Monday, March 21, 2011

Detecting Character Encoding of a Text File in ColdFusion

Recently I needed to detect the character encoding of text files (these file were uploaded by end-users) using ColdFusion. Since character encodings cannot be detected (the character encoding is defined by the creator of the file) it boiled down to guesstimating the char encoding. Did some research and found a great Java object called "juniversalchardet" - this is based on the encoding detection library of Mozilla. You can download the library from http://code.google.com/p/juniversalchardet/

ColdFusion code:
<cfscript> detector = createObject( "java", "org.mozilla.universalchardet.UniversalDetector").init(JavaCast("null", "")); myfile = FileOpen("C:\inetpub\wwwroot\MMBranches\EPNU\Web\TestStuff\doc_unicode4.txt" ,"readBinary");
while (! FileIsEOF(myfile) && !detector.isDone()) { // continue the loop if the end of file has not reached x = FileRead(myfile, 1); // read 1 kb binary data detector.handleData(x, 0, 1); // process this binary data.. } detector.dataEnd(); WriteOutput(detector.getDetectedCharset()); </cfscript>

  


No comments: