[Mono-list] Character coding auto-detection in plain-text files

Pedro Castro mail at pedrocastro.org
Mon Mar 12 06:08:47 EDT 2007


Hi,

Antonello Provenzano wrote:
>
> The fact is the port of CharDet to C# is made from Java starting
> point: if you've checked the original JCharDet is quite outdated also
> (latest release was 3 years ago).
Yes, that's true. But the Python port is a lot newer and includes the 
Universal chardet (more below).
>
> I haven't tried yet, but I believe the current version should work for
> detection of character encodings, since the encoding table is not
> changed since that time.
The current code works (I have it working in a project of mine - 
http://sublib.sf.net), but isn't complete anymore. If you look at 
http://www.mozilla.org/projects/intl/chardet.html , the new "universal 
charset detector" (which code is at 
http://lxr.mozilla.org/seamonkey/source/extensions/universalchardet/ ) 
includes more encodings. Just to name a few:
ISO-8859-2
ISO-8859-5
ISO-8859-7
windows-1250
windows-1251
windows-1253

I've received feedback from Polish users, for instance, where the 
auto-detection fails and they have to manually select the encoding for 
things to work.

Best regards,

>
>
> On 3/10/07, Pedro Castro <mail at pedrocastro.org> wrote:
>> Hi,
>>
>> This comes first as a question: is there currently a way to autodetect
>> encodings in text files / strings?
>>
>> I realize there isn't, so would like ask if someone's interested on
>> going forward with this. Mozilla has a great detector, written in C,
>> which has been ported to other languages, like Java
>> (http://jchardet.sourceforge.net/) and Python
>> (http://chardet.feedparser.org/) for instance. A port exists in C# but
>> is very outdated
>> (http://www.conceptdevelopment.net/Localization/NCharDet/).
>>
>> This library would be of great help to many applications, mostly those
>> working with files in different encodings, but basically any
>> application reading plain-text files.
>>
>> -- 
>> Pedro Castro
>> http://www.pedrocastro.org
>> _______________________________________________
>> Mono-list maillist  -  Mono-list at lists.ximian.com
>> http://lists.ximian.com/mailman/listinfo/mono-list
>>
>


-- 
Pedro Castro
http://www.pedrocastro.org



More information about the Mono-list mailing list