[Mono-devel-list] Detecting Unicode strings

Rafael Teixeira monoman at gmail.com
Tue Jun 28 11:07:16 EDT 2005


Hi Aleksandar,

Are you meaning Unicode (16-bit wide chars [in truth UTF-16, as
Unicode is 32bit nowadays]) or UTF-8 (8-bit multibyte encoding of
Unicode chars)?

The wide char encodings (UCS-2, UTF-16, UCs-4) are easily spotted by
the large number of zeroed bytes in the stream for texts in western
languages like english or portuguese. but things get harder with
eastern texts.

UTF-8 has some specific byte-combinations that are valid, and so you
can look for invalid ones to guess it is not encoded in UTF-8, but for
english texts those byte-combinations are very rare to make it
reliable.

Google for "BOM Byte Order Marker" to see another (non-standard) way
to look at it.

Any of those method are just playing guess nad should be applied with
care. The correct way is to have the source text correctly identified
with what encoding it is using, like xml for instance mandates (utf-8
is the default).

HIH,

On 6/28/05, Aleksandar Dezelin <dezelin at gmail.com> wrote:
> How can I detect if the string is a Unicode string at runtime?
>  
>  cheers,
>  Aleksandar Dezelin
> 
> -- 
> Linux is like wighwam, no windows no gates and apache inside... 
> _______________________________________________
> Mono-devel-list mailing list
> Mono-devel-list at lists.ximian.com
> http://lists.ximian.com/mailman/listinfo/mono-devel-list
> 
> 
> 


-- 
Rafael "Monoman" Teixeira
---------------------------------------
I'm trying to become a "Rosh Gadol" before my own eyes. 
See http://www.joelonsoftware.com/items/2004/12/06.html for enlightment.
It hurts!



More information about the Mono-devel-list mailing list