[Mono-devel-list] Detecting Unicode strings
monoman at gmail.com
Tue Jun 28 11:07:16 EDT 2005
Are you meaning Unicode (16-bit wide chars [in truth UTF-16, as
Unicode is 32bit nowadays]) or UTF-8 (8-bit multibyte encoding of
The wide char encodings (UCS-2, UTF-16, UCs-4) are easily spotted by
the large number of zeroed bytes in the stream for texts in western
languages like english or portuguese. but things get harder with
UTF-8 has some specific byte-combinations that are valid, and so you
can look for invalid ones to guess it is not encoded in UTF-8, but for
english texts those byte-combinations are very rare to make it
Google for "BOM Byte Order Marker" to see another (non-standard) way
to look at it.
Any of those method are just playing guess nad should be applied with
care. The correct way is to have the source text correctly identified
with what encoding it is using, like xml for instance mandates (utf-8
is the default).
On 6/28/05, Aleksandar Dezelin <dezelin at gmail.com> wrote:
> How can I detect if the string is a Unicode string at runtime?
> Aleksandar Dezelin
> Linux is like wighwam, no windows no gates and apache inside...
> Mono-devel-list mailing list
> Mono-devel-list at lists.ximian.com
Rafael "Monoman" Teixeira
I'm trying to become a "Rosh Gadol" before my own eyes.
See http://www.joelonsoftware.com/items/2004/12/06.html for enlightment.
More information about the Mono-devel-list