[Mono-devel-list] Problems with UTF-8 Decoder

Svetlana Zholkovsky svetlanaz at mainsoft.com
Sun Feb 27 06:07:58 EST 2005


Hi, All!

I am using a UTF-8 Encoding to encode/decode the following unicode strings:

"\u4f00\u302a\ud800\udc00\u4f01",
"\uFEFF",
"\u0041\u2262\u0391\u002e",
"\ud55c\uad6d\uc5b4",
"\u65e5\u672c\u8a9e",
"\ufeff\u233b4"

The encoding works fine and code looks like exact implementation of RFC 
3629 spec, but the decoder
does not return original characters.
The character "\uFEFF" (bytes FE BB BF) does not returned
at all.

I've checked the UTF8Encoding.cs - and I have admit that in opposite to 
encoder - decoder does some strange logic which tries to decode 
sequences of 5 or 6 bytes (the standard defines only 1 - 4 bytes 
sequences for the valid Unicode characters)

So, before I'll try to fix the problem - may be someone can clarify me 
the current UTF-8 decoder implementation logic?

I've opened a bug http://bugzilla.ximian.com/show_bug.cgi?id=73086 on 
UTF-8.

Thanks,
Svetlana.





More information about the Mono-devel-list mailing list