[Mono-devel-list] Problems with UTF-8 Decoder

Mon Feb 28 13:53:20 EST 2005

You are using outdated documentation for the utf-8 standard as of
unicode 3.x, we have more than 1 million codepoints (20 bits) and
utf-8 was extended to expand some of those in 5 or 6 bytes.

Get some updated documentation. 

Also from the top of my mind \uFEFF is the continuation prefix in
utf-16, that is what CLI strings contain, if so,  you trying to give
the encoder an invalid character...

HIH,

On Sun, 27 Feb 2005 13:07:58 +0200, Svetlana Zholkovsky
<svetlanaz at mainsoft.com> wrote:
> Hi, All!
> 
> I am using a UTF-8 Encoding to encode/decode the following unicode strings:
> 
> "\u4f00\u302a\ud800\udc00\u4f01",
> "\uFEFF",
> "\u0041\u2262\u0391\u002e",
> "\ud55c\uad6d\uc5b4",
> "\u65e5\u672c\u8a9e",
> "\ufeff\u233b4"
> 
> The encoding works fine and code looks like exact implementation of RFC
> 3629 spec, but the decoder
> does not return original characters.
> The character "\uFEFF" (bytes FE BB BF) does not returned
> at all.
> 
> I've checked the UTF8Encoding.cs - and I have admit that in opposite to
> encoder - decoder does some strange logic which tries to decode
> sequences of 5 or 6 bytes (the standard defines only 1 - 4 bytes
> sequences for the valid Unicode characters)
> 
> So, before I'll try to fix the problem - may be someone can clarify me
> the current UTF-8 decoder implementation logic?
> 
> I've opened a bug http://bugzilla.ximian.com/show_bug.cgi?id=73086 on
> UTF-8.
> 
> Thanks,
> Svetlana.
> 
> _______________________________________________
> Mono-devel-list mailing list
> Mono-devel-list at lists.ximian.com
> http://lists.ximian.com/mailman/listinfo/mono-devel-list
> 

-- 
Rafael "Monoman" Teixeira
---------------------------------------
I'm trying to become a "Rosh Gadol" before my own eyes. 
See http://www.joelonsoftware.com/items/2004/12/06.html for enlightment.