[Mono-devel-list] Problems with UTF-8 Decoder

Rafael Teixeira monoman at gmail.com
Mon Feb 28 14:03:12 EST 2005


Sorry I now checked: FEFF is the ZERO WIDTH NO-BREAK SPACE that is
also used as Byte Order Marker, what normally entails it to be
eliminated if it is the first unicode character in a text stream, but
it sure should be preserved by the encoder/decoder.

About the character: http://www.fileformat.info/info/unicode/char/feff/index.htm

I was confusing a bit with the surrogate pairs that utf-16 uses to
represent larger codepoints. See:

http://czyborra.com/utf/


On Mon, 28 Feb 2005 15:53:20 -0300, Rafael Teixeira <monoman at gmail.com> wrote:
> You are using outdated documentation for the utf-8 standard as of
> unicode 3.x, we have more than 1 million codepoints (20 bits) and
> utf-8 was extended to expand some of those in 5 or 6 bytes.
> 
> Get some updated documentation.
> 
> Also from the top of my mind \uFEFF is the continuation prefix in
> utf-16, that is what CLI strings contain, if so,  you trying to give
> the encoder an invalid character...
> 
> HIH,
> 
> On Sun, 27 Feb 2005 13:07:58 +0200, Svetlana Zholkovsky
> <svetlanaz at mainsoft.com> wrote:
> > Hi, All!
> >
> > I am using a UTF-8 Encoding to encode/decode the following unicode strings:
> >
> > "\u4f00\u302a\ud800\udc00\u4f01",
> > "\uFEFF",
> > "\u0041\u2262\u0391\u002e",
> > "\ud55c\uad6d\uc5b4",
> > "\u65e5\u672c\u8a9e",
> > "\ufeff\u233b4"
> >
> > The encoding works fine and code looks like exact implementation of RFC
> > 3629 spec, but the decoder
> > does not return original characters.
> > The character "\uFEFF" (bytes FE BB BF) does not returned
> > at all.
> >
> > I've checked the UTF8Encoding.cs - and I have admit that in opposite to
> > encoder - decoder does some strange logic which tries to decode
> > sequences of 5 or 6 bytes (the standard defines only 1 - 4 bytes
> > sequences for the valid Unicode characters)
> >
> > So, before I'll try to fix the problem - may be someone can clarify me
> > the current UTF-8 decoder implementation logic?
> >
> > I've opened a bug http://bugzilla.ximian.com/show_bug.cgi?id=73086 on
> > UTF-8.
> >
> > Thanks,
> > Svetlana.
> >
> > _______________________________________________
> > Mono-devel-list mailing list
> > Mono-devel-list at lists.ximian.com
> > http://lists.ximian.com/mailman/listinfo/mono-devel-list
> >
> 
> --
> Rafael "Monoman" Teixeira
> ---------------------------------------
> I'm trying to become a "Rosh Gadol" before my own eyes.
> See http://www.joelonsoftware.com/items/2004/12/06.html for enlightment.
> 


-- 
Rafael "Monoman" Teixeira
---------------------------------------
I'm trying to become a "Rosh Gadol" before my own eyes. 
See http://www.joelonsoftware.com/items/2004/12/06.html for enlightment.



More information about the Mono-devel-list mailing list