[Mono-dev] Re: [Mono-list] Re: Mono.Unix Filename Marshaling

Wed Oct 26 06:47:25 EDT 2005

On Wed, 2005-10-26 at 08:08 +0200, Florian Weimer wrote:
> It seems that the UTF-8 decoder treats the byte sequence EF BF BF as
> invalid.  Doesn't this mean that with your changes, it is encoded as
> FFFF 00EF FFFF 00BF FFFF 00BF on the Mono side?

The UTF-8 decoder doesn't treat EF BF BF as invalid; see
mcs/class/corlib/Test/System.Text/UTF8EncodingTest.cs:T5_IllegalCodePosition_3_Other_532().
Apparently .NET treats EF BF BF as the encoding of U+FFFF, which is
correct, even if U+FFFF is guaranteed to never be assigned.

Consequently, EF BF BF will be decoded as U+FFFF, and if it's the last
character in the managed string, it will be re-encoded as EF BF BF; if
there's a character after it, it will assume the following character is
a byte (the usual escape mechanism), so in this case the output won't
correctly match the input.

I'm hoping that this scenario is sufficiently rare that things will Just
Work.  If it isn't, I'll have to find a different escape character.
How's U+0001 sound (control character, START OF HEADING)?  Something
else?

 - Jon