[Mono-dev] mcs patch for default encoding

Kornél Pál kornelpal at hotmail.com
Tue Aug 23 05:53:57 EDT 2005


There is no other solution to detect UTF-8 without BOM so csc.exe has to do
the same.:) But this test could be done only on the first n bytes of a
stream then it could be assumed that the rest of the stream has the same
encoding.

Kornél

----- Original Message -----
From: "Atsushi Eno" <atsushi at ximian.com>
To: "Kornél Pál" <kornelpal at hotmail.com>
Cc: "mono-devel mailing list" <mono-devel-list at lists.ximian.com>; "Marek
Safar" <marek.safar at seznam.cz>
Sent: Tuesday, August 23, 2005 11:50 AM
Subject: Re: [Mono-dev] mcs patch for default encoding


>I don't think this is acceptable because of its significant
> performance loss (reading the entire stream)...
>
> Atsushi Eno
>
> Kornél Pál wrote:
>> Hi,
>>
>> Character set detection.
>>
>> This code uses a UTF8Encoding with throwOnInvalidBytes. StreamReader
>> detects
>> BOM (UTF-8, Unicode, Unicode (Big-Endian)). UTF-8 is easy to validate as
>> it
>> has strict rules regarding the byte
>> representation of character. So it's safe to assume that a text is UTF-8
>> if
>> it can be parsed as UTF-8. UTF8Encoding (with throwOnInvalidBytes) throws
>> ArgumentException when it is
>> not UTF-8. In this case fall back to Encoding.Default.
>>
>> Unicode (16-bit) is not detected by csc.exe without BOM so I think we
>> shouldn't deal with it.
>>
>> Kornél
>
> _______________________________________________
> Mono-devel-list mailing list
> Mono-devel-list at lists.ximian.com
> http://lists.ximian.com/mailman/listinfo/mono-devel-list
>




More information about the Mono-devel-list mailing list