[Mono-dev] mcs patch for default encoding

Kornél Pál kornelpal at hotmail.com
Tue Aug 23 05:36:36 EDT 2005


Hi,

Character set detection.

This code uses a UTF8Encoding with throwOnInvalidBytes. StreamReader detects
BOM (UTF-8, Unicode, Unicode (Big-Endian)). UTF-8 is easy to validate as it
has strict rules regarding the byte
representation of character. So it's safe to assume that a text is UTF-8 if
it can be parsed as UTF-8. UTF8Encoding (with throwOnInvalidBytes) throws
ArgumentException when it is
not UTF-8. In this case fall back to Encoding.Default.

Unicode (16-bit) is not detected by csc.exe without BOM so I think we
shouldn't deal with it.

Kornél

----- Original Message -----
From: "Atsushi Eno" <atsushi at ximian.com>
To: "Marek Safar" <marek.safar at seznam.cz>
Cc: "mono-devel mailing list" <mono-devel-list at lists.ximian.com>
Sent: Tuesday, August 23, 2005 9:55 AM
Subject: Re: [Mono-dev] mcs patch for default encoding


> Oh, actually I have.
>
> I even have a case that does not work with mcs but works with csc -
> i.e. the case that csc detects utf-8 regardless of BOM.
>
>
> I forgot one thing - with regard to that remaining problem, we need
> to fix WinForms build (because KeyboardLayout.cs seems to have
> raw non-ASCII character:
>
> syntax error, got token `IDENTIFIER'
> System.Windows.Forms\KeyboardLayouts.cs(93,51): error CS1526: A new
> expression requires () or [] after type
> System.Windows.Forms\KeyboardLayouts.cs(97,62): error CS8025: Parsing
> error
> Compilation failed: 2 error(s), 0 warnings
>
> They should be replaced by \uXXXX but I have no idea what those
> characters actually are :|
>
> Atsushi Eno
>
>
> Marek Safar wrote:
>> Hello Eno,
>>
>> Could you write some tests to cover this functionality. I mean e.g.
>> simple test file with UTF header.
>>
>> Thanks,
>> Marek
>>
>>> Hi again,
>>>
>>>> Agreed. In fact, I was also fixing bug #75065, maybe duplicate.
>>>> I have a fix for UTF8Encoding, but it uncovered another mcs bug
>>>> which does not handle files with BOM with specific encoding.
>>>> To summarize the situation:
>>>>
>>>>     - Currently driver.cs does not process source files with
>>>>       default encoding.
>>>>     - UTF8Encoding.cs does not handle U+FEFF correctly.
>>>>     - When we fix UTF8Encoding.cs to handle U+FEFF, it starts
>>>>       to reject some source files which has BOM.
>>>>       (CS8025:Parsing error)
>>>>     - Even if we fix driver.cs to let StreamReader consider BOM
>>>>       (currently we disable it), there are still some files
>>>>       borking.
>>>>
>>>> Am digging into this bug in depth. Hopefully I'll post a set of
>>>> fixes later.
>>>
>>>
>>> ... and now I finished the fixes as was done in the attached patch:
>>>
>>>     - driver.cs :
>>>       a) uses Encoding.Default for the default input.
>>>       b) Always use true for detecting BOM at any time.
>>>     - support.cs : Handle preamble_size precisely.
>>>     - UTF8Encoding.cs : it should not skip U+FEFF. This fixes
>>>       bug #73086 and #75065.
>>>
>>> They should be applied at a time, except for a).
>>>
>>> Atsushi Eno
>


--------------------------------------------------------------------------------


> public class 쯠쯡쯢
> {
> public string 颀顰飳;
>
> public static void Main ()
> {
> }
> }
>
>


--------------------------------------------------------------------------------


> public class 쯠쯡쯢
> {
> static string 颀顰飳 = "頃頇";
> public static void Main ()
> {
> foreach (char c in 颀顰飳)
> System.Console.WriteLine ("{0:X04}", (int) c);
> }
> }
>
>


--------------------------------------------------------------------------------


> Index: Makefile
> ===================================================================
> --- Makefile (revision 48630)
> +++ Makefile (working copy)
> @@ -2,7 +2,7 @@
> include ../../build/rules.make
>
> LIBRARY = System.Windows.Forms.dll
> -LIB_MCS_FLAGS = /unsafe \
> +LIB_MCS_FLAGS = /unsafe /codepage:65001 \
>  /r:$(corlib) /r:System.dll /r:System.Xml.dll \
>  /r:System.Drawing.dll /r:Accessibility.dll \
>  /r:System.Data.dll /r:Mono.Posix.dll \
>


--------------------------------------------------------------------------------


> _______________________________________________
> Mono-devel-list mailing list
> Mono-devel-list at lists.ximian.com
> http://lists.ximian.com/mailman/listinfo/mono-devel-list
>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: DetectEncoding.cs
Url: http://lists.ximian.com/pipermail/mono-devel-list/attachments/20050823/89da8f4c/attachment.pl 


More information about the Mono-devel-list mailing list