[Mono-dev] mcs patch for default encoding

Kornél Pál kornelpal at hotmail.com
Tue Aug 23 06:48:24 EDT 2005


I've tried to compile a 2 GB size file using csc.exe: I got out of memory
error. The I reduced the size to 500 MB but I still got out of memory.
Finally I was able to compile a 200 MB file.

I got error CS1034: Compiler limit exceeded: Line cannot exceed 2046
characters

So I added line breaks as well. And added // to the beginning of each line
to add some non-whitespace chars just for fun and to test the compiler.:)

The first non-ASCII character is very near to the end of the file. csc.exe
compiled it correctly. UTF-8 and ACP as well. DétectEncoding was compiled
correctly in both cases. I attached the test cases (about 200 MB each).

So I think csc.exe parses the whole file to detect UTF-8 and has poor memory
management in addition.:) Maybe it chaches the source file using it's own
allocated memory.

Kornél

----- Original Message -----
From: "Kornél Pál" <kornelpal at hotmail.com>
To: "Atsushi Eno" <atsushi at ximian.com>
Cc: "Marek Safar" <marek.safar at seznam.cz>; "mono-devel mailing list"
<mono-devel-list at lists.ximian.com>
Sent: Tuesday, August 23, 2005 11:53 AM
Subject: Re: [Mono-dev] mcs patch for default encoding


> There is no other solution to detect UTF-8 without BOM so csc.exe has to
> do
> the same.:) But this test could be done only on the first n bytes of a
> stream then it could be assumed that the rest of the stream has the same
> encoding.
>
> Kornél
>
> ----- Original Message -----
> From: "Atsushi Eno" <atsushi at ximian.com>
> To: "Kornél Pál" <kornelpal at hotmail.com>
> Cc: "mono-devel mailing list" <mono-devel-list at lists.ximian.com>; "Marek
> Safar" <marek.safar at seznam.cz>
> Sent: Tuesday, August 23, 2005 11:50 AM
> Subject: Re: [Mono-dev] mcs patch for default encoding
>
>
>>I don't think this is acceptable because of its significant
>> performance loss (reading the entire stream)...
>>
>> Atsushi Eno
>>
>> Kornél Pál wrote:
>>> Hi,
>>>
>>> Character set detection.
>>>
>>> This code uses a UTF8Encoding with throwOnInvalidBytes. StreamReader
>>> detects
>>> BOM (UTF-8, Unicode, Unicode (Big-Endian)). UTF-8 is easy to validate as
>>> it
>>> has strict rules regarding the byte
>>> representation of character. So it's safe to assume that a text is UTF-8
>>> if
>>> it can be parsed as UTF-8. UTF8Encoding (with throwOnInvalidBytes)
>>> throws
>>> ArgumentException when it is
>>> not UTF-8. In this case fall back to Encoding.Default.
>>>
>>> Unicode (16-bit) is not detected by csc.exe without BOM so I think we
>>> shouldn't deal with it.
>>>
>>> Kornél
>>
>> _______________________________________________
>> Mono-devel-list mailing list
>> Mono-devel-list at lists.ximian.com
>> http://lists.ximian.com/mailman/listinfo/mono-devel-list
>>
>
> _______________________________________________
> Mono-devel-list mailing list
> Mono-devel-list at lists.ximian.com
> http://lists.ximian.com/mailman/listinfo/mono-devel-list
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: DetectEncoding.tar.bz2
Type: application/octet-stream
Size: 41115 bytes
Desc: not available
Url : http://lists.ximian.com/pipermail/mono-devel-list/attachments/20050823/507dd10b/attachment.obj 


More information about the Mono-devel-list mailing list