[Mono-dev] Fwd: [Mono-patches] r63710 - in trunk/mcs/class/System.Web: System.Web.UI.WebControls Test/System.Web.UI.WebControls

Mon Aug 14 07:06:34 EDT 2006

> Hi Kornél (am always copypasting your name ;-),

I was pretty sure about that.:)

>> What about using UTF-8 (without BOM) in ChangeLog? It may sound weird but 
>> I personally have no problem with using Kanji (or using other non-Latin 
>> scripts) when using UTF-8 altough using Latin characters are more likely 
>> to be readable by every people looking at the files (as we use English). 
>> So using Latin scripts seem to be more reasonable.:) But I personally see 
>> nothing wrong about optionally including names in alternative scripts in 
>> addition to Latin representations.
>
> Is saving files in utf-8 without BOM possible in general western
> editors land? If yes I like the idea. If not then maybe it is not
> a good solution for us (yeah, not using non-ASCII letters is the
> most pessimistic option).
>
> (BTW I guess, with BOM you guys will get stuck, right?)

I just suggested not to use BOM because when using non-UTF-8-aware editors 
the BOM will be preserved in the middle of the file instead of being moved 
to the beginning. (I didn't explain this reason so now I did.:)

Usually I am using Windows XP that has support for UTF-8 and has no problem 
with BOM. For example Notepad has no support for saving UTF-8 without BOM. 
Microsoft programs (Notepad, Visual Studio, csc, ...) can recognize UTF-8 
without BOM (they try to parse the entire file as UTF-8 and they treat it as 
UTF-8 if it's valid UTF-8, otherwise they use the default ANSI code page). 
And they recognize BOM of course. For example Visual Studio is saving files 
with BOM when they had originally and save without BOM when they didn't.

Usually I only use Linux using telnet so I have little experience in Linux 
based text editors and I hardly use other operating systems.

Assuming that people are using UTF-8 aware text editors BOM should not cause 
any problems. And I think that eastern localized versions of the same 
software behave the same regarding BOM as the western localized versions.

>> Currently .cs files are compiled either as Latin 1 (default) or as UTF-8 
>> (when set in Makefile of the assembly, but anything else could be set) so 
>> I think the appropriate encoding can be used in source files.
>
> (Note that what Kornél mentioned above was all about ChangeLogs.)
>
> When it comes to mcs sources, we wouldn't want to change things.
> It forces us to change all relevant sources to utf-8 because
> with BOMless utf-8 explicit compiler option -codepage:65001 is
> required.

I didn't suggested to change encoding. I just said that I think it's OK to 
take advantage of the actual encoding used for compiling sources rather than 
sticking to ASCII. That is currently Latin 1 for most cases and UTF-8 for 
Microsoft.VisualBasic and System.Windows.Forms. Currently is here because it 
can be changed altough I agree that there is no reason to be changed.

>> Apart from the above things using \u#### and \U######## escape sequences 
>> for non-ASCII characters in string literals ensures that the code 
>> functions correctly. Compilers ignore comments so incorrectly encoded 
>> characters in comments can cause little harm if any.
>
> Sadly it indeed caused problems when I tried to build a library
> from the sources whose comments are written in Japanese (Shift_JIS).
> So, in such cases, using UTF-8 is the only solution I think.

Unfortunately I don't know the Shift_JIS encoding so I don't know whether 
it's safe to use it in comments.

As long as comments begin with // and end with new line there should be no 
problem. If the encoding is ASCII compatible NULL, CR, LF slash and space 
should have the same byte encodings as in ASCII. But problem can occur when 
the encoding uses control characters (only native new line sequence is 
relevant) as trailing bytes. If Shift_JIS fulfils the above requirements the 
problem is likely to be with the compiler. Otherwise Shift_JIS is not safe 
to be used in comments.

Using C style comments /* */ is less safe because it ends with normal 
characters (not control characters) but if the the encoding does not use 
these characters as trail bytes it's just as safe as //.

Note that the above is true only when decoding text using single byte 
encoding. When decoding using multi byte encodings end markers may be lost 
because they may be interpreted as trail bytes.

Kornél