[Mono-dev] mcs default encoding: Latin1 or not

Kornél Pál kornelpal at hotmail.com
Fri Aug 26 10:40:10 EDT 2005


Hi,

>From: Atsushi Eno
> Can we edit UTF8 files on vim on cygwin? No. This fact simply tells
> that we are not living in the age of Unicode.

I think that for example the fact that a lot of default Linux installations
are using UTF-8 as default code page shows that we are living in the age of
Unicode.

I rarely use Linux so I have little experience regarding UTF-8 suppor on
Linux but I have the following experience on Windows (Windows XP with
relatively recent versions of software):

All the tools I use (including notepad.exe, Visual Studio .NET, csc.exe,
Total Commander, Internet Explorer, Firefox, .NET Framework) know UTF-8 and
can detect it without BOM as well but all of them defaults to system default
ANSI code page. In addition Mono has all the infrastructure to do the same
altough it has different default behaviour currently. Windows NT's are using
Unicode for representing strings in memory.

This makes me think that we are in the age of Unicode.

We shouldn't use non-ASCII characters inside code for identifiers but we can
use other characters in strings and comments. Of course we could use ASCII
but I think UTF-8 is a better solution.

Back to vim: I think that the fact that vim has no UTF-8 support tells that
vim is a tool from the past or the developers of vim still live in the past
as everything around vim has UTF-8 support.

At least on Windows you can open texts in any code page, edit them and when
you save them no characters will be corrupted so you can open it again using
the correct code page. This is true for UTF-8 as well. Some control chars
(0x00-0x1F) may be lost but these charactes are neither used by SBCS nor by
DBCS code pages. So you can safely edit UTF-8 files without UTF-8 support as
long as you modify only ASCII characters and add only only ASCII characters.

Not using BOM was an idea to support text editors without UTF-8 support but
if this makes a lot of UTF-8 aware editors not to use UTF-8 we have to use
BOM.

But I'm still sure that mcs has a bug regarding UTF-8. As non UTF-8 encoded
files can be read using UTF8Encoding (it skips invalid characters) but mcs
throws error for the source code that should not be caused by ignored
comment characters.

Kornél

----- Original Message -----
From: "Rafael Teixeira" <monoman at gmail.com>
To: "Atsushi Eno" <atsushi at ximian.com>
Cc: "Kornél Pál" <kornelpal at hotmail.com>; "mono-devel mailing list"
<mono-devel-list at lists.ximian.com>
Sent: Friday, August 26, 2005 3:30 PM
Subject: Re: [Mono-dev] mcs default encoding: Latin1 or not


Just to comment a bit.

We have at least two decisions to make coming from this discussion:

- What default encoding should mcs use?

I prefer to use the local current encoding (Encoding.Defaut), so it
works for files edited with the commonly used editors for each
platform (gedit and I believe MonoDevelop, follow that for instance)
read/saved without specifiying another encoding. On windows it would
also work as notepad/write/vs.net follow the current codepage.

- What standard encoding all of our source files in mono repository
should use to keep things workable for hackers from all cultures?

Here we should stick to utf-8 (and for those that like to use vim
inside cygwin, I think we should ask vim hackers to make it support it
in that platform).  Remember that while most code in mono use only
ASCII identifiers (as it follows MS API, or is new code sticking to
coding guidelines), author names for sure and even some other
commentary text may contain non-ASCII characters.

The derived question is: should the files have the BOM marker? I would
say that for easing mixed mono/.net (mcs/csc) work they should, but
regrettably while VS.NET editor does make a good job of
detecting/hiding/preserving the BOM marker AFAIK most linux editors
doesn't, showing it as the specially-typed space character it is. I do
think we can make MonoDevelop mimic VS.NET in that regard but many
mono hackers use other editors, or a multitude of them (besides MD, I
use gedit extensively when doing quick fixes from the console). So I
don't have a firm opinion on this sub-issue, input is welcome.

Regards,

On 8/26/05, Atsushi Eno <atsushi at ximian.com> wrote:
> Hi,
>
> > If you don't like ISO 28591 because it's foreign, why do you want to use
> > ASCII in source files?:)
>
> Well, ASCII is not foreign for Japanese. All of iso-2022-jp /
> shift_jis / euc-jp don't contradict ASCII and it is actually
> part of those encodings.
>
> I know there used to be non-ASCII based encodings such as Indian
> ISSCII-7, Arabic ASMO 449, Banguradesh BDS 1520:1995 etc. but I
> don't know any modern encoding that contradicts ASCII (I don't
> think it is possible to publish world-ready applications with
> those encodings).
>
> So AFAIK ASCII is safe, the GCM for us. Latin1 is not the case.
>
> > I personally hate the fact of having code pages but this has historical
> > reasons. I think UTF-8 is a good solution as it is international,
> > culture-neutral and ASCII compatible.
> >
> > I think we are living in the age of Unicode. So there is no reason to
> > use
> > ASCII. It's OK to use only ASCII in identifiers and use English in
> > comments
> > and texts but I don't think we shouldn't take advantage of Unicode. We
> > can
> > use it for names for example.
>
> Can we edit UTF8 files on vim on cygwin? No. This fact simply tells
> that we are not living in the age of Unicode.
>
> I heard a story - there was a Japanese or Chinesee who used Chinese
> character in his (or her) blog which are aggregated in somewhere
> (I don't remember the details) and that person got blamed of using
> Chinese, even though it is written in utf-8 encoding.
>
> > I think mcs should use Encoding.Default as default encoding as I think
> > this
> > is nearest to the user's need and provides compatiblity with csc.exe.
>
> > But we should use UTF-8 without signature (BOM) for our .cs source code
> > files and explicitly specify for mcs to use UTF-8.
>
> Why? I think we *should* use BOM as we discussed before that mcs
> (nor csc) does not autodetect encoding correctly.
>
> Here I guess that you think BOM-less UTF8 sources could be edited
> in Latin1 editors. What happens if I put CJK ideographs? Actually
> we all (really all) Japanese hackers said that they feel reluctant
> to edit those files that contain Latin1 letters, because our
> usual editors does not support Latin1 (even as a candidate of
> encodings to save file).
>
> Atsushi Eno
> _______________________________________________
> Mono-devel-list mailing list
> Mono-devel-list at lists.ximian.com
> http://lists.ximian.com/mailman/listinfo/mono-devel-list
>


--
Rafael "Monoman" Teixeira
---------------------------------------
I'm trying to become a "Rosh Gadol" before my own eyes.
See http://www.joelonsoftware.com/items/2004/12/06.html for enlightment.
It hurts!
_______________________________________________
Mono-devel-list mailing list
Mono-devel-list at lists.ximian.com
http://lists.ximian.com/mailman/listinfo/mono-devel-list




More information about the Mono-devel-list mailing list