[Mono-list] Encoding problems

Jonathan Pryor jonpryor@vt.edu
Tue, 11 Jan 2005 07:31:46 -0500


On Mon, 2005-01-10 at 22:31 -0200, Francisco Figueiredo Jr. wrote:
> I received a report about problems with encoding on mono.

It probably isn't Mono, but I'm willing to be proven wrong. :-)

>From the outset, I'm guessing that this is a codepage/charset issue.  US
English and Spanish use different codepages, and characters within one
codepage may map to a different character in another codepage.  In
particular, only ASCII is consistent between them; everything above
codepoint 127 will differ, and this is where "funky" characters like n-
tilde and a-acute are placed.

The only way to preserve sanity is to ensure that (1) you only use
characters that are in both codepages (read: stick with ASCII), or 
(2) use a codepage that represents the union of all required codepages.
That's Unicode, typically UTF-8.

> The following text isn't being returned correctly from database:
> 
> Magriñá
> 
> The chars n-tilde and a-acute is appearing as strange chars.
> 
> On mono 1.0.4 on linux if you change LANG to en_US the text reads
> correctly, with es_ES not.

Is it LANG=en_US or LANG=en_US.UTF-8?  The text after the '.' specifies
the codepage to use.  If the codepage isn't explicitly specified, then
the default is used (latin1 for english, latin2 for spanish, IIRC).
This is likely where you're experiencing problems.

> I tested here with svn version and with both en_US and es_ES it works.
> Only if I export LANG= it returns wrong chars. What is the default
> encoding when I don't set LANG?

You say that you tested "here", which potentially implies that it's a
different machine than the one experiencing the problem.  Is this
correct?

Regardless, the default LANG value varies between distros; in FC2 it's
set in /etc/sysconfig/i18n (read by /etc/profile.d/lang.sh, read
by /etc/profile, read by bash).  I'm sure where it's set will also vary.

Furthermore, the only distro I'm aware of that defaults to using UTF-8
throughout is Red Hat and associated distros such as Fedora Core.  This
may have changed (I hope so; it's been 3 years since I heard anything
about this), but until all distros migrate to UTF-8 there will be
behavioral differences in *any* locale-aware program.  (Just look at the
locale-related problems in Gnome and the use of G_BROKEN_FILENAMES...)

> Do you know if there is any problem with 1.0.4 or 1.0.5 and if so if
> there is any fix?

The fix is to *always* specify your codepage and consistently use it.
This may (will) require configuring your database so that it ua ses the
correct codepage to store strings (as Aleksandar Dezelin mentioned, SQL
Server requires the nchar data type for Unicode strings).

Mono isn't a mind reader, and can't tell what codepage a given string is
in.  It's up to you to ensure codepages are correct and consistent.

 - Jon