[Mono-list] Why UTF-16 strings in Mono.Unix?

Jonathan Pryor jonpryor at vt.edu
Tue Oct 18 07:02:45 EDT 2005


On Tue, 2005-10-18 at 09:58 +0200, Florian Weimer wrote:
> * Jonathan Pryor:
> > On Mon, 2005-10-17 at 19:03 +0200, Florian Weimer wrote:
> >> Why are UTF-16 strings used in Mono.Unix?  Doesn't this mean that some
> >> resources are inaccessible to programs running under Mono in a
> >> multibyte localeq (such as one using UTF-8)?
> >
> > Care to elaborate?  System.String is always used to represent strings in
> > Mono.Unix and Mono.Unix.Native, but Mono's marshaler will convert the
> > strings to UTF-8 for the P/Invoke call.
> 
> UNIX systems do not have a system-wide locale.  Some user might run
> under a single-byte locale and create a file named "Ärger.txt" (whose
> name consists of exactly nine bytes in his locale).  Another user who
> uses UTF-8 cannot access this file using any name that is valid UTF-8.
> For applications written in C, this is typically not a problem because
> you can pass the necessary byte string on the command line (entering
> ?rger.txt in the shell, which performs expansion), but this won't work
> with Mono applications.

This won't work with a great deal more than just Mono applications.
This will likely also "break" for every app that uses a runtime (Java,
Perl, Python), and certainly won't work with GTK+/Gnome applications
unless the user explicitly sets the G_FILENAME_ENCODING environment
variable to contain the character set name that should instead be used
(and how many users will know about G_FILENAME_ENCODING, much less set
it?), or the user sets G_BROKEN_FILENAMES=1.

A "fix" might be for Mono's string marshaler and
Marshal.StringToHGlobalAnsi() to follow G_FILENAME_ENCODING instead of
always converting to UTF-8 (something I considered a few months ago but
never got around to writing a patch for), in which case things would
work properly for you...if you remembered G_FILENAME_ENCODING, anyway.

> A first step in a direction to fix that would be to use native strings
> (multibyte strings) for accessing native APIs.

What does that mean, exactly?  Mono is already generating multibyte
strings for the Native APIs -- UTF-8 strings, yes, but UTF-8 is a
multibyte encoding -- so your statement is effectively meaningless.

It sounds like what you *really* want is for Mono's string marshaler to
marshal to the user's preferred character set/encoding instead of UTF-8.
This can be done, though I'm not sure what all it would impact, and
determining what the user's preferred encoding is would likely fall to
using G_FILENAME_ENCODING, in which case few may benefit anyway.

 - Jon




More information about the Mono-list mailing list