[Mono-list] Mono.Unix Filename Marshaling
Jonathan Pryor
jonpryor at vt.edu
Tue Oct 25 22:30:28 EDT 2005
To permit better handling of arbitrary filenames, Mono.Unix in svn has
been extended to use the following semantics:
- When marshaling a filename from unmanaged to managed code (such
as with Syscall.readdir() or Syscall.readdir_r()), Mono.Unix will
first attempt to decode the filename as a UTF-8 string.
If the UTF-8 decode fails, any "invalid" characters will be
represented as the System.Char sequence of
Mono.Unix.UnixEncoding.EscapeByte followed by the "offending" byte
cast to a char.
- When marshaling a filename from managed to unmanaged code (such as
via Syscall.open() or Syscall.stat()), the filename will be
encoded using UTF-8 unless Mono.Unix.UnixEncoding.EscapeByte is
encountered, in which case the EscapeByte character will be skipped
and the following character will be marshaled as a byte.
See Mono.Unix.UnixEncoding for details.
In short, it's a Glorious Hack. Rejoice. Or something.
What this means:
- Any filename on disk, in any encoding (or lack thereof), can be
found and used with the Mono.Unix(.Native) types.
- You don't need to specify the encoding of filenames (which could be
wrong anyway, since a directory may contain files in > 1 encoding).
- Printing or otherwise saving/displaying the filename may be
incorrect, since it contains extra escaping that's relevant only to
the Mono.Unix(.Native) classes. I'm not losing any sleep over this,
because if the encoding is unknown the strings couldn't be displayed
correctly anyway...
- You _may not_ be able to use the System.IO classes to use a file
obtained via Mono.Unix(.Native) classes. This is because System.IO
doesn't know about UnixEncoding and the escape mechanism it uses.
I don't consider this to be a problem, as the System.IO classes
couldn't open these files *anyway*, they weren't returned by
System.IO.Directory.GetFiles(), and they were effectively invisible
to normal Mono programs. They still are.
If the filename contains Mono.Unix.UnixEncoding.EscapeByte, then
you won't be able to use System.IO with that file. If the filename
doesn't contain EscapeByte, it can be used with System.IO.
- You still can't specify filenames in arbitrary encodings on the
mono command line. Mono will still try to decode these as either
UTF-8 strings or as an encoding listed in MONO_EXTERNAL_ENCODINGS.
Q & A:
Q Why UTF-8? Why not use Encoding.Default?
A Because UTF-8 is sane and should always be used. :-)
Q Seriously?
A Ha ha only serious. Plus, since a directory can contain files in
more than one encoding, and expecting the developer to provide the
right encoding for each file would require the developer to be
clairvoyant.
Plus, using UTF-8 allows any Unicode character to be used in a
filename (which could be considered as a bad thing, depending).
Q What is Mono.Unix.UnixEncoding.EscapeByte?
A U+FFFF, which is guaranteed not to be a Unicode character at all.
I suppose someone might still try to use this in a filename, but I
think it's highly unlikely (famous last words, knock on wood...).
Q Why not use byte[] instead of string for filenames in
Syscall.open(), Syscall.stat(), etc.?
A Because byte[] is fugly to work with, so it would need to be offered
in addition to the string versions, which would double all the
file-related APIs. Do you really want to explain the difference
between these APIs?
public static int open (string pathname, OpenFlags flags);
public static int open (byte[] pathname, OpenFlags flags);
(Hint: if you *do* want to explain the difference between these
you're masochistic.)
Furthermore, what should Mono.Unix.Native.Dirent.d_name be (or
Fstab.fs_file, or any other string-typed structure member)?
If it's a byte[], developers will still need a way to convert it to
a string for debugging and display to the user, but the developer
can't know what encoding to use (it could be anything), so this
becomes an impossible problem. UnixEncoding may be a Glorious Hack,
but at least it leaves the API usage unambiguous.
Q .NET doesn't have these limitations! Why does Mono?
A Because Windows stores all filenames on disk as Unicode (and has
since Windows NT 3.1 and/or the introduction of Long Filenames in
Windows 95), so it doesn't need to worry (as much) about the
arbitrary filename encoding problem.
Q Why doesn't Mono do this (or something like it) so that System.IO
can read and process all files?
A Priorities. :-)
Plus, I thought it would be easy for Mono to do this, but after
implementing Mono.Unix.UnixEncoding I'm not sure the other
maintainers would wish to deal with the issues of arbitrary
filename encodings.
Plus, most current Linux distros default to using UTF-8 already,
so (hopefully) this won't be an issue for too much longer
(10 years?).
- Jon
More information about the Mono-list
mailing list