[Mono-list] Mono.Unix Filename Marshaling

Jonathan Pryor jonpryor at vt.edu
Tue Oct 25 22:30:28 EDT 2005


To permit better handling of arbitrary filenames, Mono.Unix in svn has
been extended to use the following semantics:

  - When marshaling a filename from unmanaged to managed code (such 
    as with Syscall.readdir() or Syscall.readdir_r()), Mono.Unix will
    first attempt to decode the filename as a UTF-8 string.

    If the UTF-8 decode fails, any "invalid" characters will be 
    represented as the System.Char sequence of 
    Mono.Unix.UnixEncoding.EscapeByte followed by the "offending" byte
    cast to a char.

  - When marshaling a filename from managed to unmanaged code (such as
    via Syscall.open() or Syscall.stat()), the filename will be 
    encoded using UTF-8 unless Mono.Unix.UnixEncoding.EscapeByte is 
    encountered, in which case the EscapeByte character will be skipped
    and the following character will be marshaled as a byte.

See Mono.Unix.UnixEncoding for details.

In short, it's a Glorious Hack.  Rejoice.  Or something.

What this means:
  - Any filename on disk, in any encoding (or lack thereof), can be
    found and used with the Mono.Unix(.Native) types.

  - You don't need to specify the encoding of filenames (which could be
    wrong anyway, since a directory may contain files in > 1 encoding).

  - Printing or otherwise saving/displaying the filename may be 
    incorrect, since it contains extra escaping that's relevant only to 
    the Mono.Unix(.Native) classes.  I'm not losing any sleep over this,
    because if the encoding is unknown the strings couldn't be displayed
    correctly anyway...

  - You _may not_ be able to use the System.IO classes to use a file 
    obtained via Mono.Unix(.Native) classes.  This is because System.IO
    doesn't know about UnixEncoding and the escape mechanism it uses.
    I don't consider this to be a problem, as the System.IO classes 
    couldn't open these files *anyway*, they weren't returned by
    System.IO.Directory.GetFiles(), and they were effectively invisible
    to normal Mono programs.  They still are.

    If the filename contains Mono.Unix.UnixEncoding.EscapeByte, then
    you won't be able to use System.IO with that file.  If the filename
    doesn't contain EscapeByte, it can be used with System.IO.

  - You still can't specify filenames in arbitrary encodings on the 
    mono command line.  Mono will still try to decode these as either
    UTF-8 strings or as an encoding listed in MONO_EXTERNAL_ENCODINGS.

Q & A:
  Q Why UTF-8?  Why not use Encoding.Default?  
  A Because UTF-8 is sane and should always be used. :-)

  Q Seriously?
  A Ha ha only serious.  Plus, since a directory can contain files in 
    more than one encoding, and expecting the developer to provide the
    right encoding for each file would require the developer to be 
    clairvoyant. 

    Plus, using UTF-8 allows any Unicode character to be used in a 
    filename (which could be considered as a bad thing, depending).

  Q What is Mono.Unix.UnixEncoding.EscapeByte?
  A U+FFFF, which is guaranteed not to be a Unicode character at all.
    I suppose someone might still try to use this in a filename, but I
    think it's highly unlikely (famous last words, knock on wood...).

  Q Why not use byte[] instead of string for filenames in 
    Syscall.open(), Syscall.stat(), etc.?
  A Because byte[] is fugly to work with, so it would need to be offered
    in addition to the string versions, which would double all the 
    file-related APIs.  Do you really want to explain the difference
    between these APIs?

	public static int open (string pathname, OpenFlags flags);
	public static int open (byte[] pathname, OpenFlags flags);

    (Hint: if you *do* want to explain the difference between these
    you're masochistic.)

    Furthermore, what should Mono.Unix.Native.Dirent.d_name be (or
    Fstab.fs_file, or any other string-typed structure member)?
    If it's a byte[], developers will still need a way to convert it to 
    a string for debugging and display to the user, but the developer 
    can't know what encoding to use (it could be anything), so this 
    becomes an impossible problem.  UnixEncoding may be a Glorious Hack,
    but at least it leaves the API usage unambiguous.

  Q .NET doesn't have these limitations!  Why does Mono?
  A Because Windows stores all filenames on disk as Unicode (and has 
    since Windows NT 3.1 and/or the introduction of Long Filenames in
    Windows 95), so it doesn't need to worry (as much) about the
    arbitrary filename encoding problem.

  Q Why doesn't Mono do this (or something like it) so that System.IO 
    can read and process all files?
  A Priorities. :-)

    Plus, I thought it would be easy for Mono to do this, but after 
    implementing Mono.Unix.UnixEncoding I'm not sure the other 
    maintainers would wish to deal with the issues of arbitrary 
    filename encodings.

    Plus, most current Linux distros default to using UTF-8 already,
    so (hopefully) this won't be an issue for too much longer
    (10 years?).

 - Jon




More information about the Mono-list mailing list