[MonoDevelop] Re: Souce files are UTF-8... are we sure?

Gaute B Strokkenes gs234@srcf.ucam.org
Sat, 10 Apr 2004 17:57:28 +0200


On  8 apr 2004, steve@citygroup.ca wrote:

> // src/Addins/DisplayBindings/SourceEditor/SourceEditorBuffer.cs:
> public static SourceEditorBuffer CreateTextBufferFromFile (string filename)
> {
>   FileStream fs = new FileStream(filename, FileMode.Open);
>   fs.Position = 0;
>   byte[] preamble = Encoding.UTF8.GetPreamble();
>   for (int j = 0; j < preamble.Length; j++)
>   {
>     if (preamble[j] != fs.ReadByte())
>     {
>       System.Console.WriteLine("CreateTextBufferFromFile(): file is not
> UTF-8. Skipping.");
>       return (null);
>     }
>   }
>   System.Console.WriteLine("CreateTextBufferFromFile(): file is UTF-8.
> Loading into sourcebuffer...");
>   SourceEditorBuffer buff = new SourceEditorBuffer ();
>   buff.LoadFile (filename);
>   return buff;
> }
> // end

This is the wrong way to autodetect UTF-8 encoded text in general.  If
you must autodetect then the correct way in general is to scan through
the entire file and verify that all the bytes correspond to valid
UTF-8 sequences.  (Valid UTF-8 sequences are very distinctive, so this
is highly unlikely to give you false positives.)

> Just before submitting the System.Text bugreport, however, I tried
> running the same test case using those 2 Windows files on this
> FC1/mono box. Lo and behold, it recognizes the .txt as UTF-8 (which
> was set in Notepad) just fine.

You are quite unlikely to find a tool other than Windows Notepad that
adds the UTF-8 BOM preamble to UTF-8 text files.  The practise is
disrecommended because a magic cookie at the front of each file just
isn't plain-text compatible and causes a world of trouble down the
line.  (As stated above, if you need to autodetect then you don't need
it anyway.)

Have a look at: http://www.cl.cam.ac.uk/~mgk25/unicode.html#linux

That site is an excellent resource for all things unicode in the free
and open source software world, by the way.

--
Gaute Strokkenes                        http://www.srcf.ucam.org/~gs234/
..  So, if we convert SUPPLY-SIDE SOYBEAN FUTURES into
 HIGH-YIELD T-BILL INDICATORS, the PRE-INFLATIONARY risks
 will DWINDLE to a rate of 2 SHOPPING SPREES per EGGPLANT!!