[Mono-list] UTF-16 and XmlTextReader questions

François Garillot garillot at seas.upenn.edu
Fri Jul 29 11:35:45 EDT 2005


Hi

I've been feeding some UTF-16 documents to an XmlTextReader lately¹, and
I've encountered some behavior I have trouble understanding. 

I'm working on the basis of a UTF-16-encoded file ("test.xml" in the
following) containing just the character U+00E1 LATIN SMALL LETTER A
WITH ACUTE between the opening and the closing of a "foo" tag.

- If this file has no BOM² and no XML text declaration, the
XmlTextReader chokes on the U+00E1 character (System.ArgumentException:
Arg_InvalidUTF8), wich is logical since it expects UTF-8³. However :

- If this file has no BOM, but an erroneous XML text declaration telling
it's UTF-8, the XmlTextReader processes the file, simply discarding the
offending U+00E1. Shouldn't it produce an error in the exact same way as
the previous case ?

- If the file has a BOM (hexa FE FF), but no XML text declaration, the
XmlTextReader chokes on the BOM, outputting :

 Unhandled Exception: System.Xml.XmlException: Text node cannot appear
 in this state.
  file://test.xml Line 1, position 1.
 in <0x001ee> System.Xml.XmlTextReader:ReadText (Boolean notWhitespace)
 in <0x00186> System.Xml.XmlTextReader:ReadContent ()
 in <0x0011f> System.Xml.XmlTextReader:Read ()
 in <0x00071> test:Main ()

The XML spec³, in (4.3.3, §2), says about the BOM :

"This is an encoding signature, not part of either the markup or the
character data of the XML document."

Therefore, shouldn't the XmlTextReader discard the BOM along the way and
process the document as usual ?

I'm using mono & mcs from the svn repository, revision 47821. Thanks in
advance for any input you could provide me on this subject.

-- 
François Garillot <garillot at seas.upenn.edu>

1: I'm using this code as a test, with "test.xml" as my XML content test
file.

using System;
using System.Xml;
using System.Text;

class test
{
	public static void Main ()
	{
          XmlTextReader reader = new XmlTextReader("test.xml");
          StringBuilder sb = new StringBuilder();
          while (reader.Read()) {
                sb.Append(reader.ReadOuterXml());
          }
          System.Console.WriteLine(sb.ToString());
	}
}
2 : Byte Order Mark
3 : the behaviors I'm expecting all come from my understanding of :
 http://www.w3.org/TR/2000/REC-xml-20001006#charencoding



More information about the Mono-list mailing list