[Mono-list] UTF-16 and XmlTextReader questions

François Garillot garillot at seas.upenn.edu
Fri Jul 29 14:29:30 EDT 2005


Atsushi Eno wrote :

> This does not happen either. Can you please post the exact XML files
> that raises the errors?

OK. I work from this basefile :
<test>á</test>
hexdump:
0000000 743c 7365 3e74 3ce1 742f 7365 3e74     
000000e

This UTF-16 file with no BOM and no declaration gets, as expected,
rejected for not being UTF-8 (System.ArgumentException: Arg_InvalidUTF8)

***

Test 1 

I add :
<?xml version="1.0" encoding="utf-8"?>
to it.

hexdump:
0000000 3f3c 6d78 206c 6576 7372 6f69 3d6e 3122
0000010 302e 2022 6e65 6f63 6964 676e 223d 7475
0000020 2d66 2238 3e3f 3c0a 6574 7473 e13e 2f3c
0000030 6574 7473 003e                         
0000035

This UTF-16 file with no BOM and an erroneous UTF-8 XML declaration
should get rejected, if I understand the XML spec (4.3.3, §8)¹
correctly.

The output I get is simply the XML file with the offending á discarded :

<?xml version="1.0" encoding="utf-8"?><test></test>

hexdump:
0000000 3f3c 6d78 206c 6576 7372 6f69 3d6e 3122
0000010 302e 2022 6e65 6f63 6964 676e 223d 7475
0000020 2d66 2238 3e3f 743c 7365 3e74 2f3c 6574
0000030 7473 0a3e                              
0000034

***

Test 2 

I take the base file again and run 'iconv -f utf-16 -t utf-16' on it.
I get :
ÿþ<test>á</test>

hexdump:
0000000 feff 743c 7365 3e74 3ce1 742f 7365 3e74
0000010

This UTF-16 file with BOM and no declaration should, if I understand
correctly, be accepted. However, the output I get is the error I
described in my first posting, i.e. :

Unhandled Exception: System.Xml.XmlException: Text node cannot appear in
this state. file://test.xml Line 1, position 1.
in <0x001ee> System.Xml.XmlTextReader:ReadText (Boolean notWhitespace)
in <0x00186> System.Xml.XmlTextReader:ReadContent ()
in <0x0011f> System.Xml.XmlTextReader:Read ()
in <0x00071> test:Main ()

-- 
François Garillot <garillot at seas.upenn.edu>

¹ : In the absence of information provided by an external transport
protocol (e.g. HTTP or MIME), it is an error for an entity including an
encoding declaration to be presented to the XML processor in an encoding
other than that named in the declaration, (...)



More information about the Mono-list mailing list