[Mono-dev] Re: detecting xml encoding

Atsushi Eno atsushi at ximian.com
Tue Nov 8 04:35:35 EST 2005


Hi,

Andrew Skiba wrote:
> Hi guys,
> 
> It looks like our XML processor fails to work on EBCDIC system. There
> may be XMLs both in UTF8 and EBCDIC encodings. I found this section
> 
> http://www.w3.org/TR/REC-xml/#sec-guessing-no-ext-info
> 
> which defines well how to detect the encoding of an XML document with or
> without BOM. Atsushi may recall that there is another section, that
> states that XML document in UTF-16 must always have BOM, so I don't know
> why they are guiding here how to guess the encoding even for UTF-16. May
> be these parts were written by different people :-) Anyway, BOM does not
> help us to deal with EBCDIC problem, and this section describes exactly
> how to detect the correct encoding.

I think the reason there are two split parts is because section F is
non-normative, while section 4 is normative.

> First, BOM is read, if present. If it's not there, the '<?xml'
> characters are read. There may be 9 different cases, specified in the
> link I gave. That gives us enough information to be able to read the
> whole xml declaration, written in English letters, and this declaration
> MUST specify the encoding. Citation from
> http://www.w3.org/TR/REC-xml/#charencoding
> 
> "Unless an encoding is determined by a higher-level protocol, it is also
> a fatal error if an XML entity contains no encoding declaration and its
> content is not legal UTF-8 or UTF-16."
> 
> So this gives us a robust mechanism to determine XML document encoding.
> 
> Please tell me what you think.
> 
> Andrew.
> 

We can change XmlInputStream.cs to detect more encodings than it does 
now, dunno if other encodings really work though (no one or very few 
people would have touched EBCDIC).

Atsushi Eno



More information about the Mono-devel-list mailing list