[Mono-dev] detecting xml encoding

Andrew Skiba andrews at mainsoft.com
Tue Nov 8 04:13:21 EST 2005


Hi guys,

It looks like our XML processor fails to work on EBCDIC system. There
may be XMLs both in UTF8 and EBCDIC encodings. I found this section

http://www.w3.org/TR/REC-xml/#sec-guessing-no-ext-info

which defines well how to detect the encoding of an XML document with or
without BOM. Atsushi may recall that there is another section, that
states that XML document in UTF-16 must always have BOM, so I don't know
why they are guiding here how to guess the encoding even for UTF-16. May
be these parts were written by different people :-) Anyway, BOM does not
help us to deal with EBCDIC problem, and this section describes exactly
how to detect the correct encoding.

First, BOM is read, if present. If it's not there, the '<?xml'
characters are read. There may be 9 different cases, specified in the
link I gave. That gives us enough information to be able to read the
whole xml declaration, written in English letters, and this declaration
MUST specify the encoding. Citation from
http://www.w3.org/TR/REC-xml/#charencoding

"Unless an encoding is determined by a higher-level protocol, it is also
a fatal error if an XML entity contains no encoding declaration and its
content is not legal UTF-8 or UTF-16."

So this gives us a robust mechanism to determine XML document encoding.

Please tell me what you think.

Andrew.



More information about the Mono-devel-list mailing list