[Mono-bugs] [Bug 52101][Maj] New - UTF-8 encoded byte order mark is output to stdout
bugzilla-daemon@bugzilla.ximian.com
bugzilla-daemon@bugzilla.ximian.com
Fri, 12 Dec 2003 09:25:15 -0500 (EST)
Please do not reply to this email- if you want to comment on the bug, go to the
URL shown below and enter your comments there.
Changed by bruno@clisp.org.
http://bugzilla.ximian.com/show_bug.cgi?id=52101
--- shadow/52101 2003-12-12 09:25:15.000000000 -0500
+++ shadow/52101.tmp.29342 2003-12-12 09:25:15.000000000 -0500
@@ -0,0 +1,79 @@
+Bug#: 52101
+Product: Mono/Class Libraries
+Version: unspecified
+OS: SUSE 9.0
+OS Details:
+Status: NEW
+Resolution:
+Severity:
+Priority: Major
+Component: System
+AssignedTo: mono-bugs@ximian.com
+ReportedBy: bruno@clisp.org
+QAContact: mono-bugs@ximian.com
+TargetMilestone: ---
+URL:
+Cc:
+Summary: UTF-8 encoded byte order mark is output to stdout
+
+Description of Problem:
+An UTF-8 ZWNBSP (zero-width non-breaking space) is output when "mono"
+produces console output in an UTF-8 locale. Mentioning this behaviour in the
+FAQ doesn't change the fact that it's a bug.
+
+
+Steps to reproduce the problem:
+Install mono-0.28.
+$ export LANG=de_DE.UTF-8
+$ export LC_COLLATE=POSIX
+$ env | grep '^\(LANG\|LC_\)'
+LC_COLLATE=POSIX
+LANG=de_DE.UTF-8
+$ mcs hello.cs -o hello.mono.exe
+$ mono hello.mono.exe | hexdump -e '"%06.6_ax " 16/1 "%02X "' -e '" "
+16/1 "%_p" "\n"'
+
+
+Actual Results:
+$ mono hello.mono.exe | hexdump -e '"%06.6_ax " 16/1 "%02X "' -e '" "
+16/1 "%_p" "\n"'
+000000 EF BB BF 48 65 6C 6C 6F 0A ...Hello.
+
+
+Expected Results:
+$ mono hello.mono.exe | hexdump -e '"%06.6_ax " 16/1 "%02X "' -e '" "
+16/1 "%_p" "\n"'
+000000 48 65 6C 6C 6F 0A Hello.
+
+
+How often does this happen?
+Reproducible.
+
+
+Additional Information:
+
+- The fact that the output contain line terminators of 0A, not 0D 0A, shows that
+Mono is intended to follow the usual customary conventions on Unix systems.
+
+- The usual customary conventions on Unix systems are to not output a byte
+order mark on stdout or to files. See
+http://www.cl.cam.ac.uk/~mgk25/unicode.html; this says:
+ "One influential non-POSIX PC operating system vendor (whom we shall
+leave unnamed here) suggested that all Unicode files should start with the
+character ZERO WIDTH NOBREAK SPACE (U+FEFF), which is in this role
+also referred to as the "signature" or "byte-order mark (BOM)", in order to
+identify the encoding and byte-order used in a file. Linux/Unix does
+<STRONG>not</STRONG> use any BOMs and signatures. They would
+break far too many existing ASCII syntax conventions (such as scripts starting
+with <SAMP>#!</SAMP>). On POSIX systems, the selected locale identifies
+already the encoding expected in all input and output files of a process."
+
+- The RFC 2279, which defines UTF-8, doesn't talk about byte order marks,
+whereas the RFC 2781, which defines UTF-16, talks about byte order marks.
+This makes it clear that byte order marks are reserved to the
+UTF-16/UCS-2/UCS-4 encodings.
+ http://www.faqs.org/rfcs/rfc2279.html
+ http://www.faqs.org/rfcs/rfc2781.html
+
+- The fix may be to simply change the value of the variable
+System.Text.Encoding.UTF8: just invoke the default constructor.