[Mono-bugs] [Bug 52101][Maj] New - UTF-8 encoded byte order mark is output to stdout
bugzilla-daemon@bugzilla.ximian.com
bugzilla-daemon@bugzilla.ximian.com
Tue, 6 Jan 2004 12:41:54 -0500 (EST)
Please do not reply to this email- if you want to comment on the bug, go to the
URL shown below and enter your comments there.
Changed by miguel@ximian.com.
http://bugzilla.ximian.com/show_bug.cgi?id=52101
--- shadow/52101 2004-01-06 12:41:54.000000000 -0500
+++ shadow/52101.tmp.22886 2004-01-06 12:41:54.000000000 -0500
@@ -0,0 +1,89 @@
+Bug#: 52101
+Product: Mono/Class Libraries
+Version: unspecified
+OS: SUSE 9.0
+OS Details:
+Status: RESOLVED
+Resolution: FIXED
+Severity: Unknown
+Priority: Major
+Component: System
+AssignedTo: mono-bugs@ximian.com
+ReportedBy: bruno@clisp.org
+QAContact: mono-bugs@ximian.com
+TargetMilestone: ---
+URL:
+Cc:
+Summary: UTF-8 encoded byte order mark is output to stdout
+
+Description of Problem:
+An UTF-8 ZWNBSP (zero-width non-breaking space) is output when "mono"
+produces console output in an UTF-8 locale. Mentioning this behaviour in the
+FAQ doesn't change the fact that it's a bug.
+
+
+Steps to reproduce the problem:
+Install mono-0.28.
+$ export LANG=de_DE.UTF-8
+$ export LC_COLLATE=POSIX
+$ env | grep '^\(LANG\|LC_\)'
+LC_COLLATE=POSIX
+LANG=de_DE.UTF-8
+$ mcs hello.cs -o hello.mono.exe
+$ mono hello.mono.exe | hexdump -e '"%06.6_ax " 16/1 "%02X "' -e '" "
+16/1 "%_p" "\n"'
+
+
+Actual Results:
+$ mono hello.mono.exe | hexdump -e '"%06.6_ax " 16/1 "%02X "' -e '" "
+16/1 "%_p" "\n"'
+000000 EF BB BF 48 65 6C 6C 6F 0A ...Hello.
+
+
+Expected Results:
+$ mono hello.mono.exe | hexdump -e '"%06.6_ax " 16/1 "%02X "' -e '" "
+16/1 "%_p" "\n"'
+000000 48 65 6C 6C 6F 0A Hello.
+
+
+How often does this happen?
+Reproducible.
+
+
+Additional Information:
+
+- The fact that the output contain line terminators of 0A, not 0D 0A, shows that
+Mono is intended to follow the usual customary conventions on Unix systems.
+
+- The usual customary conventions on Unix systems are to not output a byte
+order mark on stdout or to files. See
+http://www.cl.cam.ac.uk/~mgk25/unicode.html; this says:
+ "One influential non-POSIX PC operating system vendor (whom we shall
+leave unnamed here) suggested that all Unicode files should start with the
+character ZERO WIDTH NOBREAK SPACE (U+FEFF), which is in this role
+also referred to as the "signature" or "byte-order mark (BOM)", in order to
+identify the encoding and byte-order used in a file. Linux/Unix does
+<STRONG>not</STRONG> use any BOMs and signatures. They would
+break far too many existing ASCII syntax conventions (such as scripts starting
+with <SAMP>#!</SAMP>). On POSIX systems, the selected locale identifies
+already the encoding expected in all input and output files of a process."
+
+- The RFC 2279, which defines UTF-8, doesn't talk about byte order marks,
+whereas the RFC 2781, which defines UTF-16, talks about byte order marks.
+This makes it clear that byte order marks are reserved to the
+UTF-16/UCS-2/UCS-4 encodings.
+ http://www.faqs.org/rfcs/rfc2279.html
+ http://www.faqs.org/rfcs/rfc2781.html
+
+- The fix may be to simply change the value of the variable
+System.Text.Encoding.UTF8: just invoke the default constructor.
+
+------- Additional Comments From bruno@clisp.org 2003-12-12 09:25 -------
+Created an attachment (id=6207)
+hello.cs (source code)
+
+
+------- Additional Comments From miguel@ximian.com 2004-01-06 12:41 -------
+The intention of a fix that went into CVS on Nov 14th was to fix that,
+but for some reason it did not work on your setup. This has now been
+fixed on CVS.