[Mono-bugs] [Bug 52101][Maj] New - UTF-8 encoded byte order mark is output to stdout

bugzilla-daemon@bugzilla.ximian.com bugzilla-daemon@bugzilla.ximian.com
Tue, 6 Jan 2004 12:41:54 -0500 (EST)

Please do not reply to this email- if you want to comment on the bug, go to the
URL shown below and enter your comments there.

Changed by miguel@ximian.com.


--- shadow/52101	2004-01-06 12:41:54.000000000 -0500
+++ shadow/52101.tmp.22886	2004-01-06 12:41:54.000000000 -0500
@@ -0,0 +1,89 @@
+Bug#: 52101
+Product: Mono/Class Libraries
+Version: unspecified
+OS: SUSE 9.0
+OS Details: 
+Status: RESOLVED   
+Resolution: FIXED
+Severity: Unknown
+Priority: Major
+Component: System
+AssignedTo: mono-bugs@ximian.com                            
+ReportedBy: bruno@clisp.org               
+QAContact: mono-bugs@ximian.com
+TargetMilestone: ---
+Summary: UTF-8 encoded byte order mark is output to stdout
+Description of Problem: 
+An UTF-8 ZWNBSP (zero-width non-breaking space) is output when "mono" 
+produces console output in an UTF-8 locale. Mentioning this behaviour in the 
+FAQ doesn't change the fact that it's a bug. 
+Steps to reproduce the problem: 
+Install mono-0.28. 
+$ export LANG=de_DE.UTF-8 
+$ env | grep '^\(LANG\|LC_\)' 
+$ mcs hello.cs -o hello.mono.exe 
+$ mono hello.mono.exe | hexdump -e '"%06.6_ax  " 16/1 "%02X "' -e '"  " 
+16/1 "%_p" "\n"' 
+Actual Results: 
+$ mono hello.mono.exe | hexdump -e '"%06.6_ax  " 16/1 "%02X "' -e '"  " 
+16/1 "%_p" "\n"' 
+000000  EF BB BF 48 65 6C 6C 6F 0A                       ...Hello. 
+Expected Results: 
+$ mono hello.mono.exe | hexdump -e '"%06.6_ax  " 16/1 "%02X "' -e '"  " 
+16/1 "%_p" "\n"' 
+000000  48 65 6C 6C 6F 0A                                Hello. 
+How often does this happen?  
+Additional Information: 
+- The fact that the output contain line terminators of 0A, not 0D 0A, shows that 
+Mono is intended to follow the usual customary conventions on Unix systems. 
+- The usual customary conventions on Unix systems are to not output a byte 
+order mark on stdout or to files. See 
+http://www.cl.cam.ac.uk/~mgk25/unicode.html; this says: 
+  "One influential non-POSIX PC operating system vendor (whom we shall 
+leave unnamed here) suggested that all Unicode files should start with the 
+character ZERO WIDTH NOBREAK SPACE (U+FEFF), which is in this role 
+also referred to as the "signature" or "byte-order mark (BOM)", in order to 
+identify the encoding and byte-order used in a file. Linux/Unix does 
+<STRONG>not</STRONG> use any BOMs and signatures. They would 
+break far too many existing ASCII syntax conventions (such as scripts starting 
+with <SAMP>#!</SAMP>). On POSIX systems, the selected locale identifies 
+already the encoding expected in all input and output files of a process." 
+- The RFC 2279, which defines UTF-8, doesn't talk about byte order marks, 
+whereas the RFC 2781, which defines UTF-16, talks about byte order marks. 
+This makes it clear that byte order marks are reserved to the 
+UTF-16/UCS-2/UCS-4 encodings. 
+  http://www.faqs.org/rfcs/rfc2279.html 
+  http://www.faqs.org/rfcs/rfc2781.html 
+- The fix may be to simply change the value of the variable 
+System.Text.Encoding.UTF8: just invoke the default constructor.
+------- Additional Comments From bruno@clisp.org  2003-12-12 09:25 -------
+Created an attachment (id=6207)
+hello.cs (source code)
+------- Additional Comments From miguel@ximian.com  2004-01-06 12:41 -------
+The intention of a fix that went into CVS on Nov 14th was to fix that,
+but for some reason it did not work on your setup.  This has now been
+fixed on CVS.