[Mono-bugs] [Bug 77315][Nor] New - Invalid Unicode surrogate handling

Fri Jan 20 07:13:58 EST 2006

Please do not reply to this email- if you want to comment on the bug, go to the
URL shown below and enter your comments there.

Changed by pawel.sakowski at mind-breeze.com.

http://bugzilla.ximian.com/show_bug.cgi?id=77315

--- shadow/77315	2006-01-20 07:13:58.000000000 -0500
+++ shadow/77315.tmp.11592	2006-01-20 07:13:58.000000000 -0500
@@ -0,0 +1,105 @@
+Bug#: 77315
+Product: Mono: Runtime
+Version: 1.1
+OS: GNU/Linux [Other]
+OS Details: 
+Status: NEW   
+Resolution: 
+Severity: 
+Priority: Normal
+Component: interop
+AssignedTo: mono-bugs at ximian.com                            
+ReportedBy: pawel.sakowski at mind-breeze.com               
+QAContact: mono-bugs at ximian.com
+TargetMilestone: ---
+URL: 
+Cc: 
+Summary: Invalid Unicode surrogate handling
+
+Description of Problem:
+The internal string representation (UTF-16) allows for certain invalid
+strings to appear in the program, when the surrogate characters
+(U+D800-U+DFFF) don't pair correctly. I've identified three problems
+regarding unpaired surrogate handling:
+
+- UTF7Encoding fails to detect the situation when an input byte array
+includes encoding of unpaired surrogates (that is, encodes a valid UTF-16
+codepoint stream, but not a valid Unicode character stream)
+
+- UTF8Encoding decoder accepts when an input byte stream contains UTF-8
+encoding of surrogates character, e.g. when U+233B4 (in UTF16: D88C 8FB4)
+is encoded as a 3-byte encoding of U+D88C followed by a 3-byte encoding of
+U+8FB4. The correct encoding is a direct 4-byte encoding of the whole
+character. Rejecting such overlong encodings is a MUST according to RFC 3629.
+
+- The most serious problem: if a string with unpaired surrogates (e.g.
+obtained from the above decoders) is subject to marshalling into native
+const char*, such invalid string gets rejected by g_utf16_to_utf8. In that
+case the callee receives NULL instead of the string. This causes a
+segmentation fault in many native functions, which do not expect their
+string arguments to be NULL. The caller is unable to make the check on his
+own, since the original managed string was essentialy !=null. Besides, a
+g_warning appears cluttering the console, which IMO should be reserved for
+internal runtime errors, and not occur if an invalid argument is being
+passed to a function.
+
+Steps to reproduce the problem:
+using System;
+using System.Runtime.InteropServices;
+using System.Text;
+class TestEncodings {
+static void Main() {
+        try {
+                // standalone U+D800
+                string str = new
+UTF7Encoding().GetString(Encoding.ASCII.GetBytes("+2AA-"));
+                // whether erronous input is ignored or rejected, an
+invalid UTF-16 shouldn't be output
+                if (!str.Equals("\uD800")) throw new ArgumentException();
+                Console.WriteLine("UTF7 surrogates: BAD");
+        } catch (ArgumentException) {
+                Console.WriteLine("UTF7 surrogates: ok");
+        }
+        try {
+                //standalone U+D88C
+                new UTF8Encoding(false,true).GetString(new byte[] {0xED,
+0xA2, 0x8C});
+                Console.WriteLine("UTF8 surrogates: BAD");
+        } catch (ArgumentException) {
+                Console.WriteLine("UTF8 surrogates: ok");
+        }
+        StringBuilder builder = new StringBuilder(128);
+        try {
+                sprintf(builder, "%s", "\uD88C");
+                if (!builder.ToString().Equals("(null)")) throw new
+Exception();
+                Console.WriteLine("Marshal nulling: BAD");
+        } catch {
+                Console.WriteLine("Marshal nulling: ok");
+        }
+}
+[DllImport("libc")] static extern int sprintf(StringBuilder buf, string
+format, string arg);
+}
+
+Actual Results:
+UTF7 surrogates: BAD
+UTF8 surrogates: BAD
+
+** (TestEncodings.exe:3902): WARNING **: Partial character sequence at end
+of input
+Marshal nulling: BAD
+
+
+Expected Results:
+UTF7 surrogates: ok
+UTF8 surrogates: ok
+Marshal nulling: ok
+
+
+How often does this happen? 
+always
+
+Additional information:
+Reporting these related issues as one bug report, for the component
+matching that issue which has the biggest impact.