[Mono-bugs] [Bug 77315][Nor] New - Invalid Unicode surrogate
handling
bugzilla-daemon at bugzilla.ximian.com
bugzilla-daemon at bugzilla.ximian.com
Fri Jan 20 07:13:58 EST 2006
Please do not reply to this email- if you want to comment on the bug, go to the
URL shown below and enter your comments there.
Changed by pawel.sakowski at mind-breeze.com.
http://bugzilla.ximian.com/show_bug.cgi?id=77315
--- shadow/77315 2006-01-20 07:13:58.000000000 -0500
+++ shadow/77315.tmp.11592 2006-01-20 07:13:58.000000000 -0500
@@ -0,0 +1,105 @@
+Bug#: 77315
+Product: Mono: Runtime
+Version: 1.1
+OS: GNU/Linux [Other]
+OS Details:
+Status: NEW
+Resolution:
+Severity:
+Priority: Normal
+Component: interop
+AssignedTo: mono-bugs at ximian.com
+ReportedBy: pawel.sakowski at mind-breeze.com
+QAContact: mono-bugs at ximian.com
+TargetMilestone: ---
+URL:
+Cc:
+Summary: Invalid Unicode surrogate handling
+
+Description of Problem:
+The internal string representation (UTF-16) allows for certain invalid
+strings to appear in the program, when the surrogate characters
+(U+D800-U+DFFF) don't pair correctly. I've identified three problems
+regarding unpaired surrogate handling:
+
+- UTF7Encoding fails to detect the situation when an input byte array
+includes encoding of unpaired surrogates (that is, encodes a valid UTF-16
+codepoint stream, but not a valid Unicode character stream)
+
+- UTF8Encoding decoder accepts when an input byte stream contains UTF-8
+encoding of surrogates character, e.g. when U+233B4 (in UTF16: D88C 8FB4)
+is encoded as a 3-byte encoding of U+D88C followed by a 3-byte encoding of
+U+8FB4. The correct encoding is a direct 4-byte encoding of the whole
+character. Rejecting such overlong encodings is a MUST according to RFC 3629.
+
+- The most serious problem: if a string with unpaired surrogates (e.g.
+obtained from the above decoders) is subject to marshalling into native
+const char*, such invalid string gets rejected by g_utf16_to_utf8. In that
+case the callee receives NULL instead of the string. This causes a
+segmentation fault in many native functions, which do not expect their
+string arguments to be NULL. The caller is unable to make the check on his
+own, since the original managed string was essentialy !=null. Besides, a
+g_warning appears cluttering the console, which IMO should be reserved for
+internal runtime errors, and not occur if an invalid argument is being
+passed to a function.
+
+Steps to reproduce the problem:
+using System;
+using System.Runtime.InteropServices;
+using System.Text;
+class TestEncodings {
+static void Main() {
+ try {
+ // standalone U+D800
+ string str = new
+UTF7Encoding().GetString(Encoding.ASCII.GetBytes("+2AA-"));
+ // whether erronous input is ignored or rejected, an
+invalid UTF-16 shouldn't be output
+ if (!str.Equals("\uD800")) throw new ArgumentException();
+ Console.WriteLine("UTF7 surrogates: BAD");
+ } catch (ArgumentException) {
+ Console.WriteLine("UTF7 surrogates: ok");
+ }
+ try {
+ //standalone U+D88C
+ new UTF8Encoding(false,true).GetString(new byte[] {0xED,
+0xA2, 0x8C});
+ Console.WriteLine("UTF8 surrogates: BAD");
+ } catch (ArgumentException) {
+ Console.WriteLine("UTF8 surrogates: ok");
+ }
+ StringBuilder builder = new StringBuilder(128);
+ try {
+ sprintf(builder, "%s", "\uD88C");
+ if (!builder.ToString().Equals("(null)")) throw new
+Exception();
+ Console.WriteLine("Marshal nulling: BAD");
+ } catch {
+ Console.WriteLine("Marshal nulling: ok");
+ }
+}
+[DllImport("libc")] static extern int sprintf(StringBuilder buf, string
+format, string arg);
+}
+
+Actual Results:
+UTF7 surrogates: BAD
+UTF8 surrogates: BAD
+
+** (TestEncodings.exe:3902): WARNING **: Partial character sequence at end
+of input
+Marshal nulling: BAD
+
+
+Expected Results:
+UTF7 surrogates: ok
+UTF8 surrogates: ok
+Marshal nulling: ok
+
+
+How often does this happen?
+always
+
+Additional information:
+Reporting these related issues as one bug report, for the component
+matching that issue which has the biggest impact.
More information about the mono-bugs
mailing list