[Mono-bugs] [Bug 67810][Nor] New - Probably a Conversion bug - needs further investigation ?

bugzilla-daemon@bugzilla.ximian.com bugzilla-daemon@bugzilla.ximian.com
Fri, 8 Oct 2004 03:43:35 -0400 (EDT)


Please do not reply to this email- if you want to comment on the bug, go to the
URL shown below and enter your comments there.

Changed by kjambunathan@novell.com.

http://bugzilla.ximian.com/show_bug.cgi?id=67810

--- shadow/67810	2004-10-08 03:43:35.000000000 -0400
+++ shadow/67810.tmp.24893	2004-10-08 03:43:35.000000000 -0400
@@ -0,0 +1,262 @@
+Bug#: 67810
+Product: Mono: Class Libraries
+Version: unspecified
+OS: unknown
+OS Details: 
+Status: NEW   
+Resolution: 
+Severity: 
+Priority: Normal
+Component: VB Runtime
+AssignedTo: banirban@novell.com                            
+ReportedBy: kjambunathan@novell.com               
+QAContact: mono-bugs@ximian.com
+TargetMilestone: ---
+URL: 
+Cc: 
+Summary: Probably a Conversion bug - needs further investigation ?
+
+This conversion issue needs to be addressed  and I am logging a defect for
+later perusal. The defect is reported by Richard.Polton@morganstanley.com 
+
+The converstaion begins with the following posting:
+http://lists.ximian.com/archives/public/mono-list/2004-September/023382.html
+
+and it continues on in October 2004 thread as well.
+
+This is the converstaion log for easy reference:
+
+>>> Richard Polton <Richard.Polton@morganstanley.com> 09/24/04 4:04 PM
+>>>
+Hi,
+
+I was browsing the source code for Conversions.cs (or similar) in the
+VisualBasic (I think) package in 1.0.1 and came across a couple of
+oddities. The one which struck me in particular was the conversion of a
+
+single character into a numeric type. It was performed by subtracting
+the absolute character '0' from the function parameter. This will only
+
+work if the input character set is known to be ASCII or known to be
+ordered in the same way? So, my question is, is it a requirement of all
+
+.NET (and Mono) applications that the character set used is ASCII or
+similar? I know the spec refers to the program as being encoded in
+Unicode but I couldn't find anything regarding the recognisable input
+chars.
+
+Richard
+
+-----Original Message-----
+Hey Richard
+
+This is definitely odd. 
+
+Could you point us to the offending code or better still could you
+please submit a test report in bugzilla with a test case.
+
+Regards,
+Jambunathan  K.
+
+-----Original Message-----
+
+Hi,
+
+The code is in
+
+mcs-1.0.1/class/Microsoft.VisualBasic/Microsoft.VisualBasic/Conversion.c
+s
+
+at line 687
+
+The particular piece I noticed is
+
+return Expression - '0';
+
+In fact, habing given it further thought, I have a couple of questions:
+i) if I sit at a Japanese terminal (for example) and enter '-', i.e.
+ichi or 'one', is this a valid Unicode character?
+ii) how wide is the 'char' datatype? I assume it contains Unicode rather
+than single-byte ASCII.
+iii) if entering 'ichi' is valid, and char contains Unicode, then I
+suspect that the below subtration will return a number substantially
+greater than one.
+
+- Richard
+
+-----Original Message-----
+
+On Tue, 2004-10-05 at 04:34, Polton, Richard (IT) wrote:
+>  In fact, habing given it further thought, I have a couple of questions:
+> 
+> i) if I sit at a Japanese terminal (for example) and enter '-', i.e.
+> ichi or 'one', is this a valid Unicode character?
+
+Yes.
+
+> ii) how wide is the 'char' datatype? I assume it contains Unicode rather
+> than single-byte ASCII.
+
+16-bit unsigned value.  It supports Unicode.
+
+> iii) if entering 'ichi' is valid, and char contains Unicode, then I
+> suspect that the below subtration will return a number substantially
+> greater than one.
+
+No.  At least, not if it's remotely like CVS HEAD:
+
+	public static int Val (char Expression) {
+		if (char.IsDigit(Expression)) {
+			return Expression - '0';
+		}
+		else {
+			throw new ArgumentException();
+		}
+	}
+
+Ichi isn't a digit, so it will generate an ArgumentException.
+
+(Assuming that Ichi is Unicode U+4E00, which certainly looks like '-'. 
+It's in the Unicode category "Letter, Other".)
+
+The subtraction should be safe, as (1) it's only done on digits, and (2)
+Unicode follows the ASCII character ordering (for glyphs 0-127), which
+permits this subtraction.
+
+ - Jon
+
+-----Original Message-----
+
+Thanks for this. Is it fair to say, then, that only Arabic numerals are
+counted as digits?  Even though other numeric characters have integer
+values?
+
+-Richard
+
+-----Original Message-----
+
+A quick perusal through Perl's "Category.pl" shows this:
+
+(1) Numbers are categorized as "Nd"
+(2) The only ranges that are "Nd" seem to be:
+
+	0030 - 0039	'0' - '9'
+	0660 - 0669	ARABIC-INDIC DIGIT 0 - 9 (same order as ASCII)
+	06F0 - 06F9	EXTENDED ARABIC-INDIC DIGIT 0-9 ("")
+	0966 - 096F	DEVANAGRAI DIGIT 0-9
+	09E6 - 09EF	BENGALI DIGIT 0-9
+	0A66 - 0A6F
+	0AE6 - 0AEF
+	0B66 - 0B6F
+	0BE7 - 0BEF
+	0C66 - 0C6F
+	0CE6 - 0CEF
+	0D66 - 0D6F
+	0E50 - 0E59
+	0ED0 - 0ED9
+	0F20 - 0F29
+	... Plus 8 more...
+
+I'm too lazy to look at all of these ranges, but the ones I did look at
+all had digits in the order 0..9.  The subtraction should be legal for
+all of these glyphs.  (Which is probably by design; it would be very odd
+-- broken? -- to have so many digits in the "right" order, and then have
+a few in a different order...)
+
+Gnome's Character Map program (gucharmap) is very handy for looking up
+the Unicode Category a character belongs to.  Too bad the opposite
+direction (Unicode Category -> characters) tends to be more difficult
+(hence consulting Perl's internal tables).
+
+ - Jon
+
+-----Original Message-----
+
+Thanks for this, although this begs another question :-)
+
+If the char which is to be converted is 0661, say, then what will be the
+value of the subtraction? Will it be 0661 - 0660 or will it be 0661 -
+0030? I assume that a literal '0' will always map to 0030 rather than
+cleverly detect the range of digits that the char belongs to.
+
+Richard=20
+
+
+
+-----Original Message-----
+
+On Wed, 2004-10-06 at 03:59, Polton, Richard (IT) wrote:
+> If the char which is to be converted is 0661, say, then what will be the
+> value of the subtraction? Will it be 0661 - 0660 or will it be 0661 -
+> 0030? I assume that a literal '0' will always map to 0030 rather than
+> cleverly detect the range of digits that the char belongs to.
+
+Oh.  Good point.  (Why didn't I think of that?)  The literal '0' is
+mapped to 0030, so  you'd get U+0661 - U+0030, which is *way* too big.
+
+So I guess the code is broken.  The question is, in what way? :-/
+
+Now the question is: what does Microsoft's implementation do? :-)
+
+Someone will have to throw U+0661 at Microsoft's
+Microsoft.VisualBasic.dll and see what the return value (or exception
+generated) is.  They may require a value between '0' and '9', and all
+other "Nd" digits, such as U+0661, generate exceptions.
+
+Alternatively, Microsoft always subtracts from the proper value.
+
+We can do either of these, we just need to know which to do.
+
+ - Jon
+
+-----Original Message-----
+
+At 06:57 AM 06/10/2004 -0400, Jonathan Pryor wrote:
+>On Wed, 2004-10-06 at 03:59, Polton, Richard (IT) wrote:
+>> If the char which is to be converted is 0661, say, then what will be the
+>> value of the subtraction? Will it be 0661 - 0660 or will it be 0661 -
+>> 0030? I assume that a literal '0' will always map to 0030 rather than
+>> cleverly detect the range of digits that the char belongs to.
+>
+>Oh.  Good point.  (Why didn't I think of that?)  The literal '0' is
+>mapped to 0030, so  you'd get U+0661 - U+0030, which is *way* too big.
+>
+>So I guess the code is broken.  The question is, in what way? :-/
+>
+>Now the question is: what does Microsoft's implementation do? :-)
+>
+>Someone will have to throw U+0661 at Microsoft's
+>Microsoft.VisualBasic.dll and see what the return value (or exception
+>generated) is.  They may require a value between '0' and '9', and all
+>other "Nd" digits, such as U+0661, generate exceptions.
+>
+>Alternatively, Microsoft always subtracts from the proper value.
+>
+>We can do either of these, we just need to know which to do.
+
+I just wrote a simple test program using some of the ranges listed two
+posts back. I threw "375", "\u0663\u0667\u0665", "\u09E9\u09ED\u09EB", and,
+for kicks, the native Japanese representation, 5 kanji long,
+"\u4E09\u767E\u4E03\u5341\u4E94" (sambyaku nanajuu go). Here are the results:
+
+int.Parse(..):
+  Arabic: 375
+  Arabic-Indic: ERROR: FormatException
+  Bengali: ERROR: FormatException
+  Japanese: ERROR: FormatException
+VB.NET's Val(..):
+  Arabic: 375
+  Arabic-Indic: 0
+  Bengali: 0
+  Japanese: 0
+
+When I concatenate the arabic-indic script to the arabic script (yielding
+the string "375\u0663\u0667\u0665"), VB's Val() function returns "375". In
+other words, int.Parse() should throw when it gets something that isn't in
+['0', '9'] (or relevant punctuation), and
+Microsoft.VisualBasic.Conversion.Val() should stop parsing when it reaches
+the first such character.
+
+Hope this helps :-)
+
+Jonathan Gilbert