[Mono-dev] Unhandled Exception in Normalization.cs Combine()

Fri Jun 19 06:04:28 EDT 2009

Actually I was wrong at fixing the first "bug" you reported. It was
actually .NET which is buggy, though unlike older Mono it doesn't result
in an unhandled exception.

http://demo.icu-project.org/icu-bin/nbrowser?t=\u03B1\u0313\u0345&s=&uv=0

To examine C# implementation, try below:

	foreach (char c in "\u03B1\u0313\u0345".Normalize ())
		Console.Write ("{0:X04} ", (int) c);

NET outputs: 03B1 0313 0345

I have a fix that corrects the output as: 1F80

I'll check in the fix soon. With the fix your test prints all "True".

Atsushi Eno

Atsushi Eno wrote:
> Hi Tom, and Tom :)
> 
> I have tried the Hindle version of the test.
> 
> Summary: the sample depends on .NET bug; 2 .NET bugs, 1 mono bug.
> 
> This exactly shows that .NET Normalization is buggy. Here is the
> result from ICU normalization results:
> http://demo.icu-project.org/icu-bin/nbrowser?t=\u00e1bc&s=&uv=0
> 
> i.e. in NFKD, \u00e1bc must be decomposed to \u0061\u0301\u0062\u0063,
> while .NET returns the same string as the input.
> 
> The sample code is confusing because it uses "styleName" output
> to the next input. .NET does not correctly decompose it to
> \u0061\u0301\u0062\u0063, while Mono is correct. When it ran on mono,
> it keeps using the correct NFKD as the next input to the following
> normalizations and hence difference in NFKC (i.e. we have no bug in
> normalizing NFKC string, unlike the test claims).
> 
> I have created a bit visible modification below:
> http://pastebin.ca/1465907
> 
> Though, there seems a mono bug on NFD-to-NFC and NFKD-to-NFKC
> composition. I have extracted a simpler test:
> 
> 	string s1 = "\u0061\u0301bc";
> 	string s2 = "\u00e1bc";
> 	Console.WriteLine (s1.Normalize () == s2);
> 
> *Both* Mono and .NET says "False", but it must be "True". See
> ICU conversion results:
> http://demo.icu-project.org/icu-bin/nbrowser?t=\u0061\u0301bc&s=&uv=0
> Its NFC must be \u00e1\u0062\u0063 (the string s2 above).
> 
> I'll work on fixing the composition part of the issue.
> 
> I haven't tried the Philpot version as I have never installed
> mbunit on this Windows machine - it'd be nicer if the sample just
> compiles and runs within standard libs to make it possible to
> integrate our nunit tests.
> 
> Atsushi Eno
> 
> 
> Tom Hindle wrote:
>> Attached small self contained my test case.
>> I think the output should be 5 trues.
>>
>> I getting 2 Trues and 3 Fails. on mono version r136435
>>
>> Incidentally .NET returns 5 trues for this test case.
>>
>> Is there a Bugzilla entry for this issue?
>>
>>
>>
>> Also normalization-tables.h is now has windows line endings (CRLF)
>>
>> Thanks
>> Tom
>>
>> On Thu, 2009-06-18 at 13:51 -0700, Tom Philpot wrote:
>>> Here is a revision of the test case I sent earlier to the list that
>>> doesn't
>>> rely on any specific encoding (only uses '\uXXXX' characters).
>>>
>>> Hopefully this will be helpful.
>>>
>>> Tom
>>>
>>>
>>> On 6/18/09 1:49 PM, "Tom Hindle" <tom_hindle at sil.org> wrote:
>>>
>>>> Hi Guys,
>>>>
>>>> With regard to recent Normalization changes I have just run our test
>>>> suite with recent mono r136422 - and are getting a number of
>>>> regressions.
>>>>
>>>>
>>>> For example:
>>>>
>>>> {
>>>> string styleName = "\u00e1bc";
>>>> StStyle style = new StStyle();
>>>> Cache.LangProject.StylesOC.Add(style);
>>>> style.Name = styleName;
>>>>
>>>> FwStyleSheet.StyleInfoCollection styleCollection = new
>>>> FwStyleSheet.StyleInfoCollection();
>>>> styleCollection.Add(new BaseStyleInfo(style));
>>>>
>>>>
>>> Assert.IsTrue(styleCollection.Contains(styleName.Normalize(NormalizationForm.F
>>>> ormC)));
>>> Assert.IsTrue(styleCollection.Contains(styleName.Normalize(Normalizat
>>>> ionForm.FormD)));
>>> Assert.IsTrue(styleCollection.Contains(styleName.Normalize
>>>> (NormalizationForm.FormKC)));
>>> Assert.IsTrue(styleCollection.Contains(styleName
>>>> .Normalize(NormalizationForm.FormKD)));
>>>> }
>>>>
>>>> is now failing, as well as other larger unit tests.
>>>>
>>>> I will look info this further to try and produce an example test
>>> program
>>>> that doesn't contain references to our code base.
>>>>
>>>> Thanks
>>>> Tom
>>>>
>>>> On Thu, 2009-06-18 at 15:01 +0900, Atsushi Eno wrote:
>>>>> Hi,
>>>>>
>>>>> If you mean the test cases by the previous email, then that's what
>>>>> (I said) includes raw native encoding in your land (Latin1?) and is
>>>>> what I cannot read. Replace them all with ASCII representation
>>> (\uxxxx).
>>>>> Even if the attachment includes encoding (you mean BOMs?), it is
>>>>> not readable in some environment (like the text editor I use on
>>>>> Windows). Let me repeat, Latin1 is not universal. Don't depend on
>>> it
>>>>> (if you do).
>>>>>
>>>>> Atsushi Eno
>>>>>
>>>>>
>>>>> Tom Philpot wrote:
>>>>>> Atsushi,
>>>>>>
>>>>>> Thanks for the feedback. For some reason, the Mac when displaying
>>>>>> unicode always composes strings before display. I'll look at the
>>> test
>>>>>> case in corlib tomorrow when I get in to work. Would it be helpful
>>> for
>>>>>> the test cases if I gave you both the formD bytes and the formC
>>> bytes
>>>>>> that I think are correct for the test case I sent? Perhaps the
>>> encoding
>>>>>> did not come across in the attachment.
>>>>>>
>>>>>> We have a workaround for the Mac port of our app which would
>>> require
>>>>>> overriding string.Normalize to p/invoke to Mac OS X's NSString
>>> library
>>>>>> to do normalization. It would work, but we would prefer not to
>>> have to
>>>>>> ship a custom build of Mono. The normalization on .NET appears to
>>> be
>>>>>> "good enough" for our purposes and we'd just like our Mac version
>>> to be
>>>>>> consistent.
>>>>>>
>>>>>> Tom
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Atsushi Eno [mailto:atsushieno at veritas-vos-liberabit.com]
>>>>>> Sent: Wed 6/17/2009 7:51 PM
>>>>>> To: Tom Philpot
>>>>>> Cc: mono-devel-list at ximian.com
>>>>>> Subject: Re: [Mono-dev] Unhandled Exception in Normalization.cs
>>> Combine()
>>>>>> You seem to have embedded raw native encoding in your land that
>>>>>> is *not* understandable in Japan. Anyways the input string you
>>>>>> posted in the previous sample was already in FormC which will
>>>>>> look like "doing nothing" as the conversion results.
>>>>>>
>>>>>> There is a standalone normalization test generated from
>>> normalization
>>>>>> conformance test in corlib/Mono.Globalization.Unicode. We fail
>>>>>> about 26000. Far from good, but still better than 35000 on .NET.
>>>>>>
>>>>>> Atsushi Eno
>>>>>>
>>>>>> Tom Philpot wrote:
>>>>>>> Now, string.Normalize(NormalizationForm.FormC) doesn't do
>>> anything using
>>>>>>> mono (r136228).
>>>>>>>
>>>>>>> I've attached some test cases which will hopefully help in
>>> tracking down
>>>>>>> what doesn't work.
>>>>>>>
>>>>>>> On 6/15/09 1:58 AM, "Atsushi Eno"
>>> <atsushieno at veritas-vos-liberabit.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi again,
>>>>>>>>
>>>>>>>> It should be now fixed in trunk.
>>>>>>>>
>>>>>>>> Atsushi Eno
>>>>>>>>
>>>>>>>> Atsushi Eno wrote:
>>>>>>>>> I'll have a look. However since 4 years have passed since I
>>> wrote it,
>>>>>>>>> I'll have to revisit the spec and will take not a little time.
>>>>>>>>>
>>>>>>>>> Atsushi Eno
>>>>>>>>>
>>>>> _______________________________________________
>>>>> Mono-devel-list mailing list
>>>>> Mono-devel-list at lists.ximian.com
>>>>> http://lists.ximian.com/mailman/listinfo/mono-devel-list
>>>
> 
> _______________________________________________
> Mono-devel-list mailing list
> Mono-devel-list at lists.ximian.com
> http://lists.ximian.com/mailman/listinfo/mono-devel-list
> 
> 
>