[Mono-dev] [PATCHES] Bug 480178 - System.Globalization.CharUnicodeInfo.GetUnicodeCategory() does not handle surrogate characters appropriately.

Sat May 15 11:09:18 EDT 2010

There are a few things to notice when extending unicode data.

MS have (or used to on 2.0) a lot of broken stuff when it comes to
unicode/locales.
So sometimes fixing for MS compatibility means breaking correctness.

On Sat, May 15, 2010 at 12:04 PM, Damien Diederen <dd at crosstwine.com> wrote:

>
> Hi Andreas,
>
> "Andreas Nahr" <ClassDevelopment at A-SoftTech.com> writes:
> > Hi,
> >
> > I'm the author (of a large part) of the current Char class and
> > unfortunately I think you overlooked some things with the changes.
> >
> > What you are doing is (heavily) degrading the performance of most of
> > the char methods (some of which are VERY commonly used) for supporting
> > a single (likely VERY rarely used case) in a single (likely not used
> > very much) method.
>
> Thanks for the feedback.  You are completely right: we were discussing
> access times to the packed data, and I got sidetracked into measuring
> that, mistakenly assuming these accesses would still be inlined in the
> interesting cases.
>
> I now had a look at the disassembly (better late than never), which is a
> bit sobering.  Thanks for the sanity check!
>
> > 1. Lets first start with performance. Unfortunately your micro
> > benchmark is completely omitting the most important speed factor: All
> > the methods were designed to be inlinable (because they tend to be
> > called in tight loops, as they are even within corelib). After your
> > change likely most/none of them would be inlinable anymore. Your
> > benchmark, however, does not CALL A SINGLE METHOD. It just measures
> > the overhead of two additional mathematical operators and a memory
> > access. That is not the big speed-sink. But even then your approach
> > looses 25% perf for no additional gain. The original implementation
> > was the result of (literally) thousands of performance tests.
>
> Heh.  Ironically, I introduced the testing method to *avoid* measuring
> function call overhead…
>
> My initial iteration (in Bugzilla) was trying to take advantage of the
> initial linear portion of the table, but the conditional function call
> would probably also result in the method not being inlineable.
>
> > 2. Gain vs. impact. I don't see much sense in changing lots of methods
> > in Char to something worse just to support one corner case that could
> > be handled separately.
>
> Another goal of my approach was to support multiple variants of the
> database while still sharing identical pages.  Maybe I should punt on
> that, either by only supporting v2.0–3.5 or by including multiple flat
> tables (which shouldn't be faulted in if not used).
>
> > 3. Imho the alternative is relatively simple: Just leave the Char
> > methods as they are and add the special casing only for the one
> > relevant method. You could just add another table (or two for bi-level
> > table compression or one that reuses the data from the first table) as
> > a separate entity. So you can leave all the original methods intact
> > and ONLY if somebody really uses that method and ONLY if he uses it to
> > query the astral plane using surrogates he actually has to pay the
> > price for it (in the runtime loading the relevant table into memory
> > and using that additional lookup).
>
> I just implemented such a solution, and quickly measured its performance
> with the attached microbenchmark (feel free to point me to a more
> representative test suite if you have one at hand).  On this (x86-64)
> computer, it (unsurprisingly) results in the same numbers as the
> unpached version:
>
> $ time mono -O=all CharMicrobenchs.exe --load /tmp/mono.txt --repeat 5000
> --run
>
>  - Original Mono: 10.5s;
>  - Bi-level table: 15.6s;
>  - Linear + bi-level table: 10.5s.
>
> (Run a few times; jitter is minimal.  Mono.txt is a lynx dump of
> http://de.wikipedia.org/wiki/Mono-Projekt.)
>
> This approach grows the category data table from 64 to ~70kB, and
> requires an additional 8kB index for the astral planes.  The resulting
> runtime produces the same results as Microsoft's v2.0.50727 and
> v3.5.21022 for the whole Unicode range, but has no support for v4.
>
> Would that be an acceptable overhead?
>
> Would supporting v4.0.30319 be worth it at an additional cost of ~78kB
> of data (which shouldn't be faulted in) in the Mono binary?
>
> Further comments are of course welcome.  I will prepare a new series of
> patches in the near future.
>
> > Happy hacking
> > Andreas
> >
> > P.S. And sorry for being somewhat harsh on the patches.
>
> Heh.  That's what I get for cutting corners ;)
>
> Cheers,
> Damien
>
> --
> http://crosstwine.com
> tel:  +49 21 89 29 39
> cell: +49 174 34 89 428
>
> “Strong Opinions, Weakly Held”
>                 — Bob Johansen
>
> _______________________________________________
> Mono-devel-list mailing list
> Mono-devel-list at lists.ximian.com
> http://lists.ximian.com/mailman/listinfo/mono-devel-list
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.ximian.com/pipermail/mono-devel-list/attachments/20100515/95b1ead7/attachment-0001.html