[Mono-dev] [PATCHES] Bug 480178 - System.Globalization.CharUnicodeInfo.GetUnicodeCategory() does not handle surrogate characters appropriately.

Damien Diederen dd at crosstwine.com
Sat May 15 11:04:58 EDT 2010


Hi Andreas,

"Andreas Nahr" <ClassDevelopment at A-SoftTech.com> writes:
> Hi,
>
> I'm the author (of a large part) of the current Char class and
> unfortunately I think you overlooked some things with the changes.
>
> What you are doing is (heavily) degrading the performance of most of
> the char methods (some of which are VERY commonly used) for supporting
> a single (likely VERY rarely used case) in a single (likely not used
> very much) method.

Thanks for the feedback.  You are completely right: we were discussing
access times to the packed data, and I got sidetracked into measuring
that, mistakenly assuming these accesses would still be inlined in the
interesting cases.

I now had a look at the disassembly (better late than never), which is a
bit sobering.  Thanks for the sanity check!

> 1. Lets first start with performance. Unfortunately your micro
> benchmark is completely omitting the most important speed factor: All
> the methods were designed to be inlinable (because they tend to be
> called in tight loops, as they are even within corelib). After your
> change likely most/none of them would be inlinable anymore. Your
> benchmark, however, does not CALL A SINGLE METHOD. It just measures
> the overhead of two additional mathematical operators and a memory
> access. That is not the big speed-sink. But even then your approach
> looses 25% perf for no additional gain. The original implementation
> was the result of (literally) thousands of performance tests.

Heh.  Ironically, I introduced the testing method to *avoid* measuring
function call overhead…

My initial iteration (in Bugzilla) was trying to take advantage of the
initial linear portion of the table, but the conditional function call
would probably also result in the method not being inlineable.

> 2. Gain vs. impact. I don't see much sense in changing lots of methods
> in Char to something worse just to support one corner case that could
> be handled separately.

Another goal of my approach was to support multiple variants of the
database while still sharing identical pages.  Maybe I should punt on
that, either by only supporting v2.0–3.5 or by including multiple flat
tables (which shouldn't be faulted in if not used).

> 3. Imho the alternative is relatively simple: Just leave the Char
> methods as they are and add the special casing only for the one
> relevant method. You could just add another table (or two for bi-level
> table compression or one that reuses the data from the first table) as
> a separate entity. So you can leave all the original methods intact
> and ONLY if somebody really uses that method and ONLY if he uses it to
> query the astral plane using surrogates he actually has to pay the
> price for it (in the runtime loading the relevant table into memory
> and using that additional lookup).

I just implemented such a solution, and quickly measured its performance
with the attached microbenchmark (feel free to point me to a more
representative test suite if you have one at hand).  On this (x86-64)
computer, it (unsurprisingly) results in the same numbers as the
unpached version:

$ time mono -O=all CharMicrobenchs.exe --load /tmp/mono.txt --repeat 5000 --run

  - Original Mono: 10.5s;
  - Bi-level table: 15.6s;
  - Linear + bi-level table: 10.5s.

(Run a few times; jitter is minimal.  Mono.txt is a lynx dump of
http://de.wikipedia.org/wiki/Mono-Projekt.)

This approach grows the category data table from 64 to ~70kB, and
requires an additional 8kB index for the astral planes.  The resulting
runtime produces the same results as Microsoft's v2.0.50727 and
v3.5.21022 for the whole Unicode range, but has no support for v4.

Would that be an acceptable overhead?

Would supporting v4.0.30319 be worth it at an additional cost of ~78kB
of data (which shouldn't be faulted in) in the Mono binary?

Further comments are of course welcome.  I will prepare a new series of
patches in the near future.

> Happy hacking
> Andreas
>
> P.S. And sorry for being somewhat harsh on the patches.

Heh.  That's what I get for cutting corners ;)

Cheers,
Damien
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: CharMicrobenchs.cs
Url: http://lists.ximian.com/pipermail/mono-devel-list/attachments/20100515/edc4b02a/attachment.pl 
-------------- next part --------------
-- 
http://crosstwine.com
tel:  +49 21 89 29 39
cell: +49 174 34 89 428

“Strong Opinions, Weakly Held”
                 — Bob Johansen


More information about the Mono-devel-list mailing list