[Mono-bugs] [Bug 480178] System.Globalization.CharUnicodeInfo.GetUnicodeCategory() does not handle surrogate characters appropriately.

bugzilla_noreply at novell.com bugzilla_noreply at novell.com
Fri May 14 16:56:28 EDT 2010



Damien Diederen <dd at crosstwine.com> changed:

           What    |Removed                     |Added
 Attachment #359241|0                           |1
        is obsolete|                            |

--- Comment #31 from Damien Diederen <dd at crosstwine.com> 2010-05-14 20:56:26 UTC ---
Created an attachment (id=362384)
 --> (http://bugzilla.novell.com/attachment.cgi?id=362384)
System.Char: Handle astral planes in GetUnicodeCategory(string,int)

If the string element at index starts a surrogate pair, we decode the full
codepoint and "query" the higher planes of the database.

This commit fixes #480178


The updated Mono runtime as been verified to produce the same results
as Microsoft's; here are MD5 sums of their Unicode category database

    eba45e00acdc82f9a08873465110aef4  v2.0.50727.dump
    eba45e00acdc82f9a08873465110aef4  v3.5.21022.dump
    56fd5c828fbb9083693835680667fd2c  v4.0.30319.dump

    eba45e00acdc82f9a08873465110aef4  gmcs.dump
    56fd5c828fbb9083693835680667fd2c  dmcs.dump

(Generated via create-category-table --dump, compiled and executed
under the relevant runtime.)


The simple data access pattern, suggested by Paolo Molaro, is fairly
efficient; here are timings observed on a simple loop fetching the
category code of each codepoint from "Range" "Iterations" times
(Intel(R) Core(TM)2 Duo CPU P7350 @ 2.00GHz; best of three runs):

    | Range       | Iterations | Linear table | 2.0+  | 4.0   |
    | 0000-00FF   |     256000 | 0.30s        | 0.35s | 0.37s |
    | 0000-FFFF   |      16000 | 4.75s        | 5.54s | 5.82s |
    | 0000-10FFFF |       1000 | N/A          | 5.63s | 6.19s |
    | Data size   |            | 64kB         | 30kB  | 48kB  |

In the table above, 2.0+ denotes a mode compatible with versions
v2.0.50727...v3.5.21022 of Microsoft's framework, whereas 4.0 mimics
v4.0.30319.  The former is used by programs compiled by 'mcs', 'gmcs'
and 'smcs'; the latter by programs compiled by 'dmcs'.

(The difference in performance between these modes is probably due to
a change in memory access patterns: the 4.0 table shares "pages" with
the 2.0 one, causing accesses to be more spread out.)

Configure bugmail: http://bugzilla.novell.com/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the QA contact for the bug.
You are the assignee for the bug.

More information about the mono-bugs mailing list