[Mono-bugs] [Bug 480178] System.Globalization.CharUnicodeInfo.GetUnicodeCategory() does not handle surrogate characters appropriately.

bugzilla_noreply at novell.com bugzilla_noreply at novell.com
Mon May 17 14:17:07 EDT 2010



Damien Diederen <dd at crosstwine.com> changed:

           What    |Removed                     |Added
 Attachment #362383|0                           |1
        is obsolete|                            |
 Attachment #362384|0                           |1
        is obsolete|                            |

--- Comment #38 from Damien Diederen <dd at crosstwine.com> 2010-05-17 18:17:06 UTC ---
Created an attachment (id=362713)
 --> (http://bugzilla.novell.com/attachment.cgi?id=362713)
System.Char: Handle astral planes in GetUnicodeCategory(string,int)

If the string element at index starts a surrogate pair, we decode the
full codepoint and "query" the higher planes of the database.

This commit fixes bug 480178.

CAUTION: This commit depends on the following runtime change:

  System.Char icall: New Unicode category tables compatible with MS
  .NET v2.0.50727 and v3.5.21022.

Without it, Mono will suffer a low-level (internal call) crash when
initializing System.Char.


The updated Mono runtime as been verified to produce the same results
as Microsoft's; here are the MD5 sums of their Unicode category
database dumps (generated via create-category-table --dump, compiled
and executed under the relevant runtime):

    eba45e00acdc82f9a08873465110aef4  v2.0.50727.dump
    eba45e00acdc82f9a08873465110aef4  v3.5.21022.dump

    eba45e00acdc82f9a08873465110aef4  gmcs.dump

Note that this is different from the results produced by Mono (even
for the BMP) prior to the introduction of these changes, and is also
different from the results produced by Microsoft's recently-released

    56fd5c828fbb9083693835680667fd2c  v4.0.30319.dump

Other versions of the internal database can be easily generated using
create-category-table(.cs), but this currently requires a rebuild.


Direct array indexing is mandatory for code points in the
U+0000..U+FFFF range; as pointed out by Andreas Nahr, performing
bi-level lookups in the Char.Is*(char) predicates cause the JIT to
stop inlining them and results in an important performance drop.

The simple data access pattern used for higher planes, suggested by
Paolo Molaro, is fairly efficient but currenly only used by this
non-optimized method.

Configure bugmail: http://bugzilla.novell.com/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the QA contact for the bug.
You are the assignee for the bug.

More information about the mono-bugs mailing list