[Mono-bugs] [Bug 480178] System.Globalization.CharUnicodeInfo.GetUnicodeCategory() does not handle surrogate characters appropriately.
bugzilla_noreply at novell.com
bugzilla_noreply at novell.com
Fri May 14 16:56:28 EDT 2010
http://bugzilla.novell.com/show_bug.cgi?id=480178
http://bugzilla.novell.com/show_bug.cgi?id=480178#c31
Damien Diederen <dd at crosstwine.com> changed:
What |Removed |Added
----------------------------------------------------------------------------
Attachment #359241|0 |1
is obsolete| |
--- Comment #31 from Damien Diederen <dd at crosstwine.com> 2010-05-14 20:56:26 UTC ---
Created an attachment (id=362384)
--> (http://bugzilla.novell.com/attachment.cgi?id=362384)
System.Char: Handle astral planes in GetUnicodeCategory(string,int)
If the string element at index starts a surrogate pair, we decode the full
codepoint and "query" the higher planes of the database.
This commit fixes #480178
COMPATIBILITY
The updated Mono runtime as been verified to produce the same results
as Microsoft's; here are MD5 sums of their Unicode category database
dumps:
eba45e00acdc82f9a08873465110aef4 v2.0.50727.dump
eba45e00acdc82f9a08873465110aef4 v3.5.21022.dump
56fd5c828fbb9083693835680667fd2c v4.0.30319.dump
eba45e00acdc82f9a08873465110aef4 gmcs.dump
56fd5c828fbb9083693835680667fd2c dmcs.dump
(Generated via create-category-table --dump, compiled and executed
under the relevant runtime.)
PERFORMANCE
The simple data access pattern, suggested by Paolo Molaro, is fairly
efficient; here are timings observed on a simple loop fetching the
category code of each codepoint from "Range" "Iterations" times
(Intel(R) Core(TM)2 Duo CPU P7350 @ 2.00GHz; best of three runs):
| Range | Iterations | Linear table | 2.0+ | 4.0 |
|-------------+------------+--------------+-------+-------|
| 0000-00FF | 256000 | 0.30s | 0.35s | 0.37s |
| 0000-FFFF | 16000 | 4.75s | 5.54s | 5.82s |
| 0000-10FFFF | 1000 | N/A | 5.63s | 6.19s |
|-------------+------------+--------------+-------+-------|
| Data size | | 64kB | 30kB | 48kB |
In the table above, 2.0+ denotes a mode compatible with versions
v2.0.50727...v3.5.21022 of Microsoft's framework, whereas 4.0 mimics
v4.0.30319. The former is used by programs compiled by 'mcs', 'gmcs'
and 'smcs'; the latter by programs compiled by 'dmcs'.
(The difference in performance between these modes is probably due to
a change in memory access patterns: the 4.0 table shares "pages" with
the 2.0 one, causing accesses to be more spread out.)
--
Configure bugmail: http://bugzilla.novell.com/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the QA contact for the bug.
You are the assignee for the bug.
More information about the mono-bugs
mailing list