[Mono-list] What affects collation order in a given culture?
Weeble
clockworksaint at gmail.com
Mon Jun 3 12:22:36 UTC 2013
I should preface this by saying that I don't know Mandarin, so I'm
working rather blind. I want to make sure that a list of
filenames/track names/artists/albums is sorted correctly for Chinese
users. My understanding is that the most commonly expected sort order
is based on the pinyin transcription of the characters.
I've been investigating how strings are sorted in various cultures in
.NET, and I've found that I get different results in Mono from .NET
for the "zh-Hans" culture. From what I've read, I think this should
just be another name for the "zh-CHS" culture, and I should get the
same results for both, but Mono gives me different results.
Here's a link to my short test program:
http://pastebin.com/kTL9QuLS
Here's my output on .NET:
http://pastebin.com/D5Hp6GjA
On .NET, in both the zh-Hans and the zh-CHS culture, the example
strings are sorted in an order consistent with their pinyin
transcriptions, which is what I expect.
Here's my output on mono 3.0.10, running on Ubuntu:
http://pastebin.com/jMB0FdkP
This time, I get the same result as for .NET with zh-CHS. However, for
zh-Hans, I get a different order. It *looks* like they're just being
ordered by unicode code-point. I am surprised that I see a different
sort order for zh-CHS from zh-Hans on the same setup, and I'm
surprised at the difference from .NET.
I tried another attempt with an older version of Mono, 2.10.8, as
distributed with Ubuntu:
http://pastebin.com/BmEepAmc
This gives me the expected sort order for both zh-Hans and zh-CHS, but
it also reports the culture name as being simply "Chinese" in each
case, instead of the expected "Chinese (Simplified) Legacy" and
"Chinese (Simplified)".
Finally, I've summarized the results in a table:
Runtime Requested culture Culture display name
Collation order for
Chinese characters
.NET 4.0 invariant Invariant Language (Invariant
Country) code-point
.NET 4.0 zh-CHS Chinese (Simplified) Legacy
pinyin
.NET 4.0 zh-Hans Chinese (Simplified)
pinyin
Mono 2.8.10 invariant Invariant Language (Invariant
Country) code-point
Mono 2.8.10 zh-CHS Chinese
pinyin
Mono 2.8.10 zh-Hans Chinese
pinyin
Mono 3.0.10 invariant Invariant Language (Invariant
Country) code-point
Mono 3.0.10 zh-CHS Chinese (Simplified) Legacy
pinyin
Mono 3.0.10 zh-Hans Chinese (Simplified)
code-point(?!)
(In case the formatting is screwed up in email, here it is monospaced:
http://pastebin.com/SXYR7ucc )
If you're still following me (thankyou!) I have a few questions:
1. Am I correct to expect that zh-CHS and zh-Hans should have the same
collation behaviour as each other?
2. Am I correct to expect that zh-Hans will have a pinyin-based collation order?
3. What systems/libraries are involved here? Does Mono depend on some
system library for its collation order, or does it implement this
itself? Are there particular configuration options I need to be aware
of if I am compiling mono myself?
4. How does Mono pick the default culture on its various platforms?
Will it ever pick 'zh-Hans' as the default culture? Or would it always
prefer 'zh-CHS'? I'm worried that if it defaults to 'zh-Hans' for some
Chinese users they will get a surprising and unhelpful sort order.
Regards,
Weeble.
More information about the Mono-list
mailing list