[Mono-list] What affects collation order in a given culture?

Weeble clockworksaint at gmail.com
Mon Jun 3 12:22:36 UTC 2013


I should preface this by saying that I don't know Mandarin, so I'm
working rather blind. I want to make sure that a list of
filenames/track names/artists/albums is sorted correctly for Chinese
users. My understanding is that the most commonly expected sort order
is based on the pinyin transcription of the characters.

I've been investigating how strings are sorted in various cultures in
.NET, and I've found that I get different results in Mono from .NET
for the "zh-Hans" culture. From what I've read, I think this should
just be another name for the "zh-CHS" culture, and I should get the
same results for both, but Mono gives me different results.

Here's a link to my short test program:

http://pastebin.com/kTL9QuLS

Here's my output on .NET:

http://pastebin.com/D5Hp6GjA

On .NET, in both the zh-Hans and the zh-CHS culture, the example
strings are sorted in an order consistent with their pinyin
transcriptions, which is what I expect.

Here's my output on mono 3.0.10, running on Ubuntu:

http://pastebin.com/jMB0FdkP

This time, I get the same result as for .NET with zh-CHS. However, for
zh-Hans, I get a different order. It *looks* like they're just being
ordered by unicode code-point. I am surprised that I see a different
sort order for zh-CHS from zh-Hans on the same setup, and I'm
surprised at the difference from .NET.

I tried another attempt with an older version of Mono, 2.10.8, as
distributed with Ubuntu:

http://pastebin.com/BmEepAmc

This gives me the expected sort order for both zh-Hans and zh-CHS, but
it also reports the culture name as being simply "Chinese" in each
case, instead of the expected "Chinese (Simplified) Legacy" and
"Chinese (Simplified)".

Finally, I've summarized the results in a table:

Runtime       Requested culture   Culture display name
    Collation order for

    Chinese characters

.NET 4.0      invariant           Invariant Language (Invariant
Country)   code-point
.NET 4.0      zh-CHS              Chinese (Simplified) Legacy
    pinyin
.NET 4.0      zh-Hans             Chinese (Simplified)
    pinyin

Mono 2.8.10   invariant           Invariant Language (Invariant
Country)   code-point
Mono 2.8.10   zh-CHS              Chinese
    pinyin
Mono 2.8.10   zh-Hans             Chinese
    pinyin

Mono 3.0.10   invariant           Invariant Language (Invariant
Country)   code-point
Mono 3.0.10   zh-CHS              Chinese (Simplified) Legacy
    pinyin
Mono 3.0.10   zh-Hans             Chinese (Simplified)
    code-point(?!)

(In case the formatting is screwed up in email, here it is monospaced:
http://pastebin.com/SXYR7ucc )

If you're still following me (thankyou!) I have a few questions:

1. Am I correct to expect that zh-CHS and zh-Hans should have the same
collation behaviour as each other?
2. Am I correct to expect that zh-Hans will have a pinyin-based collation order?
3. What systems/libraries are involved here? Does Mono depend on some
system library for its collation order, or does it implement this
itself? Are there particular configuration options I need to be aware
of if I am compiling mono myself?
4. How does Mono pick the default culture on its various platforms?
Will it ever pick 'zh-Hans' as the default culture? Or would it always
prefer 'zh-CHS'? I'm worried that if it defaults to 'zh-Hans' for some
Chinese users they will get a surprising and unhelpful sort order.

Regards,

Weeble.


More information about the Mono-list mailing list