[Mono-dev] String.GetHashCode()

Mon Dec 3 06:36:30 EST 2007

On Sat, 2007-12-01 at 16:16 +0100, Tinco Andringa wrote:
>         Speaking of unicode/utf-16 and memory, couldn't a lot of
>         memory be spared if strings where stored as utf8 internally,
>         which would be converted back to utf-16 when more than 256
>         different characters would be used?

It depends on what you're storing in the strings.  If you're only
storing ASCII or Western European characters in your strings, then yes,
UTF-8 would require less memory.  If, on the other hand, you're storing
Asian language text (Japanese, Chinese, Korean), or anything else
containing any character >= U+0800 (i.e. > 97%+ of all potential
characters), then UTF-8 is a space *loss*, not a gain, as each glyph
would require at least 3 bytes to store, while UTF-16 would need 2.

See also:

http://blogs.msdn.com/michkap/archive/2005/05/20/420317.aspx
http://blogs.msdn.com/michkap/archive/2005/05/22/420822.aspx
http://blogs.msdn.com/michkap/archive/2005/05/25/421828.aspx

Because of this, it is not uncommon for Linux apps to use UTF-16
internally, e.g. Mozilla, Qt, Python (which iirc has a configure-time
command to control use of UTF-16 vs. UTF-32 strings), etc.

 - Jon