jonpryor at vt.edu
Mon Dec 3 06:36:30 EST 2007
On Sat, 2007-12-01 at 16:16 +0100, Tinco Andringa wrote:
> Speaking of unicode/utf-16 and memory, couldn't a lot of
> memory be spared if strings where stored as utf8 internally,
> which would be converted back to utf-16 when more than 256
> different characters would be used?
It depends on what you're storing in the strings. If you're only
storing ASCII or Western European characters in your strings, then yes,
UTF-8 would require less memory. If, on the other hand, you're storing
Asian language text (Japanese, Chinese, Korean), or anything else
containing any character >= U+0800 (i.e. > 97%+ of all potential
characters), then UTF-8 is a space *loss*, not a gain, as each glyph
would require at least 3 bytes to store, while UTF-16 would need 2.
Because of this, it is not uncommon for Linux apps to use UTF-16
internally, e.g. Mozilla, Qt, Python (which iirc has a configure-time
command to control use of UTF-16 vs. UTF-32 strings), etc.
More information about the Mono-devel-list