[Mono-list] unicode trouble

Jonathan Pryor jonpryor@vt.edu
Mon, 09 Feb 2004 07:16:52 -0500


On Mon, 2004-02-09 at 02:22, gabor wrote:
<snip/>
> i just can't understand why the designers of dotnet didn't look at the unicode
> standards. i can understand that java has this problem, but java is much older 
> than dotnet.
> 
> maybe it's because winapi uses 16-bit characters?

I imagine it's due to a memory trade-off.  The easiest way for the
programmer do deal with things would be to just use UCS-32 for all
Unicode strings.  You wouldn't have to worry about code pairs or
anything else like that.

It would also mean that all strings would require 32-bits for each
character, which would eat up *lots* of memory for all strings.  The
most common code points -- US, Europe, Asia -- all easily fit within
16-bits, *by design*.  So the designers had a choice: use 32-bit
characters internally everywhere, forcing nearly all users to "waste"
16-24 bits/character, or 1/2 - 3/4 of all memory dedicated to strings,
or use 16-bit characters internally, which would suite the needs of most
current users (probably > 80%), while only "wasting" 8-bits/character
for the US and parts of Europe, a minority of the world population.

16-bit characters were considered to be a decent trade-off, I would
assume.

An alternative approach could have been for the string to do on-the-fly
conversion between Unicode UCS-32 code-points and an internal
representation, such as UTF-16.  This would imply that System.Char is a
32-bit structure, and that System.String wouldn't conceptually store a
char[] array, but rather some implementation-defined encoding of the
char[] array, to save memory.  This could be argued to complicate
things, but I don't know why else this strategy wouldn't work.

 - Jon