[Mono-list] unicode trouble

Jonathan Pryor jonpryor@vt.edu
Mon, 09 Feb 2004 20:48:07 -0500


On Mon, 2004-02-09 at 19:21, Marcus wrote:
> As I recall, when the CM3 Modula-3 compiler added support for unicode, they 
> used a hybrid scheme where TEXTs (their equivalent of System.String) can 
> contain both 8-bit and 16-bit "chars". So only the portions of the string 
> that require more than 8 bits use it. Something similar could be done with 
> 32-bit characters in some future library is compactness were a concern.

In a fashion, that's what UTF-16 does: most characters are 16-bit, but
when a character doesn't fit within 16-bits a Unicode Surrogate Pair (or
Combining Char, or...) is used, which causes the next character to be
combined in some fashion with the current character.

The downside to this approach is user-complexity: you can't just iterate
over all the literal characters in the string, as there isn't a
one-to-one mapping between the literal characters (16-bit char) and the
logical characters you actually care about (UCS-32 code-point).  The
previously mentioned System.Globalization.TextElementEnumerator is used
to map between the physical mapping (char[] array) and the logical
(UCS-32 code point).

> By the way, how much performance penality is there for accessing a single 8 
> one modern 32-bit processors?

I don't understand this question.  A "single 8 one modern 32-bit
processors"?  I can only assume that you're asking about the performance
penalty of accessing memory that isn't properly aligned (according to
its underlying type), in which case it depends on the processor: x86
processors will access it, but at a slower rate, and some RISC
processors (Sparc, Alpha, etc.) will refuse to load memory that isn't
properly aligned, generating a processor exception, which (on linux) is
trapped by the OS, which re-maps (copies) the memory to be properly
aligned, and then the memory is accessed properly.

Note that this doesn't cause 8-bit characters to be horribly slow; it
causes accessing a 32-bit quantity (for example) that isn't aligned on a
32-bit address boundary to perform slowly, if at all.  For example, if
the following structure were packed:

	// GNU C
	struct foo {
		char c;		// offset: 0
		int32_t	i;	// offset: 1
	} __attribute__((packed));

	foo f;
	f.i = 42;

The above structure is liable to generate bus errors under some
operating systems, such as SunOS on Sparc, as foo::i isn't
integer-aligned and the processor raises an exception trying to do the
conversion.  The "portable" equivalent would be:

	foo f;
	int32_t n = 42;
	memcpy (&f.i, &n, sizeof(n));

Just a minor digression...

 - Jon

> On Monday 09 February 2004 7:16 am, Jonathan Pryor wrote:
> 
> > I imagine it's due to a memory trade-off.  The easiest way for the
> > programmer do deal with things would be to just use UCS-32 for all
> > Unicode strings.  You wouldn't have to worry about code pairs or
> > anything else like that.
> >
> > It would also mean that all strings would require 32-bits for each
> > character, which would eat up *lots* of memory for all strings.  The
> > most common code points -- US, Europe, Asia -- all easily fit within
> > 16-bits, *by design*. 
> _______________________________________________
> Mono-list maillist  -  Mono-list@lists.ximian.com
> http://lists.ximian.com/mailman/listinfo/mono-list