[Mono-list] unicode trouble

Fabio Montoya [@model-it] fabio@model-it.com.mx
Mon, 9 Feb 2004 00:04:11 -0600


Gabor is right Max! The Unicode standard defines characters in a 32 bit
space, The Unicode Character Space in 32 bits or UCS-32.

For practical reasons, the Unicode standard defines transformation formats,
i.e.:

UTF-8  Unicode transformation format for 8 bits
UTF-16 Unicode transformation format for 16 bits 
[Any transformation format above 8 bits needs to handle byte-ordering
issues.]


The original Max's question persists...

| > but what about unicode characters, that are simply above the 16-bit 
| > limit?
| >
| > for example:
| > OLD ITALIC LETTER A (unicode code: 10300).
| >
| > how do you represent those in .net?

 
Cheers!


Fabio Montoya


| -----Original Message-----
| From: mono-list-admin@lists.ximian.com 
| [mailto:mono-list-admin@lists.ximian.com] On Behalf Of max
| Sent: Sunday, February 08, 2004 10:04 PM
| To: gabor; mono-list@lists.ximian.com
| Subject: Re: [Mono-list] unicode trouble
| 
| Hi Gabor,
| I think you're confused. Characters in .NET are 16 bits 
| BECAUSE they are unicode. 16 bits = 2 bytes = 65536 values.
| 
| a way to check that is simple. here's some C# example code:
| 
|       string s = "a";
|       s += (char)10300;
| 
|       Console.WriteLine("s = " + s);
|       Console.WriteLine("len = " + s.Length);
| 
|       for (int i = 0; i < s.Length; i++ ) {
|          Console.WriteLine("s["+i+"] = " + (int)s[i]);
|       }
| 
| max
| 
| On Sunday 08 February 2004 15:19, gabor wrote:
| > hi,
| >
| > as i understand, characters in .net are 16-bit values.
| >
| > but what about unicode characters, that are simply above the 16-bit 
| > limit?
| >
| > for example:
| > OLD ITALIC LETTER A (unicode code: 10300).
| >
| > how do you represent those in .net?
| >
| > i tried to open a textfile containing this old-italic-a:
| >
| > - the length and indexing methods of string all said that 
| old-italic-a 
| > is actually 2 letters => it doesn't work
| > - when writing the string back to an utf8 encoded textfile, then it 
| > was correctly written.
| >
| > so for me it seems that dotnet (mono) uses utf16 as 
| internal encoding 
| > format, but indexing (and length) doesn't use that information.
| >
| > am i correct?
| >
| > are there any ways to handle those characters in dotnet?
| >
| > for example the new java-1.5 contains some new 
| string-methods that can 
| > handle these characters. it's not perfect in java, but at 
| least there 
| > is something.
| >
| > if someone wants to play with it, i attached a text file containing 
| > the text "marrakesh", encoded in utf8, where i replaced the 
| first "a" 
| > with old-italic-a (it's easy to do with a little iconv to-from ucs4 
| > and hexedit)
| >
| > thanks,
| > gabor farkas
| 
| _______________________________________________
| Mono-list maillist  -  Mono-list@lists.ximian.com 
| http://lists.ximian.com/mailman/listinfo/mono-list
| 
|