[Mono-list] unicode trouble

Fabio Montoya [@model-it] fabio@model-it.com.mx
Mon, 9 Feb 2004 00:07:52 -0600


Sorry I should have said "The original Gabor's question persists..."

Fabio Montoya

| -----Original Message-----
| From: mono-list-admin@lists.ximian.com 
| [mailto:mono-list-admin@lists.ximian.com] On Behalf Of Fabio 
| Montoya [@model-it]
| Sent: Monday, February 09, 2004 12:04 AM
| To: aranym@adelphia.net; 'gabor'; mono-list@lists.ximian.com
| Subject: RE: [Mono-list] unicode trouble
| 
| 
| 
| Gabor is right Max! The Unicode standard defines characters 
| in a 32 bit space, The Unicode Character Space in 32 bits or UCS-32.
| 
| For practical reasons, the Unicode standard defines 
| transformation formats,
| i.e.:
| 
| UTF-8  Unicode transformation format for 8 bits
| UTF-16 Unicode transformation format for 16 bits [Any 
| transformation format above 8 bits needs to handle 
| byte-ordering issues.]
| 
| 
| The original Max's question persists...
| 
| | > but what about unicode characters, that are simply above 
| the 16-bit 
| | > limit?
| | >
| | > for example:
| | > OLD ITALIC LETTER A (unicode code: 10300).
| | >
| | > how do you represent those in .net?
| 
|  
| Cheers!
| 
| 
| Fabio Montoya
| 
| 
| | -----Original Message-----
| | From: mono-list-admin@lists.ximian.com 
| | [mailto:mono-list-admin@lists.ximian.com] On Behalf Of max
| | Sent: Sunday, February 08, 2004 10:04 PM
| | To: gabor; mono-list@lists.ximian.com
| | Subject: Re: [Mono-list] unicode trouble
| | 
| | Hi Gabor,
| | I think you're confused. Characters in .NET are 16 bits 
| | BECAUSE they are unicode. 16 bits = 2 bytes = 65536 values.
| | 
| | a way to check that is simple. here's some C# example code:
| | 
| |       string s = "a";
| |       s += (char)10300;
| | 
| |       Console.WriteLine("s = " + s);
| |       Console.WriteLine("len = " + s.Length);
| | 
| |       for (int i = 0; i < s.Length; i++ ) {
| |          Console.WriteLine("s["+i+"] = " + (int)s[i]);
| |       }
| | 
| | max
| | 
| | On Sunday 08 February 2004 15:19, gabor wrote:
| | > hi,
| | >
| | > as i understand, characters in .net are 16-bit values.
| | >
| | > but what about unicode characters, that are simply above 
| the 16-bit 
| | > limit?
| | >
| | > for example:
| | > OLD ITALIC LETTER A (unicode code: 10300).
| | >
| | > how do you represent those in .net?
| | >
| | > i tried to open a textfile containing this old-italic-a:
| | >
| | > - the length and indexing methods of string all said that 
| | old-italic-a 
| | > is actually 2 letters => it doesn't work
| | > - when writing the string back to an utf8 encoded 
| textfile, then it 
| | > was correctly written.
| | >
| | > so for me it seems that dotnet (mono) uses utf16 as 
| | internal encoding 
| | > format, but indexing (and length) doesn't use that information.
| | >
| | > am i correct?
| | >
| | > are there any ways to handle those characters in dotnet?
| | >
| | > for example the new java-1.5 contains some new 
| | string-methods that can 
| | > handle these characters. it's not perfect in java, but at 
| | least there 
| | > is something.
| | >
| | > if someone wants to play with it, i attached a text file 
| containing 
| | > the text "marrakesh", encoded in utf8, where i replaced the 
| | first "a" 
| | > with old-italic-a (it's easy to do with a little iconv 
| to-from ucs4 
| | > and hexedit)
| | >
| | > thanks,
| | > gabor farkas
| | 
| | _______________________________________________
| | Mono-list maillist  -  Mono-list@lists.ximian.com 
| | http://lists.ximian.com/mailman/listinfo/mono-list
| | 
| | 
| 
| 
| _______________________________________________
| Mono-list maillist  -  Mono-list@lists.ximian.com
| http://lists.ximian.com/mailman/listinfo/mono-list
| 
|