[Mono-list] unicode trouble
Fabio Montoya [@model-it]
fabio@model-it.com.mx
Mon, 9 Feb 2004 00:04:11 -0600
Gabor is right Max! The Unicode standard defines characters in a 32 bit
space, The Unicode Character Space in 32 bits or UCS-32.
For practical reasons, the Unicode standard defines transformation formats,
i.e.:
UTF-8 Unicode transformation format for 8 bits
UTF-16 Unicode transformation format for 16 bits
[Any transformation format above 8 bits needs to handle byte-ordering
issues.]
The original Max's question persists...
| > but what about unicode characters, that are simply above the 16-bit
| > limit?
| >
| > for example:
| > OLD ITALIC LETTER A (unicode code: 10300).
| >
| > how do you represent those in .net?
Cheers!
Fabio Montoya
| -----Original Message-----
| From: mono-list-admin@lists.ximian.com
| [mailto:mono-list-admin@lists.ximian.com] On Behalf Of max
| Sent: Sunday, February 08, 2004 10:04 PM
| To: gabor; mono-list@lists.ximian.com
| Subject: Re: [Mono-list] unicode trouble
|
| Hi Gabor,
| I think you're confused. Characters in .NET are 16 bits
| BECAUSE they are unicode. 16 bits = 2 bytes = 65536 values.
|
| a way to check that is simple. here's some C# example code:
|
| string s = "a";
| s += (char)10300;
|
| Console.WriteLine("s = " + s);
| Console.WriteLine("len = " + s.Length);
|
| for (int i = 0; i < s.Length; i++ ) {
| Console.WriteLine("s["+i+"] = " + (int)s[i]);
| }
|
| max
|
| On Sunday 08 February 2004 15:19, gabor wrote:
| > hi,
| >
| > as i understand, characters in .net are 16-bit values.
| >
| > but what about unicode characters, that are simply above the 16-bit
| > limit?
| >
| > for example:
| > OLD ITALIC LETTER A (unicode code: 10300).
| >
| > how do you represent those in .net?
| >
| > i tried to open a textfile containing this old-italic-a:
| >
| > - the length and indexing methods of string all said that
| old-italic-a
| > is actually 2 letters => it doesn't work
| > - when writing the string back to an utf8 encoded textfile, then it
| > was correctly written.
| >
| > so for me it seems that dotnet (mono) uses utf16 as
| internal encoding
| > format, but indexing (and length) doesn't use that information.
| >
| > am i correct?
| >
| > are there any ways to handle those characters in dotnet?
| >
| > for example the new java-1.5 contains some new
| string-methods that can
| > handle these characters. it's not perfect in java, but at
| least there
| > is something.
| >
| > if someone wants to play with it, i attached a text file containing
| > the text "marrakesh", encoded in utf8, where i replaced the
| first "a"
| > with old-italic-a (it's easy to do with a little iconv to-from ucs4
| > and hexedit)
| >
| > thanks,
| > gabor farkas
|
| _______________________________________________
| Mono-list maillist - Mono-list@lists.ximian.com
| http://lists.ximian.com/mailman/listinfo/mono-list
|
|