[Mono-list] unicode trouble

max aranym@adelphia.net
Sun, 8 Feb 2004 20:03:33 -0800


Hi Gabor,
I think you're confused. Characters in .NET are 16 bits BECAUSE they are 
unicode. 16 bits = 2 bytes = 65536 values.

a way to check that is simple. here's some C# example code:

      string s = "a";
      s += (char)10300;

      Console.WriteLine("s = " + s);
      Console.WriteLine("len = " + s.Length);

      for (int i = 0; i < s.Length; i++ ) {
         Console.WriteLine("s["+i+"] = " + (int)s[i]);
      }

max

On Sunday 08 February 2004 15:19, gabor wrote:
> hi,
>
> as i understand, characters in .net are 16-bit values.
>
> but what about unicode characters, that are simply above the 16-bit
> limit?
>
> for example:
> OLD ITALIC LETTER A (unicode code: 10300).
>
> how do you represent those in .net?
>
> i tried to open a textfile containing this old-italic-a:
>
> - the length and indexing methods of string all said that old-italic-a
> is actually 2 letters => it doesn't work
> - when writing the string back to an utf8 encoded textfile, then it was
> correctly written.
>
> so for me it seems that dotnet (mono) uses utf16 as internal encoding
> format, but indexing (and length) doesn't use that information.
>
> am i correct?
>
> are there any ways to handle those characters in dotnet?
>
> for example the new java-1.5 contains some new string-methods that can
> handle these characters. it's not perfect in java, but at least there is
> something.
>
> if someone wants to play with it, i attached a text file containing the
> text "marrakesh", encoded in utf8, where i replaced the first "a" with
> old-italic-a
> (it's easy to do with a little iconv to-from ucs4 and hexedit)
>
> thanks,
> gabor farkas