[Mono-list] unicode trouble

gabor gabor@z10n.net
Mon, 09 Feb 2004 00:19:03 +0100


--=-wWuUHml/3gJmT8vIfC5G
Content-Type: text/plain
Content-Transfer-Encoding: 7bit

hi,

as i understand, characters in .net are 16-bit values.

but what about unicode characters, that are simply above the 16-bit
limit?

for example:
OLD ITALIC LETTER A (unicode code: 10300).

how do you represent those in .net?

i tried to open a textfile containing this old-italic-a:

- the length and indexing methods of string all said that old-italic-a
is actually 2 letters => it doesn't work
- when writing the string back to an utf8 encoded textfile, then it was
correctly written.

so for me it seems that dotnet (mono) uses utf16 as internal encoding
format, but indexing (and length) doesn't use that information.

am i correct?

are there any ways to handle those characters in dotnet?

for example the new java-1.5 contains some new string-methods that can
handle these characters. it's not perfect in java, but at least there is
something.

if someone wants to play with it, i attached a text file containing the
text "marrakesh", encoded in utf8, where i replaced the first "a" with
old-italic-a  
(it's easy to do with a little iconv to-from ucs4 and hexedit)

thanks,
gabor farkas

--=-wWuUHml/3gJmT8vIfC5G
Content-Disposition: attachment; filename=marrakesh.txt
Content-Type: text/plain; name=marrakesh.txt; charset=
Content-Transfer-Encoding: 8bit

m𐌀rrakesh

--=-wWuUHml/3gJmT8vIfC5G--