[Mono-list] unicode trouble

gabor gabor@z10n.net
Mon, 09 Feb 2004 00:19:03 +0100

Content-Type: text/plain
Content-Transfer-Encoding: 7bit


as i understand, characters in .net are 16-bit values.

but what about unicode characters, that are simply above the 16-bit

for example:
OLD ITALIC LETTER A (unicode code: 10300).

how do you represent those in .net?

i tried to open a textfile containing this old-italic-a:

- the length and indexing methods of string all said that old-italic-a
is actually 2 letters => it doesn't work
- when writing the string back to an utf8 encoded textfile, then it was
correctly written.

so for me it seems that dotnet (mono) uses utf16 as internal encoding
format, but indexing (and length) doesn't use that information.

am i correct?

are there any ways to handle those characters in dotnet?

for example the new java-1.5 contains some new string-methods that can
handle these characters. it's not perfect in java, but at least there is

if someone wants to play with it, i attached a text file containing the
text "marrakesh", encoded in utf8, where i replaced the first "a" with
(it's easy to do with a little iconv to-from ucs4 and hexedit)

gabor farkas

Content-Disposition: attachment; filename=marrakesh.txt
Content-Type: text/plain; name=marrakesh.txt; charset=
Content-Transfer-Encoding: 8bit