[Mono-list] unicode trouble

Chris Mullins cmullins@winfessor.com
Sun, 8 Feb 2004 22:37:45 -0800


After re-reading it, my post isn't too clear.=20

Basically I was trying to say that Unicode CodePoints can be encoded
using UTF8 or UTF16 regardless if that are > 0xFFFF or not.=20

.NET has the ability to:=20
1) Iterate over strings by graphemes so that regardless of encoding,
developers can treat Unicode combining characters and surrogate pairs as
a single entity.=20

2) Build and manipulate strings that consist of any currently defined
Unicode CodePoint. While creating a grapheme for a CodePoint >0xFFFF is
tricky, once the grapheme is properly encoded into a string, any of the
standard string manipulations can be used to append it with other
strings, or otherwise manipulate it.=20

Both encodings in .NET, UTF8 and UTF16, have this ability. Many people
incorrectly interpret UTF8 to "generally be 1 byte long for western
languages and 2 bytes for Eastern Languages" when the reality is that
when UTF8 encodes a CodePoint the resulting encoding is sometimes 1, 2,
3, 4 or (I think) 5 bytes long. UTF16 has similar properties.=20

--=20
Chris Mullins

-----Original Message-----
From: Chris Mullins=20
Sent: Sunday, February 08, 2004 9:58 PM
To: mono-list@lists.ximian.com
Subject: RE: [Mono-list] unicode trouble


(Forgiveness please, in advance, for the VB.NET code samples that follow
- I'm too tired to port them over to C# right now...)

Fergus Henderson Wrote:

> Unfortunately Windows, Java, and .NET all use 16-bit=20
> characters. That means that they must either (a) use=20
> UCS-2 encoding, i.e. don't support the new unicode=20
> characters such as "OLD ITALIC LETTER A"; or (b)=20
> use UTF-16 encoding, which means that these=20
> characters which don't fit in 16 bits get represented as=20
> a pair of 16-bit codes.

Strings in .NET, have no trouble representing Unicode code points above
0xFFFF. In fact, at least in the Microsoft Implementation (I haven't
tried my code yet in Mono), this is handled nearly flawlessly.=20

The encodings used by .NET (UTF8, UTF16) have no trouble representing
CodePoints over 0xFFFF, and in fact do so exactly as defined by the
Unicode spec.=20

If your goal is to iterate over a .NET string, you should use the .NET
System.Globalization.StringInfo class - this class gives you the ability
to iterate by graphemes, rather than single 'characters'. Thus is your
Unicode string has 3 "displayable" characters, iterating by grapheme
will let you see all 3 characters characters.

(All the following code samples assume UTF8 encoding - it's the same for
a UTF16 encoding of Unicode. (Anyone want to implements a UTF32 encoding
for .NET? That would sure make this stuff a bit easier!))

The following code shows how to iterate over a String, regardless of the
Unicode CodePoints that are contained in the string:

Public Shared Function Prohibit(ByVal stringToTest As String) As Boolean
        'ensure the string isn't too long
        Dim bytes() As Byte =3D
System.Text.Encoding.UTF8.GetBytes(stringToTest)
      =20
        Dim si As New System.Globalization.StringInfo
        Dim myTEE As System.Globalization.TextElementEnumerator =3D
si.GetTextElementEnumerator(stringToTest)

        myTEE.Reset()
        While myTEE.MoveNext()
            Dim CodePoint As Integer

            Dim grapheme As String =3D myTEE.GetTextElement
            If grapheme.Length > 1 Then
                Dim uc As Char =3D grapheme.Chars(0)
                Dim lc As Char =3D grapheme.Chars(1)

                CodePoint =3D ((AscW(uc) - &HD800) * &H400) + AscW(lc) -
&HDC00 + &H10000
            Else
                CodePoint =3D AscW(grapheme)
            End If

            '*** Do something here with the Codepoint...
		'*** (like check the code against the profiles defined
		'*** prohibit tables).=20

        End While

        Return True
    End Function=20

I've implemented StrinPrep (RFC 3454) in .NET, and while it was painful,
the capabilities are there to support it.=20

To create a .NET string from a Unicode code point, use the following
VB.NET code (if you port it to C# be careful to stick with the
System.Math.DivRem statement - or else integer rounding will cause all
sorts of problems.=20

Public Shared Function UnicodeCodepointToString(ByVal codePoint As
Int32) As String
	If codePoint <=3D &HFFFF Then
      	Return String.Concat(ChrW(codePoint))
      Else
      	Dim remainder As Integer
            Dim intDivide As Integer =3D System.Math.DivRem(codePoint -
&H10000, &H400, remainder)

            Dim H As Integer =3D System.Convert.ToInt32(intDivide +
&HD800)
            Dim L As Integer =3D System.Convert.ToInt32(remainder +
&HDC00)
            Return String.Concat(System.Convert.ToChar(H),
System.Convert.ToChar(L))
        End If
End Function

This algorithm was adapted from the Surrogate Encoding algorithm
presented in Chapter 3.7.D25 of the Unicode handbook:=20
	http://www.unicode.org/book/ch03.pdf


--=20
Chris Mullins

_______________________________________________
Mono-list maillist  -  Mono-list@lists.ximian.com
http://lists.ximian.com/mailman/listinfo/mono-list