[Mono-list] unicode trouble

Chris Mullins cmullins@winfessor.com
Sun, 8 Feb 2004 21:58:26 -0800


(Forgiveness please, in advance, for the VB.NET code samples that follow
- I'm too tired to port them over to C# right now...)

Fergus Henderson Wrote:

> Unfortunately Windows, Java, and .NET all use 16-bit=20
> characters. That means that they must either (a) use=20
> UCS-2 encoding, i.e. don't support the new unicode=20
> characters such as "OLD ITALIC LETTER A"; or (b)=20
> use UTF-16 encoding, which means that these=20
> characters which don't fit in 16 bits get represented as=20
> a pair of 16-bit codes.

Strings in .NET, have no trouble representing Unicode code points above
0xFFFF. In fact, at least in the Microsoft Implementation (I haven't
tried my code yet in Mono), this is handled nearly flawlessly.=20

The encodings used by .NET (UTF8, UTF16) have no trouble representing
CodePoints over 0xFFFF, and in fact do so exactly as defined by the
Unicode spec.=20

If your goal is to iterate over a .NET string, you should use the .NET
System.Globalization.StringInfo class - this class gives you the ability
to iterate by graphemes, rather than single 'characters'. Thus is your
Unicode string has 3 "displayable" characters, iterating by grapheme
will let you see all 3 characters characters.

(All the following code samples assume UTF8 encoding - it's the same for
a UTF16 encoding of Unicode. (Anyone want to implements a UTF32 encoding
for .NET? That would sure make this stuff a bit easier!))

The following code shows how to iterate over a String, regardless of the
Unicode CodePoints that are contained in the string:

Public Shared Function Prohibit(ByVal stringToTest As String) As Boolean
        'ensure the string isn't too long
        Dim bytes() As Byte =3D
System.Text.Encoding.UTF8.GetBytes(stringToTest)
      =20
        Dim si As New System.Globalization.StringInfo
        Dim myTEE As System.Globalization.TextElementEnumerator =3D
si.GetTextElementEnumerator(stringToTest)

        myTEE.Reset()
        While myTEE.MoveNext()
            Dim CodePoint As Integer

            Dim grapheme As String =3D myTEE.GetTextElement
            If grapheme.Length > 1 Then
                Dim uc As Char =3D grapheme.Chars(0)
                Dim lc As Char =3D grapheme.Chars(1)

                CodePoint =3D ((AscW(uc) - &HD800) * &H400) + AscW(lc) -
&HDC00 + &H10000
            Else
                CodePoint =3D AscW(grapheme)
            End If

            '*** Do something here with the Codepoint...
		'*** (like check the code against the profiles defined
		'*** prohibit tables).=20

        End While

        Return True
    End Function=20

I've implemented StrinPrep (RFC 3454) in .NET, and while it was painful,
the capabilities are there to support it.=20

To create a .NET string from a Unicode code point, use the following
VB.NET code (if you port it to C# be careful to stick with the
System.Math.DivRem statement - or else integer rounding will cause all
sorts of problems.=20

Public Shared Function UnicodeCodepointToString(ByVal codePoint As
Int32) As String
	If codePoint <=3D &HFFFF Then
      	Return String.Concat(ChrW(codePoint))
      Else
      	Dim remainder As Integer
            Dim intDivide As Integer =3D System.Math.DivRem(codePoint -
&H10000, &H400, remainder)

            Dim H As Integer =3D System.Convert.ToInt32(intDivide +
&HD800)
            Dim L As Integer =3D System.Convert.ToInt32(remainder +
&HDC00)
            Return String.Concat(System.Convert.ToChar(H),
System.Convert.ToChar(L))
        End If
End Function

This algorithm was adapted from the Surrogate Encoding algorithm
presented in Chapter 3.7.D25 of the Unicode handbook:=20
	http://www.unicode.org/book/ch03.pdf


--=20
Chris Mullins