[Mono-dev] Patch to boost speed of UnicodeEncoding

Sat Mar 11 07:09:46 EST 2006

Alright guys,

Here is a cool (and still incomplete) patch to speed up 
System.Text.UnicodeEncoding I'm working on. Just want to make sure this 
is sane before I finish it by getting everyone's opinions.

I was tinkering with this idea. Since the strings are stored in memory 
as UTF-16 (UCS 2) already, the idea of converting them with like we do 
with a while loop, one char at a time, was really bothering me. 
Directly copying whats in memory seems a little bit more sane. I don't 
want to make it sound that easy because it isn't (and maybe why it 
wasn't done like this when it was first written). :-P

The biggest problem is that UnicodeEncoding can be bigEndian or 
littleEndian so I went through the logic and testing to see if the 
system's endian (with 'BitConverter.IsLittleEndian') matched the endian 
of the current Encoding class (using the 'bigEndian' bool field) and if 
it doesn't then use the same method we already use. (Is that right? Is 
the internal version of utf-16 we use in our strings specific to the 
endian of the system? I assumed yes here but if it's not, it's a simple 
change to remark it out.)

Also since the memcpy function in String.cs uses some unsafe logic, 
taking a possible hit for that with a really small string seems silly, 
so I put in an condition that if the char count is less then or equal 
to 10 chars, then use the existing method. (Maybe 10 chars should be 
adjusted or is that idea silly?)

Below is an unfinished sample of my idea. Of course I will have to 
reverse this logic for GetChars() (instead of GetBytes below) and 
finish the overloads in System.Text.UnicodeEncoding's GetBytes and 
GetChars methods but I want to see what everything thinks.


Index: System/String.cs
===================================================================

--- System/String.cs    (revision 57749)
+++ System/String.cs    (working copy)
@@ -1746,6 +1746,22 @@
                        return n;
                }

+
+               internal unsafe int InternalStrToByteArr(Byte[] tmp, 
int offset)
+               {
+                       //shortcut function for System.Text.UnicodeEncoding
+
+                       //byte[] tmp = new Byte[this.length * 2];
+                       fixed (byte* dest = tmp){
+                               fixed (char* src = this)
+                               {
+                                       memcpy ((byte*) (dest+offset), 
(byte*)src, this.length * 2);
+                               }
+                       }
+                       return this.length * 2;
+
+               }
+
                internal unsafe void InternalSetChar (int idx, char val)
                {
                        if ((uint) idx >= (uint) Length)
Index: System.Text/UnicodeEncoding.cs
===================================================================
--- System.Text/UnicodeEncoding.cs      (revision 57749)
+++ System.Text/UnicodeEncoding.cs      (working copy)
@@ -123,22 +123,36 @@
                if ((bytes.Length - byteIndex) < (charCount * 2)) {
                        throw new ArgumentException 
(_("Arg_InsufficientSpace"));
                }
-               int posn = byteIndex;
+
+               int retval;
                char ch;
-               if (bigEndian) {
-                       while (charCount-- > 0) {
-                               ch = chars[charIndex++];
-                               bytes[posn++] = (byte)(ch >> 8);
-                               bytes[posn++] = (byte)ch;
-                       }
-               } else {
-                       while (charCount-- > 0) {
-                               ch = chars[charIndex++];
-                               bytes[posn++] = (byte)ch;
-                               bytes[posn++] = (byte)(ch >> 8);
-                       }
+
+               // Shortcut unicode encoding process if the system 
matches this encoding endian
+               // otherwise, use the byte by byte method (unless the 
string is really
+               // small, in which case, using this shortcut can hurt us)
+               if (BitConverter.IsLittleEndian == bigEndian || 
charCount <= 10) {
+                       int posn = byteIndex;
+                       if (bigEndian) {
+                               while (charCount-- > 0) {
+                                       ch = chars[charIndex++];
+                                       bytes[posn++] = (byte)(ch >> 8);
+                                       bytes[posn++] = (byte)ch;
+                               }
+                       } else {
+                               while (charCount-- > 0) {
+                                       ch = chars[charIndex++];
+                                       bytes[posn++] = (byte)ch;
+                                       bytes[posn++] = (byte)(ch >> 8);
+                               } //
+
+                       } //
+
+                       retval = posn - byteIndex;
                }
-               return posn - byteIndex;
+               else {
+                       retval = (new 
String(chars,charIndex,charCount)).InternalStrToByteArr(bytes, 
byteIndex);
+               }
+               return retval;
        }



-- 
Zac Bowling
http://zacbowling.com/