[Mono-dev] [PATCH] Add GetString to UnicodeEncoding 2.0 andmodifysome Encoding wrappers

Wed Apr 12 09:09:57 EDT 2006

For new byte[1]:

The following code for example (and other methods with empy output buffer) 
has to throw ArgumentException (not IndexOutOfRangeException that &bytes[0] 
throws):

Encoding e = Encoding.GetEncoding(1252);
e.GetBytes(new char[] {'a'}, 0, 1, new byte[] {}, 0);

I didn't previously find the following line in MonoEncoding:
if (bytes.Length - byteIndex < charCount)
 throw new ArgumentException (Strings.GetString ("Arg_InsufficientSpace"), 
"bytes");

This will eliminate IndexOutOfRangeException so no new byte[1] is required 
but actually I'm not sure whether it's correct/safe to assume that 
GetByteCount returns charCount. (for DBCS it can be more, when using no 
fallback character (ignoring invalid bytes) it can be less as well) As 
MonoEncoding is the base encoding class of I18N it may be better to move 
this check to a higher level and use new byte[1] in MonoEncoding and let 
GetBytesImpl check for buffer size.

For the other things I haven't got enough time now, but I'm happy that we 
actually have very similar opinion and actually I like to discuss things 
with you.:)

Kornél

----- Original Message ----- 
From: "Atsushi Eno" <atsushi at ximian.com>
To: "Kornél Pál" <kornelpal at gmail.com>
Cc: <mono-devel-list at lists.ximian.com>
Sent: Wednesday, April 12, 2006 2:04 PM
Subject: Re: [Mono-dev] [PATCH] Add GetString to UnicodeEncoding 2.0 
andmodifysome Encoding wrappers

> Hi,
>
> Here's the real answer.
>
> I might not be fully understanding you, but if you are saying that
> your current patch as is should be applied, then it's no-go (due
> to the big difference in ASCII and Unicode as you showed us).
>
> Note that I'm not saying that performance is always higher-rated
> matter than compatibility (I'm actually rather anti-pro-optimization
> dude than others). If there is a way to achieves this "compatibility"
> and does not harm performance, it'd be awesome. I'm just not for
> *extermism*. The reason you were once convinced was not because the
> evidences are numbers but because the differences are significant.
>
> (Hey, there is no doubt that I love your detailed analysis BTW :-)
>
> I agree with you on that we had better feel free to override virtual
> stuff that does not result in MissingMethodException (but it might
> be only myself).
>
> For individual changes other than that performance loss, there are
> certain goodness in your patches. But for some I'm not convinced
> (such as giving new byte [1]) because you really don't provide
> evident NUnit tests.
>
> If you don't write any, I will create ones for some changes that I am
> convinced. But as I've written in the first reply, the difference is
> so minor that it is low priority for me.
>
> BTW thanks for the decent tester code. It conceived me that there are
> still some optimizible things.
>
> Atsushi Eno
>
> Kornél Pál wrote:
>> Hi,
>>
>> I've done some tests:
>> New 1.1:
>> UnicodeEncoding: 6750
>> ASCIIEncoding: 18609
>> UTF8Encoding: 9922
>> CP932: 14641
>>
>> New 2.0:
>> UnicodeEncoding: 13594
>> ASCIIEncoding: 19562
>> UTF8Encoding: 16625
>> CP932: 38906
>>
>> Old 1.1:
>> UnicodeEncoding: 6906
>> ASCIIEncoding: 18859
>> UTF8Encoding: 10062
>> CP932: 21719
>>
>> Old 2.0:
>> UnicodeEncoding: 6750
>> ASCIIEncoding: 7297
>> UTF8Encoding: 16719
>> CP932: 45469
>>
>> I have the following conclusion:
>>
>> UnicodeEncoding in 2.0 is slower because GetBytes(string) is not 
>> overridden. But performance is improved in 1.1 because the overridden 
>> implementation optimized for UnicodeEncoding.
>>
>> In ASCIIEncoding you can see the drawback of doing optimizations in 
>> Encoding class because the current code is only faster on 2.0. Using the 
>> new code 1.1 didn't change because not using unsafe code.
>>
>> There is no change in UTF8Encoding (or little but improvement is 
>> minimal).
>>
>> CP932 is faster because optimization is done in MonoEncoding.
>>
>> As a conclusion I think that Encoding should be MS.NET compatible because 
>> it's more likely to be used by users. And no improvement can be done in 
>> profile 1.1 because there are no unsafe methods so there is no use to 
>> sacrifice compatibility for performance.
>>
>> I think that the best solution for encoding optimization is to use a 
>> single unsafe implementation (for each funtionality; GetBytes, GetChars, 
>> GetByteCount, GetCharCount) and other methods (string, char[], byte[]) 
>> are calling this single implementation. This makes the code more 
>> maintainable as well. This is what I've done in UnicodeEncoding.
>>
>> And I think the point where we shouldn't care about MS.NET compatibility 
>> are the derived public encoding classes; we should override as much 
>> methods as we need even if they aren't overridden in MS.NET. (For private 
>> encoding classes layout compatibility is not requirement.)
>>
>> For example if I remove !NET_2_0 and NET_2_0 from GetBytes(string) and 
>> GetString(byte[], int, int) in UnicodeEncoding significant performance 
>> improvement can be achieved in all profiles.
>>
>> Is this "deal" acceptable? If you have any objections please let me know.
>>
>> Kornél