[Mono-dev] [PATCH] Implement internal Encodings using unified code base

Thu Mar 20 10:32:58 EDT 2008

Hi,

>From: "Paolo Molaro" <lupus at ximian.com>
...
>I would much prefer to see the argument validation close to the unsafe
>code and I don't particularly like the use of these internal virtual
>wrapper methods.
>The code duplication can be avoided by using simple static validation
>methods like (only for the cases when more that two lines of checks are
>needed):
>
>static void ValidateGetBytes (...same arguments as GetBytes ()...) {
>}
>
>Basically using virtual+unsafe should be severely limited in our managed
>assemblies or it's going to increase the cost of security checks
>significantly.

My reasons for the new Encoding infrastructure I propose:

- Unsafe code is required anyway because Encoding has public unsafe methods. 
Handling strings and character arrays in a single function requires unsafe 
code as well. Using a single implementation avoids accidental differences 
between string, array and pointer based public methods.

- All argument checks are done in the abstract Encoding.cs and encoding 
implementations only have to provide the actual implementation. I believe 
that this is more secure than using different checks is each encoding 
implementation because they can easily go out of sync.

- Argument validation could be moved to static methods in Encoding.cs that 
could result in better security (by not exposing unchecked unsafe code) but 
Microsoft implementation has bad desingn. There are a lot of unnecessary 
buffer copy operations (mostly related to string <-> char[] conversion). To 
fix https://bugzilla.novell.com/show_bug.cgi?id=363713 while avoiding 
unecessary buffer copy operations there has to be some internal virtual 
methods with a check like in IsInternal property.

- After a lot of brainstorming I think that having these four virtual unsafe 
methods provides the solution. (GetStringImpl is required in .NET 1.x only 
because a bug in MS implementation.) Only single implementation per method 
per encoding is required, argument checks are shared between encodings, and 
Encoding.cs can call faster code paths than MS.NET does without overriding 
methods not overridden in MS.NET.

- Also note that I have plans to integrate Encoders/Decoders to this 
infrastructure by adding an Encoder/Decoder parameter at the end of the 
argument list of these four ...Impl methods that will let us use 
Encoder/Decoder as a state store rather than having actual code 
implementation.

These goals could only be achieved without these four virtual unsafe methods 
by adding some more internal virtual (not unsafe) methods and having a lot 
of duplicate code. This would be much more difficult to maintain so I 
believe that this would be less secure than having four virtual unsafe 
methods that lets us to use much cleaner design in Encoding implementations.

>> - if (charCount < 0 || charIndex + charCount > s.Length)
>> + if (charCount < 0 || charIndex > (s.Length - charCount))
>These fixes should be in their own separate patch instead of being
>hidden in a large change.

This is the only one. (Maybe I'm wrong but no other intentional fixes of 
this kind.)

>> - return count * 2;
>> + return GetByteCountInternal (chars, index, count);
>For these simple cases I think that hiding count*2 inside
>GetByteCountInternal() doesn't help code readability or any other
>property of the code. This would look much better if it was instead like
>this:

Currently our implementations has weak fallback support. For example 
UnicodeEncoding has none while MS.NET has.
Eventually we will have to say goodby to count * 2 because that is not 
compatible with fallbacks.
.NET 1.x has no fallback support so there could be some performance gain by 
avoiding GetByteCountInternal but I think having a centralized 
implementation is worth to have little performance loss on .NET 1.x because 
people will eventually use later versions instead.

The same applies for complicating other currently simple implementations.

>What I suggest is the following:
>1) a small patch with the bugfixes

This is the only simple bug fix:
>> - if (charCount < 0 || charIndex + charCount > s.Length)
>> + if (charCount < 0 || charIndex > (s.Length - charCount))

If you approve it, I'll commit it.

Other bugs can only be fixed by infrastructural changes.

>2) another patch that introduces the validation methods
>and reduces the code duplication
>3) separate patches for each encoding with the optimizations

Because of using IsInternal property each encodings can be committed 
separatedly only Encoding.cs has to go first.

I tried to defend the infrastructure proposed by me. I see that the one 
proposed by you is cleaner but I think that it does not address all the 
issuse that mine does and extending it to address all the issues would 
result in a lot of extra code. I believe that having four internal virtual 
unsafe methods is not a big problem. If it is a big issue sorry for wasting 
time but I didn't get your reasons against them.

Kornél