[Mono-dev] [PATCH] Implement internal Encodings usingunified code base

Paolo Molaro lupus at ximian.com
Fri Mar 21 13:38:18 EDT 2008

On 03/21/08 Kornél Pál wrote:
> >From: "Paolo Molaro" <lupus at ximian.com>
> >Sure, the existing unsafe methods are perfectly fine, there is no need
> >to add more virtual unsafe methods (this is about 2.0). In 1.1 the
> >unsafe methods would simply become internal.
> This is unfortunately unture because we cannot call the public unsafe 
> methods because it would result in the problems described in bug 363713. To 
> fix bug 363713 a private (actually internal) unsafe implementation is 
> required.

This patch fixes the GetByteCount() issue described in the bug:

Index: System.Text/Encoding.cs
--- System.Text/Encoding.cs	(revision 98527)
+++ System.Text/Encoding.cs	(working copy)
@@ -1150,7 +1150,7 @@
 		for (int p = 0; p < count; p++)
 			c [p] = chars [p];
-		return GetByteCount (c);
+		return GetByteCount (c, 0, count);

For the case of GetByteCount (), the GetByteCount (char[],int,int)
overload is the only one that needs to be implemented by a subclass.
Efficient implementations can implement the unsafe variant and that will
avoid the array creation. The MS people likely did this to not break
compatibility and allow non-unsafe derived classes of Encoding.
All our internal implementations can override the unsafe variant.
If there is another issue related to GetByteCount() I'd like to know
otherwise we can continue and examine the other buggy cases.

> Also note that by having private implementations code of Encoding 
> and Encoder/Decoder methods can be shared as well by passing state 
> information to these four methods (and passing zero or initial state when 
> called from the stateless Encoding).

Sure, this is a separate issue about the implementation of the unsafe
variants and there is no difference related to your design in it.
It is likely that the implementation for most complex encodings should
be into the encoder classes. We'd have to write a few tests to make sure
if it is a requirement.

> To fix bug 363713 we have to take slower code paths (the same that MS.NET
> uses) for some methods that will result in buffer copy operations. This bug
> only affects inherited encodings (but there are a lot of inheritable
> encoding classes). This is why I introduced the IsInternal property;
> instances of internal classes can take faster code paths, while instances of
> inherited classes will take MS compatible (slower) code paths. I decided to

I have no objections to a mechanism like IsInternal.

> implement this code path redirection in Encoding.cs and call the four
> internal virtual (in fact internally must-implement) unsafe methods. This
> way the source code of inherited encoding classes will have a clean look
> because only four methods has to be implemented the other ones will only
> call the ...Internal implementations in Encoding.cs that does all the work.

Well, the issue is that in this case "all the work" is basically
argument checking + a call to the different virtual unsafe method.
Argument checking I already showed that having static methods to share
the code has the same property of removing duplication and the advantage
of making visible what is exactly checked at the point where the data is
entering the unsafe user code -> trusted runtime code boundary.
And when you have implemented the checks in common static methods, the
new unsafe virtual calls have basically no need to exist and they become
both a runtime burden and an additional issue when auditing the code.

> Because of fallbacks the actual required byte cound depends on the fallback 
> buffer as well. Also note that UnicodeEncoding in MS.NET 2.0 has support for 
> surrogates and can detect invalid surrogate sequences that means that the 
> fallback buffer will be used to replace invalid input data. If you don't 
> have a single implementation for GetByteCount you will have to duplicate 
> code for strings, char arrays and char pointers as well. So having a single 

I never proposed having duplicated code for those cases, quite the
opposite, so this whole argument of yours is moot.

> Your proposed solution (that would be just as much change in the
> infrastructure) is to use static implementation for these four methods in
> contrast to virtual methods.
> This would need more code in individual encoding classes:
> - Argument checking (can be implemented using static methods in Encoding.cs)

I don't see how my proposal to have common static methods to do argument
checking amounts to code duplication.

> - IsInternal checks: not much code but can easily go out of sync (some 
> methods require it while others not because the method is overridden in all 
> internal Encoding classes)

Not an issue as you say.

> - calling MS compatible slow implementation (could be a static method in 
> Encoding.cs)

Not an issue. The slow code path can go into separate common methods,
but this so so trivial that it may not even be worth it.

> - Array-to-pointer conversions has to be done in each encoding class 
> (because of calling static internal implementations)

No issue here as well.

> Your proposed design avoids the four internal virtual unsafe methods but 
> treads the gain for the above mentioned extra code. Your desing probably has 
> some performance benefits, but there will be a lot of redundant code (in 
> nearly each method of each encoding class). This code will look similar (or 
> the same) in source code but because of calling different static methods in 
> each class the compiled code will be different so the code has to be copied. 
> That means that if someone only modifies a single encoding class (by fixing 
> a bug) the others will go out of sync.

See above, there is no code duplication except trivial calls to helper
static methods. When you analyze the memory cost you need to have an
idea of how much each different solution costs in terms of bytes.
The calls is my proposal about to the generated code that performs the
call, say 20 bytes for a method with 3 arguments. The additional virtual
unsafe method itself would be 50 bytes just in additional metadata
and virtual table costs. So the amounts are small, but they are
definitely not a benefit of your design.

> Sure, both designs fix bug 363713 but as a conclusion I believe that my 
> proposed desing will result in code that is much easier to maintain. So I 
> would not trade it for yours unless you have some serious reason(s) not to 
> use it.

I already explained the reasons: the unnecessary complexity of additional
virtual unsafe methods.

> Fixing bug 363713 requires infrastructural changes because the current call 
> sequence causes the bug. In a lot of cases (see the test attached to the 
> bug) we cannot call the right public methods for other public methods 
> because MS.NET uses different code paths. For methods expected not to call 
> public methods we either have to have a private implementation that is 
> called by public methods or copy the same implementation to each public 
> method. For method that take slower code paths we have to use a technique 
> like the IsInternal property if we want to provide better performance. I 
> consider these changes being infrastructural because all of the Encodings 
> have to be modified similarly.

See my patch above for the fix to the GetByteCount case.
The other cases should be very similar. On top of the fixes we can then
implement the proper IsInternal-based optimizations.

> duplicated code. For example in ASCII and Latin-1 encodings. Applying the 
> desing pattern you suggest would result in an infrastructural change as well 
> but the implementation would be distributed and copied across source files 
> while mine would be centralized to Encoding.cs.

See above where your points about non-existent duplication in my
proposal are rebutted.
Everything your proposal does can be implemented with static methods
instead of virtual unsafe methods. Static methods may not be the cool
new thing of the day, but they are more efficient and allow us to
immediately check where potentialy ostile data is manipulated.

> I am open to modifications of my proposed desing but as I see your proposed 
> desing would introduce a lot of code duplication while there still would be 
> unsafe methods and a lot of unsafe code paths, and these code paths would be 
> copied in each encoding that would require to maintain each encoding 
> separatedly. With my desing there would be only four methods with unsafe 
> code per encoding.

See above, there is no relevant code duplication in my proposal.
The difference is basically the following, for, say, GetByteCount.
Your proposal:
	encoding-specific impl:
		call virtual common implementation
	virtual common implementation:
		argument checks + basic manipulation
		call virtual specific implementation

My proposal:
	encoding-specific impl:
		call common static argcheck + basic manipulation method
		call static/instance specific implementation (or even inline)
	common static code:
		argument checks + basic manipulation

So the only high-level difference is that my version avoids doing
unnecessary virtual calls and that potentially the code can be inline
exactly where it is used without hard to follow flows (important both
for performance and for security checks).

> Some fact sheets I have created in 2006 about Encoding methods of MS.NET. 
> This is based on the actual implementation of MS.NET (using a test like the 
> one in bug 363713):

Thanks, this is useful in developing individual test cases for the
bugs we have in our implementation.
I suggest writing individual test cases that exploit the current bugs in
a format suitable for adding to our test suite.
Once we have individual test cases we can reason better about the way to
fix them. And once we are bug-free we can look at implementing the
IsInternal checks to speedup things.


lupus at debian.org                                     debian/rules
lupus at ximian.com                             Monkeys do it better

More information about the Mono-devel-list mailing list