[Mono-dev] New (faster) Implementaiton for managed CharCopy

Mon Mar 20 14:34:14 EST 2006

> On 03/18/06 Andreas Nahr wrote:
>> ALL new Implementations are always faster, sometimes more than twice as
>> fast as current. The best overall is CharCopy autoalign. However I'm not
>> sure if this would work on non-x86 Platforms, but CC aligned should 
>> always
>> work if I understand the alignment issues right.
>
> Unaligned loads and stores are unacceptable, since they break at least
> sparc and arm.

Unfortunatelly I don't have any sparc or arm system ready so I couldn't 
test, but the CC aligned implementation should work, because it is aligned.

> The current memcpy() was done because it is a compromise between speed
> and code bloat.

The current implementation is 180 lines of code. The proposed one only has 
113 lines of code. Which one is more bloated?

> You could certainly unroll some more for some speed
> gains.

It doesn't make any sense to unroll anymore, because each additional level 
of unrolling will make copying SHORT data much more slow. In fact if you 
look at the numbers the new Implementations are not much faster for huge 
strings, they are much faster for short and especially medium length 
strings.

> If you can provide significant gains with little code bloat in the
> current methods, please do, but we're not going to add additional
> specialized memory copy routines.

Well if a speedup of  up to >100% isn't enough for you - how much should it 
be???

> memcpy is used also by the runtime and your methods only deal with
> chars, so they are not a replacement.

And how much of the time they are called with using a byte? Two times for 
the entire Corelib? And the rest of the time they are called for characters.
Besides that it does not even seem semantically right to put something like 
memcopy into the STRING class.

I'd say put the specialized CharCopy into String and move the memcopy to 
e.g. Buffer if it is still needed.

> The way to speedup the code is to implement arch-specific hand-coded
> managed implementations in the jit.

I absolutely agree on that, however I don't see any reason why we should 
have a managed fallback that is much slower than it could possibly be.
Especially as nobody has done this in the last few years.

In another thread you posted that you should measure against the icalls. 
Where is the sense in that if the managed implementation is known to be 
slow?

Greets
Andreas