[Mono-list] cpblk?

Jaroslaw Kowalski jarek@atm.com.pl
Sat, 25 May 2002 21:19:34 +0200 (CEST)


It's not trivial to achieve maximum possible performance for such a
trivial task as memory block transfer. 

From my experience with game programming I can tell you that it's
generally best to completelly unroll copying for small data blocks of
constant size, that is to use a series of (interleaved) "mov". 

When you have small data blocks to be moved (but the size isn't known
at compile time) it's generally best to "rep movsb" without any additional
logic. When you have larger blocks, it really pays off to optimize for
things like DWORD/QWORD alignment, cache prefetching (available in most
modern CPU architectures). Ideally you have specialized copy/move routines
for different architecures (Pentium, K6, Athlon, MMX, SSE, SSE2, etc.) and
just call (or emit the call to) the appropriate one. The cost of
"call/ret" is not relevant for new processors.

So when the size isn't known at compile time I suggest a simple compare
of the block size against some threshold and either "rep movsb" or call to
memmove() optimized for current processor architecture. 

If the size is known at compile time and it is small, just unroll the
loop. If the size is above some threshold, just call memmove().

Just my $0.05 ;-)


On 25 May 2002, Miguel de Icaza wrote:

> > > memcpy already takes care of copying in the fastest way posible.
> > 
> > That's right, but we still have a call, a ret, and a conditional or two ;-)
> I was going to say exactly that ;-)
> > By inlining we can get rid of these things (especially if size is known up-front).
> > Moreover, due to JIT's dynamic nature it's possible to generate faster code at run-time.
> > For example, the following (generic) memcpy is faster on pre-Pentium x86s (Intel syntax):
> >   mov esi, $src
> >   mov ecx, $size
> >   mov edi, $dest
> >   shr ecx,1
> >   rep movsw
> >   adc cl,cl
> >   rep movsb
> > 
> > For const size==1 we could just mov al, [src]; mov [dest],al
> > etc.etc.
> > BTW, MS JIT uses similar optimizations for cpblk/initblk.
> Exactly.  The same logic that lives in memmove() for the data size
> quantum can be inlined by the JIT engine trivially.  
> However, how often does this happen?  Until a couple of days ago we did
> not have cpblk, so my guess is that measuring the performance impact
> might not be immediately noticeable. 
> I would very much like to see this at some point.
> Miguel.
> _______________________________________________
> Mono-list maillist  -  Mono-list@ximian.com
> http://lists.ximian.com/mailman/listinfo/mono-list