[Mono-list] cpblk?

Dietmar Maurer dietmar@ximian.com
27 May 2002 12:37:08 +0200


Wow, Sergey is always a bit faster than I am :-)

I would include that patch if you remove the MOVAPS or test if the
feature is really available. Maybe the simple generic memcopy you posted
first is not much slower?

- Dietmar


On Sun, 2002-05-26 at 03:22, Sergey Chaban wrote:
> Hello!
> 
> > However, how often does this happen?
> 
> Not very often, most certainly :-)
> As far as I can tell, the opcode is currently used by Managed VC++ to inline memcpy,
> if certain optimizations were enabled or if compiler was explicitly instructed to do so
> with #pragma intrinsic(memcpy).
> 
> I think that another use for cpblk is dynamic code generated at runtime (with Reflection.Emit),
> perhaps when size is already known (something similar to self-modifying code often used in the old days).
> 
> 
> > When you have small data blocks to be moved (but the size isn't known
> > at compile time) it's generally best to "rep movsb" without any additional
> > logic. When you have larger blocks, it really pays off to optimize for
> > things like DWORD/QWORD alignment, cache prefetching (available in most
> > modern CPU architectures). Ideally you have specialized copy/move routines
> 
> I totally agreed :-)
> All in all, I think it's perfectly correct to implement cpblk with memmove,
> but I think that it would be wrong to make any assumptions about its behaviour
> (with regard to overlapping blocks), and write code based on these assumptions.
> 
> Also not all modern CPUs are x86s ;-)
> 
> I put together some tests and this patch with some optimizations for size=const:
> http://mono.eurosoft.od.ua/files/x86.brg.cpblk.diff
> 
> Some sample code:
> http://mono.eurosoft.od.ua/files/CpblkTest.il
> http://mono.eurosoft.od.ua/files/BulkCpy.il
> 
> These tests are rather synthetic, unfortunately it's currently impossible to run
> VC++ generated executables under Mono - I'd code something more realistic :-)
> The first test is just moving XYZ float vectors around (size=12, in this case performance
> increase is quite noticeable). The second just copies blocks of various sizes.
> 
> The patch is quick and dirty, for different sizes it emits code optimized for different CPUs :-)
> Moreover it uses MOVAPS instructions to copy blocks larger than 1K without checking
> if SSE is actually available, so second test will crash on CPUs without SSE.
> It uses FPU to move blocks of certain sizes which is faster on older Pentiums/486 but slow on P6+.
> This is just to demonstrate/test CPU-specific optimizations for cpblk.