[Mono-list] cpblk?
Sergey Chaban
serge@wildwestsoftware.com
Sun, 26 May 2002 04:22:28 +0300
Hello!
> However, how often does this happen?
Not very often, most certainly :-)
As far as I can tell, the opcode is currently used by Managed VC++ to inline memcpy,
if certain optimizations were enabled or if compiler was explicitly instructed to do so
with #pragma intrinsic(memcpy).
I think that another use for cpblk is dynamic code generated at runtime (with Reflection.Emit),
perhaps when size is already known (something similar to self-modifying code often used in the old days).
> When you have small data blocks to be moved (but the size isn't known
> at compile time) it's generally best to "rep movsb" without any additional
> logic. When you have larger blocks, it really pays off to optimize for
> things like DWORD/QWORD alignment, cache prefetching (available in most
> modern CPU architectures). Ideally you have specialized copy/move routines
I totally agreed :-)
All in all, I think it's perfectly correct to implement cpblk with memmove,
but I think that it would be wrong to make any assumptions about its behaviour
(with regard to overlapping blocks), and write code based on these assumptions.
Also not all modern CPUs are x86s ;-)
I put together some tests and this patch with some optimizations for size=const:
http://mono.eurosoft.od.ua/files/x86.brg.cpblk.diff
Some sample code:
http://mono.eurosoft.od.ua/files/CpblkTest.il
http://mono.eurosoft.od.ua/files/BulkCpy.il
These tests are rather synthetic, unfortunately it's currently impossible to run
VC++ generated executables under Mono - I'd code something more realistic :-)
The first test is just moving XYZ float vectors around (size=12, in this case performance
increase is quite noticeable). The second just copies blocks of various sizes.
The patch is quick and dirty, for different sizes it emits code optimized for different CPUs :-)
Moreover it uses MOVAPS instructions to copy blocks larger than 1K without checking
if SSE is actually available, so second test will crash on CPUs without SSE.
It uses FPU to move blocks of certain sizes which is faster on older Pentiums/486 but slow on P6+.
This is just to demonstrate/test CPU-specific optimizations for cpblk.
Sergey