[Mono-devel-list] [Patch] Manged code is fast!

Fri May 21 09:08:33 EDT 2004

Hey Lupus,

The investigation of this is very interesting. I have a few comments:

1) Have you changed code like
h = (h << 5) - h + *cc;
h = (h << 5) - h + cc [1];
h = (h << 5) - h + cc [2];
h = (h << 5) - h + cc [3];

To use memindex type stuff rather than an extra add?

2) I wonder if it would be a good idea to manually write the memcpy/memmove routine in assembly. This would seem to be the best way to things like your `fixme' on copying with doubles. Also, we would be able to do per-cpu tricks (totte wants SSE2 type stff). I will try something like this later tonight.

-- Ben

>>> Paolo Molaro <lupus at ximian.com> 05/21/04 08:36 AM >>>
On 05/21/04 Andreas Nahr wrote:
> > > private unsafe static void CharCopy (char* source, char* destination,
> int count)
> >
> > What is the perf here if things are not dword aligned?
> 
> I think for me thing always were dword aligned. We should ensure that
> Strings always get the right alignment in the JIT.

We can guarantee the character data in a string will be aligned to a 4 byte
boundary, but CharCopy can called on data aligned to just 2.

> > > + while (count >= 16) {
> > > + *((int*) destination) = *((int*) source);
> > > + destination += 2;
> > > + source += 2;
> > > + *((int*) destination) = *((int*) source);
> > > + destination += 2;
> > > + source += 2;
> >
> > It is probably better to do something like:
> >
> > *((int*) dest + x) = ...
> 
> Did you really test this or are you just guessing?

What? It's much easier to talk than to test! Why should he test? :-)

> For me the above solution (although more source code) always produced
> superior speed.
> However I used the notation *((int*) dest[x]) =...
> But this seems to be compiled into same IL.

When you posted about the low performance and I changed the JIT to
produce faster code I also investigated a few methods in String and
methods to do copies. The basic thing to note is to keep the variables
used in the inner loop to 3 and to do clever unrolling. When unrolling
in a copy, for example you should not do:
	copy 1
	increase pointers by 1
	copy 1
	increase pointers by 1
	...

but the more efficient:
	copy 1
	copy 1
	copy 1
	copy 1
	increase pointers by 4

See the attached benchmarks for ideas: GetHashCode() is always faster
than the C version (on x86, on ppc it's faster until 200 chars and 20%
slower at 1000, but I didn't optimize that yet). It's twice as fast
as the current code so I'll get it in cvs in the next few days.
As for copies: I'd like to have something like the attached memcpy in
System.String and use it whenever a copy is required (it will eventually
be used also for the cpblk IL opcode). The memcpy is always faster than
the C version for me (except when the data is misaligned): I didn't have
the time to properly test if this is because of bugs in the code:-)
If someone would write a set of extensive tests for memcpy it'll be
appreciated.
Results from both benchmarks on different cpus are also appreciated:
please provide cpu type and speed and run with -O=all with mono from
cvs (-O=loop is enough to get most of the speed: I'll enable it by
default shortly since it has low impact on JIT time).
A memmove method is also needed for some of the string methods.
Thanks.

lupus

-- 
-----------------------------------------------------------------
lupus at debian.org                                     debian/rules
lupus at ximian.com                             Monkeys do it better