[Mono-devel-list] [PATCH] String speedup
05mauben at hawken.edu
Tue Feb 24 13:32:40 EST 2004
Hrm, i guess I tried too hard :-).
Even though your new version beats the ICall in most cases, I'd really like to see a cheaper way to make the call to memcpy and other such methods.
I was talking with Miguel about this and he suggested working around the exception handling helpers.
Last night, I told the JIT not to make a wrapper for CPBLK. I instantly got a huge perf boost. However, I dont think we can do it for cpblk out of correctness (what if you try to copy 10 bytes from NULL to NULL, what would happen to the exception without the unwind info?)
However, for internal calls, we can make some assumptions about the validity of memory. What I was thinking was to provide the following methods to the framework:
internal void memcpy (void* src, void* dest, int cb);
internal void memmove (void* src, void* dest, int cb);
It would the be caller's responsibility to ensure that the said methods were only passed valid addresses with a valid count.
In the jit, we would just implement these as direct calls to the glibc functions. Since there are no exceptions, it would be just as fast as doing the call from C code.
Thus, we get the benefit of the tuned-to-death memcpy et al while not having the overhead of an ICall.
I remember a MS blogger writing about something similar in their framework. They had two calling conventions, one that created a stub (much like today's icall interface) and one that did a direct call.
>>> Paolo Molaro <lupus at ximian.com> 02/24/04 12:12 PM >>>
On 02/23/04 Ben Maurer wrote:
> This tests the speed of copying strings of various lengths. On my box, the results were:
> Length + Before ---+ After ---+
> 2 | .388 s | .273 s |
> 5 | .426 s | .436 s |
> 8 | .419 s | .421 s |
> 47 | .536 s | .937 s |
> My implementation of memcpy/memove is attached, with a little test driver.
> So, it looks like right now, after len 2 strings, the cost of the icall
> becomes lower than the benefit of memcpy.
What about a different explanation? Because to me it looks like that
with a crippled memcpy managed implementation you can get as bad results
as you want. Attached a first cut that doesn't try to optimize away the
unaligned accesses. It beats the icall on my system until about 50-55
and is about 10% slower with lengths between 80-100 (and 10% is
definitely within the improvements we can gain in the jit). Also note
that it takes 3 80-char copies with the icall to gain back the time lost
with a single 10-char copy. Note it doesn't handle overlap, so I'm not
going to commit it: the few calls in stringbuilder that do need overlap
should be changed to call another function so the common case is handled
faster. I had hoped you would do that, but it looks like you wanted to
show how to write orrible and slow code instead.
> One other thought I had was somehow using the CPBLK instruction. We
> could make method that was transformed into CPBLK by the jit. This way,
> we just have to optimize that opcode. Note, that Mono runs the CPBLK
> bench mark 3x slower than MS does, so we may have to do some work. Also,
Trivially optimized with about 5 lines of C code. Anyone out there who
wants to start some jit hacking? No asm knowledge required, the results
should look something like:
$ mono -O=all,-intrins benchmark/bulkcpy.exe
Elapsed : 4046 ms.
$ mono -O=all benchmark/bulkcpy.exe
Elapsed : 1359 ms.
On 02/23/04 Ben Maurer wrote:
> Some greping shows that the old JIT had code generation for CPBLK, and
> it looked pretty fast. Maybe we can port that over?
Nope. Well, you're free to spend your time porting it, maybe you'll
learn something. Once you have ported it we'll show you why it is not
lupus at debian.org debian/rules
lupus at ximian.com Monkeys do it better
Mono-devel-list mailing list
Mono-devel-list at lists.ximian.com
More information about the Mono-devel-list