[Mono-devel-list] mono AES performance woes (was: poor PPC JIT output)

Mon Jul 18 15:18:34 EDT 2005

On Jul 18, 2005, at 2:59 AM, Paolo Molaro wrote:

> On 07/15/05 Allan Hsu wrote:
>
>> Is there any reference on what sorts of things you can change using
>> mono_set_defaults? Following the mono source for references to that
>> function wasn't particularly enlightening. It would be useful if the
>>
>
> grep mono_set_defaults *.c
> mini.c:mono_set_defaults (int verbose_level, guint32 opts)
> Should be pretty evident. Just always use the result of
> mono_parse_default_optimizations (NULL) as the opts value.

I understood the verbose_level parameters, but the opts parameter was  
what mystified me. I should have been more specific about what I was  
looking for. At the time, I didn't understand the value that  
mono_parse_default_optimizations() returns or what values you can  
pass in to affect it. I've since traced it back to the relevant code  
in driver.c and the mini-X.c platform code now and see how it works.  
Is it safe to mess with those parameters, or will it cause undefined  
results?

>> To be fair, the native implementation is able to take advantage of  
>> 64-
>> bit processors when available, while all mono builds in the above
>> benchmarks are 32-bit. The Windows XP machine is the standard 32-bit
>> install, even though the processor is 64-bit. This is a pretty
>> informal benchmark, but all I'm interested in showing here is how bad
>> the AES performance under mono is.
>>
>
> The current implementation causes lots of spilling and other
> unnecessary work which the jit doesn't remove (the work massi is
> doing should improve this). Some parts of it can be easily changed
> to use unsafe code and that should improve performance a lot: I'll  
> leave
> that to Sebastien:-)

This is good to hear. I hope the benchmarking I did will provide some  
information that somebody will find useful.

For my specific application, there is no such thing as "enough"  
performance:) I plan on writing a managed wrapper around libcrypto  
for this reason. This will be the subject of another email.

>>> Some of the data looks definitely bogus: it reports a stall even on
>>> the addi, here:
>>>
>>>    0x2e143c8 lwz      r4,32(r1)    3:1 Stall=2
>>>    0x2e143cc lwz      r5,12(r4)    3:1 Stall=2
>>>    0x2e143d0 cmplwi   r5,0x0000     3:1 Stall=2
>>>    0x2e143d4 blel     $+696 <0x2e1468c [8B]>    2:1
>>> 0.4%    0x2e143d8 addi     r4,r4,16     2:1 Stall=1
>>>
> [...]
>
>> As for the stall statistics, you have misread them. Each line that
>> says "Stall=N" is saying that the instruction latency of the marked
>> instruction will cause a subsequent dependent instruction to stall,
>> not that the marked instruction itself will stall. N is the maximum
>> number of stall cycles for the nearest dependent instruction. The
>>
>
> Since the tool reports that the addi stalls only sometimes (check the
> similar code sequences where no stall is reported), my take
> is that your interpretation or the data reported is not correct.

I'm not sure if my meaning came across. The line next to the addi  
instruction that says "Stall=1" means that a dependent instruction  
*following* the addi looks like it will stall while waiting for the  
results from addi, not that the addi instruction itself will stall.  
The code that follows that specific instruction looks like this:

0.4%    0x2e143d8     addi     r4,r4,16    2:1        Stall=1
     0x2e143dc     lbz      r4,0(r4)    3:1        Stall=2
     0x2e143e0     add      r3,r3,r4    2:1        Stall=1
     0x2e143e4     stw      r3,44(r1)    3:1

The instruction latency of the addi instruction is 2 cycles; the lbz  
that immediately follows the addi is dependent on the addi. The lbz  
will stall for 1 cycle. That is what the Shark output is trying to say.

     -Allan

--
Allan Hsu <allan at counterpop dot net>
1E64 E20F 34D9 CBA7 1300  1457 AC37 CBBB 0E92 C779