[Mono-devel-list] mono AES performance woes (was: poor PPC JIT output)

Fri Jul 15 20:42:11 EDT 2005

On Jul 15, 2005, at 3:39 AM, Paolo Molaro wrote:

> On 07/14/05 Allan Hsu wrote:
>
>> Code generated by the PPC code emitter performs very poorly in
>> comparison to the same code emitted for other platforms (most
>> notably, x86). I had a brief conversation about this with Miguel in
>> #mono today and he suggested that I post some examples.
>>
>
> I'm sure he meant an actual test case, which you didn't provide.

I apologize for that. I was sharing the information I had already  
gathered as part of an investigation into the poor performance of the  
OS X port of our product. I was not sure if this sort of data was  
useful or if, as seems the case, I was doing something wrong. It  
looks like the performance problems I was running into are not  
specific to PPC, but the lack of JIT optimization (which I've  
remedied) made them *very* apparent.

>> Preliminary profiling with Shark (a profiling tool that is part of
>> the Apple CHUD tools) shows some heinously inefficient JIT output on
>> both G4 and G5 machines. Here's some sample Shark analysis on the
>> code emitted by mono 1.1.8.1 from
>> System.Security.Cryptography.RijndaelTransform.ECB(byte[], byte[])
>> and System.Security.Cryptography.RijndaelTransform.ShiftRow(bool):
>>
>> http://strangecargo.org/~allan/mono/
>>
>
> It looks like optimizations are not enabled: are you embedding mono
> in your app?
> You should try adding:
>     mono_set_defaults (0, mono_parse_default_optimizations (NULL));
> before the call to mono_jit_init ().

I am indeed using embedded mono, and I was not at all aware that  
optimizations were disabled by default. This does not occur in any of  
the sample code that I've seen and this is the first I've heard of it.

Is there any reference on what sorts of things you can change using  
mono_set_defaults? Following the mono source for references to that  
function wasn't particularly enlightening. It would be useful if the  
Wiki page on embedding mono mentioned JIT optimization.

I have done some more isolated testing of AES performance after  
turning on optimization and it seems that the JIT-emitted PPC code is  
roughly on par with x86 mono performance. Here is the code I used for  
some simple benchmarking:

http://strangecargo.org/~allan/mono/aes.tar.bz2

Here's some times for 1000 encrypts/decrypts of 32768 byte chunks  
from some machines we have here in the office, ordered by speed:
57.7 seconds under mono 1.1.8.1, OS X 10.4.2 (1.67 Ghz G4 1.2)
55.0 seconds under mono 1.1.8.1, Linux 2.6.9 (1.8 Ghz Athlon XP 2500+)
45.8 seconds under mono 1.1.8.1, Linux 2.6.9 (2.2 Ghz Athlon 64 3200+)
42.4 seconds under mono 1.1.8.1, OS X 10.4.2 (2.0 Ghz G5 3.0)
9.01 seconds under Microsoft .NET 1.1.4322, Windows XP Pro SP2 (2.0  
Ghz Athlon 64 3200+)

If you look at the benchmark code, it uses RijndaelManaged to do  
encrypt/decrypt. This class is supposedly 100% managed code in the  
Microsoft implementation.

Included in the tarball is some native code that links against  
OpenSSL to do the same thing. This is what native performance for the  
same sized chunks looks like:

1.67 seconds under OpenSSL 0.9.7a, Linux 2.6.9 (1.8 Ghz Athlon XP 2500+)
1.44 seconds under OpenSSL 0.9.7, OS X 10.4.2 (1.67 Ghz G4 1.2)
1.05 seconds under OpenSSL 0.9.7, OS X 10.4.2 (2.0 Ghz G5 3.0)
.67 seconds under OpenSSL 0.9.7a, Linux 2.6.9 (2.2 Ghz Athlon 64 3200+)

To be fair, the native implementation is able to take advantage of 64- 
bit processors when available, while all mono builds in the above  
benchmarks are 32-bit. The Windows XP machine is the standard 32-bit  
install, even though the processor is 64-bit. This is a pretty  
informal benchmark, but all I'm interested in showing here is how bad  
the AES performance under mono is.

It was suggested in #mono that I try compiling the mono AES  
implementation under VS.NET and run it under the Microsoft VM to  
compare performance..
The resulting project is available here:
http://strangecargo.org/~allan/mono/AESSpeedTest.zip

The same operation benchmarks thusly:
22.76 seconds under Microsoft .NET 1.1.4322, Windows XP Pro SP2 (2.0  
Ghz Athlon 64 3200+)

The AES code is taken from mono svn, so it may be different from the  
code used in the mono 1.1.8.1 benchmarks above.

While switching to the Microsoft VM boosts speed significantly, it  
looks like significant gains could be made by optimizing the mono  
RijndaelManaged code.

(some insightful comment would go here if I weren't so tired of  
writing this email).

-Allan

<everything below doesn't matter so much, since it was based on  
information gathered from unoptimized JIT output>
>> Information on how to read Shark analysis comes with Shark (available
>> for free from the Apple Developer Connection website).
>>
>
> A direct pointer to the doc would be useful.

Unfortunately, I can't find a copy of the documentation that's  
available online (otherwise, I would have linked it). The closest  
thing I can find to online documentation is this document: http:// 
developer.apple.com/tools/sharkoptimize.html

>> (A summary:
>> numerous and frequent pipeline stalls, unoptimized loops).
>>
>
> Some of the data looks definitely bogus: it reports a stall even on
> the addi, here:
>
>     0x2e143c8 lwz      r4,32(r1)    3:1 Stall=2
>     0x2e143cc lwz      r5,12(r4)    3:1 Stall=2
>     0x2e143d0 cmplwi   r5,0x0000     3:1 Stall=2
>     0x2e143d4 blel     $+696 <0x2e1468c [8B]>    2:1
> 0.4%    0x2e143d8 addi     r4,r4,16     2:1 Stall=1
>
> How can it stall while adding an immediate value to a register
> that was loaded several instructions before? Anyway, maybe the  
> documentation
> for the output format will shed some light, once provided.
> As for the loop commentary: did you actually test how much you
> gain by aligning loop starts on 32 byte boundaries? It would be
> a huge waste of memory in most cases.

I was not implying that all of the Shark suggestions were useful. I  
was simply summarizing the bulk of the suggestions. There other sorts  
of optimizations that Shark often suggests that are absent from the  
analysis of JIT code. I agree that loop alignment is probably  
wasteful in the majority of cases.

As for the stall statistics, you have misread them. Each line that  
says "Stall=N" is saying that the instruction latency of the marked  
instruction will cause a subsequent dependent instruction to stall,  
not that the marked instruction itself will stall. N is the maximum  
number of stall cycles for the nearest dependent instruction. The  
documentation claims that the register analysis algorithm they use is  
"very conservative" and the actual stall cycles observed may be higher.

     -Allan
--
Allan Hsu <allan at counterpop dot net>
1E64 E20F 34D9 CBA7 1300  1457 AC37 CBBB 0E92 C779