[Mono-devel-list] poor PPC JIT output

Fri Jul 15 06:39:36 EDT 2005

On 07/14/05 Allan Hsu wrote:
> Code generated by the PPC code emitter performs very poorly in  
> comparison to the same code emitted for other platforms (most  
> notably, x86). I had a brief conversation about this with Miguel in  
> #mono today and he suggested that I post some examples.

I'm sure he meant an actual test case, which you didn't provide.

> Preliminary profiling with Shark (a profiling tool that is part of  
> the Apple CHUD tools) shows some heinously inefficient JIT output on  
> both G4 and G5 machines. Here's some sample Shark analysis on the  
> code emitted by mono 1.1.8.1 from  
> System.Security.Cryptography.RijndaelTransform.ECB(byte[], byte[])  
> and System.Security.Cryptography.RijndaelTransform.ShiftRow(bool):
> 
> http://strangecargo.org/~allan/mono/

It looks like optimizations are not enabled: are you embedding mono
in your app?
You should try adding:
	mono_set_defaults (0, mono_parse_default_optimizations (NULL));
before the call to mono_jit_init ().

> Information on how to read Shark analysis comes with Shark (available  
> for free from the Apple Developer Connection website).

A direct pointer to the doc would be useful.

> (A summary:  
> numerous and frequent pipeline stalls, unoptimized loops).

Some of the data looks definitely bogus: it reports a stall even on
the addi, here:

	0x2e143c8 lwz      r4,32(r1)	3:1 Stall=2
	0x2e143cc lwz      r5,12(r4)	3:1 Stall=2
	0x2e143d0 cmplwi   r5,0x0000 	3:1 Stall=2
	0x2e143d4 blel     $+696 <0x2e1468c [8B]>	2:1
0.4%	0x2e143d8 addi     r4,r4,16 	2:1 Stall=1

How can it stall while adding an immediate value to a register
that was loaded several instructions before? Anyway, maybe the documentation
for the output format will shed some light, once provided.
As for the loop commentary: did you actually test how much you
gain by aligning loop starts on 32 byte boundaries? It would be
a huge waste of memory in most cases.

> Is there any active effort to optimize the PPC code emitter? The  

We ususally optimize only when we find performance issues (or when someone
reports them). Sometimes the performance issues are easily addressable,
but if nobody reports them it means they are not important, so we
spend our limited time on other tasks.

> above two methods account for the majority of CPU time on a pegged  
> 2Ghz G5 while decrypting AES blocks coming off the wire. The x86  
> machine encrypting the data (also running mono) doesn't even break a  
> sweat.

Without a test case this doesn't mean anything: you can't compare
two programs that do two different things and arrive at any
meaningful conclusion.

lupus

-- 
-----------------------------------------------------------------
lupus at debian.org                                     debian/rules
lupus at ximian.com                             Monkeys do it better