[Mono-devel-list] mono AES performance woes (was: poor PPC JIT output)

Zoltan Varga vargaz at gmail.com
Mon Jul 18 15:29:48 EDT 2005


                                  Hi,

  This has been fixed in SVN, so you no longer need to call mono_set_defaults
(which isn't in the public headers anyway).

                     Zoltan

On 7/18/05, Allan Hsu <allan at counterpop.net> wrote:
> On Jul 18, 2005, at 2:59 AM, Paolo Molaro wrote:
> 
> > On 07/15/05 Allan Hsu wrote:
> >
> >> Is there any reference on what sorts of things you can change using
> >> mono_set_defaults? Following the mono source for references to that
> >> function wasn't particularly enlightening. It would be useful if the
> >>
> >
> > grep mono_set_defaults *.c
> > mini.c:mono_set_defaults (int verbose_level, guint32 opts)
> > Should be pretty evident. Just always use the result of
> > mono_parse_default_optimizations (NULL) as the opts value.
> 
> I understood the verbose_level parameters, but the opts parameter was
> what mystified me. I should have been more specific about what I was
> looking for. At the time, I didn't understand the value that
> mono_parse_default_optimizations() returns or what values you can
> pass in to affect it. I've since traced it back to the relevant code
> in driver.c and the mini-X.c platform code now and see how it works.
> Is it safe to mess with those parameters, or will it cause undefined
> results?
> 
> >> To be fair, the native implementation is able to take advantage of
> >> 64-
> >> bit processors when available, while all mono builds in the above
> >> benchmarks are 32-bit. The Windows XP machine is the standard 32-bit
> >> install, even though the processor is 64-bit. This is a pretty
> >> informal benchmark, but all I'm interested in showing here is how bad
> >> the AES performance under mono is.
> >>
> >
> > The current implementation causes lots of spilling and other
> > unnecessary work which the jit doesn't remove (the work massi is
> > doing should improve this). Some parts of it can be easily changed
> > to use unsafe code and that should improve performance a lot: I'll
> > leave
> > that to Sebastien:-)
> 
> This is good to hear. I hope the benchmarking I did will provide some
> information that somebody will find useful.
> 
> For my specific application, there is no such thing as "enough"
> performance:) I plan on writing a managed wrapper around libcrypto
> for this reason. This will be the subject of another email.
> 
> >>> Some of the data looks definitely bogus: it reports a stall even on
> >>> the addi, here:
> >>>
> >>>    0x2e143c8 lwz      r4,32(r1)    3:1 Stall=2
> >>>    0x2e143cc lwz      r5,12(r4)    3:1 Stall=2
> >>>    0x2e143d0 cmplwi   r5,0x0000     3:1 Stall=2
> >>>    0x2e143d4 blel     $+696 <0x2e1468c [8B]>    2:1
> >>> 0.4%    0x2e143d8 addi     r4,r4,16     2:1 Stall=1
> >>>
> > [...]
> >
> >> As for the stall statistics, you have misread them. Each line that
> >> says "Stall=N" is saying that the instruction latency of the marked
> >> instruction will cause a subsequent dependent instruction to stall,
> >> not that the marked instruction itself will stall. N is the maximum
> >> number of stall cycles for the nearest dependent instruction. The
> >>
> >
> > Since the tool reports that the addi stalls only sometimes (check the
> > similar code sequences where no stall is reported), my take
> > is that your interpretation or the data reported is not correct.
> 
> I'm not sure if my meaning came across. The line next to the addi
> instruction that says "Stall=1" means that a dependent instruction
> *following* the addi looks like it will stall while waiting for the
> results from addi, not that the addi instruction itself will stall.
> The code that follows that specific instruction looks like this:
> 
> 0.4%    0x2e143d8     addi     r4,r4,16    2:1        Stall=1
>      0x2e143dc     lbz      r4,0(r4)    3:1        Stall=2
>      0x2e143e0     add      r3,r3,r4    2:1        Stall=1
>      0x2e143e4     stw      r3,44(r1)    3:1
> 
> The instruction latency of the addi instruction is 2 cycles; the lbz
> that immediately follows the addi is dependent on the addi. The lbz
> will stall for 1 cycle. That is what the Shark output is trying to say.
> 
>      -Allan
> 
> --
> Allan Hsu <allan at counterpop dot net>
> 1E64 E20F 34D9 CBA7 1300  1457 AC37 CBBB 0E92 C779
> _______________________________________________
> Mono-devel-list mailing list
> Mono-devel-list at lists.ximian.com
> http://lists.ximian.com/mailman/listinfo/mono-devel-list
>



More information about the Mono-devel-list mailing list