[Mono-bugs] [Bug 487846] PPC code gen is inefficient in several areas

Mon Sep 7 23:22:31 EDT 2009

http://bugzilla.novell.com/show_bug.cgi?id=487846

User munroesj at us.ibm.com added comment
http://bugzilla.novell.com/show_bug.cgi?id=487846#c31

--- Comment #31 from Steven Munroe <munroesj at us.ibm.com>  2009-09-07 21:22:23 MDT ---
Some observations: Mono JIT is generating alot of branch +4 (branch to the next
instruction following the branch) instructions. 

This is bad for a number of reasons the primary being that the branch
terminates the (super scalar multi-instruction) dispatch group. As this branch
is nominally redundant it lowers the CPI an hurts overall performance. 

However this branch+4 is useful in the load-hit-store case. For higher
bandwidth POWER uses a Store Queue between the LSU and the L2 cache. But in the
special case of a load of the same data recently storage the load may
experience a pipe line data (waiting for the data to reach a bypass stage or
exit the STQ into the L2). The worst case is when the load immediately follows
the store instruction and the IFU attempts to dispatch both in the same group
(as can happen on O-O-O machines like P4/P5/P7). This is detected later and
results in an instruction reject/flush/refetch starting with the load (dozens
of cycles delay).

This hazard is normally rare as POWER has lots of registers (32) and compilers
normally allocate locals to registers and avoid the reload. The Mono JIT does
not seems to be allocating locals to registers aggressively enough to avoid the
reload case (even when many registers are available).

I observe that many of the b+4 cases are the existing the if/then/else. And
many of those then/else blocks are setting (stores) local variables that are
referenced (loaded) in the next basic block. So in this case a simply noping
the branch may give worse performance then leaving it in place. So a peep-hole
optimization that checks for this case is required. If there is non-LHS case
then simply nop the branch. 

In the LHS case where the store/load use the same register and we can prove the
that there is only one entry the block with the load, we can nop both the
branch and the load.

In the LHS case where the store/load use a different register and we can prove
the that there is only one entry the block with the load, we can nop the branch
and replace the load with a move register.

If we can't prove that there is only one entry into the following block and the
store/load use the same register, we can patch the branch to skip the load
(branch+8).

This requires some new infrastructure to determine if the load target has
multiple entries, and avoiding special cases (MONO_PATCH_INFO_METHOD_JUMP)
where the branch may be relocated several times (this can generate asserts then
the second patch finds a nop instead of a branch).

I am working through this and hope to have a patch to review soon. This is
helping me to learn about mono-internals leads into the larger issue of
optimizing calls and pointer loads.

-- 
Configure bugmail: http://bugzilla.novell.com/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the QA contact for the bug.