[Mono-devel-list] RAPGO Proposal
Willibald.Krenn at gmx.at
Fri Nov 26 13:19:09 EST 2004
Dynamic Profile Guided Optimizations (DPGOs)
Runtime Applied Profile Guided Optimizations (RAPGOs)
as I like to call them. Version 0.1; 26th Nov. 2004
I like to present some of my ideas about implementing RAPGOS
into the mono just-in-time compiler. I'll also cover some bits
in the mono jit, that are not quite understandable from my POV
and ask for reasons some design decisions were taken.
I'm doing this work as a diploma thesis and therefore it entails
some risk for my work laying out all my ideas in such a broad
manner. (Before they are implemented, that is: You mono hackers
are damn fast coders!)
That being said, I hope for lots of feedback and comments from
Before going into details, I'd like to share some of my sources
on the RAPGO topic:
First and formost, I like the PhD work of Thomas Kistler he did
at the University of Clifornia; Some of his ideas are well
suited for inclusion into mono (everything IMO of course).
Unfortunately this work is not freely available, but instead
several rip-outs of his thesis are available:
-> Kistler, Thomas and Franz, Michael
"Continuous Program Optimization: A Case Study"
"Continuous Program Optimization: Design and Evaluation"
What are other jits doing in that direction:
-> Arnold, Hind and Ryder
"Online Feedback-Directed Optimization of Java"
(Includes a feedback directed splitting algo)
-> Adl-Tabatabai et al. (Intel)
"The StarJIT Compiler: A Dynamic Compiler for Managed Runtime
-> Hind, Rajan, Sweeney
"Phase Shift Detection: A Problem Classification"
-> Pettis, Hansen
"Profile Guided Code Positioning"
(how BBs should be sorted)
Last, but not least: Profiling
-> Conte, Menezes, Hirsch
"Accurate and Practical Profile-Driven Compilation Using the
"Using Branch Handling Hardware to Support Profile-Driven
-> Ammons, Ball and Larus
"Exploiting Hardware Performance Counters with Flow and
Context Sensitive Profiling"
'Priority Path' [PP]
I'll refer to all sorts of time critical operations to
be in the Priority Path. This includes e.g first time
compilation, exception handling etc. (Most/All of the
'Robust Path' [RP]
All operations that are not time critical. 'Robust'
should indicate that inhabitants of this path shall be
Robust Path code allows for and should use
Just read the papers if you don't understand why this is a
necessary technique for JIT compilers..
II Design Goals
(1) One of the most important things besides performance is
considered to be security.
(2) Code shall be maintainable.
(3) First time compilation should be as fast as possible.
(4) Profiling with low overhead and not full time. (Except
(5) Somehow remember the set of optimizations an application
(6) Extensible framework for optimizations/profiling techniques.
*) [PP] First-Time Compilation is done with current mini
*) [RP] A low priority level thread will be used for all the
*) In the following I'll stick to the design outlined by
- All Optimization Objects need
- A method that returns the estimated speedup achievable
(fast, without using/needing the IR or doing any
- A method that actually performs the transformation on
the SSA/CFG.. (IR)
- All Profiling Objects need
- A method that does aging (or whatever) on profile data
- A method for signaling phase changes
The Optimization/Profiling Objects are managed by some
manager (Profiler, Optimizer). On top there is another
Manager that is responsible for interaction of the
Work is done like:
1) [PP] do first time compilation (perhaps only
2) [RP] Manager asks Profiler for any kind of Phase
3) [RP] Call Optimizer to work on the candidates
indicated in step 2
4) [RP] Optimizer uses Profiler data to apply
5) [RP] Replace slow procedure/method by new one.
Notes: * It's also possible to replace a given method by one
that is instrumented for further profile generation
* Optimizer itself decides what OptimizationObjects it
will apply - based on the estimated speedup figure the
One addition, not found in the Kistler paper: There are
applications (e.g. mcs) where RAPGOs simply don't make sense at
all. We therefore must have some metadata for applications
indicating mean run time etc.
IMO it would be beneficial to somehow cache compiled code on
disk along with the executable, so that the first time
compilation may be replaced by loading the cached version...
(Some sort of implicit AOT compilation?) Of course this can't
be done with fully optimized versions..
IV Profiling (AMD64)
I like the idea of 'Profile Buffers' introduced in "Using Branch
Handling Hardware to Support Profile-Driven Optimization".
The authors basically propose some hardware device that
increments different counters (register) automatically on
taking/not taking a branch.
Of course we don't have that device available, but instead I
propose to use the legacy MMX registers for that purpose.
(Except MMX0,1 which can be used for sin/cos calculation)
If anyone knows a way of reading out the BTB on current x86
CPUs, this information could be used too..
In that context "Static Branch Frequency and Program Profile
Analysis" by Wu and Larus seems to be interesting too.
How to instrument a given method?
Basically a ProfilerObject that needs to instrument code shall
define a separate OptimizationObject that applies the
I'm not really sure how to handle that - basically there are
I've come up with following 'idea': Each method is called
indirectly via call *rax (where rax points to some GOT). So by
changing the offset every call will go to the new location.
Another technique would be to replace the existing method by
some code that patches the caller's address to jump to the new
code the next time directly. This however means that we would
have to take care how long a given 'Patcher' needs to be
preserved... (some problem GC could take care of..)
Before freeing/overwriting a method we also have to ensure no
thread is executing this piece of code anymore. Simple
Entry/Exit counters should be able to handle that..
(In case of an endless loop, code could be patched so that this
thread generates a signal..)
Currently every code that is being emitted gets copied to it's
final location - smells like overhead to me..
What about mmap and direct emit into this area?
This would also save time for freeing/allocating memory for
Ok, as you might have guessed by now, I don't have any code to share -
but I'm confident that this will change in the next few weeks..
Thanks for wading through all this,
More information about the Mono-devel-list