[Mono-devel-list] RAPGO Proposal

Fri Nov 26 13:19:09 EST 2004

About
                   Dynamic Profile Guided Optimizations (DPGOs)
	                          or
	      Runtime Applied Profile Guided Optimizations (RAPGOs)

as I like to call them.                     Version 0.1; 26th Nov. 2004

Abstract
~~~~~~~~
	I like to present some of my ideas about implementing RAPGOS
  	into the mono just-in-time compiler. I'll also cover some bits
	in the mono jit, that are not quite understandable from my POV
	and ask for reasons some design decisions were taken.

	I'm doing this work as a diploma thesis and therefore it entails
	some risk for my work laying out all my ideas in such a broad
	manner. (Before they are implemented, that is: You mono hackers
	are damn fast coders!)

	That being said, I hope for lots of feedback and comments from
	the community.

Papers
~~~~~~
	Before going into details, I'd like to share some of my sources
	on the RAPGO topic:
	First and formost, I like the PhD work of Thomas Kistler he did
	at the University of Clifornia; Some of his ideas are well
	suited for inclusion into mono (everything IMO of course).
	Unfortunately this work is not freely available, but instead
	several rip-outs of his thesis are available:
	-> Kistler, Thomas and Franz, Michael
	   "Continuous Program Optimization: A Case Study"
	-> same
	   "Continuous Program Optimization: Design and Evaluation"

	What are other jits doing in that direction:
	-> Arnold, Hind and Ryder
	   "Online Feedback-Directed Optimization of Java"
	   (Includes a feedback directed splitting algo)
	-> Adl-Tabatabai et al. (Intel)
	   "The StarJIT Compiler: A Dynamic Compiler for Managed Runtime
	    Environments"

	Misc.:
	-> Hind, Rajan, Sweeney
	   "Phase Shift Detection: A Problem Classification"
	-> Pettis, Hansen
	   "Profile Guided Code Positioning"
	   (how BBs should be sorted)

	Last, but not least: Profiling
	-> Conte, Menezes, Hirsch
	   "Accurate and Practical Profile-Driven Compilation Using the
	    Profile Buffer"
	-> same
	   "Using Branch Handling Hardware to Support Profile-Driven
	    Optimization"
	-> Ammons, Ball and Larus
	   "Exploiting Hardware Performance Counters with Flow and
	    Context Sensitive Profiling"

0 Definitions
~~~~~~~~~~~~~

	'Priority Path' [PP]
		I'll refer to all sorts of time critical operations to
		be in the Priority Path. This includes e.g first time
		compilation, exception handling etc. (Most/All of the
		current jit)

	'Robust Path' [RP]
		All operations that are not time critical. 'Robust'
		should indicate	that inhabitants of this path shall be
		defensively coded.
		(www.secureprogramming.com)
		Robust Path code allows for and should use
		Object-Oriented code.

I Motivation
~~~~~~~~~~~~
	Just read the papers if you don't understand why this is a
	necessary technique for JIT compilers..

II Design Goals
~~~~~~~~~~~~~~~
	(1) One of the most important things besides performance is
	    considered to be security.
	(2) Code shall be maintainable.
	(3) First time compilation should be as fast as possible.
	(4) Profiling with low overhead and not full time. (Except
	    sampling, perhaps)
	(5) Somehow remember the set of optimizations an application
             favoured most.
	(6) Extensible framework for optimizations/profiling techniques.

III Architecture
~~~~~~~~~~~~~~~~
	*) [PP] First-Time Compilation is done with current mini
		unchanged.

	*) [RP] A low priority level thread will be used for all the
		'background' work.

	*) In the following I'll stick to the design outlined by
    	   Kistler:
	   - All Optimization Objects need
	      - A method that returns the estimated speedup achievable
		(fast, without using/needing the IR or doing any
		 expensive computations)
	      - A method that actually performs the transformation on
		the SSA/CFG.. (IR)
	      - ..
	   - All Profiling Objects need
	      - A method that does aging (or whatever) on profile data
	      - A method for signaling phase changes
	      - ..
	   The Optimization/Profiling Objects are managed by some
	   manager (Profiler, Optimizer). On top there is another
	   Manager that is responsible for interaction of the
	   sub-components.

	Work is done like:
		1) [PP] do first time compilation (perhaps only
			sample profiled)
		2) [RP] Manager asks Profiler for any kind of Phase
			Change
		3) [RP] Call Optimizer to work on the candidates
			indicated in step 2
		4) [RP] Optimizer uses Profiler data to apply
			optimizations..
		5) [RP] Replace slow procedure/method by new one.

	Notes:	* It's also possible to replace a given method by one
		  that is instrumented for further profile generation
		  (IOW slower)..
		* Optimizer itself decides what OptimizationObjects it
		  will apply - based on the estimated speedup figure the
		  OptimizationObjects return.

	One addition, not found in the Kistler paper: There are
	applications (e.g. mcs)	where RAPGOs simply don't make sense at
	all. We therefore must have some metadata for 	applications
	indicating mean run time etc.

	IMO it would be beneficial to somehow cache compiled code on
	disk along with the executable, so that the first time
	compilation may be replaced by loading the cached version...
	(Some sort of implicit AOT compilation?) Of course this can't
	be done with fully optimized versions..

IV Profiling (AMD64)
~~~~~~~~~~~~~~~~~~~~

	I like the idea of 'Profile Buffers' introduced in "Using Branch
	Handling Hardware to Support Profile-Driven Optimization".
	The authors basically propose some hardware device that
	increments different counters (register) automatically on
	taking/not taking a branch.
	Of course we don't have that device available, but instead I
	propose	to use the legacy MMX registers for that purpose.
	(Except MMX0,1 which can be used for sin/cos calculation)

	If anyone knows a way of reading out the BTB on current x86
	CPUs, this information could be used too..

	In that context "Static Branch Frequency and Program Profile
	Analysis" by Wu and Larus seems to be interesting too.

	How to instrument a given method?
	Basically a ProfilerObject that needs to instrument code shall
	define a separate OptimizationObject that applies the
	transformations needed.

V Replacement
~~~~~~~~~~~~~
	I'm not really sure how to handle that - basically there are
	several	ways.
	I've come up with following 'idea': Each method is called
	indirectly via	call *rax (where rax points to some GOT). So by
	changing the offset every call will go to the new location.
	Another technique would be to replace the existing method by
	some code that patches the caller's address to jump to the new
	code the next time directly. This however means that we would
	have to take care how long a given 'Patcher' needs to be
	preserved... (some problem GC could take care of..)

	Before freeing/overwriting a method we also have to ensure no
	thread is executing this piece of code anymore. Simple
	Entry/Exit counters should be able to handle that..
	(In case of an endless loop, code could be patched so that this
	thread generates a signal..)

VI Placement
~~~~~~~~~~~~
	Currently every code that is being emitted gets copied to it's
	final location - smells like overhead to me..
	What about mmap and direct emit into this area?
	This would also save time for freeing/allocating memory for
	replacement code.

Ok, as you might have guessed by now, I don't have any code to share - 
but I'm confident that this will change in the next few weeks..

Thanks for wading through all this,
  Willi