[Mono-list] JVM performance: JVM as a basis for CLR

Soeren Sandmann sandmann@daimi.au.dk
23 Jul 2001 09:28:33 +0200

Rhys Weatherley <rweather@zip.com.au> writes:

> > I think we can do better than traditional macro expanding JITs still
> > by using traditional instruction selection in our JIT.
> Have you investigated HP's Dynamo project?  They got
> very good performance through dynamic re-optimization
> rather than static "compile once, run many" techniques.
> They start with simple code translation, and work their
> way up as a profile of the program's runtime is built.

The Dynamo project[1,2] is very impressive.  It is essentially a software
trace cache.  Here is roughly what it does:

The system contains an interpreter for PA-RISC code.  A counter is
associated with every target of a backwards loop, and every time that
target is reached, the counter is incremented.  When a counter reaches
some treshold, the interpreter enters a mode in which it writes down
in a buffer every instruction it executes.  This buffer is then
optimized and used instead of the old code.

Actually, it doesn't write down every instruction; jumps are removed.

Direct jumps are just removed, conditional jumps are possibly reversed
so that the fall-through path is the path actually taken.  Indirect
jumps are translated from this:

   call r


   cmp r, <target that was actually target>
   jne <address of stub that traps back into the interpreter/code cache>
   <inlined code>

The above allows virtual calls to be inlined an optimized.  This can
be very important for Java/C# languages where short virtual methods
are very common.

The code resulting from this has the very important property that it
contains no jumps into it.  It's only entry point is at the top.  This
makes optimizations and good register allocation very easy to do

HP was able to take binaries compiled with HP's compiler at level -O2
and make them run up to 20% faster.  This is on *real programs*, not

> I believe Sun was trying something similar with HotSpot.

HotSpot also does profiling and dynamic compilation, but their
compiler work at method granularity and does the same heavy-weight
optimizations that an ahead-of-time compiler does, whereas Dynamo
generates traces and lightweight optimizations.

Given that a traditional optimizer is only able to squezze a few
percent out of a C program, I don't believe that an optimizing JIT
will be able to achieve high performance with Java/C# programs that
contains many hard-to-analyze virtual calls.

> An oddity of their system is that an intepreter + JIT is faster
> than a JIT alone!  

That is not odd.  That is the entire idea behind dynamic compilation.

> The (main) reason for this is that many
> functions, especially initializers, are only called once and it
> is faster to interpret them than JIT them and then run them.

Do you have measurements that show that this is the case? 

I doubt very much that functions that are only called once have much
impact on performance in either Dynamo or HotSpot.  They get their
performance, essentially by trading space for speed at the 'hot
spots'.  Much of HotSpot's performance comes from inlining and
specialization of virtual methods (they can only do that because it
has an interpreter that collects type information).  Dynamo is able to
generate extremely good code with little effort because it optimizes
traces, not graphs of basic blocks.  This leads to (controlled) code
duplication, because code can be part of more than one hot trace.

In my opinion, a good way to write a virtual machine for IL will be 
along these lines:

There is an interpreter and dynamic compiler, much like in the Dynamo
project.  It interprets IL code and generates traces of IL code which
is then optimized (by a portable optimizer) and register allocated
(perhaps portable) and translated into native code (obviously not

The profiler would install a counter for every target of a backwards
branch and at the beginning of every method (so that the system will
detect loops that are made with recursive calls).

I am convinced this alone will provide very high performance, because
the (a) code that most of the time is spent in, is native code, and
(b) this native code is of very high quality.

The high quality includes 

- Very good code locality.  Traces may be scattered around the
  program, but the generated code will be contigous.  

- Inlining means much more to work on for the optimizer.  Dynamo's
  scheme inlines both non-virtual calls, virtual calls, and taken
  branches of if-statements.

I don't see any reason why the performance of such a system would be
any worse than Dynamo.  This would mean *better* performance than
statically optimized native code.

Assuming high performance of bytecodes, it is natural to write the
garbage collector in bytecode.  This would allow much less context
switching between bytecode and native code, and it would allow garbage
collection and object allocation to be inlined and optimized in

In the case where there is only one thread, this will solve a problem:
With a native garbage collector, the compiler would have to ensure
that all pointers are not mangled or hidden before allocating a new
object.  If the garbage collector was written in bytecode, the
compiler would only have to be consistent with itself.

Unfortunately, with more than one thread, it doesn't work. garbage
collections can happen at any time, and so it must be ensured that
threads are only stopped at 'safe points'.  This can get hairy.  I
have a feeling it may be possible to come up with something clever
here.  The best thing would be a garbage collector that didn't require
stopping the threads at all, but I don't know of such a collector.

[1] http://www.hpl.hp.com/cambridge/projects/Dynamo/PLDI2000.pdf

    This is a 12 page paper from 'Programming languages, design and
    implementations 2000'. It provides a good overview of the Dynamo

[2] http://www.hpl.hp.com/techreports/1999/HPL-1999-78.html

    This is a ~100 pages technical report that describes in detail how
    Dynamo works.