[Mono-dev] CIL to CIL optimizer

Mon Aug 14 13:39:45 EDT 2006

Massimiliano Mantione skrev:
> First things first: it's nice you'll do the CIL-CIL optimizer, but
> there are plans to do this in the JIT anyway, where these things
> belong.
> I say "where these things belong" not to discourage you, but because
> the JIT has generally more visibility on how the code will really
> look like (it can inline methods, knows that the address of certain
> things maybe is a already constant, and in the future will also have
> visibility of the method prologue-epilogue), so things like SCCP and
> PRE should be performed by the JIT anyway if we want to take full
> advantage of them.
> Not to mention the issues involved in tuning redundancy elimination,
> which can easily get architecture dependent...
> That said, obviously until the JIT "is not there" having an external
> optimizer can be useful :-)
> 
I think optimization of something like CIL belongs in BOTH the JIT and 
in a CIL-CIL optimizer (I abbreviate this as CCO in the rest of the email).

There are several reasons that I think this. You have already given some 
good reasons that optimization belongs in the JIT even in the face of an 
excellent CCO. The CCO also has to take extra care to preserve the 
constraints of the CIL format, which can be burdensome. E.g. CIL 
concepts cannot be eliminated by translating them to something simpler, 
at least not without being able to translate them back later when the 
optimized assembly is to be output.

There are reasons that a CCO is desirable too, though:

1. The JIT cannot afford to do expensive whole-program optimizations and 
a CCO can, so the CCO can optimize the program in ways that the JIT 
cannot. In the same vein, the JIT will only aggressively optimize hot 
parts of the code while a CCO can take its time and optimize everything.

2. It takes time before the JIT has optimized the running program, while 
the effect of the CCO will be available at start up.

3. A CCO can decrease the size of an assembly by such optimizations as 
dead code/function/class elimination (the last two only in the absence 
of reflection), better use of the stack, better instruction selection, 
redundancy elimination, better ordering of basic blocks and so on. This 
decreases start up time.

4. Some JITs do not optimize very aggressively. CCO optimizations are 
available to all JITs, and different CCOs can be run in sequence.

5. Some optimizations, like dead code elimination, can be done in a CCO 
and then the JIT does not have to do them (or does not have to do them 
as much). The CCO could embed some metadata in the assembly to indicate 
which optimizations have already been performed. The same trick applies 
to the results of expensive analyzes, though relying on that information 
without checking it could be unsafe.

6. A CCO can output data that enables the JIT to start up faster and 
optimize better. An example is SafeTSA which is a Java-equivalent 
bytecode format where the code is already in SSA form and where some 
array bounds checks and some null checks can be safely eliminated. On 
top of this the SafeTSA format actually takes up less space then the 
normal Java bytecode format. Another example is a paper I read that 
explained how to annotate Java bytecode so that the JIT could do fast 
and high quality register allocation in little time. This scheme is, 
surprisingly, independent of the number of registers on the target machine.

The last two points have the drawback that they require respectively 
slight and extensive cooperation between the CCO and the JIT. I don't 
plan to implement point 6, btw.

As a practical matter, I also think that it will take less effort to 
implement optimizations in ML than in C. On a personal note I have been 
programming a Groebner Basis computation program for quite a while in 
C++, and I am looking forward to programming in a functional-style 
language again.

I cannot foresee how problematical the natural barrier to optimization 
in a CCO that Miguel mentions will be. It is certainly there in the 
sense that a JIT can do more optimization than a CCO if it had all the 
time in the world. I guess I will have to write a CCO and see what 
happens to find out.

In any case, my CCO would certainly serve the purpose that Miguel 
mentions of cleaning up the output of mcs.

> Then, about the CFG shape and issues when try[/catch]/finally clauses
> are present, this discussion already reached mono-devel-list:
> http://marc.theaimsgroup.com/?l=mono-devel-list&m=111336503228737&w=2
> I'd really like you to comment on that thread.
> 
I think that discussion is a good example of how the different 
constraints on JITs and CCOs play out.

You are discussing whether to let the CFG reflect the possibility that 
an instruction could throw an exception. On one hand this would improve 
precision of analyzes, but it would also make the CFG much larger which 
could increase memory consumption and decrease performance of the JIT 
optimizer. It sounded like you ended up agreeing on the less precise but 
more efficient approach.

The less precise approach did not even occur to me as a CCO is not as 
resource constrained as a JIT. I did think for quite a while about 
whether or not it was a good idea to let all my analyzes know about 
extended basic blocks, which in a sense is the best of both worlds, but 
I decided it was not worth the effort.

I'm sure you know, but for any other people reading this, the issue I 
raised in my initial email on the finally construct was on how to make 
the CFG even more precise than is considered in the discussion you 
linked to. I.e. the issue of what to do about the CFG edges going out 
from the finally block.

I have decided what to do about that. My first implementation will 
simply eliminate finally by inlining the finally block. A more 
sophisticated approach that I might look into later would be to extract 
the finally block into a function with ref arguments and insert calls to 
this function that are impervious to code motion. These functions calls 
could then be replaced by the finally construct just before emitting the 
optimized CIL. This should work well together with whole-program 
optimization. A hybrid approach that inlines small finally blocks and 
preserves large finally blocks would probably be a good idea, especially 
if Mono cannot analyze finally blocks precisely.

> Finally, about building SSA form, my favorite paper is this:
> http://citeseer.ist.psu.edu/sreedhar95linear.html
> I really like the dominance algorithm at the beginning of the paper
> you mentioned, an implemented it once, it's so easy it just took a
> few hours to have it running and fully debugged :-)
> But the rest of the paper (about dominance frontiers) was a bit hard
> for me, I think the one I linked is well explained.
> And BTW, there's already a C implementation of it for Mono in my
> old HSSA patch here:
> http://marc.theaimsgroup.com/?l=mono-devel-list&m=113235041825001&w=2
> Keep in mind that HSSA is a big beast, including GVN and full alias
> analysis, tracing also heap bases values, so one can do PRE and SCCP
> taking all those things into account, and also register promotion.
> That patch implements most of the infrastructure, enough to have a
> working deadce implementation.
> 
Thanks for the tip!

Do you know if there is a performance impact of entering and exiting try 
blocks in Mono, even if no exception is ever (dynamically) thrown?

/Bjarke Roune