[Mono-devel-list] Compiler thoughts, 2

Thu Mar 17 11:07:29 EST 2005

Hello Jb,

> There is a big loss of efforts in maintening multiple compilers. gmcs, 
> bmcs, jscript.net, etc. Can we not imagine some kind of library that 
> would be the kernel for each Mono compiler ? In my mind, this library 
> would rely on the Cecil library to emit assemblies. There is a plan to 
> write optimizers for CIL stream using Cecil. So we could have one code 
> base, producing optimized CIL stream, (ie like the C++/CLI does), that 
> compiler writer could use easily.

Cecil today is similar to Reflection.Emit;  Hopefully Cecil will not
have some of the most annoying limitations of Reflection.Emit

This topic is a recurrent one: why not write a library that all
compilers can reuse to minimize the compiler size?

I have had this discussion a number of times, so I will try to summarize
some of the aspects of it:

	* Compilers in the .NET world are only half-compilers, they take
	  an input language and have to produce CIL code, which is not
	  directly native code.  The translation of CIL to native code
	  happens in another place, something that traditional compilers
	  have had to address in the past.

	* What remains of a .NET compiler is fairly minimal: a 
	  tokenizer, a parser, an internal AST that matches the source
	  language, and a translation phase to CIL.

	  The parser, the tokenizer and the AST are likely going to be
	  intimately tied to the language being implemented, so sharing
	  there is almost minimal.

	* People have suggested introducing a new intermediate
	  representation between the AST and CIL, but there is no reason
	  to do so;  And nobody has shown that this would bring any
	  benefits, nor would it reduce the complexity of the compiler.

	  In addition, a new intermediate layer between the AST and the
	  CIL means that this layer would have to be prototyped and
	  tested with a number of compilers before we could consider it
	  ready for production and complete enough for production.

	  Keeping in mind that such new intermediate representation is
	  of dubious use.

	* If such a library existed, it also means that all the
	  compilers consuming it during the development process would
	  have to be synchronized to it: and making changes to the
	  library would probably break more than one compiler.

	  It is of course possible to do this, but with a vastly
	  under-staffed effort (we still do not have production
	  compilers for JScript, VB.NET nor C++, and none of those
	  compilers would benefit from this library) it seems unclear
	  why we would take into another task.

	* Maintenance of a library with a solid, stable API will take a
	  long time to develop.  

	  Not only this, but the resources that need to be committed to
	  develop a stable API are fairly large and the burden imposed
	  on the implementors of the compilers and the library will take
	  a large toll.

And the most important point, I think is:

	* Many compiler authors will want to write their compiler in 
	  their own language, as their own test of completeness and as
	  their own way of dogfooding the compiler which would make the
	  sharing of anything above (parser, tokenizer) almost minimal.

Now, I believe that it would be nice to build a higher-level abstraction
than CIL on top of Cecil for the purpose of doing the same kind of
decoding that the JIT compiler does, and to find the higher level
constructions of a program from the CIL stream: basic blocks, trees of
statements, the DU/UD relations and other bits that might be used for an
optimizer.

Such representation could be used by an optimizer to improve the output
produced by standard CIL compilers, removing the burden from normal
compilers to implement very complicated optimizations. 

> We can go even further, I think about the lexing/parsing, I would love 
> to write a fully object oriented lexer/parser. The compiler writer could 
> feed the library, telling it what are the primitives of its language, 
> and giving it its implementation of statements, using a pure object 
> model. This would obviously be slowler than a jay approach, but this 
> could simply lots of things.
> 
> What do you think about that ?

As for parsing and lexing: the options are infinite here, and the
standard practice of using generators or hand-tuned parsers and lexers
is not only fairly common, but has been tuned extensively.

Adding abstractions like the one your describe and trying to evolve it
to match all the possible languages in the world is in my opinion not
going to produce any results.  The worst problem with this approach is
that you would have to cater to new compiler writers, because existing
compilers are unlikely going to switch, and you do not know what these
new compilers might want to do.

Not only this, but a language designer wants to have flexibility,
flexibility that he can gain by writing a manual parser/tokenizer or
using a code generator, some of which are very advanced and embody many
man years of research and development.  Making people use and learn a
new abstraction is just getting in the way of innovation.

My opinion is that the compiler problem is very well understood today,
compilers are not hard to write, but they are not small projects: they
are like a marathon.  

Miguel.