[Mono-dev] inlining and performance of SIMD code

Sun Nov 22 16:40:21 EST 2009

Hello,

> - I am curious what the heuristics are. I looked at the function 
> mono_method_check_inlining, but even when the function returns TRUE, the 
> function might not be inlined. Could you point me the relevant piece of code? Is 
> there any high level rule to make a guess, like complex control flow, use of 
> certain opcode, etc?

It happens in mini/method-to-ir.c, there are a few rules that govern the
inlining.

The routine that you mention has most of the heuristics, but the two
call sites that use it have additional limitations, check the source.

> - Can I force inlining of a given function? Even a hack is fine, I am trying to 
> evaluate several code generation schemes, and I would like to measure the impact 
> of inlining. Whatever works is fine.

I do not think that this is possible currently, but a simple hack that
allows you to do this with an environment variable would do the trick
(as a temporary hack).

> 
> - I tried to run code with calls to Mono.Simd on architectures that do not 
> support SIMD (or on x86 with the flag --optimize=-simd). A simple loop written 
> in C a[i]=b[i]+c[i] gets vectorized by GCC, the bytecode esentially contains 
> calls to Mono.Simd.Vector4f::LoadAligned, StoreAligned and op_Addition, plus 
> address computations. The generated code, however, is very inefficient, values 
> being copied around many times. Here is an example I captured with 'mono -v -v':
>    f8:	8b 11                	mov    (%ecx),%edx
>    fa:	89 55 b8             	mov    %edx,-0x48(%ebp)
>   12e:	8b 4d b8             	mov    -0x48(%ebp),%ecx
>   131:	89 4d 88             	mov    %ecx,-0x78(%ebp)
>   15e:	d9 45 88             	flds   -0x78(%ebp)
>   161:	d9 45 98             	flds   <...second op...>
>   164:	de c1                	faddp  %st,%st(1)
>   19c:	d9 9d 4c ff ff ff    	fstps  -0xb4(%ebp)
>   1b6:	d9 85 4c ff ff ff    	flds   -0xb4(%ebp)
>   1bc:	d9 5d a8             	fstps  -0x58(%ebp)
>   1da:	8b 4d a8             	mov    -0x58(%ebp),%ecx
>   1dd:	89 4d d8             	mov    %ecx,-0x28(%ebp)
>   1f2:	8b 4d d8             	mov    -0x28(%ebp),%ecx
>   1f5:	89 08                	mov    %ecx,(%eax)
> 
> It seems that a simple copy propagation followed by dead code elimination would 
> fix it. But I am not sure where I should look. Any comment or suggestion?

You are correct that the code quality is not great, let me give you some
background on how we got here.   Our previous JIT engine transformed the
CIL byte stream into a tree-based intermediate representation.   Back in
that day we implemented various optimizations on this tree, including
the decomposition of the tree into an SSA form, and then various
optimizations based on the SSA form.   

Although the optimizations were useful, the tree representation
prevented other optimizations from taking place and the code quality was
not that great.

A new engine that circumvented the tree representation was created which
avoided the tree, so a new set of optimizations became possible, this
new IR is documented here:

http://www.mono-project.com/Linear_IL

The code generation improved, but some of the old optimizations that we
had created were lost in the process.   Although the code continues to
exist in the tree, they have not been ported.

In particular, a full framework to do partial redundancy elimination
existed (and it supported dead code elimination) in the file ssapre.c

I am not sure what is best: to port the ssapre.c framework to work on
Mono's new JIT, or to build a simplistic DCE/PRE framework on top of the
current IR.

There is definitely plenty of low-hanging fruit in the Mono JIT at this
point.