[Mono-dev] inlining and performance of SIMD code

Tue Oct 20 12:10:31 EDT 2009

Hello,

I have a few questions about inlining:

- I am curious what the heuristics are. I looked at the function 
mono_method_check_inlining, but even when the function returns TRUE, the 
function might not be inlined. Could you point me the relevant piece of code? Is 
there any high level rule to make a guess, like complex control flow, use of 
certain opcode, etc?

- Can I force inlining of a given function? Even a hack is fine, I am trying to 
evaluate several code generation schemes, and I would like to measure the impact 
of inlining. Whatever works is fine.

- I tried to run code with calls to Mono.Simd on architectures that do not 
support SIMD (or on x86 with the flag --optimize=-simd). A simple loop written 
in C a[i]=b[i]+c[i] gets vectorized by GCC, the bytecode esentially contains 
calls to Mono.Simd.Vector4f::LoadAligned, StoreAligned and op_Addition, plus 
address computations. The generated code, however, is very inefficient, values 
being copied around many times. Here is an example I captured with 'mono -v -v':
   f8:	8b 11                	mov    (%ecx),%edx
   fa:	89 55 b8             	mov    %edx,-0x48(%ebp)
  12e:	8b 4d b8             	mov    -0x48(%ebp),%ecx
  131:	89 4d 88             	mov    %ecx,-0x78(%ebp)
  15e:	d9 45 88             	flds   -0x78(%ebp)
  161:	d9 45 98             	flds   <...second op...>
  164:	de c1                	faddp  %st,%st(1)
  19c:	d9 9d 4c ff ff ff    	fstps  -0xb4(%ebp)
  1b6:	d9 85 4c ff ff ff    	flds   -0xb4(%ebp)
  1bc:	d9 5d a8             	fstps  -0x58(%ebp)
  1da:	8b 4d a8             	mov    -0x58(%ebp),%ecx
  1dd:	89 4d d8             	mov    %ecx,-0x28(%ebp)
  1f2:	8b 4d d8             	mov    -0x28(%ebp),%ecx
  1f5:	89 08                	mov    %ecx,(%eax)

It seems that a simple copy propagation followed by dead code elimination would 
fix it. But I am not sure where I should look. Any comment or suggestion?

Thanks a lot,

--
Erven.