[Mono-dev] mono benchmark on arm no FPU => division optimisation problem?

Wed Feb 24 13:50:40 EST 2010

Hi,

I have run the benchmarks included with mono on a ARM platform 
(Freescale iMX21 based on ARM926EJS core)
Using several runtimes:
  * mono 2.4.2.3 (built from openembedded)
  * mono 2.6.1 (built form released tarball)
  * mono svn revision 152005

All the above under linux 2.6.33-rc8 using eabi
The CPU has no FPU (2.4.2.3 shows up as "vfp" due to openembedded patch) 
but soft float is used in all cases.

For good measure I also tried Microsoft .NET compact framework under 
WinCE 5.0 (on the same hardware)

Results are (times in seconds):
                         mono2.4.2.3-vfp mono2.6.1-soft   
monosvn-soft          ms-cf
boxtest                               19             36             
43             24
bulkcpy                               31             32             
32             26
castclass                             59             59             
60            105
cmov1                                 68             69             
68             81
cmov2                                 64             64             
64             67
cmov3                                 11             10             
11             16
cmov4                                  9              9              
9             11
cmov5                                 88             86             
86            130
commute                               18             18             
17             46
ctor-bench                           344            347            
424          CRASH
fib                                   42             77             
45             59
iconst-byte                            3              3              
3              7
initlocals                            56             55             
55            104
inline-readonly                       80             77             
76            159
inline1                               18             18             
18             25
inline2                               18             18             
18             43
inline3                               22             22             
22            114
inline4                               10             10             
10             11
inline5                               46             46             
46             98
inline6                               17             17             
17             86
isinst                                61             66             
65             88
life                                  16             26             
31             15
logic                                 66             65             
66            126
loops                                  7              7              
7             11
math                                 482            555            
570            804
max-min                               13             13             
12             14
muldiv                               102            162            
117             16
readonly                              16             16             
16             31
readonly-byte-array                   23             23             
23             33
readonly-inst                          9              9              
9             13
readonly-vt                           10             10             
10          CRASH
regalloc                              25             25             
25             30
regalloc-2                            14              9              
9             12
sbperf1                               16             28             
33             13
sbperf2                               20             38             
44             27
switch                                89             89             
88            124
valuetype-hash-equals                 44             54             
57             89
vt2                                   10             10             
10             32

Things I noticed:
1) The muldiv test on all mono versions is very slow (cf .NET)
Looking at the jit generated machine code shows that the n = (n / 256) 
operation is not being converted to a shift operation (whereas the n = n 
* 128 operation _is_). Using a shift (at the C# source level) gives ~4s 
(vs 102)

2) boxtest has become significantly slower in more recent mono versions

3) Compares pretty well to .NET CF

I've looked at the code to try to figure out the cause for 1) and it 
seems to be that mono uses emulation for the IDIV opcodes as 
MONO_ARCH_EMULATE_DIV is defined (since the ARM does not have that in 
hardware).

#if defined(MONO_ARCH_EMULATE_MUL_DIV) || defined(MONO_ARCH_EMULATE_DIV)
    mono_register_opcode_emulation (CEE_DIV, "__emul_idiv", "int32 int32 
int32", mono_idiv, FALSE);
    mono_register_opcode_emulation (CEE_DIV_UN, "__emul_idiv_un", "int32 
int32 int32", mono_idiv_un, FALSE);
    mono_register_opcode_emulation (CEE_REM, "__emul_irem", "int32 int32 
int32", mono_irem, FALSE);
    mono_register_opcode_emulation (CEE_REM_UN, "__emul_irem_un", "int32 
int32 int32", mono_irem_un, FALSE);
    mono_register_opcode_emulation (OP_IDIV, "__emul_op_idiv", "int32 
int32 int32", mono_idiv, FALSE);
    mono_register_opcode_emulation (OP_IDIV_UN, "__emul_op_idiv_un", 
"int32 int32 int32", mono_idiv_un, FALSE);
    mono_register_opcode_emulation (OP_IREM, "__emul_op_irem", "int32 
int32 int32", mono_irem, FALSE);
    mono_register_opcode_emulation (OP_IREM_UN, "__emul_op_irem_un", 
"int32 int32 int32", mono_irem_un, FALSE);
#endif

Which results in a dispatch to  mono_idiv:

gint32
mono_idiv (gint32 a, gint32 b)
{
    MONO_ARCH_SAVE_REGS;

#ifdef MONO_ARCH_NEED_DIV_CHECK
    if (!b)
        mono_raise_exception (mono_get_exception_divide_by_zero ());
    else if (b == -1 && a == (0x80000000))
        mono_raise_exception (mono_get_exception_arithmetic ());
#endif
    return a / b;
}

However at this point the fact that we were dividing by a power of 2 
constant has been lost.
Furthermore the actual mechanics of getting to this function is quite 
heavy (through an indirection table) as
is born out by the disassembly of the jitted code for muldiv:

For n=n/256:
105c:       e1a00006        mov     r0, r6
1060:       e3a01f40        mov     r1, #256        ; 0x100
1064:       eb000412        bl      20b4 <plt+0x14>

plt+0x14:
20b4:       e28fc000        add     ip, pc, #0      ; 0x0
20b8:       e28ccc19        add     ip, ip, #6400   ; 0x1900
20bc:       e59cf058        ldr     pc, [ip, #88]

Plus the implementation of mono_idiv

Compare to the code generated for the next two lines:
                n++;
                n = n * 128;
1068:       e2800001        add     r0, r0, #1      ; 0x1
106c:       e1a00380        lsl     r0, r0, #7

Would the correct way to fix this be to translate the opcodes in 
mono_arch_lowering_pass() of mini-arm.c?

Regards,

Martin