[Mono-dev] Delegates very slow on Mono 2.2/Linux (but not on Mono 2.4/Windows)

Sun Mar 15 17:19:00 EDT 2009

Hi all,

I just ran some tests to measure performance in OpenTK.Graphics and
Tao.OpenGl and uncovered some surprising results.

Some background first: OpenGL exports functions either statically ("core
functions") or dynamically ("extensions"). While you use a simple
[DllImport] to invoke core functions, you have to invoke extensions through
function pointers. Different platforms, video cards, even drivers expose
different subsets of OpenGL as extensions, which means you have to handle
this issue during runtime.

To deal with this problem, the aforementioned libraries implement a
relatively complex solution:

   - The union of all core functions is declared as [DllImport] in a private
   class named "Core".
   - The union of all core and extension functions are declared as delegates
   in a private class named "Delegates".
   - Each delegate has one or more "wrapper" functions. This is the public
   API for the user.
   - During initialization, we probe each OpenGL function and "arm" the
   relevant delegate with Marshal.GetDelegateForFunctionPointer, a function
   from the Core class or null (if it exported dynamically, statically or not
   at all, respectively).

Most of the types used in OpenGL interop are blittable, which makes most
pinvokes pretty fast. The main bottleneck is the delegate call, which should
be plenty fast (or so we thought).

To test the performance of this approach, I wrote a simple test that
simulates OpenGL calls (attached). The test measures the call overhead for
two function prototypes that are very common in OpenGL:

   - void SendFloat(int, int, int, float*)
   - void Send(int, int, int, int, void*)

The first function is wrapped as "void SendFloat(int, int, int, float[])"
and the array is pinned and passed as a simple pointer.  The second becomes
"void Send(int, int, int, int, object)" and the last parameter is also
pinned (with GCHandle.Alloc) and passed as a simple pointer (we assume
'object' is a blittable struct). Each of these functions is tested twice,
first through a delegate (as outlined above) and then directly with a simple
pinvoke.

The results are measured on a 2.66GHz Core 2 Duo with each function called
10^6 times (not nearly enough for ns accuracy, but the problem is
nonetheless obvious). The binaries were compiled with gmcs 2.2 (every test
used the same executable). The unmanaged dll was compiled with gcc on Linux
(x86_64) and msvc on Windows (x86):

[Mono 2.2, Linux x86_64]
Timing SendFloat (delegate): 0.7666697 seconds (766.6697 ns/call) with
3/3/3 collections.
Timing SendFloat (direct): 0.0170575 seconds (17.0575 ns/call) with
3/3/3 collections.
Timing Send (delegate): 1.3894752 seconds (1389.4752 ns/call) with
3/3/3 collections.
Timing Send (direct): 0.2461236 seconds (246.1236 ns/call) with
3/3/3 collections.

[Mono 2.4 RC1, Windows x86 (VirtualBox)]
Timing SendFloat (delegate): 0,0130416 seconds (13,0416 ns/call) with 1/1/1
collections.
Timing SendFloat (direct): 0,0140448 seconds (14,0448 ns/call) with 1/1/1
collections.
Timing Send (delegate): 0,1033469 seconds (103,3469 ns/call) with 1/1/1
collections.
Timing Send (direct): 0,1063392 seconds (106,3392 ns/call) with 1/1/1
collections.

[.Net 3.5 SP1, Windows x86 (VirtualBox)]
Timing SendFloat (delegate): 0,0117486 seconds (11,7486 ns/call) with
0/0/0 collections.
Timing SendFloat (direct): 0,0070824 seconds (7,0824 ns/call) with
0/0/0 collections.
Timing Send (delegate): 0,1087277 seconds (108,7277 ns/call) with
0/0/0 collections.
Timing Send (direct): 0,095304 seconds (95,304 ns/call) with
0/0/0 collections.

As you can see, Mono 2.2 on Linux x86_64 is 5 - 40 times slower when calling
a delegate - nearly 1us for a single delegate call! In comparison, calling a
delegate on Windows x86 seems comparable to a simple virtual call (1 - 3ns
overhead).

A typical, state-of-the-art 3d program may contain somewhere between
1000-5000 draw calls per frame. Assuming the above results hold, the interop
layer will consume between 5-30% of your total frame bugdet (16.6ms) - not
good!

Is there an explanation for this discrepancy? Can we expect better
performance in some future version of the runtime? Should we bite the bullet
and rewrite the bindings in ilasm (replacing pinvokes with calli
instructions)? Any possible workarounds / alternatives?

Thanks for your time!
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.ximian.com/pipermail/mono-devel-list/attachments/20090315/c3dc4c14/attachment-0001.html 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: InteropSpeed.7z
Type: application/x-7z-compressed
Size: 28651 bytes
Desc: not available
Url : http://lists.ximian.com/pipermail/mono-devel-list/attachments/20090315/c3dc4c14/attachment-0001.bin