[Mono-dev] difference in performance between mono OSX and linux?

Sun Jan 22 11:39:00 UTC 2012

On 21/01/2012 19:28, Jonathan Shore wrote:
> So I am wondering whether there are differences in implementation
> between mono on these platforms that could account for a significant
> performance difference?

First of all, since your code appears to be multi-threaded, is your code 
using thread-static variables extensively (including as part of a 
library)? The Darwin ABI does not natively support support thread-local 
storage, so Apple only supports it through pthread_get_specific() [1,2]. 
This makes thread-static variables comparatively slow in 2.10.

This is somewhat fixed in the current github master (and presumably will 
be also fixed in 2.12). The new code attempts to disassemble 
pthread_getspecific() to find out the gs register offset that the OS 
uses and then uses that as a basis for generating thread-local code. The 
performance difference is pretty dramatic if you use thread-static 
variables a lot (caveat: if you want to experiment, from what I can 
tell, it so far only properly works for the x86 target; the amd64 
target, i.e. 64 bit, for some reason doesn't, so you want to build for a 
32-bit host if experimenting with it).

Second, if you're running a benchmark that aggressively has multiple 
threads use a single shared lock, that can lead to a form of 
"thrashing", independently of the OS used. Basically, if a thread blocks 
because of a contended lock, most simple lock implementation suspend the 
thread (which involves an expensive kernel trap). If timing is 
unfortunate, then you can waste a lot of time having threads suspending 
themselves and getting immediately reawakened; the specific overhead and 
circumstances where that happens vary by OS, but the effect can be very 
unpretty (you can easily make a program 10x slower on most machines by 
parallelizing it in a way that the architecture doesn't like). You can 
recognize this scenario by using /usr/bin/time or something similar; an 
otherwise CPU-bound process will have a disproportionate amount of time 
allocated to system rather than user code.

A relatively simple workaround where you have this problem but expect a 
critical section to only be short-lived is to repeatedly use a "try 
lock" statement (such as Monitor.TryEnter()) before actually using a 
lock-or-suspend type of operation. While this can be more expensive (and 
potentially problematic if you have more threads than available 
processors, or if you have a LOT of processors), in a lot of normal 
situations it prevents unnecessary thread suspensions (essentially, it 
tries to treat the lock as a spin lock and only falls back to a blocking 
implementation if that seems unworkable).

Third, another cause might be that the Boehm GC is causing trouble here; 
it (unavoidably) has a central lock and you say you're allocating 
millions of objects. While the Boehm GC specifically tries to mitigate 
the high contention scenario above (and has thread-local allocation if 
enabled that largely avoids it for a lot of cases), there may still be 
system-specific differences. Trying to run with --gc=sgen may help to 
either identify or exclude this as a source of performance difference.

And, of course, there are a gazillion more causes why there may be a 
performance difference, but these are common reasons you may encounter.

			Reimer Behrends

[1] As on Linux, Darwin stores thread-local variables relative to the 
segment register gs; unlike Linux, Darwin gives you no way to tell at 
what offset thread-local data is/can be stored nor does it promise that 
it may not totally change its implementation in a later version of the OS.

[2] There are alternative implementations of fast thread-local storage, 
but most of them have their own up- and downsides.