[Mono-dev] difference in performance between mono OSX and linux?

Mon Jan 23 16:26:26 UTC 2012

Reimer Behrends <behrends <at> gmail.com> writes:

> 
> On 21/01/2012 19:28, Jonathan Shore wrote:
> > So I am wondering whether there are differences in implementation
> > between mono on these platforms that could account for a significant
> > performance difference?
> 
> First of all, since your code appears to be multi-threaded, is your code 
> using thread-static variables extensively (including as part of a 
> library)? The Darwin ABI does not natively support support thread-local 
> storage, so Apple only supports it through pthread_get_specific() [1,2]. 
> This makes thread-static variables comparatively slow in 2.10.

I do have ThreadLocals in some parts of the code base, but not being 
exercised in this test.   Thanks for the pointer though.   Will keep in 
mind.

> 
>> 
> Second, if you're running a benchmark that aggressively has multiple 
> threads use a single shared lock, that can lead to a form of 
> "thrashing", independently of the OS used. Basically, if a thread blocks 
> because of a contended lock, most simple lock implementation suspend the 
> thread (which involves an expensive kernel trap). If timing is 
> unfortunate, then you can waste a lot of time having threads suspending 
> themselves and getting immediately reawakened; the specific overhead and 

Although the code is intended to be hit by multiple threads, this test was on a
single thread.  The thread does enter and exit a SpinLock though.   I temporarily
removed to see if there was asubstantial performance difference.  The difference
was less than a second for 8 million passesthrough it.   

So am happy to report that SpinLocks (at least in my usage on a single thread) 
appears to be very efficient.   I expect some degradation in lock performance on
multiple threads since it uses a queue and also will suspend the thread after a 
certain # of cycles.

 > A relatively simple workaround where you have this problem but expect a 
> critical section to only be short-lived is to repeatedly use a "try 
> lock" statement (such as Monitor.TryEnter()) before actually using a 
> lock-or-suspend type of operation. While this can be more expensive (and 
> potentially problematic if you have more threads than available 
> processors, or if you have a LOT of processors), in a lot of normal 
> situations it prevents unnecessary thread suspensions (essentially, it 
> tries to treat the lock as a spin lock and only falls back to a blocking 
> implementation if that seems unworkable).

Since my transactions are short-lived, the SpinLock seems like a good choice.
I believe the mono implementation is actually a hybrid, that will spin for a 
while and then suspend (which is good behavior).

> 
> Third, another cause might be that the Boehm GC is causing trouble here; 
> it (unavoidably) has a central lock and you say you're allocating 
> millions of objects. While the Boehm GC specifically tries to mitigate 
> the high contention scenario above (and has thread-local allocation if 
> enabled that largely avoids it for a lot of cases), there may still be 
> system-specific differences. Trying to run with --gc=sgen may help to 
> either identify or exclude this as a source of performance difference.

In retrospect, I don't think I am GC bound.  Most of the objects are created 
up front and then run through a simulation (which is transaction based).   

'll run with sgen and see how that differs.   For practical use, unfortunately, 
I've found that sgen is often much slower than boehm for my application(s).
For some trivial tests where there is local object creation and discard, sgen
is much better.    Will give it a go.

> 
> And, of course, there are a gazillion more causes why there may be a 
> performance difference, but these are common reasons you may encounter.
> 

Sure.  I recognize that I have not provided much to go on.  I was curious 
what general implementation differences could contribute.

> 			Reimer Behrends
> 
> [1] As on Linux, Darwin stores thread-local variables relative to the 
> segment register gs; unlike Linux, Darwin gives you no way to tell at 
> what offset thread-local data is/can be stored nor does it promise that 
> it may not totally change its implementation in a later version of the OS.
> 
> [2] There are alternative implementations of fast thread-local storage, 
> but most of them have their own up- and downsides.
>