[Mono-list] numerical performance comparison [C++ vs JVM vs Mono]

Mon Oct 4 13:13:07 EDT 2010

i did the exact same thing, and dumped the assembly generated by c and c# on nbody,
in addition to the sqrt issue, which i believe is pre and post a SSEn, where mono isn't as uptodate
on full use of SSEn , the other issue was, mono did a few sets of unnecessary register transfer,
as compared to the assembly generated by C. With these two issues resolved, the 
benchmark would have match on time. I am not even sure mono's older sqrt call was the majority of the diff
in the mark, I believe it was the unnecessary reg. trans.

you will also notice in that bench mark game, many of the C versions (most recent version of given benchmark),
often are not even solutions that can even be compared, I think there is even one that is threaded solution
in the winner, and mono uses single thread. Really, that "game" is way more to do with people making better
and better algos for a given language solution, then it is a comparison of language. Having said that,
it seems to me, given my comparison of the assembly code of a few, that aside from obvious issues of
array boundary checks and so on for safety, the main issue of performance kill (for mono) appears to be
non optimal use of registers, with to much unnecessary transfer/setup. This however is only most noticeable
in these huge loops, with for many people using  mono, isn't an issue. Another issue i noticed is that
in the latest SSE4.? there is 16 registers to use, but I see Mono shuffling within 8 (i think, if i remember
correctly). Oddly enough I didn't see gnu gcc using the available 16 either.

tl

On Mon, 4 Oct 2010 10:43:18 -0400
Jonathan Shore <jonathan.shore at gmail.com> wrote:

> Hi,
> 
> I am looking forward to moving all of my code from Java / C++ to F# / C# in the very near future.   I took the nbody code from the language shootout and ran with 500 million iterations (much more than used in the shootout to provide a fair comparison) on ubuntu server on a core i7 920 box.
> 
> I used:
> 
> - C++ (g++ -O3 with various MMX related flags as done in the shootout)
> - Java 7  -server
> - Mono 2.4.4, compiling with -optimize:+
> 
> I had the following results in seconds:
> 
> 1.  C++: 		98 seconds
> 2.  JVM:		126 seconds,  a 28% performance gap against C++
> 3.  Mono:	191 seconds,  a 50% performance gap with the JVM
> 
> Because the nbody problem uses sqrt for the euclidean distance in each loop, thought that maybe the discrepancy might be more related to the implementation of Sqrt().
> 
> I implemented a (very poor) numerical algorithm as a substitute for the sqrt() function in each implementation to provide an apples-to-apples comparison.    The new numbers became:
> 
> 1.  C++:		517 seconds
> 2.  JVM:		527 seconds
> 3  Mono:		223 seconds (wow, a surprise here)
> 
> I noticed that the Mono runtime libraries use an internal implementation of Sqrt() that seems to resolve to an Op Code.   I am wondering, ultimately, what implementation this maps to?   Clearly the Sqrt implementation in Mono is 2x as slow (or access through the layers is 2x as slow) as the libc implementation.   
> 
> I do mostly numerical work, so concerned about sqrt as well as other fundamental functions in this regard.   Are these custom implementations in assembler for each arch?    Would it be reasonable to try to map these to the existing libc library when available?
> 
> Thanks
> 
> --
> Jonathan Shore
> Systematic Trading Group
> 
> _______________________________________________
> Mono-list maillist  -  Mono-list at lists.ximian.com
> http://lists.ximian.com/mailman/listinfo/mono-list
> 

-- 
ted leslie <tleslie at tcn.net>