[Mono-bugs] [Bug 81663][Wis] Changed - Performance: Delegate optimization, DLR and IronPython

Wed May 23 10:34:57 EDT 2007

Please do not reply to this email- if you want to comment on the bug, go to the
URL shown below and enter your comments there.

Changed by lupus at ximian.com.

http://bugzilla.ximian.com/show_bug.cgi?id=81663

--- shadow/81663	2007-05-21 16:37:21.000000000 -0400
+++ shadow/81663.tmp.5826	2007-05-23 10:34:57.000000000 -0400
@@ -32,6 +32,67 @@
 tracking the lifetime of the delegate and now the GC does that
 instead. I don't know if that'll apply to Mono or not but maybe it
 gives you some ideas on where to look for own improvements".
 
 ------- Additional Comments From miguel at ximian.com  2007-05-18 12:29 -------
 We should create a small test case to measure the problem.
+
+------- Additional Comments From lupus at ximian.com  2007-05-23 10:34 -------
+On my system (pentium M 1.6) there is a small degradation in
+performance in pystone when going from 1.0.1/1.1 to 2.0 (54662.5 to
+53477.5). But note that pystone should be run with optimizations
+enabled (-O pystone.py) and in that case there is a small gain in 2.0
+vs 1.1 (55157.9 vs basically no change in earlier ironpythons).
+
+Also note that delegate invocation performance has nothing to do
+with the results of pystone (in a profile run the most expensive
+delegate invocation wrapper has a 0.6% impact on the total performance).
+It would be good for people experiencing big slowdowns to get a
+profile for the two runs and attach them to this bug for my review.
+The commands to run are:
+mono --profile=default:stat IPCE-r6/ipy.exe -O
+/usr/lib/python2.5/test/pystone.py 500000
+and
+mono --profile=default:stat IronPython-1.0.1/ipy.exe -O
+/usr/lib/python2.5/test/pystone.py 500000
+
+As for the delegate invocation performance, it is currently about 3
+times the time spent for a virtual call. There are a few optimizations
+possible.
+The first thing to remove is a check added in r27776 by lluis, but we
+need to investigate why it was added (the wrapper should not be called
+in unmanaged->managed transions as the comment says, if it is, it is a
+bug and it should be fixed some other way). This should be the
+simplest change.
+
+The most important overhead is reloading the arguments: we can avoid
+it in the most common scenario, a delegate that is not chained with
+other delegates. There are cases where removing this overhead is cheap
+and cases where it is hard to implement, depending on the calling
+convention. On x86, for example, for delegates that invoke instance
+methods we can just load the target object, place it on the stack
+where the delegate object was pushed and jump to the address. Static
+methods can't be handled the same way because the stack would end up
+imbalanced. Other architectures could be able to handle static methods
+as well with some signatures: sliding a few arguments in registers is
+very cheap. If we change the internal call convention to pass this in
+%ecx, handling static methods would be cheap as well.
+
+I had the above change prototyped a while ago, by prepending to the
+delegate-invoke wrapper a simple check+jump. In the production
+implementation we could do things a little differently by changing the
+call to delegate invoke from:
+
+  call delegate_invoke_impl
+to
+  call *delegate_object [offset]
+where offset is of a field in the delegate object where we store a
+delegate instance-specific invoker.
+The default would be the address of the current delegate_invoke_impl,
+but in the cases where we can optimize we would use the specific one.
+This change would result in a nice speedup for the common case and a
+tiny slowdown when invoking multicast delegates.
+The above is possible because we control in corlib where different
+delegates are chained and of course we control the delegate constructor.
+
+
+