[Mono-devel-list] Two IO-Layer performance ideas

Thu Oct 14 23:06:20 EDT 2004

Hey guys,

I just wanted to put out two io-layer performance ideas for commenting.

Miguel was saying that io-layer is being a bit slow for beagle. Also, in
tests for Monitors, I have noticed that the performance of semaphores is
as much as 10x worse than native pthreads code.

I think much of the overhead of io-layer is coming from two areas:

1) Excess locks for getting segment data.

Today in io-layer we have code like:

        static inline struct _WapiHandleShared_list *_wapi_handle_get_shared_segment (guint32 segment)
        {
        	struct _WapiHandleShared_list *shared;
        	int thr_ret;

        	pthread_cleanup_push ((void(*)(void *))pthread_mutex_unlock,
        			      (void *)&_wapi_shared_mutex);
        	thr_ret = pthread_mutex_lock (&_wapi_shared_mutex);
        	g_assert (thr_ret == 0);

        	shared=_wapi_shared_data[segment];

        	thr_ret = pthread_mutex_unlock (&_wapi_shared_mutex);
        	g_assert (thr_ret == 0);
        	pthread_cleanup_pop (0);

        	return(shared);
        }

This code must be called each time we use a handle (basically). The
reason we must lock is because _wapi_shared_data needs to be dynamically
expanded.

I think we can do a clever trick here. We can say

        #define NUM_FASTPATH_SEGMENTS /* some number */

        struct _WapiHandleShared_list * _wapi_fast_shared_data [NUM_FASTPATH_SEGMENTS];

If `segment < NUM_FASTPAH_SEGMENTS' we can just use the index of
_wapi_fast_shared_data. Because that area is statically allocated, we
would not have to lock it. We can assume that in most cases, there will
not be more than a few segments (we need real life data for the number).

This should cut a little overhead from all the io type functions.
Really, the benefit of this path is for smp boxes with multiple threads
running. We reduce the risk that someone will need to block on a thread.

2) Make the _wapi_handle_[un]ref functions in-process

Today to ref or unref a handle, we must do IPC. We should be using
Interlocked type functions to do the refcounting. If the refcount goes
to 0, we can do the ipc.

The only problem I saw here is that the daemon does:
        static void ref_handle (ChannelData *channel_data, guint32 handle)
        {
        	guint32 segment, idx;

        	if(handle==0) {
        		return;
        	}

        	_wapi_handle_segment (GUINT_TO_POINTER (handle), &segment, &idx);

        	_wapi_shared_data[segment]->handles[idx].ref++;
        	channel_data->open_handles[handle]++;

The _wapi_shared_data is easy to do with interlocked, because we have
the data in shared memory. However, the channel data is not. I am not
sure what we want to do with this.

I know that we are going to rewrite things to remove the daemon.
However, this function accounts for much of the overhead of the daemon
at runtime. Also, rewriting this function should be a
step-in-the-right-direction.

I think I know how to do 1. I will try cooking a patch over the weekend.
However, 2 I need some more input on.

-- 
Ben Maurer <bmaurer at ximian.com>