[Mono-dev] Possible deadlock in sgen garbage collector

Burkhard Linke blinke at CeBiTec.Uni-Bielefeld.DE
Wed May 26 08:39:22 EDT 2010


Hi,

I've stumpled over a possible deadlock in boehm GC some time ago. Since the 
sgen GC uses the same mechanism for stopping the world, it may also be a 
problem in that implementation.

Thread termination is signalled to the GC by the mean of a thread exit handler 
(boehm) or a thread data key destructor (sgen). The function in question 
removes the thread from the internal management tables and does necessary 
cleanup.

POSIX does not specify the state of the thread's signal mask during exit 
handlers or data key destructor. Linux pthreads afaik enable signals, so the 
signal based suspend/restart mechanism is OK. But Solaris/x86 blocks signals 
during these handlers. From the pthread_exit(3) manpage:

     An exiting thread runs with all signals blocked. All  thread
     termination   functions,   including   cancellation  cleanup
     handlers and thread-specific data destructor functions,  are
     called with all signals blocked.

And at this point a (unlikely, but possible) race condition occurs. If thread 
A stop the world, it examines the thread table for active threads and sends a 
suspend signal to each of them. If this happens while thread B is terminating 
and executing its termination handlers, the signal will be blocked (and 
probably lost, since the manpage does not mention unblocking the signals 
again). The suspend handlers post to a semaphore thread A is waiting for. The 
post of thread B never happens and thread A blocks forever. This error is not 
reproducable in a reliable way, so no test case can be provided.

The patch for boehm GC requires adding another mutex for thread 
termination/garbage collection and explicitly checking for pending signals in 
the termination handler. I'll try to port this patch to sgen GC, unless 
someone else has a better solution.

Burkhard Linke


More information about the Mono-devel-list mailing list