[Mono-devel-list] NPTL thread hang issue

Tue Sep 14 13:10:25 EDT 2004

Scott Mohekey wrote:

> Scott Mohekey wrote:
>
>> Gonzalo Paniagua Javier wrote:
>>
>>>On Mon, 2004-09-13 at 11:54, Scott Mohekey wrote:
>>>  
>>>
>>>>I've spent the last few days looking into the issue that has been 
>>>>reported here (http://bugs.ximian.com/show_bug.cgi?id=60576) and here 
>>>>(http://bugs.gentoo.org/show_bug.cgi?id=63734).
>>>>
>>>>This first code snippet exhibits the bug on a pure NPTL system:
>>>>
>>>>using System;
>>>>using System.Threading;
>>>>
>>>>class Test
>>>>{
>>>>        public static void Main( String[] args )
>>>>        {
>>>>                int i = 0;
>>>>                while( true ) {
>>>>                        Thread t = new Thread( new ThreadStart(Blah) );
>>>>                        t.Start();
>>>>                        i++;
>>>>                        Console.WriteLine( i+" threads" );
>>>>                }
>>>>        }
>>>>
>>>>        private static void Blah() {
>>>>                Console.WriteLine( "starting thread" );
>>>>        }
>>>>}
>>>>
>>>>Running this program inside gdb reveals that each of the threads created 
>>>>becomes a zombie as soon as it exits. Further investigation reveals that 
>>>>none of the mono thread cleanup code is being run (breaking on 
>>>>thread_cleanup() or handle_remove() gives no results) for any of the 
>>>>zombie threads. After the program has run for a certain length of time, 
>>>>the garbage collector is invoked, which forces a world stop of all 
>>>>threads. If the garbage collector is invoked when a zombie thread is 
>>>>present, mono_gc_stop_world() is called which iterates over the threads 
>>>>hashtable calling gc_stop_world() on each entry, which in turn calls 
>>>>SuspendThread() on the each. But because the thread cleanup code hasn't 
>>>>been run (in particular, handle_remove() which removes the thread handle 
>>>>from the hashtable), SuspendThread() gets called for threads which don't 
>>>>exist, or are zombies. The offending code appears to be:
>>>>
>>>>while (MONO_SEM_WAIT (&thread->suspend_sem) != 0 && errno == EINTR);
>>>>
>>>>
>>>>in _wapi_timed_thread_suspend(), which indirectly calls 
>>>>pthread_cond_wait() for a condition variable that is not going to be 
>>>>triggered, because the other thread is a zombie.
>>>>
>>>>As far as I can tell, mono_thread_manage runs it's 'join' loop once, at 
>>>>which point it waits for eternity inside WaitForMultipleObjectsEx() 
>>>>which is called from wait_for_tids(). Because of this, the 
>>>>thread_cleanup() call in wait_for_tids() is never reached, not for any 
>>>>of the threads.
>>>>
>>>>I'm going to continue investigating this. Any help would be greatly 
>>>>appreciated.
>>>>
>>>>    
>>>>
>>>
>>>Is that happening with CVS HEAD and not on 1.0? I removed some code to
>>>fix this bug:
>>>http://bugzilla.ximian.com/show_bug.cgi?id=65379
>>>
>>>and that might be the cause of yours if you're using CVS HEAD.
>>>
>>>-Gonzalo
>>>
>>>
>>>_______________________________________________
>>>Mono-devel-list mailing list
>>>Mono-devel-list at lists.ximian.com
>>>http://lists.ximian.com/mailman/listinfo/mono-devel-list
>>>
>>>
>>>  
>>>
>> Yes, I have been using cvs HEAD. I switched back to version 1.0.1 and 
>> this problem (handle_remove() not being called) was not present 
>> anymore. However, the program still freezes during a garbage 
>> collection. Is it possible that a thread exits, and the garbage 
>> collector is called in another thread before handle_remove() is 
>> called for the exiting thread?
>>
>> It appears not to be the case if the following output is anything to 
>> go by. I've built mono-1.0.1 with THREAD_DEBUG, THREAD_WAIT_DEBUG, 
>> and LIBGC_DEBUG defined in metadata/threads.c. I also modified the 
>> g_message calls in handle_store and handle_remove to be a bit more 
>> descriptive. This the output of the thread test case run inside gdb.
>>
>>starting thread
>>** Message: : returning 0x8230700
>>** Message: : (1098329008) Done launching thread 0x8230700 (1100430256)
>>197 threads
>>** Message: : returning 0x80fae00
>>** Message: : joining thread handle 0xd7d, -1 ms
>>** Message: : (1100430256) Abort requested for 0x8230700 (1100430256)
>>** Message: : returning 0x8230700
>>** Message: : returning 0x8230700
>>** Message: : removing thread ID 1100430256
>>** Message: : returning (nil)
>>** Message: : join successful
>>** Message: : Trying to start a new thread: this (0x8230690) start (0x8241b18)
>>[Thread 1100430256 (zombie) exited]
>>[New Thread 1102531504 (LWP 1187)]
>>** Message: : Started thread ID 1102531504 (handle 0xd80)
>>** Message: : (1098329008) Launching thread 0x8230690 (1102531504)
>>** Message: : storing thread 0x8230690 ID 1102531504
>>** Message: : (1098329008) waiting for thread 0x8230690 (1102531504) to start
>>** Message: : (1102531504) Start wrapper
>>** Message: : returning 0x8230690
>>** Message: : (1102531504,985) Setting thread stack to 0x41b749ec
>>** Message: : (1102531504) Setting current_object_key to 0x8230690
>>** Message: : (1098329008) Done launching thread 0x8230690 (1102531504)
>>198 threads
>>** Message: : returning 0x80fae00
>>** Message: : joining thread handle 0xd80, -1 ms
>>** Message: : returning 0x8230690
>>starting thread
>>** Message: : returning 0x8230690
>>** Message: : (1102531504) Abort requested for 0x8230690 (1102531504)
>>** Message: : returning 0x8230690
>>** Message: : returning 0x8230690
>>** Message: : join successful
>>** Message: : Trying to start a new thread: this (0x8230620) start (0x8241a78)
>>** Message: : removing thread ID 1102531504
>>** Message: : returning (nil)
>>[Thread 1102531504 (zombie) exited]
>>[New Thread 1100430256 (LWP 1188)]
>>Initiating full world-stop collection 4 after 466576 allocd bytes
>>
>>  
>>
>> Everything seems to be ok here. However, if I break on 
>> mono_gc_stop_world, the breakpoint isn't hit.
>>
>> Scott. 
>
> Heres another portion exhibiting slightly different behaviour:
>
>[Thread 1104632752 (zombie) exited]
>[New Thread 1102531504 (LWP 2205)]
>** Message: : Started thread ID 1102531504 (handle 0x683)
>** Message: : (1098329008) Launching thread 0x82be000 (1102531504)
>** Message: : storing thread 0x82be000 ID 1102531504
>** Message: : (1098329008) waiting for thread 0x82be000 (1102531504) to start
>** Message: : (1102531504) Start wrapper
>** Message: : returning 0x82be000
>** Message: : (1102531504,1723) Setting thread stack to 0x41b749ec
>** Message: : (1102531504) Setting current_object_key to 0x82be000
>** Message: : returning 0x82be000
>starting thread
>** Message: : returning 0x82be000
>** Message: : (1098329008) Done launching thread 0x82be000 (1102531504)
>480 threads
>** Message: : returning 0x80fae00
>** Message: : joining thread handle 0x683, -1 ms
>** Message: : (1102531504) Abort requested for 0x82be000 (1102531504)
>** Message: : returning 0x82be000
>** Message: : returning 0x82be000
>** Message: : join successful
>Initiating full world-stop collection 6 after 839344 allocd bytes
>
>Program received signal SIGINT, Interrupt.
>[Switching to Thread 1088395008 (LWP 1723)]
>0xffffe410 in ?? ()
>(gdb) info th
>  483 Thread 1102531504 (LWP 2205)  0xffffe410 in ?? ()
>  3 Thread 1098329008 (LWP 1725)  0xffffe410 in ?? ()
>  2 Thread 1096207280 (LWP 1724)  0xffffe410 in ?? ()
>* 1 Thread 1088395008 (LWP 1723)  0xffffe410 in ?? ()
>  
>
> I'm not sure, but it looks like the last thread created is about to be 
> removed (mono has joined with it), but handle_remove() has not been 
> called, and as far as gdb is aware, the thread is still active (no 
> "(zombie) exited" message).
>
> It seems that what the system sees as the list of active threads is 
> different to what mono sees as the list of active threads.
>
> Scott

Further investigation reveals that the thread clean up code in the 
included libgc is not being run for any threads. After GC_start_routine( 
) has called the start function it's given, control is never returned to 
it. A backtrace in gdb shows that the offending threads have called 
__sigsetjmp and are waiting on a mutex lock (_L_mutex_loc_24()).

(gdb) bt
#0  0xffffe410 in ?? ()
#1  0x419a1554 in ?? ()
#2  0x40d50000 in ?? () from /lib/libc.so.6
#3  0x00000008 in ?? ()
#4  0x40c6c345 in sigsuspend () from /lib/libc.so.6
#5  0x40156729 in GC_suspend_handler () from /home/mono/lib/libmono.so.0
#6  <signal handler called>
#7  0xffffe410 in ?? ()
#8  0x419a18b8 in ?? ()
#9  0x00000002 in ?? ()
#10 0x00000000 in ?? ()
#11 0x40c0fece in __lll_mutex_lock_wait () from /lib/libpthread.so.0
#12 0x40c0cceb in _L_mutex_lock_24 () from /lib/libpthread.so.0
#13 0x419a18ac in ?? ()
#14 0x419a18b0 in ?? ()
#15 0x40c6b994 in __sigsetjmp () from /lib/libc.so.6
#16 0x40ed4964 in ?? ()
#17 0x419a18ac in ?? ()
#18 0x00000000 in ?? ()
#19 0x401b6850 in ?? () from /home/mono/lib/libmono.so.0
#20 0x00001000 in ?? ()
#21 0x419a1a80 in ?? ()
#22 0x419a1978 in ?? ()
#23 0x40135010 in thread_exit (exitstatus=1075549660, handle=0x0) at threads.c:173
#24 0x40135010 in thread_exit (exitstatus=0, handle=0xf19) at threads.c:173
#25 0x40136f03 in _wapi_timed_thread_exit (exitstatus=0) at timed-thread.c:63
#26 0x401371bb in timed_thread_start_routine (args=0x8163950) at timed-thread.c:135
#27 0x40155dc5 in GC_start_routine () from /home/mono/lib/libmono.so.0
#28 0x40c0b8dc in start_thread () from /lib/libpthread.so.0
#29 0x40cf567a in clone () from /lib/libc.so.6

So we have the garbage collecting thread waiting for a signal from the 
stopping threads to say that its ok to begin garbage collection. And the 
zombie threads are waiting on a mutex which is (I think) related to 
pthread_cleanup_push?

Scott.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.ximian.com/pipermail/mono-devel-list/attachments/20040914/eb1687e1/attachment.html