[Mono-devel-list] NPTL thread hang issue

Scott Mohekey scott.mohekey at telogis.com
Thu Sep 16 12:22:07 EDT 2004


Scott Mohekey wrote:

> Scott Mohekey wrote:
>
>> Scott Mohekey wrote:
>>
>>> Gonzalo Paniagua Javier wrote:
>>>
>>>>On Mon, 2004-09-13 at 11:54, Scott Mohekey wrote:
>>>>  
>>>>
>>>>>I've spent the last few days looking into the issue that has been 
>>>>>reported here (http://bugs.ximian.com/show_bug.cgi?id=60576) and here 
>>>>>(http://bugs.gentoo.org/show_bug.cgi?id=63734).
>>>>>
>>>>>This first code snippet exhibits the bug on a pure NPTL system:
>>>>>
>>>>>using System;
>>>>>using System.Threading;
>>>>>
>>>>>class Test
>>>>>{
>>>>>        public static void Main( String[] args )
>>>>>        {
>>>>>                int i = 0;
>>>>>                while( true ) {
>>>>>                        Thread t = new Thread( new ThreadStart(Blah) );
>>>>>                        t.Start();
>>>>>                        i++;
>>>>>                        Console.WriteLine( i+" threads" );
>>>>>                }
>>>>>        }
>>>>>
>>>>>        private static void Blah() {
>>>>>                Console.WriteLine( "starting thread" );
>>>>>        }
>>>>>}
>>>>>
>>>>>Running this program inside gdb reveals that each of the threads created 
>>>>>becomes a zombie as soon as it exits. Further investigation reveals that 
>>>>>none of the mono thread cleanup code is being run (breaking on 
>>>>>thread_cleanup() or handle_remove() gives no results) for any of the 
>>>>>zombie threads. After the program has run for a certain length of time, 
>>>>>the garbage collector is invoked, which forces a world stop of all 
>>>>>threads. If the garbage collector is invoked when a zombie thread is 
>>>>>present, mono_gc_stop_world() is called which iterates over the threads 
>>>>>hashtable calling gc_stop_world() on each entry, which in turn calls 
>>>>>SuspendThread() on the each. But because the thread cleanup code hasn't 
>>>>>been run (in particular, handle_remove() which removes the thread handle 
>>>>>from the hashtable), SuspendThread() gets called for threads which don't 
>>>>>exist, or are zombies. The offending code appears to be:
>>>>>
>>>>>while (MONO_SEM_WAIT (&thread->suspend_sem) != 0 && errno == EINTR);
>>>>>
>>>>>
>>>>>in _wapi_timed_thread_suspend(), which indirectly calls 
>>>>>pthread_cond_wait() for a condition variable that is not going to be 
>>>>>triggered, because the other thread is a zombie.
>>>>>
>>>>>As far as I can tell, mono_thread_manage runs it's 'join' loop once, at 
>>>>>which point it waits for eternity inside WaitForMultipleObjectsEx() 
>>>>>which is called from wait_for_tids(). Because of this, the 
>>>>>thread_cleanup() call in wait_for_tids() is never reached, not for any 
>>>>>of the threads.
>>>>>
>>>>>I'm going to continue investigating this. Any help would be greatly 
>>>>>appreciated.
>>>>>
>>>>>    
>>>>>
>>>>
>>>>Is that happening with CVS HEAD and not on 1.0? I removed some code to
>>>>fix this bug:
>>>>http://bugzilla.ximian.com/show_bug.cgi?id=65379
>>>>
>>>>and that might be the cause of yours if you're using CVS HEAD.
>>>>
>>>>-Gonzalo
>>>>
>>>>
>>>>_______________________________________________
>>>>Mono-devel-list mailing list
>>>>Mono-devel-list at lists.ximian.com
>>>>http://lists.ximian.com/mailman/listinfo/mono-devel-list
>>>>
>>>>
>>>>  
>>>>
>>> Yes, I have been using cvs HEAD. I switched back to version 1.0.1 
>>> and this problem (handle_remove() not being called) was not present 
>>> anymore. However, the program still freezes during a garbage 
>>> collection. Is it possible that a thread exits, and the garbage 
>>> collector is called in another thread before handle_remove() is 
>>> called for the exiting thread?
>>>
>>> It appears not to be the case if the following output is anything to 
>>> go by. I've built mono-1.0.1 with THREAD_DEBUG, THREAD_WAIT_DEBUG, 
>>> and LIBGC_DEBUG defined in metadata/threads.c. I also modified the 
>>> g_message calls in handle_store and handle_remove to be a bit more 
>>> descriptive. This the output of the thread test case run inside gdb.
>>>
>>>starting thread
>>>** Message: : returning 0x8230700
>>>** Message: : (1098329008) Done launching thread 0x8230700 (1100430256)
>>>197 threads
>>>** Message: : returning 0x80fae00
>>>** Message: : joining thread handle 0xd7d, -1 ms
>>>** Message: : (1100430256) Abort requested for 0x8230700 (1100430256)
>>>** Message: : returning 0x8230700
>>>** Message: : returning 0x8230700
>>>** Message: : removing thread ID 1100430256
>>>** Message: : returning (nil)
>>>** Message: : join successful
>>>** Message: : Trying to start a new thread: this (0x8230690) start (0x8241b18)
>>>[Thread 1100430256 (zombie) exited]
>>>[New Thread 1102531504 (LWP 1187)]
>>>** Message: : Started thread ID 1102531504 (handle 0xd80)
>>>** Message: : (1098329008) Launching thread 0x8230690 (1102531504)
>>>** Message: : storing thread 0x8230690 ID 1102531504
>>>** Message: : (1098329008) waiting for thread 0x8230690 (1102531504) to start
>>>** Message: : (1102531504) Start wrapper
>>>** Message: : returning 0x8230690
>>>** Message: : (1102531504,985) Setting thread stack to 0x41b749ec
>>>** Message: : (1102531504) Setting current_object_key to 0x8230690
>>>** Message: : (1098329008) Done launching thread 0x8230690 (1102531504)
>>>198 threads
>>>** Message: : returning 0x80fae00
>>>** Message: : joining thread handle 0xd80, -1 ms
>>>** Message: : returning 0x8230690
>>>starting thread
>>>** Message: : returning 0x8230690
>>>** Message: : (1102531504) Abort requested for 0x8230690 (1102531504)
>>>** Message: : returning 0x8230690
>>>** Message: : returning 0x8230690
>>>** Message: : join successful
>>>** Message: : Trying to start a new thread: this (0x8230620) start (0x8241a78)
>>>** Message: : removing thread ID 1102531504
>>>** Message: : returning (nil)
>>>[Thread 1102531504 (zombie) exited]
>>>[New Thread 1100430256 (LWP 1188)]
>>>Initiating full world-stop collection 4 after 466576 allocd bytes
>>>
>>>  
>>>
>>> Everything seems to be ok here. However, if I break on 
>>> mono_gc_stop_world, the breakpoint isn't hit.
>>>
>>> Scott. 
>>
>> Heres another portion exhibiting slightly different behaviour:
>>
>>[Thread 1104632752 (zombie) exited]
>>[New Thread 1102531504 (LWP 2205)]
>>** Message: : Started thread ID 1102531504 (handle 0x683)
>>** Message: : (1098329008) Launching thread 0x82be000 (1102531504)
>>** Message: : storing thread 0x82be000 ID 1102531504
>>** Message: : (1098329008) waiting for thread 0x82be000 (1102531504) to start
>>** Message: : (1102531504) Start wrapper
>>** Message: : returning 0x82be000
>>** Message: : (1102531504,1723) Setting thread stack to 0x41b749ec
>>** Message: : (1102531504) Setting current_object_key to 0x82be000
>>** Message: : returning 0x82be000
>>starting thread
>>** Message: : returning 0x82be000
>>** Message: : (1098329008) Done launching thread 0x82be000 (1102531504)
>>480 threads
>>** Message: : returning 0x80fae00
>>** Message: : joining thread handle 0x683, -1 ms
>>** Message: : (1102531504) Abort requested for 0x82be000 (1102531504)
>>** Message: : returning 0x82be000
>>** Message: : returning 0x82be000
>>** Message: : join successful
>>Initiating full world-stop collection 6 after 839344 allocd bytes
>>
>>Program received signal SIGINT, Interrupt.
>>[Switching to Thread 1088395008 (LWP 1723)]
>>0xffffe410 in ?? ()
>>(gdb) info th
>>  483 Thread 1102531504 (LWP 2205)  0xffffe410 in ?? ()
>>  3 Thread 1098329008 (LWP 1725)  0xffffe410 in ?? ()
>>  2 Thread 1096207280 (LWP 1724)  0xffffe410 in ?? ()
>>* 1 Thread 1088395008 (LWP 1723)  0xffffe410 in ?? ()
>>  
>>
>> I'm not sure, but it looks like the last thread created is about to 
>> be removed (mono has joined with it), but handle_remove() has not 
>> been called, and as far as gdb is aware, the thread is still active 
>> (no "(zombie) exited" message).
>>
>> It seems that what the system sees as the list of active threads is 
>> different to what mono sees as the list of active threads.
>>
>> Scott
>
> Further investigation reveals that the thread clean up code in the 
> included libgc is not being run for any threads. After 
> GC_start_routine( ) has called the start function it's given, control 
> is never returned to it. A backtrace in gdb shows that the offending 
> threads have called __sigsetjmp and are waiting on a mutex lock 
> (_L_mutex_loc_24()).
>
>(gdb) bt
>#0  0xffffe410 in ?? ()
>#1  0x419a1554 in ?? ()
>#2  0x40d50000 in ?? () from /lib/libc.so.6
>#3  0x00000008 in ?? ()
>#4  0x40c6c345 in sigsuspend () from /lib/libc.so.6
>#5  0x40156729 in GC_suspend_handler () from /home/mono/lib/libmono.so.0
>#6  <signal handler called>
>#7  0xffffe410 in ?? ()
>#8  0x419a18b8 in ?? ()
>#9  0x00000002 in ?? ()
>#10 0x00000000 in ?? ()
>#11 0x40c0fece in __lll_mutex_lock_wait () from /lib/libpthread.so.0
>#12 0x40c0cceb in _L_mutex_lock_24 () from /lib/libpthread.so.0
>#13 0x419a18ac in ?? ()
>#14 0x419a18b0 in ?? ()
>#15 0x40c6b994 in __sigsetjmp () from /lib/libc.so.6
>#16 0x40ed4964 in ?? ()
>#17 0x419a18ac in ?? ()
>#18 0x00000000 in ?? ()
>#19 0x401b6850 in ?? () from /home/mono/lib/libmono.so.0
>#20 0x00001000 in ?? ()
>#21 0x419a1a80 in ?? ()
>#22 0x419a1978 in ?? ()
>#23 0x40135010 in thread_exit (exitstatus=1075549660, handle=0x0) at threads.c:173
>#24 0x40135010 in thread_exit (exitstatus=0, handle=0xf19) at threads.c:173
>#25 0x40136f03 in _wapi_timed_thread_exit (exitstatus=0) at timed-thread.c:63
>#26 0x401371bb in timed_thread_start_routine (args=0x8163950) at timed-thread.c:135
>#27 0x40155dc5 in GC_start_routine () from /home/mono/lib/libmono.so.0
>#28 0x40c0b8dc in start_thread () from /lib/libpthread.so.0
>#29 0x40cf567a in clone () from /lib/libc.so.6
>  
>
> So we have the garbage collecting thread waiting for a signal from the 
> stopping threads to say that its ok to begin garbage collection. And 
> the zombie threads are waiting on a mutex which is (I think) related 
> to pthread_cleanup_push?
>
> Scott.

>From 
http://techpubs.sgi.com/library/tpl/cgi-bin/getdoc.cgi?coll=linux&db=bks&srch=&fname=/SGI_Admin/Porting_Guide/sgi_html/ch07.html 
:

The Native Posix Thread Library (described in 
http://people.redhat.com/drepper/nptl-design.pdf ) provides performance 
improvements and increased scalability and it aims to overcome most of 
the deficiencies of Linux Threads while remaining as compatible as 
possible to the Linux Thread API. It is also a 1:1 (rather than M:N) 
threading model, but it corrects many of the issues with signal handling 
in Linux Threads and is thus much more standard conformant. Applications 
that rely on behavior where the LinuxThreads implementation deviates 
from the POSIX standard will need to be fixed.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.ximian.com/pipermail/mono-devel-list/attachments/20040916/2b5258b1/attachment.html 


More information about the Mono-devel-list mailing list