[Mono-devel-list] NPTL thread hang issue
Scott Mohekey
scott.mohekey at telogis.com
Tue Sep 14 13:10:25 EDT 2004
Scott Mohekey wrote:
> Scott Mohekey wrote:
>
>> Gonzalo Paniagua Javier wrote:
>>
>>>On Mon, 2004-09-13 at 11:54, Scott Mohekey wrote:
>>>
>>>
>>>>I've spent the last few days looking into the issue that has been
>>>>reported here (http://bugs.ximian.com/show_bug.cgi?id=60576) and here
>>>>(http://bugs.gentoo.org/show_bug.cgi?id=63734).
>>>>
>>>>This first code snippet exhibits the bug on a pure NPTL system:
>>>>
>>>>using System;
>>>>using System.Threading;
>>>>
>>>>class Test
>>>>{
>>>> public static void Main( String[] args )
>>>> {
>>>> int i = 0;
>>>> while( true ) {
>>>> Thread t = new Thread( new ThreadStart(Blah) );
>>>> t.Start();
>>>> i++;
>>>> Console.WriteLine( i+" threads" );
>>>> }
>>>> }
>>>>
>>>> private static void Blah() {
>>>> Console.WriteLine( "starting thread" );
>>>> }
>>>>}
>>>>
>>>>Running this program inside gdb reveals that each of the threads created
>>>>becomes a zombie as soon as it exits. Further investigation reveals that
>>>>none of the mono thread cleanup code is being run (breaking on
>>>>thread_cleanup() or handle_remove() gives no results) for any of the
>>>>zombie threads. After the program has run for a certain length of time,
>>>>the garbage collector is invoked, which forces a world stop of all
>>>>threads. If the garbage collector is invoked when a zombie thread is
>>>>present, mono_gc_stop_world() is called which iterates over the threads
>>>>hashtable calling gc_stop_world() on each entry, which in turn calls
>>>>SuspendThread() on the each. But because the thread cleanup code hasn't
>>>>been run (in particular, handle_remove() which removes the thread handle
>>>>from the hashtable), SuspendThread() gets called for threads which don't
>>>>exist, or are zombies. The offending code appears to be:
>>>>
>>>>while (MONO_SEM_WAIT (&thread->suspend_sem) != 0 && errno == EINTR);
>>>>
>>>>
>>>>in _wapi_timed_thread_suspend(), which indirectly calls
>>>>pthread_cond_wait() for a condition variable that is not going to be
>>>>triggered, because the other thread is a zombie.
>>>>
>>>>As far as I can tell, mono_thread_manage runs it's 'join' loop once, at
>>>>which point it waits for eternity inside WaitForMultipleObjectsEx()
>>>>which is called from wait_for_tids(). Because of this, the
>>>>thread_cleanup() call in wait_for_tids() is never reached, not for any
>>>>of the threads.
>>>>
>>>>I'm going to continue investigating this. Any help would be greatly
>>>>appreciated.
>>>>
>>>>
>>>>
>>>
>>>Is that happening with CVS HEAD and not on 1.0? I removed some code to
>>>fix this bug:
>>>http://bugzilla.ximian.com/show_bug.cgi?id=65379
>>>
>>>and that might be the cause of yours if you're using CVS HEAD.
>>>
>>>-Gonzalo
>>>
>>>
>>>_______________________________________________
>>>Mono-devel-list mailing list
>>>Mono-devel-list at lists.ximian.com
>>>http://lists.ximian.com/mailman/listinfo/mono-devel-list
>>>
>>>
>>>
>>>
>> Yes, I have been using cvs HEAD. I switched back to version 1.0.1 and
>> this problem (handle_remove() not being called) was not present
>> anymore. However, the program still freezes during a garbage
>> collection. Is it possible that a thread exits, and the garbage
>> collector is called in another thread before handle_remove() is
>> called for the exiting thread?
>>
>> It appears not to be the case if the following output is anything to
>> go by. I've built mono-1.0.1 with THREAD_DEBUG, THREAD_WAIT_DEBUG,
>> and LIBGC_DEBUG defined in metadata/threads.c. I also modified the
>> g_message calls in handle_store and handle_remove to be a bit more
>> descriptive. This the output of the thread test case run inside gdb.
>>
>>starting thread
>>** Message: : returning 0x8230700
>>** Message: : (1098329008) Done launching thread 0x8230700 (1100430256)
>>197 threads
>>** Message: : returning 0x80fae00
>>** Message: : joining thread handle 0xd7d, -1 ms
>>** Message: : (1100430256) Abort requested for 0x8230700 (1100430256)
>>** Message: : returning 0x8230700
>>** Message: : returning 0x8230700
>>** Message: : removing thread ID 1100430256
>>** Message: : returning (nil)
>>** Message: : join successful
>>** Message: : Trying to start a new thread: this (0x8230690) start (0x8241b18)
>>[Thread 1100430256 (zombie) exited]
>>[New Thread 1102531504 (LWP 1187)]
>>** Message: : Started thread ID 1102531504 (handle 0xd80)
>>** Message: : (1098329008) Launching thread 0x8230690 (1102531504)
>>** Message: : storing thread 0x8230690 ID 1102531504
>>** Message: : (1098329008) waiting for thread 0x8230690 (1102531504) to start
>>** Message: : (1102531504) Start wrapper
>>** Message: : returning 0x8230690
>>** Message: : (1102531504,985) Setting thread stack to 0x41b749ec
>>** Message: : (1102531504) Setting current_object_key to 0x8230690
>>** Message: : (1098329008) Done launching thread 0x8230690 (1102531504)
>>198 threads
>>** Message: : returning 0x80fae00
>>** Message: : joining thread handle 0xd80, -1 ms
>>** Message: : returning 0x8230690
>>starting thread
>>** Message: : returning 0x8230690
>>** Message: : (1102531504) Abort requested for 0x8230690 (1102531504)
>>** Message: : returning 0x8230690
>>** Message: : returning 0x8230690
>>** Message: : join successful
>>** Message: : Trying to start a new thread: this (0x8230620) start (0x8241a78)
>>** Message: : removing thread ID 1102531504
>>** Message: : returning (nil)
>>[Thread 1102531504 (zombie) exited]
>>[New Thread 1100430256 (LWP 1188)]
>>Initiating full world-stop collection 4 after 466576 allocd bytes
>>
>>
>>
>> Everything seems to be ok here. However, if I break on
>> mono_gc_stop_world, the breakpoint isn't hit.
>>
>> Scott.
>
> Heres another portion exhibiting slightly different behaviour:
>
>[Thread 1104632752 (zombie) exited]
>[New Thread 1102531504 (LWP 2205)]
>** Message: : Started thread ID 1102531504 (handle 0x683)
>** Message: : (1098329008) Launching thread 0x82be000 (1102531504)
>** Message: : storing thread 0x82be000 ID 1102531504
>** Message: : (1098329008) waiting for thread 0x82be000 (1102531504) to start
>** Message: : (1102531504) Start wrapper
>** Message: : returning 0x82be000
>** Message: : (1102531504,1723) Setting thread stack to 0x41b749ec
>** Message: : (1102531504) Setting current_object_key to 0x82be000
>** Message: : returning 0x82be000
>starting thread
>** Message: : returning 0x82be000
>** Message: : (1098329008) Done launching thread 0x82be000 (1102531504)
>480 threads
>** Message: : returning 0x80fae00
>** Message: : joining thread handle 0x683, -1 ms
>** Message: : (1102531504) Abort requested for 0x82be000 (1102531504)
>** Message: : returning 0x82be000
>** Message: : returning 0x82be000
>** Message: : join successful
>Initiating full world-stop collection 6 after 839344 allocd bytes
>
>Program received signal SIGINT, Interrupt.
>[Switching to Thread 1088395008 (LWP 1723)]
>0xffffe410 in ?? ()
>(gdb) info th
> 483 Thread 1102531504 (LWP 2205) 0xffffe410 in ?? ()
> 3 Thread 1098329008 (LWP 1725) 0xffffe410 in ?? ()
> 2 Thread 1096207280 (LWP 1724) 0xffffe410 in ?? ()
>* 1 Thread 1088395008 (LWP 1723) 0xffffe410 in ?? ()
>
>
> I'm not sure, but it looks like the last thread created is about to be
> removed (mono has joined with it), but handle_remove() has not been
> called, and as far as gdb is aware, the thread is still active (no
> "(zombie) exited" message).
>
> It seems that what the system sees as the list of active threads is
> different to what mono sees as the list of active threads.
>
> Scott
Further investigation reveals that the thread clean up code in the
included libgc is not being run for any threads. After GC_start_routine(
) has called the start function it's given, control is never returned to
it. A backtrace in gdb shows that the offending threads have called
__sigsetjmp and are waiting on a mutex lock (_L_mutex_loc_24()).
(gdb) bt
#0 0xffffe410 in ?? ()
#1 0x419a1554 in ?? ()
#2 0x40d50000 in ?? () from /lib/libc.so.6
#3 0x00000008 in ?? ()
#4 0x40c6c345 in sigsuspend () from /lib/libc.so.6
#5 0x40156729 in GC_suspend_handler () from /home/mono/lib/libmono.so.0
#6 <signal handler called>
#7 0xffffe410 in ?? ()
#8 0x419a18b8 in ?? ()
#9 0x00000002 in ?? ()
#10 0x00000000 in ?? ()
#11 0x40c0fece in __lll_mutex_lock_wait () from /lib/libpthread.so.0
#12 0x40c0cceb in _L_mutex_lock_24 () from /lib/libpthread.so.0
#13 0x419a18ac in ?? ()
#14 0x419a18b0 in ?? ()
#15 0x40c6b994 in __sigsetjmp () from /lib/libc.so.6
#16 0x40ed4964 in ?? ()
#17 0x419a18ac in ?? ()
#18 0x00000000 in ?? ()
#19 0x401b6850 in ?? () from /home/mono/lib/libmono.so.0
#20 0x00001000 in ?? ()
#21 0x419a1a80 in ?? ()
#22 0x419a1978 in ?? ()
#23 0x40135010 in thread_exit (exitstatus=1075549660, handle=0x0) at threads.c:173
#24 0x40135010 in thread_exit (exitstatus=0, handle=0xf19) at threads.c:173
#25 0x40136f03 in _wapi_timed_thread_exit (exitstatus=0) at timed-thread.c:63
#26 0x401371bb in timed_thread_start_routine (args=0x8163950) at timed-thread.c:135
#27 0x40155dc5 in GC_start_routine () from /home/mono/lib/libmono.so.0
#28 0x40c0b8dc in start_thread () from /lib/libpthread.so.0
#29 0x40cf567a in clone () from /lib/libc.so.6
So we have the garbage collecting thread waiting for a signal from the
stopping threads to say that its ok to begin garbage collection. And the
zombie threads are waiting on a mutex which is (I think) related to
pthread_cleanup_push?
Scott.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.ximian.com/pipermail/mono-devel-list/attachments/20040914/eb1687e1/attachment.html
More information about the Mono-devel-list
mailing list