[Mono-devel-list] NPTL thread hang issue

Tue Sep 14 08:24:46 EDT 2004

Scott Mohekey wrote:

> Gonzalo Paniagua Javier wrote:
>
>>On Mon, 2004-09-13 at 11:54, Scott Mohekey wrote:
>>  
>>
>>>I've spent the last few days looking into the issue that has been 
>>>reported here (http://bugs.ximian.com/show_bug.cgi?id=60576) and here 
>>>(http://bugs.gentoo.org/show_bug.cgi?id=63734).
>>>
>>>This first code snippet exhibits the bug on a pure NPTL system:
>>>
>>>using System;
>>>using System.Threading;
>>>
>>>class Test
>>>{
>>>        public static void Main( String[] args )
>>>        {
>>>                int i = 0;
>>>                while( true ) {
>>>                        Thread t = new Thread( new ThreadStart(Blah) );
>>>                        t.Start();
>>>                        i++;
>>>                        Console.WriteLine( i+" threads" );
>>>                }
>>>        }
>>>
>>>        private static void Blah() {
>>>                Console.WriteLine( "starting thread" );
>>>        }
>>>}
>>>
>>>Running this program inside gdb reveals that each of the threads created 
>>>becomes a zombie as soon as it exits. Further investigation reveals that 
>>>none of the mono thread cleanup code is being run (breaking on 
>>>thread_cleanup() or handle_remove() gives no results) for any of the 
>>>zombie threads. After the program has run for a certain length of time, 
>>>the garbage collector is invoked, which forces a world stop of all 
>>>threads. If the garbage collector is invoked when a zombie thread is 
>>>present, mono_gc_stop_world() is called which iterates over the threads 
>>>hashtable calling gc_stop_world() on each entry, which in turn calls 
>>>SuspendThread() on the each. But because the thread cleanup code hasn't 
>>>been run (in particular, handle_remove() which removes the thread handle 
>>>from the hashtable), SuspendThread() gets called for threads which don't 
>>>exist, or are zombies. The offending code appears to be:
>>>
>>>while (MONO_SEM_WAIT (&thread->suspend_sem) != 0 && errno == EINTR);
>>>
>>>
>>>in _wapi_timed_thread_suspend(), which indirectly calls 
>>>pthread_cond_wait() for a condition variable that is not going to be 
>>>triggered, because the other thread is a zombie.
>>>
>>>As far as I can tell, mono_thread_manage runs it's 'join' loop once, at 
>>>which point it waits for eternity inside WaitForMultipleObjectsEx() 
>>>which is called from wait_for_tids(). Because of this, the 
>>>thread_cleanup() call in wait_for_tids() is never reached, not for any 
>>>of the threads.
>>>
>>>I'm going to continue investigating this. Any help would be greatly 
>>>appreciated.
>>>
>>>    
>>>
>>
>>Is that happening with CVS HEAD and not on 1.0? I removed some code to
>>fix this bug:
>>http://bugzilla.ximian.com/show_bug.cgi?id=65379
>>
>>and that might be the cause of yours if you're using CVS HEAD.
>>
>>-Gonzalo
>>
>>
>>_______________________________________________
>>Mono-devel-list mailing list
>>Mono-devel-list at lists.ximian.com
>>http://lists.ximian.com/mailman/listinfo/mono-devel-list
>>
>>
>>  
>>
> Yes, I have been using cvs HEAD. I switched back to version 1.0.1 and 
> this problem (handle_remove() not being called) was not present 
> anymore. However, the program still freezes during a garbage 
> collection. Is it possible that a thread exits, and the garbage 
> collector is called in another thread before handle_remove() is called 
> for the exiting thread?
>
> It appears not to be the case if the following output is anything to 
> go by. I've built mono-1.0.1 with THREAD_DEBUG, THREAD_WAIT_DEBUG, and 
> LIBGC_DEBUG defined in metadata/threads.c. I also modified the 
> g_message calls in handle_store and handle_remove to be a bit more 
> descriptive. This the output of the thread test case run inside gdb.
>
>starting thread
>** Message: : returning 0x8230700
>** Message: : (1098329008) Done launching thread 0x8230700 (1100430256)
>197 threads
>** Message: : returning 0x80fae00
>** Message: : joining thread handle 0xd7d, -1 ms
>** Message: : (1100430256) Abort requested for 0x8230700 (1100430256)
>** Message: : returning 0x8230700
>** Message: : returning 0x8230700
>** Message: : removing thread ID 1100430256
>** Message: : returning (nil)
>** Message: : join successful
>** Message: : Trying to start a new thread: this (0x8230690) start (0x8241b18)
>[Thread 1100430256 (zombie) exited]
>[New Thread 1102531504 (LWP 1187)]
>** Message: : Started thread ID 1102531504 (handle 0xd80)
>** Message: : (1098329008) Launching thread 0x8230690 (1102531504)
>** Message: : storing thread 0x8230690 ID 1102531504
>** Message: : (1098329008) waiting for thread 0x8230690 (1102531504) to start
>** Message: : (1102531504) Start wrapper
>** Message: : returning 0x8230690
>** Message: : (1102531504,985) Setting thread stack to 0x41b749ec
>** Message: : (1102531504) Setting current_object_key to 0x8230690
>** Message: : (1098329008) Done launching thread 0x8230690 (1102531504)
>198 threads
>** Message: : returning 0x80fae00
>** Message: : joining thread handle 0xd80, -1 ms
>** Message: : returning 0x8230690
>starting thread
>** Message: : returning 0x8230690
>** Message: : (1102531504) Abort requested for 0x8230690 (1102531504)
>** Message: : returning 0x8230690
>** Message: : returning 0x8230690
>** Message: : join successful
>** Message: : Trying to start a new thread: this (0x8230620) start (0x8241a78)
>** Message: : removing thread ID 1102531504
>** Message: : returning (nil)
>[Thread 1102531504 (zombie) exited]
>[New Thread 1100430256 (LWP 1188)]
>Initiating full world-stop collection 4 after 466576 allocd bytes
>
>  
>
> Everything seems to be ok here. However, if I break on 
> mono_gc_stop_world, the breakpoint isn't hit.
>
> Scott. 

Heres another portion exhibiting slightly different behaviour:

[Thread 1104632752 (zombie) exited]
[New Thread 1102531504 (LWP 2205)]
** Message: : Started thread ID 1102531504 (handle 0x683)
** Message: : (1098329008) Launching thread 0x82be000 (1102531504)
** Message: : storing thread 0x82be000 ID 1102531504
** Message: : (1098329008) waiting for thread 0x82be000 (1102531504) to start
** Message: : (1102531504) Start wrapper
** Message: : returning 0x82be000
** Message: : (1102531504,1723) Setting thread stack to 0x41b749ec
** Message: : (1102531504) Setting current_object_key to 0x82be000
** Message: : returning 0x82be000
starting thread
** Message: : returning 0x82be000
** Message: : (1098329008) Done launching thread 0x82be000 (1102531504)
480 threads
** Message: : returning 0x80fae00
** Message: : joining thread handle 0x683, -1 ms
** Message: : (1102531504) Abort requested for 0x82be000 (1102531504)
** Message: : returning 0x82be000
** Message: : returning 0x82be000
** Message: : join successful
Initiating full world-stop collection 6 after 839344 allocd bytes

Program received signal SIGINT, Interrupt.
[Switching to Thread 1088395008 (LWP 1723)]
0xffffe410 in ?? ()
(gdb) info th
  483 Thread 1102531504 (LWP 2205)  0xffffe410 in ?? ()
  3 Thread 1098329008 (LWP 1725)  0xffffe410 in ?? ()
  2 Thread 1096207280 (LWP 1724)  0xffffe410 in ?? ()
* 1 Thread 1088395008 (LWP 1723)  0xffffe410 in ?? ()

I'm not sure, but it looks like the last thread created is about to be 
removed (mono has joined with it), but handle_remove() has not been 
called, and as far as gdb is aware, the thread is still active (no 
"(zombie) exited" message).

It seems that what the system sees as the list of active threads is 
different to what mono sees as the list of active threads.

Scott
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.ximian.com/pipermail/mono-devel-list/attachments/20040914/2330d7fd/attachment.html