[Mono-dev] FW: Random hangs while running mono app

George, Glover E ERDC-RDE-ITL-MS CIV Glover.E.George at erdc.dren.mil
Fri Jun 3 19:42:04 UTC 2016


Hi Rodrigo, 

I didn¹t realize our previous emails were off thread.  I was able to get
the STAT CUE CARD output.  See below:

STATE CUE CARD: (? means a positive number, usually 1 or 2, * means any
number)
        0x0     - starting (GOOD, unless the thread is running managed
code)
        0x1     - running (BAD, unless it's the gc thread)
        0x2     - detached (GOOD, unless the thread is running managed
code)
        0x?03   - async suspended (GOOD)
        0x?04   - self suspended (GOOD)
        0x?05   - async suspend requested (BAD)
        0x?06   - self suspend requested (BAD)
        0x*07   - blocking (GOOD)
        0x?08   - blocking with pending suspend (GOOD)
--thread 0xa38b30 id 0x7fffec637700 [(nil)] state 105
--thread 0x9d8640 id 0x7fffedae7780 [(nil)] state 1  GC INITIATOR
WAITING for 1 threads, got 0 suspended
suspend_thread suspend took 200 ms, which is more than the allowed 200 ms
STATE CUE CARD: (? means a positive number, usually 1 or 2, * means any
number)
        0x0     - starting (GOOD, unless the thread is running managed
code)
        0x1     - running (BAD, unless it's the gc thread)
        0x2     - detached (GOOD, unless the thread is running managed
code)
        0x?03   - async suspended (GOOD)
        0x?04   - self suspended (GOOD)
        0x?05   - async suspend requested (BAD)
        0x?06   - self suspend requested (BAD)
        0x*07   - blocking (GOOD)
        0x?08   - blocking with pending suspend (GOOD)
--thread 0xa38b30 id 0x7fffec637700 [(nil)] state 105
--thread 0x9d8640 id 0x7fffedae7780 [(nil)] state 1  GC INITIATOR
WAITING for 1 threads, got 0 suspended
suspend_thread suspend took 200 ms, which is more than the allowed 200 ms
STATE CUE CARD: (? means a positive number, usually 1 or 2, * means any
number)
        0x0     - starting (GOOD, unless the thread is running managed
code)
        0x1     - running (BAD, unless it's the gc thread)
        0x2     - detached (GOOD, unless the thread is running managed
code)
        0x?03   - async suspended (GOOD)
        0x?04   - self suspended (GOOD)
        0x?05   - async suspend requested (BAD)
        0x?06   - self suspend requested (BAD)
        0x*07   - blocking (GOOD)
        0x?08   - blocking with pending suspend (GOOD)
--thread 0xa38b30 id 0x7fffec637700 [(nil)] state 105
--thread 0x9d8640 id 0x7fffedae7780 [(nil)] state 1  GC INITIATOR
WAITING for 1 threads, got 0 suspended
suspend_thread suspend took 200 ms, which is more than the allowed 200 ms
STATE CUE CARD: (? means a positive number, usually 1 or 2, * means any
number)
        0x0     - starting (GOOD, unless the thread is running managed
code)
        0x1     - running (BAD, unless it's the gc thread)
        0x2     - detached (GOOD, unless the thread is running managed
code)
        0x?03   - async suspended (GOOD)
        0x?04   - self suspended (GOOD)
        0x?05   - async suspend requested (BAD)
        0x?06   - self suspend requested (BAD)
        0x*07   - blocking (GOOD)
        0x?08   - blocking with pending suspend (GOOD)
--thread 0xa38b30 id 0x7fffec637700 [(nil)] state 105
--thread 0x9d8640 id 0x7fffedae7780 [(nil)] state 1  GC INITIATOR
WAITING for 1 threads, got 0 suspended
suspend_thread suspend took 200 ms, which is more than the allowed 200 ms
STATE CUE CARD: (? means a positive number, usually 1 or 2, * means any
number)
        0x0     - starting (GOOD, unless the thread is running managed
code)
        0x1     - running (BAD, unless it's the gc thread)
        0x2     - detached (GOOD, unless the thread is running managed
code)
        0x?03   - async suspended (GOOD)
        0x?04   - self suspended (GOOD)
        0x?05   - async suspend requested (BAD)
        0x?06   - self suspend requested (BAD)
        0x*07   - blocking (GOOD)
        0x?08   - blocking with pending suspend (GOOD)
--thread 0xa38b30 id 0x7fffec637700 [(nil)] state 105
--thread 0x9d8640 id 0x7fffedae7780 [(nil)] state 1  GC INITIATOR
WAITING for 1 threads, got 0 suspended
suspend_thread suspend took 200 ms, which is more than the allowed 200 ms
STATE CUE CARD: (? means a positive number, usually 1 or 2, * means any
number)
        0x0     - starting (GOOD, unless the thread is running managed
code)
        0x1     - running (BAD, unless it's the gc thread)
        0x2     - detached (GOOD, unless the thread is running managed
code)
        0x?03   - async suspended (GOOD)
        0x?04   - self suspended (GOOD)
        0x?05   - async suspend requested (BAD)
        0x?06   - self suspend requested (BAD)
        0x*07   - blocking (GOOD)
        0x?08   - blocking with pending suspend (GOOD)
--thread 0xa38b30 id 0x7fffec637700 [(nil)] state 105
--thread 0x9d8640 id 0x7fffedae7780 [(nil)] state 1  GC INITIATOR
WAITING for 1 threads, got 0 suspended
suspend_thread suspend took 200 ms, which is more than the allowed 200 ms
STATE CUE CARD: (? means a positive number, usually 1 or 2, * means any
number)
        0x0     - starting (GOOD, unless the thread is running managed
code)
        0x1     - running (BAD, unless it's the gc thread)
        0x2     - detached (GOOD, unless the thread is running managed
code)
        0x?03   - async suspended (GOOD)
        0x?04   - self suspended (GOOD)
        0x?05   - async suspend requested (BAD)
        0x?06   - self suspend requested (BAD)
        0x*07   - blocking (GOOD)
        0x?08   - blocking with pending suspend (GOOD)
--thread 0xa38b30 id 0x7fffec637700 [(nil)] state 105
--thread 0x9d8640 id 0x7fffedae7780 [(nil)] state 1  GC INITIATOR


‹ ‹ ‹ 
Glover E. George
Computer Scientist
Information Technology Laboratory
US Army Engineer Research and Development Center
Vicksburg, MS 39180
601-634-4730





From:  Rodrigo Kumpera <kumpera at gmail.com>
Date:  Friday, April 29, 2016 at 9:12 AM
To:  Glover George <glover.e.george at erdc.dren.mil>
Cc:  "mono-devel-list at lists.ximian.com" <mono-devel-list at lists.ximian.com>
Subject:  Re: [Mono-dev] FW: Random hangs while running mono app


This looks like a shutdown bug in mono.

Do you have a reliable way to reproduce it?
How loaded are the machines running your workload?


On Thu, Apr 28, 2016 at 6:41 PM, George, Glover E ERDC-RDE-ITL-MS CIV
<Glover.E.George at erdc.dren.mil> wrote:

One more important point I forgot to mention, when this occurs, it is
generally only a few mono instances out of the 500+ simultaneous mono
instances.  Most of them terminate.  And more importantly, the mono
instances that do hang actually complete their
 work!  It seems to be when mono is trying to close down.

This has occurred with mono 4.2.x, 4.3.x, 4.4.x, and current git.

Happened today with this configuration:
Mono JIT compiler version 4.3.2 (Stable 4.3.2.467/ba2e5e4 Tue Mar 15
11:44:06 CDT 2016)
Copyright (C) 2002-2014 Novell, Inc, Xamarin Inc and Contributors.
Blockedwww.mono-project.comBlocked
<Blockedhttp://www.mono-project.comBlocked>
TLS:           __thread
SIGSEGV:       altstack
Notifications: epoll
Architecture:  amd64
Disabled:      none
Misc:          bigarrays softdebug
LLVM:          supported, not enabled.
GC:            sgen




From: <mono-devel-list-bounces at lists.ximian.com> on behalf of Glover
George <glover.e.george at erdc.dren.mil>
Date: Thursday, April 28, 2016 at 5:25 PM
To: "mono-devel-list at lists.ximian.com" <mono-devel-list at lists.ximian.com>
Subject: [Mono-dev] Random hangs while running mono app


Hi all,

Quick background: We¹re running mono on an HPC platform (SGI ICE-X / SUSE
Enterprise Linux 11) with a Lustre filesystem.  For several months, I¹ve
noticed random hangs with our application.  In a given batch job, we run
500+ instances of mono in parallel
 (no interprocess communication).   I haven¹t been able to reliably
reproduce the issue which leads me to believe it¹s a timing issue, and
more specifically, I have reason to believe it may be an issue with the
interaction of mono and the filesystem.  Being
 a clustered file-system, the load on the backing stores can vary due to
other users.

When I notice that my job hasn¹t finished in a reasonable amount of time,
I log into the compute node and do a  ³ ps -efL | grep mono ³ which
reveals:

george  16728 16575 16728  6    3 16:13 ?        00:03:58 mono
--runtime=v4.0.30319 /app/MyConsole.exe
ggeorge  16728 16575 16768  0    3 16:13 ?        00:00:02 mono
--runtime=v4.0.30319 /app/MyConsole.exe
ggeorge  16728 16575 16815  0    3 16:13 ?        00:00:04 mono
--runtime=v4.0.30319 /app/MyConsole.exe


16728 is the top proc which spawns 16768 and 16815 (main, gc and maybe
finalizer thread?).

Attaching to each of these 3 pid¹s with gdb and doing a backtrace reveals
the traces below.  

1.  Is it possible this is a GC hang?

2.  Is it possible this is a race condition that has anything to do with
the filesystem?

3.  Any idea how this sequence could occur?  If there¹s a way to gather
more information, please let me know.

The reason for the file system suspicion is my app writes several small
files, times the number of mono instances (500+).  It only happens
randomly, but seems to be more frequent the more mono instances I run.

Any help is HIGHLY appreciated.

PID 16728
‹‹‹
(gdb) bt
#0  0x00007fffecccd324 in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007fffeccc8684 in _L_lock_1091 () from /lib64/libpthread.so.0
#2  0x00007fffeccc84f6 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x00007fffed8f6dcc in _dl_open () from /lib64/ld-linux-x86-64.so.2
#4  0x00007fffec842530 in do_dlopen () from /lib64/libc.so.6
#5  0x00007fffed8f2e86 in _dl_catch_error () from
/lib64/ld-linux-x86-64.so.2
#6  0x00007fffec8425e5 in dlerror_run () from /lib64/libc.so.6
#7  0x00007fffec8426d7 in __libc_dlopen_mode () from /lib64/libc.so.6
#8  0x00007fffec81d2e5 in init () from /lib64/libc.so.6
#9  0x00007fffecccbd03 in pthread_once () from /lib64/libpthread.so.0
#10 0x00007fffec81d43c in backtrace () from /lib64/libc.so.6
#11 0x00000000004ac025 in mono_handle_native_sigsegv (signal=<optimized
out>, ctx=<optimized out>, info=<optimized out>)
    at mini-exceptions.c:2309
#12 <signal handler called>
#13 0x00007fffec75e875 in raise () from /lib64/libc.so.6
#14 0x00007fffec75fe51 in abort () from /lib64/libc.so.6
#15 0x000000000064528a in monoeg_log_default_handler (log_domain=0x0,
log_level=G_LOG_LEVEL_ERROR, message=
    0x1ac7660 "suspend_thread suspend took 200 ms, which is more than the
allowed 200 ms", unused_data=0x0) at goutput.c:233
#16 0x0000000000645077 in monoeg_g_logv (log_domain=0x0,
log_level=G_LOG_LEVEL_ERROR, format=
    0x7015d8 "suspend_thread suspend took %d ms, which is more than the
allowed %d ms", args=0x7fffffffce58) at goutput.c:113
#17 0x000000000064512d in monoeg_g_log (log_domain=0x0,
log_level=G_LOG_LEVEL_ERROR, format=
    0x7015d8 "suspend_thread suspend took %d ms, which is more than the
allowed %d ms") at goutput.c:123
#18 0x000000000063a13f in mono_threads_wait_pending_operations () at
mono-threads.c:238
#19 0x000000000063a8cd in suspend_sync (interrupt_kernel=1,
tid=140737159329536) at mono-threads.c:877
#20 suspend_sync_nolock (interrupt_kernel=1, id=140737159329536) at
mono-threads.c:892
#21 mono_thread_info_safe_suspend_and_run (id=140737159329536,
interrupt_kernel=interrupt_kernel at entry=1, callback=callback at entry=
    0x58d5c0 <abort_thread_critical>,
user_data=user_data at entry=0x7fffffffd3e0) at mono-threads.c:935
#22 0x0000000000591a86 in abort_thread_internal
(thread=thread at entry=0x7fffec6e0230,
install_async_abort=install_async_abort at entry=1,
    can_raise_exception=1) at threads.c:4728
#23 0x0000000000591b29 in mono_thread_internal_stop
(thread=0x7fffec6e0230) at threads.c:2385
---Type <return> to continue, or q <return> to quit---
#24 0x00000000005b123e in mono_gc_cleanup () at gc.c:842
#25 0x00000000005aab8e in mono_runtime_cleanup
(domain=domain at entry=0x9d9e00) at appdomain.c:356
#26 0x0000000000426c8b in mini_cleanup (domain=0x9d9e00) at
mini-runtime.c:4017
#27 0x000000000047fac6 in mono_main (argc=11, argv=<optimized out>) at
driver.c:2115
#28 0x0000000000424c68 in mono_main_with_options (argv=0x7fffffffd698,
argc=11) at main.c:20
#29 main (argc=<optimized out>, argv=<optimized out>) at main.c:53

Thread 16768
‹‹‹
(gdb) bt
#0  0x00007fffeccca66c in pthread_cond_wait@@GLIBC_2.3.2 () from
/lib64/libpthread.so.0
#1  0x000000000060c873 in mono_os_cond_wait (mutex=0x97e640 <lock>,
cond=0x97e600 <work_cond>) at ../../mono/utils/mono-os-mutex.h:105
#2  thread_func (thread_data=0x0) at sgen-thread-pool.c:118
#3  0x00007fffeccc6806 in start_thread () from /lib64/libpthread.so.0
#4  0x00007fffec80a9bd in clone () from /lib64/libc.so.6
#5  0x0000000000000000 in ?? ()
(gdb) 


Thread 16815
‹‹‹‹
Thread #0  0x00007fffec75ec8b in sigsuspend () from /lib64/libc.so.6
#1  0x000000000063cda6 in suspend_signal_handler (_dummy=<optimized out>,
info=<optimized out>, context=0x7fffec633dc0)
    at mono-threads-posix-signals.c:209
#2  <signal handler called>
#3  0x00007fffed8faf97 in open64 () from /lib64/ld-linux-x86-64.so.2
#4  0x00007fffed8ea82d in open_verify () from /lib64/ld-linux-x86-64.so.2
#5  0x00007fffed8eade0 in open_path () from /lib64/ld-linux-x86-64.so.2
#6  0x00007fffed8ece7c in _dl_map_object () from
/lib64/ld-linux-x86-64.so.2
#7  0x00007fffed8f7400 in dl_open_worker () from
/lib64/ld-linux-x86-64.so.2
#8  0x00007fffed8f2e86 in _dl_catch_error () from
/lib64/ld-linux-x86-64.so.2
#9  0x00007fffed8f6e3b in _dl_open () from /lib64/ld-linux-x86-64.so.2
#10 0x00007fffecedcf9b in dlopen_doit () from /lib64/libdl.so.2
#11 0x00007fffed8f2e86 in _dl_catch_error () from
/lib64/ld-linux-x86-64.so.2
#12 0x00007fffecedd33c in _dlerror_run () from /lib64/libdl.so.2
#13 0x00007fffecedcf01 in dlopen@@GLIBC_2.2.5 () from /lib64/libdl.so.2
#14 0x0000000000630b79 in mono_dl_open (name=name at entry=0x1c02790
"libSystem.Data", flags=flags at entry=1, error_msg=error_msg at entry=
    0x7fffec634e80) at mono-dl.c:150
#15 0x000000000054b9f0 in cached_module_load (name=name at entry=0x1c02790
"libSystem.Data", err=err at entry=0x7fffec634e80, flags=1)
    at loader.c:1398








_______________________________________________
Mono-devel-list mailing list
Mono-devel-list at lists.ximian.com
Blockedhttp://lists.ximian.com/mailman/listinfo/mono-devel-listBlocked








More information about the Mono-devel-list mailing list