[Mono-dev] Random hangs while running mono app

George, Glover E ERDC-RDE-ITL-MS CIV Glover.E.George at erdc.dren.mil
Thu Apr 28 22:25:56 UTC 2016


Hi all,

Quick background: We’re running mono on an HPC platform (SGI ICE-X / SUSE Enterprise Linux 11) with a Lustre filesystem.  For several months, I’ve noticed random hangs with our application.  In a given batch job, we run 500+ instances of mono in parallel (no interprocess communication).   I haven’t been able to reliably reproduce the issue which leads me to believe it’s a timing issue, and more specifically, I have reason to believe it may be an issue with the interaction of mono and the filesystem.  Being a clustered file-system, the load on the backing stores can vary due to other users.

When I notice that my job hasn’t finished in a reasonable amount of time,  I log into the compute node and do a  “ ps -efL | grep mono “ which reveals:

george  16728 16575 16728  6    3 16:13 ?        00:03:58 mono --runtime=v4.0.30319 /app/MyConsole.exe
ggeorge  16728 16575 16768  0    3 16:13 ?        00:00:02 mono --runtime=v4.0.30319 /app/MyConsole.exe
ggeorge  16728 16575 16815  0    3 16:13 ?        00:00:04 mono --runtime=v4.0.30319 /app/MyConsole.exe

16728 is the top proc which spawns 16768 and 16815 (main, gc and maybe finalizer thread?).

Attaching to each of these 3 pid’s with gdb and doing a backtrace reveals the traces below.

1.  Is it possible this is a GC hang?

2.  Is it possible this is a race condition that has anything to do with the filesystem?

3.  Any idea how this sequence could occur?  If there’s a way to gather more information, please let me know.

The reason for the file system suspicion is my app writes several small files, times the number of mono instances (500+).  It only happens randomly, but seems to be more frequent the more mono instances I run.

Any help is HIGHLY appreciated.

PID 16728
———
(gdb) bt
#0  0x00007fffecccd324 in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007fffeccc8684 in _L_lock_1091 () from /lib64/libpthread.so.0
#2  0x00007fffeccc84f6 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x00007fffed8f6dcc in _dl_open () from /lib64/ld-linux-x86-64.so.2
#4  0x00007fffec842530 in do_dlopen () from /lib64/libc.so.6
#5  0x00007fffed8f2e86 in _dl_catch_error () from /lib64/ld-linux-x86-64.so.2
#6  0x00007fffec8425e5 in dlerror_run () from /lib64/libc.so.6
#7  0x00007fffec8426d7 in __libc_dlopen_mode () from /lib64/libc.so.6
#8  0x00007fffec81d2e5 in init () from /lib64/libc.so.6
#9  0x00007fffecccbd03 in pthread_once () from /lib64/libpthread.so.0
#10 0x00007fffec81d43c in backtrace () from /lib64/libc.so.6
#11 0x00000000004ac025 in mono_handle_native_sigsegv (signal=<optimized out>, ctx=<optimized out>, info=<optimized out>)
    at mini-exceptions.c:2309
#12 <signal handler called>
#13 0x00007fffec75e875 in raise () from /lib64/libc.so.6
#14 0x00007fffec75fe51 in abort () from /lib64/libc.so.6
#15 0x000000000064528a in monoeg_log_default_handler (log_domain=0x0, log_level=G_LOG_LEVEL_ERROR, message=
    0x1ac7660 "suspend_thread suspend took 200 ms, which is more than the allowed 200 ms", unused_data=0x0) at goutput.c:233
#16 0x0000000000645077 in monoeg_g_logv (log_domain=0x0, log_level=G_LOG_LEVEL_ERROR, format=
    0x7015d8 "suspend_thread suspend took %d ms, which is more than the allowed %d ms", args=0x7fffffffce58) at goutput.c:113
#17 0x000000000064512d in monoeg_g_log (log_domain=0x0, log_level=G_LOG_LEVEL_ERROR, format=
    0x7015d8 "suspend_thread suspend took %d ms, which is more than the allowed %d ms") at goutput.c:123
#18 0x000000000063a13f in mono_threads_wait_pending_operations () at mono-threads.c:238
#19 0x000000000063a8cd in suspend_sync (interrupt_kernel=1, tid=140737159329536) at mono-threads.c:877
#20 suspend_sync_nolock (interrupt_kernel=1, id=140737159329536) at mono-threads.c:892
#21 mono_thread_info_safe_suspend_and_run (id=140737159329536, interrupt_kernel=interrupt_kernel at entry=1, callback=callback at entry=
    0x58d5c0 <abort_thread_critical>, user_data=user_data at entry=0x7fffffffd3e0) at mono-threads.c:935
#22 0x0000000000591a86 in abort_thread_internal (thread=thread at entry=0x7fffec6e0230, install_async_abort=install_async_abort at entry=1,
    can_raise_exception=1) at threads.c:4728
#23 0x0000000000591b29 in mono_thread_internal_stop (thread=0x7fffec6e0230) at threads.c:2385
---Type <return> to continue, or q <return> to quit---
#24 0x00000000005b123e in mono_gc_cleanup () at gc.c:842
#25 0x00000000005aab8e in mono_runtime_cleanup (domain=domain at entry=0x9d9e00) at appdomain.c:356
#26 0x0000000000426c8b in mini_cleanup (domain=0x9d9e00) at mini-runtime.c:4017
#27 0x000000000047fac6 in mono_main (argc=11, argv=<optimized out>) at driver.c:2115
#28 0x0000000000424c68 in mono_main_with_options (argv=0x7fffffffd698, argc=11) at main.c:20
#29 main (argc=<optimized out>, argv=<optimized out>) at main.c:53

Thread 16768
———
(gdb) bt
#0  0x00007fffeccca66c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x000000000060c873 in mono_os_cond_wait (mutex=0x97e640 <lock>, cond=0x97e600 <work_cond>) at ../../mono/utils/mono-os-mutex.h:105
#2  thread_func (thread_data=0x0) at sgen-thread-pool.c:118
#3  0x00007fffeccc6806 in start_thread () from /lib64/libpthread.so.0
#4  0x00007fffec80a9bd in clone () from /lib64/libc.so.6
#5  0x0000000000000000 in ?? ()
(gdb)


Thread 16815
————
Thread #0  0x00007fffec75ec8b in sigsuspend () from /lib64/libc.so.6
#1  0x000000000063cda6 in suspend_signal_handler (_dummy=<optimized out>, info=<optimized out>, context=0x7fffec633dc0)
    at mono-threads-posix-signals.c:209
#2  <signal handler called>
#3  0x00007fffed8faf97 in open64 () from /lib64/ld-linux-x86-64.so.2
#4  0x00007fffed8ea82d in open_verify () from /lib64/ld-linux-x86-64.so.2
#5  0x00007fffed8eade0 in open_path () from /lib64/ld-linux-x86-64.so.2
#6  0x00007fffed8ece7c in _dl_map_object () from /lib64/ld-linux-x86-64.so.2
#7  0x00007fffed8f7400 in dl_open_worker () from /lib64/ld-linux-x86-64.so.2
#8  0x00007fffed8f2e86 in _dl_catch_error () from /lib64/ld-linux-x86-64.so.2
#9  0x00007fffed8f6e3b in _dl_open () from /lib64/ld-linux-x86-64.so.2
#10 0x00007fffecedcf9b in dlopen_doit () from /lib64/libdl.so.2
#11 0x00007fffed8f2e86 in _dl_catch_error () from /lib64/ld-linux-x86-64.so.2
#12 0x00007fffecedd33c in _dlerror_run () from /lib64/libdl.so.2
#13 0x00007fffecedcf01 in dlopen@@GLIBC_2.2.5 () from /lib64/libdl.so.2
#14 0x0000000000630b79 in mono_dl_open (name=name at entry=0x1c02790 "libSystem.Data", flags=flags at entry=1, error_msg=error_msg at entry=
    0x7fffec634e80) at mono-dl.c:150
#15 0x000000000054b9f0 in cached_module_load (name=name at entry=0x1c02790 "libSystem.Data", err=err at entry=0x7fffec634e80, flags=1)
    at loader.c:1398

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ximian.com/pipermail/mono-devel-list/attachments/20160428/296ae4c9/attachment.html>


More information about the Mono-devel-list mailing list