[Mono-dev] FW: Random hangs while running mono app

George, Glover E ERDC-RDE-ITL-MS CIV Glover.E.George at erdc.dren.mil
Thu Apr 28 22:41:29 UTC 2016


One more important point I forgot to mention, when this occurs, it is generally only a few mono instances out of the 500+ simultaneous mono instances.  Most of them terminate.  And more importantly, the mono instances that do hang actually complete their work!  It seems to be when mono is trying to close down.

This has occurred with mono 4.2.x, 4.3.x, 4.4.x, and current git.

Happened today with this configuration:
Mono JIT compiler version 4.3.2 (Stable 4.3.2.467/ba2e5e4 Tue Mar 15 11:44:06 CDT 2016)
Copyright (C) 2002-2014 Novell, Inc, Xamarin Inc and Contributors. www.mono-project.com
TLS:           __thread
SIGSEGV:       altstack
Notifications: epoll
Architecture:  amd64
Disabled:      none
Misc:          bigarrays softdebug
LLVM:          supported, not enabled.
GC:            sgen


From: <mono-devel-list-bounces at lists.ximian.com<mailto:mono-devel-list-bounces at lists.ximian.com>> on behalf of Glover George <glover.e.george at erdc.dren.mil<mailto:glover.e.george at erdc.dren.mil>>
Date: Thursday, April 28, 2016 at 5:25 PM
To: "mono-devel-list at lists.ximian.com<mailto:mono-devel-list at lists.ximian.com>" <mono-devel-list at lists.ximian.com<mailto:mono-devel-list at lists.ximian.com>>
Subject: [Mono-dev] Random hangs while running mono app

Hi all,

Quick background: We’re running mono on an HPC platform (SGI ICE-X / SUSE Enterprise Linux 11) with a Lustre filesystem.  For several months, I’ve noticed random hangs with our application.  In a given batch job, we run 500+ instances of mono in parallel (no interprocess communication).   I haven’t been able to reliably reproduce the issue which leads me to believe it’s a timing issue, and more specifically, I have reason to believe it may be an issue with the interaction of mono and the filesystem.  Being a clustered file-system, the load on the backing stores can vary due to other users.

When I notice that my job hasn’t finished in a reasonable amount of time,  I log into the compute node and do a  “ ps -efL | grep mono “ which reveals:

george  16728 16575 16728  6    3 16:13 ?        00:03:58 mono --runtime=v4.0.30319 /app/MyConsole.exe
ggeorge  16728 16575 16768  0    3 16:13 ?        00:00:02 mono --runtime=v4.0.30319 /app/MyConsole.exe
ggeorge  16728 16575 16815  0    3 16:13 ?        00:00:04 mono --runtime=v4.0.30319 /app/MyConsole.exe

16728 is the top proc which spawns 16768 and 16815 (main, gc and maybe finalizer thread?).

Attaching to each of these 3 pid’s with gdb and doing a backtrace reveals the traces below.

1.  Is it possible this is a GC hang?

2.  Is it possible this is a race condition that has anything to do with the filesystem?

3.  Any idea how this sequence could occur?  If there’s a way to gather more information, please let me know.

The reason for the file system suspicion is my app writes several small files, times the number of mono instances (500+).  It only happens randomly, but seems to be more frequent the more mono instances I run.

Any help is HIGHLY appreciated.

PID 16728
———
(gdb) bt
#0  0x00007fffecccd324 in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007fffeccc8684 in _L_lock_1091 () from /lib64/libpthread.so.0
#2  0x00007fffeccc84f6 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x00007fffed8f6dcc in _dl_open () from /lib64/ld-linux-x86-64.so.2
#4  0x00007fffec842530 in do_dlopen () from /lib64/libc.so.6
#5  0x00007fffed8f2e86 in _dl_catch_error () from /lib64/ld-linux-x86-64.so.2
#6  0x00007fffec8425e5 in dlerror_run () from /lib64/libc.so.6
#7  0x00007fffec8426d7 in __libc_dlopen_mode () from /lib64/libc.so.6
#8  0x00007fffec81d2e5 in init () from /lib64/libc.so.6
#9  0x00007fffecccbd03 in pthread_once () from /lib64/libpthread.so.0
#10 0x00007fffec81d43c in backtrace () from /lib64/libc.so.6
#11 0x00000000004ac025 in mono_handle_native_sigsegv (signal=<optimized out>, ctx=<optimized out>, info=<optimized out>)
    at mini-exceptions.c:2309
#12 <signal handler called>
#13 0x00007fffec75e875 in raise () from /lib64/libc.so.6
#14 0x00007fffec75fe51 in abort () from /lib64/libc.so.6
#15 0x000000000064528a in monoeg_log_default_handler (log_domain=0x0, log_level=G_LOG_LEVEL_ERROR, message=
    0x1ac7660 "suspend_thread suspend took 200 ms, which is more than the allowed 200 ms", unused_data=0x0) at goutput.c:233
#16 0x0000000000645077 in monoeg_g_logv (log_domain=0x0, log_level=G_LOG_LEVEL_ERROR, format=
    0x7015d8 "suspend_thread suspend took %d ms, which is more than the allowed %d ms", args=0x7fffffffce58) at goutput.c:113
#17 0x000000000064512d in monoeg_g_log (log_domain=0x0, log_level=G_LOG_LEVEL_ERROR, format=
    0x7015d8 "suspend_thread suspend took %d ms, which is more than the allowed %d ms") at goutput.c:123
#18 0x000000000063a13f in mono_threads_wait_pending_operations () at mono-threads.c:238
#19 0x000000000063a8cd in suspend_sync (interrupt_kernel=1, tid=140737159329536) at mono-threads.c:877
#20 suspend_sync_nolock (interrupt_kernel=1, id=140737159329536) at mono-threads.c:892
#21 mono_thread_info_safe_suspend_and_run (id=140737159329536, interrupt_kernel=interrupt_kernel at entry=1, callback=callback at entry=
    0x58d5c0 <abort_thread_critical>, user_data=user_data at entry=0x7fffffffd3e0) at mono-threads.c:935
#22 0x0000000000591a86 in abort_thread_internal (thread=thread at entry=0x7fffec6e0230, install_async_abort=install_async_abort at entry=1,
    can_raise_exception=1) at threads.c:4728
#23 0x0000000000591b29 in mono_thread_internal_stop (thread=0x7fffec6e0230) at threads.c:2385
---Type <return> to continue, or q <return> to quit---
#24 0x00000000005b123e in mono_gc_cleanup () at gc.c:842
#25 0x00000000005aab8e in mono_runtime_cleanup (domain=domain at entry=0x9d9e00) at appdomain.c:356
#26 0x0000000000426c8b in mini_cleanup (domain=0x9d9e00) at mini-runtime.c:4017
#27 0x000000000047fac6 in mono_main (argc=11, argv=<optimized out>) at driver.c:2115
#28 0x0000000000424c68 in mono_main_with_options (argv=0x7fffffffd698, argc=11) at main.c:20
#29 main (argc=<optimized out>, argv=<optimized out>) at main.c:53

Thread 16768
———
(gdb) bt
#0  0x00007fffeccca66c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x000000000060c873 in mono_os_cond_wait (mutex=0x97e640 <lock>, cond=0x97e600 <work_cond>) at ../../mono/utils/mono-os-mutex.h:105
#2  thread_func (thread_data=0x0) at sgen-thread-pool.c:118
#3  0x00007fffeccc6806 in start_thread () from /lib64/libpthread.so.0
#4  0x00007fffec80a9bd in clone () from /lib64/libc.so.6
#5  0x0000000000000000 in ?? ()
(gdb)


Thread 16815
————
Thread #0  0x00007fffec75ec8b in sigsuspend () from /lib64/libc.so.6
#1  0x000000000063cda6 in suspend_signal_handler (_dummy=<optimized out>, info=<optimized out>, context=0x7fffec633dc0)
    at mono-threads-posix-signals.c:209
#2  <signal handler called>
#3  0x00007fffed8faf97 in open64 () from /lib64/ld-linux-x86-64.so.2
#4  0x00007fffed8ea82d in open_verify () from /lib64/ld-linux-x86-64.so.2
#5  0x00007fffed8eade0 in open_path () from /lib64/ld-linux-x86-64.so.2
#6  0x00007fffed8ece7c in _dl_map_object () from /lib64/ld-linux-x86-64.so.2
#7  0x00007fffed8f7400 in dl_open_worker () from /lib64/ld-linux-x86-64.so.2
#8  0x00007fffed8f2e86 in _dl_catch_error () from /lib64/ld-linux-x86-64.so.2
#9  0x00007fffed8f6e3b in _dl_open () from /lib64/ld-linux-x86-64.so.2
#10 0x00007fffecedcf9b in dlopen_doit () from /lib64/libdl.so.2
#11 0x00007fffed8f2e86 in _dl_catch_error () from /lib64/ld-linux-x86-64.so.2
#12 0x00007fffecedd33c in _dlerror_run () from /lib64/libdl.so.2
#13 0x00007fffecedcf01 in dlopen@@GLIBC_2.2.5 () from /lib64/libdl.so.2
#14 0x0000000000630b79 in mono_dl_open (name=name at entry=0x1c02790 "libSystem.Data", flags=flags at entry=1, error_msg=error_msg at entry=
    0x7fffec634e80) at mono-dl.c:150
#15 0x000000000054b9f0 in cached_module_load (name=name at entry=0x1c02790 "libSystem.Data", err=err at entry=0x7fffec634e80, flags=1)
    at loader.c:1398

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ximian.com/pipermail/mono-devel-list/attachments/20160428/1cefba2c/attachment-0001.html>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: ATT00001.txt
URL: <http://lists.ximian.com/pipermail/mono-devel-list/attachments/20160428/1cefba2c/attachment-0001.txt>


More information about the Mono-devel-list mailing list