[Mono-dev] FW: Random hangs while running mono app

Rodrigo Kumpera kumpera at gmail.com
Fri Apr 29 14:12:39 UTC 2016


This looks like a shutdown bug in mono.

Do you have a reliable way to reproduce it?
How loaded are the machines running your workload?

On Thu, Apr 28, 2016 at 6:41 PM, George, Glover E ERDC-RDE-ITL-MS CIV <
Glover.E.George at erdc.dren.mil> wrote:

> One more important point I forgot to mention, when this occurs, it is
> generally only a few mono instances out of the 500+ simultaneous mono
> instances.  Most of them terminate.  And more importantly, the mono
> instances that do hang actually complete their work!  It seems to be when
> mono is trying to close down.
>
> This has occurred with mono 4.2.x, 4.3.x, 4.4.x, and current git.
>
> Happened today with this configuration:
> Mono JIT compiler version 4.3.2 (Stable 4.3.2.467/ba2e5e4 Tue Mar 15
> 11:44:06 CDT 2016)
> Copyright (C) 2002-2014 Novell, Inc, Xamarin Inc and Contributors.
> www.mono-project.com
> TLS:           __thread
> SIGSEGV:       altstack
> Notifications: epoll
> Architecture:  amd64
> Disabled:      none
> Misc:          bigarrays softdebug
> LLVM:          supported, not enabled.
> GC:            sgen
>
>
> From: <mono-devel-list-bounces at lists.ximian.com> on behalf of Glover
> George <glover.e.george at erdc.dren.mil>
> Date: Thursday, April 28, 2016 at 5:25 PM
> To: "mono-devel-list at lists.ximian.com" <mono-devel-list at lists.ximian.com>
> Subject: [Mono-dev] Random hangs while running mono app
>
> Hi all,
>
> Quick background: We’re running mono on an HPC platform (SGI ICE-X / SUSE
> Enterprise Linux 11) with a Lustre filesystem.  For several months, I’ve
> noticed random hangs with our application.  In a given batch job, we run
> 500+ instances of mono in parallel (no interprocess communication).   I
> haven’t been able to reliably reproduce the issue which leads me to believe
> it’s a timing issue, and more specifically, I have reason to believe it may
> be an issue with the interaction of mono and the filesystem.  Being a
> clustered file-system, the load on the backing stores can vary due to other
> users.
>
> When I notice that my job hasn’t finished in a reasonable amount of time,
>  I log into the compute node and do a  “ ps -efL | grep mono “ which
> reveals:
>
> george  16728 16575 16728  6    3 16:13 ?        00:03:58 mono
> --runtime=v4.0.30319 /app/MyConsole.exe
> ggeorge  16728 16575 16768  0    3 16:13 ?        00:00:02 mono
> --runtime=v4.0.30319 /app/MyConsole.exe
> ggeorge  16728 16575 16815  0    3 16:13 ?        00:00:04 mono
> --runtime=v4.0.30319 /app/MyConsole.exe
>
> 16728 is the top proc which spawns 16768 and 16815 (main, gc and maybe
> finalizer thread?).
>
> Attaching to each of these 3 pid’s with gdb and doing a backtrace reveals
> the traces below.
>
> 1.  Is it possible this is a GC hang?
>
> 2.  Is it possible this is a race condition that has anything to do with
> the filesystem?
>
> 3.  Any idea how this sequence could occur?  If there’s a way to gather
> more information, please let me know.
>
> The reason for the file system suspicion is my app writes several small
> files, times the number of mono instances (500+).  It only happens
> randomly, but seems to be more frequent the more mono instances I run.
>
> Any help is HIGHLY appreciated.
>
> PID 16728
> ———
> (gdb) bt
> #0  0x00007fffecccd324 in __lll_lock_wait () from /lib64/libpthread.so.0
> #1  0x00007fffeccc8684 in _L_lock_1091 () from /lib64/libpthread.so.0
> #2  0x00007fffeccc84f6 in pthread_mutex_lock () from /lib64/libpthread.so.0
> #3  0x00007fffed8f6dcc in _dl_open () from /lib64/ld-linux-x86-64.so.2
> #4  0x00007fffec842530 in do_dlopen () from /lib64/libc.so.6
> #5  0x00007fffed8f2e86 in _dl_catch_error () from
> /lib64/ld-linux-x86-64.so.2
> #6  0x00007fffec8425e5 in dlerror_run () from /lib64/libc.so.6
> #7  0x00007fffec8426d7 in __libc_dlopen_mode () from /lib64/libc.so.6
> #8  0x00007fffec81d2e5 in init () from /lib64/libc.so.6
> #9  0x00007fffecccbd03 in pthread_once () from /lib64/libpthread.so.0
> #10 0x00007fffec81d43c in backtrace () from /lib64/libc.so.6
> #11 0x00000000004ac025 in mono_handle_native_sigsegv (signal=<optimized
> out>, ctx=<optimized out>, info=<optimized out>)
>     at mini-exceptions.c:2309
> #12 <signal handler called>
> #13 0x00007fffec75e875 in raise () from /lib64/libc.so.6
> #14 0x00007fffec75fe51 in abort () from /lib64/libc.so.6
> #15 0x000000000064528a in monoeg_log_default_handler (log_domain=0x0,
> log_level=G_LOG_LEVEL_ERROR, message=
>     0x1ac7660 "suspend_thread suspend took 200 ms, which is more than the
> allowed 200 ms", unused_data=0x0) at goutput.c:233
> #16 0x0000000000645077 in monoeg_g_logv (log_domain=0x0,
> log_level=G_LOG_LEVEL_ERROR, format=
>     0x7015d8 "suspend_thread suspend took %d ms, which is more than the
> allowed %d ms", args=0x7fffffffce58) at goutput.c:113
> #17 0x000000000064512d in monoeg_g_log (log_domain=0x0,
> log_level=G_LOG_LEVEL_ERROR, format=
>     0x7015d8 "suspend_thread suspend took %d ms, which is more than the
> allowed %d ms") at goutput.c:123
> #18 0x000000000063a13f in mono_threads_wait_pending_operations () at
> mono-threads.c:238
> #19 0x000000000063a8cd in suspend_sync (interrupt_kernel=1,
> tid=140737159329536) at mono-threads.c:877
> #20 suspend_sync_nolock (interrupt_kernel=1, id=140737159329536) at
> mono-threads.c:892
> #21 mono_thread_info_safe_suspend_and_run (id=140737159329536,
> interrupt_kernel=interrupt_kernel at entry=1, callback=callback at entry=
>     0x58d5c0 <abort_thread_critical>, user_data=user_data at entry=0x7fffffffd3e0)
> at mono-threads.c:935
> #22 0x0000000000591a86 in abort_thread_internal (thread=thread at entry=0x7fffec6e0230,
> install_async_abort=install_async_abort at entry=1,
>     can_raise_exception=1) at threads.c:4728
> #23 0x0000000000591b29 in mono_thread_internal_stop
> (thread=0x7fffec6e0230) at threads.c:2385
> ---Type <return> to continue, or q <return> to quit---
> #24 0x00000000005b123e in mono_gc_cleanup () at gc.c:842
> #25 0x00000000005aab8e in mono_runtime_cleanup (domain=domain at entry=0x9d9e00)
> at appdomain.c:356
> #26 0x0000000000426c8b in mini_cleanup (domain=0x9d9e00) at
> mini-runtime.c:4017
> #27 0x000000000047fac6 in mono_main (argc=11, argv=<optimized out>) at
> driver.c:2115
> #28 0x0000000000424c68 in mono_main_with_options (argv=0x7fffffffd698,
> argc=11) at main.c:20
> #29 main (argc=<optimized out>, argv=<optimized out>) at main.c:53
>
> Thread 16768
> ———
> (gdb) bt
> #0  0x00007fffeccca66c in pthread_cond_wait@@GLIBC_2.3.2 () from
> /lib64/libpthread.so.0
> #1  0x000000000060c873 in mono_os_cond_wait (mutex=0x97e640 <lock>,
> cond=0x97e600 <work_cond>) at ../../mono/utils/mono-os-mutex.h:105
> #2  thread_func (thread_data=0x0) at sgen-thread-pool.c:118
> #3  0x00007fffeccc6806 in start_thread () from /lib64/libpthread.so.0
> #4  0x00007fffec80a9bd in clone () from /lib64/libc.so.6
> #5  0x0000000000000000 in ?? ()
> (gdb)
>
>
> Thread 16815
> ————
> Thread #0  0x00007fffec75ec8b in sigsuspend () from /lib64/libc.so.6
> #1  0x000000000063cda6 in suspend_signal_handler (_dummy=<optimized out>,
> info=<optimized out>, context=0x7fffec633dc0)
>     at mono-threads-posix-signals.c:209
> #2  <signal handler called>
> #3  0x00007fffed8faf97 in open64 () from /lib64/ld-linux-x86-64.so.2
> #4  0x00007fffed8ea82d in open_verify () from /lib64/ld-linux-x86-64.so.2
> #5  0x00007fffed8eade0 in open_path () from /lib64/ld-linux-x86-64.so.2
> #6  0x00007fffed8ece7c in _dl_map_object () from
> /lib64/ld-linux-x86-64.so.2
> #7  0x00007fffed8f7400 in dl_open_worker () from
> /lib64/ld-linux-x86-64.so.2
> #8  0x00007fffed8f2e86 in _dl_catch_error () from
> /lib64/ld-linux-x86-64.so.2
> #9  0x00007fffed8f6e3b in _dl_open () from /lib64/ld-linux-x86-64.so.2
> #10 0x00007fffecedcf9b in dlopen_doit () from /lib64/libdl.so.2
> #11 0x00007fffed8f2e86 in _dl_catch_error () from
> /lib64/ld-linux-x86-64.so.2
> #12 0x00007fffecedd33c in _dlerror_run () from /lib64/libdl.so.2
> #13 0x00007fffecedcf01 in dlopen@@GLIBC_2.2.5 () from /lib64/libdl.so.2
> #14 0x0000000000630b79 in mono_dl_open (name=name at entry=0x1c02790
> "libSystem.Data", flags=flags at entry=1, error_msg=error_msg at entry=
>     0x7fffec634e80) at mono-dl.c:150
> #15 0x000000000054b9f0 in cached_module_load (name=name at entry=0x1c02790
> "libSystem.Data", err=err at entry=0x7fffec634e80, flags=1)
>     at loader.c:1398
>
>
> _______________________________________________
> Mono-devel-list mailing list
> Mono-devel-list at lists.ximian.com
> http://lists.ximian.com/mailman/listinfo/mono-devel-list
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ximian.com/pipermail/mono-devel-list/attachments/20160429/9e760909/attachment.html>


More information about the Mono-devel-list mailing list