[Mono-dev] FW: Random hangs while running mono app

George, Glover E ERDC-RDE-ITL-MS CIV Glover.E.George at erdc.dren.mil
Tue Jun 21 16:35:48 UTC 2016


Just realized I forgot one possibly important point.  All I/O is now being done in /dev/shm (tmpfs) so any clustered file system load/performance issues should not affect the code.

Cheers.

From: <mono-devel-list-bounces at lists.ximian.com<mailto:mono-devel-list-bounces at lists.ximian.com>> on behalf of Glover George <glover.e.george at erdc.dren.mil<mailto:glover.e.george at erdc.dren.mil>>
Date: Tuesday, June 21, 2016 at 11:30 AM
To: Rodrigo Kumpera <kumpera at gmail.com<mailto:kumpera at gmail.com>>
Cc: "mono-devel-list at lists.ximian.com<mailto:mono-devel-list at lists.ximian.com>" <mono-devel-list at lists.ximian.com<mailto:mono-devel-list at lists.ximian.com>>
Subject: Re: [Mono-dev] FW: Random hangs while running mono app


Hi Rodrigo,


I finally can semi-reliably (most of the time) reproduce these hung processes.  The strace, gdb backtrace, and CUE cards for a single hung mono instance is shown below.

  1.  I am now running on mono-4.4.0.  The hanging issue still exists.
  2.  The code is compiled in VS 2015 and run under mono.
  3.  I changed the code in mono-threads.c (line 72) to use 10 seconds instead of the 200ms just to test if there was a timing issue in mono_threads_wait_pending_operations.  You can see below that this timer expires as well.  My guess is something is deadlocked waiting on a condition to occur that isn’t going to occur.

Finally, can you give me an overall picture of what is occuring so I might be able to navigate the source a bit better.  There are three threads in the backtrace.  My best guess was that one is the GC, one is the Main Thread, and one is possibly a Finalizer, but I’m not sure.  I’m digging through mono/utils/mono-threads.c and related code.  Also, the CUE CARD state is 105, which according to the CUE CARD itself says :
0x?05   - async suspend requested (BAD)

Under what cirumstances would an async suspend be requested in this context?

Thanks again for your help with this issue.

Cheers.




strace of hung process/threads:

-----------------------------------------------------------------------------------------------

ggeorge  64515 64382 64515 31    3 17:06 ?        00:08:37 mono --runtime=v4.0.30319 mycode.exe

ggeorge  64515 64382 64574  0    3 17:06 ?        00:00:00 mono --runtime=v4.0.30319 mycode.exe

ggeorge  64515 64382 64596  0    3 17:06 ?        00:00:06 mono --runtime=v4.0.30319 mycode.exe

ggeorge  67340 67160 67340  0    1 17:33 pts/0    00:00:00 grep mono


ggeorge at r8i2n17:~> strace -p 64515

Process 64515 attached - interrupt to quit

futex(0x7fffedb05968, FUTEX_WAIT_PRIVATE, 2, NULL^C <unfinished ...>

Process 64515 detached

ggeorge at r8i2n17:~> strace -p 64382

Process 64382 attached - interrupt to quit

wait4(-1, ^C <unfinished ...>

Process 64382 detached

ggeorge at r8i2n17:~> strace -p 64574

Process 64574 attached - interrupt to quit

futex(0x97f604, FUTEX_WAIT_PRIVATE, 523, NULL^C <unfinished ...>

Process 64574 detached

ggeorge at r8i2n17:~> strace -p 64596

Process 64596 attached - interrupt to quit

rt_sigsuspend(~[PWR RTMIN RT_1 RT_5]^C <unfinished ...>

Process 64596 detached

ggeorge at r8i2n17:~>

gdb backtrace of hung process/threads:

-----------------------------------------------------------------------------------------------

Thread 3 (Thread 0x7fffebfff700 (LWP 64574)):

#0  0x00007fffecd3d66c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0

#1  0x000000000060db93 in mono_os_cond_wait (mutex=0x97f640 <lock>, cond=0x97f600 <work_cond>) at ../../mono/utils/mono-os-mutex.h:105

#2  thread_func (thread_data=0x0) at sgen-thread-pool.c:118

#3  0x00007fffecd39806 in start_thread () from /lib64/libpthread.so.0

#4  0x00007fffec87d9bd in clone () from /lib64/libc.so.6

#5  0x0000000000000000 in ?? ()


Thread 2 (Thread 0x7fffd8093700 (LWP 64596)):

#0  0x00007fffec7d1c8b in sigsuspend () from /lib64/libc.so.6

#1  0x000000000063e3e6 in suspend_signal_handler (_dummy=<optimized out>, info=<optimized out>, context=0x7fffd808ff80) at mono-threads-posix-signals.c:209

#2  <signal handler called>

#3  0x00007fffed8faf97 in open64 () from /lib64/ld-linux-x86-64.so.2

#4  0x00007fffed8ea82d in open_verify () from /lib64/ld-linux-x86-64.so.2

#5  0x00007fffed8ecca1 in _dl_map_object () from /lib64/ld-linux-x86-64.so.2

#6  0x00007fffed8f7400 in dl_open_worker () from /lib64/ld-linux-x86-64.so.2

#7  0x00007fffed8f2e86 in _dl_catch_error () from /lib64/ld-linux-x86-64.so.2

#8  0x00007fffed8f6e3b in _dl_open () from /lib64/ld-linux-x86-64.so.2

#9  0x00007fffecf4ff9b in dlopen_doit () from /lib64/libdl.so.2

#10 0x00007fffed8f2e86 in _dl_catch_error () from /lib64/ld-linux-x86-64.so.2

#11 0x00007fffecf5033c in _dlerror_run () from /lib64/libdl.so.2

#12 0x00007fffecf4ff01 in dlopen@@GLIBC_2.2.5 () from /lib64/libdl.so.2

#13 0x00000000006327f5 in mono_dl_open_file (file=<optimized out>, flags=<optimized out>) at mono-dl-posix.c:67

#14 0x0000000000632029 in mono_dl_open (name=name at entry=0x1c604d0 "/p/home/apps/unsupported/NAVAIR/build/mono-4.4.0/lib/mono/gac/System.Data/4.0.0.0__b77a5c561934e089/libSystem.Data.dll", flags=flags at entry=1,

    error_msg=error_msg at entry=0x7fffd8090e80) at mono-dl.c:150

#15 0x000000000054cf90 in cached_module_load (name=name at entry=0x1c604d0 "/p/home/apps/unsupported/NAVAIR/build/mono-4.4.0/lib/mono/gac/System.Data/4.0.0.0__b77a5c561934e089/libSystem.Data.dll", err=err at entry=0x7fffd8090e80,

    flags=1) at loader.c:1402

Python Exception <type 'exceptions.ValueError'> zero length field name in format:

#16 0x000000000054e218 in mono_lookup_pinvoke_call (method=method at entry=, exc_class=exc_class at entry=0x7fffd8091f00, exc_arg=exc_arg at entry=0x7fffd8091f08) at loader.c:1645

Python Exception <type 'exceptions.ValueError'> zero length field name in format:

#17 0x00000000005641f6 in mono_marshal_get_native_wrapper (method=method at entry=, check_exceptions=check_exceptions at entry=1, aot=0) at marshal.c:7396

Python Exception <type 'exceptions.ValueError'> zero length field name in format:

#18 0x00000000004602cf in mono_method_to_ir (cfg=cfg at entry=0x1c5f920, method=method at entry=, start_bblock=<optimized out>, start_bblock at entry=0x0, end_bblock=<optimized out>, end_bblock at entry=0x0, return_var=return_var at entry=0x0,

    inline_args=inline_args at entry=0x0, inline_offset=0, is_virtual_call=0) at method-to-ir.c:9280

Python Exception <type 'exceptions.ValueError'> zero length field name in format:

#19 0x000000000050acf9 in mini_method_compile (method=method at entry=, opts=opts at entry=370239999, domain=domain at entry=0x9c91f0, flags=flags at entry=JIT_FLAG_RUN_CCTORS, parts=parts at entry=0, aot_method_index=aot_method_index at entry=-1)

    at mini.c:3608

Python Exception <type 'exceptions.ValueError'> zero length field name in format:

#20 0x000000000050c4d5 in mono_jit_compile_method_inner (method=method at entry=, target_domain=target_domain at entry=0x9c91f0, opt=opt at entry=370239999, jit_ex=jit_ex at entry=0x7fffd8092678) at mini.c:4263

Python Exception <type 'exceptions.ValueError'> zero length field name in format:

#21 0x00000000004287f8 in mono_jit_compile_method_with_opt (method=method at entry=, opt=370239999, ex=ex at entry=0x7fffd8092678) at mini-runtime.c:1952

Python Exception <type 'exceptions.ValueError'> zero length field name in format:

#22 0x0000000000428fbb in mono_jit_compile_method (method=) at mini-runtime.c:2008

Python Exception <type 'exceptions.ValueError'> zero length field name in format:

#23 0x00000000004aeb43 in common_call_trampoline_inner (regs=regs at entry=0x7fffd8092890, code=code at entry=0x401c3ee4 "\270\001", m=m at entry=, vt=vt at entry=0x0, vtable_slot=<optimized out>, vtable_slot at entry=0x0)

    at mini-trampolines.c:694

Python Exception <type 'exceptions.ValueError'> zero length field name in format:

#24 0x00000000004af2a0 in common_call_trampoline (regs=0x7fffd8092890, code=0x401c3ee4 "\270\001", m=, vt=0x0, vtable_slot=0x0) at mini-trampolines.c:808

#25 0x0000000040000289 in ?? ()

#26 0x0000000000a25145 in ?? ()

#27 0x00000000401c3ee4 in ?? ()

#28 0x0000000000a25145 in ?? ()

#29 0x00007fffd80929d0 in ?? ()

#30 0x00007fffd8092890 in ?? ()

#31 0x00007fffd8093698 in ?? ()

#32 0x00007fffd80d6188 in ?? ()

#33 0x00007fffd80d61a0 in ?? ()

#34 0x00007fffd8092af0 in ?? ()

#35 0x0000000000000003 in ?? ()

#36 0x00007fffd80929d0 in ?? ()

#37 0x00007fffd8092a60 in ?? ()

#38 0x0000000000000001 in ?? ()

#39 0x00007fffd80d6188 in ?? ()

#40 0x0000000000000000 in ?? ()


Thread 1 (Thread 0x7fffedae7780 (LWP 64515)):

#0  0x00007fffecd40324 in __lll_lock_wait () from /lib64/libpthread.so.0

#1  0x00007fffecd3b684 in _L_lock_1091 () from /lib64/libpthread.so.0

#2  0x00007fffecd3b4f6 in pthread_mutex_lock () from /lib64/libpthread.so.0

#3  0x00007fffed8f6dcc in _dl_open () from /lib64/ld-linux-x86-64.so.2

#4  0x00007fffec8b5530 in do_dlopen () from /lib64/libc.so.6

#5  0x00007fffed8f2e86 in _dl_catch_error () from /lib64/ld-linux-x86-64.so.2

#6  0x00007fffec8b55e5 in dlerror_run () from /lib64/libc.so.6

#7  0x00007fffec8b56d7 in __libc_dlopen_mode () from /lib64/libc.so.6

#8  0x00007fffec8902e5 in init () from /lib64/libc.so.6

#9  0x00007fffecd3ed03 in pthread_once () from /lib64/libpthread.so.0

#10 0x00007fffec89043c in backtrace () from /lib64/libc.so.6

#11 0x00000000004ad425 in mono_handle_native_sigsegv (signal=<optimized out>, ctx=<optimized out>, info=<optimized out>) at mini-exceptions.c:2309

#12 <signal handler called>

#13 0x00007fffec7d1875 in raise () from /lib64/libc.so.6

#14 0x00007fffec7d2e51 in abort () from /lib64/libc.so.6

#15 0x00000000006468be in monoeg_log_default_handler (log_domain=0x0, log_level=G_LOG_LEVEL_ERROR, message=0x1c70ac0 "suspend_thread suspend took 10000 ms, which is more than the allowed 10000 ms", unused_data=0x0) at goutput.c:233

#16 0x00000000006466ab in monoeg_g_logv (log_domain=0x0, log_level=G_LOG_LEVEL_ERROR, format=0x703240 "suspend_thread suspend took %d ms, which is more than the allowed %d ms", args=0x7fffffffc588) at goutput.c:113

#17 0x0000000000646761 in monoeg_g_log (log_domain=0x0, log_level=G_LOG_LEVEL_ERROR, format=0x703240 "suspend_thread suspend took %d ms, which is more than the allowed %d ms") at goutput.c:123

#18 0x000000000063b77f in mono_threads_wait_pending_operations () at mono-threads.c:236

#19 0x000000000063bf0d in suspend_sync (interrupt_kernel=1, tid=140736817870592) at mono-threads.c:875

#20 suspend_sync_nolock (interrupt_kernel=1, id=140736817870592) at mono-threads.c:890

#21 mono_thread_info_safe_suspend_and_run (id=140736817870592, interrupt_kernel=interrupt_kernel at entry=1, callback=callback at entry=0x58ead0 <abort_thread_critical>, user_data=user_data at entry=0x7fffffffcb10) at mono-threads.c:933

#22 0x0000000000592f96 in abort_thread_internal (thread=thread at entry=0x7fffd813c230, install_async_abort=install_async_abort at entry=1, can_raise_exception=1) at threads.c:4728

#23 0x0000000000593039 in mono_thread_internal_stop (thread=0x7fffd813c230) at threads.c:2385

#24 0x00000000005b266e in mono_gc_cleanup () at gc.c:842

#25 0x00000000005abfbe in mono_runtime_cleanup (domain=domain at entry=0x9c91f0) at appdomain.c:356

#26 0x000000000042702b in mini_cleanup (domain=0x9c91f0) at mini-runtime.c:4017

#27 0x0000000000480dc6 in mono_main (argc=11, argv=<optimized out>) at driver.c:2119

#28 0x0000000000424cdf in mono_main_with_options (argv=0x7fffffffce78, argc=11) at main.c:28

#29 main (argc=11, argv=0x7fffffffce78) at main.c:177

(gdb)


STATE CUE CARD of above process:

-----------------------------------------------------------------------------------------------

STATE CUE CARD: (? means a positive number, usually 1 or 2, * means any number)

        0x0     - starting (GOOD, unless the thread is running managed code)

        0x1     - running (BAD, unless it's the gc thread)

        0x2     - detached (GOOD, unless the thread is running managed code)

        0x?03   - async suspended (GOOD)

        0x?04   - self suspended (GOOD)

        0x?05   - async suspend requested (BAD)

        0x?06   - self suspend requested (BAD)

        0x*07   - blocking (GOOD)

        0x?08   - blocking with pending suspend (GOOD)

--thread 0xa27f30 id 0x7fffd8093700 [(nil)] state 105

--thread 0x9c7a30 id 0x7fffedae7780 [(nil)] state 1  GC INITIATOR

WAITING for 1 threads, got 0 suspended

suspend_thread suspend took 10000 ms, which is more than the allowed 10000 ms

Shutting down finalizer thread timed out.

* Assertion at gc.c:867, condition `finalizer_thread_exited' not met


From: Rodrigo Kumpera <kumpera at gmail.com<mailto:kumpera at gmail.com>>
Date: Tuesday, June 7, 2016 at 2:07 PM
To: Glover George <glover.e.george at erdc.dren.mil<mailto:glover.e.george at erdc.dren.mil>>
Subject: Re: [Mono-dev] FW: Random hangs while running mono app

There should be only one CUE card dump per crash. You posted a bunch of those.

I need to know which CUR Card goes with with crash dump.

--
Rodrigo

On Tue, Jun 7, 2016 at 9:09 AM, George, Glover E ERDC-RDE-ITL-MS CIV <Glover.E.George at erdc.dren.mil<mailto:Glover.E.George at erdc.dren.mil>> wrote:
What do you mean by proper logs?  Was the email I sent with the CUE cards and the stack trace not correct?
— — —
Glover E. George
Computer Scientist
Information Technology Laboratory
US Army Engineer Research and Development Center
Vicksburg, MS 39180
601-634-4730<tel:601-634-4730>


From: Rodrigo Kumpera <kumpera at gmail.com<mailto:kumpera at gmail.com>>
Date: Tuesday, June 7, 2016 at 11:07 AM
To: Glover George <glover.e.george at erdc.dren.mil<mailto:glover.e.george at erdc.dren.mil>>

Subject: Re: [Mono-dev] FW: Random hangs while running mono app

Not really, without a test case and proper logs, nothing much we can do.


On Tue, Jun 7, 2016 at 9:02 AM, George, Glover E ERDC-RDE-ITL-MS CIV <Glover.E.George at erdc.dren.mil<mailto:Glover.E.George at erdc.dren.mil>> wrote:
Hi Rodrigo,

To avoid spamming the list, I’m emailing you off list.  Is there anything else I could be doing to help narrow down the problem?

Cheers.
— — —
Glover E. George
Computer Scientist
Information Technology Laboratory
US Army Engineer Research and Development Center
Vicksburg, MS 39180
601-634-4730<tel:601-634-4730>


From: Rodrigo Kumpera <kumpera at gmail.com<mailto:kumpera at gmail.com>>
Date: Friday, June 3, 2016 at 3:02 PM
To: Glover George <glover.e.george at erdc.dren.mil<mailto:glover.e.george at erdc.dren.mil>>
Cc: "mono-devel-list at lists.ximian.com<mailto:mono-devel-list at lists.ximian.com>" <mono-devel-list at lists.ximian.com<mailto:mono-devel-list at lists.ximian.com>>
Subject: Re: [Mono-dev] FW: Random hangs while running mono app

Hi,

The cue card output on itself is not useful.
I need the cue card output with the matched thread dump as that allows me to see what's going on on the problematic thread.

--
Rodrigo


On Fri, Jun 3, 2016 at 12:42 PM, George, Glover E ERDC-RDE-ITL-MS CIV <Glover.E.George at erdc.dren.mil<mailto:Glover.E.George at erdc.dren.mil>> wrote:
Hi Rodrigo,

I didn¹t realize our previous emails were off thread.  I was able to get
the STAT CUE CARD output.  See below:

STATE CUE CARD: (? means a positive number, usually 1 or 2, * means any
number)
        0x0     - starting (GOOD, unless the thread is running managed
code)
        0x1     - running (BAD, unless it's the gc thread)
        0x2     - detached (GOOD, unless the thread is running managed
code)
        0x?03   - async suspended (GOOD)
        0x?04   - self suspended (GOOD)
        0x?05   - async suspend requested (BAD)
        0x?06   - self suspend requested (BAD)
        0x*07   - blocking (GOOD)
        0x?08   - blocking with pending suspend (GOOD)
--thread 0xa38b30 id 0x7fffec637700 [(nil)] state 105
--thread 0x9d8640 id 0x7fffedae7780 [(nil)] state 1  GC INITIATOR
WAITING for 1 threads, got 0 suspended
suspend_thread suspend took 200 ms, which is more than the allowed 200 ms
STATE CUE CARD: (? means a positive number, usually 1 or 2, * means any
number)
        0x0     - starting (GOOD, unless the thread is running managed
code)
        0x1     - running (BAD, unless it's the gc thread)
        0x2     - detached (GOOD, unless the thread is running managed
code)
        0x?03   - async suspended (GOOD)
        0x?04   - self suspended (GOOD)
        0x?05   - async suspend requested (BAD)
        0x?06   - self suspend requested (BAD)
        0x*07   - blocking (GOOD)
        0x?08   - blocking with pending suspend (GOOD)
--thread 0xa38b30 id 0x7fffec637700 [(nil)] state 105
--thread 0x9d8640 id 0x7fffedae7780 [(nil)] state 1  GC INITIATOR
WAITING for 1 threads, got 0 suspended
suspend_thread suspend took 200 ms, which is more than the allowed 200 ms
STATE CUE CARD: (? means a positive number, usually 1 or 2, * means any
number)
        0x0     - starting (GOOD, unless the thread is running managed
code)
        0x1     - running (BAD, unless it's the gc thread)
        0x2     - detached (GOOD, unless the thread is running managed
code)
        0x?03   - async suspended (GOOD)
        0x?04   - self suspended (GOOD)
        0x?05   - async suspend requested (BAD)
        0x?06   - self suspend requested (BAD)
        0x*07   - blocking (GOOD)
        0x?08   - blocking with pending suspend (GOOD)
--thread 0xa38b30 id 0x7fffec637700 [(nil)] state 105
--thread 0x9d8640 id 0x7fffedae7780 [(nil)] state 1  GC INITIATOR
WAITING for 1 threads, got 0 suspended
suspend_thread suspend took 200 ms, which is more than the allowed 200 ms
STATE CUE CARD: (? means a positive number, usually 1 or 2, * means any
number)
        0x0     - starting (GOOD, unless the thread is running managed
code)
        0x1     - running (BAD, unless it's the gc thread)
        0x2     - detached (GOOD, unless the thread is running managed
code)
        0x?03   - async suspended (GOOD)
        0x?04   - self suspended (GOOD)
        0x?05   - async suspend requested (BAD)
        0x?06   - self suspend requested (BAD)
        0x*07   - blocking (GOOD)
        0x?08   - blocking with pending suspend (GOOD)
--thread 0xa38b30 id 0x7fffec637700 [(nil)] state 105
--thread 0x9d8640 id 0x7fffedae7780 [(nil)] state 1  GC INITIATOR
WAITING for 1 threads, got 0 suspended
suspend_thread suspend took 200 ms, which is more than the allowed 200 ms
STATE CUE CARD: (? means a positive number, usually 1 or 2, * means any
number)
        0x0     - starting (GOOD, unless the thread is running managed
code)
        0x1     - running (BAD, unless it's the gc thread)
        0x2     - detached (GOOD, unless the thread is running managed
code)
        0x?03   - async suspended (GOOD)
        0x?04   - self suspended (GOOD)
        0x?05   - async suspend requested (BAD)
        0x?06   - self suspend requested (BAD)
        0x*07   - blocking (GOOD)
        0x?08   - blocking with pending suspend (GOOD)
--thread 0xa38b30 id 0x7fffec637700 [(nil)] state 105
--thread 0x9d8640 id 0x7fffedae7780 [(nil)] state 1  GC INITIATOR
WAITING for 1 threads, got 0 suspended
suspend_thread suspend took 200 ms, which is more than the allowed 200 ms
STATE CUE CARD: (? means a positive number, usually 1 or 2, * means any
number)
        0x0     - starting (GOOD, unless the thread is running managed
code)
        0x1     - running (BAD, unless it's the gc thread)
        0x2     - detached (GOOD, unless the thread is running managed
code)
        0x?03   - async suspended (GOOD)
        0x?04   - self suspended (GOOD)
        0x?05   - async suspend requested (BAD)
        0x?06   - self suspend requested (BAD)
        0x*07   - blocking (GOOD)
        0x?08   - blocking with pending suspend (GOOD)
--thread 0xa38b30 id 0x7fffec637700 [(nil)] state 105
--thread 0x9d8640 id 0x7fffedae7780 [(nil)] state 1  GC INITIATOR
WAITING for 1 threads, got 0 suspended
suspend_thread suspend took 200 ms, which is more than the allowed 200 ms
STATE CUE CARD: (? means a positive number, usually 1 or 2, * means any
number)
        0x0     - starting (GOOD, unless the thread is running managed
code)
        0x1     - running (BAD, unless it's the gc thread)
        0x2     - detached (GOOD, unless the thread is running managed
code)
        0x?03   - async suspended (GOOD)
        0x?04   - self suspended (GOOD)
        0x?05   - async suspend requested (BAD)
        0x?06   - self suspend requested (BAD)
        0x*07   - blocking (GOOD)
        0x?08   - blocking with pending suspend (GOOD)
--thread 0xa38b30 id 0x7fffec637700 [(nil)] state 105
--thread 0x9d8640 id 0x7fffedae7780 [(nil)] state 1  GC INITIATOR


‹ ‹ ‹
Glover E. George
Computer Scientist
Information Technology Laboratory
US Army Engineer Research and Development Center
Vicksburg, MS 39180
601-634-4730<tel:601-634-4730>





From:  Rodrigo Kumpera <kumpera at gmail.com<mailto:kumpera at gmail.com>>
Date:  Friday, April 29, 2016 at 9:12 AM
To:  Glover George <glover.e.george at erdc.dren.mil<mailto:glover.e.george at erdc.dren.mil>>
Cc:  "mono-devel-list at lists.ximian.com<mailto:mono-devel-list at lists.ximian.com>" <mono-devel-list at lists.ximian.com<mailto:mono-devel-list at lists.ximian.com>>
Subject:  Re: [Mono-dev] FW: Random hangs while running mono app


This looks like a shutdown bug in mono.

Do you have a reliable way to reproduce it?
How loaded are the machines running your workload?


On Thu, Apr 28, 2016 at 6:41 PM, George, Glover E ERDC-RDE-ITL-MS CIV
<Glover.E.George at erdc.dren.mil<mailto:Glover.E.George at erdc.dren.mil>> wrote:

One more important point I forgot to mention, when this occurs, it is
generally only a few mono instances out of the 500+ simultaneous mono
instances.  Most of them terminate.  And more importantly, the mono
instances that do hang actually complete their
 work!  It seems to be when mono is trying to close down.

This has occurred with mono 4.2.x, 4.3.x, 4.4.x, and current git.

Happened today with this configuration:
Mono JIT compiler version 4.3.2 (Stable 4.3.2.467/ba2e5e4 Tue Mar 15
11:44:06 CDT 2016)
Copyright (C) 2002-2014 Novell, Inc, Xamarin Inc and Contributors.
BlockedBlockedBlockedBlockedBlockedwww.mono-project.comBlockedBlockedBlockedBlockedBlocked
<BlockedBlockedBlockedBlockedBlockedhttp://www.mono-project.comBlockedBlockedBlockedBlockedBlocked>

TLS:           __thread
SIGSEGV:       altstack
Notifications: epoll
Architecture:  amd64
Disabled:      none
Misc:          bigarrays softdebug
LLVM:          supported, not enabled.
GC:            sgen




From: <mono-devel-list-bounces at lists.ximian.com<mailto:mono-devel-list-bounces at lists.ximian.com>> on behalf of Glover
George <glover.e.george at erdc.dren.mil<mailto:glover.e.george at erdc.dren.mil>>
Date: Thursday, April 28, 2016 at 5:25 PM
To: "mono-devel-list at lists.ximian.com<mailto:mono-devel-list at lists.ximian.com>" <mono-devel-list at lists.ximian.com<mailto:mono-devel-list at lists.ximian.com>>
Subject: [Mono-dev] Random hangs while running mono app


Hi all,

Quick background: We¹re running mono on an HPC platform (SGI ICE-X / SUSE
Enterprise Linux 11) with a Lustre filesystem.  For several months, I¹ve
noticed random hangs with our application.  In a given batch job, we run
500+ instances of mono in parallel
 (no interprocess communication).   I haven¹t been able to reliably
reproduce the issue which leads me to believe it¹s a timing issue, and
more specifically, I have reason to believe it may be an issue with the
interaction of mono and the filesystem.  Being
 a clustered file-system, the load on the backing stores can vary due to
other users.

When I notice that my job hasn¹t finished in a reasonable amount of time,
I log into the compute node and do a  ³ ps -efL | grep mono ³ which
reveals:

george  16728 16575 16728  6    3 16:13 ?        00:03:58 mono
--runtime=v4.0.30319 /app/MyConsole.exe
ggeorge  16728 16575 16768  0    3 16:13 ?        00:00:02 mono
--runtime=v4.0.30319 /app/MyConsole.exe
ggeorge  16728 16575 16815  0    3 16:13 ?        00:00:04 mono
--runtime=v4.0.30319 /app/MyConsole.exe


16728 is the top proc which spawns 16768 and 16815 (main, gc and maybe
finalizer thread?).

Attaching to each of these 3 pid¹s with gdb and doing a backtrace reveals
the traces below.

1.  Is it possible this is a GC hang?

2.  Is it possible this is a race condition that has anything to do with
the filesystem?

3.  Any idea how this sequence could occur?  If there¹s a way to gather
more information, please let me know.

The reason for the file system suspicion is my app writes several small
files, times the number of mono instances (500+).  It only happens
randomly, but seems to be more frequent the more mono instances I run.

Any help is HIGHLY appreciated.

PID 16728
‹‹‹
(gdb) bt
#0  0x00007fffecccd324 in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007fffeccc8684 in _L_lock_1091 () from /lib64/libpthread.so.0
#2  0x00007fffeccc84f6 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x00007fffed8f6dcc in _dl_open () from /lib64/ld-linux-x86-64.so.2
#4  0x00007fffec842530 in do_dlopen () from /lib64/libc.so.6
#5  0x00007fffed8f2e86 in _dl_catch_error () from
/lib64/ld-linux-x86-64.so.2
#6  0x00007fffec8425e5 in dlerror_run () from /lib64/libc.so.6
#7  0x00007fffec8426d7 in __libc_dlopen_mode () from /lib64/libc.so.6
#8  0x00007fffec81d2e5 in init () from /lib64/libc.so.6
#9  0x00007fffecccbd03 in pthread_once () from /lib64/libpthread.so.0
#10 0x00007fffec81d43c in backtrace () from /lib64/libc.so.6
#11 0x00000000004ac025 in mono_handle_native_sigsegv (signal=<optimized
out>, ctx=<optimized out>, info=<optimized out>)
    at mini-exceptions.c:2309
#12 <signal handler called>
#13 0x00007fffec75e875 in raise () from /lib64/libc.so.6
#14 0x00007fffec75fe51 in abort () from /lib64/libc.so.6
#15 0x000000000064528a in monoeg_log_default_handler (log_domain=0x0,
log_level=G_LOG_LEVEL_ERROR, message=
    0x1ac7660 "suspend_thread suspend took 200 ms, which is more than the
allowed 200 ms", unused_data=0x0) at goutput.c:233
#16 0x0000000000645077 in monoeg_g_logv (log_domain=0x0,
log_level=G_LOG_LEVEL_ERROR, format=
    0x7015d8 "suspend_thread suspend took %d ms, which is more than the
allowed %d ms", args=0x7fffffffce58) at goutput.c:113
#17 0x000000000064512d in monoeg_g_log (log_domain=0x0,
log_level=G_LOG_LEVEL_ERROR, format=
    0x7015d8 "suspend_thread suspend took %d ms, which is more than the
allowed %d ms") at goutput.c:123
#18 0x000000000063a13f in mono_threads_wait_pending_operations () at
mono-threads.c:238
#19 0x000000000063a8cd in suspend_sync (interrupt_kernel=1,
tid=140737159329536) at mono-threads.c:877
#20 suspend_sync_nolock (interrupt_kernel=1, id=140737159329536) at
mono-threads.c:892
#21 mono_thread_info_safe_suspend_and_run (id=140737159329536,
interrupt_kernel=interrupt_kernel at entry=1, callback=callback at entry=
    0x58d5c0 <abort_thread_critical>,
user_data=user_data at entry=0x7fffffffd3e0) at mono-threads.c:935
#22 0x0000000000591a86 in abort_thread_internal
(thread=thread at entry=0x7fffec6e0230,
install_async_abort=install_async_abort at entry=1,
    can_raise_exception=1) at threads.c:4728
#23 0x0000000000591b29 in mono_thread_internal_stop
(thread=0x7fffec6e0230) at threads.c:2385
---Type <return> to continue, or q <return> to quit---
#24 0x00000000005b123e in mono_gc_cleanup () at gc.c:842
#25 0x00000000005aab8e in mono_runtime_cleanup
(domain=domain at entry=0x9d9e00) at appdomain.c:356
#26 0x0000000000426c8b in mini_cleanup (domain=0x9d9e00) at
mini-runtime.c:4017
#27 0x000000000047fac6 in mono_main (argc=11, argv=<optimized out>) at
driver.c:2115
#28 0x0000000000424c68 in mono_main_with_options (argv=0x7fffffffd698,
argc=11) at main.c:20
#29 main (argc=<optimized out>, argv=<optimized out>) at main.c:53

Thread 16768
‹‹‹
(gdb) bt
#0  0x00007fffeccca66c in pthread_cond_wait@@GLIBC_2.3.2 () from
/lib64/libpthread.so.0
#1  0x000000000060c873 in mono_os_cond_wait (mutex=0x97e640 <lock>,
cond=0x97e600 <work_cond>) at ../../mono/utils/mono-os-mutex.h:105
#2  thread_func (thread_data=0x0) at sgen-thread-pool.c:118
#3  0x00007fffeccc6806 in start_thread () from /lib64/libpthread.so.0
#4  0x00007fffec80a9bd in clone () from /lib64/libc.so.6
#5  0x0000000000000000 in ?? ()
(gdb)


Thread 16815
‹‹‹‹
Thread #0  0x00007fffec75ec8b in sigsuspend () from /lib64/libc.so.6
#1  0x000000000063cda6 in suspend_signal_handler (_dummy=<optimized out>,
info=<optimized out>, context=0x7fffec633dc0)
    at mono-threads-posix-signals.c:209
#2  <signal handler called>
#3  0x00007fffed8faf97 in open64 () from /lib64/ld-linux-x86-64.so.2
#4  0x00007fffed8ea82d in open_verify () from /lib64/ld-linux-x86-64.so.2
#5  0x00007fffed8eade0 in open_path () from /lib64/ld-linux-x86-64.so.2
#6  0x00007fffed8ece7c in _dl_map_object () from
/lib64/ld-linux-x86-64.so.2
#7  0x00007fffed8f7400 in dl_open_worker () from
/lib64/ld-linux-x86-64.so.2
#8  0x00007fffed8f2e86 in _dl_catch_error () from
/lib64/ld-linux-x86-64.so.2
#9  0x00007fffed8f6e3b in _dl_open () from /lib64/ld-linux-x86-64.so.2
#10 0x00007fffecedcf9b in dlopen_doit () from /lib64/libdl.so.2
#11 0x00007fffed8f2e86 in _dl_catch_error () from
/lib64/ld-linux-x86-64.so.2
#12 0x00007fffecedd33c in _dlerror_run () from /lib64/libdl.so.2
#13 0x00007fffecedcf01 in dlopen@@GLIBC_2.2.5 () from /lib64/libdl.so.2
#14 0x0000000000630b79 in mono_dl_open (name=name at entry=0x1c02790
"libSystem.Data", flags=flags at entry=1, error_msg=error_msg at entry=
    0x7fffec634e80) at mono-dl.c:150
#15 0x000000000054b9f0 in cached_module_load (name=name at entry=0x1c02790
"libSystem.Data", err=err at entry=0x7fffec634e80, flags=1)
    at loader.c:1398








_______________________________________________
Mono-devel-list mailing list
Mono-devel-list at lists.ximian.com<mailto:Mono-devel-list at lists.ximian.com>
Blockedhttp://lists.ximian.com/mailman/listinfo/mono-devel-listBlocked









-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ximian.com/pipermail/mono-devel-list/attachments/20160621/f05db425/attachment-0001.html>


More information about the Mono-devel-list mailing list