[Mono-dev] [android-devel] Runtime crashes on Android

Tue Dec 13 17:26:48 UTC 2016

Hi all,

thanks Jonathan for the right pointers.  After a lot of debugging and digging through source code I think I sort of understand what is happening.

The audit message we see is indeed by SELinux. What happens is that we cause another SIGSEGV inside the handler. Unfortunately the kernel doesn't give us any further information. I managed to attach with lldb and the reason for the crash is that the stack pointer points into a text segment of some shared library.  WTF?

Let me go back one step. In general the signal chain looks like this on Android:

(1) SIGSEGV happens, the ART handler catches it and does some stuff (e.g. "is it caused by my managed code"?).
(2) if ART doesn't know what to do, it will chain into remaining handlers. 
(3) now it's the mono runtimes handler turn, we do our business, figure out it's a native crash, etc.
(4) (in case we do *NOT* crash) we return to the ART handler
(5) the ART handler now chains into the next SIGSEGV handler, which was setup by the linker of bionic.
(6) the libc/bionic handler communicates with debuggerd which ptraces our process and delivers further information (e.g. register dump, native stack trace)

So the interesting bit here is that the mono runtime doesn't register our handler to be executed on an alternative stack, but still our handler happens to run on one. Why? Because ART registers its handler to be executed on an altstack and then chains into our handler. The reason why this is relevant, is that the altstack is only 8k big on 32bit systems or 16k on 64bit systems. Some structures we need in libunwind or even things we do in mono are exceeding those limits on the stack.

I did some changes here and there to reduce frame size requirements and included libunwind into mono (see relevant PRs at the end of this email). I re-ran my experiment on XTC:
https://gist.github.com/lewurm/0b271b406b7e194cadaf1340172fc178

or here a crash where I sneaked in a segfault into the JIT:
https://gist.github.com/lewurm/c96c6236fc1b79b3c30473de174b71dd

Looking at this, I have this conclusion: How about we do not even attempt to do a native stack trace in mono, but just let debuggerd do its business? The arguments to support that:

(a) native stack trace by debuggerd is at least as good as the one we get via libunwind (most of the time the trace provided by libunwind is useless?)
(b) we cannot screw up by accident (see example above with altstack mess)
(c) we don't need to maintain the libunwind integration into mono (which we sort of had to do because of the upcoming dlopen limitation in Android 7.0 Nougat).
(d) the libmonosgen-2.0.so had about one megabyte more footprint (16mb -> 17mb). I'm talking about a debug build here though.

Bonus: On *some* devices we even get an even nicer dump, e.g. check samsung_galaxy_note_5-5.1.1.txt.: 
https://gist.github.com/lewurm/0b271b406b7e194cadaf1340172fc178#file-samsung_galaxy_note_5-5-1-1-txt

"CrashAnrDetector" seems to be yet another player in this whole story, so far I've only seen it on older Samsung devices.

Any thoughts?

-Bernhard

Relevant pull requests:
https://github.com/mono/mono/pull/4106
https://github.com/mono/mono/pull/4112
https://github.com/mono/mono/pull/4113
https://github.com/mono/mono/pull/4124
https://github.com/mono/mono/pull/4131

________________________________________
From: Jonathan Pryor <jonpryor at vt.edu>
Sent: Thursday, November 17, 2016 12:21:09 PM
To: Bernhard Urban
Cc: Mono-devel-list at lists.dot.net; android-devel at lists.dot.net; Alex Petersen
Subject: Re: [android-devel] Runtime crashes on Android

Reply inline…

On Nov 16, 2016, at 4:29 PM, Bernhard Urban via android-devel <android-devel at lists.dot.net> wrote:
> everytime I look at a runtime bug on Android, something doesn't feel right. Reports look different to each other. So I tried to get a better understanding on how we handle a SIGSEGV in the runtime and what the output is supposed to be. There are three basic steps [1]:
>
> (1) we print a managed stacktrace.
> (2) we print a native stacktrace: we do that either via libunwind or libcorkscrew depending on what is available. if neither is available, we do nothing.
> (3) we call `exit (-1)`, which might give us more information such as a register dump.

Unfortunately, there are (implicitly!) *more* than three basic steps, and I’m fairly sure I still don’t understand what all is going on. For more wonderful context:

        https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmono%2Fmono%2Fcommit%2F5d07b77a67f61576318a30e8b1c5f65f7f26b1cf&data=02%7C01%7Cbeurba%40microsoft.com%7C322a7c4f6e02418796a808d40edbd606%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636149784734334208&sdata=jKZXKeqsMrSlIZkut2OwUySlP36kEyEaUIMFvcugJaE%3D&reserved=0
> when a process crashes on Android, ideally:
>
> 1. The Android signal handler is executed,
> 2. Bionic will attempt to connect to /system/bin/debuggerd.
> 3. debuggerd will try to connect to the crashing process, then
>  retrieve "useful" information from the crashing process (stack
>  trace, register values, etc.)

The “fun” is in trying to intermix Mono’s SIGSEGV handling mechanism in with Android’s infrastructure, which involves having an extra process (`debuggerd`) connect to the process to dump process state.

Additionally, I *believe* — but have not retested or reverified — that the `exit(-1)` within `mini-exceptions.c` won’t be executed, because of the Xamarin.Android calls `mono_set_crash_chaining(1)`, which sets `mono_do_crash_chaining` to 1:

        https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fxamarin%2Fxamarin-android%2Fblob%2Ff862032%2Fsrc%2Fmonodroid%2Fjni%2Fmonodroid-glue.c%23L2802&data=02%7C01%7Cbeurba%40microsoft.com%7C322a7c4f6e02418796a808d40edbd606%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636149784734334208&sdata=hlzBXyioFEx6Kd8wW9swEakumP2Yua8shSo38BWHOE0%3D&reserved=0

Not that any of the above in any way helps further improve reliability…

> That's the idea, unfortunately that is not always what we get.  In order to see the behaviour across different devices and versions of Android, I made this simple crashing app: [2]. As soon as you click the button the application segfaults. For that I wrote a UI test and sent it off to Xamarin Test Cloud and collected the logs: [3]. Note that every device ran the same APK.
>
> Out of 19 devices, there are really only two devices where the crash report looks like it should: samsung_google_nexus_10-4.4.txt and xiaomi_mi_4-4.4.4.txt.  On many devices we only get a managed stacktrace and then the fun is over.
>
> Why?
>
> Good question. Luckily I have a device on my desk where this is the case, so I did a bit of printf debugging. What I figured out is, that the call to `mono_exception_native_unwind ()` in [4] is where the fun stops. The message I see on adb logcat:
>
> 11-15 20:51:44.790  7093  7093 E audit   : type=1701 msg=audit(1479239504.790:1839): auid=4294967295 uid=10288 gid=10288 ses=4294967295 subj=u:r:untrusted_app:s0:c512,c768 pid=14937 comm="artup.lulzcrash" exe="/system/bin/app_process32" sig=11

Are there any other `adb logcat` messages? The above looks like an SELinux-related message. (I have no idea what it *means*, but that’s what it looks like…)

> I see the text of a printf right before that call. printf at the beginning of the function doesn't happen. If I move `mono_exception_native_unwind ()` right before the managed stack unwinding, it crashes there, so it isn't a timeout mechanism. I have no idea why on earth this is the case. Unfortunately there is no clue from which PC the signal is coming from (maybe we cause another fault in the handler? maybe android interferes somehow?)

`debuggerd`?

> Anyone has some idea?  Please tell me I overlook something obvious here.  (I haven't had success yet with gdb/lldb)

I’ve only had success with gdb when using 32-bit targets. 64-bit targets give me gdb protocol mismatch errors. :-(

> Regardless, I want to suggest some things:
>
> (a) we should get rid of the dynamic loading of libunwind/libcorkscrew. Some devices don't ship it. Instead, we should include it in the runtime. I think it's worth the extra footprint (if that is the concern why it wasn't done in the first place).

This is *absolutely* something we should consider. This is even more important in the context of Android 7.0 Nougat, which won’t allow us to load those native libraries, even if they do exist.

- Jon