[Mono-dev] Runtime crashes on Android

Wed Nov 16 21:29:05 UTC 2016

Hi team,

everytime I look at a runtime bug on Android, something doesn't feel right. Reports look different to each other. So I tried to get a better understanding on how we handle a SIGSEGV in the runtime and what the output is supposed to be. There are three basic steps [1]:

(1) we print a managed stacktrace.
(2) we print a native stacktrace: we do that either via libunwind or libcorkscrew depending on what is available. if neither is available, we do nothing.
(3) we call `exit (-1)`, which might give us more information such as a register dump.

That's the idea, unfortunately that is not always what we get.  In order to see the behaviour across different devices and versions of Android, I made this simple crashing app: [2]. As soon as you click the button the application segfaults. For that I wrote a UI test and sent it off to Xamarin Test Cloud and collected the logs: [3]. Note that every device ran the same APK.

Out of 19 devices, there are really only two devices where the crash report looks like it should: samsung_google_nexus_10-4.4.txt and xiaomi_mi_4-4.4.4.txt.  On many devices we only get a managed stacktrace and then the fun is over.

Why?

Good question. Luckily I have a device on my desk where this is the case, so I did a bit of printf debugging. What I figured out is, that the call to `mono_exception_native_unwind ()` in [4] is where the fun stops. The message I see on adb logcat:

11-15 20:51:44.790  7093  7093 E audit   : type=1701 msg=audit(1479239504.790:1839): auid=4294967295 uid=10288 gid=10288 ses=4294967295 subj=u:r:untrusted_app:s0:c512,c768 pid=14937 comm="artup.lulzcrash" exe="/system/bin/app_process32" sig=11

I see the text of a printf right before that call. printf at the beginning of the function doesn't happen. If I move `mono_exception_native_unwind ()` right before the managed stack unwinding, it crashes there, so it isn't a timeout mechanism. I have no idea why on earth this is the case. Unfortunately there is no clue from which PC the signal is coming from (maybe we cause another fault in the handler? maybe android interferes somehow?)

Anyone has some idea?  Please tell me I overlook something obvious here.  (I haven't had success yet with gdb/lldb)

Regardless, I want to suggest some things:

(a) we should get rid of the dynamic loading of libunwind/libcorkscrew. Some devices don't ship it. Instead, we should include it in the runtime. I think it's worth the extra footprint (if that is the concern why it wasn't done in the first place).

(b) we should handle dumping the registers and memory areas ourselves, so it's in our control and doesn't depend on the mood of some particular Android implementation.  Especially hexdumping around PC could be useful (combined with tools like [5]).

(c) print /proc/$PID/maps, so we get a useful mapping of the currently loaded modules.  this gives us some backup in case libunwind fails (at least I failed to get the base address of loaded modules like libmonosgen.so from the logs?).

Comments? :-)

Thanks,
-Bernhard

[1] https://github.com/mono/mono/blob/94b8270e9bdbd9280de1ec144af20877d8c8d055/mono/mini/mini-exceptions.c#L2348
[2] https://gist.github.com/lewurm/8203e7087c72388a820f67502eca19fd
[3] https://gist.github.com/lewurm/4130c23742bc5694898d2c39ced29e52
[4] https://github.com/mono/mono/blob/94b8270e9bdbd9280de1ec144af20877d8c8d055/mono/mini/mini-exceptions.c#L2434
[5] https://onlinedisassembler.com/odaweb/