[Mono-dev] Mono CI weather report 9/22
anmccl at microsoft.com
Thu Sep 22 22:29:34 UTC 2016
What this is: The Mono team has a CI (continuous integration) system which builds and runs automated tests on every commit checked in to git (specifically the master branch). We have a test log viewer<https://jenkins.mono-project.com/view/All/job/jenkins-testresult-viewer/Test_Result_View/> on Jenkins that tracks the results (currently only accessible to github project admins, sorry). Once a week I sweep through and write an email with a list of the most frequently-failing automated tests. This is both so that everyone on the team is aware of our current stability level, and so that when people see failures in the github PR tests they know whether to treat them as known bugs or new failures. In the interest of making our development process more open, I’ve started crossposting this weekly email on the public mailing list.
Stability in the C9/master tests is pretty bad lately and we’ve not been producing many green builds. The top test failures are the same as they’ve been the last couple weeks (and mostly things that aren’t user-impacting— they only happen on the builder, or happen during process shutdown) but there are some new, less frequent bugs that are very worrisome (please do not miss the bugs from 4 on, these bugs are new and some do not have owners).
The top recurring failures currently ruining Jenkins builds are:
Filed as https://bugzilla.xamarin.com/show_bug.cgi?id=43172 , currently assigned to Marcos Heinrich.
This has been failing for a pretty long time. It only occurs on Linux but on Linux it fails over 20% of the time. (It has also been seen on Android.) It is possible this is only an issue in CI (see akoeplinger note in bug).
The failure is consistent and looks like:
System.Exception : Could not abort registered blocking threads before closing socket.
at System.Net.Sockets.SafeSocketHandle.RegisterForBlockingSyscall () [0x00057] in /mnt/jenkins/workspace/test-mono-mainline-linux/label/ubuntu-1404-amd64/mcs/class/System/System.Net.Sockets/SafeSocketHandle.cs:114
at System.Net.Sockets.Socket.SendFile_internal (System.Net.Sockets.SafeSocketHandle safeHandle, System.String filename, System.Byte pre_buffer, System.Byte post_buffer, System.Net.Sockets.TransmitFileOptions flags) [0x00000] in /mnt/jenkins/workspace/test-mono-mainline-linux/label/ubuntu-1404-amd64/mcs/class/System/System.Net.Sockets/Socket.cs:2944
at System.Net.Sockets.Socket.SendFile (System.String fileName, System.Byte preBuffer, System.Byte postBuffer, System.Net.Sockets.TransmitFileOptions flags) [0x00028] in /mnt/jenkins/workspace/test-mono-mainline-linux/label/ubuntu-1404-amd64/mcs/class/System/System.Net.Sockets/Socket.cs:2893
On ARM64 only, when this test calls ChannelServices.UnregisterChannel(), sometimes a KeyNotFoundException is generated somewhere in the guts of Socket.Close. This is filed as https://bugzilla.xamarin.com/show_bug.cgi?id=43727 . It is possible this is the same issue as #2 above (see akoeplinger note in bug).
2. ThreadAbortException in System.Threading.Timer+Scheduler.SchedulerThread
Filed as https://bugzilla.xamarin.com/show_bug.cgi?id=43320 , currently assigned to Rodrigo.
This occurs in many different places but the crash message always looks the same. It is believed to be existing bad behavior brought into the light by recent fixes by Vargaz around finalizers and VM shutdown.
System.TypeInitializationException: The type initializer for 'System.Collections.Generic.List`1' threw an exception. ---> System.Threading.ThreadAbortException
--- End of inner exception stack trace ---
at System.Threading.Timer+Scheduler.SchedulerThread () [0x0000f] in <filename unknown>:0
at System.Threading.ThreadHelper.ThreadStart_Context (System.Object state) [0x00017] in <filename unknown>:0
at System.Threading.ExecutionContext.RunInternal (System.Threading.ExecutionContext executionContext, System.Threading.ContextCallback callback, System.Object state, System.Boolean preserveSyncCtx) [0x0008d] in <filename unknown>:0
at System.Threading.ExecutionContext.Run (System.Threading.ExecutionContext executionContext, System.Threading.ContextCallback callback, System.Object state, System.Boolean preserveSyncCtx) [0x00000] in <filename unknown>:0
at System.Threading.ExecutionContext.Run (System.Threading.ExecutionContext executionContext, System.Threading.ContextCallback callback, System.Object state) [0x00031] in <filename unknown>:0
at System.Threading.ThreadHelper.ThreadStart () [0x0000b] in <filename unknown>:0
[MVID] 0deb57f9de664ff681556c641423618d 0,1,2,3,4,5
[ERROR] FATAL UNHANDLED EXCEPTION: Nested exception trying to figure out what went wrong
Some places this failure is seen include MonoTests.gshared.generic-marshalbyref.2.exe, MonoTests.runtime.bug-415577.exe, and as an unknown-test failure when a test suite (such as mcs/class/corlib) is shutting down.
https://jenkins.mono-project.com/job/test-mono-mainline/label=osx-i386/4656/parsed_console/log_content.html#WARNING1 (test shutdown)
3. __icall_wrapper_mono_gc_alloc_vector crash
Filed as https://bugzilla.xamarin.com/show_bug.cgi?id=43921 , currently assigned to Aleksey.
There are two problems here:
1 There appears to be a race condition in coop, Aleksey is looking at this
2 There appears to be a problem where we are not scanning pointers in SIMD registers. If a memory copy, such as the one in __icall_wrapper_mono_gc_alloc_vector, happens to use a SIMD register, and the copy is interrupted by a GC, it will lead to memory corruption. This issue is being targeted by https://github.com/mono/mono/pull/3364 , which is still under development.
The symptom we see is SIGSEGVs in a range of tests related to domain unloading, or thread creation around the same time as the GC stopping the world. This symptom occurs on Mac only, we think because mac clang is more aggressive and is optimizing our memory copy routine to use SIMD instructions.
3.5 (?). AppDomain.internalUnload crash
This is also mac-only; Aleksey is looking into whether it is the same failure as #X.
https://jenkins.mono-project.com/job/test-mono-mainline/label=osx-amd64/4812/testReport/ (both failures)
Crashes, managed stack looks like:
at (wrapper managed-to-native) System.AppDomain.InternalUnload (int) <0x00012>
at System.AppDomain.Unload (System.AppDomain) [0x00011] in /Users/builder/jenkins/workspace/test-mono-mainline/label/osx-i386/mcs/class/corlib/System/AppDomain.cs:1200
at MonoTests.System.AppDomainTest.TearDown () [0x0000b] in /Users/builder/jenkins/workspace/test-mono-mainline/label/osx-i386/mcs/class/corlib/Test/System/AppDomainTest.cs:71
at (wrapper runtime-invoke) object.runtime_invoke_void__this__ (object,intptr,intptr,intptr) <IL 0x0004f, 0x00092>
4. Tarjan GC bridge crashing for armel binaries on ARM64 host
sgen-bridge.exe and sgen-bridge-major-fragmentation.exe, which run with a simulated version of the Android GC bridge, have in the last week started segfaulting about 1/3 of the time on the ARM soft float build (but never on any other platform). I think the crashes are 1:1 with the soft float binary being run on a ARM64 builder (this only happens some percentage of the time, it depends on what’s available on the load balancer). The stacks are consistent.
Filed as https://bugzilla.xamarin.com/show_bug.cgi?id=44397 , currently assigned to me.
5. Crash doing thread join while closing process after ServiceModel tests
Both crashes and hangs have been seen recently while running the ServiceModel test suites. This has been seen on both Mac and Linux (I think it likes to crash on mac and hang on Linux?). Nothing is filed. When the crash occurs, it happens in the test runner itself, waiting for the tests to finish.
Managed stack looks like:
at (wrapper managed-to-native) System.Threading.Thread.JoinInternal (System.Threading.Thread,int) <IL 0x00014, 0x00067>
at System.Threading.Thread.Join () [0x00000] in /Users/builder/jenkins/workspace/test-mono-mainline/label/osx-amd64/mcs/class/referencesource/mscorlib/system/threading/thread.cs:697
at NUnit.Core.TestRunnerThread.Wait () [0x00010] in /Users/builder/jenkins/workspace/test-mono-mainline/label/osx-amd64/mcs/nunit24/NUnitCore/core/TestRunnerThread.cs:118
at NUnit.Core.ThreadedTestRunner.Wait () [0x0000b] in /Users/builder/jenkins/workspace/test-mono-mainline/label/osx-amd64/mcs/nunit24/NUnitCore/core/ThreadedTestRunner.cs:63
at NUnit.Core.ThreadedTestRunner.EndRun () [0x00000] in /Users/builder/jenkins/workspace/test-mono-mainline/label/osx-amd64/mcs/nunit24/NUnitCore/core/ThreadedTestRunner.cs:55
at NUnit.Core.ThreadedTestRunner.Run (NUnit.Core.EventListener,NUnit.Core.ITestFilter) [0x00008] in /Users/builder/jenkins/workspace/test-mono-mainline/label/osx-amd64/mcs/nunit24/NUnitCore/core/ThreadedTestRunner.cs:36
at NUnit.Core.ProxyTestRunner.Run (NUnit.Core.EventListener,NUnit.Core.ITestFilter) [0x00007] in /Users/builder/jenkins/workspace/test-mono-mainline/label/osx-amd64/mcs/nunit24/NUnitCore/core/ProxyTestRunner.cs:133
at NUnit.Core.RemoteTestRunner.Run (NUnit.Core.EventListener,NUnit.Core.ITestFilter) [0x0002b] in /Users/builder/jenkins/workspace/test-mono-mainline/label/osx-amd64/mcs/nunit24/NUnitCore/core/RemoteTestRunner.cs:63
Native stack, when we get one, looks like:
0 mono 0x00000001073a7d5a mono_handle_native_sigsegv + 282
1 libsystem_platform.dylib 0x00007fff91ff152a _sigtramp + 26
2 ??? 0x00000001081b9a00 0x0 + 4430993920
3 mono 0x000000010756f763 mono_os_cond_timedwait + 163
4 mono 0x000000010756e326 mono_w32handle_timedwait_signal_handle + 358
5 mono 0x000000010756e0e1 mono_w32handle_wait_one + 897
6 mono 0x00000001075535f9 wapi_WaitForSingleObjectEx + 9
7 mono 0x00000001074a2cfe ves_icall_System_Threading_Thread_Join_internal + 174
6. ServiceModel contract tests fail with the wrong exception
In about 4 random tests over a week, all on Linux, a test of the contract capability in ServiceModel failed with a ObjectDisposedException where it expected a contract-wrong exception. Filed, not assigned but in a Class Libraries component so Marek Safar is aware of it: https://bugzilla.xamarin.com/show_bug.cgi?id=44650
7. handle_ops[type] exception in w32handle on thread abort
In about 4 random tests over a week, all on Mac, the regression test for bug 561239 is failing with the assert
"Assertion at w32handle.c:809, condition `handle_ops [type]' not met”
while aborting a thread.
Filed, not assigned: https://bugzilla.xamarin.com/show_bug.cgi?id=44651
8. JIT exception during XSL tests
In about 4 random tests over a week, all on ARM64, one of the XSL tests fails with
"Assertion at mini-arm64.c:937, condition `arm_is_bl_disp ((code), (target))' not met”
We have a managed stack but not a native one.
Filed, not assigned: https://bugzilla.xamarin.com/show_bug.cgi?id=44659
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Mono-devel-list