[Mono-dev] Integrating heap-shot in the new logging profiler

Tue Dec 18 10:23:09 EST 2007

Hello,
this is a follow up to a small brainstorming I had with Paolo.

The problem is that integrating heap-shot in the new logging
profiler is now causing deadlocks.
The logging profiler, from time to time, must write to its
output file (obviously), and these operations are protected
by a lock (any thread that fills up its buffer will want to
write, plus, before the write the id mapping tables must be
updated and flushed, hence the need for a lock).

Heap-shot works mainly in the gc event callback (at the end
of the mark phase), and it also wants to access global data
structures and write to the file, so it wants to acquire the
same lock.

The issue is: the gc stops the world, including any thread
that is writing to the file (and therefore has the lock).
So trying to get the lock inside the gc profiler event will
eventually deadlock.

After the brainstorming, the idea was to make the work in
the gc handler lock-free (which is needed also because that
event handler is executed in a "controlled" state, with the
world stopped, so using locks is asking for trouble...).

The "object allocated" event handlers will have the duty
of collecting the MonoObject* of all the allocated objects
in a central place (like a global array), and the gc event
handler will scan this array and do its work to support the
heap-shot functionality.
The work of the gc event handler will be made easier by the
fact that the world is stopped, so no lock will be needed
to be sure that the MonoObject* buffer content is "fixed".
Then the gc event handler will submit the writing job to
a helper thread, so that again no lock will be needed.
We already have this thread around (it is the one writing
the statistical events), so this is not a big complication.

With this approach, there are a few tricky issues:
- The threads must access the global MonoObject* array
  without stomping on each other's toes. Using interlocked
  increment on the current element index is easy, handling
  when the array is full gets more tricky.
- When objects are freed (during collections), their slot
  in the global array becomes invalid but it's not easy to
  reuse it, so at times when the array gets full it must be
  reallocated so that the copy operation can pack its slots
  and fill the holes.
  For the gc event handler dealing with this reallocation
  can be tricky.
- Also the writer thread will need to see the array, so also
  for it handling reallocations can be tricky.
Note that the above tricky things are perfectly doable!
I'm just saying they are tricky.

After a bit of thinking, and considering how the current
code works ("current" in the logging profiler!), I think
that we could modify this approach a bit, and make it much
easier.
Particularly, I would get rid of the global array, and just
reuse the per-thread data structures we already have (which
work very well).

In the beginning I was thinking to reuse also the current
per thread buffers, but for various reasons it's better to
record object allocations in separate buffers.

Each per thread struct ("ProfilerPerThreadData") will also
point to a list of buffers, where each buffer will have the
following data:
- a pointer to the next buffer in the list,
- a "handled by gc event" index, and
- the array of MonoObject* (the actual buffer).
Each per-thread profiler struct will have the following:
- a pointer to the first (current) MonoObject* buffer (cannot
  be NULL), and
- a "next free slot in MonoObject* buffer" index.
The "object allocated" profiler event will do the following:
- if the "next free slot" index is still inside the buffer,
  put the MonoObject* there and increment the index,
- otherwise, allocate a new buffer and put it in the head
  of the list (easy to do atomically), and put the MonoObject*
  there.
- If there are buffers at the end of the list that have the
  "handled by gc event" flag set to true, unlink them from the
  list (again, easy to do atomically) and free them.
We could think about recycling the buffers instead of freeing
them, but this is just a detail.
The main point is: the buffer queue is *entirely* maintained
by the relevant thread. The current buffer is always the 1st,
the others are always full.

The gc event handler must simply scan the list of per thread
structures, and inside each of them also scan the buffer list
working on each MonoObject*, and producing the data for the
writer thread (updating the "handled by gc event" index to mark
where it "consumed" the MonoObject* values).
With Paolo we were thinking to reuse the same buffers for the
writer thread (with no copies), but this is not simple: the
gc event handler must take the heap snapshot, and for each
MonoObject* it must build the "list" of reference fields with
their values (list, array, whatever...).
All these values must be written to the output file, but their
snapshot must be taken inside the gc handler, when the world is
stopped.
So we simply *need* to use more memory than the one we have in
the MonoObject* buffers, unless we handle them with tricks,
reserving slots for the reference field contents.
This would effectively make the MonoObjext* buffers almost as
large as the heap itself, but is anyway doable...

IMHO, for simplicity the gc event handler should build yet
another set of buffers, containing the heap snapshot, and pass
them to the writer thread.
But the more I think about it, the more I see that reusing the
buffers is feasible, even if tricky.
The tradeoff is unclear, because reserving space in the buffers
will waste slots for references that in the end will be null,
and would take no space at all in the heap snapshot.
On the other hand, we would not need to "copy" buffers...

In either case, the writer thread would then write the snapshot
to the output file.

Comments?
  Massi