[Mono-devel-list] The first (attempt to checkin) managed collation patch

Wed Jul 20 18:05:02 EDT 2005

Some messages from your neighborhood optimizer :-).

On Thu, 2005-07-21 at 02:12 +0900, Atsushi Eno wrote:
> Hello,
> But besides the patch, I need to checkin 7 prebuilt binary resource
> files in mcs/class/corlib directory, though they can be built when
> you run "make" in mcs/class/corlib/Mono.Globalization.Unicode. [*1]
> I put all the binaries here:
> http://monkey.workarea.jp/tmp/20050720

Some things I noticed while viewing the files in a hex editor:

      * There are extremely long runs of the same char in many instances
      * The file seems to have tons of 0 bytes.
      * There are some runs of sequences:

0002bfb0: 3c00 3d00 3e00 3f00 4000 4100 4200 4300  <.=.>.?. at .A.B.C.
0002bfc0: 4400 4500 4600 4700 4800 4900 4a00 4b00  D.E.F.G.H.I.J.K.
0002bfd0: 4c00 4d00 4e00 4f00 5000 5100 5200 5300  L.M.N.O.P.Q.R.S.
0002bfe0: 5400 5500 5600 5700 5800 5900 5a00 5b00  T.U.V.W.X.Y.Z.[.

        though they are somewhat smaller than the runs of the same char.

As a datapoint:

[benm at omega ~]$ du -h collation.core.bin*
188K    collation.core.bin
16K     collation.core.bin.bz2
16K     collation.core.bin.gz

Obviously, the compression rate from something like this a theoretical
lowest size we could get. But the fact that we can get over 10x
compression on the file means there is quite a bit of room for
improvement.

It'd be nice to optimize the format *before* we check in the binary
files, since optimizing will require some frequent changes.

> When this managed collation is enabled, it will eat huge managed
> resource

How are you reading in the resource? Do you use it as a byte [] array?
In mscorlib, when we load a resource, it icall's into the runtime which
returns a void*/intptr to the data. This pointer comes from the mmap'd
assembly location. It would be a very large win to use either the stream
directly (ie, with a binaryreader that did seeking) or using unsafe
operations on the void*. This has two advantages:
      * Only pages of data that are ever read get paged in by the OS.
        Other things can just stay on the disk
      * Pages can be shared between mono processes.

> I can make this into unmanaged header file if we want.

If we do this, it wouldn't be that big of a win to make it an unmanaged
header. The only win that I can think of there is that we can do
arch-specific stuff, avoiding the need to be endian safe (btw, have you
tested this on a ppc box?).

It might be nice to put the files in /usr/share. A few things we win by
doing that:

      * We can split the files across packages (does everybody need the
        Asian languages data?)
      * It keeps the size of our rpms down because the data isn't
        duplicated for 1.0 and 2.0 mscorlib, nor is it triplicated for
        mono, libmono.so and libmono.a
      * It keeps the size of our tarballs and monolites down because the
        included mscorlib does not have the data

Of course, if the compression can make this data small, we wouldn't need
to think about this :-).

-- Ben