[Mono-devel-list] The first (attempt to checkin) managed collation patch

Wed Jul 20 22:02:30 EDT 2005

Hola Opti-Ben,

Ben Maurer wrote:
> Some messages from your neighborhood optimizer :-).

Thanks ;-)

> Some things I noticed while viewing the files in a hex editor:
> 
>       * There are extremely long runs of the same char in many instances
>       * The file seems to have tons of 0 bytes.
>       * There are some runs of sequences:
> 
> 0002bfb0: 3c00 3d00 3e00 3f00 4000 4100 4200 4300  <.=.>.?. at .A.B.C.
> 0002bfc0: 4400 4500 4600 4700 4800 4900 4a00 4b00  D.E.F.G.H.I.J.K.
> 0002bfd0: 4c00 4d00 4e00 4f00 5000 5100 5200 5300  L.M.N.O.P.Q.R.S.
> 0002bfe0: 5400 5500 5600 5700 5800 5900 5a00 5b00  T.U.V.W.X.Y.Z.[.
> 
>         though they are somewhat smaller than the runs of the same char.

As the quality of sortkey data content, the arrays are already
optimized to be usable as live arrays used in the code:

1) The sequences of zeros are not that big I believe, at least now.
I have implemented an index-based table mapping optimizer that skips
such sequences of zeros. I can introduce more indexes to optimize
the table, but it harms performance by introducing extraneous
codepoint comparison.

2) Actually there's a couple of reasons that the table _looks_
extraneously massive. A large chunk of "repetition" is from Hangul
Syllables and CJK ideographs. Due to the silly Windows sortkey
design, the computation is not direct. (The details are described
in http://monkey.workarea.jp/lb/archive/2005/6-25.html )

The cause of the sequences of similar data is because they are
ushort arrays that maps ASCII characters (they are used to support
CompareOptions.IgnoreWidth). They are *mostly* in ASCII range but
there are still a couple of non-ASCII mappings. I don't think that
table is large now.

> As a datapoint:
> 
> [benm at omega ~]$ du -h collation.core.bin*
> 188K    collation.core.bin
> 16K     collation.core.bin.bz2
> 16K     collation.core.bin.gz
> 
> Obviously, the compression rate from something like this a theoretical
> lowest size we could get. But the fact that we can get over 10x
> compression on the file means there is quite a bit of room for
> improvement.
> 
> It'd be nice to optimize the format *before* we check in the binary
> files, since optimizing will require some frequent changes.

As the quality of data storage, yes they could be made smaller.
The table could be much smaller even if I introduced simple
run-length compression.

But it also means that the live arrays (used in the collator code)
must be created apart from internal pointer to the managed resources.
I wonder if it makes sense.

>>When this managed collation is enabled, it will eat huge managed
>>resource
> 
> How are you reading in the resource? Do you use it as a byte [] array?
> In mscorlib, when we load a resource, it icall's into the runtime which
> returns a void*/intptr to the data. This pointer comes from the mmap'd
> assembly location. It would be a very large win to use either the stream
> directly (ie, with a binaryreader that did seeking) or using unsafe
> operations on the void*. This has two advantages:
>       * Only pages of data that are ever read get paged in by the OS.
>         Other things can just stay on the disk
>       * Pages can be shared between mono processes.

Well, what I wanted to mean was "managed heap" (so it anyways creates
managed resource, regardless of whether you use managed collation
or not).

The corresponding code is already in svn,
mcs/class/corlib/Mono.Globalization.Unicode/MSCompatUnicodeTable.cs.
It creates a BinaryReader instance for each manifest resource stream,
and for byte arrays it does Read(array, 0, size).

For ushort array it calls ReadUInt16(). Except for one case
("widthCompat" array) it is possible to split an ushort array into
two byte arrays.

>>I can make this into unmanaged header file if we want.
> 
> If we do this, it wouldn't be that big of a win to make it an unmanaged
> header. The only win that I can think of there is that we can do
> arch-specific stuff, avoiding the need to be endian safe (btw, have you
> tested this on a ppc box?).

I don't have PPC. I use BinaryWriter to create those resources,
calling Write(byte), WriteInt32(), etc. and used BinaryReader
to read manifest resource stream, with ReadByte(), ReadUInt16(),
ReadInt32() etc.

If BinaryWriter.Write() (other than byte parameter) writes its
stream output in different byte order depending on the platform
or BinaryReader reads stream as well, how can I know that platform
dependent byte order?

> It might be nice to put the files in /usr/share. A few things we win by
> doing that:

How can we get the precise file location, especially when we specify
different GAC to reference mscorlib?

>       * It keeps the size of our tarballs and monolites down because the
>         included mscorlib does not have the data

Similarly, when the collation resources are split, then CompareInfo
in mscorlib will be messed. It is similar to what happens when we
have inconsistent version of mscorlib.dll and the runtime.

> Of course, if the compression can make this data small, we wouldn't need
> to think about this :-).

Heh, yes ;-)

> Of course, one non-performance advantage of this is that it is easier
> for people to test your bug fixes (and easier for you as well!).

Actually for debugging purpose it ("make" under Mono.Globalization.
Unicode) also generates the code that contains full managed code
array (it is created as MSCompatUnicodeTableGenerated.cs under
that directory). The file looks like this:
http://monkey.workarea.jp/tmp/20050720/MSCompatUnicodeTableGenerated.cs
(warning: it is about 1MB)

It is completely replaceable the one which uses managed resources.

Atsushi Eno