[Mono-devel-list] The first (attempt to checkin) managedcollation patch

Wed Jul 20 18:29:04 EDT 2005

On Thu, 2005-07-21 at 00:12 +0200, Kornél Pál wrote:
> > From: "Ben Maurer"
> >      * There are extremely long runs of the same char in many instances
> >      * The file seems to have tons of 0 bytes.
> >      * There are some runs of sequences:
> >
> > 0002bfb0: 3c00 3d00 3e00 3f00 4000 4100 4200 4300  <.=.>.?. at .A.B.C.
> > 0002bfc0: 4400 4500 4600 4700 4800 4900 4a00 4b00  D.E.F.G.H.I.J.K.
> > 0002bfd0: 4c00 4d00 4e00 4f00 5000 5100 5200 5300  L.M.N.O.P.Q.R.S.
> > 0002bfe0: 5400 5500 5600 5700 5800 5900 5a00 5b00  T.U.V.W.X.Y.Z.[.
> >
> >        though they are somewhat smaller than the runs of the same char.
> 
> I see the problem as the following: If the file contains unicode Unicode
> charaters it eats disk space but is fast to read thus sorting is fast.
> If it is compressed but unbuffered sorting is slow and eats CPU.
> If it's buffered either because it is compressed or "just for fun" it eats
> RAM.

Compression does not mean `use bzip' in this context. It means "change
the file format so that we don't need long runs".

Compression will quite possibly make things faster:
      * Reading from disk is SLOOOOOOOOOW. In the time it takes to
        access one extra page from the disk, we could have done *tons*
        of sorts. Please see http://rlove.org/talks/rml_guadec_2005.ppt,
        slide 3.
      * Cache misses are slow (but not as slow). So a few extra
        instructions may well be worth avoiding one.

-- Ben