[Mono-devel-list] The first (attempt to checkin) managedcollation patch
Ben Maurer
bmaurer at ximian.com
Wed Jul 20 18:29:04 EDT 2005
On Thu, 2005-07-21 at 00:12 +0200, Kornél Pál wrote:
> > From: "Ben Maurer"
> > * There are extremely long runs of the same char in many instances
> > * The file seems to have tons of 0 bytes.
> > * There are some runs of sequences:
> >
> > 0002bfb0: 3c00 3d00 3e00 3f00 4000 4100 4200 4300 <.=.>.?. at .A.B.C.
> > 0002bfc0: 4400 4500 4600 4700 4800 4900 4a00 4b00 D.E.F.G.H.I.J.K.
> > 0002bfd0: 4c00 4d00 4e00 4f00 5000 5100 5200 5300 L.M.N.O.P.Q.R.S.
> > 0002bfe0: 5400 5500 5600 5700 5800 5900 5a00 5b00 T.U.V.W.X.Y.Z.[.
> >
> > though they are somewhat smaller than the runs of the same char.
>
> I see the problem as the following: If the file contains unicode Unicode
> charaters it eats disk space but is fast to read thus sorting is fast.
> If it is compressed but unbuffered sorting is slow and eats CPU.
> If it's buffered either because it is compressed or "just for fun" it eats
> RAM.
Compression does not mean `use bzip' in this context. It means "change
the file format so that we don't need long runs".
Compression will quite possibly make things faster:
* Reading from disk is SLOOOOOOOOOW. In the time it takes to
access one extra page from the disk, we could have done *tons*
of sorts. Please see http://rlove.org/talks/rml_guadec_2005.ppt,
slide 3.
* Cache misses are slow (but not as slow). So a few extra
instructions may well be worth avoiding one.
-- Ben
More information about the Mono-devel-list
mailing list