[Mono-dev] Dictionary`2: optimized and serialization-compatible with MS.net

Mon Jun 11 16:46:46 EDT 2007

Hi Paolo,

Thanks for your detailed comments! I'll take a closer at them as soon my
tired brain allows it.

I've cleaned and packaged my different versions of Dictionary along with
my test cases. Just extract the attached archive into an empty directory
and run "make". The test cases are in "DictBench.cs".

Juraj

On Mon, 2007-06-11 at 19:34 +0200, Paolo Molaro wrote:
> On 06/11/07 Juraj Skripsky wrote:
> > I changed the code accordingly, getting rid of the hashcodes completely.
> > Unfortunately, the result of the change is a slow down which is often
> > considerable (especially for string keys):
> > 
> > Dictionary<int, int>:       6-12%
> > Dictionary<double, double>: 25-35%
> > Dictionary<string, int>:    30-80% *
> 
> Uhm, these results are similar to what you already posted with the
> GetHashCode() calls included. They are suspicious: it means that
> we have lots of collisions (bad hash functions) and/or the tests
> involve lots of checks for presence that would fail and happen to hash
> to the same bucket with different hashcodes. Can you post the test
> cases?
> 
> > For key types with slow Equals() methods (e.g. String.Equals, whose
> > running time is proportional to the length of the matching substring),
> > the slow down will always be substantial.
> 
> As you pointed out, the Equals() methods will return at the first
> difference, so if we see lots of Equals calls it is again an indication
> of a poor hash function. I suspect some of the overhead is because of
> the way hashtable and dictionary eventually call Equals...
> 
> > I would vote for keeping the hashcode around, as my guess (read: gut
> > feeling) for the average slow down in realistic scenarios would be
> > around 30%.
> 
> Well, if we exclude the rehash which shouldn't be a big deal, we have
> two major cases: a hashtable hit and a hashtable miss.
> A well-behaved hashtable will have no collisions or very few anyway
> (I don't think it's worth optimizing for the worst case, we should try
> to improve the hashing functions instead if that causes the issue).
> 
> So in the hit case we save an array element access, a compare/branch
> and the need to keep the hashcode in a register (it's a value that gets
> used in a loop and we try to use a register for it): so using hashcodes
> here must be slower, even if just a bit.
> 
> In the miss case the hash code is only relevant if the bucket is not
> empty and if the values stored there happen to generate different
> hashcodes: keeping the hashcodes is a win in the miss case with
> densely populated and relatively small hashtables (the bigger the hash
> table the less likely two hash codes hit the same bucket while
> being actually different).
> 
> So we have two different cases that favour two different
> implementations (the first of which also have an advantage with memory
> usage that doesn't show up in performance benchmarks).
> Which of the two scenarios we should prefer is likely debatable:
> some apps may have a large number of hits while others a large number of
> misses. What are people's expectations here? Would you optimize for a
> hit or for a miss in a hash table lookup?
> 
> lupus
> 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Dictionary.tar.gz
Type: application/x-compressed-tar
Size: 10982 bytes
Desc: not available
Url : http://lists.ximian.com/pipermail/mono-devel-list/attachments/20070611/d0d1958f/attachment.bin