[Mono-dev] Dictionary`2: optimized and serialization-compatible with MS.net

Mon Jun 11 13:34:44 EDT 2007

On 06/11/07 Juraj Skripsky wrote:
> I changed the code accordingly, getting rid of the hashcodes completely.
> Unfortunately, the result of the change is a slow down which is often
> considerable (especially for string keys):
> 
> Dictionary<int, int>:       6-12%
> Dictionary<double, double>: 25-35%
> Dictionary<string, int>:    30-80% *

Uhm, these results are similar to what you already posted with the
GetHashCode() calls included. They are suspicious: it means that
we have lots of collisions (bad hash functions) and/or the tests
involve lots of checks for presence that would fail and happen to hash
to the same bucket with different hashcodes. Can you post the test
cases?

> For key types with slow Equals() methods (e.g. String.Equals, whose
> running time is proportional to the length of the matching substring),
> the slow down will always be substantial.

As you pointed out, the Equals() methods will return at the first
difference, so if we see lots of Equals calls it is again an indication
of a poor hash function. I suspect some of the overhead is because of
the way hashtable and dictionary eventually call Equals...

> I would vote for keeping the hashcode around, as my guess (read: gut
> feeling) for the average slow down in realistic scenarios would be
> around 30%.

Well, if we exclude the rehash which shouldn't be a big deal, we have
two major cases: a hashtable hit and a hashtable miss.
A well-behaved hashtable will have no collisions or very few anyway
(I don't think it's worth optimizing for the worst case, we should try
to improve the hashing functions instead if that causes the issue).

So in the hit case we save an array element access, a compare/branch
and the need to keep the hashcode in a register (it's a value that gets
used in a loop and we try to use a register for it): so using hashcodes
here must be slower, even if just a bit.

In the miss case the hash code is only relevant if the bucket is not
empty and if the values stored there happen to generate different
hashcodes: keeping the hashcodes is a win in the miss case with
densely populated and relatively small hashtables (the bigger the hash
table the less likely two hash codes hit the same bucket while
being actually different).

So we have two different cases that favour two different
implementations (the first of which also have an advantage with memory
usage that doesn't show up in performance benchmarks).
Which of the two scenarios we should prefer is likely debatable:
some apps may have a large number of hits while others a large number of
misses. What are people's expectations here? Would you optimize for a
hit or for a miss in a hash table lookup?

lupus

-- 
-----------------------------------------------------------------
lupus at debian.org                                     debian/rules
lupus at ximian.com                             Monkeys do it better