[Mono-dev] String.GetHashCode()

Mon Dec 3 08:13:32 EST 2007

On 12/01/07 Robert Jordan wrote:
> > Since strings are immutable, shouldn't it be possible to compute the
> > hashcode once and store it rather than recomputing it over and over again?
> > Is there some really obvious reason that i can't think of that would make
> > this a bad idea?
> 
> This idea already came up a couple of years ago (2002 or so).
[...]
> Therefore, the optimization should consider the string length
> and append the hash code after the string's bytes
> only if its length is greater than some amount (1K or so).

I'll reply to this mail and try to summarize all the responses and
the constraints we need to obey.

1) we can't change the C-side MonoString struct, because it is part of
the Mono ABI
2) we should try to not increase memory usage
3) we should try to not slowdown the case were the hashcode is
calculated for the first time on a string instance
4) we can't use utf8 to hold string chars because this would break
the mscorlib/CLI ABI (see how fixed on a string is compiled by the C#
compiler, for example)
5) GetHashCode should never be called for a string that is not yet
fully built (like in StringBuilder), so there is no worry aout the
string changing after the hash code field has been set

Point 1 means that if we were to add a new field to the string it can
only be added at the end, after the character data (and the terminating
null char). This also means that if we use an int, we have to be careful
and properly align it (this also means more instructions to access this
field).

About memory usage: objects in Mono are always aligned on 8 byte
boundaries so in some cases the overhead is 50%, 33%, 25%, 20%
(for strings up to 15 chars) and sometimes it is 0. I would
say that, as a mean, the memory overhead is 15% for up to 20 chars,
5% for up to 100 chars and it doesn't matter above that.

After having written the moving GC, where fast computing of an object's
size is important, I would also try to avoid allocating the extra field
only for big strings (this would also slowdown string allocation
somewhat).

So, given the above constraints and considerations, I suggest the
following:
1) wait for the moving GC to become the default: in that case we can use
the same mechanism used to store object hash codes in the object header:
this means no extra memory usage for strings and just a small slowdown
when inserting the hashcode in the header (it needs a locked op).
Implementing this mechanism with the current GC would slowdown a bit
some common operations (lock/unlock), with the moving GC we would take
the hit anyway.
2) in alternative, use a 16 bit hash code field: the memory overhead is
smaller (5-10% for up to 20 chars, 2-3% for up to 100 chars on average),
it requires no alignment calculation so it's fast to access. The
drawback is that for very big hashtables (> 100 K entries) we'd have few
bits. A cheap way to increase bits would be to or the hash with the
first char shifted left 16 bits (but this wouldn't solve the issue for
strings all beginning with the same char). I'm sure this drawback is not
very significant, but I'm also sure someone will write a specific
benchmark to exploit this weakness and whine in some blog or on the list
about it:)

2 is very simple to implement so we could easily get some numbers (hint, hint).

lupus

-- 
-----------------------------------------------------------------
lupus at debian.org                                     debian/rules
lupus at ximian.com                             Monkeys do it better