[Mono-dev] String.GetHashCode Discussion.

Wed Jul 16 19:13:04 EDT 2008

Hello,

Here are some data which hopefully bring some light to this topic. I 
didn't measure uniqueness of hashcodes as I consider it less important 
as range distribution. Enclosed you can find the result for you data 
with old and new string.GetHashCode implementation.

Some numbers to look at:

** Old implementation

Highest contentions: 2985, 1155, 1137

** Current implementation

Highest contentions: 236, 217, 210

Marek

> I have been asked to move this discussion to the e-mail list from IRC.
>  Basically we (my company and I) have new unit tests errors in 2.0
> that did not occur at 1.9.  The errors were traced to the
> String.GetHashCode change.  I had one of our interns (Mike) research
> the change and I wanted to share his findings with you.  His final
> document s attached.  I can also send the test case Mike used to
> generate his data.  I believe it is too large to send to all of your
> inboxes.
>
> http://docs.google.com/Doc?id=dhnnms3p_3fpmfq8dm
>
> (1:28:15 PM) holmes: Have there been any discussions about r106887
> which was a change to String.GetHashCode?
> (1:31:54 PM) holmes: This change is causing some of our tests to fail.
>  I will admit that our tests are relying on the sting hash code a bit
> too much.  However after some investigation I am wondering if the new
> algorithm is generating a unique enough result.  (I say that as
> respectfully as I can.)
> (1:34:33 PM) holmes: We conducted a short test where 58000 English
> string were tested and the new code only results in 60% unique results
> compared to the previous code's performance of 100%
> (1:38:36 PM) holmes: We have found the current algorithm only
> considers the last ~6 characters of a string, which may be the
> problem.  That is the last thing I will say here I promise.
> (1:42:21 PM) kumpera: marek: how have you tested the distribution of
> your new string hashing algorithm?
> (1:42:54 PM) marek: kumpera: I used english dictionary
> (1:43:57 PM) marek: holmes: how big is your dictionary?
> (1:45:05 PM) holmes: marek: 58110
> (1:46:12 PM) holmes: I think the real problem is not that a word
> hashes unique enough as a sentence producing a unique result.
> (1:46:56 PM) holmes: the last 6 characters define the whole algorithm.
>  (or at least I think so)
> (1:56:50 PM) lupus: it's likely better to revert the change, string
> hashing is also supposed to be synced with the implementation in C in
> the runtime anyway
>
> -bill
> _______________________________________________
> Mono-devel-list mailing list
> Mono-devel-list at lists.ximian.com
> http://lists.ximian.com/mailman/listinfo/mono-devel-list
>
>   

-------------- next part --------------
A non-text attachment was scrubbed...
Name: MonoStringHash.csv
Type: text/csv
Size: 28813 bytes
Desc: not available
Url : http://lists.ximian.com/pipermail/mono-devel-list/attachments/20080717/83376fd5/attachment-0001.bin