[Mono-list] How many strings do I have?

Jonathan Gilbert 2a5gjx302@sneakemail.com
Thu, 07 Oct 2004 19:21:29 -0400


At 03:49 PM 07/10/2004 -0400, you wrote:
>This is funny...
>
>http://weblogs.asp.net/bclteam/archive/2004/09/07/226487.aspx
>
>When I compile it with Visual Studio and run in on Windows [.NET 1.1], I get
>this:
>
>Hello
>Xello
>Not Found
>
>On GNU/Linux and Mono (mcs) I get this:
>
>Hello
>Xello
>string is Hello

The reason the 'switch' block doesn't work in .NET has to do with the way
'switch' on strings is implemented, as well as the implementation details
of the string intern pool.

'switch' performs about the same with strings as it does with ints. How
does it do this? It has to do with the fact that every separate object on
the system has an object ID, which can be treated as an 'int'. This alone
isn't enough, but coupled with the string intern pool, it is possible to
come up with a "definitive" object ID that represents a given string.

When the JIT is compiling code, by definition it is required to add every
literal string to the current AppDomain's string intern pool. This is a
very simple process; String.Intern is called for each string, and that
function operates in a simple way: It has a hash table of strings, and if a
string object with exactly the same value is not in the hash table, then it
adds it, otherwise it leaves the existing instance alone. In other words,
within a given run of a program, String.Intern("Hello") should *always*
return the same instance of the string "Hello".

Java also has a string intern pool, but the language does not support
'switch' on strings. The reason for this is that it lacks a critical
function that .NET has. The string intern pool should generally be kept
clean; you don't want to go interning every string you get. However, Java's
intern pool has only one function to access it: java.lang.String::intern().
This function, like .NET's System.String::Intern(), has the post-condition
that *some* instance of the string you passed be in the pool; it adds the
string you pass if it isn't there. Once a string is in the pool, it can
NEVER be removed (until the AppDomain is disposed). To permit 'switch' on
strings without polluting the string intern pool, therefore, .NET has an
additional method, System.String::IsInterned(). It never modifies the
intern pool, and returns 'null' if the string is not in the pool. If the
string IS in the pool, however, it returns the interned instance. If the
string being checked is one of the string literals in a 'case' label for
the 'switch' block, then by definition, it will return the same instance as
the one referenced by the 'case' label (since the interpreter is required
to intern all string literals when it loads the assembly).

After this, it is a simple matter of treating the object ID of the interned
strings as integers, and doing an integer 'switch' block.

Now, the implementation of the intern pool, for best performance, is almost
always going to be a hash table. A hash table takes the hash of the key to
determine the slot into which to put the data, but more than one value of
'string' can have the same 'GetHashCode' return value, so the hash table
needs to take collisions into account. Closed hashing certainly does work,
but open hashing is a lot easier to implement. This means that each slot in
the hash table is a list; once you hash the string, you need to check that
it is actually in the list. This is where the difference between the MS.NET
and mono implementations comes into play.

>From the output you paste, it looks to me like MS.NET uses a fully-managed
string intern pool. That is, it uses the same interned strings themselves
as the entries in the hash table. When the constant is changed from "Hello"
to "Xello", the 'switch' statement's call to 'IsInterned' hashes 'Hello',
finds a list with 1 entry, but that entry's data is 'Xello' (remember that
the intern pool is using the same constant that was modified by the test
program, which is technically not a valid program). Therefore, .NET can't
find the string in the intern pool, and it returns 'null', and the 'switch'
block fails.

On the other hand, mono's implementation seems to use unmanaged code (or at
least a separate instance of the string). Inside the hash table, the
entries in the list for each slot are NOT the actual interned 'string'
objects. When the lookup is done, the instance in the list found in the
corresponding slot in the hash table still says 'Hello', even though the
interned string object now contains the text 'Xello'. As such, it is able
to find the instance, and the 'switch' block succeeds. The intern pool says
that the instance with that object ID contains the text 'Hello', which is
why 'case "Hello":' is reached, even though the value of "Hello" is no
longer "Hello" :-)

Does that make sense?

Jonathan Gilbert