[Mono-devel-list] String constants and localization

Mon Jul 14 17:34:16 EDT 2003

Hi

First again some facts:
By the time Mono class libraries are complete they will probably contain
about 1-2MB of hardcoded strings.
It is not possible to do ANY optimization or improvement on that other than
removing these.
For a normal PC 1-2MB is today negligible
For a memory limited device (e.g. a Palm or a PocketPC or a Cell-Phone) 1-2
MB permanently lost is HUGE (ok maybe not for a PocketPC ;)

Things that I feel Mono is aiming at:
Create a common code base that can be compiled into several 'versions' like
.Net 1.0, 1.1, Compact
Mono wants to embed the entire string into the code base (e.g.
GetString("This is the original english text for the error NotValid"))

What MS did:
They are obviously having two different code bases.
.Net framework uses a short string identifier to identify strings (e.g.
GetString("Get_Error_NotValid"))
.Net compact framework seems to use an Int value to identify strings (e.g.
GetString(45) )
So it seems that they felt they could not afford the memory loss in compact
framwork

What I suggested:
Using a enum value (after compiling it is represented as int because the
compiler compiles constant values into their native value):
(e.g. Developer sees/ uses: GetString(MonoString.Get_Error_NotValid))
(e.g. After compiling the assembly contains/calls: GetString(45))

Some calculations (assuming the translated strings are as long as the
english ones):

The absolute minimum size you can archive (removing all strings, assumed 1MB
strings without changing the code base):
Mono: 1000KB (cannot remove without removing every single string)
MS: estimated 250KB (assuming the identifier is average 1/4 of the string
itself)
MS Compact: about 40KB
Suggestion: about 40KB (assuming you remove the enumeration after compiling)

The minimum size you can archive when using localization (one localized
resource set, assumed 1MB strings):
Mono: 3000KB
MS: estimated 1500KB (assuming the identifier is average 1/4 of the string
itself)
MS Compact: about 1040KB
Suggestion: about 1290KB (assuming the enumeration entry is average 1/4 of
the string itself) (assuming you remove the enumeration after compiling)
Suggestion: about 1040KB (assuming you remove the enumeration after
compiling)

RAM need at runtime when using localization for getting ONE/The first entry
(one localized resource set, full memory cache, assumed 1MB strings):
Mono: 2000KB (Hashtable implementation)
MS: estimated 1500KB (assuming the identifier is average 1/4 of the string
itself)
MS Compact: about 1040KB
Suggestion: about 1040KB

Typical RAM need at runtime when using localization for getting ONE/The
first entry (one localized resource set, index cached, assumed 1MB strings):
Mono: about 1040KB
MS: estimated 540KB (assuming the identifier is average 1/4 of the string
itself)
MS Compact: about 40KB
Suggestion: about 40KB

So we see two things:
Mono would use the most memory of all implementations (For the compiled
assembly as well as RAM for execution)
The memory need of the Mono implementation will never allow Mono to run on a
memory limited device. And there is NO way to do any optimization on the
assembly size.

Because it seems that I did not make clear my suggestion to some people I
attatch the following files to show:
StringData.xml: Contains string definitions
StringData.bin: Contains compiled string information from StringData.xml
(starts with 56bytes index table, look at it with a text editor)
MS.cs: Contains a simple implementation of the Suggestion with some sample
enum values, which assumes it is compiled into an assembly that has
StringData.bin added as resource; also contains implementation for direct
file access and fully cached direct file access.

Additional comments are in the text

> Hello!
>
> > I've read your answer, but it seems that at quite some points you
overlooked
> > advantages (maybe I'm also wrong with any of these, but I don't think
so).
> > So I added some addidional comments to it
>
> Thanks for getting back to me.

I'm back again ;)

> I do agree that there were various of advantages, but from a maintenance
> point of view, I did not get the feeling that those improvements were
> enough to justify the design change.  This is purely my personal
> feeling, so we should definitely continue exploring this topic.
>
> That being said, one of the successful policies we used in Gnumeric was
> that we aimed for maintainability, completeness and only in a third
> place about performance and memory consumption.
>
> This allowed us to focus on getting things done right, and getting the
> basic infrastructure in place.  And only later we did performance and
> memory improvements.  This turned out to be good, because most
> ahead-of-time optimizations turn out to be wrong.

I would not see this as ahead-of-time optimization as it just opens
possiblilities for creating optimized implementations. Per se it is not an
optimization.

> Let me give you an example.  In the C# compiler I was very worried that
> using the various "Cast" classes was going to be very slow, and I
> decided that one day, I would rewrite the whole cast system to avoid
> these objects.
>
> Well, turns out that in the execution time profiles and in the memory
> profiles, these do not even show up.  So we are able to keep the elegant
> design, and profiling showed that the issues were elsewhere: in
> unsuspected places.
>
> So, anyways, after getting the philosophical bits out of the way, lets
> get to the meat of it:
>
> > * Much faster
> > * Much smaller Assembly size (see below)
> > * Much smaller RAM need
>
> These are the items that we can measure, and we will have to balance
> with the proposed changes, and the maintainance issues.
>
> > * More safe when programing because of compile errors for e.g. typos
>
> There are already tools in place to cope with this things.  For example,
> we can use gettext to pull the strings out, so this is actually an
> automated process, and one that existing translatros are familiar with.

I don't know gettext, but I would assume that it cannot acomplish things
like:
* Determine which strings must not be translated
* Determine a level that indicated how important it is to translate a string
(Strings that are directly displayed to Users are more important to be
translated than e.g. Exeption strings)
* It cannot find typos in the strings (If the same strings should be used at
multiple locations, but one has a typo/ small difference)
* Find errors/typos at compile time

> The enumeration approach on the other hand, opens the doors to new
> problems in the build system and on the setup;  Maybe not terribly hard
> to fix, but they add to the plate.

I do not see any problems on setup if compiling the data as resource into
the assemblies, however I agree that it will make build a little more
complicated.

> > OK - but IMHO your solution just has two flaws:
> > * Reimplement the chunks we already have for handling resources
> >    in corlib to cope with all the CultureInfo bits (which is exactly
what
> > you wanted to avoid above)
>
> This piece is left intact.  The only change is that we have to expose an
> internal method that will perform the string -> index mapping in the
> ResourceReader without using the Enumerator-based API.
>
> This is fairly simple to do (and is in fact, what Microsoft does).

IMHO MS uses GetResourceStream to load the entire string table and index
mapping into memory

> > * Sooner or later you will always come to the GetResourceStream
function,
> > which actually provides a memory stream, which is: loading all things
into
> > memory (and if you want to provide a complete second infrastructure for
> > strings, then the work that has to be done would be IMHO FAR more work
than
> > anything you might have to do to implement something like my suggested
> > solution)
>
> Well, we do not need to make GetResourceStream load everything into
> memory;  In fact, if this is the case today, it sounds like we should
> optimize that process as well.

This is what you are doing today and this is also what MS is doing, so I'm
not sure if it can be optimized.

> > Sorry but IMHO this it total overkill. You want to perform a binary
search
> > DIRECTLY on a file containing an estimated 200KB string values EVERY
time we
> > do a string lookup. Are you sure this won't totally fry your HDD. And
what
> > about if the assembly we are accessing is on e.g. a network share that
has
> > slow access times?
>
> It works fine enough for Monodoc, I can really not tell the difference
> of disk access.  Now, lets assume we have 8k strings, that turns out to
> be 12 different seeks+reads, and for the later cases, they will probably
> hit the cache.

OK - just for an example: You are using a System.Windows.Forms PropertyGrid
to display a control.
The control has say 125 Properties and Methods (the very simple Button
Control has that much) - each of these has a SRDescription.
So to display the PropertyGrid for the Button we need 1500 seeks+reads
Now assume the case we are working over a (fast) network and have 10ms
latency
Then we need 15 seconds! for that to load
1500 seeks is even a lot on a HDD

> Gettext works like this today, and there are no complaints about fully
> localized systems today about the speed.  And keep in mind that with
> gettext, every app on the system is doing this process all the time.
>
> Anyways, the summary is that I do not think that deviating today from
> the .NET framework setup is worth it.
>
> Miguel.
>
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: StringData.xml
Type: text/xml
Size: 2597 bytes
Desc: not available
Url : http://lists.ximian.com/pipermail/mono-devel-list/attachments/20030714/44b4bf18/attachment.xml 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: StringData.bin
Type: application/octet-stream
Size: 667 bytes
Desc: not available
Url : http://lists.ximian.com/pipermail/mono-devel-list/attachments/20030714/44b4bf18/attachment.bin 
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: MS.cs
Url: http://lists.ximian.com/pipermail/mono-devel-list/attachments/20030714/44b4bf18/attachment.pl