[MonoDevelop] Code completion problem

Sat Jun 17 09:24:47 EDT 2006

Hi,

> I have stumbled upon a very disturbing problem. When I realized that 
> IClass doesn't have a list of IReturnType for its base classes I thought 
> 'why not changing the string collection of base types into a list of 
> IReturnType' - surely, it is needed if we want generics in MonoDevelop. 
> Well, since this was the first compromising change of the current 
> architecture (the serialization process of IClass instances changed with 
> this), I had to change some things in the Persistent<Type> classes. 
> Soon, after having to change the behaviour of dozens of other classes, I 
> have realized that we have a very serious problem.
> 
> Explanation: Currently the architecture's deserialization process has to 
> resolve EVERY language element it deserializes from the persistent 
> storage. Obviously this would not be necessary if the persistent storage 
> would be serialized/deserialized as a whole (atomically). 

The problem is, there si no such 'whole'. Code completion information is
spread in many different .pidb files. Every referenced assembly and
every project has its own pidb file. When MonoDevelop opens a project it
"opens" its pidb file, and the pidb file of all the directly or
indirectly referenced assemblies. pidb files are not fully loaded in
memory, only a class index is initially loaded. Class information is
loaded from disk under demand. All this save a lot of memory when
dealing with big projects.

> This is not 
> only an unnecessary overhead far exceeding everything I've ever seen, 

Nope, you've already seen something like this. When Mono loads an
assembly it has to resolve every language element it deserializes. The
concept is the same.

> it's also a big problem for coders (like me), because adding only one 
> piece of support for whatever new code completion feature (in my case 
> generics) will result in inevitable neck breaking, endless coding and 
> total breakage of interrelated code (code so interrelated that no 
> foreign programmer will ever be capable having an oversight of it).

Adding new information to classes means adding to IClass and related
classes, implement the serialization code and implement the resolving
code. Those basic operations need to be implement whatever serialization
model you use.

Changing existing structures is a harder work, since you'll need to
change the parser and resolver code of the language bindings. Such is
life.

> 
> My arguments underpinning the statement 'the current code completion 
> architecture is fundamentally flawed':
> - we get duplicated entries in the persistent storage (fully qualified 
> names of types being just one of these)

I've already explained that parser information is spread in several pidb
files. If you agree in that this is necessary, then you must also agree
that storing fully qualified names is needed in this model. We can't
store 'pointers' from A.pidb to B.pidb because if B changes, pointer
values will change, and references in A.pidb will break with no chance
of recovery. Notice that B.pidb may change while A not being in memory
(for example, B may be loaded as part of another C project).

The only way to keep a consistent structure of class references would
imply managing all parser information as a whole, so for example if a
class in mscorlib changes its location for some reason (it's pointer or
id), the parse information for all projects and all assemblies in your
hard disk would need to be updated. Keeping track of all this huge
amount of information would be something really hard to implement in an
efficient way (if possible at all).

>From the storage point of view, a pidb file has a string table where all
strings (including type names) are stored, so there is no string
duplication inside a pidb file. There is another string table in memory,
to avoid duplication when loading class information in memory.

> - we get multiple utility classes which exist only because we have such 
> a curious way of (de)serializing type information (look at all those 
> 'Persistent<Type>' classes or the ITypeResolver) - making up the 
> spaghetti I was talking about...

I admit that the class structure is a bit weird here, since it was an
incremental change from SharpDevelop code. All Persistent* classes were
real instances which implemented the serialization code. This was
changed to utility classes to be able to serialize any implementation of
IClass and related information. All those Abstract* base inheritance
relations in Persistent* could be removed.

> - many of these utility classes do per element array copies in 'foreach' 
> loops and a lot of O(n^2) or even more heavy algorithms to do things 
> that would otherwise never be needed <-- this is done in the resolving 
> process

ITypeResolver is not being used when loading parser information from
disk, but to resolve partial type names provided by the parser. That's
NEEDED, you can't avoid it.

Let's say I have this code:

public class MyClass: ArrayList
{
}

The parsing process is like this:
      * The c# language binding parses the source code. The result is an
        IClass instance which has "ArrayList" in the list of base
        classes.
      * The MD parser service takes the IClass and 'resolves' it. This
        means converting partial class names into fully qualified type
        names. In this process, ArrayList is converted into
        System.ArrayList.
      * The resolved IClass is added to the parser database, and will be
        stored to disk when the parser services decides it. This class
        will never be resolved again (until replaced by a new re-parsed
        instance).

You can't avoid this process, although it might be improved. For
example, it might be possible to update the IClass instance instead of
crating a new one. It would require changes in all those interfaces
since many name setters are missing (honestly, I would enjoy seeing all
those interfaces being replaced by sealed classes).

In any case, I don't think this is a performance issue, since this
resolving process is only executed when parsing the file being edited
(which will often have only one class), or when parsing other project
files due for example to an SVN update (in this case the parser process
is done in background, so no big issue).

> - we have a lot of statements like these: 'return new ArrayList()' where 
> we create a lot of temporary arrays just because some utility classes 
> don't check whether some properties are 'null' or not. 

Only the utility classes? The whole IDE and all its add-ins have access
to IClass, and they expect to get collections even if they are empty,
instead of null.

> Now tell me, 
> honestly, which operation is more expensive (speed- and memory-wise): 
> creating a new ArrayList, or checking that it is 'null'?

Checking for null. However, in public APIs making a property return
either a collection of items OR null if there are no items is a bad
coding practice. It leads to user code more difficult to read (you have
to add a null check to every collection access) and buggy (soon or later
some user will forget to add the null check).

> 
> Errr, why did Mike Krüger decide to store things this way? 

He did not. The original SharpDevelop code did not have the concept of
pidb files. I completely redesigned the way parser information is stored
and loaded. I also made some improvements in the resolving process.
However, the class structures (IClass and such), and the infrastructure
for language binding parsers are more or less the same.

> Is .NET's 
> serialization so ineffective (bloated, slow?) that he wanted to 
> sacrifice speed and in-memory space in favour of on-disk space 
> consumption? 

I once implemented a project reader for MonoDevelop which was able to
understand Mono's makefile format. So I could click on "Open File",
select the root Mono makefile, and that would load the entire Mono tree
of projects into MonoDevelop. The first time I tried it, the CPU jumped
to 100%, memory started growing, and MD crashed after several minutes
after filling all the available memory. And all this was because of the
parser information.

With the current design, the entire MD tree loads in a few seconds, and
takes a few megs of memory.

> Well, if disk space is a problem, we can still compress the 
> code completion database... But essentially, I doubt this is a 
> problem... If there are some insiders among you guys, please, tell me if 
> and why I'm wrong - what have I overlooked.
> 
> Do I have a go for designing a new architecture, or do you guys have any 
> warnings or objections for me. I'm aware that I might be wrong with some 
> (or all) of my above assumptions.

You now have more info, judge it yourself. I'm open to ideas for
improving the performance and memory usage, as long as they have a good
cost/benefit trade-off (if not, there are other MD areas that need
work).

Thanks!
Lluis.