[MonoDevelop] Code completion problem

Sat Jun 17 11:51:44 EDT 2006

Lluis Sanchez wrote:
> Hi,
>
>   
>> I have stumbled upon a very disturbing problem...
>> ...serialized/deserialized as a whole (atomically). 
>>     
>
> The problem is, there si no such 'whole'. Code completion information is
> spread in many different .pidb files. Every referenced assembly and
> every project has its own pidb file. When MonoDevelop opens a project it
> "opens" its pidb file, and the pidb file of all the directly or
> indirectly referenced assemblies. pidb files are not fully loaded in
> memory, only a class index is initially loaded. Class information is
> loaded from disk under demand. All this save a lot of memory when
> dealing with big projects.
>   

Yes, indeed. I concentrated more on the ParserDatabase (which seems to 
be the centralized store? - the 'whole' as we are referring to it), 
which I thought is broken into several pieces (on serialization) and 
then fully put together (on deserialization).  But what I didn't know is 
that there is a class index - and with it lazy loading of classes. (I 
have found it in the code now.) A very nice feature to have! Yep, it 
would be very hard (well, impossible) to do such a thing with normal 
serialization.

Very useful information!

>   
>> This is not 
>> only an unnecessary overhead far exceeding everything I've ever seen, 
>>     
>
> Nope, you've already seen something like this. When Mono loads an
> assembly it has to resolve every language element it deserializes. The
> concept is the same.
>   

Yes, but I was thinking we could make this process void if we would use 
(de)serialized references to 'IClass' (and other constructs). Yes, in 
this case we would have other problems, such as 'What should be done if 
the definition of an 'IClass' changes?' - the answer 'all classes that 
reference it should be invalidated - and processed by a parser when 
actually accessed/needed' implies some logic that would have to be 
implemented (but I don't know if it needs to be very complicated 
either... - just resolving, as it is implemented now, would suffice).

Well, from what you said: I guess due to the drawbacks of using the 
'whole' serialization, this just isn't worth it...

>   
>> it's also a big problem for coders (like me), because adding only one 
>> piece of support for whatever new code completion feature (in my case 
>> generics) will result in inevitable neck breaking, endless coding and 
>> total breakage of interrelated code (code so interrelated that no 
>> foreign programmer will ever be capable having an oversight of it).
>>     
>
> Adding new information to classes means adding to IClass and related
> classes, implement the serialization code and implement the resolving
> code. Those basic operations need to be implement whatever serialization
> model you use.
>   

Yes, I have discovered that, and adding for example GenericArguments in 
IReturnType seems not to hurt much.

> Changing existing structures is a harder work, since you'll need to
> change the parser and resolver code of the language bindings. Such is
> life.
>   

This, on the other hand, is a more delicate problem - the actual source 
of my panic attack ;-)

When I wanted to change the BaseTypes (in IClass) from StringCollection 
to ReturnTypeList, I discovered one problem after the other - it started 
to seem as an endless cascade... I also started to worry because I've 
seen how many heavy weight algorithms I will have to add. Like you said, 
c'est la vie.

Well, it will just take some more time for me to figure out what has to 
be done, but in the end it should be alright.

>> My arguments underpinning the statement 'the current code completion 
>> architecture is fundamentally flawed':
>> - we get duplicated entries in the persistent storage (fully qualified 
>> names of types being just one of these)
>>     
>
> I've already explained that parser information is spread in several pidb
> files. If you agree in that this is necessary, then you must also agree
> that storing fully qualified names is needed in this model. We can't
> store 'pointers' from A.pidb to B.pidb because if B changes, pointer
> values will change, and references in A.pidb will break with no chance
> of recovery. Notice that B.pidb may change while A not being in memory
> (for example, B may be loaded as part of another C project).
>
> The only way to keep a consistent structure of class references would
> imply managing all parser information as a whole, so for example if a
> class in mscorlib changes its location for some reason (it's pointer or
> id), the parse information for all projects and all assemblies in your
> hard disk would need to be updated. Keeping track of all this huge
> amount of information would be something really hard to implement in an
> efficient way (if possible at all).
>   

Well, if we had a centralized store, keeping track of changes in 
projects of a solution wouldn't be hard. It is true that if we would 
open a project on its own - not in its solution, the pidb files would 
have to be recreated... but MonoDevelop currently only supports 
referencing projects that are opened in the parent solution. If we 
wanted to implement cross solution project references, we would have to 
add some 'file pointers' into pidb files anyways, wouldn't we?

If we would, however, say to a project that it should look for central 
class information store at a certain location on the disk, we would have 
many problems solved. If the central strore wouldn't be found, a new 
could be created easily...

With it, we would have the freedom of choice to use direct IClass 
references instead of fully qualified names. Resolving would still be 
the same, only that the result would not be a fully qualified name but 
an IClass reference.

> >From the storage point of view, a pidb file has a string table where all
> strings (including type names) are stored, so there is no string
> duplication inside a pidb file. There is another string table in memory,
> to avoid duplication when loading class information in memory.
>   

Another thing I missed.

>   
>> - we get multiple utility classes which exist only because we have such 
>> a curious way of (de)serializing type information (look at all those 
>> 'Persistent<Type>' classes or the ITypeResolver) - making up the 
>> spaghetti I was talking about...
>>     
>
> I admit that the class structure is a bit weird here, since it was an
> incremental change from SharpDevelop code. All Persistent* classes were
> real instances which implemented the serialization code. This was
> changed to utility classes to be able to serialize any implementation of
> IClass and related information. All those Abstract* base inheritance
> relations in Persistent* could be removed.
>   

Perhaps I could change that in the process?

>   
>> - many of these utility classes do per element array copies in 'foreach' 
>> loops and a lot of O(n^2) or even more heavy algorithms to do things 
>> that would otherwise never be needed <-- this is done in the resolving 
>> process
>>     
>
> ITypeResolver is not being used when loading parser information from
> disk, but to resolve partial type names provided by the parser. That's
> NEEDED, you can't avoid it.
>   

Yes, I agree. When I saw the 'Resolve' method in 'PersistentClass' I 
immediately thought it is called in the 'Loading' stage.

> Let's say I have this code:
>
> public class MyClass: ArrayList
> {
> }
>
> The parsing process is like this:
>       * The c# language binding parses the source code. The result is an
>         IClass instance which has "ArrayList" in the list of base
>         classes.
>       * The MD parser service takes the IClass and 'resolves' it. This
>         means converting partial class names into fully qualified type
>         names. In this process, ArrayList is converted into
>         System.ArrayList.
>       * The resolved IClass is added to the parser database, and will be
>         stored to disk when the parser services decides it. This class
>         will never be resolved again (until replaced by a new re-parsed
>         instance).
>   

I see! I will surely remember this - it will be needed when I have to 
resolve the list of IReturnTypes (which will represent the base types of 
an IClass).

> You can't avoid this process, although it might be improved. For
> example, it might be possible to update the IClass instance instead of
> crating a new one. It would require changes in all those interfaces
> since many name setters are missing (honestly, I would enjoy seeing all
> those interfaces being replaced by sealed classes).

Sealed classes? You mean that language bindings implementors wouldn't be 
able to create their own specialized language elements to store whatever 
additional info they need? I like those abstract classes more - which 
should be used in place of the interfaces...

What is ClassProxy needed for? Grep tells me, it's pretty much unused...

> In any case, I don't think this is a performance issue, since this
> resolving process is only executed when parsing the file being edited
> (which will often have only one class), or when parsing other project
> files due for example to an SVN update (in this case the parser process
> is done in background, so no big issue).
>   

It's needed anyways.

>   
>> - we have a lot of statements like these: 'return new ArrayList()' where 
>> we create a lot of temporary arrays just because some utility classes 
>> don't check whether some properties are 'null' or not. 
>>     
>
> Only the utility classes? The whole IDE and all its add-ins have access
> to IClass, and they expect to get collections even if they are empty,
> instead of null.
>
>   
>> Now tell me, 
>> honestly, which operation is more expensive (speed- and memory-wise): 
>> creating a new ArrayList, or checking that it is 'null'?
>>     
>
> Checking for null. However, in public APIs making a property return
> either a collection of items OR null if there are no items is a bad
> coding practice. It leads to user code more difficult to read (you have
> to add a null check to every collection access) and buggy (soon or later
> some user will forget to add the null check).
>   

Yes, but null checks are the easiest things to notice and to fix... You 
see, I have already planned to add properties that may return 'null' to 
indicate that there is nothing there. Should I change that (for example 
if you look into my patch for IClass and IReturnType)?

>   
>> Errr, why did Mike Krüger decide to store things this way? 
>>     
>
> He did not. The original SharpDevelop code did not have the concept of
> pidb files. I completely redesigned the way parser information is stored
> and loaded. I also made some improvements in the resolving process.
> However, the class structures (IClass and such), and the infrastructure
> for language binding parsers are more or less the same.
>   

And you have convinced me that your design is reasonable and the only 
way to go.

>   
>> Is .NET's 
>> serialization so ineffective (bloated, slow?) that he wanted to 
>> sacrifice speed and in-memory space in favour of on-disk space 
>> consumption? 
>>     
>
> I once implemented a project reader for MonoDevelop which was able to
> understand Mono's makefile format. So I could click on "Open File",
> select the root Mono makefile, and that would load the entire Mono tree
> of projects into MonoDevelop. The first time I tried it, the CPU jumped
> to 100%, memory started growing, and MD crashed after several minutes
> after filling all the available memory. And all this was because of the
> parser information.
>
> With the current design, the entire MD tree loads in a few seconds, and
> takes a few megs of memory.
>   

I'm convinced ;-)

>   
>> Well, if disk space is a problem, we can still compress the 
>> code completion database... But essentially, I doubt this is a 
>> problem... If there are some insiders among you guys, please, tell me if 
>> and why I'm wrong - what have I overlooked.
>>
>> Do I have a go for designing a new architecture, or do you guys have any 
>> warnings or objections for me. I'm aware that I might be wrong with some 
>> (or all) of my above assumptions.
>>     
>
> You now have more info, judge it yourself. I'm open to ideas for
> improving the performance and memory usage, as long as they have a good
> cost/benefit trade-off (if not, there are other MD areas that need
> work).
>   

Yes, that's the info I needed.

I panicked because I've seen how many things will have to be changed 
when converting StringCollection of BaseTypes into a ReturnTypeList. I 
had to research other parts of MonoDevelop and I started seeing things I 
was't ready to see. I asked myself why are some of these things needed?

Don't be angry with me - I'm trying to gather as much information as 
will be needed to finally start doing something productive :-) I had to 
point out what I had on my mind. I didn't mean to sound offending.

In the end, generics have to be added before I can move on to other areas...

Enjoy,
---
Matej