[Mono-dev] misc: C# request info

Mon Feb 23 17:35:02 EST 2009

----- Original Message ----- 
From: "Marek Safar" <marek.safar at seznam.cz>
To: "BGB" <cr88192 at hotmail.com>
Cc: <mono-devel-list at lists.ximian.com>
Sent: Monday, February 23, 2009 7:56 PM
Subject: Re: [Mono-dev] misc: C# request info

> Hi,
>> well, I was looking into C# some, and admittedly I have much less 
>> fammiliarity with the language than with others...
>>
>> (I have started looking at ECMA-334 some, but it is long and a little 
>> awkward to answer specific questions from absent some digging...).
>>
>>
>> so, firstly, it is my guess that in order to compile C# properly, it is 
>> required to load a whole group of files at once (I am uncertain whether 
>> the term 'assembly' also applies to the collection of input source files, 
>> or only to a produced DLL or EXE).
>>
> You can load C# source code files as well as assemblies or modules.
>

yes, ok.

>> my guess is that it works like this:
>> the group of files is loaded;
>> each file is preprocessed and parsed (it is looking like C# uses a 
>> context-independent syntax?...);
>>
> Incorrect, C# uses context dependent keywords.

I meant, context independent as in, the parser can parse things without 
knowing about previous declarations.
in C, it is necessary to know about declarations (typedefs, structs, ...) in 
order to be able to parse (otherwise... the syntax is ambiguous...).

sadly, I am not all that familiar in detail with C#'s syntax.

in C#, this would mean having to know about all of the classes in all of the 
visible namespaces before being able to parse, which would not seem to be 
the case.

actually, it is looking to me (just my guess from looking at ECMA-334), that 
rather than processing code a few-tokens at a time and using a syntax where 
the current parsing depends on prior declarations, C# is using a syntax 
where it is possible to parse everything without knowing the types at 
parse-time (I will presume this is assumed/required from the way the 
language is structured), but that the syntax is not disambiguated be the 
next 1-or-2 tokens, but may require trying to parse everything that could 
work and using the first syntactic form that does work.

this leads to a difference in parsing strategy, namely that rather than 
parsing along token-by-token and reporting a parse error the first time an 
unexpected token is seen, one descends into parsing branches, tries to 
parse, and if there would be a parse error then returning gracefully, 
allowing the next higher level to try the next possible interpretation.

for example:
'(Foo)' is ambiguous, and may be assumed to be a reference in parenthesis;
'(Foo)bar' is recognized as a cast, on the grounds that otherwise it would 
not parse correctly.

well, I am not strictly following the syntax structure given in the spec, 
mostly because to do so would require completely rewriting my existing 
parser, rather I am figuring out how to make it work (AFAICT, the 
descriptions in ECMA-334 are LR or LALR, but my parsers are typically 
hand-written recursive descent, and infact I am using a C parser as the 
basis for a C# parser, but am having to deal with many subtle but 
fundamental differences between the languages...).

basically, one of the earlier forms of my C compiler is being used as a 
template (I am using an earlier form which uses S-Exps as the basis, rather 
than a later form which used DOM instead, as S-Exps are faster and less 
memory-consuming than DOM nodes). not all is free though, as I had to 
redirect (via search-and-replace) the old typesystem API to the new 
typesystem API.

decided to leave out big chunk about my uncertainty over the convention of 
usage of letter-case in said APIs naming convention (which differs some from 
my newer naming rules, AKA: from 'alllower' to 'camelCase').

note: technically, I am doing all this for a few specific reasons:
a new compiler frontend will be less effort for the time being than getting 
either my JBC or CIL frontend working (JBC and CIL still require some 
work...);
the prior effort have already lead to me adding much of the needed machinery 
into the compiler core (I am still deciding on the details of how I will do 
exception handling, and am generally leaning against SEH, but there seems to 
be little standardization here, so I may do my own thing and possibly hook 
it into the others using VEH, ...);
it would be much less effort (and ugliness) to modify a C frontend into a C# 
frontend, than to hack OO and namespaces onto C while still preserving its 
"C-ness" (I beat around some ideas, they were ugly...).

as noted, this C# compiler would take the same basic compilation route as my 
existing C compiler (no intermediate CIL), and will use the same shared 
backend as my other compilers (since the backend and basic machinery are 
shared, it doesn't matter too much which frontend uses it, so no real lost 
effort here...).

I have opted against adding CPS to my backend for now (creates too many 
issues, and would be too much effort, and for now generalized tail-calls and 
continuations are not worth a potentially notable cost in performance).

another reason for preferring a C# route rather than a mock-up C++ route, is 
that of compilation time:
I am thinking C# should be able to compile much faster than C++ could hope 
for, due primarily to the lack of source-level inclusion (ignoring the 
possibility of a hack being used to compile headers separately and using 
'include' as a module-import feature, which had been considered in my 
C-hackery musings...).

misc bit of trivia:
I guess there exists a CIL frontend and backend for GCC, which I discovered 
recently (I guess the people here probably alrwady knew).
well, oh well, maybe slight competition against GCC, only that unless GCC 
gains dynamic compilation/JIT abilities and becomes much easier to build and 
use on Windows, my current efforts continue having at least "some" point 
(well, nevermind no one else has interest, but oh well, for my own uses it 
works at least, even if not amounting to much...).

I don't know, but last I looked (maybe some-odd years ago) GCC's code was 
filled with terror, and for all I know maybe they have cleaned it up with 
all the activity (however, the descriptions of how they have accomplished 
some things give me doubt...).

sadly, my code is ugly enough... keeping code clean and modular is never as 
easy as it seems.

there are invariably messes, and invariably dependencies (as one ends up 
with a dependency between their linker and their threading code in order to 
implement the '__thread' keyword, ...).

or such...

>> all of the namespaces and declarations are "lifted out" of the parse 
>> trees;
>> each file's parse tree can then be compiled.
>>
>> from what I can tell, types are like this:
>> type = <qualifiers>* <type-name>
>>
>> so, I can type:
>> static Foo bar;
>>
>>
>> and the parser will know that 'Foo' is the type, even if the type for Foo 
>> is not visible at the time of parsing (in C, this can't be done since 
>> there is no clear distinction or ordering between types and qualifiers, 
>> and so one would not know if 'Foo' is the type, or an intended variable 
>> name with the type being assumed to be 'int').
>>
> No, parser does not yet know the type of 'Foo'.
>

I am not certain which way you mean this.
'no', as in it does not know the type, but still parses it;
'no', as in it can't yet be parsed.

>> so, in C we can have:
>> unsigned int i;
>> int unsigned i;
>> int volatile i;
>> _Complex float f;
>> double _Complex g;
>>
>> unsigned i;
>> short int j;
>> int long k;
>> ..
>>
>> so, my guess then is that C# code is "just parsed", with no need to 
>> lookup, for example, is Foo a "struct or class or other typedef'ed type?" 
>> ...
>>
>> as far as the parser is concerned 'int' or 'byte' is syntactically not 
>> different from 'Foo' or 'Bar'?...
>>
> Correct.
>

C also has certain ugly syntactic forms which I will just assume that C# 
does not, such as:
int (*foo(int x, int y))(int z);

> Marek
>