Bart <
bc@freeuk.com> wrote:
I can also say that the C grammar is buggy:
assignment-expression:
conditional-expression
unary-expression asssignment-operator assignment-expression
When attempting to parse an assignment-expression, do you go for a
conditional-expression or unary-expression?
Both. The grammar clearly is not LL. In C90 grammar there was
one issue which meant that grammar _alone_ could be not used for
parsing: at one (or maybe two) places after recognizing a token
parser had to look up in symbol table and possibly transform
the token based on previous declarations. If you did that,
then classic LR(1) parser would work. I think that changing
the grammar to do the transformation by grammar rules alone
changed grammar so that it was not longer LR(1). I think that
after change it was LR(2), but I may be remembering things
incorrectly. For _defining_ language syntax it is enough to have
unabigious grammar which is much easier than LR(2). Or standard
could give clear rules saying which of parse trees produced by
ambigious grammar is the right one (AFAIK that is what C++ is doing).
Anyway, any troubles in C grammar are _not_ due to part above, that
part is handled fine by LALR(1) tools (LALR(1) is somewhat hard to
define, but is subest of LR(1) which is easier to implement).
BTW: LL(k) requires ability to decide which rule will apply
after seeing k symbols after start of a construct. Clearly
for nontrival grammar one needs at least one symbol, so LL(1)
is easiet to parse. LR(k) requires ability to decide which
rule will apply after seeing the whole construct and at most
k following symbols. So, LR parser keeps reading symbols
as long as it can not recognize any construct (in other words
considers all alteratives as possible). In the process it
collects information. Once it has enough information it
replaces part of previously read input by left hand side of
a rule (this is called reduction). After one reduction
there may be several one, several constricts may and in the
same place. Parser starts which smallest recognized part, than
proceeds to bigger part. Usually LR parsers keep input and
effects of reduction on a stack. Reading symbol puts it on
the stack (this is called shift), reduction removes bunch of
items from the stack and replaces them by a single (different
item). At first glance control, that is deciding between shift
and reduction and in case of reduction deciding which rule
to use, may look complicated. But ignoring control, this is
simple and quite efficient. In sixties it was discovered
that in pinciple parser control may also be quite simple,
it can be done by a finite automaton. In other word, there
is state which contains collected information and tells you
what to do (shift or reduction), you get new state by looking
into table indexed by current state and read symbols (currenly
considered symbol and lookahead symbols). At first there
was doubt that needed tables may be too big and make this
inpractical, but in seventies it was discoverd that for
typical grammars of programming languages one can use resonably
small tables. And there were tricks with compressing tables.
So practically one get few, maybe tens kilobytes of tables.
Access to compressed tables needs extra instructions and LR
grammars tend to produce more reductions than LL ones, so
LR parses tend to be slightly slower than LL ones: people
got 500000 lines per minute for LR parsers and 900000
for LL paser on machines about 1000 slower than modern PC.
Evan at that time speed difference in paser was not a deal
breaker and LR parsing was preffered.
BTW2: Before C90 there was doubt if sensible grammar for C
is possible (earler grammar had serious prblems). But during
standared process is was discovered that quite resonable grammar
works. Current grammar is a developement of C90 grammar.
And problem like you mention were understood long ago.
-- Waldek Hebisch