On 12/4/2024 10:49 AM, Scott Lurndal wrote:
BGB <cr88192@gmail.com> writes:
On 12/3/2024 1:27 PM, Janis Papanagnou wrote:
On 03.12.2024 19:57, BGB wrote:
On 12/2/2024 12:13 PM, Scott Lurndal wrote:
>
Indeed. One wonders at Bart's familiarity with formal grammars.
>
In my case, personally I haven't encountered much that doesn't work well
enough with a recursive-descent parser.
>
Is that meant as a contradiction? - If so, how?
>
>
Formal grammars and parser generators aren't usually necessary IME,
since recursive descent can deal with most everything (and is generally
more flexible than what one can deal with in a formal grammar).
How can you write a correct recursive descent parser without a
formal grammar (at least on paper) for the language being parsed?
It is usually a thing of pattern matching the next N tokens, based on the structure of the language.
You can write or reference a syntax in BNF or EBNF or similar, but it is not necessary, and some languages (like C) may contain things that can't be fully expressed via an BNF (say, for example, things that depend on prior typedefs, etc).
One doesn't necessarily need a BNF though, if a bunch of syntax examples exist and the syntactic rules are sufficiently obvious.
If designing a language, granted, better to avoid having a bunch of obvious "gotchas" (such as tricky edge cases or ambiguities).
With parsing C, 2 tokens works well. Some other languages may only need a single token at a time.
Sometimes, a path can "fail" (and return NULL) if something is encountered that does not match an expected pattern. Then you go down the next path and see if this parses.
For example, for every statement, one can first try to parse a declaration, and if the declaration fails, then one can try to parse an "inner statement" and if this fails to find anything, try to parse an expression.
And, in expression parsing, if one encounters an '(' in prefix position, one may need to check if the following token(s) can be parsed as a type, if true, assume we are looking at a cast, if false, assume an expression. On the other side, if one encounters a '(' where one might expect a binary operator, it can be assumed to be a function call.
Where, say, one of the main fail paths in a C parser being "can the following tokens be parsed as a type?" (which does often have the overhead of effectively causing nearly every identifier to need to be checked if it is identifying a previous typedef).
One can eliminate the "path failed" scenario, but then the language may end up looking more like JavaScript or similar (needing things like 'var'/'function' keywords, or "EXPR as TYPE" for casts, ...). Or, eliminate checking for typedefs, at which point it might look more like C# (where "foo bar;" at a statement-level will always be assumed to be a declaration, and one is otherwise forbidden to have identifier-identifier pairs in this context).
In some of my previous languages, there was a "feature" where the last statement in a function could be an expression lacking a semicolon, and it would be understood as an implicit return:
function foo(x, y) { var z=x*y; 2*z-y }
But, this has the downside of creating an ambiguity where one doesn't necessarily know whether the final statement is a tail expression until after being able to observe its lack of a semicolon.
Or, alternatively, using the fail-path strategy and requiring all other cases (which could potentially be mistaken for a tail expression) to have a semicolon:
foo* bar; //declaration, implicitly mandates the semicolon.
My last significant original language (BS2, nearly a decade ago now) had instead required the use of a 'return' statement (where BS2 was meant to simplify and clean up a lot of the painful cases in its predecessor; essentially going from a weird JavaScript variant to something more like a Java/C# hybrid). This language has not seen much use though, as my recent projects have mostly used C.
...
There is some variation for how to deal with input and tokens:
Early parsers (still used in my C parser):
"char **" pointer gives the input position;
Token reading function takes an input position,
fills in a buffer supplied by the caller;
sets a token-type (pointer argument);
Returns the next position.
Had later tried a different strategy (short lived):
Break entire input buffer into an array of token strings;
Each token string encoded its token-type as a prefix character.
This strategy works OK for small things, but did not scale well.
More recent strategy:
Read a token, passing a "char **" string;
Called function advances the pointer to the next token;
Returns a token, with the type as a prefix character.
The token is held in a temporary circular buffer.
In the first tokenizer strategy, a temporary lookup table would be kept mapping input-string positions to previous tokens (with the length and type), reducing the cost of trying to read the same tokens again. Calling the tokenizer function with a NULL input clears this table (say, if one is reusing the same buffer multiple times and rewriting its contents each time, such as in a preprocessor).
The third strategy can also use a similar lookup table. This can reduce how quickly the circular buffer wraps around. But, if one wants to hold onto a token for later use, it is necessary to copy it somewhere else (if too many tokens are read, the memory for previous tokens will be silently overwritten).
Say, for example:
foo ( 123 "test" )
Might give token strings:
"Ifoo", "X(", "|123", "Stest", "X)"
...