General notes about formal language grammars and nom.
In the Medieval period in Europe the four pillars of education were Rhetoric, Logic, Theology and Grammar. Today this seems a strange classification of knowledge and study, but Logic and Grammar are the basis of computer science.
I should state here that I am not a great formal-language theorist or high-level mathematician who can give insights into grammar theory. This page is just a set of heuristic notes about things I have discovered while parsing languages with the PEP NOM system.
You can use the beginswith and endswith modifiers to achieve token lookahead in [nom]. The script /eg/maths.parse.pss and also /eg/maths.to.latex.pss have good examples of using lookahead.
The document /doc/howto/nom.lookahead.html has detailed information about how to write lookahead into a nom script .
There is also the peep register which is a single character and which is not used directly by the script writer.
This is how we parse “a + b + c” .... Do we parse this as “(a + b) + c” (left-associative) or as “a + (b + c)” (right-associative)
With nom we need to build this associativity into the grammar. See the arithmetic parser example for an example of associativity (with +/- signs which associate right) .
This is how we parse “a + b * c” .... Do we parse this as “(a + b) * c” ( '+' precedence ) or as “a + (b * c)” ( * precedence )
Again, this has to be factored into the grammar
A parser generator is basically a way to avoid having to write a recursive descent parser (and compiler) or some other style of hand coded compiler.
The kleene star is a grammar construct which basically means “zero or more of something” . This exists in regular expressions
s/ab*c//
This means: 'a' followed by zero or more 'b's followed by 'c' The same construct is used in EBNF grammars (check!).
The rule below means that “A 'statement' is a 'keyword' followed” by zero or more 'parameters' followed by a semi colon.
statement = keyword parameter* ';' ;
How can we translate this into NOM ? We cannot create a single block which represents the rule above but by using several blocks we can create a nom fragment which represents it.
The code below represents a complete “lexer” and “recogniser” for statements as shown below. Notice that white-space is completely ignored except in parameters.
say 'good'; fizz; bang 'crash' ;
whirr 'ok' 'okay'; grow'big' 'small' 'tiny' ;
bird
'tweet' 'fly '
'nest' ;
The script that follows is a recogniser because it doesn’t actually compile/transpile/transform the input text, it just determines if the input text consists of valid “statements". Also notice, that the single [ebnf] rule” with a kleene star is broken up into 3 NOM blocks ( statements between braces ).
statement = keyword ';' ;
statement = keyword parameter ';' ;
statement = keyword parameterset ';' ;
parameterset = parameter parameter ;
parameterset = parameterset parameter ;
Although their are 5 ebnf rules above, NOM only requires 3 blocks because it can use OR logic concatenation in the tests (using the ',' comma character)
# "lexing"
read;
# literal token, the character is also the parse-token
';' { add "*"; push; }
[:space:] { while [:space:]; clear; }
[:alpha:] { while [:alpha:]; put; clear; add "keyword*"; push; }
"'" { until "'"; put; clear; add "parameter*"; push; }
!"" {
put; clear;
add "?? bad char ["; get; add "]\n";
print; quit;
}
# The nom version of the ebnf rule
# statement = keyword parameter* ';' ;
parse>
# show the stack reductions for debugging
# add "line "; lines; add " char "; chars; add ": "; print; clear;
# unstack; print; stack; add "\n"; print; clear;
pop;
pop;
"keyword*;*"
{ clear; add "statement*"; push; .reparse }
"parameter*parameter*","paramset*parameter*"
{ clear; add "paramset*"; push; .reparse }
pop;
"keyword*parameter*;*","keyword*paramset*;*" {
clear; add "statement!\n"; print;
clip; clip; add "*";
push; .reparse
}
push;push;push;
Nom can only match or recognise a fixed number of parse-tokens
at any point in the script. This is because it needs to pop
the parse tokens off the stack before it can match
parse token sequences.
This is why ebnf rules in the form
a = b+ c ; a = c b* ;
need to be broken up into several NOM blocks, because b+ and b* represent a variable number of parse tokens.
LL grammars (same?)
Although grammar theorists talk about all sorts of different types of parsing, in practice, the type of parsers that are written by jobbing programmers seem to be recursive descent . There is a very surprising statement on Robert Nystrom's blog that all compilers for modern languages in common use are recursive descent. Mr. Nystrom is involved in creating (designing?) the DART language and has written a great book about parsing and compiling, so I am pretty convinced he knows what he is talking about. Recursive descent is also the type of parser/compiler that seems to be taught in University Comp-Sci courses. Nicholas Wirth of Pascal fame was a strong proponent of this type of parser/compiler.
I don't wish to criticise recursive descent parsers, because they obviously work, but it does seem odd to me that this is still the main way to recognise and translate formal-languages.
NOM does not do recursive descent parsing: in fact I wrote nom because I wanted a way to understand the grammars of language that was more “natural” for my way of thinking. The PEP & NOM system does www.google.com/search?q=shift+reduce+parsing but it does it purely as a (unix-style) text-stream filter.
The pep/nom lexing/parsing/compiling process is as follows:
In the lexical analysis phase of the nom script, pep/nom
reads the input stream (more or less) one Unicode character at a
time, and creates parse-tokens and their “attributes” in a text buffer
. The parse-tokens are pushed onto a stack
and the “attributes” (the partially compiled/translated text) are put
into the tape . Then, during the parsing or “shift-reduce” phase
of the script, the parse-tokens are popped off the stack
back into the workspace
buffer. If a particular parse-token
sequence is matched or recognised, then the workspace is
cleared and the token attributes are got from the
tape and compiled and then put again onto the tape .
Then the workspace (string) buffer is cleared and the
new parse-token is created in the workspace with the add
command.
Finally, the new parse-token or tokens is pushed onto
the stack .
This entire process is text-oriented meaning that everything is a “string” including the parse-tokens and the parse tokens and their attributes are manipulated in the same virtual machine with the same commands.
E := E + E
E := T
My knowledge of formal language grammar theory is quite limited. I am more interested in practical techniques. But there is a reasonably close correlation between bnf-type grammar rules and pep/nom script construction.
The right-hand-side of a (E)BNF grammar rule is represented by the quoted text before a brace block, and the left-hand-side correlates to the new token pushed onto the stack.
"article*noun*" { clear; add "nounphrase*"; push; }
Terminology, productions (same as rules?) terminals, non-terminals, tokens. Tokens are the same as terminals? BNF backus-naur form
There is a relationship between virtual machines, automatons and grammars and formal languages.