128 lines
6.2 KiB
Text
128 lines
6.2 KiB
Text
|
The Fortran language recognizer here is an LL recursive descent parser
|
||
|
composed from a "parser combinator" library that defines a few fundamental
|
||
|
parsers and a few ways to compose them into more powerful parsers.
|
||
|
|
||
|
For our purposes here, a *parser* is any object that can attempt to recognize
|
||
|
an instance of some syntax from an input stream. It may succeed or fail.
|
||
|
On success, it may return some semantic value to its caller.
|
||
|
|
||
|
In C++ terms, a parser is any instance of a class that
|
||
|
(1) has a constexpr default constructor,
|
||
|
(2) defines a resultType typedef, and
|
||
|
(3) provides a member or static function
|
||
|
|
||
|
std::optional<resultType> Parse(ParseState *) const;
|
||
|
static std::optional<resultType> Parse(ParseState *);
|
||
|
|
||
|
that accepts a pointer to a ParseState as its argument and returns
|
||
|
a std::optional<resultType> as a result, with the presence or absence
|
||
|
of a value in the std::optional<> signifying success or failure
|
||
|
respectively.
|
||
|
|
||
|
The resultType of a parser is typically the class type of some particular
|
||
|
node type in the parse tree.
|
||
|
|
||
|
ParseState is a class that encapsulates a position in the source stream,
|
||
|
collects messages, and holds a few state flags that can affect tokenization
|
||
|
(e.g., are we in a character literal?). Instances of ParseState are
|
||
|
independent and complete -- they are cheap to duplicate when necessary to
|
||
|
implement backtracking.
|
||
|
|
||
|
The constexpr default constructor of a parser is important. The functions
|
||
|
(below) that operate on instances of parsers are themselves all constexpr.
|
||
|
This use of compile-time expressions allows the entirety of a recursive
|
||
|
descent parser for a language to be constructed at compilation time through
|
||
|
the use of templates.
|
||
|
|
||
|
These objects and functions are (or return) the fundamental parsers:
|
||
|
|
||
|
ok always succeeds without advancing
|
||
|
pure(x) always succeeds without advancing, returning some value x
|
||
|
fail<T>(msg) always fails with the given message; optionally typed
|
||
|
cut always fails, with no message
|
||
|
guard(pred) succeeds if the predicate expression evaluates to true
|
||
|
rawNextChar returns the next raw character; fails at EOF
|
||
|
cookedNextChar returns the next character after preprocessing, skipping
|
||
|
Fortran line continuations and comments; fails at EOF
|
||
|
|
||
|
These functions and operators generate new parsers from combinations of
|
||
|
other parsers:
|
||
|
|
||
|
!p ok if p fails, cut if p succeeds
|
||
|
p >> q match p, then q, returning q's value
|
||
|
p / q match p, then q, returning p's value
|
||
|
p || q match p if it succeeds, else match q; p and q must be same type
|
||
|
lookAhead(p) succeeds iff p does, but doesn't modify state
|
||
|
attempt(p) succeeds iff p does, safely preserving state on failure
|
||
|
many(p) a greedy sequence of zero or more nonempty successes of p;
|
||
|
returns std::list<> of values
|
||
|
some(p) a greedy sequence of one or more successes of p
|
||
|
skipMany(p) same as many(p), but discards result (performance optimizer)
|
||
|
maybe(p) try to match p, returning optional<T>
|
||
|
defaulted(p) matches p, or else returns a default-constructed instance
|
||
|
of p's resultType
|
||
|
nonemptySeparated(p, q) repeatedly match p q p q p q ... p, returning
|
||
|
the values of the p's
|
||
|
extension(p) parses p if strict standard compliance is disabled,
|
||
|
with a warning if nonstandard usage warnings are enabled
|
||
|
deprecated(p) parses p if strict standard compliance is disabled,
|
||
|
with a warning if deprecated usage warnings are enabled
|
||
|
inContext("...", p) run p within an error message context
|
||
|
|
||
|
Note that "a >> b >> c / d / e" matches a sequence of five parsers,
|
||
|
but returns only the result that was obtained by matching c.
|
||
|
|
||
|
The following "applicative" combinators modify or combine the values returned
|
||
|
by parsers:
|
||
|
|
||
|
construct<T>{}(p1, p2, ...)
|
||
|
matches zero or more parsers in succession, collecting their
|
||
|
results and then passing them with move semantics to a
|
||
|
constructor for the type T if they all succeed
|
||
|
applyFunction(f, p1, p2, ...)
|
||
|
matches one or more parsers in succession, collecting their
|
||
|
results and passing them as rvalue reference arguments to
|
||
|
some function, returning its result
|
||
|
applyLambda([](&&x){}, p1, p2, ...)
|
||
|
is the same thing, but for lambdas and other function objects
|
||
|
applyMem(mf, p1, p2, ...)
|
||
|
is the same thing, but invokes a member function of the
|
||
|
result of the first parser
|
||
|
|
||
|
These are non-advancing state inquiry and update parsers:
|
||
|
|
||
|
getColumn returns 1-based column position
|
||
|
inCharLiteral succeeds under withinCharLiteral
|
||
|
inFortran succeeds unless in a preprocessing directive
|
||
|
inFixedForm succeeds in fixed-form source
|
||
|
setInFixedForm sets the fixed-form flag, returns prior value
|
||
|
columns returns the 1-based column number after which source is clipped
|
||
|
setColumns(c) sets "columns", returns prior value
|
||
|
|
||
|
When parsing depends on the result values of earlier parses, the
|
||
|
"monadic bind" combinator is available (but please try to avoid using it,
|
||
|
as it makes automatic analysis of the grammar difficult):
|
||
|
|
||
|
p >>= f match p, yielding some value x on success, then match the
|
||
|
parser returned from the function call f(x)
|
||
|
|
||
|
Last, we have these basic parsers on which the actual grammar of the Fortran
|
||
|
is built. All of the following parsers consume characters acquired from
|
||
|
"cookedNextChar".
|
||
|
|
||
|
spaces always succeeds after consuming any spaces or tabs
|
||
|
digit matches one cooked decimal digit (0-9)
|
||
|
letter matches one cooked letter (A-Z)
|
||
|
CharMatch<'c'>{} matches one specific cooked character
|
||
|
"..."_tok match contents, skipping spaces before and after, and
|
||
|
with multiple spaces accepted for any internal space
|
||
|
"..." >> p the tok suffix is optional on a string before >> and after /
|
||
|
parenthesized(p) shorthand for "(" >> p / ")"
|
||
|
bracketed(p) shorthand for "[" >> p / "]"
|
||
|
|
||
|
withinCharLiteral(p) apply p, tokenizing for CHARACTER/Hollerith literals
|
||
|
nonEmptyListOf(p) matches a comma-separated list of one or more p's
|
||
|
optionalListOf(p) ditto, but can be empty
|
||
|
|
||
|
"..."_debug emit the string and succeed, for parser debugging
|