e4e52073c2
Original-commit: flang-compiler/f18@c4634a44b9
127 lines
6.2 KiB
Text
127 lines
6.2 KiB
Text
The Fortran language recognizer here is an LL recursive descent parser
|
|
composed from a "parser combinator" library that defines a few fundamental
|
|
parsers and a few ways to compose them into more powerful parsers.
|
|
|
|
For our purposes here, a *parser* is any object that can attempt to recognize
|
|
an instance of some syntax from an input stream. It may succeed or fail.
|
|
On success, it may return some semantic value to its caller.
|
|
|
|
In C++ terms, a parser is any instance of a class that
|
|
(1) has a constexpr default constructor,
|
|
(2) defines a resultType typedef, and
|
|
(3) provides a member or static function
|
|
|
|
std::optional<resultType> Parse(ParseState *) const;
|
|
static std::optional<resultType> Parse(ParseState *);
|
|
|
|
that accepts a pointer to a ParseState as its argument and returns
|
|
a std::optional<resultType> as a result, with the presence or absence
|
|
of a value in the std::optional<> signifying success or failure
|
|
respectively.
|
|
|
|
The resultType of a parser is typically the class type of some particular
|
|
node type in the parse tree.
|
|
|
|
ParseState is a class that encapsulates a position in the source stream,
|
|
collects messages, and holds a few state flags that can affect tokenization
|
|
(e.g., are we in a character literal?). Instances of ParseState are
|
|
independent and complete -- they are cheap to duplicate when necessary to
|
|
implement backtracking.
|
|
|
|
The constexpr default constructor of a parser is important. The functions
|
|
(below) that operate on instances of parsers are themselves all constexpr.
|
|
This use of compile-time expressions allows the entirety of a recursive
|
|
descent parser for a language to be constructed at compilation time through
|
|
the use of templates.
|
|
|
|
These objects and functions are (or return) the fundamental parsers:
|
|
|
|
ok always succeeds without advancing
|
|
pure(x) always succeeds without advancing, returning some value x
|
|
fail<T>(msg) always fails with the given message; optionally typed
|
|
cut always fails, with no message
|
|
guard(pred) succeeds if the predicate expression evaluates to true
|
|
rawNextChar returns the next raw character; fails at EOF
|
|
cookedNextChar returns the next character after preprocessing, skipping
|
|
Fortran line continuations and comments; fails at EOF
|
|
|
|
These functions and operators generate new parsers from combinations of
|
|
other parsers:
|
|
|
|
!p ok if p fails, cut if p succeeds
|
|
p >> q match p, then q, returning q's value
|
|
p / q match p, then q, returning p's value
|
|
p || q match p if it succeeds, else match q; p and q must be same type
|
|
lookAhead(p) succeeds iff p does, but doesn't modify state
|
|
attempt(p) succeeds iff p does, safely preserving state on failure
|
|
many(p) a greedy sequence of zero or more nonempty successes of p;
|
|
returns std::list<> of values
|
|
some(p) a greedy sequence of one or more successes of p
|
|
skipMany(p) same as many(p), but discards result (performance optimizer)
|
|
maybe(p) try to match p, returning optional<T>
|
|
defaulted(p) matches p, or else returns a default-constructed instance
|
|
of p's resultType
|
|
nonemptySeparated(p, q) repeatedly match p q p q p q ... p, returning
|
|
the values of the p's
|
|
extension(p) parses p if strict standard compliance is disabled,
|
|
with a warning if nonstandard usage warnings are enabled
|
|
deprecated(p) parses p if strict standard compliance is disabled,
|
|
with a warning if deprecated usage warnings are enabled
|
|
inContext("...", p) run p within an error message context
|
|
|
|
Note that "a >> b >> c / d / e" matches a sequence of five parsers,
|
|
but returns only the result that was obtained by matching c.
|
|
|
|
The following "applicative" combinators modify or combine the values returned
|
|
by parsers:
|
|
|
|
construct<T>{}(p1, p2, ...)
|
|
matches zero or more parsers in succession, collecting their
|
|
results and then passing them with move semantics to a
|
|
constructor for the type T if they all succeed
|
|
applyFunction(f, p1, p2, ...)
|
|
matches one or more parsers in succession, collecting their
|
|
results and passing them as rvalue reference arguments to
|
|
some function, returning its result
|
|
applyLambda([](&&x){}, p1, p2, ...)
|
|
is the same thing, but for lambdas and other function objects
|
|
applyMem(mf, p1, p2, ...)
|
|
is the same thing, but invokes a member function of the
|
|
result of the first parser
|
|
|
|
These are non-advancing state inquiry and update parsers:
|
|
|
|
getColumn returns 1-based column position
|
|
inCharLiteral succeeds under withinCharLiteral
|
|
inFortran succeeds unless in a preprocessing directive
|
|
inFixedForm succeeds in fixed-form source
|
|
setInFixedForm sets the fixed-form flag, returns prior value
|
|
columns returns the 1-based column number after which source is clipped
|
|
setColumns(c) sets "columns", returns prior value
|
|
|
|
When parsing depends on the result values of earlier parses, the
|
|
"monadic bind" combinator is available (but please try to avoid using it,
|
|
as it makes automatic analysis of the grammar difficult):
|
|
|
|
p >>= f match p, yielding some value x on success, then match the
|
|
parser returned from the function call f(x)
|
|
|
|
Last, we have these basic parsers on which the actual grammar of the Fortran
|
|
is built. All of the following parsers consume characters acquired from
|
|
"cookedNextChar".
|
|
|
|
spaces always succeeds after consuming any spaces or tabs
|
|
digit matches one cooked decimal digit (0-9)
|
|
letter matches one cooked letter (A-Z)
|
|
CharMatch<'c'>{} matches one specific cooked character
|
|
"..."_tok match contents, skipping spaces before and after, and
|
|
with multiple spaces accepted for any internal space
|
|
"..." >> p the tok suffix is optional on a string before >> and after /
|
|
parenthesized(p) shorthand for "(" >> p / ")"
|
|
bracketed(p) shorthand for "[" >> p / "]"
|
|
|
|
withinCharLiteral(p) apply p, tokenizing for CHARACTER/Hollerith literals
|
|
nonEmptyListOf(p) matches a comma-separated list of one or more p's
|
|
optionalListOf(p) ditto, but can be empty
|
|
|
|
"..."_debug emit the string and succeed, for parser debugging
|