llvm/flang/parser-combinators.txt

128 lines
6.2 KiB
Text
Raw Normal View History

The Fortran language recognizer here is an LL recursive descent parser
composed from a "parser combinator" library that defines a few fundamental
parsers and a few ways to compose them into more powerful parsers.
For our purposes here, a *parser* is any object that can attempt to recognize
an instance of some syntax from an input stream. It may succeed or fail.
On success, it may return some semantic value to its caller.
In C++ terms, a parser is any instance of a class that
(1) has a constexpr default constructor,
(2) defines a resultType typedef, and
(3) provides a member or static function
std::optional<resultType> Parse(ParseState *) const;
static std::optional<resultType> Parse(ParseState *);
that accepts a pointer to a ParseState as its argument and returns
a std::optional<resultType> as a result, with the presence or absence
of a value in the std::optional<> signifying success or failure
respectively.
The resultType of a parser is typically the class type of some particular
node type in the parse tree.
ParseState is a class that encapsulates a position in the source stream,
collects messages, and holds a few state flags that can affect tokenization
(e.g., are we in a character literal?). Instances of ParseState are
independent and complete -- they are cheap to duplicate when necessary to
implement backtracking.
The constexpr default constructor of a parser is important. The functions
(below) that operate on instances of parsers are themselves all constexpr.
This use of compile-time expressions allows the entirety of a recursive
descent parser for a language to be constructed at compilation time through
the use of templates.
These objects and functions are (or return) the fundamental parsers:
ok always succeeds without advancing
pure(x) always succeeds without advancing, returning some value x
fail<T>(msg) always fails with the given message; optionally typed
cut always fails, with no message
guard(pred) succeeds if the predicate expression evaluates to true
rawNextChar returns the next raw character; fails at EOF
cookedNextChar returns the next character after preprocessing, skipping
Fortran line continuations and comments; fails at EOF
These functions and operators generate new parsers from combinations of
other parsers:
!p ok if p fails, cut if p succeeds
p >> q match p, then q, returning q's value
p / q match p, then q, returning p's value
p || q match p if it succeeds, else match q; p and q must be same type
lookAhead(p) succeeds iff p does, but doesn't modify state
attempt(p) succeeds iff p does, safely preserving state on failure
many(p) a greedy sequence of zero or more nonempty successes of p;
returns std::list<> of values
some(p) a greedy sequence of one or more successes of p
skipMany(p) same as many(p), but discards result (performance optimizer)
maybe(p) try to match p, returning optional<T>
defaulted(p) matches p, or else returns a default-constructed instance
of p's resultType
nonemptySeparated(p, q) repeatedly match p q p q p q ... p, returning
the values of the p's
extension(p) parses p if strict standard compliance is disabled,
with a warning if nonstandard usage warnings are enabled
deprecated(p) parses p if strict standard compliance is disabled,
with a warning if deprecated usage warnings are enabled
inContext("...", p) run p within an error message context
Note that "a >> b >> c / d / e" matches a sequence of five parsers,
but returns only the result that was obtained by matching c.
The following "applicative" combinators modify or combine the values returned
by parsers:
construct<T>{}(p1, p2, ...)
matches zero or more parsers in succession, collecting their
results and then passing them with move semantics to a
constructor for the type T if they all succeed
applyFunction(f, p1, p2, ...)
matches one or more parsers in succession, collecting their
results and passing them as rvalue reference arguments to
some function, returning its result
applyLambda([](&&x){}, p1, p2, ...)
is the same thing, but for lambdas and other function objects
applyMem(mf, p1, p2, ...)
is the same thing, but invokes a member function of the
result of the first parser
These are non-advancing state inquiry and update parsers:
getColumn returns 1-based column position
inCharLiteral succeeds under withinCharLiteral
inFortran succeeds unless in a preprocessing directive
inFixedForm succeeds in fixed-form source
setInFixedForm sets the fixed-form flag, returns prior value
columns returns the 1-based column number after which source is clipped
setColumns(c) sets "columns", returns prior value
When parsing depends on the result values of earlier parses, the
"monadic bind" combinator is available (but please try to avoid using it,
as it makes automatic analysis of the grammar difficult):
p >>= f match p, yielding some value x on success, then match the
parser returned from the function call f(x)
Last, we have these basic parsers on which the actual grammar of the Fortran
is built. All of the following parsers consume characters acquired from
"cookedNextChar".
spaces always succeeds after consuming any spaces or tabs
digit matches one cooked decimal digit (0-9)
letter matches one cooked letter (A-Z)
CharMatch<'c'>{} matches one specific cooked character
"..."_tok match contents, skipping spaces before and after, and
with multiple spaces accepted for any internal space
"..." >> p the tok suffix is optional on a string before >> and after /
parenthesized(p) shorthand for "(" >> p / ")"
bracketed(p) shorthand for "[" >> p / "]"
withinCharLiteral(p) apply p, tokenizing for CHARACTER/Hollerith literals
nonEmptyListOf(p) matches a comma-separated list of one or more p's
optionalListOf(p) ditto, but can be empty
"..."_debug emit the string and succeed, for parser debugging