[flang] Convert parser combinator documentation file to Markdown.

Original-commit: flang-compiler/f18@263865c97a
This commit is contained in:
peter klausler 2018-02-05 16:53:38 -08:00
parent 94c26b688e
commit 1e69ed0c1b
3 changed files with 148 additions and 128 deletions

View file

@ -20,6 +20,8 @@ in foo.cc.)
1. In the source file "foo.cc", put the #include of "foo.h" first.
Then #include other project headers in alphabetic order; then C++ standard
headers, also alphabetically; then C and system headers.
1. Don't include the standard iostream header. If you need it for debugging,
remove the inclusion before committing.
### Naming
1. C++ names that correspond to STL names should look like those STL names
(e.g., *clear()* and *size()* member functions in a class that implements
@ -40,7 +42,7 @@ especially when you can declare them directly in a for()/while()/if()
condition. Otherwise, prefer complete English words to abbreviations
when creating names.
### Commentary
1. Use // for all comments except for short notes within statements.
1. Use // for all comments except for short notes within expressions.
1. When // follows code on a line, precede it with two spaces.
1. Comments should matter. Assume that the reader knows current C++ at least as
well as you do and avoid distracting her by calling out usage of new

145
flang/ParserCombinators.md Normal file
View file

@ -0,0 +1,145 @@
## Concept
The Fortran language recognizer here can be classified as an LL recursive
descent parser. It is composed from a *parser combinator* library that
defines a few fundamental parsers and a few ways to compose them into more
powerful parsers.
For our purposes here, a *parser* is any object that can attempt to recognize
an instance of some syntax from an input stream. It may succeed or fail.
On success, it may return some semantic value to its caller.
In C++ terms, a parser is any instance of a class that
1. has a *constexpr* default constructor,
1. defines a resultType type, and
1. provides a member or static function that accepts a pointer to a
ParseState as its argument and returns a std::optional<resultType> as a
result, with the presence or absence of a value in the std::optional<>
signifying success or failure, respectively.
> std::optional<resultType> Parse(ParseState *) const;
The resultType of a parser is typically the class type of some particular
node type in the parse tree.
*ParseState* is a class that encapsulates a position in the source stream,
collects messages, and holds a few state flags that determive tokenization
(e.g., are we in a character literal?). Instances of *ParseState* are
independent and complete -- they are cheap to duplicate whenever necessary to
implement backtracking.
The constexpr default constructor of a parser is important. The functions
(below) that operate on instances of parsers are themselves all constexpr.
This use of compile-time expressions allows the entirety of a recursive
descent parser for a language to be constructed at compilation time through
the use of templates.
### Fundamental Predefined Parsers
These objects and functions are (or return) the fundamental parsers:
* *ok* is a trivial parser that always succeeds without advancing.
* "pure(x)" returns a trivial parser that always succeeds without advancing,
returning some value *x*.
* "fail<T>(msg)" denotes a trivial parser that always fails, emitting the
given message. The template parameter is the type of the value that
the parser never returns.
* *cut* is a trivial parser that always fails silently.
* "guard(pred)" returns a parser that succeeds if and only if the predicate
expression evaluates to true.
* *rawNextChar* returns the next raw character, and fails at EOF.
* *cookedNextChar* returns the next character after preprocessing, skipping
Fortran line continuations and comments; it also fails at EOF
### Combinators
These functions and operators combine parsers to generate new parsers.
* "!p" succeeds if p fails, and fails if p succeeds.
* "p >> q" fails if p does, otherwise running q and returning its value when
it succeeds.
* "p / q" fails if p does, otherwise running q and returning *p's* value
if q succeeds.
* "p || q" succeeds if p does, otherwise running q. The two parsers must
have the same type, and the value returned by the first succeeding parser
is the value of the combination.
* "lookAhead(p)" succeeds if p does, but doesn't modify any state.
* "attempt(p)" succeeds if p does, safely preserving state on failure.
* "many(p)" recognizes a greedy sequence of zero or more nonempty successes
of *p*, and returns std::list<> of their values. It always succeeds.
* "some(p)" recognized a greedy sequence of one or more successes of *p*.
It fails if p immediately fails.
* "skipMany(p)" is the same as "many(p)", but it discards the results.
* "maybe(p)" tries to match *p*, returning an "std::optional<T>" value.
It always succeeds.
* "defaulted(p)" matches *p*, and when *p* fails it returns a
default-constructed instance of *p*'s resultType. It always succeeds.
* "nonemptySeparated(p, q)" repeatedly matches "p q p q p q ... p",
returning a std::list<> of only the values of the p's. It fails if
*p* immediately fails.
* "extension(p)" parses *p* if strict standard compliance is disabled,
or with a warning if nonstandard usage warnings are enabled.
* "deprecated(p)" parses *p* if strict standard compliance is disabled,
with a warning if deprecated usage warnings are enabled.
* "inContext(..., p)" runs *p* within an error message context.
Note that "a >> b >> c / d / e" matches a sequence of five parsers,
but returns only the result that was obtained by matching c.
### Applicatives
The following *applicative* combinators combine parsers and modify or
collect the values that they return.
* "construct<T>{}(p1, p2, ...)" matches zero or more parsers in succession,
collecting their results and then passing them with move semantics to a
constructor for the type *T* if they all succeed.
* "applyFunction(f, p1, p2, ...)" matches one or more parsers in succession,
collecting their results and passing them as rvalue reference arguments to
some function, returning its result.
* "applyLambda([](&&x){}, p1, p2, ...)" is the same thing, but for lambdas
and other function objects.
* "applyMem(mf, p1, p2, ...)" is the same thing, but invokes a member
function of the result of the first parser for updates in place.
### Non-Advancing State Inquiries and Updates
These are non-advancing state inquiry and update parsers:
* *getColumn* returns the 1-based column position.
* *inCharLiteral* succeeds under withinCharLiteral.
* *inFortran* succeeds unless in a preprocessing directive.
* *inFixedForm* succeeds in fixed-form source.
* *setInFixedForm* sets the fixed-form flag, returning its prior value.
* *columns* returns the 1-based column number after which source is clipped.
* "setColumns(c)" sets the column limit and returns its prior value.
### Monadic Combination
When parsing depends on the result values of earlier parses, the
"monadic bind" combinator is available.
Please try to avoid using it, as it makes automatic analysis of the
grammar difficult.
It has the syntax "p >>= f", and it constructs a parser that matches p,
yielding some value x on success, then matches the parser returned from
the function call "f(x)".
### Token Parsers
Last, we have these basic parsers on which the actual grammar of the Fortran
is built. All of the following parsers consume characters acquired from
*cookedNextChar*.
* *spaces* always succeeds after consuming any spaces or tabs
* *digit* matches one cooked decimal digit (0-9)
* *letter* matches one cooked letter (A-Z)
* "CharMatch<'c'>{}" matches one specific cooked character.
* "..."_tok match the content of the string, skipping spaces before and
after, and with multiple spaces accepted for any internal space.
(Note that the _tok suffix is optional when the parser appears before
the combinator ">>" or after "/".)
* "parenthesized(p)" is shorthand for "(" >> p / ")".
* "bracketed(p)" is shorthand for "[" >> p / "]".
* "withinCharLiteral(p)" applies the parser *p*, tokenizing for
CHARACTER/Hollerith literals.
* "nonEmptyListOf(p)" matches a comma-separated list of one or more
instances of *p*.
* "optionalListOf(p)" is the same thing, but can be empty, and always succeeds.
### Debugging Parser
Last, the parser "..."_debug emit the string to the standard error and succeeds.
It is useful for tracing while debugging a parser but should obviously not
be committed for production code.

View file

@ -1,127 +0,0 @@
The Fortran language recognizer here is an LL recursive descent parser
composed from a "parser combinator" library that defines a few fundamental
parsers and a few ways to compose them into more powerful parsers.
For our purposes here, a *parser* is any object that can attempt to recognize
an instance of some syntax from an input stream. It may succeed or fail.
On success, it may return some semantic value to its caller.
In C++ terms, a parser is any instance of a class that
(1) has a constexpr default constructor,
(2) defines a resultType typedef, and
(3) provides a member or static function
std::optional<resultType> Parse(ParseState *) const;
static std::optional<resultType> Parse(ParseState *);
that accepts a pointer to a ParseState as its argument and returns
a std::optional<resultType> as a result, with the presence or absence
of a value in the std::optional<> signifying success or failure
respectively.
The resultType of a parser is typically the class type of some particular
node type in the parse tree.
ParseState is a class that encapsulates a position in the source stream,
collects messages, and holds a few state flags that can affect tokenization
(e.g., are we in a character literal?). Instances of ParseState are
independent and complete -- they are cheap to duplicate when necessary to
implement backtracking.
The constexpr default constructor of a parser is important. The functions
(below) that operate on instances of parsers are themselves all constexpr.
This use of compile-time expressions allows the entirety of a recursive
descent parser for a language to be constructed at compilation time through
the use of templates.
These objects and functions are (or return) the fundamental parsers:
ok always succeeds without advancing
pure(x) always succeeds without advancing, returning some value x
fail<T>(msg) always fails with the given message; optionally typed
cut always fails, with no message
guard(pred) succeeds if the predicate expression evaluates to true
rawNextChar returns the next raw character; fails at EOF
cookedNextChar returns the next character after preprocessing, skipping
Fortran line continuations and comments; fails at EOF
These functions and operators generate new parsers from combinations of
other parsers:
!p ok if p fails, cut if p succeeds
p >> q match p, then q, returning q's value
p / q match p, then q, returning p's value
p || q match p if it succeeds, else match q; p and q must be same type
lookAhead(p) succeeds iff p does, but doesn't modify state
attempt(p) succeeds iff p does, safely preserving state on failure
many(p) a greedy sequence of zero or more nonempty successes of p;
returns std::list<> of values
some(p) a greedy sequence of one or more successes of p
skipMany(p) same as many(p), but discards result (performance optimizer)
maybe(p) try to match p, returning optional<T>
defaulted(p) matches p, or else returns a default-constructed instance
of p's resultType
nonemptySeparated(p, q) repeatedly match p q p q p q ... p, returning
the values of the p's
extension(p) parses p if strict standard compliance is disabled,
with a warning if nonstandard usage warnings are enabled
deprecated(p) parses p if strict standard compliance is disabled,
with a warning if deprecated usage warnings are enabled
inContext("...", p) run p within an error message context
Note that "a >> b >> c / d / e" matches a sequence of five parsers,
but returns only the result that was obtained by matching c.
The following "applicative" combinators modify or combine the values returned
by parsers:
construct<T>{}(p1, p2, ...)
matches zero or more parsers in succession, collecting their
results and then passing them with move semantics to a
constructor for the type T if they all succeed
applyFunction(f, p1, p2, ...)
matches one or more parsers in succession, collecting their
results and passing them as rvalue reference arguments to
some function, returning its result
applyLambda([](&&x){}, p1, p2, ...)
is the same thing, but for lambdas and other function objects
applyMem(mf, p1, p2, ...)
is the same thing, but invokes a member function of the
result of the first parser
These are non-advancing state inquiry and update parsers:
getColumn returns 1-based column position
inCharLiteral succeeds under withinCharLiteral
inFortran succeeds unless in a preprocessing directive
inFixedForm succeeds in fixed-form source
setInFixedForm sets the fixed-form flag, returns prior value
columns returns the 1-based column number after which source is clipped
setColumns(c) sets "columns", returns prior value
When parsing depends on the result values of earlier parses, the
"monadic bind" combinator is available (but please try to avoid using it,
as it makes automatic analysis of the grammar difficult):
p >>= f match p, yielding some value x on success, then match the
parser returned from the function call f(x)
Last, we have these basic parsers on which the actual grammar of the Fortran
is built. All of the following parsers consume characters acquired from
"cookedNextChar".
spaces always succeeds after consuming any spaces or tabs
digit matches one cooked decimal digit (0-9)
letter matches one cooked letter (A-Z)
CharMatch<'c'>{} matches one specific cooked character
"..."_tok match contents, skipping spaces before and after, and
with multiple spaces accepted for any internal space
"..." >> p the tok suffix is optional on a string before >> and after /
parenthesized(p) shorthand for "(" >> p / ")"
bracketed(p) shorthand for "[" >> p / "]"
withinCharLiteral(p) apply p, tokenizing for CHARACTER/Hollerith literals
nonEmptyListOf(p) matches a comma-separated list of one or more p's
optionalListOf(p) ditto, but can be empty
"..."_debug emit the string and succeed, for parser debugging