llvm/flang/parser-combinators.txt

The Fortran language recognizer here is an LL recursive descent parser
composed from a "parser combinator" library that defines a few fundamental
parsers and a few ways to compose them into more powerful parsers.

For our purposes here, a *parser* is any object that can attempt to recognize
an instance of some syntax from an input stream.  It may succeed or fail.
On success, it may return some semantic value to its caller.

In C++ terms, a parser is any instance of a class that
  (1) has a constexpr default constructor,
  (2) defines a resultType typedef, and
  (3) provides a member or static function

        std::optional<resultType> Parse(ParseState *) const;
        static std::optional<resultType> Parse(ParseState *);

      that accepts a pointer to a ParseState as its argument and returns
      a std::optional<resultType> as a result, with the presence or absence
      of a value in the std::optional<> signifying success or failure
      respectively.

The resultType of a parser is typically the class type of some particular
node type in the parse tree.

ParseState is a class that encapsulates a position in the source stream,
collects messages, and holds a few state flags that can affect tokenization
(e.g., are we in a character literal?).  Instances of ParseState are
independent and complete -- they are cheap to duplicate when necessary to
implement backtracking.

The constexpr default constructor of a parser is important.  The functions
(below) that operate on instances of parsers are themselves all constexpr.
This use of compile-time expressions allows the entirety of a recursive
descent parser for a language to be constructed at compilation time through
the use of templates.

These objects and functions are (or return) the fundamental parsers:

  ok           always succeeds without advancing
  pure(x)      always succeeds without advancing, returning some value x
  fail<T>(msg)  always fails with the given message; optionally typed
  cut          always fails, with no message
  guard(pred)  succeeds if the predicate expression evaluates to true
  rawNextChar  returns the next raw character; fails at EOF
  cookedNextChar returns the next character after preprocessing, skipping
                 Fortran line continuations and comments; fails at EOF

These functions and operators generate new parsers from combinations of
other parsers:

  !p           ok if p fails, cut if p succeeds
  p >> q       match p, then q, returning q's value
  p / q        match p, then q, returning p's value
  p || q       match p if it succeeds, else match q; p and q must be same type
  lookAhead(p) succeeds iff p does, but doesn't modify state
  attempt(p)   succeeds iff p does, safely preserving state on failure
  many(p)      a greedy sequence of zero or more nonempty successes of p;
                 returns std::list<> of values
  some(p)      a greedy sequence of one or more successes of p
  skipMany(p)  same as many(p), but discards result (performance optimizer)
  maybe(p)     try to match p, returning optional<T>
  defaulted(p) matches p, or else returns a default-constructed instance
                     of p's resultType
  nonemptySeparated(p, q) repeatedly match p q p q p q ... p, returning
                            the values of the p's
  extension(p) parses p if strict standard compliance is disabled,
                 with a warning if nonstandard usage warnings are enabled
  deprecated(p) parses p if strict standard compliance is disabled,
                 with a warning if deprecated usage warnings are enabled
  inContext("...", p)  run p within an error message context

Note that "a >> b >> c / d / e" matches a sequence of five parsers,
but returns only the result that was obtained by matching c.

The following "applicative" combinators modify or combine the values returned
by parsers:

  construct<T>{}(p1, p2, ...)
               matches zero or more parsers in succession, collecting their
               results and then passing them with move semantics to a
               constructor for the type T if they all succeed
  applyFunction(f, p1, p2, ...)
               matches one or more parsers in succession, collecting their
               results and passing them as rvalue reference arguments to
               some function, returning its result
  applyLambda([](&&x){}, p1, p2, ...)
               is the same thing, but for lambdas and other function objects
  applyMem(mf, p1, p2, ...)
               is the same thing, but invokes a member function of the
               result of the first parser

These are non-advancing state inquiry and update parsers:

  getColumn    returns 1-based column position
  inCharLiteral succeeds under withinCharLiteral
  inFortran    succeeds unless in a preprocessing directive
  inFixedForm  succeeds in fixed-form source
  setInFixedForm  sets the fixed-form flag, returns prior value
  columns      returns the 1-based column number after which source is clipped
  setColumns(c) sets "columns", returns prior value

When parsing depends on the result values of earlier parses, the
"monadic bind" combinator is available (but please try to avoid using it,
as it makes automatic analysis of the grammar difficult):

  p >>= f      match p, yielding some value x on success, then match the
                 parser returned from the function call f(x)

Last, we have these basic parsers on which the actual grammar of the Fortran
is built.  All of the following parsers consume characters acquired from
"cookedNextChar".

  spaces       always succeeds after consuming any spaces or tabs
  digit        matches one cooked decimal digit (0-9)
  letter       matches one cooked letter (A-Z)
  CharMatch<'c'>{} matches one specific cooked character
  "..."_tok    match contents, skipping spaces before and after, and
                 with multiple spaces accepted for any internal space
  "..." >> p   the tok suffix is optional on a string before >> and after /
  parenthesized(p)  shorthand for "(" >> p / ")"
  bracketed(p) shorthand for "[" >> p / "]"

  withinCharLiteral(p) apply p, tokenizing for CHARACTER/Hollerith literals
  nonEmptyListOf(p) matches a comma-separated list of one or more p's
  optionalListOf(p) ditto, but can be empty

  "..."_debug  emit the string and succeed, for parser debugging