Bison

Bison

The Yacc-compatible Parser Generator

10 September 2021, Bison Version 3.8.1

by Charles Donnelly and Richard Stallman

1 The Concepts of Bison

1.1 Languages and Context-Free Grammars

In order for Bison to parse a language, it must be described by a
context-free grammar. This means that you specify one or more syntactic
groupings and give rules for constructing them from their parts. For
example, in the C language, one kind of grouping is called an
'expression'. Rules are often recursive, but there must be at least one
rule which leads out of the recursion.

There are various important subclasses of context-free grammars. Although
it can handle almost all context-free grammars, Bison is optimized for
what are called LR(1) grammars. In brief, in these grammars, it must be
possible to tell how to parse any portion of an input string with just
a single token of lookahead.

A context-free grammar can be ambiguous, meaning that there are multiple ways
to apply the grammar rules to get the same inputs. Even unambiguous grammars
can be nondeterministic, meaning that no fixed lookahead always suffices to
determine the next grammar rule to apply. With the proper declarations, Bison
is also able to parse these more general context-free grammars, using a
technique known as GLR parsing (for Generalized LR). Bison's GLR parsers
are able to handle any context-free grammar for which the number of possible
parses of any given string is finite.

Here is a simple C function subdivided into tokens:

int /* keyword 'int' */ square (int x) 
/* identifier, open-paren, keyword 'int', identifier, close-paren */
{ /* open-brace */
    return x * x;
    /* keyword 'return', identifier, asterisk, identifier, semicolon */
} /* close-brace */

The syntactic groupings of C include the expression, the statement, the
declaration, and the function definition. These are represented in the
grammar of C by nonterminal symbols 'expression', 'statement', 'declaration'
and 'function definition'. The full grammar uses dozens of additional
language constructs, each with its own nonterminal symbol, in order to
express the meanings of these four. The example above is a function
definition; it contains one declaration, and one statement. In the statement,
each 'x' is an expression and so is 'x * x'.

One nonterminal symbol must be distinguished as the special one which defines
a complete utterance in the language. It is called the start symbol. In a
compiler, this means a complete input program. In the C language, the
nonterminal symbol 'sequence of definitions and declarations' plays this
role.

1.2 From Formal Rules to Bison Input

A formal grammar is a mathematical construct. To define the language for
Bison, you must write a file expressing the grammar in Bison syntax: a Bison
grammar file. See 3 [Bison Grammar Files].

A nonterminal symbol in the formal grammar is represented in Bison input
as an identifier, like an identifier in C. By convention, it should be in
lower case, such as expr, stmt or declaration.

The Bison representation for a terminal symbol is also called a token
kind. Token kinds as well can be represented as C-like identifiers. By
convention, these identifiers should be upper case to distinguish them from
nonterminals: for example, INTEGER, IDENTIFIER, IF or RETURN. A terminal
symbol that stands for a particular keyword in the language should be
named after that keyword converted to upper case. The terminal symbol
error is reserved for error recovery. See 3.2 [Symbols, Terminal and
Nonterminal].

A terminal symbol can also be represented as a character literal, just like
a C character constant. You should do this whenever a token is just a single
character (parenthesis, plus-sign, etc.): use that same character in a
literal as the terminal symbol for that token.

A third way to represent a terminal symbol is with a C string constant
containing several characters. See 3.2 [Symbols, Terminal and Nonterminal].

The grammar rules also have an expression in Bison syntax. For example, here
is the Bison rule for a C return statement. The semicolon in quotes is a
literal character token, representing part of the C syntax for the statement;
the naked semicolon, and the colon, are Bison punctuation used in every rule.

stmt: RETURN expr ';' ;

1.3 Semantic Values

A formal grammar selects tokens only by their classifications: for example,
if a rule mentions the terminal symbol 'integer constant', it means that any
integer constant is grammatically valid in that position. The precise
value of the constant is irrelevant to how to parse the input: if 'x+4' is
grammatical then 'x+1' or 'x+3989' is equally grammatical.

But the precise value is very important for what the input means once it is
parsed. A compiler is useless if it fails to distinguish between 4, 1 and
3989 as constants in the program! Therefore, each token in a Bison grammar
has both a token kind and a semantic value. See 3.4 [Defining Language
Semantics].

The semantic value has all the rest of the information about the meaning of
the token, such as the value of an integer, or the name of an
identifier. (A token such as ',' which is just punctuation doesn't need to
have any semantic value.)

When the parser accepts the token, it keeps track of the token's semantic
value. Each grouping can also have a semantic value as well as its nonterminal
symbol. For example, in a calculator, an expression typically has a semantic
value that is a number. In a compiler for a programming language, an
expression typically has a semantic value that is a tree structure
describing the meaning of the expression.

1.4 Semantic Actions

In order to be useful, a program must do more than parse input; it must also
produce some output based on the input. In a Bison grammar, a grammar rule
can have an action made up of C statements. Each time the parser
recognizes a match for that rule, the action is executed. See 3.4.6 [
Actions]. Most of the time, the purpose of an action is to compute the
semantic value of the whole construct from the semantic values of its
parts.

1.5 Writing GLR Parsers

1.5.1 Using GLR on Unambiguous Grammars

Consider

type subrange = lo .. hi;
type enum = (a, b, c)

The original language standard allows only numeric literals and
constant identifiers for the subrange bounds ('lo' and 'hi'), but
Extended Pascal (ISO/IEC 10206) and many other Pascal implementations
allow arbitrary expressions there. This gives rise to the following
situation, containing a superfluous pair of parentheses:

type subrange = (a) .. b;

Compare this to the following declaration of an enumerated type:

type enum = (a);

These two declarations look identical until the '..' token. With normal
LR(1) one-token lookahead it is not possible to decide between the two
forms when the identifier 'a' is parsed. It is, however, desirable for a
parser to decide this, since in the latter case 'a' must become a
new identifier to represent the enumeration value, while in the former
case 'a' must be evaluated with its current meaning, which may be a
constant or even a function call.

You might think of using the lexer to distinguish between the two forms by
returning different tokens for currently defined and undefined
identifiers. But if these declarations occur in a local scope, and 'a' is
defined in an outer scope, then both forms are possible ——— either locally
redefining 'a', or using the value of 'a' from the outer scope.
So this approach cannot work.

A simple solution to this problem is to declare the parser to use the GLR
algorithm. When the GLR parser reaches the critical state, it merely splits
into two branches and pursues both syntax rules simultaneously. Sooner
or later, one of them runs into a parsing error. If there is a '..' token
before the next ';', the rule for enumerated types fails since it cannot
accept '..' anywhere; otherwise, the subrange type rule fails since it
requires a '..' token. So one of the branches fails silently, and the
other one continues normally, performing all the intermediate actions that
were postponed during the split.
If the input is syntactically incorrect, both branches fail and the parser
reports a syntax error as usual.
The effect of all this is that the parser seems to use more lookahead than
the underlying LR(1) algorithm actually allows for.

When used as a normal LR(1) grammar, Bison correctly complains about one
reduce/reduce conflict. The parser can be turned into a GLR parser, while
also telling Bison to be silent about the one known reduce/reduce conflict,
by adding these two declarations to the top of the Bison grammar file:

%glr-parser
%expect-rr 1

There are at least two potential problems to beware. First, always analyze
the conflicts reported by Bison to make sure that GLR splitting is only done
where it is intended. A GLR parser splitting inadvertently may cause problems
less obvious than an LR parser statically choosing the wrong alternative in
a conflict. Second, consider interactions with the lexer (see 7.1 [Semantic
Info in Token Kinds]) with great care. Since a split parser consumes tokens
without performing any actions during the split, the lexer cannot obtain
information via parser actions. Some cases of lexer interactions can be
eliminated by using GLR to shift the complications from the lexer to the
parser. You must check the remaining cases for correctness.

1.5.2 Using GLR to Resolve Ambiguities

Let's consider an example:

%{
  #include <stdio.h>
  int yylex (void);
  void yyerror (char const *);
%}

%define api.value.type {char const *}

%token TYPENAME ID

%right '='
%left '+'

%glr-parser

%%

prog:
  %empty
| prog stmt      { printf ("\n"); }
;
stmt:
  expr ';' %dprec 1
| decl     %dprec 2
;
expr:
  ID             { printf ("%s ", $$); }
| TYPENAME '(' expr ')'
                 { printf ("%s <cast> ", $1); }
| expr '+' expr  { printf ("+ "); }
| expr '=' expr  { printf ("= "); }
;
decl:
  TYPENAME declarator ';'
                 { printf ("%s <declare> ", $1); }
| TYPENAME declarator '=' expr ';'
                 { printf ("%s <init-declare> ", $1); }
;
declarator:
  ID             { printf ("\"%s\" ", $1); }
| '(' declarator ')'
;

For T (x) = y+z;, it parses as either an expr or a stmt (assuming that
'T' is recognized as a TYPENAME and 'x' as an ID). Bison detects this as
a reduce/reduce conflict between the rules expr : ID and declarator : ID,
which it cannot resolve at the time it encounters x in the example above.
Since this is a GLR parser, it therefore splits the problem into two parses,
one for each choice of resolving the reduce/reduce conflict. Unlike the
example from the previous section (see 1.5.1 [Using GLR on Unambiguous
Grammars], page 18), however, neither of these parses "dies", because
the grammar as it stands is ambiguous. One of the parsers eventually reduces
stmt : expr ';' and the other reduces stmt : decl, after which both parsers
are in an identical state: they've seen 'prog stmt' and have the same
unprocessed input remaining. We say that these parses have merged.

At this point, the GLR parser requires a specification in the grammar of
how to choose between the competing parses. In the example above, the two
%dprec declarations specify that Bison is to give precedence to the parse
that interprets the example as a decl, which implies that x is a declarator.
The parser therefore prints: "x" y z + T <init-declare>.

Suppose that instead of resolving the ambiguity, you wanted to see all the
possibilities. For this purpose, you must merge the semantic actions of
the two possible parsers, rather than choosing one over the other. To do so,
you could change the declaration of stmt as follows:

stmt:
  expr ';'  %merge <stmt_merge>
| decl      %merge <stmt_merge>
;

and define a stmt_merge function(with an accompanying forward declaration
in the C declarations at the beginning of the file), as below:

static YYSTYPE
stmt_merge (YYSTYPE x0, YYSTYPE x1)
{
  printf ("<OR> ");
  return "";
}

With these declarations, the resulting parser parses the first example as
both an expr and a decl, and prints:
"x" y z + T <init-declare> x T <cast> y z + = <OR>

The signature of the merger depends on the type of the symbol. However, if
stmt had a declared type, like %type <Node *> stmt; or

%union { Node *node; ... };
%type <node> stmt;

then YYSTYPE in above merger prototype must be replaced to Node *.

note 1.5.2

从打印结果看,action的执行顺序是先逐个执行子成分的action,再执行左值总体的action;
这种打印方式能很容易地生成一个后缀表达式。从这个方面理解后缀表达式,可以看作是:
一个先获取(处理)操作数,然后对操作数进行运算,得到新的操作数的过程。

1.5.3 GLR Semantic Actions

1.5.3.1 Deferred semantic actions

By definition, a deferred semantic action is not performed at the same time
as the associated reduction.

In any semantic action, you can examine yychar to determine the kind of
the lookahead token present at the time of the associated reduction. After
checking that yychar is not set to YYEMPTY or YYEOF, you can then examine
yylval and yylloc to determine the lookahead token’s semantic value and
location, if any. In a nondeferred semantic action, you can also modify any
of these variables to influence syntax analysis.
In a deferred semantic action, it’s too late to influence syntax analysis.
In this case, yychar, yylval, and yylloc are set to shallow copies of the
values they had at the time of the associated reduction. For this reason
alone, modifying them is dangerous. Moreover, the result of modifying them
is undefined and subject to change with future versions of Bison. For
example, if a semantic action might be deferred, you should never write it
to invoke yyclearin (clear the lookahead token) or to attempt to free memory
referenced by yylval.

5 The Bison Parser Algorithm

As Bison reads tokens, it pushes them onto a stack along with their
semantic values. The stack is called the parser stack. Pushing a token is
traditionally called shifting.

For example, suppose the infix calculator has read '1 + 5 *', with a '3' to
come. The stack will have four elements, one for each token that was shifted.

But the stack does not always have an element for each token read. When the
last n tokens and groupings shifted match the components of a grammar
rule, they can be combined according to that rule. This is called
reduction. Those tokens and groupings are replaced on the stack by a
single grouping whose symbol is the result (left hand side) of that rule.
Running the rule's action is part of the process of reduction, because this
is what computes the semantic value of the resulting grouping.

The parser tries, by shifts and reductions, to reduce the entire input down
to a single grouping whose symbol is the grammar's start-symbol (see 1.1
[Languages and Context-Free Grammars]).

This kind of parser is known in the literature as a bottom-up parser.

5.1 Lookahead Tokens

The Bison parser does not always reduce immediately as soon as the last n
tokens and groupings match a rule.

Here is a simple case where lookahead is needed. These three rules define
expressions which contain binary addition operators and postfix unary
factorial operators ('!'), and allow parentheses for grouping.

expr:
  term '+' expr
| term
;
term:
  '(' expr ')'
| term '!'
| "number"
;

Suppose that the tokens '1 + 2' have been read and shifted; what should be
done?

  • If the following token is '!', then it must be shifted immediately so
    that '2 !' can be reduced to make a term. If instead the parser were to
    reduce before shifting, '1 + 2' would become an expr. It would then be
    impossible to shift the '!' because doing so would produce on the stack the
    sequence of symbols expr '!'. No rule allows that sequence.
  • If the following token is ')', then the first three tokens must be reduced
    to form an expr. This is the only valid course, because shifting the ')'
    would produce a sequence of symbols term ')', and no rule allows this.

note 5.1

这个例子也是一个关于 优先级(和结合性) 的很好的例子。在规则的定义中,优先级(和结
合性)低的操作数(符号)作为规则的子规则先被计算。换句话说,优先级(和结合性)的实
现,就是将高优先级和低优先级的规则分开,而将高优先级(结合性)的规则(符号)作为低
优先级的规则(符号)的子规则(符号)。

它在这个例子中的体现,包括了:

  • 加法(左结合性)中左侧的操作数,在规则递归中先被算出,即term在语法树中低于expr。
  • 阶乘(优先级高于加法),在规则递归中先被算出,即由term构成expr,而不是反过来。

同时,look ahead是为了保证当前的Recude不影响后面的Reduce。而后面的Reduction可能正好
就作用于优先级(和结合性)更高(在语法树中更底层)的规则。换句话说,look ahead也就是为
了使优先级(和结合性)更高的规则有机会先于更低的规则被执行;在这种情况下,高优先级的规
则作为一个单独且唯一可行的规则,在look ahead中保证了它的操作数的完整性(即不会被先出现
在文本中但是优先级(和结合性)更低的规则"划走")。

5.2 Shift/Reduce Conflicts

Suppose we are parsing a language which has if-then and if-then-else
statements, with a pair of rules like this:

if_stmt:
  "if" expr "then" stmt
| "if" expr "then" stmt "else" stmt
;

When the "else" token is read and becomes the lookahead token, the contents
of the stack (assuming the input is valid) are just right for reduction by
the first rule. But it is also legitimate to shift the "else", because that
would lead to eventual reduction by the second rule.

This situation, where either a shift or a reduction would be valid, is
called a shift/reduce conflict. Bison is designed to resolve these
conflicts by choosing to shift, unless otherwise directed by operator
precedence declarations. (It would ideally be cleaner to write an unambiguous
grammar, but that is very hard to do in this case.) This particular ambiguity
was first encountered in the specifications of Algol 60 and is called the
"dangling else" ambiguity.

  if e1 then  if e2 then s1  • else s2

5.3.1 When Precedence is Needed

Consider the following ambiguous grammar fragment:

expr:
  expr '-' expr
| expr '*' expr
| expr '<' expr
| '(' expr ')'
;...

Suppose the parser has seen the tokens '1', '-' and '2', should it reduce
them via the rule for the subtraction operator?

To decide which one Bison should do, we must consider the results. If the
next operator token op is shifted, then it must be reduced first in
order to permit another opportunity to reduce the difference. The result is
(in effect) '1 - (2 op 3)'. On the other hand, if the subtraction is reduced
before shifting op, the result is '(1 - 2) op 3'. Clearly, then, the
choice of shift or reduce should depend on the relative precedence of the
operators '-' and op: '*' should be shifted first, but not '<'.

What about input such as '1 - 2 - 5', should this be '(1 - 2) - 5' or should
it be '1 - (2 - 5)'? For most operators we prefer the former, which is called
left association. The latter alternative, right association, is desirable for
assignment operators. The choice of left or right association is a matter of
whether the parser chooses to shift or reduce when the stack contains '1 - 2'
and the lookahead token is '-': shifting makes right-associativity.

note 5.3.1

我们使用的数据结构是栈,所以,规约是从右往左进行的,也就是从栈顶开始的。这决定了:

  • 优先级(和结合性)更高的规则的子规则可以先(于栈内低优先级规则的规约)入栈;
    因为在出栈的时候,它会先于处于栈内更底层的低优先级规则被规约。即为shift。
  • 而在遇到优先级更低的规则时,则需要先将栈内高优先级的规则先做规约。即为reduce。

或者,反过来,这其实决定了:为了可以这样做,并将Bison的规则简化为仅以一次look ahead
决定shift or reduce,需要将高优先级的规则设计为独立且唯一的子规则。
于是,在look ahead的时候:

  • 遇到独立且唯一的(子)规则:做移进
  • 遇到在语法树中更高层的(父)规则:做规约,再做移进

5.3.2 Specifying Operator Precedence

  • %left left-associative
  • %right right-associative
  • %nonassoc, declares that it is an error to find the same operator
    twice in a/an row/expression.
  • %precedence without associativity

The relative precedence of different operators is controlled by the order in
which they are declared. The first precedence/associativity declaration in
the file declares the operators whose precedence is lowest, the next such
declaration declares the operators whose precedence is a little higher, and
so on.

note 5.3.2
  • 结合性标签同时指定了优先级。结合性相当于是在处理同一优先级内的符号之间的优先级。
  • Bison规定了将低优先级的符号写在相对的上面,高优先级的符号写在相对下面,与这些符号
    在语法树中对应节点的层级是相对应的。

5.3.3 Specifying Precedence Only

Since POSIX Yacc defines only %left, %right, and %nonassoc, which all defines
precedence and associativity, little attention is paid to the fact that
precedence cannot be defined without defining associativity. Yet, sometimes,
when trying to solve a conflict, precedence suffices. In such a case, using
%left, %right, or %nonassoc might hide future (associativity related)
conflicts that would remain hidden.

The dangling else ambiguity if e1 then if e2 then s1 • else s2 can be
solved explicitly.

The conflict involves the reduction of the rule 'IF expr THEN stmt', which
precedence is by default that of its last token (THEN), and the shifting
of the token ELSE. The usual disambiguation (attach the else to the closest
if), shifting must be preferred, i.e., the precedence of ELSE must be
higher than that of THEN. But neither is expected to be involved in an
associativity related conflict, which can be specified as follows.

%precedence THEN
%precedence ELSE

The unary-minus is another typical example where associativity is usually
over-specified.

note 5.3.3

结合性(和优先级)标签解决移进/规约冲突的token识别机制:

  • 规约:token为当前栈内的已经被移进过的最后一个token
  • 移进:token为look ahead的那个token,移进意味着,后续出栈将优先被弹出栈并规约

5.3.5 How Precedence Works

The first effect of the precedence declarations is to assign precedence
levels to the terminal symbols declared. The second effect is to assign
precedence levels to certain rules: each rule gets its precedence from the
last terminal symbol mentioned in the components. (You can also specify
explicitly the precedence of a rule. See 5.4 [Context-Dependent Precedence])

Finally, the resolution of conflicts works by comparing the precedence of the
rule being considered with that of the lookahead token. If the
token's precedence is higher, the choice is to shift. If the rule's
precedence is higher, the choice is to reduce. If they have equal precedence,
the choice is made based on the associativity of that precedence level. The
verbose output file made by -v (see 9 [Invoking Bison]) says how each
conflict was resolved.

Not all rules and not all tokens have precedence. If either the rule or the
lookahead token has no precedence, then the default is to shift.

5.3.6 Using Precedence For Non Operators

Alternatively, you may give both tokens the same precedence, in which case
associativity is used to solve the conflict. To preserve the shift action,
use right associativity:

%right "then" "else"

Neither solution is perfect however. Since Bison does not provide, so far,
scoped precedence, both force you to declare the precedence of these keywords
with respect to the other operators in your grammar. Therefore, instead of
being warned about new conflicts you would be unaware of (e.g., a shift/
reduce conflict due to 'if test then 1 else 2 + 3' being ambiguous: 'if test
then 1 else (2 + 3)' or '(if test then 1 else 2) + 3'?), the conflict will be
already fixed.

5.4 Context-Dependent Precedence

%left '+' '-'
%left '*'
%left UMINUS

exp:
  ...
| exp '-' exp
| '-' exp  %prec UMINUS

Now the precedence of UMINUS is given to the specific rule ('-' exp).

5.5 Parser States

The values pushed on the parser stack are not simply token kind codes; they
represent the entire sequence of terminal and nonterminal symbols at or near
the top of the stack. The current state collects all the information about
previous input which is relevant to deciding what to do next.

Each time a lookahead token is read, the current parser state together with
the kind of lookahead token are looked up in a table.

  1. This table entry can say, "Shift the lookahead token." In this case, it
    also specifies the new parser state, which is pushed onto the top of
    the parser stack.
  2. Or it can say, "Reduce using rule number n." This means that a certain
    number of tokens or groupings are taken off the top of the stack, and replaced
    by one grouping. In other words, that number of states are popped from the
    stack, and one new state is pushed.
  3. There is one other alternative: the table can say that the lookahead token
    is erroneous in the current state. This causes error processing to begin
    (see 6 [Error Recovery]).

5.6 Reduce/Reduce Conflicts

A reduce/reduce conflict occurs if there are two or more rules that apply to
the same sequence of input. This usually indicates a serious error in the
grammar.

Bison resolves a reduce/reduce conflict by choosing to use the rule that
appears first in the grammar, but it is very risky to rely on this. Every
reduce/reduce conflict must be studied and usually eliminated.

e.g.

sequence: /* empty */
| sequence word
| sequence redirect
;
%token word_t
%right word_t
%%
sequence: /* empty */
| sequence words  %prec word_t
;
words:
  word
| words word
;
word:
  word_t
;
posted @ 2024-06-12 20:18  joel-q  阅读(4)  评论(0编辑  收藏  举报