临时笔记, 有意思的东西

一些编译器理论的简单介绍,和现代Parser研究的新进展。

http://www.antlr.org/article/needlook.html

http://citeseer.comp.nus.edu.sg/440034.html

Tomita(GLR) Parser

Packrat parser (use TDPL)

http://java.sun.com/docs/books/jls/first_edition/html/19.doc.html

http://www.cs.berkeley.edu/~smcpeak/elkhound/

http://www.mollypages.org/page/grammar/index.mp

http://lambda.uta.edu/cse5317/notes/node20.html

http://pages.cpsc.ucalgary.ca/~robin/class/411/LR.1.html

http://en.wikipedia.org/wiki/Memoization

http://en.wikipedia.org/wiki/Comparison_of_parser_generators

http://en.wikipedia.org/wiki/Parsing_expression_grammar



> The author states that he wrote the GLR parser generator solely to
> handle C++ language spec [and someone lapped it up to handle Java].
>
> What exactly is it about OO languages that an LALR(1) parser cannot
> handle?

As the moderator noted, there is nothing about "OO" languages that
LALR(1) parsers cannot handle, but C++ itself is problematic. There
are LALR(1) and LL grammars for Java.

One of the problems with C++, is that expressions and declarations can
look exactly the same (technically, any language containing those the
or of the two productions is ambiguous) and C++ gets around that by
saying, if it looks like a declaration, it is a declaration (forcing
the "or" to be resolved in a particular declaration (and resolving the
ambiguity). However, that resolution is not expressed gramatically,
and one can not take two random context free rules and difference them
and expect the result to be a context free language, which is what the
C++ ambiguity resolution requires one to do.

In contrast, GLR grammars are not required to be unambiguous. Any
ambiguity is resolved by producing a resulting parse-forest that
represents all the potential mabiguous choices and requiring a later
"semantic" pass to choose which parse tree in the forst is the desired
one. Thus, with a GLR parser, one can disambiguate the C++ problem by
selecting the parse tree that treats all the ambiguous expression/
declaration sub-trees as declarations.

The only problem with GLR as a technology is that are no "warnings"
from the grammar processing tool that the language is ambiguous.
Well, there are warnings that the language is not LR (or LALR) or
whatever technology the GLR parser uses as a base. However, some of
those grammars will actually not be ambiguous and some of the will be
ambiguous. However, in any case, once your GLR generator has given a
warning, one either must prove that the language actually isn't
ambiguous or write your semantic phase assuming that the language is
ambiguous and disambiguate the resulting forest.

It is worth mentioning that there are other ways of handling ambiguous
grammars. In particular, one can use predicates to resolve
ambiguities. Predicates allow one to take the difference of two
productions in a controlled manner. In particular, it is possible to
write a syntactic rules that says, try to parse this as a declaration
and if it isn't parse it as an expression. The difference between the
predicated and the GLR solution is that predicated grammars are still
deterministic. There are no hidden ambiguities in a predicated
grammar. If your predicated parser generator gives you an error, you
still have an unresolved ambiguity and if it doesn't the resulting
parser will always construct a parse tree (and not a forest).

I would be remiss if I also did not point out backtracking parsers,
which are another solution to the problem. In fact, all the
implementations of predicated parsers that I know of, use some form of
backtracking in their implementation. General backtrakcing parsers
share the characteristic with GLR parsers that they can parse
ambiguous grammars. Backtracking parsers generally also produce a
parse tree (although in theory they could also produce a forest).
Backtracking parsers have their own deficits though. Many
backtracking parsers will loop forever on some ambiguous grammars.
(Predicated backtraking parsers do not generally have this problem,
although they do not make the same linear time guarantees that pure LL
and LR parsers do(see note)--of course, any parser generator that can
handle a significant class of ambiguous must be inherently non-linear
for some grammars, and GLR parsers have a cubic worst case, same as
Earley parsers.) In addition, most backtracking parsers resolve
ambiguities by selecting one parse tree out of the forest to return.
This is generally done by the order of the rules in the grammar (which
determines the order the rules are tried in in ambiguous cases). If
one looks closely, this is very similar to using predicates
"implicitly" in the grammar. The key difference being that the tool
inserts the predicates rather than the user and does so without
warning and usually without the run-time termination guarantees.

I would like to mention that it is possible to build a predicated
parser using GLR technology, although I don't know of anyone
attempting to do so right now. From thought-experiments I have done
considering whether to implement such a tool, it seems like there
would be some advantages to building such a tool.

Again, I do not want to imply that these are the only techniques for
dealing with ambiguity. For example, Ralph Boland is pursing some
generalization of LR technology that I gather will handle a wider
class of languages and I don't think his technique is any of the
above.

Note: Bryan Ford recently published a paper on a "predicated" parsing
technique that made extensive use of memoization and lazy evaluation
to achieve (if I recall correctly) a linear time guarantee. His
technique shares a characterisitic with general backtracking parsers
in that the order of rules determines what is matched and the the
entire tree is disambiguated that way. He uses an "ordered" or clause
to implement this.

Hope this helps,
-Chris


Chris Clark said (in part):

> It is worth mentioning that there are other ways of handling ambiguous
> grammars. In particular, one can use predicates to resolve
> ambiguities. Predicates allow one to take the difference of two
> productions in a controlled manner. In particular, it is possible to
> write a syntactic rules that says, try to parse this as a declaration
> and if it isn't parse it as an expression. The difference between the
> predicated and the GLR solution is that predicated grammars are still
> deterministic. There are no hidden ambiguities in a predicated
> grammar. If your predicated parser generator gives you an error, you
> still have an unresolved ambiguity and if it doesn't the resulting
> parser will always construct a parse tree (and not a forest)

I agree with Chris about the use of predicates.

Interestingly, the use of predicates alone can some Type 1 power to a
grammar.

For instance:

L1 = {a^n b^n c+} // clearly a type 2 language
L2 = {a+ b^n c^n} // clearly a type 2 language

L1 intersect L2 = {a^n b^n c^n} // a type 1 language

Current research in the area of this class of grammars can be found here:

http://www.cs.queensu.ca/home/okhotin/

See the section on "Boolean grammars." Intersection can get quite a bit of
power out of a formalism.

My most recent paper deals with several difficult to parse languages of the
classical sort, including the particularly nasty to parse:

L = {a^m b^n c^mn}

The only grammar I've seen expressed for that one in classical form is in
Type 0 due to a length increasing production:

  (1) <S> ::= <H><S> | <H><B>
  (2) <B> ::= <B><B> | <C>
  (3) <H><B> ::= <A><X><N><B>
  (4) <N><B> ::= <B><N>
  (5) <B><M> ::= <M><B>
  (6) <N><C> ::= <M>c
  (7) <N>c ::= <M>cc
  (8) <X><M><B><B> ::= <B><X><N><B>
  (9) <X><B><M>c ::= <B>c
(10) <H><A> ::= <A><H>
(11) <A> ::= a
(12) <B> ::= b

Because production (9) is length increasing, the grammar is in Type 0 form,
even though the language itself is Type 1. I'd like to see that grammar
normalized to a Type 1 -- but haven't been able to find one.

The longest derivation I've been able to do with that by hand is aabcc,
which is:

(11) aabcc --> <A>abcc
(11) <A>abcc --> <A><A>bcc
(12) <A><A>bcc --> <A><A><B>cc
  (9) <A><A><B>cc --> <A><A><X><B><M>cc
  (7) <A><A><X><B><M>cc --> <A><A><X><B><N>c
  (4) <A><A><X><B><N>c --> <A><A><X><N><B>c
  (3) <A><A><X><N><B>c --> <A><H><B>c
  (9) <A><H><B>c --> <A><H><X><B><M>c
  (6) <A><H><X><B><M>c --> <A><H><X><B><N><C>
  (4) <A><H><X><B><N><C> --> <A><H><X><N><B><B>
  (2) <A><H><X><N><B><B> --> <A><H><X><N><B>
(10) <A><H><X><N><B> --> <H><A><X><N><B>
  (3) <H><A><X><N><B> --> <H><H><B>
  (1) <H><H><B> --> <H><S>
  (1) <H><S> --> <S>

I started on aaabbcccccc -- but got lost in the shuffle. :-(

If anyone wants to try a purely "predicate" approach to the above
language -- I'd love to see it. (Or if anyone would care to post the
derivation of aaabbcccccc, I'd really love to see that, too.)

The $-grammar I wrote for that one accepts strings in O(n^2.3), and has 6
productions and 2 of those are predicates, but also makes use of 2
phi-expressions, which implies a total of 4 predicates (since there is an
implied predicate with every phi-expression) and at least 2 name-indexed
tries.

Also of note is that I allow for a-expressions in predicates, which allows
for substrings that have been parsed to be concatenated to form entirely new
input that is then passed to the predicates. (The $-grammar for a^m b^n c^mn
uses this.) This is similar to what is known as "length-increasing".

Anyway, a few nights ago, I was asking myself about this formation in C++:

class Foo
{
int inline_function(int x)
{
return __y * x; // __y is used before being seen
}

int __y;
}; // this class is legal

class Bar
{
int inline_function(int x)
{
return __y * x; // __y is an undeclared variable
}
}; // this class is not legal because __y never gets declared

$-calculus was able to handle this ... without any code, accepting Foo and
rejecting Bar -- using all of the above mentioned techniques.

Although the resulting $-grammar uses more than 1 explicit predicate, it
does make use of phi-expressions, and these have implied predicates -- so I
was not able to do it without extensive overall use of predication.

Anyway -- I wrote it up and am looking for somewhere that is looking for a
3.5 page paper on such things. Any thoughts?

[BTW -- Chris -- direct email to you bounces from my account. Is that a spam
guard?]

posted on 2008-06-20 20:30  怪怪  阅读(2284)  评论(0编辑  收藏  举报

导航