Parser简介
开始看《编程语言实现模式》,补习了一下Parser的基础知识。英文是Wiki上的,中文是自己理解翻译的。
Wiki LALR parser
http://en.wikipedia.org/wiki/LALR_parser
in 1965, Donald Knuth invented the LR parser (Left to Right, Rightmost derivation). The LR parser can recognize any deterministic context-free language in linear-bounded time. However, rightmost derivation has very large memory requirements and implementing an LR parser was impractical due to the limited memory of computers at that time. To address this shortcoming, in 1969, Frank DeRemer proposed two simplified versions of the LR parser, namely the Look-Ahead LR (LALR) and the Simple LR parser that had much lower memory requirements at the cost of less language recognition power with the LALR parser being the most powerful alternative. Later, in 1977, memory optimizations for the LR parser were invented but still the LR parser was less memory efficient than the simplified alternatives.
In 1979, Frank DeRemer and Tom Pennello announced a series of optimizations for the LALR parser that would further improve its memory efficiency. The formal presentation of these optimizations was made in 1982.
1965年, Donald Knuth发明了LR语法解析器,(从左往右,最右推导).它可以在线性时间里识别任何确定性的上下文无关的语言。然而,最右推导需要大的内存消耗,由于当时计算机内存有限,实现一个LR解析器并不实用。为了解决这个缺点,在1969年,Frank DeRemer提出了两种简化版本的LR解析器,即Look-Ahead LR和Simple LR。这两个版本以降低语言识别能力为代价,有着更少的内存消耗。其中LALR是能力最强的一个选择。之后,在1977年,发明了对LR解析器的内存优化,但LR解析器依然没有简化的版本的内存效率高。
在1979年,Frank DeRemer和Tom Pennello宣布了一些列的对LALR的优化,进一步提高了内存效率。这些优化方法的正式提出是在1982年。
概述
像其他类型的LR解析器一样,在对输入字符串从左往右扫描一遍的情况下,LALR解析器对于查找单个正确的自底向上的解析器是非常高效的,因为它不需要回溯。像定义描述的,作为一个向前看的解析器,它使用向前看的机制,LALR(1)是最常用的情况。
实现上的问题
因为LALR解析器使用自右推导,而不是更加直观的自左推导,要理解它如何工作有些困难。这使得找到一个正确的和高效的LALR语法非常迫切和耗时。基于同样的原因,错误报告会很难,因为LALR解析器的错误不能转换成对用户有意义的信息。因为这个原因,递归下降解析器(recursive descent parser)有时是比LALR解析器更好的选择。它需要更多的手写代码,因为它语言识别能力不够强大,但是它没有LALR解析其的特殊难题,因为它执行自左推导。
一个很好的LALR(1)的参考
http://web.cs.dal.ca/~sjackson/lalr1.html
Wiki LR parser
http://en.wikipedia.org/wiki/LR_parser
The name LR is an acronym. The L means that the parser reads input text in one direction without backing up; that direction is typically Left to right within each line, and top to bottom across the lines of the full input file. (This is true for most parsers.) The R means that the parser produces a reversed Rightmost derivation; it does a bottom-up parse, not a top-down LL parse or ad-hoc parse. The name LR is often followed by a numeric qualifier, as in LR(1) or sometimes LR(k). To avoid backtracking or guessing, the LR parser is allowed to peek ahead at k lookahead input symbols before deciding how to parse earlier symbols. Typically k is 1 and is not mentioned. The name LR is often preceded by other qualifiers, as in SLR and LALR.
LR parsers are deterministic; they produce a single correct parse without guesswork or backtracking, in linear time. This is ideal for computer languages. But LR parsers are not suited for human languages which need more flexible but slower methods. Other parser methods that backtrack or yield multiple parses may take N2 or N3 time when they guess badly.
The above properties of L, R, and k are actually shared by all shift-reduce parsers, including precedence parsers. But by convention, the LR name stands for the form of parsing invented by Donald Knuth, and excludes the earlier, less powerful precedence methods.[1] LR parsers can handle a larger range of languages and grammars than precedence parsers or top-down LL parsing[2]. This is because the LR parser waits until it has seen an entire instance of some grammar pattern before committing to what it has found. An LL parser has to decide or guess what it is seeing much sooner, when it has only seen the leftmost input symbol of that pattern. LR is also better at error reporting. It detects syntax errors as early in the input stream as possible.
LR,L(Left-to-right)表示解析器按照一个方向不回溯地读取输入,通常是在每行中从左往右,在整个输入文件的所有行中从上往下。R(Rightmost derivation)表示解析器执行一个颠倒的最右推导,它自底向上,而不是自上而下的LL解析或ad-hos解析。通常LR后面跟着一个数字,LR(1)或者LR(k)。为了避免回溯或者猜测,LR解析器允许向前查看k个输入符号,然后再决定之前的符号如何解析。通常会使用LR的改进版本,LALR或者SLR。
LR解析器是确定性的,它们在线性时间内生成一个不需要猜测或回溯的单个正确的解析器。这对于计算机语言非常理想。但是LR解析器不适合更加灵活但不需要很快的人类语言。其它的回溯或者生成多个解析选项的解析方法可能需要N^2 或者 N^3的时间复杂度,在猜测很差的时候。
上面的L,R,k属性实际上是所有的移位归约解析器所共有的,包括优先级解析器。但传统上,LR表示Donald Knuth所发明的解析形式,不包括早先的,不够强大的优先级解析器。LR解析器可以处理比自上而下的LL解析器更多的语言和语法。这是因为LR解析器等到看见一个完整的语法模式实例时才提交它所找到的内容。所有的LL解析器需要更早地决定或猜测它所看到的,当它仅看到那个模式的最左边的输入符号时。LR解析器在错误报告方面更好,它可以尽可能早地检测出输入的语法错误。
LR Tutorial
When using an LR parser within some larger program, you can usually ignore all the mathematical details about states, tables, and generators. All of the parsing actions and outputs and their timing can be simply understood by viewing the LR parser as just a shift-reduce parser with some nifty decision method. If the generator tool complains about some parts of your grammar, you may need some understanding of states and the difference between LR and LALR in order to tweak your grammar into an acceptable form. Full understanding of grammar and state analysis algorithms is needed only by the tool implementer and by students of parsing theory courses.
当在一些大的程序里使用LR解析器时,你通常可以忽略关于状态,表格,和生成器相关的数学细节。所有的解析动作和输出,和他们的时序可以通过把LR解析器当作一个有着一些漂亮的决策方法的移位-归约的解析器来理解。如果生成器抱怨你的语法的某些部分,你可能需要理解一下状态和LR和LALAR之间的区别,以便于将你的语法修改为可以接受的形式。只有实现工具的人和学习解析理论课的学生需要对语法和状态分析算法有完全的理解。
Wiki GLR parser
http://en.wikipedia.org/wiki/GLR_parser
In computer science, a GLR parser ("Generalized Left-to-right Rightmost derivation parser") is an extension of an LR parser algorithm to handle nondeterministic and ambiguous grammars. First described in a 1984 paper by Masaru Tomita, it has also been referred to as a "parallel parser". Tomita presented five stages in his original work,[1] though, in practice, it is the second stage that is recognized as the GLR parser.
Though the algorithm has evolved since its original form, the principles have remained intact: Tomita's goal was to parse natural language text thoroughly and efficiently. Standard LR parsers cannot accommodate the nondeterministic and ambiguous nature of natural language, and the GLR algorithm can.
在计算机科学中,GLR(Generalized LR)解析器是LR解析算法的扩展,使它能够处理不确定的和歧义的语法。在1984年由 Masaru Tomita第一次提出,它通常被当作一个并行的解析器。
尽管算法从原来的形式改进了,但原理还是原封不动,Tomita的目标是全面高效地解析自然语言文本。标准的LR解析器不能处理自然语言本身就有的不确定和歧义性,而GLR算法可以。
Wiki LL parser
http://en.wikipedia.org/wiki/LL_parser
In computer science, an LL parser is a top-down parser for a subset of the context-free grammars. It parses the input from Left to right, and constructs a Leftmost derivation of the sentence (hence LL, compared with LR parser). The class of grammars which are parsable in this way is known as the LL grammars.
The remainder of this article describes the table-based kind of parser, the alternative being a recursive descent parser which is usually coded by hand (although not always; see e.g. ANTLR for an LL(*) recursive-descent parser generator).
An LL parser is called an LL(k) parser if it uses k tokens of lookahead when parsing a sentence. If such a parser exists for a certain grammar and it can parse sentences of this grammar without backtracking then it is called an LL(k) grammar. A language that has an LL(k) grammar is known as an LL(k) language. There are LL(k+n) languages that are not LL(k) languages.[1] A corollary of this is that not all context-free languages are LL(k) languages.
LL(1) grammars are very popular because the corresponding LL parsers only need to look at the next token to make their parsing decisions. Languages based on grammars with a high value of k have traditionally been considered[who?] to be difficult to parse, although this is less true now given the availability and widespread use[citation needed] of parser generators supporting LL(k) grammars for arbitrary k.
An LL parser is called an LL(*) parser if it is not restricted to a finite k tokens of lookahead, but can make parsing decisions by recognizing whether the following tokens belong to a regular language (for example by use of a Deterministic Finite Automaton).
在计算机科学中,LL解析器是一种用来处理上下文无关的语法的一个子集的自顶向下的解析器。它从左往右解析输入,构建输入语句的自左推导(因此是LL,相对于LR)。可以用这种方式解析的一类语法称为LL语法。
LL解析器成为LL(k)解析器,当它解析语句时使用k个前向token。如果对于一定的语法这个解析器存在,而且它能够在不回溯的情况下解析语句,就称这些语法为LL(k)语法。不是所有的上下文无关的语法都是LL(k)语法(或语言)。
LL(1)解析器非常流行,因为对应的解析器只需要查看下一个token来作出解析决定。有着高k值的语法的语言传统上被认为是难以解析的,这不是完全正确,现在广泛使用的解析器生成器都支持任意k值的LL(k)的语法。
所有的LL解析器称作LL(*)解析器,只要不限定只有有限的k个token可以查看,但可以通过后面的token是否是正则语言(例如通过使用确定性有限自动机)。
在看编程语言实现模式这本书,看了第一部分,解析起步,感觉还不错,自己也写了一个简单的LL(1)解析器,接下来要把这本书的内容都实现一下吗?
开始了就要坚持。善始善终。