6.2 Lexing raw delimited content 对原始分隔内容进行词法排序

https://lalrpop.github.io/lalrpop/lexer_tutorial/002_raw_delimited_content.html

Our calculator example operated on numbers and arithmetic operators. There is no overlap between the characters for numeric digits (0, 1, ...), the characters representing operators (+, -, ...) and parentheses ((, )), so it was easy to embed those tokens directly in the grammar, as we saw in the earlier sections.

MST --- 我们的计算器示例对数字和算术运算符进行操作。数字字符 (0, 1, ...)、表示运算符 (+, -, ...) 和括号 ((, )) 的字符之间没有重叠,因此很容易将这些标记直接嵌入到语法中,正如我们在前面的部分中看到的那样。

GPT --- 我们的计算器示例操作的是数字和算术运算符。数字字符(0、1、...)、表示运算符的字符(+、-、...)和括号字符((、))之间没有重叠,因此可以像在前面的章节中看到的那样,直接将这些标记嵌入到语法中。

However, clean lexical separations can be hard to identify in some languages.

MST --- 但是,在某些语言中,清晰的词汇分隔可能很难识别。

GPT --- 然而,在某些语言中,清晰的词法分隔可能很难识别。

Consider parsing a language with string literals. We will define a simple one; all it can do is bind variables, which are always single characters, like this:

MST --- 考虑使用字符串文本解析语言。我们将定义一个简单的;它所能做的就是绑定变量,这些变量总是单个字符,如下所示:

GPT --- 考虑解析一个包含字符串字面量的语言。我们将定义一个简单的语言,它只能绑定变量,这些变量总是单个字符,例如:

x = "a"
y = "bc"

Using what we have learned so far, we might try a grammar like the following one:

MST --- 使用我们目前所学到的知识,我们可以尝试如下语法:

GPT --- 根据我们迄今为止学到的内容,我们可以尝试如下的语法:

use super::{Var, Lit, Eql};

grammar;

pub Var: Var = <r"[x-z]"> => <>.chars().next().unwrap().into();

pub Lit: Lit = "\"" <r"[a-z]*"> "\"" => <>.into();

pub Eql: Eql = <Var> "=" <Lit> => (<>).into();

Unfortunately, this does not work; attempting to process the above grammar yields:

MST --- 不幸的是,这不起作用;尝试处理上述语法会得到:

GPT --- 不幸的是,这样做是行不通的;尝试处理上述语法会得到:

error: ambiguity detected between the terminal `r#"[x-z]"#` and the terminal `r#"[a-z]*"#`

We saw the explanation for why this happens in the previous section: the two regular expressions overlap, and the generated lexer does not know how to resolve the ambiguity between them.

MST --- 我们在上一节中看到了为什么会发生这种情况的解释:两个正则表达式重叠,生成的词法分析器不知道如何解决它们之间的歧义。

GPT --- 我们在上一节中看到了解释:这两个正则表达式重叠了,生成的词法分析器无法解决它们之间的歧义。

Cut to the chase? 直截了当?

If you want to know "the right way" to solve this problem, you can skip straight to the end.

MST --- 如果您想知道解决此问题的 “正确方法”,您可以直接跳到最后。

GPT --- 如果你想知道解决这个问题的“正确方法”,你可以直接跳到最后。

But if you want to understand why it is the right answer, you may benefit from taking the detour that starts now.

MST --- 但是,如果您想了解为什么这是正确的答案,您可能会从现在开始的弯路中受益。

GPT --- 但如果你想理解为什么这是正确的答案,你可能会从现在开始的绕道中受益。

Exploring our options 探索我们的选择

A match declaration here, as suggested in the previous chapter, might seem like it fixes the problem:

MST --- 如上一章所述,此处的 match 声明似乎可以解决问题:

GPT --- “正如上一章所建议的,这里的 match 声明看起来似乎解决了问题:”

use super::{Var, Lit, Eql};

grammar;

match {
   r"[x-z]"
} else {
   r"[a-z]*",
   _
}

pub Var: Var = <r"[x-z]"> => <>.chars().next().unwrap().into();

pub Lit: Lit = "\"" <r"[a-z]*"> "\"" => <>.into();

pub Eql: Eql = <Var> "=" <Lit> => (<>).into();

With that match declaration in place we can successfully run a test like this one:

MST --- 有了这个 match 声明,我们可以成功地运行如下测试:

GPT --- “有了这个 match 声明,我们就可以成功地运行像这样的测试:”

#[test]
fn fair_ball() {
    assert_eq!(nobol2::EqlParser::new().parse(r#"z = "xyz""#), Ok(('z', "xyz").into()));
}

Unfortunately, the match is actually only papering over the fundamental problem here. Consider this variant of the previous test:

MST --- 不幸的是,这场比赛实际上只是掩盖了这里的根本问题。考虑上一个测试的这个变体:

GPT --- “不幸的是,这个 match 实际上只是掩盖了这里的根本问题。请看一下这个之前测试的变种:”

#[test]
fn foul_ball() {
    assert_eq!(nobol2::EqlParser::new().parse(r#"z = "x""#), Ok(('z', "x").into()));
}

The above produces:

MST --- 以上结果为:

GPT --- “上述代码会产生:”

---- foul_ball stdout ----
thread 'foul_ball' panicked at 'assertion failed: `(left == right)`
  left: `Err(UnrecognizedToken { token: (5, Token(3, "x"), 6), expected: ["r#\"[a-z]*\"#"] })`,
 right: `Ok(Eql(Var('z'), Lit("x")))`', doc/nobol/src/main.rs:43:5

What is the problem?

Merely specifying a precedence to favor tokenizing r"[x-z]" over r"[a-z]" does not address the real problem here. That precedence rule causes an input like z = "x" to be split into tokens such that the x only matches the regular expression for the Var. It will not match the r"[a-z]" in the Lit rule, even if it intuitively seems like it should; they have already been lexically categorized as different tokens at this point.

MST --- 仅仅指定一个优先顺序来优先将 r“[x-z]” 分词化而不是 r“[a-z]” 并不能解决这里的真正问题。该优先规则会导致 z = “x” 等输入被拆分为标记,以便 x 仅匹配 Var 的正则表达式。它不会匹配 Lit 规则中的 r“[a-z]”,即使直观上看起来应该匹配;此时,它们已在词法上被分类为不同的标记。

GPT --- “仅仅为 r"[x-z]" 优先于 r"[a-z]*" 指定优先级,并没有解决这里的真正问题。这个优先级规则导致像 z = "x" 这样的输入被拆分成令牌,使得 x 只匹配 Var 规则的正则表达式。即使直观上看起来它应该匹配 Lit 规则中的 r"[a-z]*",它也不会匹配;因为到此为止,它们已经被词法地分类为不同的令牌。”

One could add further workarounds to deal with this. For example, one could change the Lit production to explicitly handle the r"[x-z]" regular expression as its own case:

MST --- 可以添加更多的解决方法来处理这个问题。例如,可以更改 Lit 生产以显式处理 r“[x-z]” 正则表达式作为其自己的大小写:

GPT --- “可以通过添加更多的变通方法来解决这个问题。例如,可以将 Lit 规则改为显式地处理 r"[x-z]" 正则表达式作为它自己的情况:”

pub Lit: Lit = {
    "\"" <r"[x-z]"> "\"" => <>.into(),
    "\"" <r"[a-z]*"> "\"" => <>.into(),
};

But this is a fragile workaround.

MST --- 但这是一种脆弱的解决方法。

GPT --- “但是这是一个脆弱的变通方法。”

Specifically, this workaround is only applicable because we put artificial limits on this language.

MST --- 具体而言,此解决方法仅适用于此语言,因为我们对此语言进行了人为限制。

GPT --- “具体来说,这个变通方法之所以适用,仅仅是因为我们人为地限制了这个语言的范围。”

If we wanted to generalize string literals to be able to contain other characters (such as whitespace), the technique described so far does not work out well. Consider this grammar:

MST --- 如果我们想将字符串字面量泛化为能够包含其他字符(例如空格),那么到目前为止描述的技术效果不佳。考虑以下语法:

GPT --- “如果我们想要将字符串字面量泛化,使其能够包含其他字符(例如空格),那么到目前为止描述的技术就无法很好地解决问题。请考虑以下语法:”

match {
   r"[x-z]"
} else {
   r"[a-z ]*",
   _
}

pub Var: Var = <r"[x-z]"> => <>.chars().next().unwrap().into();

pub Lit: Lit = {
    "\"" <r"[x-z]"> "\"" => <>.into(),
    "\"" <r"[a-z ]*"> "\"" => <>.into(),
};

pub Eql: Eql = <Var> "=" <Lit> => (<>).into();

Now, if we run the same test as before:

MST --- 现在,如果我们运行与之前相同的测试:

GPT --- “现在,如果我们运行与之前相同的测试:”

#[test]
fn spaceballs() {
    assert_eq!(nobol4::EqlParser::new().parse(r#"z = "x""#), Ok(('z', "x").into()));
}

we get the following error output:

thread 'spaceballs' panicked at 'assertion failed: `(left == right)`
  left: `Err(UnrecognizedToken { token: (0, Token(2, "z "), 2), expected: ["r#\"[x-z]*\"#"] })`,
 right: `Ok(Eql(Var('z'), Lit("x")))`', doc/nobol/src/main.rs:58:5

Our attempt to generalize what strings can contain has caused problems for how the rest of the input is tokenized.

MST --- 我们试图概括字符串可以包含的内容,这导致了 Importing 其余部分的标记化方式的问题。

GPT --- “我们尝试将字符串可以包含的内容进行泛化,这导致了其余输入的词法分析出现问题。”

The right way to do this

Let us revisit the original rule in the grammar for string literals, from our first version:

MST --- 让我们重新审视一下第一个版本中字符串字面量语法中的原始规则:

GPT --- “让我们重新审视一下语法中最初版本的字符串字面量规则:”

pub Lit: Lit = "\"" <r"[a-z]*"> "\"" => <>.into();

The heart of our problem is that we have implicitly specified distinct tokens for the string delimiter (""") versus the string content (in this case, r"[a-z]*").

MST --- 我们问题的核心是,我们为字符串分隔符 (“”“) 与字符串内容 (在本例中为 r”[a-z]*) 隐式指定了不同的标记。

GPT --- “我们问题的核心在于,我们隐式地为字符串定界符(\")与字符串内容(在这种情况下是 r"[a-z]*")指定了不同的令牌。”

Intuitively, we only want to tokenize string content when we are in the process of reading a string. In other words, we only want to apply the r"[a-z]*" rule immediately after reading a """. But the generated lexer does not infer this from our rules; it just blindly looks for something matching the string content regular expression anywhere in the input.

MST --- 直观地说,我们只希望在读取字符串的过程中对字符串内容进行标记化。换句话说,我们只想在读取 “”“ 后立即应用 r”[a-z]*“ 规则。但是生成的词法分析器并没有从我们的规则中推断出来;它只是盲目地在 input 中任意位置寻找与 String Content 正则表达式匹配的内容。

GPT --- “直观上,我们只希望在读取字符串时对字符串内容进行词法分析。换句话说,我们只希望在读取到 \" 后立即应用 r"[a-z]*" 规则。但生成的词法分析器并不会从我们的规则中推断出这一点;它只是盲目地在输入中寻找与字符串内容正则表达式匹配的部分。”

You could solve this with a custom lexer (treated in the next section).

MST --- 您可以使用自定义词法分析器来解决此问题(将在下一节中讨论)。

GPT ---

But a simpler solution is to read the string delimiters and the string content as a single token, like so:

MST --- 但更简单的解决方案是将字符串分隔符和字符串内容作为单个标记读取,如下所示:

pub Var: Var = <r"[a-z]"> => <>.chars().next().unwrap().into();

pub Lit: Lit = <l:r#""[a-z ]*""#> => l[1..l.len()-1].into();

pub Eql: Eql = <Var> "=" <Lit> => (<>).into();

GPT ---

(Note that this form of the grammar does not require any match statement; there is no longer any ambiguity between the different regular expressions that drive the tokenizer.)

MST --- (请注意,这种形式的语法不需要任何 match 语句;驱动分词器的不同正则表达式之间不再有任何歧义。

GPT ---

With this definition of the grammar, all of these tests pass:

MST --- 有了语法的这个定义,所有这些测试都通过了:

#[test]
fn homerun() {
    assert_eq!(nobol5::VarParser::new().parse("x"), Ok('x'.into()));
    assert_eq!(nobol5::LitParser::new().parse(r#""abc""#), Ok("abc".into()));
    assert_eq!(nobol5::EqlParser::new().parse(r#"x = "a""#), Ok(('x', "a").into()));
    assert_eq!(nobol5::EqlParser::new().parse(r#"y = "bc""#), Ok(('y', "bc").into()));
    assert_eq!(nobol5::EqlParser::new().parse(r#"z = "xyz""#), Ok(('z', "xyz").into()));
    assert_eq!(nobol5::EqlParser::new().parse(r#"z = "x""#), Ok(('z', "x").into()));
    assert_eq!(nobol5::EqlParser::new().parse(r#"z = "x y z""#), Ok(('z', "x y z").into()));
}

GPT ---

Furthermore, we can now remove other artificial limits in our language. For example, we can make our identifiers more than one character:

MST --- 此外,我们现在可以删除语言中其他人为的限制。例如,我们可以使标识符具有多个字符:

pub Var: Var = <r"[a-z]+"> => <>.into()

GPT ---

which, with suitable changes to the library code, works out fine.

MST --- 通过对库代码进行适当的更改,效果很好。

GPT ---

Escape sequences 转义序列

Our current string literals are allowed to hold a small subset of the full space of characters.

MST --- 我们当前的字符串 Literals 可以保存完整字符空间的一小部分。

GPT ---

If we wanted to generalize it to be able to hold arbitrary characters, we would need some way to denote the delimiter character " in the string content.

MST --- 如果我们想将其泛化为能够保存任意字符,我们需要某种方法来表示字符串内容中的分隔符 ”。

GPT ---

The usual way to do this is via an escape sequence: ", which is understood by the lexical analyzer as not ending the string content.

MST --- 通常的方法是通过转义序列:“,词法分析器将其理解为不结束字符串内容。

GPT ---

We can generalize the regular expression in our new Lit rule to handle this:

MST --- 我们可以在新的 Lit 规则中泛化正则表达式来处理这个问题:

GPT ---

pub Lit: Lit = <l:r#""(\\\\|\\"|[^"\\])*""#> => l[1..l.len()-1].into();

However, depending on your data model, this is not quite right. In particular: the produced string still has the escaping backslashes embedded in it.

MST --- 但是,根据您的数据模型,这并不完全正确。特别是:生成的字符串中仍然嵌入了转义反斜杠。

GPT ---

As a concrete example, with the above definition for Lit, this test:

MST --- 作为一个具体的例子,根据上述 Lit 的定义,这个测试:

GPT ---

#[test]
fn popfly() {
    assert_eq!(nobol6::EqlParser::new().parse(r#"z = "\"\\""#), Ok(('z', "\"\\").into()));
}

yields this output:

MST --- 生成以下输出:

GPT ---

thread 'popfly' panicked at 'assertion failed: `(left == right)`
  left: `Ok(Eql(Var('z'), Lit("\\\"\\\\")))`,
 right: `Ok(Eql(Var('z'), Lit("\"\\")))`', doc/nobol/src/main.rs:91:5

This can be readily addressed by adding some code to post-process the token to remove the backslashes:

MST --- 这可以通过添加一些代码对令牌进行后处理以删除反斜杠来轻松解决:

GPT ---

pub Lit: Lit = <l:r#""(\\\\|\\"|[^"\\])*""#> => Lit(apply_string_escapes(&l[1..l.len()-1]).into());

where apply_string_escapes is a helper routine that searches for backslashes in the content and performs the corresponding replacement with the character denoted by the escape sequence.

MST --- 其中 apply_string_escapes 是一个帮助程序例程,用于在内容中搜索反斜杠,并使用转义序列表示的字符执行相应的替换。

GPT ---

posted on 2025-01-05 15:36  及途又八  阅读(6)  评论(0编辑  收藏  举报

导航