A comparative survey of recent natural language interfaces for databases论文学习

研究内容

This paper gives an overview over 24 recently developed NLIs for databases. Each of the systems is evaluated using a curated list of ten sample questions to show their strengths and weaknesses.（本文概述了最近为数据库开发的 24 个 NLI。每个系统都使用十个示例问题的精选列表进行评估，以显示它们的优点和缺点。）
The paper categorize the NLIs into four groups based on the methodology they are using: keyword-, pattern-, parsing- and grammar-based NLI.（根据NLI使用的方法，本文将NLI分为四类:关键词型、模式型、解析型和基于语法的NLI。）

研究方法

the paper present a small sample world that serves as the basis for analyzing different NLIs. The sample world is a small movie database inspired by IMDB and extended with hierarchical relationships and semantic concepts. （本文提供了一个小样本世界，作为分析不同的NLIs的基础。样本世界是一个受IMDB启发并扩展了层次关系和语义概念的小型电影数据库。）
the paper designed nine questions based on the operators of SQL and SPARQL: Joins, Filters (string, range, date or negation), Aggregations, Ordering, Union and Subqueries. The tenth question is based on a concept. The ten sample input questions given are answerable in the sample world.（该论文根据 SQL 和 SPARQL 的运算符设计了 9 个问题：Joins、Filters（字符串、范围、日期或否定）、Aggregations、Ordering、Union 和 Subqueries。第十个问题是基于一个概念。给出的十个示例输入问题在样本世界上是可回答的。）
To better understand what types of questions users pose and how representative the sample input questions are, the paper perform an analysis comparing our ten sample input questions with the two well-known questionanswering corpora Yahoo! QA Corpus L6 and GeoData250. The paper also summarize the findings of Bonifati et al. and compare them to our input questions.（为了更好地了解用户提出的问题类型以及样本输入问题的代表性，本文进行了分析，将我们的十个样本输入问题与两个著名的问答语料库 Yahoo QA Corpus L6 和 GeoData250 进行了比较。该论文还总结了 Bonifati 等人的发现。并将它们与我们的输入问题进行比较。）
This survey’s evaluation focuses on the ten sample questions. Those sample questions are based on the operators of the formal query languages SQL and SPARQL. This leads to the following limitations of the evaluation.（本次调查的评估侧重于十个样本问题。这些示例问题基于正式查询语言 SQL 和 SPARQL 的运算符。这导致评估的以下限制。）
- Theoretical. Our evaluation is theoretical and only based on the papers.（我们的评估是理论性的，仅基于论文。）
- Computational performance. We completely ignored the computational performance of the systems.（计算性能。我们完全忽略了系统的计算性能。）
- Accuracy. In our evaluation, we ignored the accuracy given in the papers.（准确性。在我们的评估中，我们忽略了论文中给出的准确性。）

第二部分内容

Keyword-based systems The core of these systems is the lookup step, where the systems try to match the given keywords against an inverted index of the base and metadata. These systems cannot answer aggregation queries. To solve more complex questions involving subqueries, the system needs to apply some sort of parsing to identify structural dependencies. The main advantage of this approach is the simplicity and adaptability.（基于关键字的系统这些系统的核心是查找步骤，系统尝试将给定的关键字与基础和元数据的倒排索引进行匹配。这些系统无法回答聚合查询。为了解决涉及子查询的更复杂的问题，系统需要应用某种解析来识别结构依赖关系。这种方法的主要优点是简单性和适应性。）The Keyword-based systems include SODA(Search Over DAta warehouse), NLP-reduce, Précis, QUICK (QUery Intent Constructor for Keywords), QUEST (QUEry generator for STructured sources), SINA and Aqqu.
Pattern-based systems These systems extend the keyword-based systems with NLP technologies to handle more than keywords and also add natural language patterns. The patterns can be domain-independent or domain-dependent.（基于模式的系统这些系统使用 NLP 技术扩展了基于关键字的系统，不仅可以处理关键字，还可以添加自然语言模式。模式可以是域独立的或域相关的。）The pattern-based systems include NLQ/A and QuestIO (QUESTion-based Interface to Ontologies).
Parsing-based systems These systems parse the input question and use the generated information about the structure of the question to understand the grammatical structure. The parse tree contains a lot of information about single tokens, but also about how tokens can be grouped together to form phrases. The main advantage of this approach is that the semantic meaning can be mapped to certain production rules (query generation).（基于解析的系统这些系统解析输入的问题并使用生成的关于问题结构的信息来理解语法结构。解析树包含许多关于单个标记的信息，但也包含关于如何将标记组合在一起以形成短语的信息。这种方法的主要优点是语义可以映射到某些产生规则（查询生成）。）The parsing-based systems include ATHENA, Querix, FREyA (Feedback, Refinement and Extended vocabularY Aggregation), BELA, USI Answers, NaLIX (Natural Language Interface to XML), NaLIR (Natural Language Interface for Relational databases) and BioSmart.
Grammar-based systems The core of these systems is a set of rules (grammar) that define the questions a user can ask the system. The main advantage of this approach is that the system can give users natural language suggestions during typing their questions. Each question that is formalized this way can be answered by the system. These systems are overall the most powerful ones, but are highly dependent on their manually designed rules.（基于语法的系统这些系统的核心是一组规则（语法），用于定义用户可以向系统提出的问题。这种方法的主要优点是系统可以在用户输入问题时为他们提供自然语言建议。系统可以回答以这种方式形式化的每个问题。总的来说，基于语法的系统是最强大的，但高度依赖于它们手工设计的规则。）The grammar-based systems include TR Discover, Ginseng (Guided Input Natural language Search ENGine), SQUALL (Semantic Query and Update High-Level Language), MEANS (MEdical question ANSwering), AskNow, SPARKLIS and GFMed.

第三部分内容

A new promising avenue of research is to use deep learning techniques as the foundation for NLIDBs. The basic idea is to formulate the translation of natural language (NL) to SQL as an end-to-end machine translation problem. The approach is often called neural semantic parsing. In other words, translating from NL to SQL can be formulated as a supervised machine learning problem on pairs of natural language and SQL queries. In particular, machine translation can be modeled as a sequence-to-sequence problem where the input sequence is represented by the words (tokens) of the NL and the output sequence by the tokens of SQL. The main goal is given an input sequence of tokens, predict the output sequence based on observed patterns in the past.（使用深度学习技术作为nlidb的基础是一个新的有前途的研究途径。其基本思想是将自然语言(NL)到SQL的翻译表述为端到端机器翻译问题。这种方法通常被称为神经语义解析。换句话说，从NL到SQL的翻译可以表述为自然语言和SQL查询对上的有监督的机器学习问题。特别是，机器翻译可以建模为一个序列到序列的问题，其中输入序列由NL的单词(标记)表示，输出序列由SQL的标记表示。其主要目标是给定一个符号输入序列，根据过去观察到的模式预测输出序列。）
The main advantage of machine learning-based approaches over traditional NLIDBs is that they support a richer linguistic variability in query expressions, and thus users can formulate queries with greater flexibility. However, one of the major challenges of supervised machine learning approaches is that they require a large training data set in order to achieve good accuracy on the translation task.（与传统的nlidb相比，基于机器学习的方法的主要优势在于，它们支持查询表达式中更丰富的语言多样性，因此用户可以更灵活地制定查询。然而，有监督机器学习方法的主要挑战之一是，它们需要大量的训练数据集，以实现翻译任务的良好准确性。）

实验评估

The paper analyzes how well the 24 recently developed systems can handle the ten sample questions. Complex questions (e.g., aggregations) cannot be phrased with keywords only. Therefore, the more complicated the users questions are, the more they will phrase them in grammatically correct sentences. In contrast, simple questions (e.g., stringfilters) can be easily asked with keywords. The users prefer to ask questions with keywords if possible. Parse trees are useful to identify subqueries, but only in grammatically correct sentences (e.g., NaLIR).（本文分析了最近开发的24个系统对10个样题的处理情况。复杂的问题(例如，聚合)不能只用关键词来表达。因此，用户的问题越复杂，他们就越会使用语法正确的句子。相比之下，简单的问题(例如字符串过滤器)可以很容易地用关键字问出来。如果可能的话，用户更喜欢用关键字提问。解析树对于识别子查询很有用，但只适用于语法正确的句子(例如，NaLIR)。）
The paper also asks the ten sample questions to three commercial systems: Google, Siri and Internet Movie Database (IMDb). Based on the sample questions, Google is the best system in this category.（本文还对谷歌、Siri、IMDb等3个商用系统进行了10个样题的提问。根据示例问题，谷歌是这类系统中最好的系统。）

研究结论

lessons learned：（得到的经验教训）
- Use distinct mechanisms for handling simple versus complex questions. Users like to pose questions differently depending on the complexity of the question. Simple questions will often be asked with keywords, while complex questions are posed in grammatically correct sentences.（使用不同的机制来处理简单和复杂的问题。用户喜欢根据问题的复杂性提出不同的问题。简单的问题通常用关键词提问，而复杂的问题则用语法正确的句子提问。）
- The identification of subqueries seems to be one of the most difficult problems for NLIs.（子查询的识别似乎是NLIs最困难的问题之一。）
- When an ambiguity occurs, the system needs to clarify with the user. This interaction should be optimized in such a way that the number of needed user interactions is minimized.（当出现歧义时，系统需要与用户进行澄清。这种交互应该优化，以使所需的用户交互数量最小化。）
- The biggest advantage of grammar-based NLIs is that they can use their grammar rules to guide the users while they are typing their questions. This improves the interaction between system and users in two ways:first, the system will understand each question the users ask; second, the users will learn how certain questions have to be asked to receive a good result.（基于语法的NLIs的最大优势在于，当用户键入问题时，他们可以使用自己的语法规则来指导用户。这从两个方面改善了系统与用户之间的交互:首先，系统将理解用户提出的每个问题;其次，用户将了解如何询问特定的问题才能获得良好的结果。）
- Using a hybrid approach of traditional NLIs that are enhanced by neural machine translation might be a good approach for the future. Traditional approaches would guarantee better accuracy while neural machine translation approaches would increase the robustness to language variability.（利用神经机器翻译增强的传统NLIs的混合方法可能是未来的一种很好的方法。传统的方法可以保证更好的准确性，而神经机器翻译方法可以提高对语言变异性的鲁棒性。）

posted @ 2021-12-28 15:01 bky-16 阅读(131) 评论(0) 收藏举报

刷新页面返回顶部