【SIGMOD】Machine Learning for Databases领域论文阅读

Deep learning in databases. Research problems at the intersection of deep learning and databases are introduced in [54]. Examples include query optimization and natural language query interfaces [54]. A feed-forward neural network (with 1 hidden layer) for cardinality estimation of simple range queries (without joins) is proposed and evaluated on a synthetic dataset [40]. Recently, [61] developed a natural language interface for database systems using deep neural networks. In [22], an LSTM autoencoder and a paragraph2vec model were applied for the tasks of query workload summarization and error prediction, with experiments on Snowflake, a private query workload, and TPC-H [53]. Compared to the datasets in [22, 61], SDSS and SQLShare are publiclyavailable and real-world.

Modeling SQL query performance. Estimates of SQL query properties and performance are used in admission control, scheduling, and costing during query optimization. Commonly, these estimates are based on manually constructed cost models in the query optimizer. However, the cost model may not be precise and requires access to the database instance. Prior work has used machine learning to accurately estimate SQL query properties [2, 14, 37, 39, 40]. Most works use relatively small synthetic workloads, like TPC-H and TPC-DS, along with traditional two-stage machine learning models. Their results are better with query execution plans as input. Similar to us, the database-agnostic approach in [24] automatically learns features from large query workloads rather than devising task-specific heuristics and feature engineering for pre-determined conditions. However, they focus on index selection and security audits. Note, devising robust prediction models that generalize well to unseen queries and changes in workloads, is studied in [39]. The approach is based on operator-level query execution plan feature engineering, focuses on CPU time and logical I/O for a query execution plan, and is evaluated on small-scale query workloads. We extend [39] by considering large-scale query workloads, and using data-driven machine learning models which learn features and their compositions.

Facilitating SQL query composition. Earlier methods provided forms for querying over databases [25]. But forms are restrictive. Keyword queries are an alternative [6, 59], but it is difficult to identify user intention from a flat list of keywords. Both [6, 59] tackle this problem by considering contextual dependencies between keywords, and the database structure. Natural language interfaces, like NaLIR [38], allow complex query intents to be expressed. Initially, the system communicates its query interpretation to the user via a Query Tree structure. The user can then verify, or select the likely interpretations. Next, the system translates the verified or corrected query tree to the correct SQL statement. Query recommendation by mining query logs [7, 12, 31] is another approach. QuerIE [7, 12] assumes access to database tuples and a SQL query log. It recommends queries by identifying data tuples that are related to the interests (past query tuples) of the users. Given the schema, tuples, and some keywords, the approach in [13] suggests SQL queries from templates. The evaluation is in the form of a user study with 10 experts. Additional query results are recommended for each query in [49]. However, other than [31], these works access tuples. Other work assume the user is familiar with samples in the query answer. AIDE [10] helps the user refine their query and iteratively guides them toward interesting data areas . It is limited to linear queries, and predicts queries using decision tree classifiers. Finding minimal project join queries based on a sample table of tuples contained in the query answer, is studied in [47]. [? ] re-write alternate forms for the queries w.r.t. their answer tuples. These works are complementary to ours.

Mining SQL query workloads. Several usability works use the TPC-H benchmark dataset [53]. TPC-H has 8 tables, contains (22) ad-hoc queries, and data content modifications. A synthetic workload can be simulated from the ad-hoc queries. WikiSQL [61], is a recent public query workload that contains natural language descriptions for SQL queries over small datasets collected from the Wikipedia, but it does not contain the meta-information we require. We use two publicly available and real-world query workloads, SDSS and SQLShare [23, 45, 46, 51] Query workloads are also used for tasks like index selection [22], improving query optimization [39], and workload compression [8]. The motivation in workload compression is that large-scale SQL query workloads can create practical problems for tasks like index selection [8]. While data-driven machine learning models rely on data abundance to train models with many parameters, data redundancies and size can pose computational challenges. Therefore, workload compression techniques can provide an orthogonal extension for data extraction part of our work.SDSS has been used to identify user interests and access areas within the data space [41]. Ettu [35], is a system that identifies insider attacks, by clustering SQL queries in a query workload. We focus on different problems.

数据库中的深度学习。在[54]中介绍了深度学习与数据库交叉的研究问题。例如查询优化和自然语言查询接口[54]。提出了一种用于简单范围查询(无连接)的卡性估计的前馈神经网络(有1个隐藏层),并在一个合成数据集上进行了评估[40]。最近,[61]利用深度神经网络开发了数据库系统的自然语言接口。在[22]中,应用LSTM自动编码器和paragraph2vec模型来完成查询工作量总结和错误预测的任务,并在私人查询工作负载Snowflake和TPC-H[53]上进行了实验。与[22,61]中的数据集相比,SDSS和SQLShare是公开的,是真实世界的。

对SQL查询性能进行建模。在查询优化过程中,SQL查询属性和性能的估计被用于接纳控制、调度和成本计算。通常,这些估计是基于查询优化器中手动构建的成本模型。然而,成本模型可能不精确,并且需要访问数据库实例。之前的工作已经使用机器学习来精确估计SQL查询属性[2,14,37,39,40]。大多数作品使用相对较小的合成工作负载,如TPC-H和TPC-DS,以及传统的两阶段机器学习模型。他们的结果是以查询执行计划作为输入,效果更好。与我们类似,[24]中的数据库无关方法从大型查询工作负载中自动学习特征,而不是针对预设条件设计特定任务的启发式和特征工程。然而,他们关注的是索引选择和安全审计。注意,在[39]中研究了设计健壮的预测模型,这些模型能够很好地泛化到未见的查询和工作负载的变化。该方法基于操作者级查询执行计划特征工程,关注查询执行计划的CPU时间和逻辑I/O,并在小规模查询工作负载上进行评估。我们对[39]进行了扩展,考虑大规模的查询工作负载,并使用数据驱动的机器学习模型,学习特征及其组成。

方便SQL查询组成。早期的方法提供了在数据库上查询的表单[25]。但形式是限制性的。关键词查询是一种替代方法[6,59],但很难从一个扁平的关键词列表中识别用户意图。两者[6,59]都是通过考虑关键词之间的上下文依赖性以及数据库结构来解决这个问题。自然语言接口,如NaLIR[38],允许表达复杂的查询意图。最初,系统通过Query Tree结构将其查询解释传达给用户。然后,用户可以验证,或者选择可能的解释。接下来,系统将验证或修正后的查询树翻译成正确的SQL语句。通过挖掘查询日志进行查询推荐[7,12,31]是另一种方法。QuerIE[7,12]假设可以访问数据库元组和SQL查询日志。它通过识别与用户兴趣相关的数据元组(过去的查询元组)来推荐查询。给定模式、元组和一些关键字,[13]中的方法从模板中建议SQL查询。评估是以10位专家的用户研究形式进行的。在[49]中,为每个查询推荐了额外的查询结果。然而,除了[31]之外,这些作品都是访问元组。其他作品假设用户熟悉查询答案中的样本。AIDE[10]帮助用户完善他们的查询,并迭代地引导他们走向有趣的数据区域 。它仅限于线性查询,并使用决策树分类器预测查询。在[47]中研究了根据查询答案中包含的元组样本表,寻找最小的项目连接查询。[? ]重写了查询w.r.t.其答案元组的备用形式。这些工作与我们的工作是互补的。

挖掘SQL查询工作负载。一些可用性工作使用TPC-H基准数据集[53]。TPC-H有8张表,包含(22)个ad-hoc查询,以及数据内容的修改。一个合成的工作负载可以从ad-hoc查询中模拟出来。WikiSQL[61],是最近的一个公共查询工作负载,它包含了对从维基百科收集的小数据集进行SQL查询的自然语言描述,但它不包含我们所需的元信息。我们使用两个公开的和现实世界的查询工作负载,SDSS和SQLShare[23,45,46,51]查询工作负载也用于索引选择[22]、改进查询优化[39]和工作负载压缩[8]等任务。工作负载压缩的动机是,大规模的SQL查询工作负载会给索引选择等任务带来实际问题[8]。虽然数据驱动的机器学习模型依赖于数据的丰富性来训练具有许多参数的模型,但数据冗余和大小会带来计算上的挑战。因此,工作量压缩技术可以为我们工作中的数据提取部分提供一个正交的扩展.SDSS已经被用于识别数据空间内的用户兴趣和访问区域[41]。Ettu[35],是一个识别内部人攻击的系统,通过对SQL查询工作负载进行聚类。我们关注不同的问题。

posted @ 2021-03-11 23:35  Ryan0v0  阅读(164)  评论(0编辑  收藏  举报