【译】数据库基础:用 Go 从零开始写一个 SQL 数据库 —— 第一部分
- Database basics: writing a SQL database from scratch in Go 译文
- 原文链接:http://notes.eatonphil.com/database-basics.html
- 原文作者:https://github.com/eatonphil
- 译文来自:https://github.com/suhanyujie/article-transfer-rs/
- 译者:suhanyujie
- 译者博客:suhanyujie
- ps:水平有限,翻译不当之处,还请指正。
- 标签:数据库,golang,解析
Database basics: writing a SQL database from scratch in Go —— part 1
【译】数据库基础:用 Go 从零开始写一个 SQL 数据库 —— 第一部分
In this series we'll write a rudimentary database from scratch in Go. Project source code is available on Github.
在这个系列文章中,我们将使用 Go 从零开始写一个很基础的数据库。项目源代码可在 Github 找到。
In this first post we'll build enough of a parser to run some simple CREATE
, INSERT
, and SELECT
queries. Then we'll build an in-memory backend supporting TEXT
and INT
types and write a basic REPL.
在第一篇中,我们将构建解析器来运行一些简单的
CREATE
、INSERT
和SELECT
查询。然后,构建一个支持TEXT
和INT
等后端内存的类型,并编写一个基本的 REPL。
We'll be able to support the following interaction:
项目完成后,我们将支持下面的交互:
$ go run *.go
Welcome to gosql.
# CREATE TABLE users (id INT, name TEXT);
ok
# INSERT INTO users VALUES (1, 'Phil');
ok
# SELECT id, name FROM users;
| id | name |
====================
| 1 | Phil |
ok
# INSERT INTO users VALUES (2, 'Kate');
ok
# SELECT name, id FROM users;
| name | id |
====================
| Phil | 1 |
| Kate | 2 |
ok
The first stage will be to map a SQL source into a list of tokens (lexing). Then we'll call parse functions to find individual SQL statements (such as SELECT). These parse functions will in turn call their own helper functions to find patterns of recursively parseable chunks, keywords, symbols (like parenthesis), identifiers (like a table name), and numeric or string literals.
第一步是将 SQL 源映射为 token 列表(词法分析)。然后我们将调用解析函数来解析单个 SQL 语句(如 SELECT)。这些解析函数将依次调用一些辅助函数,以发现可递归解析的语句块、关键字、符号(如“括号”)、标识符(如“表名称”)和数字或字符串文字。
Then, we'll write an in-memory backend to do operations based on an AST. Finally, we'll write a REPL to accept SQL from a CLI and pass it to the in-memory backend.
然后,我们将编写一个内存后端来执行基于 AST 的操作,最后,我们会编写一个 REPL 来接收命令行下的 SQL 并将其传递到内存后端进行解析。
This post assumes a basic understanding of parsing concepts. We won't skip any code, but also won't go into great detail on why we structure the way we do.
本文是假定你对解析概念有了基本的了解之上的。我们不会跳过任何代码,但也不会详细讨论为何要这样写。
For a simpler introduction to parsing and parsing concepts, see this post on parsing JSON.
有关解析和解析概念的更简单介绍,请参阅该解析 json 文章
lexing
词法分析
The lexer is responsible for finding every distinct group of characters in source code: tokens. This will consist primarily of identifiers, numbers, strings, and symbols.
lexer(词法分析器)负责查找源码中的每一组不同的字符:tokens(令牌)。它主要由标识符、数字、字符串和符号组成。
The gist of the logic will be to iterate over the source string and collect characters until we find a delimiting character such as a space or comma. In this first pass, we'll pretend users don't insert delimiting characters into strings. Once we've reached a delimiting character, we'll "finalize" the token and decide whether it is valid or not.
主要逻辑是迭代源字符串并收集字符,直到找到分隔符(如空格、逗号)。在第一遍字符传递中,我们假设用户没有将分隔字符插入到字符串中。一旦遇到一个分隔符,我们将会”最终确定“并标记它是否有效。
First off, we'll define a few types and constants for use in lexer.go
:
首先,我们在
lexer.go
中定义一些类型和常量:
package main
import (
"fmt"
"io"
"strings"
)
type location struct {
line uint
col uint
}
type keyword string
const (
selectKeyword keyword = "select"
fromKeyword keyword = "from"
asKeyword keyword = "as"
tableKeyword keyword = "table"
createKeyword keyword = "create"
insertKeyword keyword = "insert"
intoKeyword keyword = "into"
valuesKeyword keyword = "values"
intKeyword keyword = "int"
textKeyword keyword = "text"
)
type symbol string
const (
semicolonSymbol symbol = ";"
asteriskSymbol symbol = "*"
commaSymbol symbol = ","
leftparenSymbol symbol = "("
rightparenSymbol symbol = ")"
)
type tokenKind uint
const (
keywordKind tokenKind = iota
symbolKind
identifierKind
stringKind
numericKind
)
type token struct {
value string
kind tokenKind
loc location
}
func (t *token) equals(other *token) bool {
return t.value == other.value && t.kind == other.kind
}
func (t *token) finalize() bool {
return true
}
Next we'll write out the main loop:
接着我们写出循环体的主要内容:
func lex(source io.Reader) ([]*token, error) {
buf := make([]byte, 1)
tokens := []*token{}
current := token{}
var line uint = 0
var col uint = 0
for {
_, err := source.Read(buf)
if err != nil && err != io.EOF {
return nil, err
}
// Add semi-colon for EOF
// 在结尾增加分号
var c byte = ';'
if err == nil {
c = buf[0]
}
switch c {
case '\n':
line++
col = 0
continue
case ' ':
fallthrough
case ',':
fallthrough
case '(':
fallthrough
case ')':
fallthrough
case ';':
if !current.finalize() {
return nil, fmt.Errorf("Unexpected token '%s' at %d:%d", current.value, current.loc.line, current.loc.col)
}
if current.value != "" {
copy := current
tokens = append(tokens, ©)
}
if c == ';' || c == ',' || c == '(' || c == ')' {
tokens = append(tokens, &token{
loc: location{col: col, line: line},
value: string(c),
kind: symbolKind,
})
}
current = token{}
current.loc.col = col
current.loc.line = line
default:
current.value += string(c)
}
if err == io.EOF {
break
}
col++
}
return tokens, nil
}
Last, we'll write a finalizer helper for each kind of fundemental token and check on each in a reasonable order.
最后,我们会给每个基本 token 编写终结辅助函数,并按一定的顺序检查所有 token。
Validating numbers
验证数字
Numbers are the most complex. So we'll refer to the PostgreSQL documentation (section 4.1.2.6) for what constitutes a valid number.
数字是最复杂的。因此我们参考 PostgreSQL 文档(4.1.2.6节)来确定有效数字是怎样的。
func (t *token) finalizeNumeric() bool {
if len(t.value) == 0 {
return false
}
periodFound := false
expMarkerFound := false
i := 0
for i < len(t.value) {
c := t.value[i]
isDigit := c >= '0' && c <= '9'
isPeriod := c == '.'
isExpMarker := c == 'e'
// Must start with a digit or period
// 必须以数字或分隔符`.`开始
if i == 0 {
if !isDigit && !isPeriod {
return false
}
periodFound = isPeriod
i++
continue
}
if isPeriod {
if periodFound {
return false
}
periodFound = true
i++
continue
}
if isExpMarker {
if expMarkerFound {
return false
}
// No periods allowed after expMarker
// expMarker 后可以是非分隔符`.`
periodFound = true
expMarkerFound = true
// expMarker must be followed by digits
// expMarker 后必须是数字
if i == len(t.value)-1 {
return false
}
cNext := t.value[i+1]
if cNext == '-' || cNext == '+' {
i++
}
i++
continue
}
if !isDigit {
return false
}
i++
}
t.kind = numericKind
return true
}
Validating strings
验证字符串
Strings must start and end with a single apostrophe. But once we identify this is a string, we'll rewrite the value dropping these for easier use by the rest of the project.
字符串一定是以单撇号开始与结束的。但是,一旦我们确定这是一个字符串,我们将会丢弃掉这些字符串定界符,以便其余真实字符串的解析。
func (t *token) finalizeString() bool {
if len(t.value) == 0 {
return false
}
if t.value[0] == '\'' && t.value[len(t.value)-1] == '\'' {
t.kind = stringKind
t.value = t.value[1 : len(t.value)-1]
return true
}
return false
}
Validating symbols and keywords
验证符号和关键字
Symbols and keywords come from a fixed set of strings, so they're easy to compare against.
符号和关键字就是一组固定的字符串,因此它们很容易就能通过比较解析它们。
func (t *token) finalizeSymbol() bool {
switch t.value {
case "*":
break
case ";":
break
case "(":
break
case ")":
break
default:
return false
}
t.kind = symbolKind
return true
}
func (t *token) finalizeKeyword() bool {
switch strings.ToLower(t.value) {
case "select":
break
case "from":
break
case "as":
break
case "table":
break
case "create":
break
case "insert":
break
case "into":
break
case "values":
break
case "int":
break
case "text":
break
default:
return false
}
t.value = strings.ToLower(t.value)
t.kind = keywordKind
return true
}
Validating identifiers
验证标识符
Now we can finish up the original finalize
function and assume any token not matching one of these is a valid identifier.
现在,我们可以完成前面的
finalize
函数,并假设与其中有一个不能匹配的 token,则视为有效的标识符。
func (t *token) finalizeIdentifier() bool {
t.kind = identifierKind
return true
}
func (t *token) finalize() bool {
if t.finalizeSymbol() {
return true
}
if t.finalizeKeyword() {
return true
}
if t.finalizeNumeric() {
return true
}
if t.finalizeString() {
return true
}
if t.finalizeIdentifier() {
return true
}
return false
}
And that's it for the lexer! If you copy lexer_test.go from the main project, the tests should now pass.
这就是 lexer!如果你是从主项目中复制了 lexer_test.go 文件,则单元测试应该能通过。
AST model
AST 模型
At the highest level, an AST is a collection of statements:
从全局上看,AST 就是语句的集合:
package main
type Ast struct {
Statements []*Statement
}
A statement, for now, is one of INSERT
, CREATE
, or SELECT
:
现在,一条语句就是
INSERT
,CREATE
,SELECT
中的一种:
type AstKind uint
const (
SelectKind AstKind = iota
CreateTableKind
InsertKind
)
type Statement struct {
SelectStatement *SelectStatement
CreateTableStatement *CreateTableStatement
InsertStatement *InsertStatement
Kind AstKind
}
Insert
An insert statement, for now, has a table name and a list of values to insert:
目前,一条 insert 语句中需要有一个表名和插入的值列表:
type InsertStatement struct {
table token
values *[]*expression
}
An expression is a literal token or (in the future) a function call or inline operation:
表达式就是一个字符 token(在以后)或者是 一个函数调用或者是一个内链操作:
type expressionKind uint
const (
literalKind expressionKind = iota
)
type expression struct {
literal *token
kind expressionKind
}
CREATE
A create statement, for now, has a table name and a list of column names and types:
现在,create 语句需要有表名和字段名及其类型的列表:
type columnDefinition struct {
name token
datatype token
}
type CreateTableStatement struct {
name token
cols *[]*columnDefinition
}
SELECT
A select statement, for now, has a table name and a list of column names:
select 语句需要有表名和字段名:
type SelectStatement struct {
item []*expression
from token
}
And that's it for the AST.
以上都是 AST 中的内容。
Parsing
解析
The Parse
entrypoint will take a list of tokens and attempt to parse statements, separated by a semi-colon, until it reaches the last token.
Parse
函数入口会获取一个 token 列表,并尝试解析由分号分隔的语句,直到最后一个 token。
In general our strategy will be to increment and pass around a cursor containing the current position of unparsed tokens. Each helper will return the new cursor that the caller should start from.
通常,我们的策略是递增并传递一个含有未解析 token 的位置的游标。辅助函数会返回调用方即将要解析的新游标。
package main
import (
"errors"
"fmt"
"io"
)
func tokenFromKeyword(k keyword) token {
return token{
kind: keywordKind,
value: string(k),
}
}
func tokenFromSymbol(s symbol) token {
return token{
kind: symbolKind,
value: string(s),
}
}
func expectToken(tokens []*token, cursor uint, t token) bool {
if cursor >= uint(len(tokens)) {
return false
}
return t.equals(tokens[cursor])
}
func helpMessage(tokens []*token, cursor uint, msg string) {
var c *token
if cursor < uint(len(tokens)) {
c = tokens[cursor]
} else {
c = tokens[cursor-1]
}
fmt.Printf("[%d,%d]: %s, got: %s\n", c.loc.line, c.loc.col, msg, c.value)
}
func Parse(source io.Reader) (*Ast, error) {
tokens, err := lex(source)
if err != nil {
return nil, err
}
a := Ast{}
cursor := uint(0)
for cursor < uint(len(tokens)) {
stmt, newCursor, ok := parseStatement(tokens, cursor, tokenFromSymbol(semicolonSymbol))
if !ok {
helpMessage(tokens, cursor, "Expected statement")
return nil, errors.New("Failed to parse, expected statement")
}
cursor = newCursor
a.Statements = append(a.Statements, stmt)
atLeastOneSemicolon := false
for expectToken(tokens, cursor, tokenFromSymbol(semicolonSymbol)) {
cursor++
atLeastOneSemicolon = true
}
if !atLeastOneSemicolon {
helpMessage(tokens, cursor, "Expected semi-colon delimiter between statements")
return nil, errors.New("Missing semi-colon between statements")
}
}
return &a, nil
}
Parsing statements
解析语句
Each statement will be one of INSERT
, CREATE
, or SELECT
. The parseStatement
helper will call a helper on each of these statement types and return true
if one of them succeeds in parsing.
每个语句都会是
INSERT
,CREATE
和SELECT
中的一个。parseStatement
辅助函数会对每个语句类型调用对应的辅助函数,如果解析成功,则返回true
。
func parseStatement(tokens []*token, initialCursor uint, delimiter token) (*Statement, uint, bool) {
cursor := initialCursor
// Look for a SELECT statement
// 查找 SELECT 语句
semicolonToken := tokenFromSymbol(semicolonSymbol)
slct, newCursor, ok := parseSelectStatement(tokens, cursor, semicolonToken)
if ok {
return &Statement{
Kind: SelectKind,
SelectStatement: slct,
}, newCursor, true
}
// Look for a INSERT statement
// 查找 INSERT 语句
inst, newCursor, ok := parseInsertStatement(tokens, cursor, semicolonToken)
if ok {
return &Statement{
Kind: InsertKind,
InsertStatement: inst,
}, newCursor, true
}
// Look for a CREATE statement
// 查找 CREATE 语句
crtTbl, newCursor, ok := parseCreateTableStatement(tokens, cursor, semicolonToken)
if ok {
return &Statement{
Kind: CreateTableKind,
CreateTableStatement: crtTbl,
}, newCursor, true
}
return nil, initialCursor, false
}
Parsing select statements
解析 select 语句
Parsing SELECT
statements is easy. We'll look for the following token pattern:
解析
SELECT
语句很简单。我们将查找下方的 token 进行匹配:
1.`SELECT`
2.$expression [, ...]
3.`FROM`
4.$table-name`
Sketching that out we get:
我们大致可以得到下面的内容:
func parseSelectStatement(tokens []*token, initialCursor uint, delimiter token) (*SelectStatement, uint, bool) {
cursor := initialCursor
if !expectToken(tokens, cursor, tokenFromKeyword(selectKeyword)) {
return nil, initialCursor, false
}
cursor++
slct := SelectStatement{}
exps, newCursor, ok := parseExpressions(tokens, cursor, []token{tokenFromKeyword(fromKeyword), delimiter})
if !ok {
return nil, initialCursor, false
}
slct.item = *exps
cursor = newCursor
if expectToken(tokens, cursor, tokenFromKeyword(fromKeyword)) {
cursor++
from, newCursor, ok := parseToken(tokens, cursor, identifierKind)
if !ok {
helpMessage(tokens, cursor, "Expected FROM token")
return nil, initialCursor, false
}
slct.from = *from
cursor = newCursor
}
return &slct, cursor, true
}
The parseToken
helper will look for a token of a particular token kind.
parseToken
辅助函数会查找特定类型的 token 所对应的 kind 值。
func parseToken(tokens []*token, initialCursor uint, kind tokenKind) (*token, uint, bool) {
cursor := initialCursor
if cursor >= uint(len(tokens)) {
return nil, initialCursor, false
}
current := tokens[cursor]
if current.kind == kind {
return current, cursor + 1, true
}
return nil, initialCursor, false
}
The parseExpressions
helper will look for tokens separated by a comma until a delimiter is found. It will use existing helpers plus parseExpression
.
parseExpressions
辅助函数会查找由逗号分隔的 token。它将使用现有的辅助函数以及parseExpression
函数。
func parseExpressions(tokens []*token, initialCursor uint, delimiters []token) (*[]*expression, uint, bool) {
cursor := initialCursor
exps := []*expression{}
outer:
for {
if cursor >= uint(len(tokens)) {
return nil, initialCursor, false
}
// Look for delimiter
// 查找分隔符
current := tokens[cursor]
for _, delimiter := range delimiters {
if delimiter.equals(current) {
break outer
}
}
// Look for comma
// 查找逗号
if len(exps) > 0 {
if !expectToken(tokens, cursor, tokenFromSymbol(commaSymbol)) {
helpMessage(tokens, cursor, "Expected comma")
return nil, initialCursor, false
}
cursor++
}
// Look for expression
// 查找表达式
exp, newCursor, ok := parseExpression(tokens, cursor, tokenFromSymbol(commaSymbol))
if !ok {
helpMessage(tokens, cursor, "Expected expression")
return nil, initialCursor, false
}
cursor = newCursor
exps = append(exps, exp)
}
return &exps, cursor, true
}
The parseExpression
helper (for now) will look for a numeric, string, or identifier token.
parseExpression
函数(现在)会查找数字、字符串或标识符 token。
func parseExpression(tokens []*token, initialCursor uint, _ token) (*expression, uint, bool) {
cursor := initialCursor
kinds := []tokenKind{identifierKind, numericKind, stringKind}
for _, kind := range kinds {
t, newCursor, ok := parseToken(tokens, cursor, kind)
if ok {
return &expression{
literal: t,
kind: literalKind,
}, newCursor, true
}
}
return nil, initialCursor, false
}
And that's it for parsing a SELECT
statement!
以上就是解析一条
SELECT
语句的内容!
--- 待续