ElasticSearch基础入门学习笔记
前言
本笔记的内容主要是在从0开始学习ElasticSearch中,按照官方文档以及自己的一些测试的过程。
安装
由于是初学者,按照官方文档安装即可。前面ELK入门使用主要就是讲述了安装过程,这里不再赘述。
学习教程
找了很久,文档大多比较老。即使是官方文档也是基于2.x介绍的,官网最新已经演进到6了。不过基础入门还是可以的。接下来将参照官方文档来学习。
安装好ElasticSearch和Kibana之后. 打开localhost:5601, 选择Dev Tools。
索引(存储)雇员文档
测试的数据源是公司雇员的信息列表。其中,每个雇员的信息叫做一个文档,添加一条信息叫做索引一个文档。
在console里输入
PUT /megacorp/employee/1
{
"first_name" : "John",
"last_name" : "Smith",
"age" : 25,
"about" : "I love to go rock climbing",
"interests": [ "sports", "music" ]
}
- megacorp 是索引名称
- employee 是类型名称
- 1 是id,同样是雇员的id
光标定位到第一行,点击绿色按钮执行。
这个是简化的存入快捷方式, 其本质还是通过ES提供的REST API来实现的。上述可以用postman或者curl来实现,域名为ES的地址,即localhost:9200。对于postman,get方法不允许传body,用post也可以。
这样就将一个文档存入了ES。接下来,多存储几个
PUT /megacorp/employee/2
{
"first_name" : "Jane",
"last_name" : "Smith",
"age" : 32,
"about" : "I like to collect rock albums",
"interests": [ "music" ]
}
PUT /megacorp/employee/3
{
"first_name" : "Douglas",
"last_name" : "Fir",
"age" : 35,
"about": "I like to build cabinets",
"interests": [ "forestry" ]
}
然后,我们可以去查看,点击Management,Index Patterns,Configure an index pattern, 输入megacorp
,确定。
点击Discover, 就可以看到我们存储的信息了。
检索文档
存入数据后,想要查询出来。查询id为1的员工。
GET /megacorp/employee/1
返回:
{
"_index": "megacorp",
"_type": "employee",
"_id": "1",
"_version": 5,
"found": true,
"_source": {
"first_name": "John",
"last_name": "Smith",
"age": 25,
"about": "I love to go rock climbing",
"interests": [
"sports",
"music"
]
}
}
区别于保存一条记录,只是http method不同。
- put 添加
- get 获取
- delete 删除
- head 查询是否存在
- 想要更新,再次put即可
轻量搜索
我们除了findById,最常见就是条件查询了。
先来查看所有:
GET /megacorp/employee/_search
对了,可以查看记录个数count
GET /megacorp/employee/_count
想要查看last_name是Smith的
GET /megacorp/employee/_search?q=last_name:Smith
加一个参数q,字段名:Value的形式查询。
查询表达式
Query-string 搜索通过命令非常方便地进行临时性的即席搜索 ,但它有自身的局限性(参见 轻量 搜索 )。Elasticsearch 提供一个丰富灵活的查询语言叫做 查询表达式 , 它支持构建更加复杂和健壮的查询。
领域特定语言 (DSL), 指定了使用一个 JSON 请求。我们可以像这样重写之前的查询所有 Smith 的搜索
GET /megacorp/employee/_search
{
"query" : {
"match" : {
"last_name" : "Smith"
}
}
}
更复杂的查询
继续修改上一步的查询
GET /megacorp/employee/_search
{
"query" : {
"bool": {
"must": {
"match" : {
"last_name" : "smith"
}
},
"filter": {
"range" : {
"age" : { "gt" : 30 }
}
}
}
}
}
多了一个range过滤,要求age大于30.
结果
{
"took": 4,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.2876821,
"hits": [
{
"_index": "megacorp",
"_type": "employee",
"_id": "2",
"_score": 0.2876821,
"_source": {
"first_name": "Jane",
"last_name": "Smith",
"age": 32,
"about": "I like to collect rock albums",
"interests": [
"music"
]
}
}
]
}
}
全文检索
截止目前的搜索相对都很简单:单个姓名,通过年龄过滤。现在尝试下稍微高级点儿的全文搜索--一项传统数据库确实很难搞定的任务。
GET /megacorp/employee/_search
{
"query" : {
"match" : {
"about" : "rock climbing"
}
}
}
结果
{
"took": 32,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 0.53484553,
"hits": [
{
"_index": "megacorp",
"_type": "employee",
"_id": "1",
"_score": 0.53484553,
"_source": {
"first_name": "John",
"last_name": "Smith",
"age": 25,
"about": "I love to go rock climbing",
"interests": [
"sports",
"music"
]
}
},
{
"_index": "megacorp",
"_type": "employee",
"_id": "2",
"_score": 0.26742277,
"_source": {
"first_name": "Jane",
"last_name": "Smith",
"age": 32,
"about": "I like to collect rock albums",
"interests": [
"music"
]
}
}
]
}
}
有个排序,以及是分数_score
。可以看到只有一个字母匹配到的也查出来了. 如果我们想完全匹配, 换一个种查询.
match_phrase 会完全匹配短语.
GET /megacorp/employee/_search
{
"query" : {
"match_phrase" : {
"about" : "rock climbing"
}
}
}
我们百度搜索的时候, 命中的关键字还会高亮, es也可以返回高亮的位置.
GET /megacorp/employee/_search
{
"query" : {
"match_phrase" : {
"about" : "rock climbing"
}
},
"highlight": {
"fields" : {
"about" : {}
}
}
}
返回
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.5753642,
"hits": [
{
"_index": "megacorp",
"_type": "employee",
"_id": "1",
"_score": 0.5753642,
"_source": {
"first_name": "John",
"last_name": "Smith",
"age": 25,
"about": "I love to go rock climbing",
"interests": [
"sports",
"music"
]
},
"highlight": {
"about": [
"I love to go <em>rock</em> <em>climbing</em>"
]
}
}
]
}
}
聚合计算Group by
在sql里经常遇到统计的计算, 比如sum, count, avg. es可以这样:
GET /megacorp/employee/_search
{
"aggs": {
"all_interests": {
"terms": { "field": "interests" }
}
}
}
aggs
表示聚合, all_interests
是返回的变量名称, terms
表示count计算. 这个语句的意思是, 对interests
进行count统计. 然后, es可能会返回:
{
"error": {
"root_cause": [
{
"type": "illegal_argument_exception",
"reason": "Fielddata is disabled on text fields by default. Set fielddata=true on [interests] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory. Alternatively use a keyword field instead."
}
],
"type": "search_phase_execution_exception",
"reason": "all shards failed",
"phase": "query",
"grouped": true,
"failed_shards": [
{
"shard": 0,
"index": "megacorp",
"node": "iqHCjOUkSsWM2Hv6jT-xUQ",
"reason": {
"type": "illegal_argument_exception",
"reason": "Fielddata is disabled on text fields by default. Set fielddata=true on [interests] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory. Alternatively use a keyword field instead."
}
}
],
"caused_by": {
"type": "illegal_argument_exception",
"reason": "Fielddata is disabled on text fields by default. Set fielddata=true on [interests] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory. Alternatively use a keyword field instead.",
"caused_by": {
"type": "illegal_argument_exception",
"reason": "Fielddata is disabled on text fields by default. Set fielddata=true on [interests] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory. Alternatively use a keyword field instead."
}
}
},
"status": 400
}
意思是,对字符的统计, 需要开启一个设置fielddata=true
.
这就需要修改index设置了, 相当于修改关系型数据库表结构.
修改index mapping
我们先来查看一个配置:
GET /megacorp/employee/_mapping
结果:
{
"megacorp": {
"mappings": {
"employee": {
"properties": {
"about": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"age": {
"type": "long"
},
"first_name": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"interests": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"last_name": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
}
}
}
简单可以看出是定义了各个字段类型. 上个问题是需要增加一个配置
"fielddata": true
更新方法如下:
PUT /megacorp/employee/_mapping
{
"properties": {
"about": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"age": {
"type": "long"
},
"first_name": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"interests": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
},
"fielddata": true
},
"last_name": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
返回:
{
"acknowledged": true
}
表示更新成功了. 然后可以继续我们之前的聚合计算了.
聚合计算 group by count
对于sql类似于
select interests, count(*) from index_xxx
where last_name = 'smith'
group by interests.
在es里可以这样查询:
GET /megacorp/employee/_search
{
"_source": false,
"query": {
"match": {
"last_name": "smith"
}
},
"size": 0,
"aggs": {
"all_interests": {
"terms": {
"field": "interests"
}
}
}
}
_source=false
是为了不返回hit命中的item的属性, 默认true.
"size": 0,
表示不返回hits. 默认会返回所有的行, 我们不需要, 我们只要返回统计结果.
aggs
表示一个聚合操作.
all_interests
是自定义的一个变量名称, 可以随便写一个.
terms
表示进行count操作, 对应的字段是interests
.
返回:
{
"took": 0,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 0,
"hits": []
},
"aggregations": {
"all_interests": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "music",
"doc_count": 2
},
{
"key": "sports",
"doc_count": 1
}
]
}
}
}
可以得到需要的字段的count. 同样可以计算sum, avg.
GET /megacorp/employee/_search
{
"_source": false,
"size": 0,
"aggs" : {
"avg_age" : {
"avg" : { "field" : "age" }
},
"sum_age" : {
"sum" : { "field" : "age" }
}
}
}
返回
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 3,
"max_score": 0,
"hits": []
},
"aggregations": {
"avg_age": {
"value": 30.666666666666668
},
"sum_age": {
"value": 92
}
}
}
总结
上述是官方文档的第一节, 基础入门. 这里只是摘抄和实现了一遍. 没做更多的突破,但增加了个人理解. 可以知道es基本怎么用了. 更多更详细的语法后面慢慢来.
参考
关注我的公众号