Elastisearch笔记
es 和 关系型数据库的简单对比
RDBMS | Elasticsearch |
---|---|
Table | Index(Type) |
Row | Doucment |
Column | Filed |
Schema | Mapping |
SQL | DSL |
## 索引相关信息
GET kibana_sample_data_ecommerce
## 文档总数
GET kibana_sample_data_ecommerce/_count
## _cat indices API
## 模糊匹配
GET /_cat/indices/kibana_*
## 按照文档个数排序
GET /_cat/indices?v&s=docs.count:desc
## 查看文档的一些基本信息
GET /_cat/indices/kibana_sample_data_ecommerce?v
集群的名字默认为 elasticsearch
分片分为 Primary Shard & Replica Shard
创建分片索引时指定主分片数,后续不允许修改,除非 Reindex
副本分片数量可以动态调整
## 集群健康状况
GET _cluster/health
GET _cat/nodes?v
GET _cat/shards?v
index shard prirep state docs store ip node
.apm-agent-configuration 0 p STARTED 0 208b 172.18.0.2 12b52a46e43f
.kibana_1 0 p STARTED 94 967.7kb 172.18.0.2 12b52a46e43f
kibana_sample_data_ecommerce 0 p STARTED 4675 4.5mb 172.18.0.2 12b52a46e43f
.apm-custom-link 0 p STARTED 0 208b 172.18.0.2 12b52a46e43f
.kibana_task_manager_1 0 p STARTED 5 55.2kb 172.18.0.2 12b52a46e43f
简单的 CRUD
## 自动生成id
POST my_index/_doc/
{
"user":"xiaoting",
"comment":"you know for search"
}
## 用户指定id,多次 PUT 会更新 version
PUT my_index/_doc/2
{
"user":"xiaoting",
"comment":"you know for search"
}
## 读取
GET my_index/_doc/2
## 查询
GET my_index/_search
{
"query":{
"match_all":{}
}
}
## 在原文档上面增加字段,如果用 put,就必须全部指定,不然会缺失字段
POST my_index/_update/2
{
"doc":{
"post_date":"2020-05-21"
}
}
## 删除
DELETE my_index/_doc/2
## 批量读取
GET _mget
{
"docs": [
{
"_index": "my_index",
"_id": 1
},
{
"_index": "my_index",
"_id": 2
}
]
}
倒排索引
正排索引——目录页
倒排索引——索引页
分词器 Analysis
三部分组成
Character Filters Tokenizer Token Filters
## 直接指定 Analysis 进行分词
GET /_analyze
{
"analyzer": "standard",
"text": "liuchenglong is a student"
}
## 指定索引的字段进行分词,可以模拟分词器对该字段是合种分词结果
GET my_index/_analyze
{
"field": "user",
"text": "xiaoting"
}
## 自定义分词器进行分词
GET /_analyze
{
"tokenizer": "standard",
"filter": [
"lowercase"
],
"text": "liuchenglong is a student"
}
Standard Analyzer 是默认的分词器
GET /_analyze
{
"analyzer": "standard",
"text": "Liuchenglong in the house"
}
GET /_analyze
{
"analyzer": "simple",
"text": "Liuchenglong in the house"
}
GET /_analyze
{
"analyzer": "whitespace",
"text": "Liuchenglong in the house"
}
GET /_analyze
{
"analyzer": "stop",
"text": "Liuchenglong in the house"
}
GET /_analyze
{
"analyzer": "keyword",
"text": "Liuchenglong in the house"
}
GET /_analyze
{
"analyzer": "pattern",
"text": "Liuchenglong in the house"
}
GET /_analyze
{
"analyzer": "english",
"text": "Liuchenglong in the house"
}
## 中文分词器插件 ik(需要额外安装下载)
GET /_analyze
{
"analyzer": "ik_max_word",
"text": "江苏省无锡市滨湖区溪北新村"
}
GET /_analyze
{
"analyzer": "ik_smart",
"text": "江苏省无锡市滨湖区溪北新村"
}
Search API
1.URL Search,使用 q 指定查询字符串
2.Request Body Search,使用 get 或者 post,可以在请求体中使用 es 的 DSL 语法
/_search
/index1/_search
/index1,index2/_search
/index*/_search
URL Search
## q 指定查询内容,df 指定查询的字段
GET my_index/_search?q=chenglong&df=user
GET my_index/_search?q=user:chenglong
## 带上 profile:true 可以查看这次查询的计算方式
GET my_index/_search?q=chenglong&df=user
{
"profile": "true"
}
## PhraseQuery
GET my_index/_search?q=comment:"you know"
## BooleanQuery
GET my_index/_search?q=comment:you know
## term query,要用()将其包裹
GET my_index/_search?q=comment:(you know)
## "comment:you comment:and comment:know"
GET my_index/_search?q=comment:(you and know)
## comment:you comment:not comment:know"
GET my_index/_search?q=comment:(you not know)
## "comment:you +comment:know" %2B 就是 + 号
GET my_index/_search?q=comment:(you %2Bknow)
## 范围查询
GET my_index/_search?q=year>2020
## 通配符查询
GET my_index/_search?q=user:ch*
## 模糊匹配,可以匹配上 chenglong
GET my_index/_search?q=user:chengleng~1
## 可以查询出 you know for search
GET my_index/_search?q=comment:"you for"~2
Request Body Search
## 分页查询
GET my_index/_search
{
"query": {
"match_all": {}
},
"from": 0,
"size": 20
}
## 按照指定字段排序
GET my_index/_search
{
"query": {
"match_all": {}
},
"sort": [
{"_score": {"order": "desc"}}
]
}
## 只查询指定的字段
GET my_index/_search
{
"query": {
"match_all": {}
},
"_source": ["user"]
}
## matchQuery TermQuery
GET my_index/_search
{
"query": {
"match": {
"user":"Chenglong"
}
}
}
## 指定查询方式
GET my_index/_search
{
"query": {
"match": {
"user":{
"query": "Chenglong",
"operator": "and"
}
}
}
}
## match_phrase 可以指定模糊几个单词,下面的查询可以查询出 you know for search
GET my_index/_search
{
"query": {
"match_phrase": {
"comment":{
"query": "you for",
"slop": 1
}
}
}
}
脚本字段
GET my_index/_search
{
"query": {
"match_all": {}
},
"script_fields": {
"userName": {
"script": {
"lang": "painless",
"source": "doc['user'].value + 's'"
}
}
}
}
Mapping
有点类似数据库中的 schema
的定义。
- 简单类型
Text / Keyword
Date
Integer / Floating
Boolean
IPv4 & IPv6
- 复杂类型 - 对象和嵌套对象
对象类型 / 嵌套类型
- 特殊类型
geo_point & geo_shape / percolator
Dynamic Mapping
在写入文档的时候,如果索引不存在,会自动创建索引
## 查看 mapping
GET my_index/_mapping
如果字段已经存在,则不允许修改字段的类型,必须使用 Reindex API 进行重建
## 可以在创建 index 的时候指定 mappings 的额类型,默认为 true
PUT movies
{
"mappings": {
"_doc": {
"dynamic": "true | false | strict"
}
}
}
自定义 Mapping
## 创建一个 index,其中 mobile 不进行索引
PUT movies
{
"mappings": {
"properties": {
"firstName": {
"type": "text"
},
"lastName": {
"type": "text"
},
"mobile": {
"type": "text",
"index": false
}
}
}
}
## 插入数据
PUT movies/_doc/1
{
"firstName": "Liu",
"lastName": "Chenglong",
"mobile": "1234567890"
}
## 尝试查询会报错
## failed to create query: Cannot search on field [mobile] since it is not indexed.
POST /movies/_search
{
"query": {
"match": {
"mobile": "123"
}
}
}
## null_value
PUT movies
{
"mappings": {
"properties": {
"firstName": {
"type": "text"
},
"lastName": {
"type": "text"
},
"mobile": {
"type": "keyword",
"null_value": "NULL"
}
}
}
}
PUT movies/_doc/1
{
"firstName": "Liu",
"lastName": "Chenglong",
"mobile": null
}
PUT movies/_doc/2
{
"firstName": "Liu",
"lastName": "Chenglong2"
}
## 可以搜索到 mobile 是 null 的数据,但是搜索不到没有 mobile 的数据
POST /movies/_search
{
"query": {
"match": {
"mobile": "NULL"
}
}
}
## copy to
PUT movies
{
"mappings": {
"properties": {
"firstName": {
"type": "text",
"copy_to": "fullName"
},
"lastName": {
"type": "text",
"copy_to": "fullName"
}
}
}
}
PUT movies/_doc/1
{
"firstName": "Liu",
"lastName": "Chenglong"
}
## 可以直接查询 fullName,虽然 movies 里面并没有这个字段
## _source 中并没有 fullName
POST movies/_search
{
"query": {
"match": {
"fullName": "chenglong"
}
}
}
数组类型本身是 text,所以如果原来一个字段是 text,那么可以直接插入一个数组
PUT movies/_doc/1
{
"firstName": "Liu",
"lastName": "Chenglong"
}
PUT movies/_doc/3
{
"firstName": "Liu",
"lastName": ["Chenglong"]
}
多字段属性
- 实现名字精确查询匹配
增加一个 keyword 字段
- 使用不同的 analyzer
Exact Value(不需要进行分词处理)
包括 日期、数字、具体的一个字符串(Apple Store)
Full Text
es 中的 text
Character Filters
可以在 Tokenizer 之前对文本进行处理,例如增加删除、替换文本
## 可以去除文本中的 html 标签,可以处理网络爬虫爬出来的数据
GET _analyze
{
"tokenizer": "keyword",
"char_filter": [
"html_strip"
],
"text": "<b>hello world</b>"
}
## 替换文字
GET _analyze
{
"tokenizer": "standard",
"char_filter": [
{
"type": "mapping",
"mappings": [
"- => _"
]
}
],
"text": "hello-world"
}
## 按照路径进行分词
GET _analyze
{
"tokenizer": "path_hierarchy",
"text": "user/local/nginx/conf"
}
## 按照空格进行分词,并且去除一些副词进行过滤
## 这里只能查询出 You house
GET _analyze
{
"tokenizer": "whitespace",
"filter": ["stop"],
"text": "You are in the house."
}
## 添加一个 lowercase 的 filter,就可以将单词变成小写
GET _analyze
{
"tokenizer": "whitespace",
"filter": [
"stop",
"lowercase"
],
"text": "You are in the house."
}
聚合搜索 Aggregation
Bucket 一些满足结果的文档集合
Metric 进行数学运算
Pipeline 对其他聚合结果进行二次聚合
Matrix 支持多个字段操作并提供一个结果矩阵
Bucket 有些像 SQL 中的 group
Metric 有些像 SQL 中的聚合函数
## 性别统计
GET kibana_sample_data_ecommerce/_search
{
"size": 0,
"aggs": {
"flight_dest": {
"terms": {
"field": "customer_gender"
}
}
}
}
## 查询结果
"buckets" : [
{
"key" : "FEMALE",
"doc_count" : 2433
},
{
"key" : "MALE",
"doc_count" : 2242
}
]
## 对分组结果继续进行分组
GET kibana_sample_data_ecommerce/_search
{
"size": 0,
"aggs": {
"flight_dest": {
"terms": {
"field": "day_of_week"
},
"aggs": {
"avg_price": {
"avg": {
"field": "products.base_price"
}
}
}
}
}
}
查询
Term 是表达语义的最小单位
## 添加几条数据
POST /product/_doc/1
{
"productId":"XHDK-12-#f",
"desc":"iPhone"
}
POST /product/_doc/2
{
"productId":"BHDK-22-#f",
"desc":"iPad"
}
POST /product/_doc/3
{
"productId":"CHDK-32-#f",
"desc":"MBP"
}
## 由于 term 不会对搜索进行处理,而插入的数据会被分词,iPhone => iphone
## 所以这里查询不到任何数据
POST /product/_search
{
"query": {
"term": {
"desc": {
"value": "iPhone"
"value": "iphone" ## 这样才能查询出来
}
}
}
}
## 这样也可以查询出来
POST /product/_search
{
"query": {
"term": {
"desc.keyword": {
"value": "iPhone"
}
}
}
}
## 分词
POST /_analyze
{
"analyzer": "standard",
"text": ["iPhone"]
}
{
"tokens" : [
{
"token" : "iphone",
"start_offset" : 0,
"end_offset" : 6,
"type" : "<ALPHANUM>",
"position" : 0
}
]
}
## 将 Query 转换为 Filter,可以忽略算分的计算,避免不必要的开销
## Filter 可以有效的使用缓存,调高多次的查询效率
POST /product/_search
{
"query": {
"constant_score": {
"filter": {
"term": {
"desc.keyword": "iPhone"
}
},
"boost": 1.2
}
}
}
Match Query / Match Phrase Query / Query String Query
索引和搜索时会进行分词,查询时先分词然后再生成一个供查询的词项列表
POST movies/_search
{
"query": {
"match": {
"name": "chenglong"
}
}
}
结构化搜索
日期、布尔类型、数字都是结构化的数据
可以用 Term、Prefix前缀查询
## 添加一些数据
POST /product/_bulk
{ "index":{"_id":1}}
{"price":10,"avaliable":true,"date":"2020-05-22","productId":"XXX-1","tag":"one"}
{ "index":{"_id":2}}
{"price":20,"avaliable":false,"date":"2019-05-22","productId":"XXX-2","tag":["one","two"]}
{ "index":{"_id":3}}
{"price":30,"avaliable":false,"productId":"XXX-3"}
## term 查询 boolean
POST /product/_search
{
"query": {
"constant_score": {
"filter": {
"term": {
"avaliable": true
}
}
}
}
}
## range 查询 数字
POST /product/_search
{
"query": {
"constant_score": {
"filter": {
"range": {
"price": {
"gte": 10,
"lte": 20
}
}
}
}
}
}
## range 查询 日期
y 年
M 月
w 周
d 天
H/h 小时
m 分钟
s 秒
POST /product/_search
{
"query": {
"constant_score": {
"filter": {
"range": {
"date": {
"gte": "now-1y"
}
}
}
}
}
}
## 通过 exists 查询字段存在的数据
POST /product/_search
{
"query": {
"constant_score": {
"filter": {
"exists": {
"field": "date"
}
}
}
}
}
## term 对多字段查询是包含关系,而不是精确匹配
## 这样会查询出 one 和 one two 两条数据
POST /product/_search
{
"query": {
"constant_score": {
"filter": {
"term": {
"tag.keyword": "one"
}
}
}
}
}
## 只想查询出 one
## 增加一个 tag_count 字段,再结合 bool query 进行查询
搜索的相关性算分
TF-IDF
BM25
在查询中添加 "explan": true 可以在结果中查询分数的计算方式
bool Query
must 必须匹配,贡献算分
should 选择性匹配,贡献算分
must_not 必须不匹配
filter 必须匹配,不贡献算分
bool 查询可以嵌套
通过修改嵌套结构,可以影响算分
## 可以通过 boost 修改得分
## 通过修改 tag 和 price 的字段得分,会影响最后查询出来结果的顺序
POST /product/_search
{
"query": {
"bool": {
"should": [
{
"match": {
"tag": {
"query": "one",
"boost": 1
}
}
},
{
"match": {
"price": {
"query": "30",
"boost": 1
}
}
}
]
}
}
}
## 使用 boosting 可以提升某个值的分数、降低某个值的分数
POST /product/_search
{
"query": {
"boosting": {
"positive": {
"match": {
"tag": "one"
}
},
"negative": {
"match": {
"tag": "two"
}
},
"negative_boost": 0.2
}
}
}
单字符串多字段
POST /product/_bulk
{ "index":{"_id":1}}
{"title":"Quick brown rabbits","body":"Brown rabbits are commonly seen"}
{ "index":{"_id":2}}
{"title":"Keeping pets healthy","body":"My quick brown fox eats rabbits on a regular basis"}
POST /product/_search
{
"query": {
"bool": {
"should": [
{
"match": {
"title": "Brown fox"
}
},
{
"match": {
"body": "Brown fox"
}
}
]
}
}
}
POST /product/_search
{
"query": {
"dis_max": {
"queries": [
{
"match": {
"title": "Quick fox"
}
},
{
"match": {
"body": "Quick fox"
}
}
]
}
}
}
## 如果查询出来有评分相同的,可以添加一个 tie_breaker 系数,让评分产生差异
## tie_breaker 是一个介于 0-1 之间的浮点数
## 0 表示使用最佳匹配
## 1 表示所有语句同等重要
POST /product/_search
{
"query": {
"dis_max": {
"queries": [
{
"match": {
"title": "Quick pets"
}
},
{
"match": {
"body": "Quick pets"
}
}
],
"tie_breaker": 0.7
}
}
}
multi_match 查询
//LCLTODO 整个还不是很理解
POST /product/_search
{
"query": {
"multi_match": {
"query": "brown",
"fields": ["title","body"]
}
}
}
中文分词器
hanlp
icu
ik
pingyin
Search Template
解耦
## 创建一个 search template
POST _scripts/queryProduct
{
"script": {
"lang": "mustache",
"source": {
"query": {
"multi_match": {
"query": "{{q}}",
"fields": [
"title"
]
}
}
}
}
}
GET _scripts/queryProduct
## 使用 template 进行查询
POST product/_search/template
{
"id":"queryProduct",
"params": {
"q":"pets"
}
}
Funcation Score Query
可以在查询结束后,对每一个匹配的文档进行一系列的重新算分,根据新生成的分数进行排序
默认的几种排序方式:
-
Weight 为每个文档设置一个简单而不规范化的权重
-
Field Value Factor 使用该数值修改 _score
-
Random Score
-
衰减函数 以某个字段的值作为标准,距离某个值越近,得分越高
-
Script Score 自定义脚本完全控制得分逻辑
PUT shop/_doc/1
{
"title": "Apple pie",
"price": 8
}
PUT shop/_doc/2
{
"title": "Orange pie",
"price": 3
}
PUT shop/_doc/1
{
"title": "Watermelon pie",
"price": 6
}
POST /shop/_search
{
"query": {
"function_score": {
"query": {
"multi_match": {
"query": "e",
"fields": "title"
}
},
"field_value_factor": {
"field": "price"
}
}
}
}