Elastisearch笔记

es 和关系型数据库的简单对比

RDBMS	Elasticsearch
Table	Index(Type)
Row	Doucment
Column	Filed
Schema	Mapping
SQL	DSL

## 索引相关信息
GET kibana_sample_data_ecommerce

## 文档总数
GET kibana_sample_data_ecommerce/_count

## _cat indices API
## 模糊匹配
GET /_cat/indices/kibana_*
## 按照文档个数排序
GET /_cat/indices?v&s=docs.count:desc
## 查看文档的一些基本信息
GET /_cat/indices/kibana_sample_data_ecommerce?v

集群的名字默认为 elasticsearch

分片分为 Primary Shard & Replica Shard

创建分片索引时指定主分片数，后续不允许修改，除非 Reindex

副本分片数量可以动态调整

## 集群健康状况
GET _cluster/health

GET _cat/nodes?v
GET _cat/shards?v

index                        shard prirep state   docs   store ip         node
.apm-agent-configuration     0     p      STARTED    0    208b 172.18.0.2 12b52a46e43f
.kibana_1                    0     p      STARTED   94 967.7kb 172.18.0.2 12b52a46e43f
kibana_sample_data_ecommerce 0     p      STARTED 4675   4.5mb 172.18.0.2 12b52a46e43f
.apm-custom-link             0     p      STARTED    0    208b 172.18.0.2 12b52a46e43f
.kibana_task_manager_1       0     p      STARTED    5  55.2kb 172.18.0.2 12b52a46e43f

简单的 CRUD

## 自动生成id
POST my_index/_doc/
{
  "user":"xiaoting",
  "comment":"you know for search"
}

## 用户指定id，多次 PUT 会更新 version
PUT my_index/_doc/2
{
  "user":"xiaoting",
  "comment":"you know for search"
}

## 读取
GET my_index/_doc/2

## 查询
GET my_index/_search
{
  "query":{
    "match_all":{}
  }
}

## 在原文档上面增加字段，如果用 put，就必须全部指定，不然会缺失字段
POST my_index/_update/2
{
  "doc":{
    "post_date":"2020-05-21"
  }
}

## 删除
DELETE my_index/_doc/2

## 批量读取
GET _mget
{
  "docs": [
    {
      "_index": "my_index",
      "_id": 1
    },
    {
      "_index": "my_index",
      "_id": 2
    }
  ]
}

倒排索引

正排索引——目录页

倒排索引——索引页

分词器 Analysis

三部分组成

Character Filters Tokenizer Token Filters

## 直接指定 Analysis 进行分词
GET /_analyze
{
  "analyzer": "standard",
  "text": "liuchenglong is a student"
}

## 指定索引的字段进行分词，可以模拟分词器对该字段是合种分词结果
GET my_index/_analyze
{
  "field": "user",
  "text": "xiaoting"
}

## 自定义分词器进行分词
GET /_analyze
{
  "tokenizer": "standard",
  "filter": [
    "lowercase"
  ],
  "text": "liuchenglong is a student"
}

Standard Analyzer 是默认的分词器

GET /_analyze
{
  "analyzer": "standard",
  "text": "Liuchenglong in the house"
}

GET /_analyze
{
  "analyzer": "simple",
  "text": "Liuchenglong in the house"
}

GET /_analyze
{
  "analyzer": "whitespace",
  "text": "Liuchenglong in the house"
}

GET /_analyze
{
  "analyzer": "stop",
  "text": "Liuchenglong in the house"
}

GET /_analyze
{
  "analyzer": "keyword",
  "text": "Liuchenglong in the house"
}

GET /_analyze
{
  "analyzer": "pattern",
  "text": "Liuchenglong in the house"
}

GET /_analyze
{
  "analyzer": "english",
  "text": "Liuchenglong in the house"
}

## 中文分词器插件 ik（需要额外安装下载）
GET /_analyze
{
  "analyzer": "ik_max_word",
  "text": "江苏省无锡市滨湖区溪北新村"
}

GET /_analyze
{
  "analyzer": "ik_smart",
  "text": "江苏省无锡市滨湖区溪北新村"
}

Search API

1.URL Search，使用 q 指定查询字符串

2.Request Body Search，使用 get 或者 post，可以在请求体中使用 es 的 DSL 语法

/_search
/index1/_search
/index1,index2/_search
/index*/_search

URL Search

## q 指定查询内容，df 指定查询的字段
GET my_index/_search?q=chenglong&df=user
GET my_index/_search?q=user:chenglong

## 带上 profile:true 可以查看这次查询的计算方式
GET my_index/_search?q=chenglong&df=user
{
  "profile": "true"
}

## PhraseQuery
GET my_index/_search?q=comment:"you know"
## BooleanQuery
GET my_index/_search?q=comment:you know
## term query，要用()将其包裹
GET my_index/_search?q=comment:(you know)
## "comment:you comment:and comment:know"
GET my_index/_search?q=comment:(you and know)
## comment:you comment:not comment:know"
GET my_index/_search?q=comment:(you not know)
## "comment:you +comment:know"   %2B 就是 + 号
GET my_index/_search?q=comment:(you %2Bknow)
## 范围查询
GET my_index/_search?q=year>2020
## 通配符查询
GET my_index/_search?q=user:ch*
## 模糊匹配，可以匹配上 chenglong
GET my_index/_search?q=user:chengleng~1
## 可以查询出 you know for search
GET my_index/_search?q=comment:"you for"~2

Request Body Search

## 分页查询
GET my_index/_search
{
  "query": {
    "match_all": {}
  },
  "from": 0,
  "size": 20
}

## 按照指定字段排序
GET my_index/_search
{
  "query": {
    "match_all": {}
  },
  "sort": [
    {"_score": {"order": "desc"}}
  ]
}

## 只查询指定的字段
GET my_index/_search
{
  "query": {
    "match_all": {}
  },
  "_source": ["user"]
}

## matchQuery TermQuery
GET my_index/_search
{
  "query": {
    "match": {
      "user":"Chenglong"
    }
  }
}

## 指定查询方式
GET my_index/_search
{
  "query": {
    "match": {
      "user":{
        "query": "Chenglong",
        "operator": "and"
      }
    }
  }
}

## match_phrase 可以指定模糊几个单词，下面的查询可以查询出 you know for search
GET my_index/_search
{
  "query": {
    "match_phrase": {
      "comment":{
        "query": "you for",
        "slop": 1
      }
    }
  }
}

脚本字段

GET my_index/_search
{
  "query": {
    "match_all": {}
  },
  "script_fields": {
    "userName": {
      "script": {
        "lang": "painless",
        "source": "doc['user'].value + 's'"
      }
    }
  }
}

Mapping

有点类似数据库中的 schema 的定义。

简单类型

Text / Keyword

Date

Integer / Floating

Boolean

IPv4 & IPv6

复杂类型 - 对象和嵌套对象

对象类型 / 嵌套类型

特殊类型

geo_point & geo_shape / percolator

Dynamic Mapping

在写入文档的时候，如果索引不存在，会自动创建索引

## 查看 mapping
GET my_index/_mapping

如果字段已经存在，则不允许修改字段的类型，必须使用 Reindex API 进行重建

## 可以在创建 index 的时候指定 mappings 的额类型，默认为 true
PUT movies
{
  "mappings": {
    "_doc": {
      "dynamic": "true | false | strict"
    }
  }
}

自定义 Mapping

## 创建一个 index，其中 mobile 不进行索引
PUT movies
{
  "mappings": {
    "properties": {
      "firstName": {
        "type": "text"
      },
      "lastName": {
        "type": "text"
      },
      "mobile": {
        "type": "text",
        "index": false
      }
    }
  }
}

## 插入数据
PUT movies/_doc/1
{
  "firstName": "Liu",
  "lastName": "Chenglong",
  "mobile": "1234567890"
}

## 尝试查询会报错
## failed to create query: Cannot search on field [mobile] since it is not indexed.
POST /movies/_search
{
  "query": {
    "match": {
      "mobile": "123"
    }
  }
}

## null_value
PUT movies
{
  "mappings": {
    "properties": {
      "firstName": {
        "type": "text"
      },
      "lastName": {
        "type": "text"
      },
      "mobile": {
        "type": "keyword",
        "null_value": "NULL"
      }
    }
  }
}

PUT movies/_doc/1
{
  "firstName": "Liu",
  "lastName": "Chenglong",
  "mobile": null
}

PUT movies/_doc/2
{
  "firstName": "Liu",
  "lastName": "Chenglong2"
}

## 可以搜索到 mobile 是 null 的数据，但是搜索不到没有 mobile 的数据
POST /movies/_search
{
  "query": {
    "match": {
      "mobile": "NULL"
    }
  }
}

## copy to
PUT movies
{
  "mappings": {
    "properties": {
      "firstName": {
        "type": "text",
        "copy_to": "fullName"
      },
      "lastName": {
        "type": "text",
        "copy_to": "fullName"
      }
    }
  }
}

PUT movies/_doc/1
{
  "firstName": "Liu",
  "lastName": "Chenglong"
}

## 可以直接查询 fullName，虽然 movies 里面并没有这个字段
## _source 中并没有 fullName
POST movies/_search
{
  "query": {
    "match": {
      "fullName": "chenglong"
    }
  }
}

数组类型本身是 text，所以如果原来一个字段是 text，那么可以直接插入一个数组

PUT movies/_doc/1
{
  "firstName": "Liu",
  "lastName": "Chenglong"
}

PUT movies/_doc/3
{
  "firstName": "Liu",
  "lastName": ["Chenglong"]
}

多字段属性

实现名字精确查询匹配

增加一个 keyword 字段

使用不同的 analyzer

Exact Value（不需要进行分词处理）

包括日期、数字、具体的一个字符串（Apple Store)

Full Text

es 中的 text

Character Filters

可以在 Tokenizer 之前对文本进行处理，例如增加删除、替换文本

## 可以去除文本中的 html 标签，可以处理网络爬虫爬出来的数据
GET _analyze
{
  "tokenizer": "keyword",
  "char_filter": [
    "html_strip"
  ],
  "text": "<b>hello world</b>"
}

## 替换文字
GET _analyze
{
  "tokenizer": "standard",
  "char_filter": [
    {
      "type": "mapping",
      "mappings": [
        "- => _"
      ]
    }
  ],
  "text": "hello-world"
}

## 按照路径进行分词
GET _analyze
{
  "tokenizer": "path_hierarchy",
  "text": "user/local/nginx/conf"
}

## 按照空格进行分词，并且去除一些副词进行过滤
## 这里只能查询出 You house
GET _analyze
{
  "tokenizer": "whitespace",
  "filter": ["stop"], 
  "text": "You are in the house."
}

## 添加一个 lowercase 的 filter，就可以将单词变成小写
GET _analyze
{
  "tokenizer": "whitespace",
  "filter": [
    "stop",
    "lowercase"
  ],
  "text": "You are in the house."
}

聚合搜索 Aggregation

Bucket 一些满足结果的文档集合

Metric 进行数学运算

Pipeline 对其他聚合结果进行二次聚合

Matrix 支持多个字段操作并提供一个结果矩阵

Bucket 有些像 SQL 中的 group

Metric 有些像 SQL 中的聚合函数

## 性别统计
GET kibana_sample_data_ecommerce/_search
{
  "size": 0,
  "aggs": {
    "flight_dest": {
      "terms": {
        "field": "customer_gender"
      }
    }
  }
}

## 查询结果
"buckets" : [
  {
    "key" : "FEMALE",
    "doc_count" : 2433
  },
  {
    "key" : "MALE",
    "doc_count" : 2242
  }
]

## 对分组结果继续进行分组
GET kibana_sample_data_ecommerce/_search
{
  "size": 0,
  "aggs": {
    "flight_dest": {
      "terms": {
        "field": "day_of_week"
      },
      "aggs": {
        "avg_price": {
          "avg": {
            "field": "products.base_price"
          }
        }
      }
    }
  }
}

查询

Term 是表达语义的最小单位

## 添加几条数据
POST /product/_doc/1
{
  "productId":"XHDK-12-#f",
  "desc":"iPhone"
}
POST /product/_doc/2
{
  "productId":"BHDK-22-#f",
  "desc":"iPad"
}
POST /product/_doc/3
{
  "productId":"CHDK-32-#f",
  "desc":"MBP"
}

## 由于 term 不会对搜索进行处理，而插入的数据会被分词，iPhone => iphone
## 所以这里查询不到任何数据
POST /product/_search
{
  "query": {
    "term": {
      "desc": {
        "value": "iPhone"
        "value": "iphone" ## 这样才能查询出来
      }
    }
  }
}

## 这样也可以查询出来
POST /product/_search
{
  "query": {
    "term": {
      "desc.keyword": {
        "value": "iPhone"
      }
    }
  }
}

## 分词
POST /_analyze
{
  "analyzer": "standard",
  "text": ["iPhone"]
}

{
  "tokens" : [
    {
      "token" : "iphone",
      "start_offset" : 0,
      "end_offset" : 6,
      "type" : "<ALPHANUM>",
      "position" : 0
    }
  ]
}

## 将 Query 转换为 Filter，可以忽略算分的计算，避免不必要的开销
## Filter 可以有效的使用缓存，调高多次的查询效率
POST /product/_search
{
  "query": {
    "constant_score": {
      "filter": {
        "term": {
          "desc.keyword": "iPhone"
        }
      },
      "boost": 1.2
    }
  }
}

Match Query / Match Phrase Query / Query String Query

索引和搜索时会进行分词，查询时先分词然后再生成一个供查询的词项列表

POST movies/_search
{
  "query": {
    "match": {
      "name": "chenglong"
    }
  }
}

结构化搜索

日期、布尔类型、数字都是结构化的数据

可以用 Term、Prefix前缀查询

## 添加一些数据
POST /product/_bulk
{ "index":{"_id":1}}
{"price":10,"avaliable":true,"date":"2020-05-22","productId":"XXX-1","tag":"one"}
{ "index":{"_id":2}}
{"price":20,"avaliable":false,"date":"2019-05-22","productId":"XXX-2","tag":["one","two"]}
{ "index":{"_id":3}}
{"price":30,"avaliable":false,"productId":"XXX-3"}

## term 查询 boolean
POST /product/_search
{
  "query": {
    "constant_score": {
      "filter": {
        "term": {
          "avaliable": true
        }
      }
    }
  }
}

## range 查询 数字
POST /product/_search
{
  "query": {
    "constant_score": {
      "filter": {
        "range": {
          "price": {
            "gte": 10,
            "lte": 20
          }
        }
      }
    }
  }
}

## range 查询 日期
y 年
M 月
w 周
d 天
H/h 小时
m 分钟
s 秒
POST /product/_search
{
  "query": {
    "constant_score": {
      "filter": {
        "range": {
          "date": {
            "gte": "now-1y"
          }
        }
      }
    }
  }
}

## 通过 exists 查询字段存在的数据
POST /product/_search
{
  "query": {
    "constant_score": {
      "filter": {
        "exists": {
          "field": "date"
        }
      }
    }
  }
}

## term 对多字段查询是包含关系，而不是精确匹配
## 这样会查询出 one 和 one two 两条数据
POST /product/_search
{
  "query": {
    "constant_score": {
      "filter": {
        "term": {
          "tag.keyword": "one"
        }
      }
    }
  }
}

## 只想查询出 one
## 增加一个 tag_count 字段，再结合 bool query 进行查询

搜索的相关性算分

TF-IDF

BM25

在查询中添加 "explan": true 可以在结果中查询分数的计算方式

bool Query

must 必须匹配，贡献算分

should 选择性匹配，贡献算分

must_not 必须不匹配

filter 必须匹配，不贡献算分

bool 查询可以嵌套

通过修改嵌套结构，可以影响算分

## 可以通过 boost 修改得分
## 通过修改 tag 和 price 的字段得分，会影响最后查询出来结果的顺序
POST /product/_search
{
  "query": {
    "bool": {
      "should": [
        {
          "match": {
            "tag": {
              "query": "one",
              "boost": 1
            }
          }
        },
        {
          "match": {
            "price": {
              "query": "30",
              "boost": 1
            }
          }
        }
      ]
    }
  }
}

## 使用 boosting 可以提升某个值的分数、降低某个值的分数
POST /product/_search
{
  "query": {
    "boosting": {
      "positive": {
        "match": {
          "tag": "one"
        }
      },
      "negative": {
         "match": {
          "tag": "two"
        }
      },
      "negative_boost": 0.2
    }
  }
}

单字符串多字段

POST /product/_bulk
{ "index":{"_id":1}}
{"title":"Quick brown rabbits","body":"Brown rabbits are commonly seen"}
{ "index":{"_id":2}}
{"title":"Keeping pets healthy","body":"My quick brown fox eats rabbits on a regular basis"}

POST /product/_search
{
  "query": {
    "bool": {
      "should": [
        {
          "match": {
            "title": "Brown fox"
          }
        },
        {
          "match": {
            "body": "Brown fox"
          }
        }
      ]
    }
  }
}

POST /product/_search
{
  "query": {
    "dis_max": {
      "queries": [
        {
          "match": {
            "title": "Quick fox"
          }
        },
        {
          "match": {
            "body": "Quick fox"
          }
        }
      ]
    }
  }
}

## 如果查询出来有评分相同的，可以添加一个 tie_breaker 系数，让评分产生差异
## tie_breaker 是一个介于 0-1 之间的浮点数
## 0 表示使用最佳匹配
## 1 表示所有语句同等重要
POST /product/_search
{
  "query": {
    "dis_max": {
      "queries": [
        {
          "match": {
            "title": "Quick pets"
          }
        },
        {
          "match": {
            "body": "Quick pets"
          }
        }
      ],
      "tie_breaker": 0.7
    }
  }
}

multi_match 查询

//LCLTODO 整个还不是很理解

POST /product/_search
{
  "query": {
    "multi_match": {
      "query": "brown",
      "fields": ["title","body"]
    }
  }
}

中文分词器

hanlp

icu

pingyin

Search Template

解耦

## 创建一个 search template
POST _scripts/queryProduct
{
  "script": {
    "lang": "mustache",
    "source": {
      "query": {
        "multi_match": {
          "query": "{{q}}",
          "fields": [
            "title"
          ]
        }
      }
    }
  }
}

GET _scripts/queryProduct

## 使用 template 进行查询
POST product/_search/template
{
  "id":"queryProduct",
  "params": {
    "q":"pets"
  }
}

Funcation Score Query

可以在查询结束后，对每一个匹配的文档进行一系列的重新算分，根据新生成的分数进行排序

默认的几种排序方式：

Weight 为每个文档设置一个简单而不规范化的权重
Field Value Factor 使用该数值修改 _score
Random Score
衰减函数以某个字段的值作为标准，距离某个值越近，得分越高
Script Score 自定义脚本完全控制得分逻辑

PUT shop/_doc/1
{
  "title": "Apple pie",
  "price": 8
}

PUT shop/_doc/2
{
  "title": "Orange pie",
  "price": 3
}

PUT shop/_doc/1
{
  "title": "Watermelon pie",
  "price": 6
}

POST /shop/_search
{
  "query": {
    "function_score": {
      "query": {
        "multi_match": {
          "query": "e",
          "fields": "title"
        }
      },
      "field_value_factor": {
        "field": "price"
      }
    }
  }
}

posted @ 2020-05-24 20:33 LiuChengloong 阅读(375) 评论(0) 收藏举报

刷新页面返回顶部

一任阶前、点滴到天明

Elastisearch笔记

es 和关系型数据库的简单对比

简单的 CRUD

倒排索引

分词器 Analysis

Search API

Mapping

自定义 Mapping

多字段属性

聚合搜索 Aggregation

查询

结构化搜索

搜索的相关性算分

bool Query

单字符串多字段

中文分词器

Search Template

Funcation Score Query

公告

一任阶前、点滴到天明

Elastisearch笔记

es 和 关系型数据库的简单对比

简单的 CRUD

倒排索引

分词器 Analysis

Search API

Mapping

自定义 Mapping

多字段属性

聚合搜索 Aggregation

查询

结构化搜索

搜索的相关性算分

bool Query

单字符串多字段

中文分词器

Search Template

Funcation Score Query

公告

es 和关系型数据库的简单对比