初识elasticsearch

概要

此部分将零散的知识放在一起

es与relational DB

elastcisearch是面向文档的，就意味着索引和搜索的数据最小的单位是文档。

关系型数据库 es

数据库索引 indices

表 types

行 documents

字段 fields

关系型数据库	es
数据库	索引 indices
表	types
行	documents
字段	fields

下载地址与兼容性

elastic中文社区提供的下载，该网站的下载速度特别快！！
不同的jdk对不同的ElastcSearch的兼容性是不一样的，详情可以查看官网，操作系统等兼容性问题，也可以在该网址中找到。
elasticsearch7.x默认包含jdk

端口

es端口9200
kabana端口5601

es目录结构

bin  		启动文件
config  	配置文件
		log4j2         日志配置文件
		jvm.options    	  java虚拟机相关配置
		elasticsearch.yml  es的配置文件，默认9200端口
lib    		相关jar包
logs  		 日志
modules  	 功能模块
plugins  	 插件

分片

分片是 Elasticsearch 在集群中分发数据的关键。
把分片想象成数据的容器。文档存储在分片中，然后分片分配到集群中的节点上。当集群扩容或缩小，Elasticsearch 将会自动在节点间迁移分片，以使集群保持平衡。一个分片(shard)是一个最小级别“工作单元(worker unit)”，它只是保存了索引中所有数据的一部分。
分片有主分片和副本分片之分。

主分片：所以数据写入时都要到主分片中写入，主分片的数量决定了索引最多能存储多少数据。
副本分片：是主分片的一个副本，它可以防止硬件故障导致的数据丢失，同时可以提供读请求，比如搜索或者从别的 shard 取回文档

倒排索引

倒排索引是搜索引擎的核心。搜索引擎的主要目标是在查找发生搜索条件的文档时提供快速搜索。ES中的倒排索引其实就是 lucene 的倒排索引，区别于传统的正向索引，倒排索引会再存储数据时将关键词和数据进行关联，保存到倒排表中，然后查询时，将查询内容进行分词后在倒排表中进行查询，最后匹配数据即可。

ik分词器

在elasticsearch 中查询数据，使用了默认的分词器，分词效果不太理想。会把字段分成一个一个汉字，搜索时会把搜索到的句子进行分词，非常不智能，所以本次引入更为智能的IK分词器。

ik分词器有两种分词算法：ik_smart(最少切分)和ik_max_word(最细粒度划分)

安装
- github下载对应版本，点击这里，也可以在镜像网站中下载（上面有网址）
- 解压放到es的plugin目录下unzip elasticsearch-analysis-ik-7.9.3.zip -d /opt/es7.9.3/plugin/ik
- 重启es
- 可以用elasticsearch-plugin list命令查看插件（在es/bin中）

测试(用kibana)

ik_smart，通俗来说就是一个词往大了分，即一句话用断点分开，没有重复的部分。

GET _analyze
{
  "analyzer": "ik_smart",
  "text": "逆水行舟"

}

返回值：

{
  "tokens" : [
	{
	  "token" : "逆水行舟",
	  "start_offset" : 0,
	  "end_offset" : 4,
	  "type" : "CN_WORD",
	  "position" : 0
	}
  ]
}

ik_max_word，就是穷极词的组合，往细分。

GET _analyze
{
  "analyzer": "ik_max_word",
  "text": "逆水行舟"

}

返回值：


{
  "tokens" : [
	{
	  "token" : "逆水行舟",
	  "start_offset" : 0,
	  "end_offset" : 4,
	  "type" : "CN_WORD",
	  "position" : 0
	},
	{
	  "token" : "逆水",
	  "start_offset" : 0,
	  "end_offset" : 2,
	  "type" : "CN_WORD",
	  "position" : 1
	},
	{
	  "token" : "行舟",
	  "start_offset" : 2,
	  "end_offset" : 4,
	  "type" : "CN_WORD",
	  "position" : 2
	}
  ]
}

增加自定义词
ik分词器可能不能识别出某些词如：“张继科”等人名或我们的昵称等。所以需要我们手动添加词到词库中。

在ik分词器config目录下创建一个字典：vim /opt/elasticsearch-7.9.3/plugins/ik/config/name.dic
name.dic：

张继科
马龙
王皓
许昕

将字典放到ik分词器配置中：vim /opt/elasticsearch-7.9.3/plugins/ik/config/IKAnalyzer.cfg.xml
IKAnalyzer.cfg.xml：

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
	<comment>IK Analyzer 扩展配置</comment>
	<!--用户可以在这里配置自己的扩展字典 -->
	<entry key="ext_dict">name.dic</entry>
	 <!--用户可以在这里配置自己的扩展停止词字典-->
	<entry key="ext_stopwords"></entry>
	<!--用户可以在这里配置远程扩展字典 -->
	<!-- <entry key="remote_ext_dict">words_location</entry> -->
	<!--用户可以在这里配置远程扩展停止词字典-->
	<!-- <entry key="remote_ext_stopwords">words_location</entry> -->
</properties>

重启es，实验一下，成功：

创建索引时指定分词器

curl -X PUT "localhost:9200/user" -H "Content-Type: application/json" -d '{
    "settings": {
        "analysis": {
            "analyzer": {
                "ik": {
                    "tokenizer": "ik_max_word"
                }
            }
        }
    }
}'

修改已经有的索引的分词器
方法：重新创建一个新的索引，设置好正确的 Mapping，再使用 reindex 将旧索引导入到新的索引里面去。

kibana

kibana干什么的网上一堆介绍。

安装kibana

下载rpm包，根据在官网或镜像网站中下载
可以下载sha512文件检验：sha512sum -c kibana-xxx.rpm.sha512
rpm命令安装：rpm -ivh kibana-7.4.0-x86_64.rpm
修改配置文件
配置文件：/etc/kibana/kibana.yml。
关于配置文件主要看这个官方文档

1. 启动kibana：systemctl start kibana
kibana程序目录为/usr/share/kibana，其中：

plugins 为插件目录
bin 为执行文件目录
数据目录为/var/lib/kibana
1. 自启动：systemctl enable kibana
1. 防火墙：firewall-cmd --add-port=5601/tcp --permanent 和 firewall-cmd --reload
5601是默认端口，以自己的配置为准

REST风格

rest是一种风格，es使用RESTful风格的接口：

method	url	说明
PUT	`localhost:9200/索引名称/类型名称/文档id`	创建文档（指定id）
POST	`localhost:9200/索引名称/类型名称`	创建文档（随机id）
POST	`localhost:9200/索引名称/类型名称/文档id/_update`	修改文档
DELETE	`localhost:9200/索引名称/类型名称/文档id`	删除文档
GET	`localhost:9200/索引名称/类型名称/文档id`	通过文档id查询数据
POST	`localhost:9200/索引名称/类型名称/_search`	根据请求体条件查询数据

索引

创建

自动创建

curl -X PUT "localhost:9200/user?pretty

创建一个userl的索引
pretty参数表示返回漂亮打印的JSON结果

在写入文档时，如果索引不存在，会自动创建索引
这种机制，使得我们无需手动定义mappings。Elasticsearch会自动根据文档信息，推算出字段的类型
有的时候，Elasticsearch可能会推算不对，如：地理位置信息
当类型推算得不对时，可能导致一些功能无法正常运行，如Range查询。

如果索引不存在时，创建一个文档，会自动创建一个索引。

自定义创建

使用mapping，Mapping类似于关系型数据库的Schema，主要包含以下内容：

定义索引中字段的名称
定义字段的数据类型，如：字符串、数字、boolean等
可对字段设置倒排索引的相关配置，如是否需要分词，使用什么分词器
注意：从7.x开始，一个Mapping只属于一个索引的type 默认type 为：_doc

mapping的主要类型：

类型	具体类型
字符串类型	text、keyword
数字类型	long、integer、short、byte、double、float、half_float、scaled_float
日期类型	date、date_nanos
布尔类型	boolean
二进制类型	binary
范围类型	range
点击这里查看所以字段的类型，以及字段的参数

text与keyword的区别：

text
默认会进行分词，支持模糊查询（5.x之后版本string类型已废弃，请大家使用text）。
keyword
不进行分词；keyword类型默认开启doc_values来加速聚合排序操作，占用了大量磁盘io 如非必须可以禁用doc_values。

text和keyword使用的分词器也不一样，text: standard 而 keyword: keyword，
可以使用GET _anlayze进行测试

关于嵌套数据类型和上面的类型的解释见：Elasticsearch 7.x 映射（Mapping）中的字段类型

7.x新增类型：
alias：并不实际存在，而是对已有字段的一种别名映射，搜索该字段与搜索实际字段返回的内容没有本质的区别。
date_nanos：另一种时间类型，可精确到纳秒，用法类似date。
features：用来存储特征向量，数据不能为0和负数，查询时只能使用rank_feature query，该字段主要为支持后续机器学习相关功能做准备。
vector：存储特征数组，支持稀疏与稠密向量存储，该字段主要为支持后续机器学习相关功能做准备。

settings指定该索引的设置，定义不同的数据分布
mappings定义文档字段的类型。

curl -X PUT "localhost:9200/user" -H "Content-Type: application/json" -d '{
  "settings": {
    "number_of_shards": 3,
    "number_of_replicas": 2
  },
  "mappings": {
    "properties": {
      "name": {
        "type": "text"
      },
      "age": {
        "type": "integer"
      },
      "isteacher": {
        "type": "boolean"
      },
      "createdate": {
        "type": "date",
        "format": "yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||epoch_millis"
      }
    }
  }
}'

number_of_shards指定索引的主分片，默认为1，指定之后不可以改
number_of_replicas指定索引的副本分片，默认为1，指定后可以修改
curl -X PUT "localhost:9200/user/_settings" -H "Content-Type: application/json" -d '{
	"number_of_relicas": 2
}'

mapping还可以这样创建：

curl -X PUT "localhost:9200/user/_mapping" -H "Content-Type: application/json" -d '{
  "properties": {
    "name": {
      "type": "text",
      "index": true
    },
    "sex": {
      "type": "keyword",
      "index": true
    },
    "tel": {
      "type": "keyword",
      "index": false
    }
  }
}'

每个字段的参数

参数	说明
index	是否为索引，否的话不能直接查找该字段
type	字段类型
doc_values	ture or false，列式存储，为支持快速聚合与排序场景而设计，不在该类场景的可禁用
ignore_malformed	是否忽略脏数据，为true，数据格式或类型错误数据将被忽略，其它字段会正常插入；为false，一旦数据不符合要求整个文档将被拒绝。

更多见（包括settings的参数）此文elasticsearch简介和elasticsearch_dsl

使用python创建

注意：指定analyzer需要自己安装，不再使用_doc的形式创建索引，否则报错。

from elasticsearch import Elasticsearch
# from elasticsearch import AsyncElasticsearch

es = Elasticsearch(host="localhost", port=9200)
# es = AsyncElasticsearch()

body = {
    "settings": {
        "number_of_shards": 3,
        "number_of_replicas": 2
    },
    "mappings": {
        #"_doc": {
            "properties": {
                "id": {
                    "type": "integer",
                },
                "text": {
                    "type": "text",
                  #  "analyzer": "ik_max_word",  # 指定ik分词器，适用中文分词。
                    "index": False
                },
                "userId": {
                    "type": "long",
                },
                "reprinted": {
                    "type": "keyword",
                },
           # }
        }
    }
}

# 创建 index
es.indices.create(index="test", body=body)

指定分词器

curl -X PUT "localhost:9200/user" -H "Content-Type: application/json" -d '{
    "settings": {
        "analysis": {
            "analyzer": {
                "ik": {
                    "tokenizer": "ik_max_word"
                }
            }
        }
    },
    "mappings": {
        "properties": {
            "title": {
                "type": "text"
            },
            "content": {
                "type": "text",
                "analyzer": "ik_max_word"
            }
        }
    }
}

改

curl -X PUT "localhost:9200/user/_mapping" -H "Content-Type: application/json" -d '{
  "properties": {
    "name": {
      "type": "text",
      "index": true
    },
    "sex": {
      "type": "keyword",
      "index": true
    },
    "tel": {
      "type": "keyword",
      "index": false
    }
  }
}'

ES 的索引一旦建立，对Mapping的修改只能新增字段，不能对Mapping中已有的字段进行修改、删除。在默认情况下，Mapping的动态映射Dynamic = true，会自动推测字段的类型并添加到Mapping中。

更多请查看此文：ES文档字段删除与索引重建。

查

curl

curl -X GET "localhost:9200/_cat/indices?v"

也可以：

curl -X GET "localhost:9200/index_name/
# index_name就是要查的索引名字，有的话就会返回

python


from elasticsearch import Elasticsearch

es = Elasticsearch(["127.0.0.1:9200"])

index_name = 'student'

if es.indices.exists(index_name) == True:
    print('索引存在')
else:
    print('索引不存在')

删除

curl

curl -X DELETE  '127.0.0.1:9200/user'

python

from elasticsearch import Elasticsearch

es = Elasticsearch(["127.0.0.1:9200"])

es.indices.delete(index='student')

文档

es是面向文档的，文档是小的可搜索单位，文档会被序列化为JSON格式，每个文档都有自己的Unique ID（自己指定或自动生成）

增加

curl

这种增加方式为幂等操作，所以用PUT。

curl -X PUT "localhost:9200/customer/_doc/1" -H 'Content-Type: application/json' -d'{ "name": "Jane Doe", "age": 20 }

'

假如不指定id："localhost:9200/customer/_doc/则会自动生成id，为非幂等操作，使用POST。
curl：

curl -X POST "localhost:9200/customer/_doc/" -H 'Content-Type: application/json' -d'{ "name": "Jane Doe", "age": 20 }
'

也可以使用："localhost:9200/customer/_create/创建：

curl -X PUT "localhost:9200/customer/_create/3" -H 'Content-Type: application/json' -d'{ "name": "Jane Doe", "age": 20 }
'

python

from elasticsearch import Elasticsearch
index_name = 'my_index'
 
es = Elasticsearch(['127.0.0.1:9200'])
   

es.index(index=index_name, id='1', body={
            'name': '法外狂徒-张三',
            'id': 1,
        }
        )

删除

curl

curl -X DELETE "localhost:9200/customer/_doc/2?pretty"

python

from elasticsearch import Elasticsearch
index_name = 'my_index'
 
es = Elasticsearch(['127.0.0.1:9200'])
es.delete(index=index_name, id=1)

修改

curl

修改的方式有两种，一是全量修改，二是局部更新。这两者的区别就是请求方式的不同，前者有PUT后者用POST，而且URL和请求体也不一样。

# 全量修改(即覆盖)
curl -X PUT "localhost:9200/user/_doc/1?pretty" -H 'Content-Type: application/json' -d '{"name":"张三"}'

局部修改

curl -X POST "localhost:9200/user/_update/1?pretty" -H 'Content-Type: application/json' -d'
{
  "script" : "ctx._source.age += 5"
}'
# ctx._source引用的是当前源文档

# 或者
curl -X POST "localhost:9200/user/_update/1?pretty" -H 'Content-Type: application/json' -d '{"doc":{"name":"张三"}}'
# doc表示当前文档

python

from elasticsearch import Elasticsearch
index_name = 'my_index'

es.update(index = index_name, id = 1, body = {"doc":{"name":"张三"}})

查

可以使用这种方式查询：curl GET "localhost:9200/test/_search?q=name:lczmx"，但不推荐，以json的形式更好。

主键查询：根据id查一条数据。

curl -X GET "localhost:9200/test/_doc/1?pretty"

假如想要全文查找，则需要使用curl -X GET "localhost:9200/test/_search"，见DSL的内容。

DSL

DSL是elasticsearch的一种查询语法，它是通过如下形式查找的：

curl -X GET "localhost:9200/bank/_search" -H 'Content-Type: application/json' -d'
{
  "query": { "match_all": {} }
}
'

返回结果

形如：

{
  "took" : 12,
  "timed_out" : false,
  "_shards" : {
    "total" : 3,
    "successful" : 3,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "test",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 1.0,
        "_source" : {
          "name" : "lczmx"
        }
      }
    ]
  }
}

took ： Elasticsearch执行搜索的时间（以毫秒为单位）
timed_out ：告诉我们检索是否超时
_shards ：告诉我们检索了多少分片，以及成功/失败的分片数各是多少
hits ：检索的结果
hits.total ：符合检索条件的文档总数
hits.hits ：实际的检索结果数组（默认为前10个文档）
hits.sort ：排序的key（如果按分值排序的话则不显示）
hits._score ：匹配度，匹配度越高分值越高
hits.max_score 现在我们先忽略这些字段

query查询

DLS使用query表示要如何查找，查找方式有一下几种。

精准查询term

term查询是直接通过倒排索引指定的词条进行精确查找。效率较高。

term
字段只有一个值时候，用term关键词查询
查询biz_id值为1909190023901225的记录

curl -XGET "localhost:9200/xyerp/order/_search" -H 'Content-Type: application/json' -d '
{
 "query": {
     "term": {
       "biz_id": "1909190023901225"
      }
 }
}'

进一步优化查询，因为是精准查询，不需要查询进行评分计算，只希望对文档进行包括或排除的计算，所以我们会使用constant_score查询以非评分模式来执行 term 查询并以一作为统一评分。推荐如下查询

curl -XGET "localhost:9200/xyerp/order/_search" -H 'Content-Type: application/json' -d '
{  
    "query" : {  
        "constant_score" : {  
             "filter" : {  
                "term" : {  
                    "biz_id" : "1909190023901225"  
                }  
            }  
        }  
    }  
}'

terms
字段有一多个值时候，用terms关键词查询，后跟数组。

curl -XGET "localhost:9200/xyerp/order/_search" -H 'Content-Type: application/json' -d '
{
    "query":{
        "terms":{
            "biz_id":["1909190023901225"]
        }
    }
}'

term多个条件，使用bool

curl -XGET "localhost:9200/xyerp/order/_search" -H 'Content-Type: application/json' -d '
GET /account/_search
{
  "query": {
    "bool": {
      "should": [
        {
          "term": {
            "biz_id": "1909190023901225"
          }
        },
        {
          "term": {
            "name": "lczmx"
          }
        }
      ]
    }
  }
}
'

constant_scorec查询

constant_score 以非评分模式查询，推荐如下查询。

{  
    "query" : {  
        "constant_score" : {  
             "filter" : {  
                "terms" : {  
                    "biz_id" : ["1909190023901225","e1909190111365113"]  
                }  
            }  
        }  
    }  
}'

匹配查询match

match_all

curl -X GET "localhost:9200/bank/_search" -H 'Content-Type: application/json' -d'
{
  "query": { "match_all": {} },
  "_source": ["account_number", "balance"]
}
'

相当于：

SELECT account_number， balance FROM bank

match

curl -X GET "localhost:9200/bank/_search" -H 'Content-Type: application/json' -d'
{
  "query": { "match": { "account_number": 20 } }
}
'

相当于：

SELECT * FROM bank WHERE account_number = 20

假如是字符串的话：

curl -X GET "localhost:9200/bank/_search" -H 'Content-Type: application/json' -d'
{
  "query": { "match": { "address": "mill" } }
}
'

SELECT * FROM bank WHERE address LIKE '%mill%

注意：以空格隔开如{ "match": { "address": "mill link" }，表示mill OR link。，只要符合一个就可以查出，可以通过分数查看匹配程度。

multi_match

{
    "query":{
        "multi_match":{
            "query":"2501",
            "fields":["merchant_id","_id"]
        }
    }
}

match_phrase
表示的是完全匹配，只有一模一样的才能匹配上。

curl GET "localhost:9200/test/_search?pretty" -H  "Content-Type: application/json" -d '{"query":{
"match_phrase":{"name": "lcz"}}}'

bool查询

bool查询包含四种操作符，分别是must,should,must_not,filter。它们均是一种数组，数组里面是对应的判断条件

must：必须匹配，与and等价。贡献算分
must_not：必须不匹配，与not等价，常过滤子句用，但不贡献算分
should：选择性匹配，至少满足一条，与 OR 等价。贡献算分
filter：过滤子句，必须匹配，符合要求留下，不符合的过滤。但不贡献算分

curl -X GET "localhost:9200/bank/_search" -H 'Content-Type: application/json' -d'
{
  "query": {
    "bool": {
      "should": [
        { "match": { "address": "mill" } },
        { "match": { "address": "lane" } }
      ]
    }
  }
}
'
# 要求address匹配上mill 或 lane

curl -X GET "localhost:9200/bank/_search" -H 'Content-Type: application/json' -d'
{
  "query": {
    "bool": {
      "must": [
        { "match": { "address": "mill" } },
        { "match": { "name": "lczmx" } }
      ]
    }
  }
}
'

# 要求同时满足address和name

bool和filter+range经常一起使用，见下小节例子

filter查询

过滤器，会查询对结果进行缓存，不会计算相关度，避免计算分值，执行速度非常快。

如下，查询出status为active的状态


{
  "query": {
    "bool": {
      "filter": {
        "term": {
          "status": "active"
        }
      }
    }
  }
}

filter也常和range范围查询一起结合使用，range范围可供组合的选项

gt : 大于
lt : 小于
gte : 大于等于
lte :小于等于

如下，查询merchant_id值为2501下的交易数据

{
  "query": {
    "bool": {
      "must": {
        "term": {
          "merchant_id": "2501"
        }
      }, 
      "filter": {
        "range": {
          "trade_finished_time": {
            "from": "2019-09-01T00:00:00", 
            "to": "2019-09-30T23:59:59"
          }
        }
      }
    }
  }
}

如下查询，must下匹配，filter进行过滤，range定义范围

{    
    "query": {    
        "bool": {    
            "must": [    
                {   
                    "match": {   
                        "title": "Search"   
                        }  
                },  
                {   
                    "match": {   
                    "content": "Elasticsearch"   
                    }  
                }  
            ],    
            "filter": [  
                {   
                    "term": {   
                        "status": "1"   
                        }  
                },  
                {   
                    "range": {   
                        "publish_date": {   
                        "gte": "2015-01-01"   
                        }  
                    }  
                }  
            ]  
        }  
     }  
}

sort排序

指定结果的排序方式

curl -X GET "localhost:9200/bank/_search" -H 'Content-Type: application/json' -d'
{
    "query": {
        "match_all": {}
    },
    "sort": [
      {
        "age": {
          "order": "desc"
        }
      }
    ]
}
'
# 以account_number升序，desc为降序

由于我们手动排序了，所以其score为null

size、from分页

from参数（从0开始）指定从哪个文档索引开始，并且size参数指定从from开始返回多少条，分页时有用。

curl -X GET "localhost:9200/bank/_search" -H 'Content-Type: application/json' -d'
{
  "query": { "match_all": {} },
  "from": 10,
  "size": 10
}
'

假如要分页的话：from = （页码 - 1） * 每页数据数
如：每页10条数据，则第2页的from为：(2 - 1) * 10 = 20

`_source`限度返回数据

指定要返回那些数据。

curl -X GET "localhost:9200/bank/_search" -H 'Content-Type: application/json' -d'
{
  "query": { "match_all": {} },
  "_source": ["account_number", "balance"]
}
'

相当于：

SELECT account_number， balance FROM bank

aggs聚合函数

使用聚合函数。
需要指定聚合后的名字和使用哪个聚合函数。假如只需要统计数据的话，可以指定size为0。

分组

curl -X GET "localhost:9200/bank/_search" -H 'Content-Type: application/json' -d'{
	"aggs":{
		"name_group": {
			"terms": {
				"field": "name"
				}
			}

	}

}'

# name_group只是一个自己取的名字

平均值

curl -X GET "localhost:9200/bank/_search" -H 'Content-Type: application/json' -d'{
	"aggs":{
		"name_avg": {
			"avg": {
				"field": "price"
				}
			}
	}
}'

highlight高亮

指定哪些字段可以高亮显示。

curl GET "localhost:9200/test/_search?pretty" -H  "Content-Type: application/json" -d '{
    "query": {"match": {
        "name": "lczmx"
    }},
    "highlight":{
        "fields": {
            "name": {}
        }
    }
}'

# fields指定哪些字段高亮，字段的值为{}

默认为<em>lczmx</em>"==> lczmx，我们可以指定标签前缀和后缀：

curl GET "localhost:9200/test/_search?pretty" -H  "Content-Type: application/json" -d '{
  "query": {
    "match": {
      "name": "lczmx"
    }
  },
  "highlight": {
    "pre_tags": "<p class='key' style='color:red;'>",
    "post_tags": "</p>", 
    "fields": {
      "name": {}
    }
  }
}'

变为：<p class='name' style='color:red;'>lczmx</p>==>

lczmx

利用python查询

使用elasticsearch模块

调用search方法，body和dsl的语法一样，返回值是一个字典，和用curl、kibana等工具操作的返回值一样。

from elasticsearch import Elasticsearch


es = Elasticsearch(host="192.168.43.128", port=9200)

# 创建 index

body = {"query": {
    "match_all": {}},
    "from": 0,
    "size": 1
}
res = es.search(index="account", doc_type="_doc", body=body)
print(type(res))  # <class 'dict'>
print(res.get("hits").get("hits"))
# [{'_index': 'account', '_type': '_doc', '_id': '1', '_score': 1.0, '_source': {'name': 'lczmx', 'age': 22, 'birthday': '2020/12/22 12:22:39'}}]

elasticsearch_dsl模块

见我的另一篇文章python elasticsearch_dsl模块

集群

es一个节点就是一个集群，默认为elasticsearch，es后台把每页的索引分成多个分片，每个分片可以在集群中的不同服务器之间迁移。
创建索引时会由5个分片（primary shard）构成，每个祝主分片会有一个副本（replica shard, 又称复制分片）

ES集群中的节点有三种不同的类型：

主节点：负责管理集群范围内的所有变更，例如增加、删除索引，或者增加、删除节点等。主节点并不需要涉及到文档级别的变更和搜索等操作。可以通过属性node.master进行设置。
数据节点：存储数据和其对应的倒排索引。默认每一个节点都是数据节点（包括主节点），可以通过node.data属性进行设置。
协调节点：如果node.master和node.data属性均为false，则此节点称为协调节点，用来响应客户请求，均衡每个节点的负载，但每个节点就是不设置也可以隐性成为协调节点。

部署

解压es

tar -zxvf elasticsearch-xxx.tar.gz -C /opt/module

创建新用户
因为安全问题，Elasticsearch 不允许 root 用户直接运行，所以要创建新用户，在 root 用户中创建新用户

useradd es #新增 es 用户
passwd es #为 es 用户设置密码
userdel -r es #如果错了，可以删除再加
chown -R es:es /opt/module/elasticsearch-xxx #文件夹所有者

修改配置文件
ES 默认会加载位于 $ES_HOME/config/elasticsearch.yml 的配置文件。
备注：任何能够通过配置文件设置的内容，都可以通过命令行使用 -E 的语法进行指定，例如：
```
./bin/elasticsearch -d -Ecluster.name=my_cluster -Enode.name=node_1
```
cluster.name 设置集群名称。一个节点只能加入一个集群中，默认的集群名称是 elasticsearch。
```
cluster.name: search-7.3.2
```
node.name可以配置每个节点的名称。用来提供可读性高的 ES 实例名称，它默认名称是机器的 hostname，可以自定义：
```
node.name: node-1
```
同一集群中的节点名称不能相同

network.host设置访问的地址。默认仅绑定在回环地址 127.0.0.1 和 [::1]。建议设置为主机的公网 IP 或 0.0.0.0：
```
network.host: 0.0.0.0
```
http.port默认端口是 9200
```
http.port: 9200
```
transport.port集群中节点通信端口，范围：9300-9400。
```
transport.port: 9300
```
注意：transport.tcp.xxx这种写法在es6开始就变为了transport.xxx，所以看网上的配置要识别

discovery.seed_hosts发现设置。有两种重要的发现和集群形成配置，以便集群中的节点能够彼此发现并且选择一个主节点。如果要与其他主机上的节点组成集群，则必须设置 discovery.seed_hosts，用来提供集群中的其他主机列表。默认将绑定到可用的环回地址，并将扫描本地端口 9300 - 9305，所以其它主机需要配置。可以看看官方的说明：Discovery and cluster formation settings。
```
discovery.seed_hosts: ["abc.com:9300","192.168.43.112:9301","127.0.0.1:9300""]
```
cluster.initial_master_nodes初始的候选 master 节点列表。需要确保cluster.initial_master_nodes 中的值与 node.name 完全匹配。
es选master的流程，开发时可以不指定该配置，有es自动完成，但可能会有故障。所以工作环境中需要指定。
```
cluster.initial_master_nodes: ["node1"， "node2"]
```
1. 在启动一个新的集群的时候需要有 cluster.initial_master_nodes 初始化集群主节点列表。
2. 如果一个集群一旦形成，你不该再设置该配置项，应该移除它。
3. 集群形成之后，就算保留，这个配置也会被忽略的！
4. 该配置项并不是需要每个节点设置保持一致，设置需谨慎。若都设置了，当一个主节点关闭了，可能会导致其他主节点也会关闭！！因为一旦节点初始启动时设置了这个参数，它下次启动时还是会尝试和当初指定的主节点链接，当链接失败时，自己也会关闭！因此，为了保证可用性，预备做主节点的节点不用每个上面都配置该配置项！保证有的主节点上就不设置该配置项，这样当有主节点故障时，还有可用的主节点不会一定要去寻找初始节点中的主节点！
更多配置可以看这篇文章：ElasticSearch7.x 之 elasticsearch.yml详细配置

现在列出我的配置：
```
# es-node-1
cluster.name: es7
node.name: node-1
node.master: true
node.data: false
node.ingest: false
network.host: 0.0.0.0
http.port: 9200
transport.port: 9300
discovery.seed_hosts: ["127.0.0.1:9300","127.0.0.1:9301","127.0.0.1:9302"]
cluster.initial_master_nodes: ["node-1"]

# es-node-2
cluster.name: es7
node.name: node-2
node.master: true
node.data: true
node.ingest: false
network.host: 0.0.0.0
http.port: 9201
transport.port: 9301
discovery.seed_hosts: ["127.0.0.1:9300","127.0.0.1:9301","127.0.0.1:9302"]

# es-node-3
cluster.name: es7
node.name: node-3
node.master: true
node.data: true
node.ingest: false
network.host: 0.0.0.0
http.port: 9202
transport.port: 9302
discovery.seed_hosts: ["127.0.0.1:9300","127.0.0.1:9301","127.0.0.1:9302"]
```
node-1 节点仅仅是一个 master 节点，它不是一个数据节点。

修改系统配置
启动时可能会报错，不报错可跳过这步。
修改/etc/security/limits.conf：

# 在文件末尾中增加下面内容
# 每个进程可以打开的文件数的限制
es soft nofile 65536
es hard nofile 65536

修改/etc/security/limits.d/20-nproc.conf：

# 在文件末尾中增加下面内容
# 每个进程可以打开的文件数的限制
es soft nofile 65536
es hard nofile 65536

# 操作系统级别对每个用户创建的进程数的限制
* hard nproc 4096
# 注：* 带表 Linux 所有用户名称

修改/etc/sysctl.conf：

# 在文件中增加下面内容
# 一个进程可以拥有的 VMA(虚拟内存区域)的数量,默认值为 65536

vm.max_map_count=655360

重新加载：

sysctl -p

启动

cd /opt/module/es/
# 启动
bin/elasticsearch
# 后台启动
bin/elasticsearch -d

配置信息

elasticsearch.yml

属性名	说明
`cluster.name`	配置elasticsearch的集群名称，默认是elasticsearch。建议修改成一个有意义的名称。
`node.name`	节点名，es会默认随机指定一个名字，建议指定一个有意义的名称，方便管理
`path.conf`	设置配置文件的存储路径，tar或zip包安装默认在es根目录下的config文件夹，rpm安装默认在/etc/ elasticsearch
`path.data`	设置索引数据的存储路径，默认是es根目录下的data文件夹，可以设置多个存储路径，用逗号隔开
`path.logs`	设置日志文件的存储路径，默认是es根目录下的logs文件夹
`path.plugins`	设置插件的存放路径，默认是es根目录下的plugins文件夹
`bootstrap.memory_lock`	设置为true可以锁住ES使用的内存，避免内存进行swap
`network.host`	设置bind_host和publish_host，设置为0.0.0.0允许外网访问
`http.port`	设置对外服务的http端口，默认为9200。
`transport.port`	集群结点之间通信端口
`discovery.zen.ping.timeout`	设置ES自动发现节点连接超时的时间，默认为3秒，如果网络延迟高可设置大些
`discovery.zen.minimum_master_nodes`	主结点数量的最少值 ,此值的公式为：(master_eligible_nodes / 2) + 1 ，比如：有3个符合要求的主结点，那么这里要设置为2

详细：

################################### Cluster ###################################
#定义集群名称,默认是elasticsearch
cluster.name: elasticsearch
#################################### Node #####################################
#定义此节点名称
node.name: "node-1"
 
#此节点是否为master,master作用就是做协调,协调集群的状态,数据的读取时由集群的各个节点共同完成的,但是数据的修改只能master完成
# 当前节点是否可以被选举为master节点，是：true、否：false
node.master: true
#此节点是否为子节点,功能就是存储数据,存储索引之类的，当前节点是否用于存储数据，是：true、否：false
node.data: true
#node.rack: rack314
#每个热/温/冷节点加入如下配置并重启服务
#这两项配置是为节点增加标签，具体名称并不是写死的，与后面模板和策略配置有关
node.attr.box_type: hot/warm/cold
node.attr.rack: rack1
 
# 设置一台服务器能运行的节点数,一般为1就好,因为一般情况下一台机器只跑一个节点
#node.max_local_storage_nodes: 1
#################################### Index ####################################
#注：ES7.×版本后不再支持定义索引属性
#定义索引分片的数量
#index.number_of_shards: 5
#定义副本的数量
#index.number_of_replicas: 1
#################################### Paths ####################################
#定义配置文件的设置
#path.conf: /path/to/conf
#定义索引数据存储的位置
#path.data: /path/to/data
#path.data: /path/to/data1,/path/to/data2
 
#定义临时文件的位置
#path.work: /path/to/work
 
# Path to log files:
#定义日志文件的路径
#path.logs: /path/to/logs
 
#定义插件的位置
#path.plugins: /path/to/plugins
 
# 需求锁住物理内存，是：true、否：false
bootstrap.memory_lock: false
 # SecComp检测，是：true、否：false
bootstrap.system_call_filter: false
 
 
#################################### Plugin ###################################
#这个属性值为各个插件的名称,如果该值里的所列的插件没安装,则该节点不能启动,默认是没有插件
#plugin.mandatory: mapper-attachments,lang-groovy
################################### Memory ####################################
#es在内存不够jvm开启swapping的时候,表现的会很差,所以为了避免这个问题,将概述性设置为true,表示锁定es所使用的内存
############################## Network And HTTP ###############################
#elasticsearch节点绑定的地址
#network.bind_host: 192.168.0.1
#elasticsearch和其他节点通信的地址,如果不设置的话 会自动获取
#network.publish_host: 192.168.0.1
 
# Set both 'bind_host' and 'publish_host':
#
#network.host: 192.168.0.1
network.host: 0.0.0.0 
#设置节点之间通信的端口
#transport.port: 9300
#定义是否压缩tcp传输时的数据
#transport.compress: true
#定义http传输监听的端口
# 主机访问的端口号
http.port: 9200
#设置http交互中传输内容的最大长度
#http.max_content_length: 100mb
#是否启用http协议,如果不想让elasticsearch走http协议就设置为FALSE
#http.enabled: false
 
# 是否支持跨域，是：true，在使用head插件时需要此配置
http.cors.enabled: true
 
# "*" 表示支持所有域名
http.cors.allow-origin: "*"
 
################################### Gateway ###################################
#elasticsearch底层持久化,默认是走的本地,也可以设置为aws的s3
#gateway.type: local
#控制集群在达到多少个节点之后才会开始数据恢复,通过这个设置可以避免集群自动相互发现的初期,shard分片不全的问题,
#假如es集群内一共有5个节点,就可以设置为5,那么这个集群必须有5个节点启动后才会开始数据分片,如果设置为3,就有可能另外两个节点没存储数据分片
#gateway.recover_after_nodes: 1
#初始化数据恢复的超时时间,假如gateway.recover_after_nodes参数设置为5,就是5个节点全部启动后,再过5分钟开始数据恢复
#gateway.recover_after_time: 5m
 
# Set how many nodes are expected in this cluster. Once these N nodes
# are up (and recover_after_nodes is met), begin recovery process immediately
# (without waiting for recover_after_time to expire):
#启动几个节点后开始数据恢复,假如gateway.recover_after_nodes这个参数设置为5,那么等到这5个节点全部启动后直接可以数据恢复,不用等待gateway.recover_after_time设置的时间
#gateway.expected_nodes: 2
############################# Recovery Throttling #############################
#设置一个节点的并发数量,
#cluster.routing.allocation.node_initial_primaries_recoveries: 4
#cluster.routing.allocation.node_concurrent_recoveries: 2
# Set to throttle throughput when recovering (eg. 100mb, by default 20mb):
#恢复数据时,限制的宽带流量,如果是0就是无限制
#indices.recovery.max_bytes_per_sec: 20mb
 
#从其他分片恢复数据时,最大打开并发的值
#indices.recovery.concurrent_streams: 5
 
################################## Discovery ##################################
#1、在新版7.0的es中，对es的集群发现系统做了调整，不再有discovery.zen.minimum_master_nodes这个控制集群脑裂的配置，
#转而由集群自主控制，并且新版在启动一个新的集群的时候需要有cluster.initial_master_nodes初始化集群列表。
#2、在es7中，discovery.zen.* 开头的参数，有些已经失效
#（旧）
#设置这个集群,有多少个节点有master候选资格,如果集群较大官方建议为2-4个
#discovery.zen.minimum_master_nodes: 2
#es集群中自动发现其他节点的超时时间,如果网络延迟较大,建议设置长一点,防止误判
#discovery.zen.ping.timeout: 3s
#是否打开多播协议
#discovery.zen.ping.multicast.enabled: false
#设置集群中master集群初始化列表,这个数组里的机器将被自动发现加入集群
#discovery.zen.ping.unicast.hosts: ["host1", "host2:port"]
 
#（新）
# es7.x 之后新增的配置，写入候选主节点的设备地址，在开启服务后可以被选为主节点
# es7之后，不需要上面discovery.zen.ping.unicast.hosts这个参数，用discovery.seed_hosts替换
# discovery.zen.ping.unicast.hosts: ["10.19.1.9:9200","10.19.1.10:9200","10.19.1.11:9200"]
discovery.seed_hosts: ["10.19.1.9:9200","10.19.1.10:9200","10.19.1.11:9200"]
 
# es7.x 之后新增的配置，初始化一个新的集群时需要此配置来选举master
#cluster.initial_master_nodes: [“node-1”]
cluster.initial_master_nodes: ["node-1"]
 
################################## Slow Log ##################################
 
# Shard level query and fetch threshold logging.
 
#index.search.slowlog.threshold.query.warn: 10s
#index.search.slowlog.threshold.query.info: 5s
#index.search.slowlog.threshold.query.debug: 2s
#index.search.slowlog.threshold.query.trace: 500ms
 
#index.search.slowlog.threshold.fetch.warn: 1s
#index.search.slowlog.threshold.fetch.info: 800ms
#index.search.slowlog.threshold.fetch.debug: 500ms
#index.search.slowlog.threshold.fetch.trace: 200ms
 
#index.indexing.slowlog.threshold.index.warn: 10s
#index.indexing.slowlog.threshold.index.info: 5s
#index.indexing.slowlog.threshold.index.debug: 2s
#index.indexing.slowlog.threshold.index.trace: 500ms
 
################################## GC Logging ################################
#monitor.jvm.gc.young.warn: 1000ms
#monitor.jvm.gc.young.info: 700ms
#monitor.jvm.gc.young.debug: 400ms
#monitor.jvm.gc.old.warn: 10s
#monitor.jvm.gc.old.info: 5s
#monitor.jvm.gc.old.debug: 2s
################################## Security ################################
 
# Uncomment if you want to enable JSONP as a valid return transport on the
# http server. With this enabled, it may pose a security risk, so disabling
# it unless you need it is recommended (it is disabled by default).
#
#http.jsonp.enable: true

ES原理

见我另一篇文章ElasticSearch原理(TODO)

参考：
【ES】ElasticSearch 深入分片
 elasticsearch简介和elasticsearch_dsl
elasticsearch7常见查询（term、match、bool、filter）
Elasticsearch 7 : 添加和更新文档
 ElasticSearch7.4学习之基础操作（增删改查）
这个博主有很多的elasticsearch的文章
 ElasticSearch 7 教程（一整套的）

posted @ 2021-07-06 22:17 403·Forbidden 阅读(667) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

403 Forbidden

初识elasticsearch

概要

es与relational DB

下载地址与兼容性

端口

es目录结构

分片

倒排索引

ik分词器

kibana

REST风格

索引

创建

自动创建

自定义创建

使用python创建

指定分词器

改

查

curl

python

删除

curl

python

文档

增加

curl

python

删除

curl

python

修改

curl

python

查

DSL

返回结果

query查询

精准查询term

constant_scorec查询

匹配查询match

bool查询

filter查询

sort排序

size、from分页

_source限度返回数据

aggs聚合函数

highlight高亮

利用python查询

集群

部署

配置信息

ES原理

公告

`_source`限度返回数据