ElasticSerach(三)
1、ES 查询操作
1.1、过滤—先匹配,再过滤
GET movie_index/_search { "query": { "match": { "name": "red" } }, "post_filter": { "term": { "actorList.id": "3" } } }
1.2、过滤—匹配过滤同时
GET movie_index/_search { "query": { "bool": { "must": [ { "match": { "name": "red" } } ], "filter": { "term": { "actorList.id": 3 } } } } }
1.3、过滤—按范围
#score >6 <9 GET movie_index/_search { "query": { "range": { "doubanScore": { "gte": 6, "lte": 9 } } } }
gt
|
大于
|
lt
|
小于
|
gte
|
大于等于 great than or equals
|
lte
|
小于等于 less than or equals
|
1.4、排序
GET movie_index/_search { "sort": [ { "doubanScore": { "order": "desc" # 降序 } } ] }
包含指定名称在排序
GET movie_index/_search { "query": { "match": { "name": "red" } }, "sort": [ { "doubanScore": { "order": "desc" } } ] }
1.5、分页
#分页 GET movie_index/_search { "from": 0, #从什么地方开始 "size": 2 #一页展示几条数据 }
1.6、查询指定列
#查询指定字段 GET movie_index/_search { "_source": "{id,name}" }
1.7、高亮显示
#高亮显示 GET movie_index/_search { "query": { "match": { "name": "red" } }, "highlight": { "fields": {"name": {}} } }
1.8、聚合
聚合提供了对数据进行分组、统计的能力,类似于SQL中Group By和SQL聚合函数。在ElasticSearch中,可以同时返回搜索结果及其聚合计算结果,这是非常强大和高效的
需求1:取出每个演员共参演了多少部电影
#取出每个演员共参演了多少部电影 #terms 聚合操作 GET movie_index/_search { "aggs": { "myAggs": { "terms": { "field": "actorList.name.keyword", "size": 10 } } } }
需求2:每个演员参演电影的平均分是多少,并按评分排序
GET movie_index/_search { "aggs": { "groupByname": { "terms": { "field": "actorList.name.keyword", "size": 10, "order": { "avgScore": "asc" } }, "aggs": { "avgScore": { "avg": { "field": "doubanScore" } } } } } }
结果

{ "took" : 38, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : 3, "max_score" : 1.0, "hits" : [ { "_index" : "movie_index", "_type" : "movie", "_id" : "2", "_score" : 1.0, "_source" : { "id" : 101, "name" : "peration meigong river", "doubanScore" : 8.1, "actorList" : [ { "id" : 1, "name" : "zhang han yu" } ] } }, { "_index" : "movie_index", "_type" : "movie", "_id" : "1", "_score" : 1.0, "_source" : { "id" : 100, "name" : "operation red sea", "doubanScore" : 8.5, "actorlist" : [ { "id" : 1, "name" : "zhang yi" }, { "id" : 2, "name" : "hai qing" }, { "id" : 3, "name" : "zhang han yu" } ] } }, { "_index" : "movie_index", "_type" : "movie", "_id" : "3", "_score" : 1.0, "_source" : { "id" : 300, "name" : "incident red sea", "doubanScore" : 5.0, "actorList" : [ { "id" : 4, "name" : "zhang san feng" } ] } } ] }, "aggregations" : { "groupByname" : { "doc_count_error_upper_bound" : 0, "sum_other_doc_count" : 0, "buckets" : [ { "key" : "zhang san feng", "doc_count" : 1, "avgScore" : { "value" : 5.0 } }, { "key" : "zhang han yu", "doc_count" : 1, "avgScore" : { "value" : 8.100000381469727 } } ] } } }
2、分词
2.1、查看默认分词
#默认分词策略 GET _analyze { "text": "hello world" }
结果

{ "tokens" : [ { "token" : "hello", "start_offset" : 0, "end_offset" : 5, "type" : "<ALPHANUM>", "position" : 0 }, { "token" : "world", "start_offset" : 6, "end_offset" : 11, "type" : "<ALPHANUM>", "position" : 1 } ] }
2.2、中文分词
#默认分词策略 GET _analyze { "text": "我是中国人"
结果

{ "tokens" : [ { "token" : "我", "start_offset" : 0, "end_offset" : 1, "type" : "<IDEOGRAPHIC>", "position" : 0 }, { "token" : "是", "start_offset" : 1, "end_offset" : 2, "type" : "<IDEOGRAPHIC>", "position" : 1 }, { "token" : "中", "start_offset" : 2, "end_offset" : 3, "type" : "<IDEOGRAPHIC>", "position" : 2 }, { "token" : "国", "start_offset" : 3, "end_offset" : 4, "type" : "<IDEOGRAPHIC>", "position" : 3 }, { "token" : "人", "start_offset" : 4, "end_offset" : 5, "type" : "<IDEOGRAPHIC>", "position" : 4 } ] }
2.3、中文分词器
通过上面的查询,我们可以看到ES本身自带的中文分词,就是单纯把中文一个字一个字的分开,根本没有词汇的概念。但是实际应用中,用户都是以词汇为条件,进行查询匹配的,如果能够把文章以词汇为单位切分开,那么与用户的查询条件能够更贴切的匹配上,查询速度也更加快速。常见的一些开源分词器对比,我们使用IK分词器
分词器
|
优势
|
劣势
|
Smart Chinese Analysis
|
官方插件
|
中文分词效果惨不忍睹
|
IKAnalyzer
|
简单易用,支持自定义词典和远程词典
|
词库需要自行维护,不支持词性识别
|
结巴分词
|
新词识别功能
|
不支持词性识别
|
Ansj中文分词
|
分词精准度不错,支持词性识别
|
对标hanlp词库略少,学习成本高
|
Hanlp
|
目前词库最完善,支持的特性非常多
|
需要更优的分词效果,学习成本高
|
2.4、IK 分词器
es 安装目录创建 插件目录
[hui@hadoop201 plugins]$ pwd /opt/module/elasticsearch/plugins [hui@hadoop201 plugins]$ mkdir ik
解压插件到制定目录
[hui@hadoop201 software]$ unzip elasticsearch-analysis-ik-6.6.0.zip -d /opt/module/elasticsearch/plugins/ik /opt/module/elasticsearch/plugins/ik [hui@hadoop201 ik]$ ll total 1432 -rw-r--r--. 1 hui hui 263965 May 6 2018 commons-codec-1.9.jar -rw-r--r--. 1 hui hui 61829 May 6 2018 commons-logging-1.2.jar drwxr-xr-x. 2 hui hui 4096 Aug 26 2018 config -rw-r--r--. 1 hui hui 54693 Jan 30 2019 elasticsearch-analysis-ik-6.6.0.jar -rw-r--r--. 1 hui hui 736658 May 6 2018 httpclient-4.5.2.jar -rw-r--r--. 1 hui hui 326724 May 6 2018 httpcore-4.4.4.jar -rw-r--r--. 1 hui hui 1805 Jan 30 2019 plugin-descriptor.properties -rw-r--r--. 1 hui hui 125 Jan 30 2019 plugin-security.policy
分发ik 分词器
hui@hadoop201 plugins]$ sxync.sh ik/
分发后记得重启es集群
#ik 分词简单版本 GET /_analyze { "text": "我是中国人", "analyzer": "ik_smart" } 结果 { "tokens" : [ { "token" : "我", "start_offset" : 0, "end_offset" : 1, "type" : "CN_CHAR", "position" : 0 }, { "token" : "是", "start_offset" : 1, "end_offset" : 2, "type" : "CN_CHAR", "position" : 1 }, { "token" : "中国人", "start_offset" : 2, "end_offset" : 5, "type" : "CN_WORD", "position" : 2 } ] }
#ik 分词细致版本 GET /_analyze { "text": "我是中国人", "analyzer": "ik_max_word" } { "tokens" : [ { "token" : "我", "start_offset" : 0, "end_offset" : 1, "type" : "CN_CHAR", "position" : 0 }, { "token" : "是", "start_offset" : 1, "end_offset" : 2, "type" : "CN_CHAR", "position" : 1 }, { "token" : "中国人", "start_offset" : 2, "end_offset" : 5, "type" : "CN_WORD", "position" : 2 }, { "token" : "中国", "start_offset" : 2, "end_offset" : 4, "type" : "CN_WORD", "position" : 3 }, { "token" : "国人", "start_offset" : 3, "end_offset" : 5, "type" : "CN_WORD", "position" : 4 } ] }
2.5、自定义IK 分词器
有的时候,词库提供的词并不包含项目中使用到的一些专业术语或者新兴网络用语,需要我们对词库进行补充。具体步骤 编辑自定义分词数据
[hui@hadoop201 config]$ pwd /opt/module/elasticsearch/plugins/ik/config [hui@hadoop201 config]$ less myword.txt 蓝瘦香菇 蓝廋 香菇 瘦香
编写配置文件
[hui@hadoop201 config]$ vim IKAnalyzer.cfg.xml <?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd"> <properties> <comment>IK Analyzer 扩展配置</comment> <!--用户可以在这里配置自己的扩展字典 --> <entry key="ext_dict">./myword.txt</entry> <!--用户可以在这里配置自己的扩展停止词字典--> <entry key="ext_stopwords"></entry> <!--用户可以在这里配置远程扩展字典 --> <!-- <entry key="remote_ext_dict">words_location</entry> --> <!--用户可以在这里配置远程扩展停止词字典--> <!-- <entry key="remote_ext_stopwords">words_location</entry> --> </properties>
分发配置
[hui@hadoop201 config]$ sxync.sh myword.txt
[hui@hadoop201 config]$ sxync.sh IKAnalyzer.cfg.xml
重启后测试
#ik 分词细致版本 GET /_analyze { "text": "蓝瘦香菇", "analyzer": "ik_max_word" }
结果

{ "tokens" : [ { "token" : "蓝瘦香菇", "start_offset" : 0, "end_offset" : 4, "type" : "CN_WORD", "position" : 0 }, { "token" : "瘦香", "start_offset" : 1, "end_offset" : 3, "type" : "CN_WORD", "position" : 1 }, { "token" : "香菇", "start_offset" : 2, "end_offset" : 4, "type" : "CN_WORD", "position" : 2 } ] }
3、mapping
每个Type中的字段是什么数据类型,由mapping定义,如果我们在创建Index的时候,没有设定mapping,系统会自动根据一条数据的格式来推断出该数据对应的字段类型,具体推断类型如下:
- true/false →boolean
- 1020 →long
- 20.1 →float
- “2018-02-01” →date
- “hello world” →text +keyword
默认只有text会进行分词,keyword是不会分词的字符串。mapping除了自动定义,还可以手动定义,但是只能对新加的、没有数据的字段进行定义,一旦有了数据就无法再做修改了。
直接创建文档,没有索引时可以自行创建:
PUT /movie_chn_1/movie/1 { "id":1, "name":"红海行动", "doubanScore":8.5, "actorList":[ {"id":1,"name":"张译"}, {"id":2,"name":"海清"}, {"id":3,"name":"张涵予"} ] } PUT /movie_chn_1/movie/2 { "id":2, "name":"湄公河行动", "doubanScore":8.0, "actorList":[ {"id":3, "name":"张涵予"} ] } PUT /movie_chn_1/movie/3 { "id":3, "name":"红海事件", "doubanScore":5.0, "actorList":[ { "id":4, "name":"张三丰" } ] }
查询测试
GET /movie_chn_1/_search GET /movie_chn_1 GET /movie_chn_1/_mapping #分词查询 GET /movie_chn_1/_search { "query": { "match": { "name": "海行" } } } 自定义 mapping PUT movie_chn_2 { "mappings": { "movie":{ "properties": { "id":{ "type": "long" }, "name":{"type": "text", "analyzer": "ik_smart" }, "doubanScore":{ "type": "double" }, "actorList":{ "properties": { "id":{ "type":"long" }, "name":{ "type":"keyword" } } } } } } } #向自定义的 mapping 放数据 PUT /movie_chn_2/movie/1 { "id":1, "name":"红海行动", "doubanScore":8.5, "actorList":[ {"id":1,"name":"张译"}, {"id":2,"name":"海清"}, {"id":3,"name":"张涵予"} ] } PUT /movie_chn_2/movie/2 { "id":2, "name":"湄公河行动", "doubanScore":8.0, "actorList":[ {"id":3, "name":"张涵予"} ] } PUT /movie_chn_2/movie/3 { "id":3, "name":"红海事件", "doubanScore":5.0, "actorList":[ { "id":4, "name":"张三丰" } ] } #创建索引的时候,手动mapping,并指定别名 别名 "aliases": { "movie_chn_3_aliase": {} }, PUT movie_chn_3 { "aliases": { "movie_chn_3_aliase": {} }, "mappings": { "movie":{ "properties": { "id":{ "type": "long" }, "name":{ "type": "text", "analyzer": "ik_smart" }, "doubanScore":{ "type": "double" }, "actorList":{ "properties": { "id":{ "type":"long" }, "name":{ "type":"keyword" } } } } } } }
4、别名
#查看别名 GET /_cat/aliases #给索引添加另一个别名 POST /_aliases { "actions": [ { "add": { "index": "movie_chn_3", "alias": "movie_chn_3_wdh01" } } ] } #查看别名 GET /_cat/aliases #使用别名和索引一样 GET /movie_chn_3_wdh01/_search GET /movie_chn_3_aliase/_search #删除索引 #删除索引 POST /_aliases { "actions": [ { "remove": {"index": "movie_chn_3","alias": "movie_chn_3_aliase"} } ] } # 给索引的子集创建索引 POST _aliases { "actions": [ { "add": { "index": "movie_chn_1", "alias": "movie_chn_1_sub_query", "filter": { "term": { "actorList.id": "4" } } } } ] } # GET /movie_chn_1_sub_query/_search
5、模板
#模板 PUT _template/template_movie2020 { "index_patterns": ["movie_test*"], "settings": { "number_of_shards": 1 }, "aliases" : { "{index}-query": {}, "movie_test-query":{} }, "mappings": { "_doc": { "properties": { "id": { "type": "keyword" }, "movie_name": { "type": "text", "analyzer": "ik_smart" } } } } } #使用模板创建索引 POST movie_test_220101/_doc { "id":"0101", "name":"令狐冲" } GET /movie_test_220101/_mapping GET /movie_test_220101-query/_mapping #查看模板清单 GET /_cat/templates 索引模板 -----REDIS PUT _template/gmall2020_dau_info_template { "index_patterns": ["gmall2020_dau_info*"], "settings": { "number_of_shards": 3 }, "aliases" : { "{index}-query": {}, "gmall2020_dau_info-query":{} }, "mappings": { "_doc":{ "properties":{ "mid":{ "type":"keyword" }, "uid":{ "type":"keyword" }, "ar":{ "type":"keyword" }, "ch":{ "type":"keyword" }, "vc":{ "type":"keyword" }, "dt":{ "type":"keyword" }, "hr":{ "type":"keyword" }, "mi":{ "type":"keyword" }, "ts":{ "type":"date" } } } } }
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· 无需6万激活码!GitHub神秘组织3小时极速复刻Manus,手把手教你使用OpenManus搭建本
· C#/.NET/.NET Core优秀项目和框架2025年2月简报
· 一文读懂知识蒸馏
· Manus爆火,是硬核还是营销?
· 终于写完轮子一部分:tcp代理 了,记录一下