elasticSearch小结
1、matchQuery和termQuery区别
matchQuery:会将搜索词分词,再与目标查询字段进行匹配,若分词中的任意一个词与目标字段匹配上,则可查询到。
matchPhrasePrefix:将搜索词分词,再与目标查询字段进行匹配,当全部分词匹配上,且位置与分词相同,则匹配上
termQuery:不会对搜索词进行分词处理,而是作为一个整体与目标字段进行匹配,若完全匹配,则可查询到。
wildcardQuery:模糊匹配, 是 term 级别的 query,支持通配符,如:QueryBuilders.wildcardQuery("content", "?全*"),其中?表示一个字符,*表示0个或多个字符
FuzzyQuery:模糊匹配 Term t =
new
Term(
"content"
,
"work"
);
FuzzyQuery query =
new
FuzzyQuery(t, 0.1f, 1);第一个参数当然是词条对象,第二个参数
指的是levenshtein算法的最小相似度(
默认的匹配度是0.5,当这个值越小时,通过模糊查找出的文档的匹配程度就
越低,查出的文档量就越多,反之亦然
),第三个
参数指的是要有多少个前缀字母完全匹配
例如:
GET /test_index/_search { "from": 0, "size": 100, "timeout": "60s", "query": { "bool": { "must": [{ "match_phrase_prefix": { "deviceUuidFristLogin": { "query": "XXXXXXXXXXXXXXXXXXXXX", "slop": 0, "max_expansions": 50, "boost": 1.0 } } }], "adjust_pure_negative": true, "boost": 1.0 } } }
2、must与should区别
must : 相当于 MySQL and
should : 相当于MySQL or (效率低)
3、如何查看一个字符串是如何分词的
GET /test_index/_analyze { "field": "deviceUuidFristLogin", "text": "xxxxxxxxxxxxxxxxxxxxxxxxxxxx" }
4、ElasticSearch 5.0以后,String字段被拆分成两种新的数据类型: text用于全文搜索,会分词,而keyword用于关键词搜索,不进行分词。对于字符串类型的字段,ES默认会再生成一个keyword字段用于精确索引。默认mapping如下:
"mapping": { "properties": { "id": { "type": "long" }, "searchField": { "type": "text", "fields": { "keyword": { "type": "keyword", "ignore_above": 256 } } } }
5、fielddata
doc value 是在排序,分组等 需要文档映射到具体字段的一种正向索引,适用于很多类型字段,存储在磁盘上。
field data 是单指text 类型 也就是可以分词的类型的字段 在使用排序或分组等情况下 在内存中形成的一种正向索引,耗内存,一般默认不使用。
所以es查询在排序(sort)时的字段不推荐是text类型的;range范围查询时也不能使用text类型
另外es的聚合查询(max、min、avg、sum、terms/ranges--桶聚合等)也不能使用text类型字段
在ES5.x+里,一定要注意数值类型是否需要做范围查询,看似数值,但其实只用于Term或者Terms这类精确匹配的,应该定义为keyword类型,而不应该是long,例如userId,buyerId,sellerId
es分页常用解决方案:
方案一:由于es使用from、size最大能查不超过10000条记录,且一般后面的分页不会查看,所以可以限制最大分页数,比如限制最大100页
方案二:不限制最大页码,但是不允许跳页(像百度、google),只允许点下一页:这种可以使用search_after的方式,比如:
第一次查询,展示5条
GET /test_index/_search { "size": 2, "query": { "match" : { "sellerUserId": "xxxxx" } }, "sort": [ {"orderAddtime": "desc"}, {"_id": "desc"} ] }
返回:
{ "took" : 58, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : 108783, "max_score" : null, "hits" : [ { "_index" : "test_indext", "_type" : "elasticsearch", "_id" : "xxxxxxxxxxxxxxx", "_score" : null, "_source" : { "orderSubTypeId" : 0, "productId" : 22166, "orderId" : xxxxxxxxxxxxxxx, "orderAddtime" : "2020-04-02 10:25:32", "productSize" : "", "orderTradeStatus" : 2000, "poundage" : 60.33, "sellerName" : "大眼睛潮品代购", "orderNum" : "xxxxxxxxxxxxxxx", "orderTypeId" : 0, "sellerUserId" : xxxxxxx, "sku_id" : xxxxxxx, "buyerName" : "单身A锥195", "productName" : "Champion 袖口单标 加绒 男女款 带帽卫衣 美版 天蓝色", "paidMoney" : 219.0, "price" : 239.0, "freightCost" : 18.0, "buyerUserId" : xxxxxxx, "buyerReceiveAddress" : "河南省周口市淮阳县润德第一城c栋", "buyerReceiveMobile" : "xxxxxxxxxxxxx", "discountInfo" : "包邮券:18.0元,优惠券:20.0元", "discountValue" : 38.0, "buyerPayAccountId" : "-1" }, "sort" : [ "2020-04-02 10:25:32", "xxxxxxxxxxxxx" ] }, { "_index" : "test_indext", "_type" : "elasticsearch", "_id" : "xxxxxxxxxxxxx", "_score" : null, "_source" : { "orderSubTypeId" : 0, "productId" : 30739, "orderId" : xxxxxxxxxxxxx, "orderAddtime" : "2020-04-02 10:22:15", "productSize" : "", "orderTradeStatus" : 2000, "poundage" : 59.23, "sellerName" : "大眼睛潮品代购", "orderNum" : "xxxxxxxxxxxxx", "orderTypeId" : 0, "sellerUserId" : xxxxxxxxxxxxx, "sku_id" : 184429239, "buyerName" : "发呆影子灰qRr", "productName" : "Champion 圆领 薄长袖 T恤 美版 黑色", "paidMoney" : 123.0, "price" : 129.0, "freightCost" : 14.0, "buyerUserId" : xxxxxxxxxxxxx, "discountInfo" : "优惠券:20.0元", "discountValue" : 20.0, "buyerReceiveAddress" : "北京市北京市顺义区旺泉街道 石门苑25栋3门302", "buyerReceiveMobile" : "xxxxxxxxxxxxx", "buyerPayAccount" : "152****0708", "buyerPayAccountId" : "-1" }, "sort" : [ "2020-04-02 10:22:15", "xxxxxxxxxxxxx" ] }, { "_index" : "test_indext", "_type" : "elasticsearch", "_id" : "xxxxxxxxxxxxx", "_score" : null, "_source" : { "buyerReceiveAddress" : "贵州省遵义市正安县安场镇播州大道安页四井", "buyerReceiveMobile" : "xxxxxxxxxxxxx", "orderSubTypeId" : 0, "productId" : 27657, "orderId" : xxxxxxxxxxxxx, "orderAddtime" : "2020-04-02 10:15:01", "productSize" : "", "orderTradeStatus" : 2000, "poundage" : 60.63, "sellerName" : "大眼睛潮品代购", "orderNum" : "xxxxxxxxxxxxx", "orderTypeId" : 0, "sellerUserId" : xxxxxxxxxxxxx, "sku_id" : xxxxxxxxxxxxx, "buyerName" : "鹤鹤有鸣", "productName" : "Champion 半拉链刺绣小Logo草写 冲锋衣 美版 藏青色", "paidMoney" : 287.0, "price" : 269.0, "freightCost" : 18.0, "buyerUserId" : xxxxxxxxxxxxx, "buyerPayAccount" : "132****7962", "buyerPayAccountId" : "-1" }, "sort" : [ "2020-04-02 10:15:01", "xxxxxxxxxxxxx" ] }, { "_index" : "test_indext", "_type" : "elasticsearch", "_id" : "xxxxxxxxxxxxx", "_score" : null, "_source" : { "orderSubTypeId" : 0, "productId" : 29148, "orderId" : xxxxxxxxxxxxx, "orderAddtime" : "2020-04-02 10:13:33", "productSize" : "", "orderTradeStatus" : 2000, "poundage" : 59.33, "sellerName" : "大眼睛潮品代购", "orderNum" : "xxxxxxxxxxxxx", "orderTypeId" : 0, "sellerUserId" : xxxxxxxxxxxxx, "sku_id" : xxxxxxxxxxxxx, "buyerName" : "重情义绿龙虾nJt", "productName" : "champion 冠军 腿标 短裤 黑色", "paidMoney" : 162.0, "price" : 139.0, "freightCost" : 23.0, "buyerUserId" : xxxxxxxxxxxxx, "buyerReceiveAddress" : "云南省玉溪市澄江县龙街镇高西村委会小官庄102号", "buyerReceiveMobile" : "xxxxxxxxxxxxx", "buyerPayAccount" : "187****7958", "buyerPayAccountId" : "-1" }, "sort" : [ "2020-04-02 10:13:33", "xxxxxxxxxxxxx" ] }, { "_index" : "test_indext", "_type" : "elasticsearch", "_id" : "xxxxxxxxxxxxx", "_score" : null, "_source" : { "orderSubTypeId" : 0, "productId" : 33977, "orderId" : xxxxxxxxxxxxx, "orderAddtime" : "2020-04-02 10:09:56", "productSize" : "", "orderTradeStatus" : 2000, "poundage" : 59.03, "sellerName" : "大眼睛潮品代购", "orderNum" : "xxxxxxxxxxxxx", "orderTypeId" : 0, "sellerUserId" : xxxxxxxxxxxxx, "sku_id" : xxxxxxxxxxxxx, "buyerName" : "AAA送你到楼梯", "productName" : "Champion 袖口单标基础款打底衫短袖T恤 美版 白色", "paidMoney" : 132.0, "price" : 109.0, "freightCost" : 23.0, "buyerUserId" : xxxxxxxxxxxxx, "buyerReceiveAddress" : "四川省乐山市市中区八仙洞17号", "buyerReceiveMobile" : "xxxxxxxxxxxxx", "buyerPayAccount" : "182****3677", "buyerPayAccountId" : "xxxxxxxxxxxxx" }, "sort" : [ "2020-04-02 10:09:56", "xxxxxxxxxxxxx" ] } ] } }
返回的数据_id和orderAddtime,这5条数据为:
xxxxxxxxxxxxxxx 2020-04-02 10:13:33 xxxxxxxxxxxxxxx 2020-04-02 10:09:56 xxxxxxxxxxxxxxx 2020-04-02 10:04:58 xxxxxxxxxxxxxxx 2020-04-02 10:03:26 xxxxxxxxxxxxxxx 2020-04-02 10:02:58
使用search_after
GET /test_index/_search { "size": 2, "query": { "match" : { "sellerUserId": "xxxx" } }, "search_after": ["2020-04-02 10:09:56", "xxxxxxx"], "sort": [ {"orderAddtime": "desc"}, {"_id": "desc"} ] }
返回:
{ "took" : 16, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : 108784, "max_score" : null, "hits" : [ { "_index" : "test_index", "_type" : "elasticsearch", "_id" : "xxxxxxxxxxxxx", "_score" : null, "_source" : { "orderSubTypeId" : 0, "productId" : 27027, "orderId" : xxxxxxxxxxxxx, "orderAddtime" : "2020-04-02 10:04:58", "productSize" : "", "orderTradeStatus" : 2000, "poundage" : 59.33, "sellerName" : "大眼睛潮品代购", "orderNum" : "xxxxxxxxxxxxx", "orderTypeId" : 0, "sellerUserId" : xxxxxxxxxxxxx, "sku_id" : xxxxxxxxxxxxx, "buyerName" : "可能是快跑", "productName" : "Champion 纯棉短裤 黑色", "paidMoney" : 157.0, "price" : 139.0, "freightCost" : 18.0, "buyerUserId" : xxxxxxxxxxxxx, "buyerReceiveAddress" : "辽宁省丹东市元宝区宗裕城c区天天超市", "buyerReceiveMobile" : "xxxxxxxxxxxxx", "buyerPayAccount" : "151****3091", "buyerPayAccountId" : "-1" }, "sort" : [ "2020-04-02 10:04:58", "xxxxxxxxxxxxx" ] }, { "_index" : "test_index", "_type" : "elasticsearch", "_id" : "xxxxxxxxxxxxx", "_score" : null, "_source" : { "orderSubTypeId" : 0, "productId" : xxxxxxxxxxxxx, "orderId" : xxxxxxxxxxxxx, "orderAddtime" : "2020-04-02 10:03:26", "productSize" : "", "orderTradeStatus" : 2000, "poundage" : 59.43, "sellerName" : "大眼睛潮品代购", "orderNum" : "xxxxxxxxxxxxx", "orderTypeId" : 0, "sellerUserId" : xxxxxxxxxxxxx, "sku_id" : xxxxxxxxxxxxx, "buyerName" : "可能是快跑", "productName" : "Champion 圆领 薄长袖 T恤 美版 白色", "paidMoney" : 167.0, "price" : 149.0, "freightCost" : 18.0, "buyerUserId" : xxxxxxxxxxxxx, "buyerReceiveAddress" : "辽宁省丹东市元宝区宗裕城c区天天超市", "buyerReceiveMobile" : "xxxxxxxxxxxxx", "buyerPayAccount" : "151****3091", "buyerPayAccountId" : "-1" }, "sort" : [ "2020-04-02 10:03:26", "xxxxxxxxxxxxx" ] } ] } }
综上:
es常见大数据量分页:
1、普通分页,from,size的方式,这种每次只能查1W条记录,会先加载到内存,比较吃内存
2、深度分页,又有两种,一种是scroll:每次查询返回scrollId,下次查询根据scrollId继续往下查10条;另一种是search_after,是以一个唯一的字段,每次根据这个只往下继续查10条
区别:
scroll是创建一个快照,快照有指定生存时间,新写入的数据,不在快照中,是无法查到的
search_after搜索的时候指定一个不重复的字段(一般是_id)进行排序,将返回的不重复的字段的值作为seach_after的值请求,
优点:避免深度分页的性能问题,可以实时获取下一页文档信息,
缺点:不支持指定页数,只可以一页一页的往下翻页
当全量查询时候也建议使用scroll,比from-size效率大概高一倍
6、ES性能优化:
https://blog.csdn.net/hellozhxy/article/details/90938381
7、ES搜索类型:
1)query and fetch
向索引的所有分片(shard)都发出查询请求,各分片返回的时候把元素文档(document)和计算后的排名信息一起返回。这种搜索方式是最快的。因为相比下面的几种搜索方式,这种查询方法只需要去shard查询一次。但是各个shard返回的结果的数量之和可能是用户要求的size的n倍。
2)query then fetch(默认的搜索方式)
如果你搜索时,没有指定搜索方式,就是使用的这种搜索方式。这种搜索方式,大概分两个步骤,第一步,先向所有的shard发出请求,各分片只返回排序和排名相关的信息(注意,不包括文档document),然后按照各分片返回的分数进行重新排序和排名,取前size个文档。然后进行第二步,去相关的shard取document。这种方式返回的document与用户要求的size是相等的。
3)DFS query and fetch
这种方式比第一种方式多了一个初始化散发(initial scatter)步骤,有这一步,据说可以更精确控制搜索打分和排名。
4)DFS query then fetch
比第2种方式多了一个初始化散发(initial scatter)步骤。
DSF是什么缩写?初始化散发是一个什么样的过程?
从es的官方网站我们可以指定,初始化散发其实就是在进行真正的查询之前,先把各个分片的词频率和文档频率收集一下,然后进行词搜索的时候,各分片依据全局的词频率和文档频率进行搜索和排名。显然如果使用DFS_QUERY_THEN_FETCH这种查询方式,效率是最低的,因为一个搜索,可能要请求3次分片。但,使用DFS方法,搜索精度应该是最高的。
至于DFS是什么缩写,没有找到相关资料,这个D可能是Distributed,F可能是frequency的缩写,至于S可能是Scatter的缩写,整个单词可能是分布式词频率和文档频率散发的缩写。
总结一下,从性能考虑QUERY_AND_FETCH是最快的,DFS_QUERY_THEN_FETCH是最慢的。从搜索的准确度来说,DFS要比非DFS的准确度更高。
参考:
https://www.cnblogs.com/xll970105/p/11561537.html
es各种查询:https://www.colabug.com/2018/0902/4334463/
8、ES的mapping创建原则:
1)文档自动映射关掉
默认,true,false允许插入陌生字段,但是mapping不同,默认和true会给陌生字段自定义mapping,
但是为false时,陌生字段直接没有mapping,strict直接不允许插入陌生字段
2)自动刷新默认是1s,改成60s
3)批量写ES可以抗住1w+QPS没问题
4)开放索引的只读设置:
PUT _settings { "index": { "blocks": { "read_only_allow_delete": "false" } } }
5)analyzer 不指定分词时,会使用默认的standard
6)建索引时候,注意指定分片数以及副本数(防止默认是1分片的--ES 6.7版本有的默认分片是1。我们集群是3节点,建议3分片1副本)
"settings": { "number_of_shards": 3, "number_of_replicas": 1 },
7) 时间类型--统一类型为date,设置format: yyyy-MM-dd HH:mm:ss(ES存储的是JSON,JSON没有date格式,写入时候可以传字符串来表示)
"auditTime": { "type": "date", "format": "yyyy-MM-dd HH:mm:ss" }
字段类型设计原则:
1)字段建议使用keyword类型,查询快,支持排序
2)content字段要用text字段,支持分词,不支持排序
3)时间使用long类型,支持范围查询,建议到精确到分钟,会提高查询效率
9、ES常用sql:
查询:
GET /trend_reply/_search {"query":{ "bool" : { "must" : [ { "range":{ "publishTime":{ "gte":1577808000, "lt":1590940800 } } } ], "adjust_pure_negative" : true, "boost" : 1.0 } } } GET /risk_order_his/_search {"query":{ "bool" : { "must" : [ { "term" : { "orderNum" : { "value" : 1120010815637847, "boost" : 1.0 } } } ], "adjust_pure_negative" : true, "boost" : 1.0 } } } GET /trend_reply/_search { "from":0, "size":10000, "timeout":"60s", "query":{ "bool":{ "filter":[ { "match_phrase_prefix":{ "content":{ "query":"c", "slop":0, "max_expansions":50, "boost":1 } } }, { "range":{ "result":{ "from":"0", "to":null, "include_lower":true, "include_upper":true, "boost":1 } } }, { "exists":{ "field":"auditResultId", "boost":1 } } ], "must_not":[ { "term":{ "auditResultId":{ "value":"", "boost":1 } } } ], "adjust_pure_negative":true, "boost":1 } } }
更新:
POST /trend_reply/_update_by_query { "script": { "source": "ctx._source['result']=8" }, "query":{ "term":{ "_id":"F-n4XnIB5Q-NfXXmj7pa" } } }
删除
POST /trend_reply/_delete_by_query?wait_for_completion=false {"query":{ "bool" : { "must" : [ { "range":{ "publishTime":{ "gte":1575302400, "lt":1590940800 } } } ], "adjust_pure_negative" : true, "boost" : 1.0 } } }
查询两个字段相等的记录
#判断两个字符串 GET xxx_index/_search { "query": { "bool": { "must": [{ "match_all": {} }], "filter": [{ "script": { "script": { "inline": "String.valueOf(doc['dataId'].value) == doc['_id'].value", "lang": "painless" } } }], "must_not": [], "should": [] } } } #判断两个long类型 GET xxx_index/_search { "query": { "bool": { "must": [{ "match_all": {} }], "filter": [{ "script": { "script": { "inline": "doc['dataId'].value - doc['userId'].value == 0", "lang": "painless" } } }], "must_not": [], "should": [] } } }