ES 学习笔记_下(复杂搜索和分组查询)
1.前缀搜索 在实际应用中,能不用尽量别用。性能太差了。
C3D0-KD345 C3K5-DFG65 C4I8-UI365 C3 --> 上面这两个都搜索出来 --> 根据字符串的前缀去搜索 不用帖子的案例背景,因为比较简单,直接用自己手动建的新索引,给大家演示一下就可以了 PUT my_index { "mappings": { "my_type": { "properties": { "title": { "type": "keyword" } } } } } GET my_index/my_type/_search { "query": { "prefix": { "title": { "value": "C3" } } } }
2.搜索推荐的功能
百度 --> elas --> elasticsearch --> elasticsearch权威指南
GET /my_index/my_type/_search { "query": { "match_phrase_prefix": { "title": "hello d" } } }
原理跟match_phrase类似,唯一的区别,就是把最后一个term作为前缀去搜索
hello就是去进行match,搜索对应的doc
w,会作为前缀,去扫描整个倒排索引,找到所有w开头的doc
然后找到所有doc中,即包含hello,又包含w开头的字符的doc
根据你的slop去计算,看在slop范围内,能不能让hello w,正好跟doc中的hello和w开头的单词的position相匹配
也可以指定slop,但是只有最后一个term会作为前缀
max_expansions:指定prefix最多匹配多少个term,超过这个数量就不继续匹配了,限定性能
默认情况下,前缀要扫描所有的倒排索引中的term,去查找w打头的单词,但是这样性能太差。可以用max_expansions限定,w前缀最多匹配多少个term,就不再继续搜索倒排索引了。
尽量不要用,因为,最后一个前缀始终要去扫描大量的索引,性能可能会很差
3.negative boost
GET /forum/article/_search { "query": { "bool": { "must": [ { "match": { "content": "java" } } ], "must_not": [ { "match": { "content": "spark" } } ] } } } GET /forum/article/_search { "query": { "boosting": { "positive": { "match": { "content": "java" } }, "negative": { "match": { "content": "spark" } }, "negative_boost": 0.2 } } }
4.自定义字段影响评分
数据准备
POST /forum/article/_bulk { "update": { "_id": "1"} } { "doc" : {"follower_num" : 5} } { "update": { "_id": "2"} } { "doc" : {"follower_num" : 10} } { "update": { "_id": "3"} } { "doc" : {"follower_num" : 25} } { "update": { "_id": "4"} } { "doc" : {"follower_num" : 3} } { "update": { "_id": "5"} } { "doc" : {"follower_num" : 60} }
GET /forum/article/_search { "query": { "function_score": { "query": { "multi_match": { "query": "java spark", "fields": ["tile", "content"] } }, "field_value_factor": { "field": "follower_num", "modifier": "log1p", "factor": 0.5 }, "boost_mode": "sum", "max_boost": 2 } } }
5.fuzzy搜索技术 --> 自动将拼写错误的搜索文本,进行纠正,纠正以后去尝试匹配索引中的数据
数据准备
POST /my_index/my_type/_bulk { "index": { "_id": 1 }} { "text": "Surprise me!"} { "index": { "_id": 2 }} { "text": "That was surprising."} { "index": { "_id": 3 }} { "text": "I wasn't surprised."}
GET /my_index/my_type/_search { "query": { "fuzzy": { "text": { "value": "surprize", "fuzziness": 2 } } } } surprize --> 拼写错误 --> surprise --> s -> z surprize --> surprise -> z -> s,纠正一个字母,就可以匹配上,所以在fuziness指定的2范围内 surprize --> surprised -> z -> s,末尾加个d,纠正了2次,也可以匹配上,在fuziness指定的2范围内 surprize --> surprising -> z -> s,去掉e,ing,3次,总共要5次,才可以匹配上,始终纠正不了 fuzzy搜索以后,会自动尝试将你的搜索文本进行纠错,然后去跟文本进行匹配 fuzziness,你的搜索文本最多可以纠正几个字母去跟你的数据进行匹配,默认如果不设置,就是2 GET /my_index/my_type/_search { "query": { "match": { "text": { "query": "SURPIZE ME", "fuzziness": "AUTO", "operator": "and" } } } }
重点:!!!6.两个核心概念:bucket和metric
city name
北京 小李
北京 小王
上海 小张
上海 小丽
上海 小陈
基于city划分buckets
划分出来两个bucket,一个是北京bucket,一个是上海bucket
北京bucket:包含了2个人,小李,小王
上海bucket:包含了3个人,小张,小丽,小陈
按照某个字段进行bucket划分,那个字段的值相同的那些数据,就会被划分到一个bucket中
有一些mysql的sql知识的话,聚合,首先第一步就是分组,对每个组内的数据进行聚合分析,分组,就是我们的bucket
metric:对一个数据分组执行的统计
当我们有了一堆bucket之后,就可以对每个bucket中的数据进行聚合分词了,比如说计算一个bucket内所有数据的数量,或者计算一个bucket内所有数据的平均值,最大值,最小值
metric,就是对一个bucket执行的某种聚合分析的操作,比如说求平均值,求最大值,求最小值
select count(*)
from access_log
group by user_id
bucket:group by user_id --> 那些user_id相同的数据,就会被划分到一个bucket中
metric:count(*),对每个user_id bucket中所有的数据,计算一个数量
7.数据准备:
以一个家电卖场中的电视销售数据为背景,来对各种品牌,各种颜色的电视的销量和销售额,进行各种各样角度的分析 PUT /tvs { "mappings": { "sales": { "properties": { "price": { "type": "long" }, "color": { "type": "keyword" }, "brand": { "type": "keyword" }, "sold_date": { "type": "date" } } } } } POST /tvs/sales/_bulk { "index": {}} { "price" : 1000, "color" : "红色", "brand" : "长虹", "sold_date" : "2016-10-28" } { "index": {}} { "price" : 2000, "color" : "红色", "brand" : "长虹", "sold_date" : "2016-11-05" } { "index": {}} { "price" : 3000, "color" : "绿色", "brand" : "小米", "sold_date" : "2016-05-18" } { "index": {}} { "price" : 1500, "color" : "蓝色", "brand" : "TCL", "sold_date" : "2016-07-02" } { "index": {}} { "price" : 1200, "color" : "绿色", "brand" : "TCL", "sold_date" : "2016-08-19" } { "index": {}} { "price" : 2000, "color" : "红色", "brand" : "长虹", "sold_date" : "2016-11-05" } { "index": {}} { "price" : 8000, "color" : "红色", "brand" : "三星", "sold_date" : "2017-01-01" } { "index": {}} { "price" : 2500, "color" : "蓝色", "brand" : "小米", "sold_date" : "2017-02-12" }
8.统计哪种颜色的电视销量最高
GET /tvs/sales/_search { "size" : 0, "aggs" : { "popular_colors" : { "terms" : { "field" : "color" } } } }
size:只获取聚合结果,而不要执行聚合的原始数据
aggs:固定语法,要对一份数据执行分组聚合操作
popular_colors:就是对每个aggs,都要起一个名字,这个名字是随机的,你随便取什么都ok
terms:根据字段的值进行分组
field:根据指定的字段的值进行分组
默认的排序规则:按照doc_count降序排序
9.不同颜色下的平均价格
select avg(price)
from tvs.sales
group by color
GET /tvs/sales/_search { "size" : 0, "aggs": { "colors": { "terms": { "field": "color" }, "aggs": { "avg_price": { "avg": { "field": "price" } } } } } }
10.按照多个维度(颜色+品牌)多层下钻分析,而且学会了每个下钻维度(颜色,颜色+品牌),都可以对每个维度分别执行一次metric聚合操作
这个是比mysql强大的地方,每次分组后不只有统计信息。同时基还有分组数据明细。类似于开创函数。
分组统计信息:select color,brand,avg(price) from tvs.sales group by color,brand;
分组数据明细:select color,brand,* form tvs.sales;
开窗:select *, avg(price) over(partition by color,brand order by color desc) from tvs.sales;
GET /tvs/sales/_search { "size": 0, "aggs": { "group_by_color": { "terms": { "field": "color" }, "aggs": { "color_avg_price": { "avg": { "field": "price" } }, "group_by_brand": { "terms": { "field": "brand" }, "aggs": { "brand_avg_price": { "avg": { "field": "price" } } } } } } } }
11.一般来说,90%的常见的数据分析的操作,metric,无非就是count,avg,max,min,sum
GET /tvs/sales/_search { "size" : 0, "aggs": { "colors": { "terms": { "field": "color" }, "aggs": { "avg_price": { "avg": { "field": "price" } }, "min_price" : { "min": { "field": "price"} }, "max_price" : { "max": { "field": "price"} }, "sum_price" : { "sum": { "field": "price" } } } } } }
12.
histogram:类似于terms,也是进行bucket分组操作,接收一个field,按照这个field的值的各个范围区间,进行bucket分组操作
"histogram":{
"field": "price",
"interval": 2000
},
interval:2000,划分范围,0~2000,2000~4000,4000~6000,6000~8000,8000~10000,buckets
去根据price的值,比如2500,看落在哪个区间内,比如2000~4000,此时就会将这条数据放入2000~4000对应的那个bucket中
bucket划分的方法,terms,将field值相同的数据划分到一个bucket中
bucket有了之后,一样的,去对每个bucket执行avg,count,sum,max,min,等各种metric操作,聚合分析
GET /tvs/sales/_search { "size" : 0, "aggs":{ "price":{ "histogram":{ "field": "price", "interval": 2000 }, "aggs":{ "revenue": { "sum": { "field" : "price" } } } } } }
13.
bucket,分组操作,histogram,按照某个值指定的interval,划分一个一个的bucket
date histogram,按照我们指定的某个date类型的日期field,以及日期interval,按照一定的日期间隔,去划分bucket
date interval = 1m,
2017-01-01~2017-01-31,就是一个bucket
2017-02-01~2017-02-28,就是一个bucket
然后会去扫描每个数据的date field,判断date落在哪个bucket中,就将其放入那个bucket
2017-01-05,就将其放入2017-01-01~2017-01-31,就是一个bucket
min_doc_count:即使某个日期interval,2017-01-01~2017-01-31中,一条数据都没有,那么这个区间也是要返回的,不然默认是会过滤掉这个区间的
extended_bounds,min,max:划分bucket的时候,会限定在这个起始日期,和截止日期内
GET /tvs/sales/_search { "size" : 0, "aggs": { "sales": { "date_histogram": { "field": "sold_date", "interval": "month", "format": "yyyy-MM-dd", "min_doc_count" : 0, "extended_bounds" : { "min" : "2016-01-01", "max" : "2017-12-31" } } } } }
14.
GET /tvs/sales/_search { "size": 0, "aggs": { "group_by_sold_date": { "date_histogram": { "field": "sold_date", "interval": "quarter", "format": "yyyy-MM-dd", "min_doc_count": 0, "extended_bounds": { "min": "2016-01-01", "max": "2017-12-31" } }, "aggs": { "group_by_brand": { "terms": { "field": "brand" }, "aggs": { "sum_price": { "sum": { "field": "price" } } } }, "total_sum_price": { "sum": { "field": "price" } } } } } }
15.
select count(*)
from tvs.sales
where brand like "%长%"
group by price
es aggregation,scope,任何的聚合,都必须在搜索出来的结果数据中之行,搜索结果,就是聚合分析操作的scope
GET /tvs/sales/_search { "size": 0, "query": { "term": { "brand": { "value": "小米" } } }, "aggs": { "group_by_color": { "terms": { "field": "color" } } } }
16.
aggregation,scope,一个聚合操作,必须在query的搜索结果范围内执行
出来两个结果,一个结果,是基于query搜索结果来聚合的; 一个结果,是对所有数据执行聚合的
GET /tvs/sales/_search { "size": 0, "query": { "term": { "brand": { "value": "长虹" } } }, "aggs": { "single_brand_avg_price": { "avg": { "field": "price" } }, "all": { "global": {}, "aggs": { "all_brand_avg_price": { "avg": { "field": "price" } } } } } }
global:就是global bucket,就是将所有数据纳入聚合的scope,而不管之前的query
17.
搜索+聚合
过滤+聚合
GET /tvs/sales/_search { "size": 0, "query": { "constant_score": { "filter": { "range": { "price": { "gte": 1200 } } } } }, "aggs": { "avg_price": { "avg": { "field": "price" } } } }
18,
aggs.filter,针对的是聚合去做的
如果放query里面的filter,是全局的,会对所有的数据都有影响
但是,如果,比如说,你要统计,长虹电视,最近1个月的平均值; 最近3个月的平均值; 最近6个月的平均值
bucket filter:对不同的bucket下的aggs,进行filter
GET /tvs/sales/_search { "size": 0, "query": { "term": { "brand": { "value": "长虹" } } }, "aggs": { "recent_150d": { "filter": { "range": { "sold_date": { "gte": "now-150d" } } }, "aggs": { "recent_150d_avg_price": { "avg": { "field": "price" } } } }, "recent_140d": { "filter": { "range": { "sold_date": { "gte": "now-140d" } } }, "aggs": { "recent_140d_avg_price": { "avg": { "field": "price" } } } }, "recent_130d": { "filter": { "range": { "sold_date": { "gte": "now-130d" } } }, "aggs": { "recent_130d_avg_price": { "avg": { "field": "price" } } } } } }
19.
之前的话,排序,是按照每个bucket的doc_count降序来排的
但是假如说,我们现在统计出来每个颜色的电视的销售额,需要按照销售额降序排序????
GET /tvs/sales/_search { "size": 0, "aggs": { "group_by_color": { "terms": { "field": "color" }, "aggs": { "avg_price": { "avg": { "field": "price" } } } } } }
GET /tvs/sales/_search { "size": 0, "aggs": { "group_by_color": { "terms": { "field": "color", "order": { "avg_price": "asc" } }, "aggs": { "avg_price": { "avg": { "field": "price" } } } } } }
20.
GET /tvs/sales/_search { "size": 0, "aggs": { "group_by_color": { "terms": { "field": "color" }, "aggs": { "group_by_brand": { "terms": { "field": "brand", "order": { "avg_price": "desc" } }, "aggs": { "avg_price": { "avg": { "field": "price" } } } } } } } }
21.es,去重,cartinality metric,对每个bucket中的指定的field进行去重,取去重后的count,类似于count(distcint)
GET /tvs/sales/_search { "size" : 0, "aggs" : { "months" : { "date_histogram": { "field": "sold_date", "interval": "month" }, "aggs": { "distinct_colors" : { "cardinality" : { "field" : "brand" } } } } } }
22.cardinality,count(distinct),5%的错误率,性能在100ms左右
GET /tvs/sales/_search { "size" : 0, "aggs" : { "distinct_brand" : { "cardinality" : { "field" : "brand", "precision_threshold" : 100 } } } }
brand去重,如果brand的unique value,在100个以内,小米,长虹,三星,TCL,HTL。。。
在多少个unique value以内,cardinality,几乎保证100%准确
cardinality算法,会占用precision_threshold * 8 byte 内存消耗,100 * 8 = 800个字节
占用内存很小。。。而且unique value如果的确在值以内,那么可以确保100%准确
100,数百万的unique value,错误率在5%以内
precision_threshold,值设置的越大,占用内存越大,1000 * 8 = 8000 / 1000 = 8KB,可以确保更多unique value的场景下,100%的准确
field,去重,count,这时候,unique value,10000,precision_threshold=10000,10000 * 8 = 80000个byte,80KB
22.HyperLogLog++ (HLL)算法性能优化
cardinality底层算法:HLL算法,HLL算法的性能 会对所有的uqniue value取hash值,通过hash值近似去求distcint count,误差 默认情况下,发送一个cardinality请求的时候,会动态地对所有的field value,取hash值; 将取hash值的操作,前移到建立索引的时候 PUT /tvs/ { "mappings": { "sales": { "properties": { "brand": { "type": "text", "fields": { "hash": { "type": "murmur3" } } } } } } } GET /tvs/sales/_search { "size" : 0, "aggs" : { "distinct_brand" : { "cardinality" : { "field" : "brand.hash", "precision_threshold" : 100 } } } }
24.需求:比如有一个网站,记录下了每次请求的访问的耗时,需要统计tp50,tp90,tp99
数据准备:
tp50:50%的请求的耗时最长在多长时间 tp90:90%的请求的耗时最长在多长时间 tp99:99%的请求的耗时最长在多长时间 PUT /website { "mappings": { "logs": { "properties": { "latency": { "type": "long" }, "province": { "type": "keyword" }, "timestamp": { "type": "date" } } } } } POST /website/logs/_bulk { "index": {}} { "latency" : 105, "province" : "江苏", "timestamp" : "2016-10-28" } { "index": {}} { "latency" : 83, "province" : "江苏", "timestamp" : "2016-10-29" } { "index": {}} { "latency" : 92, "province" : "江苏", "timestamp" : "2016-10-29" } { "index": {}} { "latency" : 112, "province" : "江苏", "timestamp" : "2016-10-28" } { "index": {}} { "latency" : 68, "province" : "江苏", "timestamp" : "2016-10-28" } { "index": {}} { "latency" : 76, "province" : "江苏", "timestamp" : "2016-10-29" } { "index": {}} { "latency" : 101, "province" : "新疆", "timestamp" : "2016-10-28" } { "index": {}} { "latency" : 275, "province" : "新疆", "timestamp" : "2016-10-29" } { "index": {}} { "latency" : 166, "province" : "新疆", "timestamp" : "2016-10-29" } { "index": {}} { "latency" : 654, "province" : "新疆", "timestamp" : "2016-10-28" } { "index": {}} { "latency" : 389, "province" : "新疆", "timestamp" : "2016-10-28" } { "index": {}} { "latency" : 302, "province" : "新疆", "timestamp" : "2016-10-29" }
pencentiles GET /website/logs/_search { "size": 0, "aggs": { "latency_percentiles": { "percentiles": { "field": "latency", "percents": [ 50, 95, 99 ] } }, "latency_avg": { "avg": { "field": "latency" } } } }
25.
SLA:就是你提供的服务的标准
我们的网站的提供的访问延时的SLA,确保所有的请求100%,都必须在200ms以内,大公司内,一般都是要求100%在200ms以内
如果超过1s,则需要升级到A级故障,代表网站的访问性能和用户体验急剧下降
需求:在200ms以内的,有百分之多少,在1000毫秒以内的有百分之多少,percentile ranks metric
这个percentile ranks,其实比pencentile还要常用
按照品牌分组,计算,电视机,售价在1000占比,2000占比,3000占比
GET /website/logs/_search { "size": 0, "aggs": { "group_by_province": { "terms": { "field": "province" }, "aggs": { "latency_percentile_ranks": { "percentile_ranks": { "field": "latency", "values": [ 200, 1000 ] } } } } } }
percentile的优化
TDigest算法,用很多节点来执行百分比的计算,近似估计,有误差,节点越多,越精准
compression
限制节点数量最多 compression * 20 = 2000个node去计算
默认100
越大,占用内存越多,越精准,性能越差
一个节点占用32字节,100 * 20 * 32 = 64KB
如果你想要percentile算法越精准,compression可以设置的越大