Elasticsearch 之(33)document数据建模实战_文件搜索_嵌套关系_父子/祖孙关系数据
前言
在《Elasticsearch 之(2)Elasticsearch核心概念》中简单提到了document 和 数据库db 数据模型的差别,本文将详细讲述集中常用的数据模型。文件搜索数据建模,对类似文件系统这种的有多层级关系的数据进行建模1、文件系统数据构造
PUT /fs { "settings": { "analysis": { "analyzer": { "paths": { "tokenizer": "path_hierarchy" } } } } }
path_hierarchy tokenizer讲解
/a/b/c/d --> path_hierarchy -> /a/b/c/d, /a/b/c, /a/b, /a
fs: filesystem
PUT /fs/_mapping/file { "properties": { "name": { "type": "keyword" }, "path": { "type": "keyword", "fields": { "tree": { "type": "text", "analyzer": "paths" } } } } }
PUT /fs/file/1 { "name": "README.txt", "path": "/workspace/projects/helloworld", "contents": "这是我的第一个elasticsearch程序" }
2、对文件系统执行搜索
文件搜索需求:查找一份,内容包括elasticsearch,在/workspace/projects/hellworld这个目录下的文件
GET /fs/file/_search { "query": { "bool": { "must": [ { "match": { "contents": "elasticsearch" } }, { "constant_score": { "filter": { "term": { "path": "/workspace/projects/helloworld" } } } } ] } } }
{ "took": 2, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 1, "max_score": 1.284885, "hits": [ { "_index": "fs", "_type": "file", "_id": "1", "_score": 1.284885, "_source": { "name": "README.txt", "path": "/workspace/projects/helloworld", "contents": "这是我的第一个elasticsearch程序" } } ] } }搜索需求2:搜索/workspace目录下,内容包含elasticsearch的所有的文件
/workspace/projects/helloworld doc1
/workspace/projects doc1
/workspace doc1
GET /fs/file/_search { "query": { "bool": { "must": [ { "match": { "contents": "elasticsearch" } }, { "constant_score": { "filter": { "term": { "path.tree": "/workspace" } } } } ] } } }
嵌套关系
1、做一个实验,引出来为什么需要nested object
冗余数据方式的来建模,其实用的就是object类型,我们这里又要引入一种新的object类型,nested object类型
博客,评论,做的这种数据模型
PUT /website/blogs/6 { "title": "花无缺发表的一篇帖子", "content": "我是花无缺,大家要不要考虑一下投资房产和买股票的事情啊。。。", "tags": [ "投资", "理财" ], "comments": [ { "name": "小鱼儿", "comment": "什么股票啊?推荐一下呗", "age": 28, "stars": 4, "date": "2016-09-01" }, { "name": "黄药师", "comment": "我喜欢投资房产,风,险大收益也大", "age": 31, "stars": 5, "date": "2016-10-22" } ] }
被年龄是28岁的黄药师评论过的博客,搜索
GET /website/blogs/_search { "query": { "bool": { "must": [ { "match": { "comments.name": "黄药师" }}, { "match": { "comments.age": 28 }} ] } } }
{ "took": 102, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 1, "max_score": 1.8022683, "hits": [ { "_index": "website", "_type": "blogs", "_id": "6", "_score": 1.8022683, "_source": { "title": "花无缺发表的一篇帖子", "content": "我是花无缺,大家要不要考虑一下投资房产和买股票的事情啊。。。", "tags": [ "投资", "理财" ], "comments": [ { "name": "小鱼儿", "comment": "什么股票啊?推荐一下呗", "age": 28, "stars": 4, "date": "2016-09-01" }, { "name": "黄药师", "comment": "我喜欢投资房产,风,险大收益也大", "age": 31, "stars": 5, "date": "2016-10-22" } ] } } ] } }
结果是。。。好像不太对啊???
object类型数据结构的底层存储。。。
{ "title": [ "花无缺", "发表", "一篇", "帖子" ], "content": [ "我", "是", "花无缺", "大家", "要不要", "考虑", "一下", "投资", "房产", "买", "股票", "事情" ], "tags": [ "投资", "理财" ], "comments.name": [ "小鱼儿", "黄药师" ], "comments.comment": [ "什么", "股票", "推荐", "我", "喜欢", "投资", "房产", "风险", "收益", "大" ], "comments.age": [ 28, 31 ], "comments.stars": [ 4, 5 ], "comments.date": [ 2016-09-01, 2016-10-22 ] }
object类型底层数据结构,会将一个json数组中的数据,进行扁平化
所以,直接命中了这个document,name=黄药师,age=28,正好符合
2、引入nested object类型,来解决object类型底层数据结构导致的问题
修改mapping,将comments的类型从object设置为nested
PUT /website { "mappings": { "blogs": { "properties": { "comments": { "type": "nested", "properties": { "name": { "type": "string" }, "comment": { "type": "string" }, "age": { "type": "short" }, "stars": { "type": "short" }, "date": { "type": "date" } } } } } } }
{ "comments.name": [ "小鱼儿" ], "comments.comment": [ "什么", "股票", "推荐" ], "comments.age": [ 28 ], "comments.stars": [ 4 ], "comments.date": [ 2014-09-01 ] } { "comments.name": [ "黄药师" ], "comments.comment": [ "我", "喜欢", "投资", "房产", "风险", "收益", "大" ], "comments.age": [ 31 ], "comments.stars": [ 5 ], "comments.date": [ 2014-10-22 ] } { "title": [ "花无缺", "发表", "一篇", "帖子" ], "body": [ "我", "是", "花无缺", "大家", "要不要", "考虑", "一下", "投资", "房产", "买", "股票", "事情" ], "tags": [ "投资", "理财" ] }
再次搜索,成功了。。。
GET /website/blogs/_search { "query": { "bool": { "must": [ { "match": { "title": "花无缺" } }, { "nested": { "path": "comments", "score_mode": "max"; "query": { "bool": { "must": [ { "match": { "comments.name": "黄药师" } }, { "match": { "comments.age": 28 } } ] } } } } ] } } }
score_mode:max,min,avg,none,默认是avg
如果搜索命中了多个nested document,如何讲个多个nested document的分数合并为一个分数
我们讲解一下基于nested object中的数据进行聚合分析
聚合数据分析的需求1:按照评论日期进行bucket划分,然后拿到每个月的评论的评分的平均值
GET /website/blogs/_search { "size": 0, "aggs": { "comments_path": { "nested": { "path": "comments" }, "aggs": { "group_by_comments_date": { "date_histogram": { "field": "comments.date", "interval": "month", "format": "yyyy-MM" }, "aggs": { "avg_stars": { "avg": { "field": "comments.stars" } } } } } } } }
{ "took": 52, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 2, "max_score": 0, "hits": [] }, "aggregations": { "comments_path": { "doc_count": 4, "group_by_comments_date": { "buckets": [ { "key_as_string": "2016-08", "key": 1470009600000, "doc_count": 1, "avg_stars": { "value": 3 } }, { "key_as_string": "2016-09", "key": 1472688000000, "doc_count": 2, "avg_stars": { "value": 4.5 } }, { "key_as_string": "2016-10", "key": 1475280000000, "doc_count": 1, "avg_stars": { "value": 5 } } ] } } } }
当根据nested object类型聚合下钻时候,可以用过reverse_path, 获取其他object field进行下钻。
GET /website/blogs/_search { "size": 0, "aggs": { "comments_path": { "nested": { "path": "comments" }, "aggs": { "group_by_comments_age": { "histogram": { "field": "comments.age", "interval": 10 }, "aggs": { "reverse_path": { "reverse_nested": {}, "aggs": { "group_by_tags": { "terms": { "field": "tags.keyword" } } } } } } } } } }
{ "took": 5, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 2, "max_score": 0, "hits": [] }, "aggregations": { "comments_path": { "doc_count": 4, "group_by_comments_age": { "buckets": [ { "key": 20, "doc_count": 1, "reverse_path": { "doc_count": 1, "group_by_tags": { "doc_count_error_upper_bound": 0, "sum_other_doc_count": 0, "buckets": [ { "key": "投资", "doc_count": 1 }, { "key": "理财", "doc_count": 1 } ] } } }, { "key": 30, "doc_count": 3, "reverse_path": { "doc_count": 2, "group_by_tags": { "doc_count_error_upper_bound": 0, "sum_other_doc_count": 0, "buckets": [ { "key": "大侠", "doc_count": 1 }, { "key": "投资", "doc_count": 1 }, { "key": "理财", "doc_count": 1 }, { "key": "练功", "doc_count": 1 } ] } } } ] } } } }
父子关系
nested object的建模,有个不好的地方,就是采取的是类似冗余数据的方式,将多个数据都放在一起了,维护成本就比较高
parent child建模方式,采取的是类似于关系型数据库的三范式类的建模,多个实体都分割开来,每个实体之间都通过一些关联方式,进行了父子关系的关联,各种数据不需要都放在一起,父doc和子doc分别在进行更新的时候,都不会影响对方
一对多关系的建模,维护起来比较方便,而且我们之前说过,类似关系型数据库的建模方式,应用层join的方式,会导致性能比较差,因为做多次搜索。父子关系的数据模型,不会,性能很好。因为虽然数据实体之间分割开来,但是我们在搜索的时候,由es自动为我们处理底层的关联关系,并且通过一些手段保证搜索性能。
父子关系数据模型,相对于nested数据模型来说,优点是父doc和子doc互相之间不会影响
要点:父子关系元数据映射,用于确保查询时候的高性能,但是有一个限制,就是父子数据必须存在于一个shard中
父子关系数据存在一个shard中,而且还有映射其关联关系的元数据,那么搜索父子关系数据的时候,不用跨分片,一个分片本地自己就搞定了,性能当然高咯
案例背景:研发中心员工管理案例,一个IT公司有多个研发中心,每个研发中心有多个员工
PUT /company { "mappings": { "rd_center": {}, "employee": { "_parent": { "type": "rd_center" } } } }
父子关系建模的核心,多个type之间有父子关系,用_parent指定父type
POST /company/rd_center/_bulk { "index": { "_id": "1" }} { "name": "北京研发总部", "city": "北京", "country": "中国" } { "index": { "_id": "2" }} { "name": "上海研发中心", "city": "上海", "country": "中国" } { "index": { "_id": "3" }} { "name": "硅谷人工智能实验室", "city": "硅谷", "country": "美国" }
shard路由的时候,id=1的rd_center doc,默认会根据id进行路由,到某一个shard
PUT /company/employee/1?parent=1 { "name": "张三", "birthday": "1970-10-24", "hobby": "爬山" }
维护父子关系的核心,parent=1,指定了这个数据的父doc的id
此时,parent-child关系,就确保了说,父doc和子doc都是保存在一个shard上的。内部原理还是doc routing,employee和rd_center的数据,都会用parent id作为routing,这样就会到一个shard
就不会根据id=1的employee doc的id进行路由了,而是根据parent=1进行路由,会根据父doc的id进行路由,那么就可以通过底层的路由机制,保证父子数据存在于一个shard中
POST /company/employee/_bulk { "index": { "_id": 2, "parent": "1" }} { "name": "李四", "birthday": "1982-05-16", "hobby": "游泳" } { "index": { "_id": 3, "parent": "2" }} { "name": "王二", "birthday": "1979-04-01", "hobby": "爬山" } { "index": { "_id": 4, "parent": "3" }} { "name": "赵五", "birthday": "1987-05-11", "hobby": "骑马" }
我们已经建立了父子关系的数据模型之后,就要基于这个模型进行各种搜索和聚合了
1、搜索有1980年以后出生的员工的研发中心
GET /company/rd_center/_search { "query": { "has_child": { "type": "employee", "query": { "range": { "birthday": { "gte": "1980-01-01" } } } } } }
{ "took": 33, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 2, "max_score": 1, "hits": [ { "_index": "company", "_type": "rd_center", "_id": "1", "_score": 1, "_source": { "name": "北京研发总部", "city": "北京", "country": "中国" } }, { "_index": "company", "_type": "rd_center", "_id": "3", "_score": 1, "_source": { "name": "硅谷人工智能实验室", "city": "硅谷", "country": "美国" } } ] } }
2、搜索有名叫张三的员工的研发中心
GET /company/rd_center/_search { "query": { "has_child": { "type": "employee", "query": { "match": { "name": "张三" } } } } }
{ "took": 2, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 1, "max_score": 1, "hits": [ { "_index": "company", "_type": "rd_center", "_id": "1", "_score": 1, "_source": { "name": "北京研发总部", "city": "北京", "country": "中国" } } ] } }
3、搜索有至少2个以上员工的研发中心
GET /company/rd_center/_search { "query": { "has_child": { "type": "employee", "min_children": 2, "query": { "match_all": {} } } } }
{ "took": 5, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 1, "max_score": 1, "hits": [ { "_index": "company", "_type": "rd_center", "_id": "1", "_score": 1, "_source": { "name": "北京研发总部", "city": "北京", "country": "中国" } } ] } }
4、搜索在中国的研发中心的员工
GET /company/employee/_search { "query": { "has_parent": { "parent_type": "rd_center", "query": { "term": { "country.keyword": "中国" } } } } }
{ "took": 5, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 3, "max_score": 1, "hits": [ { "_index": "company", "_type": "employee", "_id": "3", "_score": 1, "_routing": "2", "_parent": "2", "_source": { "name": "王二", "birthday": "1979-04-01", "hobby": "爬山" } }, { "_index": "company", "_type": "employee", "_id": "1", "_score": 1, "_routing": "1", "_parent": "1", "_source": { "name": "张三", "birthday": "1970-10-24", "hobby": "爬山" } }, { "_index": "company", "_type": "employee", "_id": "2", "_score": 1, "_routing": "1", "_parent": "1", "_source": { "name": "李四", "birthday": "1982-05-16", "hobby": "游泳" } } ] } }5、统计每个国家的喜欢每种爱好的员工有多少个
GET /company/rd_center/_search { "size": 0, "aggs": { "group_by_country": { "terms": { "field": "country.keyword" }, "aggs": { "group_by_child_employee": { "children": { "type": "employee" }, "aggs": { "group_by_hobby": { "terms": { "field": "hobby.keyword" } } } } } } } }
{ "took": 15, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 3, "max_score": 0, "hits": [] }, "aggregations": { "group_by_country": { "doc_count_error_upper_bound": 0, "sum_other_doc_count": 0, "buckets": [ { "key": "中国", "doc_count": 2, "group_by_child_employee": { "doc_count": 3, "group_by_hobby": { "doc_count_error_upper_bound": 0, "sum_other_doc_count": 0, "buckets": [ { "key": "爬山", "doc_count": 2 }, { "key": "游泳", "doc_count": 1 } ] } } }, { "key": "美国", "doc_count": 1, "group_by_child_employee": { "doc_count": 1, "group_by_hobby": { "doc_count_error_upper_bound": 0, "sum_other_doc_count": 0, "buckets": [ { "key": "骑马", "doc_count": 1 } ] } } } ] } } }
父子关系,祖孙三层关系的数据建模,搜索
PUT /company { "mappings": { "country": {}, "rd_center": { "_parent": { "type": "country" } }, "employee": { "_parent": { "type": "rd_center" } } } }
country -> rd_center -> employee,祖孙三层数据模型
POST /company/country/_bulk { "index": { "_id": "1" }} { "name": "中国" } { "index": { "_id": "2" }} { "name": "美国" }
POST /company/rd_center/_bulk { "index": { "_id": "1", "parent": "1" }} { "name": "北京研发总部" } { "index": { "_id": "2", "parent": "1" }} { "name": "上海研发中心" } { "index": { "_id": "3", "parent": "2" }} { "name": "硅谷人工智能实验室" }
PUT /company/employee/1?parent=1&routing=1 { "name": "张三", "dob": "1970-10-24", "hobby": "爬山" }
routing参数的讲解,必须跟grandparent相同,否则有问题
country,用的是自己的id去路由; rd_center,parent,用的是country的id去路由; employee,如果也是仅仅指定一个parent,那么用的是rd_center的id去路由,这就导致祖孙三层数据不会在一个shard上
孙子辈儿,要手动指定routing,指定为爷爷辈儿的数据的id
搜索有爬山爱好的员工所在的国家
GET /company/country/_search { "query": { "has_child": { "type": "rd_center", "query": { "has_child": { "type": "employee", "query": { "match": { "hobby": "爬山" } } } } } } }
{ "took": 10, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 1, "max_score": 1, "hits": [ { "_index": "company", "_type": "country", "_id": "1", "_score": 1, "_source": { "name": "中国" } } ] } }