Elasticsearch使用

elasticsearch之倒排索引

小知识

　　一般搜索引擎的存储采用倒排索引，是底层数据存储的基本方式

定义

　　倒排索引源于实际应用中需要根据属性的值来查找记录。这种索引表中的每一项都包括一个属性值和具体该属性值的各记录的地址。

　　由于不是由记录来确定属性值，而是由属性值来确定记录的位置，因而称为倒排索引 inverted index。

　　带有倒排索引的文件我们称之为倒排索引文件，简称倒排文件

TF-IDF

　　一种用于信息检索与数据挖掘的常用加权技术。TF意思是词频(Term Frequency)，IDF意思是逆文本频率指数(Inverse Document Frequency)

　　如果某个词或短语在一篇文章中出现的频率TF高，并且在其他文章中很少出现，则认为此词或者短语具有很好的类别区分能力，适合用来分类

　　elasticsearch使用的分词打分技术

倒排索引需要解决的问题

　　1. 大小写转换问题，如python和PYTHON应该为一个词

　　2. 词干提取，如looking和look应该处理为一个词

　　3. 分词，如屏蔽系应该分为“屏蔽”、“系统”还是“屏蔽系统”

　　4. 倒排索引文件过大，需要使用什么样的压缩编码降低存储空间开销

elasticsearch文档、索引的简单CRUD操作

内置类型

　　string类型

　　　　text、keyword

　　数字类型

　　　　long、integer、short、byte、double、float

　　日期类型

　　　　date

　　bool类型

　　　　boolean

　　binary类型

　　　　binary

　　复杂类型

　　　　object、nested

　　geo类型

　　　　geo-point、geo-shape

　　专业类型

　　　　ip、competion

创建索引

1. 在kibana的console中创建索引

    # 一但创建好索引，shards就不能更改了

    PUT myindex
    {
        "settings": {
            "index": {
                "number_of_shards": 5,
                "number_of_replicas": 1
            }
        }
    }

2. 在head管理页面的新建索引菜单中创建

查询索引

1. 在kibana的console中查询索引

    GET myindex/_settings

    GET _all/_settings

    GET .kabana,myindex/_settings

    GET _settings

    GET myindex

更新索引

1. 在kibana的console中更新索引

    PUT myindex
    {
        "number_of_replicas": 2
    }

保存文档至某个索引

# 如果job后不指定文档id，则es默认给我们生成以各uuid

PUT myindex/job/1    
{
    "title": "python分布式爬虫开发",
    "salary_min": 15000,
    "city": "北京",
    "company": {
        "name": "百度",
        "company_addr": "北京市软件园"
    }
    "publish_date": "2018-05-20",
    "comments": 15
}

查看文档保存状态

　　在head管理页面中点击数据浏览，选择myindex

获取文档

GET myindex/job/1

GET myindex/job/1?_source=title,city

修改文档

1. 覆盖修改

    PUT myindex/job/1    
    {
        "title": "python分布式爬虫开发",
        "salary_min": 15000,
        "city": "北京",
        "company": {
            "name": "百度",
            "company_addr": "北京市软件园"
        }
        "publish_date": "2018-05-20",
        "comments": 15
    } 

2. 增量修改

    POST myindex/job/1/_update
    {
        "doc": {
            "comments": 15
        }
    }

删除文档

DELETE myindex/job/1

删除type

DELETE myindex/job

删除索引

DELETE myindex

elasticsearch的批量操作

mget批量获取

# 方法一：获取不同index的数据

    GET _mget
    {
        "docs": [
            {
                "_index": "test",
                "_type": "job1",
                "_id": "1"
            },
            {
                "_index": "test",
                "_type": "job2",
                "_id": "2" 
            }
        ]
    }

# 方法二：获取同一个index、type下的数据

    GET test/_mget
    {
        "docs": [
            {
                "_type": "job1",
                "_id": "1"
            },
            {
                "_type": "job2",
                "_id": "2" 
            }
        ]
    }

# 方法三：获取同一个index、type、id下的数据

    GET test/job1/_mget
    {
        "docs": [
            {
                "_id": "1"
            },
            {
                "_id": "2" 
            }
        ]
    }

# 方法四：test/job1/_mget

    {
        "ids": [1, 2]
    }

bulk批量操作

# 批量导入可以合并多个操作，比如idnex、delete、update、create等

# 也可以将一个索引导入到另一个索引

修改文档某个key值

    POST _bulk
    {"index": {"_index": "test", "_type": "type1", "_id": "1"}}
    {"field1": "value1"}
    {"index": {"_index": "test", "_type": "type1", "_id": "2"}}
    {"field2": "value2"}
    {"delete": {"_index": "test", "_type": "type1", "_id": "2"}}
    {"create": {"_index": "test", "_type": "type1", "_id": "3"}}
    {"field3": "value3"}
    {"update": {"_index": "test", "_type": "type1", "_id": "2"}}
    {"doc": {"field2": "value22222"}}

elasticsearch的映射mapping

常用属性

store

    值为yes表示存储，默认为no，适合类型all

index

    yes表示分析，默认为true，适合类型string

null_value

    如果字段为空，可以设置一个默认值，比如NA，适合类型all

analyzer

    可以设置索引和搜索时用的分析器，默认使用的是standard分析器，还可以使用whitespace、simple、english，适合类型all

include_in_all

    默认es为每个文档定义一个特殊的_all，它的作用是让每个字段被搜索到，如果不想某个字段被搜索到，可以设置为false，合适类型为all

format

    时间格式字符串的模式，合适类型为date

mapping更多属性

https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-params.html

创建一个mapping

PUT mapping 
{
    "mappings": {
        "job": {
            "properties": {
                "title": {
                    "store": true,
                    "type": "text",
                    "analyzer": "ik_max_word"
                },
                "salary_min": {
                    "type": "integer"
                },
                "city": {
                    "type": "keyword"
                },
                "company": {
                    "properties": {
                        "name":{
                            "store": true,
                            "type": "text"
                        },
                        "company_addr": {
                            "type": "text"
                        },
                        "employee_count": {
                            "type": "integer"
                        }
                    }
                },
                "publish_date": {
                    "type": "date",
                    "format": "yyyy-MM-dd"
                },
                "comments": {
                    "type": "integer"
                },
            }
        }
    }
}

对应document

PUT myindex/job/1    
{
    "title": "python分布式爬虫开发",
    "salary_min": 15000,
    "city": "北京",
    "company": {
        "name": "百度",
        "company_addr": "北京市软件园"，
        "employee_count": 50
    }
    "publish_date": "2018-05-20",
    "comments": 15
}

查看索引信息

　　在head管理页面中，点击对应index，下拉菜单中选择索引信息即可看到

注意

　　对应key属性如果与mapping不对应，那么es会尝试转换，如果转换成功继续存储，否则报错

　　索引的类型一但确定，很难修改。如果想修改很麻烦，特别是当索引数据特别大的时候，在修改映射恢复时间会特别长

获取已经创建好的mapping

GET myindex/_mapping

GET myindex/job/_mapping

GET _all/_mapping

GET _all/_mapping/job

更多elasticsearch查询

查询分类

　　基本查询，使用elasticsearch内置查询条件进行查询

　　组合查询，把多个查询组合在一起进行复合查询

　　过滤，查询同时，通过filter条件在不影响打分的情况下筛选数据

match查询

# 对查询数据进行分词后查询

GET myindex/job/_search
{
    "query": {
        "match": {
            "title": "python"
        }
    }
}

term查询

# 不对查询数据进行任何处理，直接查询

GET myindex/_search
{
    "query": {
        "term": {
            "title": "python"
        }
    }
}

terms查询

# 列表中任何一个值查询到数据就会返回

GET myindex/_search
{
    "query": {
        "terms": {
            "title": ["工程师", "django", "系统"]
        }
    }
}

控制查询返回数量

# from表示从哪一条开始，size开始返回多少条数据

GET myindex/_search
{
    "query": {
        "terms": {
            "title": ["工程师", "django", "系统"]
        }
    },
    "from": 0,
    "size": 2
}

match_all查询

# 查询所有

GET myindex/_search
{
    "query": {
        "match_all": {}
    },
}

match_phrase查询

# 短语查询,对query字段值进行分词放到一个列表中，然查询列表中的所有元素，只有满足列表所有词的文档才会被返回

# slop,限制列表中每个词出现的距离

GET /myindex/_search
{
    "query": {
        "match_phrase": {
            "title": {
                "query": "python系统",
                "slop": 6
            }
        }
    },
}

multi_match查询

# 可以指定多个字段，任意字段包含查询值即可满足返回条件

# 字段后加上^3表示增加权重，表示title字段出现python权重会高，这会影响结果的排序

GET /myindex/_search
{
    "query": {
        "multi_match": {
            "query": "python",
            "fields": ["title^3", "desc"]
        }
    },
}

指定返回字段

GET /myindex/_search
{
    "stored_fileds": ["title", "company_name"],
    "query": {
        "match": {
            "title": "python"
        }
    },
}

通过sort把结果排序

# 升序asc 降序desc

GET /myindex/_search
{
    "query": {
        "match_all": {}
    },
    "sort": [{
        "comments": {
            "order": "desc"
        }
    }]
}

数字range查询

# boost 表示权重

GET /myindex/_search
{
    "query": {
        "range": {
            "comments": {
                "gte": 10,
                "lte": 20,
                "boost": 2.0
            }
        }
    }
}

时间range查询

# now 会被es自动解析成当前时间

GET /myindex/_search
{
    "query": {
        "range": {
            "add_time": {
                "gte": "2018-05-20",
                "lte": "now"
            }
        }
    }
}

wildcard模糊查询

GET /myindex/_search
{
    "query": {
        "wildcard": {
            "title":{
                "value": "py*n",
                "boost": 2.0
            }
        }
    }
}

fuzzy模糊查询

GET myindex/_search
{
    "query": {
        "fuzzy": {
            "title": "linux"
        }
    },
    "_source": ["title"]  
}

# fuzzyiness为最小编辑距离， prefix_length代表前面不参与变换词长度

GET myindex/_search
{
    "query": {
        "fuzzy": {
            "title": {
                "value": "linu",
                "fuzzyiness": 1,
                "prefix_length": 0
            }
        }
    },
    "_source": ["title"]  
}

POST music/_search?pretty
{
    "suggest": {
        "song-suggest" : {
            "prefix" : "nor",
            "completion" : {
                "field" : "suggest",
                "fuzzy" : {
                    "fuzziness" : 2
                }
            }
        }
    }
}

regex正则匹配查询

POST music/_search?pretty
{
    "suggest": {
        "song-suggest" : {
            "regex" : "n[ever|i]r",
            "completion" : {
                "field" : "suggest"
            }
        }
    }
}

elasticsearch组合查询

bool查询

　　老版本的filtered已经被bool替换

bool查询使用格式

# filter字段过滤，不参与打分

# mush数组列的所有条件查询必须同时满足

# should数组列中任意条件满足即可

# mush_not于mush相反

"bool": {
    "filter": [],
    "must": [],
    "should": [],
    "must_not": []
}

简单过滤bool查询

# term也可以替换成match，这里integer类型分不分词查询结果都一样

GET /myindex/_search
{
    "query": {
        "bool": {
            "must":{
                "match_all": {}
            },
            "filter": {
                "term":{
                    "salary_min": 20
                }
            }
        }
    }
}

多条件查询

GET /myindex/_search
{
    "query": {
        "bool": {
            "must":{
                "match_all": {}
            },
            "filter": {
                "terms":{
                    "salary_min": [20, 30]
                }
            }
        }
    }
}

term查询text属性值的数据

# 我们存储时是Python，但默认text经过分词后会被转换程python进行存储，所以如果对Python进程term查询将没有返回结果

# 可以使用原查询值的小写进行查询或者使用match进行查询

GET /myindex/_search
{
    "query": {
        "bool": {
            "must":{
                "match_all": {}
            },
            "filter": {
                "term":{
                    "title": "Python"
                }
            }
        }
    }
}

查看分析器解析的结果

# 分词结果： python、网络、络

GET _analyze
{
    "analyzer": "ik_max_word",
    "text": "python网络"
}

# 分词结果： python、网络

GET _analyze
{
    "analyzer": "ik_smart",
    "text": "python网络"
}

多条件组合查询

GET /myindex/_search
{
    "query": {
        "bool": {
            "should": [
                "term": {
                    "salary_min": 20
                },
                "term": {
                    "title": "python"
                }
            ],
            "must_not": [
                "term": {
                    "salary_min": 30
                }
            ]
        }
    }
}

多条件嵌套查询

GET /myindex/_search
{
    "query": {
        "bool": {
            "should": [
                {
                    "term": {
                        "salary_min": 20
                    }
                },
                {
                    "bool": {
                        "must": [
                            {
                                "term": {
                                    "title": "django"
                                }
                            },
                            {
                                "term": {
                                    "salary_min": 30
                                }
                            }
                        ]
                    }
                }
            ],
        }
    }
}

过滤空值和非空值

建立测试数据

POST myindex/test2/_bulk
{"index":{"_id":1}}
{"tags":["search"]}
{"index":{"_id":2}}
{"tags":["search", "python"]}
{"index":{"_id":3}}
{"other_field":["data"]}
{"index":{"_id":4}}
{"tags":null}
{"index":{"_id":5}}
{"tags":["search", null]}

获取非空值结果

# 返回有tag字段的结果并且非空

GET /myindex/_search
{
    "query": {
        "bool": {
            "filter":{
                "exists": {
                    "field": "tags"
                }
            },
        }
    }
}

获取空值结果

# 返回没有tag字段的结果并且为空

GET /myindex/_search
{
    "query": {
        "bool": {
            "must_not":{
                "exists": {
                    "field": "tags"
                }
            },
        }
    }
}

posted @ 2018-05-25 21:05 前路~ 阅读(286) 评论(0) 编辑收藏举报

刷新页面返回顶部

陈乾

Elasticsearch使用

公告