python-es操作

本文使用的是kibana进行可视化，win10，python3.8（使用pycharm）

1.pycharm中进行python和es的链接

from elasticsearch import Elasticsearch

es=Elasticsearch("http://localhost:9200")

2.分词器

2.1 ik中文分词器

ik中文分词器分为：

ik_smart和ik_max_word：ik_smart的分粒度相比ik_max_word更大一些，所以ik_max_word分的更细致一些

#创建了一个test的索引库，并且索引库中有一个类型为text的name字段，
doc1={
    "mappings":{
        "properties":{
             "name":{
                "type":"text",
                "analyzer":"ik_max_word"    #表示此字段的分词器为ik_max_word
        }   
    }      
}
es.indices.create(index="test",body=doc1)

2.2 ik-停用词

带有停用词的ik中文分词器。一般带有停用词的中文分词器可以通过配置文件进行设置，也可以通过python代码设置停用词。这里我选择的是采用python直接进行停用词的设置，因为此种方法相较于通过配置文件写起来可能更简单一些

stopwords1=["中天","时任"]
doc2={
  "settings": {
    "analysis": {
      "filter": {#定义分词过滤器
        "my_stop":
        {
          "type":"stop", #指定分词过滤器类型为停用词
          "stopwords":stopwords1
        }
      },
      "analyzer": {
        "ik_stop":{#自定义分析器
          "tokenizer":"ik_max_word",#分析器使用的分词器类型
          "filter":["my_stop"]#指定分词过滤器
        }
      }
    }
  },
"mappings": {
        "properties": {
            "entity_name": {
                "type": "text",
                "analyzer": "ik_stop"
            },
         }
  }
}

2.3 ngram分词器

ngram更多的是可以设置字符进行匹配，如下样例显示的是一个字符进行分词，min_gram和max_gram默认设置之差不超过1

doc3={
  "settings": {
      "analysis": {
        "analyzer": {
          "specialchar_analyzer": {
            "tokenizer": "specialchar_tokenizer"
          }
        },
        "tokenizer": {
          "specialchar_tokenizer": {
            "type": "ngram",
            "min_gram": 1,
            "max_gram": 1
          }
        }
      }
  },
    "mappings": {
        "properties": {
            "entity_name": {
                "type": "text",
                "analyzer": "specialchar_analyzer"
            },
        }
    }
}

2.4 ngram-停用词

带有停用词的ngram就是融合一下

doc4={
  "settings": {
      "analysis": {
        "analyzer": {
          "ngram-stop": {
            "tokenizer": "specialchar_tokenizer",
            "filter":["my_stop"]#指定分词过滤器
          }
        },
        "tokenizer": {
          "specialchar_tokenizer": {
            "type": "ngram",
            "min_gram": 1,
            "max_gram": 1
          }
        },
        "filter": {#定义分词过滤器
          "my_stop":
          {
            "type":"stop", #指定分词过滤器类型为停用词
            "stopwords":stopwords2
          }
        }
      }
  },
    "mappings": {
        "properties": {
            "entity_name": {
                "type": "text",
                "analyzer": "ngram-stop"
            },
        }
    }
}

3.增加数据

如下样例为将body1的数据插入进test1，并将其id设置为1。如果没有指定id的话，系统会给一个随机的id。

es.index(index="test1",id="1",document=body1)

4.查询内容

#查询内容
query={
    "query": {
    "match": {
      "entity_name": "公司"
    }
  }
}

#查询并进行结果格式调整
result1 =es.search(index="test1",body=query)

5.排序

#将es库按照max_score排序
body = {
  "sort":{
    "max_score":{         # 根据max_score字段升序排序
      "order":"desc"    # asc升序，desc降序
    }

  }
}

# 搜索所有数据，并按照最大值排序
result=es.search(index="test",body=body)

posted @ 2023-09-13 18:43 bonel 阅读(260) 评论(0) 编辑收藏举报

刷新页面返回顶部

bonel

python-es操作

公告