Elasticsearch Tutorial
Elasticsearch Tutorial
Concepts
Mapping concepts across SQL and Elasticsearch
While SQL and Elasticsearch have different terms for the way the data is organized, essentially their purpose is the same.
SQL | ElasticSearch | Description |
---|---|---|
column | field | In both cases, at the lowest level, data is stored in named entries, of a variety of data types, containing one value. |
row | document | Columns and fields do not exist by themselves; they are part of a row or a document. |
table | index | The target against which queries, whether in SQL or Elasticsearch get executed against. |
database | cluster | In SQL, catalog or database are used interchangeably and represent a set of schemas that is, a number of tables. In Elasticsearch the set of indices available are grouped in a cluster. |
Field Data Type
Common types
type | description |
---|---|
binary | Binary value encoded as a Base64 string. |
boolean | true and false values. |
Keywords | The keyword family, including keyword, constant_keyword, and wildcard. |
Numbers | Numeric types, such as long and double, used to express amounts. |
Dates | Date types, including date and date_nanos. |
Text | A field to index full-text values. |
Mapping
Mapping is the process of defining how a document, and the fields it contains, are stored and indexed.
Each document is a collection of fields, which each have their own data type. When mapping your data, you create a mapping definition, which contains a list of fields that are pertinent to the document.
Dynamic mapping
Dynamic mapping allows you to experiment with and explore data when you’re just getting started. Elasticsearch adds new fields automatically, just by indexing a document.
Explicit mapping
Explicit mapping allows you to precisely choose how to define the mapping definition. For example,
{
"mappings": {
"properties": {
"uuid": {
"type": "keyword"
},
"title": {
"type": "text"
},
"main_body": {
"type": "text",
"index": "false"
}
}
}
}
The index type "keyword" indicates this field should be searched by term query, which means do not be analyzed.
The index type "text" indicates this field should be searched by match query, and it is going to be analyzed.
The "index:false" specify this field should not be indexed, meanwhile, this field could not be searched.
Query and filter contextedit
Relevance scoresedit
By default, Elasticsearch sorts matching search results by relevance score, which measures how well each document matches a query.
Query context
In the query context, a query clause answers the question “How well does this document match this query clause?” Besides deciding whether or not the document matches, the query clause also calculates a relevance score in the _score metadata field.
Filter context
In a filter context, a query clause answers the question “Does this document match this query clause?” The answer is a simple Yes or No — no scores are calculated.
Query DSL
Elasticsearch provides a full Query DSL (Domain Specific Language) based on JSON to define queries.
Leaf query clauses
query type | description |
---|---|
match | Returns documents that match a provided text, number, date or boolean value. The provided text is analyzed before matching. |
term | Returns documents that contain an exact term in a provided field. |
range | Returns documents that contain terms within a provided range. |
Compound query clauses
query type | description |
---|---|
bool | A query that matches documents matching boolean combinations of other queries. It is built using one or more boolean clauses, each clause with a typed occurrence. |
dis_max | Returns documents matching one or more wrapped queries, called query clauses or clauses. If a returned document matches multiple query clauses, the dis_max query assigns the document the highest relevance score from any matching clause, plus a tie breaking increment for any additional matching subqueries. |
constant_score | Wraps a filter query and returns every matching document with a relevance score equal to the boost parameter value. |
Allow expensive queries
query type | description |
---|---|
script queries | Filters documents based on a provided script. The script query is typically used in a filter context. |
fuzzy queries | Returns documents that contain terms similar to the search term, as measured by a Levenshtein edit distance. |
regexp queries | Returns documents that contain terms matching a regular expression. |
prefix queries | Returns documents that contain a specific prefix in a provided field. |
wildcard queries | Returns documents that contain terms matching a wildcard pattern. A wildcard operator is a placeholder that matches one or more characters. |
range queries | Returns documents that contain terms within a provided range. |
Joining queries | Performing full SQL-style joins in a distributed system like Elasticsearch is prohibitively expensive. |
Geo-shape query | Filter documents indexed using the geo_shape or geo_point type. |
Script score query | Uses a script to provide a custom score for returned documents. The script_score query is useful if, for example, a scoring function is expensive and you only need to calculate the score of a filtered set of documents. |
Percolate query | The percolate query can be used to match queries stored in an index. The percolate query itself contains the document that will be used as query to match with the stored queries. |
Python3 ElasticSearch in Action
Index
create
# encoding=utf-8
from elasticsearch import Elasticsearch
from pprint import pprint
def main():
args = parse_args()
es = Elasticsearch()
body = {
"mappings": {
"properties": {
"uuid": {
"type": "keyword"
},
"title": {
"type": "text"
},
"main_body": {
"type": "text"
}
}
}
}
ret = es.indices.create(index="forward", body=body)
pprint(ret)
if __name__ == '__main__':
main()
delete
# encoding=utf-8
from elasticsearch import Elasticsearch
from pprint import pprint
def main():
args = parse_args()
es = Elasticsearch()
ret = es.indices.delete(index="forward")
pprint(ret)
if __name__ == '__main__':
main()
update
Update mapping API
Adds new fields to an existing data stream or index. You can also use this API to change the search settings of existing fields.
# encoding=utf-8
from elasticsearch import Elasticsearch
from pprint import pprint
def main():
args = parse_args()
es = Elasticsearch()
body = {
"properties": {
"uuid": {
"type": "keyword"
},
"title": {
"type": "text"
},
"main_body": {
"type": "text"
},
"publish_date": {
"type": "keyword"
}
}
}
ret = es.indices.put_mapping(index=args.name, body=body)
pprint(ret)
if __name__ == '__main__':
main()
Reindex API
Copies documents from a source to a destination.
# encoding=utf-8
from elasticsearch import Elasticsearch
from pprint import pprint
def main():
args = parse_args()
es = Elasticsearch()
body = {
"source": {
"index": "forward"
},
"dest": {
"index": "document"
}
}
ret = es.reindex(body=body)
pprint(ret)
if __name__ == '__main__':
main()
get
# encoding=utf-8
from elasticsearch import Elasticsearch
from pprint import pprint
def main():
args = parse_args()
es = Elasticsearch()
ret = es.indices.get(index="forward")
pprint(ret)
if __name__ == '__main__':
main()
Document
create
# encoding=utf-8
from elasticsearch import Elasticsearch
from pprint import pprint
def main():
args = parse_args()
body = {
"uuid": "1000",
"title": "中国银行在港交所上市挂牌成功",
"main_body": "中国银行在港交所上市挂牌成功,成为中国大陆首家在国际市场上市的银行。"
}
es = Elasticsearch()
ret = es.index(index="forward", body=body)
pprint(ret)
if __name__ == '__main__':
main()
delete
# encoding=utf-8
import argparse
from elasticsearch import Elasticsearch
from pprint import pprint
def main():
args = parse_args()
es = Elasticsearch()
ret = es.delete(index="forward", id="WRemuHkBd6vf16HuHzHq")
pprint(ret)
if __name__ == '__main__':
main()
update
To fully replace an existing document, use the index API, which is designed to creates or updates a document in an index.
# encoding=utf-8
from elasticsearch import Elasticsearch
from pprint import pprint
def main():
args = parse_args()
es = Elasticsearch()
body = {
"uuid": "1000",
"title": "<<中国银行在港交所上市挂牌成功>>",
"main_body": "<<成为中国大陆首家在国际市场上市的银行>>"
}
ret = es.index(index="forward", body=body, id="WRemuHkBd6vf16HuHzHq")
pprint(ret)
if __name__ == '__main__':
main()
Updates a document with a script or partial document.
# encoding=utf-8
from elasticsearch import Elasticsearch
from pprint import pprint
def main():
args = parse_args()
es = Elasticsearch()
body = {
"uuid": "1000",
"title": "<<中国银行在港交所上市挂牌成功>>",
"main_body": "<<成为中国大陆首家在国际市场上市的银行>>"
}
ret = es.index(index="forward", body=body, id="WRemuHkBd6vf16HuHzHq")
pprint(ret)
if __name__ == '__main__':
main()
Updates a document using the specified script.
# encoding=utf-8
from elasticsearch import Elasticsearch
from pprint import pprint
def main():
args = parse_args()
es = Elasticsearch()
body = {
"script" : {
"source": "ctx._source.counter += params.count",
"lang": "painless",
"params" : {
"count" : 4
}
}
}
ret = es.update(index="forward", body=body, id="WRemuHkBd6vf16HuHzHq")
pprint(ret)
if __name__ == '__main__':
main()
get
Returns a document.
# encoding=utf-8
from elasticsearch import Elasticsearch
from pprint import pprint
def main():
args = parse_args()
es = Elasticsearch()
ret = es.get(index="forward", id="WRemuHkBd6vf16HuHzHq")
pprint(ret)
if __name__ == '__main__':
main()
Search
match_phrase query,可以实现基于字的中文布尔检索,实现中文精准匹配、中文精准查询。
# encoding=utf-8
from elasticsearch import Elasticsearch
from pprint import pprint
def main():
args = parse_args()
es = Elasticsearch()
body = {
"query": {
"match_phrase": {
"title": "中国石油"
},
"match_phrase": {
"main_body": "中国石油"
}
}
}
ret = es.search(body=body, index="forward")
pprint(ret)
if __name__ == '__main__':
main()
Multi-match query, The multi_match query builds on the match query to allow multi-field queries.
{
"query": {
"multi_match" : {
"query": "中国石油",
"fields": [ "title", "main_body" ]
}
}
}
Allows to highlight search results on one or more fields.
{
"query" : {
"match": { "title": "中国石油" }
},
"highlight" : {
"pre_tags" : ["<tag1>"],
"post_tags" : ["</tag1>"],
"fields" : {
"_all" : {}
}
}
}