ElasticSearch
Notes from Exploring ElasticSearch
The installation of Elasticsearch is very simple. It's a server for processing texts.
Elasticsearch is a standalone Java app, and can be easily started from command line. A copy can be obtained from the elasticsearch download page.
Microsoft Windows:
Download the .zip version and unpack it to a folder. Navigate to the bin folder, then double click elasticsearch.bat to run.
If the serve is successfully be startedl, you'll see information in terminal like this:
[2015-02-04 20:43:12,747][INFO ][node ] [Joe Fixit] started
P.S: There's a problem you may meet. If the terminal provides the information like this:
[2014-12-17 09:31:03,820][WARN ][cluster.routing.allocation.decider]
[logstash test] high disk watermark [10%] exceeded on
[7drCr113QgSM8wcjNss_Mg][Blur] free: 632.3mb[8.4%], shards will be
relocated away from this node
[2014-12-17 09:31:03,820][INFO ][cluster.routing.allocation.decider]
[logstash test] high disk watermark exceeded on one or more nodes,
rerouting shards
It just means there's no enough space in your current disk. So you only need to delete some files for freeing space.
After you've started your server, you can ensure it's running properly by opening your browser to the URL: http://localhost:9200. You should see a page like this:
{ "status" : 200, "name" : "Joe Fixit", "cluster_name" : "elasticsearch", "version" : { "number" : "1.4.2", "build_hash" : "927caff6f05403e936c20bf4529f144f0c89fd8c", "build_timestamp" : "2014-12-16T14:11:12Z", "build_snapshot" : false, "lucene_version" : "4.10.2" }, "tagline" : "You Know, for Search" }
As it's free to use any tool you wish to query elasticsearch, we can install curl and cygwin to query elasticsearch.
But if you're reading the book Exploring ElasticSearch, you'd better install the tool made by the author: elastic-hammer. You can find the detailed information on Github: https://github.com/andrewvc/elastic-hammer. It's very easy to install it as a plugin with the following steps:
- Simply run (in your elasticsearch bin folder)
./plugin -install andrewvc/elastic-hammer
- To use it visit:
http://<yourelasticsearchserver>/_plugin/elastic-hammer/. By default, <yourelasticsearchserver> is just localhost:9200.
- To upgrade the plugin run:
./plugin -remove elastic-hammer; ./plugin -install andrewvc/elastic-hammer
Modeling Data
field: the smallest individual unit of data.
documents: collections of fields, and comprise the base unit of storage in elasticsearch.
The primary data-format elasticsearch uses is JSON. A sample document:
1 2 3 4 5 6 | { "_id" : 1, "handle" : "ron" , "hobbies" : [ "hacking" , "the great outdoors" ], "computer" : { "cpu" : "pentium pro" , "mhz" : 200} } |
The user-dfined type is analogous to a database schema. Types are defined with the Mapping APIs:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | { "user" : { "properties" : { "handle" : { "type" : "string" }, "age" : { "type" : "integer" }, "hobbies" : { "type" : "string" }, "computer" : { "properties" : { "cpu" : { "type" : string}, "speed" : { "type" : "integer" } } } } } } |
Basic CRUD
The full CRUD lifecycle in elasticsearch is Create, Read, Update, Delete. We'll create an index, then a type, and finally a document within that index using tat type. The URL scheme is consistent for these operations, with most URLs having the form /index/type/docid, and that special operations on a given namespace are namespaced with an uderscore prefix.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 | // create an index named 'planet' PUT /planet // create a type called 'hacker' PUT /planet/hacker/_mapping { "hacker" : { "properties" : { "handle" : { "type" : "string" }, "age" : { "type" : "long" } } } } // create a document PUT /planet/hacker/1 { "handle" : "jean-michea" , "age" : 18} // retrieve the document GET /planet/hacker/1 // update the document's age field POST /planet/hacker/1/_update { "doc" : { "age" : 19}} // delete the document DELETE /planet/hacker/1 |
Search Data
First, create our schema:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | // Delete the document DELETE /planet/hacker/1 // Delete any existing indexes named planet DELETE /planet // Create our index PUT /planet/ { "mappings" : { "hacker" : { "properties" : { "handle" : { "type" : "string" }, "hobbies" : { "type" : "string" , "analyzer" : "snowball" } } } } } |
Then, seed some data by datasets as hacker_planet.eloader.
The data repository can be got at http://github.com/andrewvc/ee-datasets. After cloned the repository, you can load examples into your server by executing the included elastic-loader.jar program, providing the address of your elasticsearch server, and the path to the data-file. For example, to load the hacker_planet dataset, open a command prompt in the ee-datasets folder, an run:
1java -jar elastic-loader.jar http:
//localhost:9200 datasets/hacker_planet.eloader
Finally, we can perform our search:
1 2 3 4 5 6 7 8 9 | // Do the search POST /planet/hacker/_search { "query" : { "match" : { "hobbies" : "rollerblading" } } } |
The above codes perform a search for those who like rollerblading out of the 3 users we've created in the datbase.
Searches in elasticsearch are handled by the aptly named search API. The search API is provided by the _search endpoint.
- index search: /myidx/_search
- document type search: /myidx/mytpe/_search
For example:
1 2 3 4 5 6 7 | // index search POST /planet/_search ... // document type search POST /planet/hacker/_search ... |
A complex search's skeleton
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | // Load Dataset: hacker_planet.eloader POST /planet/_search { "from" : 0, "size" : 15, "query" : { "match_all" : {}}, "sort" : { "handle" : "desc" }, "filter" : { "term" : { "_all" : "coding" }}, "facet" : { "hobbies" : { "term" : { "field" : "hobbies" } } } } |
All elasticsearch queries boil down to the task of
- restricting the result set
- scoring (the default scoring algorithm implemented in Lucene's TFIDF Similarity class.)
- sorting
Text Analysis
Elasticsearch has toolbox with which we can slice and dice words in order to efficiently searched. Utilizing these tools we can narrow our search space, and find common ground between linguistically similar terms.
The Snowball analyzer is great at figuring out what the stems of English words are. The stem of a word is its root.
The process by which documents are analyzed is as follows:
- A document update or create is received via a PUT or POST.
- The field values in the document are each run through an analyzer which converts each value to zero, one, or more indexable tokens.
- The tokenized values are stored in an index, pointing back to the full version of the document.
The easist way to see analysis in action is with the Analyzer API:
1 | GET /_analyze?analyzer=snowball&text=candles%20candle&pretty= true ' |
An analyzer is a really a three stage pipeline comprised of the following execution steps:
- Character Filtering Turns the input string into a different string
- Tokenization Turns the char-filtered string into an array of tokens
- Token Filtering Post-process the filtered tokens into a mutated token array
Let's dive in by building a cutom analyzer for tokenizing CSV data. Custom analyzer can be stored at the index level either during or after index creation. Lets's:
- create a "recipes" index
- close it
- update the analysis settings
- reopen it (in order to experiment with a custom analyzer)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 | // Create the index PUT /recipes // Close the index for settings update POST /recipes/_close // Create the analyzer PUT /recipes/_settings { "index" : { "analysis" : { "tokenizer" : { "comma" : { "type" : "pattern" , "pattern" : "," } }, "analyzer" : { "recipe_csv" : { "type" : "custom" , "tokenizer" : "comma" , "filter" : [ "trim" , "lowercase" ] } } } } } // Reopen the index POST /recipes/_open |
Faceting
Facets are always attached to a query, letting you return aggregate statistics alongside regular query results. We'll create a database of movies and return facets based on the movies genres alongside standard query results. As usual, we need to load the movie_db.eloader data-set into elasticsearch server.
Simple movie mapping:
1 2 3 4 5 6 7 8 9 10 11 12 13 | // Load Dataset: movie_db.eloader GET /movie_db/movie/_mapping?pretty= true { "movie" : { "properties" : { "actors" : { "type" : "string" , "analyzer" : "standard" , "position_offset_gap" : 100}, "genre" : { "type" : "string" , "index" : "not_analyzed" }, "release_year" : { "type" : "integer" , "index" : "not_analyzed" }, "title" : { "type" : "string" , "analyzer" : "snowball" }, "description" : { "type" : "string" , "analyzer" : "snowball" } } } } |
Simple terms faceting:
1 2 3 4 5 6 7 8 9 10 11 | // Load Dataset: movie_db.eloader POST /movie_db/_search { "query" : { "match" : { "description" : "hacking" }}, "facets" : { "genre" : { "terms" : { "field" : "genre" }, "size" : 10 } } } |
This query searches for movies with a description containing "hacking". The query will return a list of facets showing which genres have descriptions containing the term "hacking", and how often films are in that genre with a matching description.
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· AI与.NET技术实操系列:向量存储与相似性搜索在 .NET 中的实现
· 基于Microsoft.Extensions.AI核心库实现RAG应用
· Linux系列:如何用heaptrack跟踪.NET程序的非托管内存泄露
· 开发者必知的日志记录最佳实践
· SQL Server 2025 AI相关能力初探
· 震惊!C++程序真的从main开始吗?99%的程序员都答错了
· winform 绘制太阳,地球,月球 运作规律
· 【硬核科普】Trae如何「偷看」你的代码?零基础破解AI编程运行原理
· 超详细:普通电脑也行Windows部署deepseek R1训练数据并当服务器共享给他人
· 上周热点回顾(3.3-3.9)