ElasticSearch
Notes from Exploring ElasticSearch
The installation of Elasticsearch is very simple. It's a server for processing texts.
Elasticsearch is a standalone Java app, and can be easily started from command line. A copy can be obtained from the elasticsearch download page.
Microsoft Windows:
Download the .zip version and unpack it to a folder. Navigate to the bin folder, then double click elasticsearch.bat to run.
If the serve is successfully be startedl, you'll see information in terminal like this:
[2015-02-04 20:43:12,747][INFO ][node ] [Joe Fixit] started
P.S: There's a problem you may meet. If the terminal provides the information like this:
[2014-12-17 09:31:03,820][WARN ][cluster.routing.allocation.decider]
[logstash test] high disk watermark [10%] exceeded on
[7drCr113QgSM8wcjNss_Mg][Blur] free: 632.3mb[8.4%], shards will be
relocated away from this node
[2014-12-17 09:31:03,820][INFO ][cluster.routing.allocation.decider]
[logstash test] high disk watermark exceeded on one or more nodes,
rerouting shards
It just means there's no enough space in your current disk. So you only need to delete some files for freeing space.
After you've started your server, you can ensure it's running properly by opening your browser to the URL: http://localhost:9200. You should see a page like this:
{ "status" : 200, "name" : "Joe Fixit", "cluster_name" : "elasticsearch", "version" : { "number" : "1.4.2", "build_hash" : "927caff6f05403e936c20bf4529f144f0c89fd8c", "build_timestamp" : "2014-12-16T14:11:12Z", "build_snapshot" : false, "lucene_version" : "4.10.2" }, "tagline" : "You Know, for Search" }
As it's free to use any tool you wish to query elasticsearch, we can install curl and cygwin to query elasticsearch.
But if you're reading the book Exploring ElasticSearch, you'd better install the tool made by the author: elastic-hammer. You can find the detailed information on Github: https://github.com/andrewvc/elastic-hammer. It's very easy to install it as a plugin with the following steps:
- Simply run (in your elasticsearch bin folder)
./plugin -install andrewvc/elastic-hammer
- To use it visit:
http://<yourelasticsearchserver>/_plugin/elastic-hammer/. By default, <yourelasticsearchserver> is just localhost:9200.
- To upgrade the plugin run:
./plugin -remove elastic-hammer; ./plugin -install andrewvc/elastic-hammer
Modeling Data
field: the smallest individual unit of data.
documents: collections of fields, and comprise the base unit of storage in elasticsearch.
The primary data-format elasticsearch uses is JSON. A sample document:
{ "_id" : 1, "handle" : "ron", "hobbies" : ["hacking", "the great outdoors"], "computer" : {"cpu" : "pentium pro", "mhz" : 200} }
The user-dfined type is analogous to a database schema. Types are defined with the Mapping APIs:
{ "user" : { "properties" : { "handle" : {"type" : "string"}, "age" : {"type" : "integer"}, "hobbies" : {"type" : "string"}, "computer" : { "properties" : { "cpu" : {"type" : string}, "speed" : {"type" : "integer"} } } } } }
Basic CRUD
The full CRUD lifecycle in elasticsearch is Create, Read, Update, Delete. We'll create an index, then a type, and finally a document within that index using tat type. The URL scheme is consistent for these operations, with most URLs having the form /index/type/docid, and that special operations on a given namespace are namespaced with an uderscore prefix.
// create an index named 'planet' PUT /planet // create a type called 'hacker' PUT /planet/hacker/_mapping { "hacker" : { "properties" : { "handle" : {"type" : "string"}, "age" : {"type" : "long"} } } } // create a document PUT /planet/hacker/1 {"handle" : "jean-michea", "age" : 18} // retrieve the document GET /planet/hacker/1 // update the document's age field POST /planet/hacker/1/_update {"doc" : {"age" : 19}} // delete the document DELETE /planet/hacker/1
Search Data
First, create our schema:
// Delete the document DELETE /planet/hacker/1 // Delete any existing indexes named planet DELETE /planet // Create our index PUT /planet/ { "mappings" : { "hacker" : { "properties" : { "handle" : {"type" : "string"}, "hobbies" : {"type" : "string", "analyzer" : "snowball"} } } } }
Then, seed some data by datasets as hacker_planet.eloader.
The data repository can be got at http://github.com/andrewvc/ee-datasets. After cloned the repository, you can load examples into your server by executing the included elastic-loader.jar program, providing the address of your elasticsearch server, and the path to the data-file. For example, to load the hacker_planet dataset, open a command prompt in the ee-datasets folder, an run:
java -jar elastic-loader.jar http://localhost:9200 datasets/hacker_planet.eloader
Finally, we can perform our search:
// Do the search POST /planet/hacker/_search { "query" : { "match" : { "hobbies" : "rollerblading" } } }
The above codes perform a search for those who like rollerblading out of the 3 users we've created in the datbase.
Searches in elasticsearch are handled by the aptly named search API. The search API is provided by the _search endpoint.
- index search: /myidx/_search
- document type search: /myidx/mytpe/_search
For example:
// index search POST /planet/_search ... // document type search POST /planet/hacker/_search ...
A complex search's skeleton
// Load Dataset: hacker_planet.eloader POST /planet/_search { "from" : 0, "size" : 15, "query" : {"match_all" : {}}, "sort" : {"handle" : "desc"}, "filter" : {"term" : {"_all" : "coding"}}, "facet" : { "hobbies" : { "term" : { "field" : "hobbies" } } } }
All elasticsearch queries boil down to the task of
- restricting the result set
- scoring (the default scoring algorithm implemented in Lucene's TFIDF Similarity class.)
- sorting
Text Analysis
Elasticsearch has toolbox with which we can slice and dice words in order to efficiently searched. Utilizing these tools we can narrow our search space, and find common ground between linguistically similar terms.
The Snowball analyzer is great at figuring out what the stems of English words are. The stem of a word is its root.
The process by which documents are analyzed is as follows:
- A document update or create is received via a PUT or POST.
- The field values in the document are each run through an analyzer which converts each value to zero, one, or more indexable tokens.
- The tokenized values are stored in an index, pointing back to the full version of the document.
The easist way to see analysis in action is with the Analyzer API:
GET /_analyze?analyzer=snowball&text=candles%20candle&pretty=true'
An analyzer is a really a three stage pipeline comprised of the following execution steps:
- Character Filtering Turns the input string into a different string
- Tokenization Turns the char-filtered string into an array of tokens
- Token Filtering Post-process the filtered tokens into a mutated token array
Let's dive in by building a cutom analyzer for tokenizing CSV data. Custom analyzer can be stored at the index level either during or after index creation. Lets's:
- create a "recipes" index
- close it
- update the analysis settings
- reopen it (in order to experiment with a custom analyzer)
// Create the index PUT /recipes // Close the index for settings update POST /recipes/_close // Create the analyzer PUT /recipes/_settings { "index" : { "analysis" : { "tokenizer" : { "comma" : {"type" : "pattern", "pattern" : ","} }, "analyzer" : { "recipe_csv" : { "type" : "custom", "tokenizer" : "comma", "filter" : ["trim", "lowercase"] } } } } } // Reopen the index POST /recipes/_open
Faceting
Facets are always attached to a query, letting you return aggregate statistics alongside regular query results. We'll create a database of movies and return facets based on the movies genres alongside standard query results. As usual, we need to load the movie_db.eloader data-set into elasticsearch server.
Simple movie mapping:
// Load Dataset: movie_db.eloader GET /movie_db/movie/_mapping?pretty=true { "movie" : { "properties" : { "actors" : {"type" : "string", "analyzer" : "standard", "position_offset_gap" : 100}, "genre" : {"type" : "string", "index" : "not_analyzed"}, "release_year" : {"type" : "integer", "index" : "not_analyzed"}, "title" : {"type" : "string", "analyzer" : "snowball"}, "description" : {"type" : "string", "analyzer" : "snowball"} } } }
Simple terms faceting:
// Load Dataset: movie_db.eloader POST /movie_db/_search { "query" : {"match" : {"description" : "hacking"}}, "facets" : { "genre" : { "terms" : {"field" : "genre"}, "size" : 10 } } }
This query searches for movies with a description containing "hacking". The query will return a list of facets showing which genres have descriptions containing the term "hacking", and how often films are in that genre with a matching description.