前面我们感觉ES就想是一个nosql数据库,支持Free Schema。
接触过Lucene、solr的同学这时可能会思考一个问题——怎么定义document中的field?store、index、analyzer等属性如何配置?
这时可以了解下ES中的Mapping。
[reference]
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/mapping.html#mapping
Mapping is the process of defining how a document should be mapped to the Search Engine, including its searchable characteristics such as which fields are searchable and if/how they are tokenized. In ElasticSearch, an index may store documents of different "mapping types". ElasticSearch allows one to associate multiple mapping definitions for each mapping type.
Explicit mapping is defined on an index/type level. By default, there isn’t a need to define an explicit mapping, since one is automatically created and registered when a new type or new field is introduced (with no performance overhead) and have sensible defaults. Only when the defaults need to be overridden must a mapping definition be provided.
mapping types
Mapping types are a way to divide the documents in an index into logical groups. Think of it as tables in a database. Though there is separation between types, it’s not a full separation (all end up as a document within the same Lucene index).
Field names with the same name across types are highly recommended to have the same type and same mapping characteristics (analysis settings for example). There is an effort to allow to explicitly "choose" which field to use by using type prefix (my_type.my_field
), but it’s not complete, and there are places where it will never work (like faceting on the field).
In practice though, this restriction is almost never an issue. The field name usually ends up being a good indication to its "typeness" (e.g. "first_name" will always be a string). Note also, that this does not apply to the cross index case.
global settings
The index.mapping.ignore_malformed
global setting can be set on the index level to allow to ignore malformed content globally across all mapping types (malformed content example is trying to index a text string value as a numeric type).
The index.mapping.coerce
global setting can be set on the index level to coerce numeric content globally across all mapping types (The default setting is true and coercions attempted are to convert strings with numbers into numeric types and also numeric values with fractions to any integer/short/long values minus the fraction part). When the permitted conversions fail in their attempts, the value is considered malformed and the ignore_malformed setting dictates what will happen next.
Fields
1)_uid
Each document indexed is associated with an id and a type, the internal _uid
field is the unique identifier of a document within an index and is composed of the type and the id (meaning that different types can have the same id and still maintain uniqueness).
The _uid
field is automatically used when _type
is not indexed to perform type based filtering, and does not require the _id
to be indexed.
【_udi=type+id,即不同的type可以存在相同id。】
2)_id
Each document indexed is associated with an id and a type. The _id
field can be used to index just the id, and possible also store it. By default it is not indexed and not stored (thus, not created).
Note, even though the _id
is not indexed, all the APIs still work (since they work with the _uid
field), as well as fetching by ids using term
, terms
or prefix
queries/filters (including the specificids
query/filter).
【_id默认是不索引、不存储,那么对其进行的各项查询操作将由_uid负责。】
The _id
field can be enabled to be indexed, and possibly stored, using:
{
"tweet":{
"_id":{"index":"not_analyzed","store":false}
}
}
The _id
mapping can also be associated with a path
that will be used to extract the id from a different location in the source document. For example, having the following mapping:
{
"tweet":{
"_id":{
"path":"post_id"
}
}
}
Will cause 1
to be used as the id for:
{
"message":"You know, for Search",
"post_id":"1"
}
This does require an additional lightweight parsing step while indexing, in order to extract the id to decide which shard the index operation will be executed on.
3)_type
Each document indexed is associated with an id and a type. The type, when indexing, is automatically indexed into a _type
field. By default, the _type
field is indexed (but not analyzed) and not stored. This means that the _type
field can be queried.
【有个_type字段用来索引type,那么每次type检索是否要加上_type字段检索条件?】
The _type
field can be stored as well, for example:
{
"tweet":{
"_type":{"store":true}
}
}
The _type
field can also not be indexed, and all the APIs will still work except for specific queries (term queries / filters) or faceting done on the _type
field.
{
"tweet":{
"_type":{"index":"no"}
}
}
4)_source
The _source
field is an automatically generated field that stores the actual JSON that was used as the indexed document. It is not indexed (searchable), just stored. When executing "fetch" requests, like get or search, the _source
field is returned by default.
【_source理解为正向文本,该字段生效时将占用不小的额外空间。】
Though very handy to have around, the source field does incur storage overhead within the index. For this reason, it can be disabled. For example:
{
"tweet":{
"_source":{"enabled":false}
}
}
includes / excludes
Allow to specify paths in the source that would be included / excluded when it’s stored, supporting *
as wildcard annotation. For example:
{
"my_type":{
"_source":{
"includes":["path1.*","path2.*"],
"excludes":["pat3.*"]
}
}
}
5)_all
The idea of the _all
field is that it includes the text of one or more other fields within the document indexed. It can come very handy especially for search requests, where we want to execute a search query against the content of a document, without knowing which fields to search on. This comes at the expense of CPU cycles and index size.
The _all
fields can be completely disabled. Explicit field mappings and object mappings can be excluded / included in the _all
field. By default, it is enabled and all fields are included in it for ease of use.
When disabling the _all
field, it is a good practice to set index.query.default_field
to a different value (for example, if you have a main "message" field in your data, set it to message
).
【当_all字段不可用是,最佳实践是指定默认检索字段index.query.default_field】
One of the nice features of the _all
field is that it takes into account specific fields boost levels. Meaning that if a title field is boosted more than content, the title (part) in the _all
field will mean more than the content (part) in the _all
field.
Here is a sample mapping:
{
"person":{
"_all":{"enabled":true},
"properties":{
"name":{
"type":"object",
"dynamic":false,
"properties":{
"first":{"type":"string","store":true,"include_in_all":false},
"last":{"type":"string","index":"not_analyzed"}
}
},
"address":{
"type":"object",
"include_in_all":false,
"properties":{
"first":{
"properties":{
"location":{"type":"string","store":true,"index_name":"firstLocation"}
}
},
"last":{
"properties":{
"location":{"type":"string"}
}
}
}
},
"simple1":{"type":"long","include_in_all":true},
"simple2":{"type":"long","include_in_all":false}
}
}
}
The _all
fields allows for store
, term_vector
and analyzer
(with specific index_analyzer
and search_analyzer
) to be set.
highlighting
For any field to allow highlighting it has to be either stored or part of the _source
field. By default the _all
field does not qualify for either, so highlighting for it does not yield any data.
Although it is possible to store
the _all
field, it is basically an aggregation of all fields, which means more data will be stored, and highlighting it might produce strange results.
6)_analyzer
The _analyzer
mapping allows to use a document field property as the name of the analyzer that will be used to index the document. The analyzer will be used for any field that does not explicitly defines an analyzer
or index_analyzer
when indexing.
Here is a simple mapping:
{
"type1":{
"_analyzer":{
"path":"my_field"
}
}
}
The above will use the value of the my_field
to lookup an analyzer registered under it. For example, indexing the following doc:
{
"my_field":"whitespace"
}
Will cause the whitespace
analyzer to be used as the index analyzer for all fields without explicit analyzer setting.
The default path value is _analyzer
, so the analyzer can be driven for a specific document by setting the _analyzer
field in it. If a custom json field name is needed, an explicit mapping with a different path should be set.
By default, the _analyzer
field is indexed, it can be disabled by settings index
to no
in the mapping.
7)_boost
Boosting is the process of enhancing the relevancy of a document or field. Field level mapping allows to define an explicit boost level on a specific field. The boost field mapping (applied on theroot object) allows to define a boost field mapping where its content will control the boost level of the document. For example, consider the following mapping:
{
"tweet":{
"_boost":{"name":"my_boost","null_value":1.0}
}
}
The above mapping defines a mapping for a field named my_boost
. If the my_boost
field exists within the JSON document indexed, its value will control the boost level of the document indexed. For example, the following JSON document will be indexed with a boost value of 2.2
:
{
"my_boost":2.2,
"message":"This is a tweet!"
}
Support for document boosting via the _boost
field has been removed from Lucene and is deprecated in Elasticsearch as of v1.0.0.RC1. The implementation in Lucene resulted in unpredictable result when used with multiple fields or multi-value fields.
Instead, the Function Score Query can be used to achieve the desired functionality by boosting each document by the value in any field the document:
{
"query":{
"function_score":{
"query":{
"match":{
"title":"your main query"
}
},
"functions":[{
"script_score":{
"script":"doc['my_boost_field'].value"
}
}],
"score_mode":"multiply"
}
}
}
8)_parent
The parent field mapping is defined on a child mapping, and points to the parent type this child relates to. For example, in case of a blog
type and a blog_tag
type child document, the mapping for blog_tag
should be:
{
"blog_tag":{
"_parent":{
"type":"blog"
}
}
}
The mapping is automatically stored and indexed (meaning it can be searched on using the _parent
field notation).
9)_routing
The routing field allows to control the _routing
aspect when indexing data and explicit routing control is required.
store / index
The first thing the _routing
mapping does is to store the routing value provided (store
set to false
) and index it (index
set to not_analyzed
). The reason why the routing is stored by default is so reindexing data will be possible if the routing value is completely external and not part of the docs.
required
Another aspect of the _routing
mapping is the ability to define it as required by setting required
to true
. This is very important to set when using routing features, as it allows different APIs to make use of it. For example, an index operation will be rejected if no routing value has been provided (or derived from the doc). A delete operation will be broadcasted to all shards if no routing value is provided and _routing
is required.
path
The routing value can be provided as an external value when indexing (and still stored as part of the document, in much the same way _source
is stored). But, it can also be automatically extracted from the index doc based on a path
. For example, having the following mapping:
{
"comment":{
"_routing":{
"required":true,
"path":"blog.post_id"
}
}
}
Will cause the following doc to be routed based on the 111222
value:
{
"text":"the comment text"
"blog":{
"post_id":"111222"
}
}
Note, using path
without explicit routing value provided required an additional (though quite fast) parsing phase.
id uniqueness
When indexing documents specifying a custom _routing
, the uniqueness of the _id
is not guaranteed throughout all the shards that the index is composed of. In fact, documents with the same _id
might end up in different shards if indexed with different _routing
values.
10)_index
The ability to store in a document the index it belongs to. By default it is disabled, in order to enable it, the following mapping should be defined:
{
"tweet":{
"_index":{"enabled":true}
}
}
11)_size
The _size
field allows to automatically index the size of the original _source
indexed. By default, it’s disabled. In order to enable it, set the mapping to:
【限定_source字段的大小】
{
"tweet":{
"_size":{"enabled":true}
}
}
In order to also store it, use:
{
"tweet":{
"_size":{"enabled":true,"store":true}
}
}
12)timestamp
The _timestamp
field allows to automatically index the timestamp of a document. It can be provided externally via the index request or in the _source
. If it is not provided externally it will be automatically set to the date the document was processed by the indexing chain.
【时间戳 如果没有提供时间戳,将自动生成。】
enabled
By default it is disabled. In order to enable it, the following mapping should be defined:
{
"tweet":{
"_timestamp":{"enabled":true}
}
}
store / index
By default the _timestamp
field has store
set to false
and index
set to not_analyzed
. It can be queried as a standard date field.
path
The _timestamp
value can be provided as an external value when indexing. But, it can also be automatically extracted from the document to index based on a path
. For example, having the following mapping:
{
"tweet":{
"_timestamp":{
"enabled":true,
"path":"post_date"
}
}
}
Will cause 2009-11-15T14:12:12
to be used as the timestamp value for:
{
"message":"You know, for Search",
"post_date":"2009-11-15T14:12:12"
}
Note, using path
without explicit timestamp value provided require an additional (though quite fast) parsing phase.
format
You can define the date format used to parse the provided timestamp value. For example:
{
"tweet":{
"_timestamp":{
"enabled":true,
"path":"post_date",
"format":"YYYY-MM-dd"
}
}
}
Note, the default format is dateOptionalTime
. The timestamp value will first be parsed as a number and if it fails the format will be tried.
13)_ttl
A lot of documents naturally come with an expiration date. Documents can therefore have a _ttl
(time to live), which will cause the expired documents to be deleted automatically.
【ttl - time to live! 可以用来设置文档的过期时间。 】
enabled
By default it is disabled, in order to enable it, the following mapping should be defined:
{
"tweet":{
"_ttl":{"enabled":true}
}
}
store / index
By default the _ttl
field has store
set to true
and index
set to not_analyzed
. Note that index
property has to be set to not_analyzed
in order for the purge process to work.
default
You can provide a per index/type default _ttl
value as follows:
{
"tweet":{
"_ttl":{"enabled":true,"default":"1d"}
}
}
In this case, if you don’t provide a _ttl
value in your query or in the _source
all tweets will have a_ttl
of one day.
In case you do not specify a time unit like d
(days), m
(minutes), h
(hours), ms
(milliseconds) or w
(weeks), milliseconds is used as default unit.
If no default
is set and no _ttl
value is given then the document has an infinite _ttl
and will not expire.
You can dynamically update the default
value using the put mapping API. It won’t change the _ttl
of already indexed documents but will be used for future documents.
note on documents expiration
Expired documents will be automatically deleted regularly. You can dynamically set the indices.ttl.interval
to fit your needs. The default value is 60s
.
The deletion orders are processed by bulk. You can set indices.ttl.bulk_size
to fit your needs. The default value is 10000
.
Note that the expiration procedure handle versioning properly so if a document is updated between the collection of documents to expire and the delete order, the document won’t be deleted.
Types
1)core types
Each JSON field can be mapped to a specific core type. JSON itself already provides us with some typing, with its support for string
, integer
/long
, float
/double
, boolean
, and null
.
The following sample tweet JSON document will be used to explain the core types:
{
"tweet"{
"user":"kimchy"
"message":"This is a tweet!",
"postDate":"2009-11-15T14:12:12",
"priority":4,
"rank":12.3
}
}
Explicit mapping for the above JSON tweet can be:
{
"tweet":{
"properties":{
"user":{"type":"string","index":"not_analyzed"},
"message":{"type":"string","null_value":"na"},
"postDate":{"type":"date"},
"priority":{"type":"integer"},
"rank":{"type":"float"}
}
}
}
string
The text based string type is the most basic type, and contains one or more characters. An example mapping can be:
{
"tweet":{
"properties":{
"message":{
"type":"string",
"store":true,
"index":"analyzed",
"null_value":"na"
},
"user":{
"type":"string",
"index":"not_analyzed",
"norms":{
"enabled":false
}
}
}
}
}
The above mapping defines a string
message
property/field within the tweet
type. The field is stored in the index (so it can later be retrieved using selective loading when searching), and it gets analyzed (broken down into searchable terms). If the message has a null
value, then the value that will be stored is na
. There is also a string
user
which is indexed as-is (not broken down into tokens) and has norms disabled (so that matching this field is a binary decision, no match is better than another one).
The following table lists all the attributes that can be used with the string
type:
Attribute | Description |
---|---|
|
The name of the field that will be stored in the index. Defaults to the property/field name. |
|
Set to |
|
Set to |
|
Set to |
|
Possible values are |
|
The boost value. Defaults to |
|
When there is a (JSON) null value for the field, use the |
|
Boolean value if norms should be enabled or not. Defaults to |
|
Describes how norms should be loaded, possible values are |
|
Allows to set the indexing options, possible values are |
|
The analyzer used to analyze the text contents when |
|
The analyzer used to analyze the text contents when |
|
The analyzer used to analyze the field when part of a query string. Can be updated on an existing field. |
|
Should the field be included in the |
|
The analyzer will ignore strings larger than this size. Useful for generic |
|
Position increment gap between field instances with the same field name. Defaults to 0. |
The string
type also support custom indexing parameters associated with the indexed value. For example:
{
"message":{
"_value": "boosted value",
"_boost": 2.0
}
}
The mapping is required to disambiguate the meaning of the document. Otherwise, the structure would interpret "message" as a value of type "object". The key _value
(or value
) in the inner document specifies the real string content that should eventually be indexed. The _boost
(or boost
) key specifies the per field document boost (here 2.0).
norms
Norms store various normalization factors that are later used (at query time) in order to compute the score of a document relatively to a query.
Although useful for scoring, norms also require quite a lot of memory (typically in the order of one byte per document per field in your index, even for documents that don’t have this specific field). As a consequence, if you don’t need scoring on a specific field, it is highly recommended to disable norms on it. In particular, this is the case for fields that are used solely for filtering or aggregations.
Coming in 1.2.0.
In case you would like to disable norms after the fact, it is possible to do so by using the PUT mapping API. Please however note that norms won’t be removed instantly, but as your index will receive new insertions or updates, and segments get merged. Any score computation on a field that got norms removed might return inconsistent results since some documents won’t have norms anymore while other documents might still have norms.
number
A number based type supporting float
, double
, byte
, short
, integer
, and long
. It uses specific constructs within Lucene in order to support numeric values. The number types have the same ranges as corresponding Java types. An example mapping can be:
{
"tweet":{
"properties":{
"rank":{
"type":"float",
"null_value":1.0
}
}
}
}
The following table lists all the attributes that can be used with a numbered type:
Attribute | Description |
---|---|
|
The type of the number. Can be |
|
The name of the field that will be stored in the index. Defaults to the property/field name. |
|
Set to |
|
Set to |
|
Set to |
|
The precision step (number of terms generated for each number value). Defaults to |
|
The boost value. Defaults to |
|
When there is a (JSON) null value for the field, use the |
|
Should the field be included in the |
|
Ignored a malformed number. Defaults to |
|
Try convert strings to numbers and truncate fractions for integers. Defaults to |
token count
The token_count
type maps to the JSON string type but indexes and stores the number of tokens in the string rather than the string itself. For example:
{
"tweet":{
"properties":{
"name":{
"type":"string",
"fields":{
"word_count":{
"type":"token_count",
"store":"yes",
"analyzer":"standard"
}
}
}
}
}
}
All the configuration that can be specified for a number can be specified for a token_count. The only extra configuration is the required analyzer
field which specifies which analyzer to use to break the string into tokens. For best performance, use an analyzer with no token filters.
Technically the token_count
type sums position increments rather than counting tokens. This means that even if the analyzer filters out stop words they are included in the count.
date
The date type is a special type which maps to JSON string type. It follows a specific format that can be explicitly set. All dates are UTC
. Internally, a date maps to a number type long
, with the added parsing stage from string to long and from long to string. An example mapping:
{
"tweet":{
"properties":{
"postDate":{
"type":"date",
"format":"YYYY-MM-dd"
}
}
}
}
The date type will also accept a long number representing UTC milliseconds since the epoch, regardless of the format it can handle.
The following table lists all the attributes that can be used with a date type:
Attribute | Description |
---|---|
|
The name of the field that will be stored in the index. Defaults to the property/field name. |
|
The date format. Defaults to |
|
Set to |
|
Set to |
|
Set to |
|
The precision step (number of terms generated for each number value). Defaults to |
|
The boost value. Defaults to |
|
When there is a (JSON) null value for the field, use the |
|
Should the field be included in the |
|
Ignored a malformed number. Defaults to |
boolean
The boolean type Maps to the JSON boolean type. It ends up storing within the index either T
or F
, with automatic translation to true
and false
respectively.
{
"tweet":{
"properties":{
"hes_my_special_tweet":{
"type":"boolean",
}
}
}
}
The boolean type also supports passing the value as a number or a string (in this case 0
, an empty string, F
, false
, off
and no
are false
, all other values are true
).
The following table lists all the attributes that can be used with the boolean type:
Attribute | Description |
---|---|
|
The name of the field that will be stored in the index. Defaults to the property/field name. |
|
Set to |
|
Set to |
|
The boost value. Defaults to |
|
When there is a (JSON) null value for the field, use the |
binary
The binary type is a base64 representation of binary data that can be stored in the index. The field is not stored by default and not indexed at all.
{
"tweet":{
"properties":{
"image":{
"type":"binary",
}
}
}
}
The following table lists all the attributes that can be used with the binary type:
Attribute | Description |
---|---|
|
The name of the field that will be stored in the index. Defaults to the property/field name. |
|
Set to |
fielddata filters
It is possible to control which field values are loaded into memory, which is particularly useful for faceting on string fields, using fielddata filters, which are explained in detail in the Fielddatasection.
Fielddata filters can exclude terms which do not match a regex, or which don’t fall between a min
and max
frequency range:
{
tweet:{
type: "string",
analyzer: "whitespace"
fielddata:{
filter:{
regex:{
"pattern": "^#.*"
},
frequency:{
min: 0.001,
max: 0.1,
min_segment_size:500
}
}
}
}
}
These filters can be updated on an existing field mapping and will take effect the next time the fielddata for a segment is loaded. Use the Clear Cache API to reload the fielddata using the new filters.
postings format
Posting formats define how fields are written into the index and how fields are represented into memory. Posting formats can be defined per field via the postings_format
option. Postings format are configurable. Elasticsearch has several builtin formats:
direct
- A postings format that uses disk-based storage but loads its terms and postings directly into memory. Note this postings format is very memory intensive and has certain limitation that don’t allow segments to grow beyond 2.1GB see {@link DirectPostingsFormat} for details.
memory
- A postings format that stores its entire terms, postings, positions and payloads in a finite state transducer. This format should only be used for primary keys or with fields where each term is contained in a very low number of documents.
pulsing
- A postings format that in-lines the posting lists for very low frequent terms in the term dictionary. This is useful to improve lookup performance for low-frequent terms.
bloom_default
- A postings format that uses a bloom filter to improve term lookup performance. This is useful for primary keys or fields that are used as a delete key.
bloom_pulsing
- A postings format that combines the advantages of bloom and pulsing to further improve lookup performance.
default
- The default Elasticsearch postings format offering best general purpose performance. This format is used if no postings format is specified in the field mapping.
postings format example
On all field types it possible to configure a postings_format
attribute:
{
"person":{
"properties":{
"second_person_id":{"type":"string","postings_format":"pulsing"}
}
}
}
On top of using the built-in posting formats it is possible define custom postings format. See codec module for more information.
doc values format
Doc values formats define how fields are written into column-stride storage in the index for the purpose of sorting or faceting. Fields that have doc values enabled will have special field data instances, which will not be uninverted from the inverted index, but directly read from disk. This makes _refresh faster and ultimately allows for having field data stored on disk depending on the configured doc values format.
Doc values formats are configurable. Elasticsearch has several builtin formats:
memory
- A doc values format which stores data in memory. Compared to the default field data implementations, using doc values with this format will have similar performance but will be faster to load, making _refresh less time-consuming.
disk
- A doc values format which stores all data on disk, requiring almost no memory from the JVM at the cost of a slight performance degradation.
default
- The default Elasticsearch doc values format, offering good performance with low memory usage. This format is used if no format is specified in the field mapping.
doc values format example
On all field types, it is possible to configure a doc_values_format
attribute:
{
"product":{
"properties":{
"price":{"type":"integer","doc_values_format":"memory"}
}
}
}
On top of using the built-in doc values formats it is possible to define custom doc values formats. See codec module for more information.
similarity
Elasticsearch allows you to configure a similarity (scoring algorithm) per field. The similarity
setting provides a simple way of choosing a similarity algorithm other than the default TF/IDF, such as BM25
.
You can configure similarities via the similarity module
configuring similarity per field
Defining the Similarity for a field is done via the similarity
mapping property, as this example shows:
{
"book":{
"properties":{
"title":{"type":"string","similarity":"BM25"}
}
}
The following Similarities are configured out-of-box:
default
- The Default TF/IDF algorithm used by Elasticsearch and Lucene in previous versions.
BM25
- The BM25 algorithm. See Okapi_BM25 for more details.
copy to field
Added in 1.0.0.RC2.
Adding copy_to
parameter to any field mapping will cause all values of this field to be copied to fields specified in the parameter. In the following example all values from fields title
and abstract
will be copied to the field meta_data
.
{
"book":{
"properties":{
"title":{"type":"string","copy_to":"meta_data"},
"abstract":{"type":"string","copy_to":"meta_data"},
"meta_data":{"type":"string"},
}
}
Multiple fields are also supported:
{
"book":{
"properties":{
"title":{"type":"string","copy_to":["meta_data","article_info"]},
}
}
multi fields
Added in 1.0.0.RC1.
The fields
options allows to map several core types fields into a single json source field. This can be useful if a single field need to be used in different ways. For example a single field is to be used for both free text search and sorting.
{
"tweet":{
"properties":{
"name":{
"type":"string",
"index":"analyzed",
"fields":{
"raw":{"type":"string","index":"not_analyzed"}
}
}
}
}
}
In the above example the field name
gets processed twice. The first time it gets processed as an analyzed string and this version is accessible under the field name name
, this is the main field and is in fact just like any other field. The second time it gets processed as a not analyzed string and is accessible under the name name.raw
.
include in all
The include_in_all
setting is ignored on any field that is defined in the fields
options. Setting the include_in_all
only makes sense on the main field, since the raw field value to copied to the _all
field, the tokens aren’t copied.
updating a field
In the essence a field can’t be updated. However multi fields can be added to existing fields. This allows for example to have a different index_analyzer
configuration in addition to the already configured index_analyzer
configuration specified in the main and other multi fields.
Also the new multi field will only be applied on document that have been added after the multi field has been added and in fact the new multi field doesn’t exist in existing documents.
Another important note is that new multi fields will be merged into the list of existing multi fields, so when adding new multi fields for a field previous added multi fields don’t need to be specified.
accessing fields
deprecated in 1.0.0.
Use copy_to
instead.
The multi fields defined in the fields
are prefixed with the name of the main field and can be accessed by their full path using the navigation notation: name.raw
, or using the typed navigation notation tweet.name.raw
. The path
option allows to control how fields are accessed. If the path
option is set to full
, then the full path of the main field is prefixed, but if the path
option is set to just_name
the actual multi field name without any prefix is used. The default value for the path
option is full
.
The just_name
setting, among other things, allows indexing content of multiple fields under the same name. In the example below the content of both fields first_name
and last_name
can be accessed by using any_name
or tweet.any_name
.
{
"tweet":{
"properties":{
"first_name":{
"type":"string",
"index":"analyzed",
"path":"just_name",
"fields":{
"any_name":{"type":"string","index":"analyzed"}
}
},
"last_name":{
"type":"string",
"index":"analyzed",
"path":"just_name",
"fields":{
"any_name":{"type":"string","index":"analyzed"}
}
}
}
}
}
2)array type
JSON documents allow to define an array (list) of fields or objects. Mapping array types could not be simpler since arrays gets automatically detected and mapping them can be done either withCore Types or Object Type mappings. For example, the following JSON defines several arrays:
{
"tweet":{
"message":"some arrays in this tweet...",
"tags":["elasticsearch","wow"],
"lists":[
{
"name":"prog_list",
"description":"programming list"
},
{
"name":"cool_list",
"description":"cool stuff list"
}
]
}
}
The above JSON has the tags
property defining a list of a simple string
type, and the lists
property is an object
type array. Here is a sample explicit mapping:
{
"tweet":{
"properties":{
"message":{"type":"string"},
"tags":{"type":"string","index_name":"tag"},
"lists":{
"properties":{
"name":{"type":"string"},
"description":{"type":"string"}
}
}
}
}
}
The fact that array types are automatically supported can be shown by the fact that the following JSON document is perfectly fine:
{
"tweet":{
"message":"some arrays in this tweet...",
"tags":"elasticsearch",
"lists":{
"name":"prog_list",
"description":"programming list"
}
}
}
Note also, that thanks to the fact that we used the index_name
to use the non plural form (tag
instead of tags
), we can actually refer to the field using the index_name
as well. For example, we can execute a query using tweet.tags:wow
or tweet.tag:wow
. We could, of course, name the field as tag
and skip the index_name
all together).
3)object type
JSON documents are hierarchical in nature, allowing them to define inner "objects" within the actual JSON. Elasticsearch completely understands the nature of these inner objects and can map them easily, providing query support for their inner fields. Because each document can have objects with different fields each time, objects mapped this way are known as "dynamic". Dynamic mapping is enabled by default. Let’s take the following JSON as an example:
{
"tweet":{
"person":{
"name":{
"first_name":"Shay",
"last_name":"Banon"
},
"sid":"12345"
},
"message":"This is a tweet!"
}
}
The above shows an example where a tweet includes the actual person
details. A person
is an object, with a sid
, and a name
object which has first_name
and last_name
. It’s important to note that tweet
is also an object, although it is a special root object type which allows for additional mapping definitions.
The following is an example of explicit mapping for the above JSON:
{
"tweet":{
"properties":{
"person":{
"type":"object",
"properties":{
"name":{
"properties":{
"first_name":{"type":"string"},
"last_name":{"type":"string"}
}
},
"sid":{"type":"string","index":"not_analyzed"}
}
},
"message":{"type":"string"}
}
}
}
In order to mark a mapping of type object
, set the type
to object. This is an optional step, since if there are properties
defined for it, it will automatically be identified as an object
mapping.
properties
An object mapping can optionally define one or more properties using the properties
tag for a field. Each property can be either another object
, or one of the core_types.
dynamic
One of the most important features of Elasticsearch is its ability to be schema-less. This means that, in our example above, the person
object can be indexed later with a new property — age
, for example — and it will automatically be added to the mapping definitions. Same goes for the tweet
root object.
This feature is by default turned on, and it’s the dynamic
nature of each object mapped. Each object mapped is automatically dynamic, though it can be explicitly turned off:
{
"tweet":{
"properties":{
"person":{
"type":"object",
"properties":{
"name":{
"dynamic":false,
"properties":{
"first_name":{"type":"string"},
"last_name":{"type":"string"}
}
},
"sid":{"type":"string","index":"not_analyzed"}
}
},
"message":{"type":"string"}
}
}
}
In the above example, the name
object mapped is not dynamic, meaning that if, in the future, we try to index JSON with a middle_name
within the name
object, it will get discarded and not added.
There is no performance overhead if an object
is dynamic, the ability to turn it off is provided as a safety mechanism so "malformed" objects won’t, by mistake, index data that we do not wish to be indexed.
If a dynamic object contains yet another inner object
, it will be automatically added to the index and mapped as well.
When processing dynamic new fields, their type is automatically derived. For example, if it is a number
, it will automatically be treated as number core_type. Dynamic fields default to their default attributes, for example, they are not stored and they are always indexed.
Date fields are special since they are represented as a string
. Date fields are detected if they can be parsed as a date when they are first introduced into the system. The set of date formats that are tested against can be configured using the dynamic_date_formats
on the root object, which is explained later.
Note, once a field has been added, its type can not change. For example, if we added age and its value is a number, then it can’t be treated as a string.
The dynamic
parameter can also be set to strict
, meaning that not only will new fields not be introduced into the mapping, but also that parsing (indexing) docs with such new fields will fail.
enabled
The enabled
flag allows to disable parsing and indexing a named object completely. This is handy when a portion of the JSON document contains arbitrary JSON which should not be indexed, nor added to the mapping. For example:
{
"tweet":{
"properties":{
"person":{
"type":"object",
"properties":{
"name":{
"type":"object",
"enabled":false
},
"sid":{"type":"string","index":"not_analyzed"}
}
},
"message":{"type":"string"}
}
}
}
In the above, name
and its content will not be indexed at all.
include_in_all
include_in_all
can be set on the object
type level. When set, it propagates down to all the inner mappings defined within the object
that do no explicitly set it.
path
deprecated in 1.0.0.
Use copy_to
instead.
In the core_types section, a field can have a index_name
associated with it in order to control the name of the field that will be stored within the index. When that field exists within an object(s) that are not the root object, the name of the field of the index can either include the full "path" to the field with its index_name
, or just the index_name
. For example (under mapping of type person
, removed the tweet type for clarity):
{
"person":{
"properties":{
"name1":{
"type":"object",
"path":"just_name",
"properties":{
"first1":{"type":"string"},
"last1":{"type":"string","index_name":"i_last_1"}
}
},
"name2":{
"type":"object",
"path":"full",
"properties":{
"first2":{"type":"string"},
"last2":{"type":"string","index_name":"i_last_2"}
}
}
}
}
}
In the above example, the name1
and name2
objects within the person
object have different combination of path
and index_name
. The document fields that will be stored in the index as a result of that are:
JSON Name | Document Field Name |
---|---|
|
|
|
|
|
|
|
|
Note, when querying or using a field name in any of the APIs provided (search, query, selective loading, …), there is an automatic detection from logical full path and into the index_name
and vice versa. For example, even though name1
/last1
defines that it is stored with just_name
and a different index_name
, it can either be referred to using name1.last1
(logical name), or its actual indexed name of i_last_1
.
More over, where applicable, for example, in queries, the full path including the type can be used such as person.name.last1
, in this case, both the actual indexed name will be resolved to match against the index, and an automatic query filter will be added to only match person
types.
4)root object type
The root object mapping is an object type mapping that maps the root object (the type itself). On top of all the different mappings that can be set using the object type mapping, it allows for additional, type level mapping definitions.
The root object mapping allows to index a JSON document that either starts with the actual mapping type, or only contains its fields. For example, the following tweet
JSON can be indexed:
{
"message":"This is a tweet!"
}
But, also the following JSON can be indexed:
{
"tweet":{
"message":"This is a tweet!"
}
}
Out of the two, it is preferable to use the document without the type explicitly set.
index / search analyzers
The root object allows to define type mapping level analyzers for index and search that will be used with all different fields that do not explicitly set analyzers on their own. Here is an example:
{
"tweet":{
"index_analyzer":"standard",
"search_analyzer":"standard"
}
}
The above simply explicitly defines both the index_analyzer
and search_analyzer
that will be used. There is also an option to use the analyzer
attribute to set both the search_analyzer
and index_analyzer
.
dynamic_date_formats
dynamic_date_formats
(old setting called date_formats
still works) is the ability to set one or more date formats that will be used to detect date
fields. For example:
{
"tweet":{
"dynamic_date_formats":["yyyy-MM-dd","dd-MM-yyyy"],
"properties":{
"message":{"type":"string"}
}
}
}
In the above mapping, if a new JSON field of type string is detected, the date formats specified will be used in order to check if its a date. If it passes parsing, then the field will be declared with date
type, and will use the matching format as its format attribute. The date format itself is explainedhere.
The default formats are: dateOptionalTime
(ISO) and yyyy/MM/dd HH:mm:ss Z||yyyy/MM/dd Z
.
Note: dynamic_date_formats
are used only for dynamically added date fields, not for date
fields that you specify in your mapping.
date_detection
Allows to disable automatic date type detection (if a new field is introduced and matches the provided format), for example:
{
"tweet":{
"date_detection":false,
"properties":{
"message":{"type":"string"}
}
}
}
numeric_detection
Sometimes, even though json has support for native numeric types, numeric values are still provided as strings. In order to try and automatically detect numeric values from string, the numeric_detection
can be set to true
. For example:
{
"tweet":{
"numeric_detection":true,
"properties":{
"message":{"type":"string"}
}
}
}
dynamic_templates
Dynamic templates allow to define mapping templates that will be applied when dynamic introduction of fields / objects happens.
For example, we might want to have all fields to be stored by default, or all string
fields to be stored, or have string
fields to always be indexed with multi fields syntax, once analyzed and once not_analyzed. Here is a simple example:
{
"person":{
"dynamic_templates":[
{
"template_1":{
"match":"multi*",
"mapping":{
"type":"{dynamic_type}",
"index":"analyzed",
"fields":{
"org":{"type":"{dynamic_type}","index":"not_analyzed"}
}
}
}
},
{
"template_2":{
"match":"*",
"match_mapping_type":"string",
"mapping":{
"type":"string",
"index":"not_analyzed"
}
}
}
]
}
}
The above mapping will create a field with multi fields for all field names starting with multi, and will map all string
types to be not_analyzed
.
Dynamic templates are named to allow for simple merge behavior. A new mapping, just with a new template can be "put" and that template will be added, or if it has the same name, the template will be replaced.
The match
allow to define matching on the field name. An unmatch
option is also available to exclude fields if they do match on match
. The match_mapping_type
controls if this template will be applied only for dynamic fields of the specified type (as guessed by the json format).
Another option is to use path_match
, which allows to match the dynamic template against the "full" dot notation name of the field (for example obj1.*.value
or obj1.obj2.*
), with the respective path_unmatch
.
The format of all the matching is simple format, allowing to use * as a matching element supporting simple patterns such as xxx*, *xxx, xxx*yyy (with arbitrary number of pattern types), as well as direct equality. The match_pattern
can be set to regex
to allow for regular expression based matching.
The mapping
element provides the actual mapping definition. The {name}
keyword can be used and will be replaced with the actual dynamic field name being introduced. The {dynamic_type}
(or {dynamicType}
) can be used and will be replaced with the mapping derived based on the field type (or the derived type, like date
).
Complete generic settings can also be applied, for example, to have all mappings be stored, just set:
{
"person":{
"dynamic_templates":[
{
"store_generic":{
"match":"*",
"mapping":{
"store":true
}
}
}
]
}
}
Such generic templates should be placed at the end of the dynamic_templates
list because when two or more dynamic templates match a field, only the first matching one from the list is used.
5)nested type
Nested objects/documents allow to map certain sections in the document indexed as nested allowing to query them as if they are separate docs joining with the parent owning doc.
One of the problems when indexing inner objects that occur several times in a doc is that "cross object" search match will occur, for example:
{
"obj1":[
{
"name":"blue",
"count":4
},
{
"name":"green",
"count":6
}
]
}
Searching for name set to blue and count higher than 5 will match the doc, because in the first element the name matches blue, and in the second element, count matches "higher than 5".
Nested mapping allows mapping certain inner objects (usually multi instance ones), for example:
{
"type1":{
"properties":{
"obj1":{
"type":"nested",
"properties":{
"name":{"type":"string","index":"not_analyzed"},
"count":{"type":"integer"}
}
}
}
}
}
The above will cause all obj1
to be indexed as a nested doc. The mapping is similar in nature to setting type
to object
, except that it’s nested
. Nested object fields can be defined explicitly as in the example above or added dynamically in the same way as for the root object.
Note: changing an object type to nested type requires reindexing.
The nested
object fields can also be automatically added to the immediate parent by setting include_in_parent
to true, and also included in the root object by setting include_in_root
to true.
Nested docs will also automatically use the root doc _all
field.
Searching on nested docs can be done using either the nested query or nested filter.
internal implementation
Internally, nested objects are indexed as additional documents, but, since they can be guaranteed to be indexed within the same "block", it allows for extremely fast joining with parent docs.
Those internal nested documents are automatically masked away when doing operations against the index (like searching with a match_all query), and they bubble out when using the nested query.
Because nested docs are always masked to the parent doc, the nested docs can never be accessed outside the scope of the nested
query. For example stored fields can be enabled on fields inside nested objects, but there is no way of retrieving them, since stored fields are fetched outside of the nested
query scope.
The _source
field is always associated with the parent document and because of that field values via the source can be fetched for nested object.
6)ip type
An ip
mapping type allows to store ipv4 addresses in a numeric form allowing to easily sort, and range query it (using ip values).
The following table lists all the attributes that can be used with an ip type:
Attribute | Description |
---|---|
|
The name of the field that will be stored in the index. Defaults to the property/field name. |
|
Set to |
|
Set to |
|
The precision step (number of terms generated for each number value). Defaults to |
|
The boost value. Defaults to |
|
When there is a (JSON) null value for the field, use the |
|
Should the field be included in the |
7)geo point type
Mapper type called geo_point
to support geo based points. The declaration looks as follows:
{
"pin":{
"properties":{
"location":{
"type":"geo_point"
}
}
}
}
indexed fields
The geo_point
mapping will index a single field with the format of lat,lon
. The lat_lon
option can be set to also index the .lat
and .lon
as numeric fields, and geohash
can be set to true
to also index .geohash
value.
A good practice is to enable indexing lat_lon
as well, since both the geo distance and bounding box filters can either be executed using in memory checks, or using the indexed lat lon values, and it really depends on the data set which one performs better. Note though, that indexed lat lon only make sense when there is a single geo point value for the field, and not multi values.
geohashes
Geohashes are a form of lat/lon encoding which divides the earth up into a grid. Each cell in this grid is represented by a geohash string. Each cell in turn can be further subdivided into smaller cells which are represented by a longer string. So the longer the geohash, the smaller (and thus more accurate) the cell is.
Because geohashes are just strings, they can be stored in an inverted index like any other string, which makes querying them very efficient.
If you enable the geohash
option, a geohash
“sub-field” will be indexed as, eg pin.geohash
. The length of the geohash is controlled by the geohash_precision
parameter, which can either be set to an absolute length (eg 12
, the default) or to a distance (eg 1km
).
More usefully, set the geohash_prefix
option to true
to not only index the geohash value, but all the enclosing cells as well. For instance, a geohash of u30
will be indexed as [u,u3,u30]
. This option can be used by the Geohash Cell Filter to find geopoints within a particular cell very efficiently.
input structure
The above mapping defines a geo_point
, which accepts different formats. The following formats are supported:
lat lon as properties
{
"pin":{
"location":{
"lat":41.12,
"lon":-71.34
}
}
}
lat lon as string
Format in lat,lon
.
{
"pin":{
"location":"41.12,-71.34"
}
}
geohash
{
"pin":{
"location":"drm3btev3e86"
}
}
lat lon as array
Format in [lon, lat]
, note, the order of lon/lat here in order to conform with GeoJSON.
{
"pin":{
"location":[-71.34,41.12]
}
}
mapping options
Option | Description |
---|---|
|
Set to |
|
Set to |
|
Sets the geohash precision. It can be set to an absolute geohash length or a distance value (eg 1km, 1m, 1ml) defining the size of the smallest cell. Defaults to an absolute length of 12. |
|
If this option is set to |
|
Set to |
|
Set to |
|
Set to |
|
Set to |
|
Set to |
|
Set to |
|
The precision step (number of terms generated for each number value) for |
field data
By default, geo points use the array
format which loads geo points into two parallel double arrays, making sure there is no precision loss. However, this can require a non-negligible amount of memory (16 bytes per document) which is why Elasticsearch also provides a field data implementation with lossy compression called compressed
:
{
"pin":{
"properties":{
"location":{
"type":"geo_point",
"fielddata":{
"format":"compressed",
"precision":"1cm"
}
}
}
}
}
This field data format comes with a precision
option which allows to configure how much precision can be traded for memory. The default value is 1cm
. The following table presents values of the memory savings given various precisions:
Precision |
Bytes per point |
Size reduction |
1km |
4 |
75% |
3m |
6 |
62.5% |
1cm |
8 |
50% |
1mm |
10 |
37.5% |
Precision can be changed on a live index by using the update mapping API.
usage in scripts
When using doc[geo_field_name]
(in the above mapping, doc['location']
), the doc[...].value
returns a GeoPoint
, which then allows access to lat
and lon
(for example, doc[...].value.lat
). For performance, it is better to access the lat
and lon
directly using doc[...].lat
and doc[...].lon
.
8)geo shape type
The geo_shape
mapping type facilitates the indexing of and searching with arbitrary geo shapes such as rectangles and polygons. It should be used when either the data being indexed or the queries being executed contain shapes other than just points.
You can query documents using this type using geo_shape Filter or geo_shape Query.
Note, the geo_shape
type uses Spatial4J and JTS, both of which are optional dependencies. Consequently you must add Spatial4J v0.3 and JTS v1.12 to your classpath in order to use this type.
mapping options
The geo_shape mapping maps geo_json geometry objects to the geo_shape type. To enable it, users must explicitly map fields to the geo_shape type.
Option | Description |
---|---|
|
Name of the PrefixTree implementation to be used: |
|
This parameter may be used instead of |
|
Maximum number of layers to be used by the PrefixTree. This can be used to control the precision of shape representations and therefore how many terms are indexed. Defaults to the default value of the chosen PrefixTree implementation. Since this parameter requires a certain level of understanding of the underlying implementation, users may use the |
|
Used as a hint to the PrefixTree about how precise it should be. Defaults to 0.025 (2.5%) with 0.5 as the maximum supported value. |
prefix trees
To efficiently represent shapes in the index, Shapes are converted into a series of hashes representing grid squares using implementations of a PrefixTree. The tree notion comes from the fact that the PrefixTree uses multiple grid layers, each with an increasing level of precision to represent the Earth.
Multiple PrefixTree implementations are provided:
- GeohashPrefixTree - Uses geohashes for grid squares. Geohashes are base32 encoded strings of the bits of the latitude and longitude interleaved. So the longer the hash, the more precise it is. Each character added to the geohash represents another tree level and adds 5 bits of precision to the geohash. A geohash represents a rectangular area and has 32 sub rectangles. The maximum amount of levels in Elasticsearch is 24.
- QuadPrefixTree - Uses a quadtree for grid squares. Similar to geohash, quad trees interleave the bits of the latitude and longitude the resulting hash is a bit set. A tree level in a quad tree represents 2 bits in this bit set, one for each coordinate. The maximum amount of levels for the quad trees in Elasticsearch is 50.
accuracy
Geo_shape does not provide 100% accuracy and depending on how it is configured it may return some false positives or false negatives for certain queries. To mitigate this, it is important to select an appropriate value for the tree_levels parameter and to adjust expectations accordingly. For example, a point may be near the border of a particular grid cell and may thus not match a query that only matches the cell right next to it — even though the shape is very close to the point.
example
{
"properties":{
"location":{
"type":"geo_shape",
"tree":"quadtree",
"precision":"1m"
}
}
}
This mapping maps the location field to the geo_shape type using the quad_tree implementation and a precision of 1m. Elasticsearch translates this into a tree_levels setting of 26.
performance considerations
Elasticsearch uses the paths in the prefix tree as terms in the index and in queries. The higher the levels is (and thus the precision), the more terms are generated. Of course, calculating the terms, keeping them in memory, and storing them on disk all have a price. Especially with higher tree levels, indices can become extremely large even with a modest amount of data. Additionally, the size of the features also matters. Big, complex polygons can take up a lot of space at higher tree levels. Which setting is right depends on the use case. Generally one trades off accuracy against index size and query performance.
The defaults in Elasticsearch for both implementations are a compromise between index size and a reasonable level of precision of 50m at the equator. This allows for indexing tens of millions of shapes without overly bloating the resulting index too much relative to the input size.
input structure
The GeoJSON format is used to represent Shapes as input as follows:
{
"location":{
"type":"point",
"coordinates":[45.0,-45.0]
}
}
Note, both the type
and coordinates
fields are required.
The supported types
are point
, linestring
, polygon
, multipoint
and multipolygon
.
Note, in geojson the correct order is longitude, latitude coordinate arrays. This differs from some APIs such as e.g. Google Maps that generally use latitude, longitude.
envelope
Elasticsearch supports an envelope
type which consists of coordinates for upper left and lower right points of the shape:
{
"location":{
"type":"envelope",
"coordinates":[[-45.0,45.0],[45.0,-45.0]]
}
}
polygonedit
A polygon is defined by a list of a list of points. The first and last points in each (outer) list must be the same (the polygon must be closed).
{
"location":{
"type":"polygon",
"coordinates":[
[[100.0,0.0],[101.0,0.0],[101.0,1.0],[100.0,1.0],[100.0,0.0]]
]
}
}
The first array represents the outer boundary of the polygon, the other arrays represent the interior shapes ("holes"):
{
"location":{
"type":"polygon",
"coordinates":[
[[100.0,0.0],[101.0,0.0],[101.0,1.0],[100.0,1.0],[100.0,0.0]],
[[100.2,0.2],[100.8,0.2],[100.8,0.8],[100.2,0.8],[100.2,0.2]]
]
}
}
multipolygonedit
A list of geojson polygons.
{
"location":{
"type":"multipolygon",
"coordinates":[
[[[102.0,2.0],[103.0,2.0],[103.0,3.0],[102.0,3.0],[102.0,2.0]]],
[[[100.0,0.0],[101.0,0.0],[101.0,1.0],[100.0,1.0],[100.0,0.0]],
[[100.2,0.2],[100.8,0.2],[100.8,0.8],[100.2,0.8],[100.2,0.2]]]
]
}
}
sorting and retrieving index shapes
Due to the complex input structure and index representation of shapes, it is not currently possible to sort shapes or retrieve their fields directly. The geo_shape value is only retrievable through the _source
field.
9)attachment type
The attachment
type allows to index different "attachment" type field (encoded as base64
), for example, Microsoft Office formats, open document formats, ePub, HTML, and so on (full list can be found here).
The attachment
type is provided as a plugin extension. The plugin is a simple zip file that can be downloaded and placed under $ES_HOME/plugins
location. It will be automatically detected and the attachment
type will be added.
Note, the attachment
type is experimental.
Using the attachment type is simple, in your mapping JSON, simply set a certain JSON element as attachment, for example:
{
"person":{
"properties":{
"my_attachment":{"type":"attachment"}
}
}
}
In this case, the JSON to index can be:
{
"my_attachment":"... base64 encoded attachment ..."
}
Or it is possible to use more elaborated JSON if content type or resource name need to be set explicitly:
{
"my_attachment":{
"_content_type":"application/pdf",
"_name":"resource/name/of/my.pdf",
"content":"... base64 encoded attachment ..."
}
}
The attachment
type not only indexes the content of the doc, but also automatically adds meta data on the attachment as well (when available). The metadata supported are: date
, title
, author
, and keywords
. They can be queried using the "dot notation", for example: my_attachment.author
.
Both the meta data and the actual content are simple core type mappers (string, date, …), thus, they can be controlled in the mappings. For example:
{
"person":{
"properties":{
"file":{
"type":"attachment",
"fields":{
"file":{"index":"no"},
"date":{"store":true},
"author":{"analyzer":"myAnalyzer"}
}
}
}
}
}
In the above example, the actual content indexed is mapped under fields
name file
, and we decide not to index it, so it will only be available in the _all
field. The other fields map to their respective metadata names, but there is no need to specify the type
(like string
or date
) since it is already known.
The plugin uses Apache Tika to parse attachments, so many formats are supported, listed here.