浙江省高等学校教师教育理论培训

微信搜索“毛凌志岗前心得”小程序

  博客园  :: 首页  :: 新随笔  :: 联系 :: 订阅 订阅  :: 管理
Basho: Riak Search
Riak Search

Introduction
Operations
Indexing
Querying
Persistence
Major Components
Replication
Further Reading

Introduction

Riak Search is a distributed, easily-scalable, failure-tolerant, real-time, full-text search engine built around Riak Core and tightly integrated with Riak KV.

Riak Search allows you to find and retrieve your Riak objects using the objects’ values. When a Riak KV bucket has been enabled for Search integration (by installing the Search pre-commit hook), any objects stored in that bucket are also indexed seamlessly in Riak Search.

The Riak Client API can then be used to perform Search queries that return a list of bucket/key pairs matching the query. Alternatively, the query results can be used as the input to a Riak map/reduce operation. Currently the PHP, Python, Ruby, and Erlang APIs support integration with Riak Search.
Riak KV and Riak Search

Riak Search is a superset of Riak KV, so if you are running Riak Search, then you are automatically running a Riak KV cluster. You don’t need to set up a separate Riak KV cluster to use Riak Search. Downloading a Riak Search binary will give you both Riak KV and Riak Search.
Search is “Beta Software”

Note that Riak Search should be considered beta software. Please be aware that there may be bugs and issues that we have not yet covered that may require a full reindexing of your data when migrating to a different release.
Operations

Operationally, Riak Search is very similar to Riak KV. An administrator can add nodes to a cluster on the fly with simple commands to increase performance or capacity. Index and query operations can be run from any node. Multiple replicas of data are stored, allowing the cluster to continue serving full results in the face of machine failure. Partitions are handed off and replicated across clusters using the same mechanisms as Riak KV.
Indexing

At index time, Riak Search tokenizes a document into an inverted index using standard Lucene Analyzers. (For improved performance, the team re-implemented some of these in Erlang to reduce hops between Erlang and Java.) Custom analyzers can be created in either Java or Erlang. The system consults a schema (defined per-index) to determine required fields, the unique key, the default analyzer, and which analyzer should be used for each field. Field aliases (grouping multiple fields into one field) and dynamic fields (wildcard field matching) are supported.

After analyzing a document into an inverted index, the system uses a consistent hash to divide the inverted index entries (called postings) by term across the cluster. This is called term-partitioning and is a key difference from other commonly used distributed indexes. Term-partitioning was chosen because it provides higher overall query throughput with large data sets. (This can come at the expense of higher-latency queries for especially large result sets.)
Querying

Search queries use the same syntax as Lucene, and support most Lucene operators including term searches, field searches, boolean operators, grouping, lexicographical range queries, and wildcards (at the end of a word only).

Querying has two distinct stages, planning and execution. During query planning, the system creates a directed graph of the query, grouping points on the graph in order to maximize data locality and minimize inter-node traffic. Single term queries can be executed on a single node, while range queries and fuzzy matches are executed using the minimal set of nodes that cover the query.

As the query executes, Riak Search uses a series of merge-joins, merge-intersections, and filters to generate the resulting set of matching bucket/key pairs.
Persistence

For a backing store, the Riak Search team developed merge\index. merge\index takes inspiration from the Lucene file format, Bitcask (our standard backing store for Riak KV), and SSTables (from Google’s BigTable paper), and was designed to have a simple, easily-recoverable data structure, to allow simultaneous reads and writes with no performance degredation, and to be forgiving of write bursts while taking advantage of low-write periods to perform data compactions and optimizations.
Major Components

Riak Search is comprised of:

Riak Core – Dynamo-inspired distributed-systems framework
Riak KV – Distributed Key/Value store inspired by Amazon’s Dynamo.
Bitcask – Default storage backend used by Riak KV.
Riak Search – Distributed index and full-text search engine.
Merge Index – Storage backend used by Riak Search. This is a pure Erlang storage format based roughly on ideas borrowed from other storage formats including log structured merge trees, sstables, bitcask, and the Lucene file format.
Qilr – Library for parsing queries into execution plans and documents into terms.
Riak Solr – Adds a subset of Solr HTTP interface capabilities to Riak Search.

Replication

Riak Search data is replicated in a manner similar to Riak KV data: A search index has an n_val setting that determines how many copies of the data exist. Copies are written across different partitions located on different physical nodes.

The underlying data for Riak Search lives in Riak KV and replicates in precisely the same manner. The Search index, created from the underlying data, replicates differently for technical reasons. Those differences are:

Riak Search uses timestamps, rather than vector clocks, to resolve version conflicts. This leads to fewer guarantees about your data (as depending on wall-clock time can cause problems if the clock is wrong) but was a necessary tradeoff for performance reasons.
Riak Search does not use quorum values when writing (indexing) data. The data is written in a fire and forget model. Riak Search does use hinted-handoff to remain write-available when a node goes offline.
Riak Search does not use quorum values when reading (querying) data. Only one copy of the data is read, and the partition is chosen based on what will create the most efficient query plan overall.

Further Reading

Riak Search - Installation and Setup
Riak Search - Schema
Riak Search - Indexing
Riak Search - Querying
Riak Search - Indexing and Querying Riak KV Data
Riak Search - Operations and Troubleshooting

Basho Technologies, Inc.


posted on 2011-10-11 23:18  lexus  阅读(462)  评论(0编辑  收藏  举报