MongoDB 全文搜索教程

MongoDB 全文搜索教程

返回原文英文原文:MongoDB Text Search Tutorial

In my introduction to text search in MongoDB, we had a look at the basic features. Today we’ll have a closer look at the details.

API

You may have noticed that a text search is not executed with a find() command. Instead you call

db.foo.runCommand( "text", {search: "bar"} )
Remember it’s an experimental feature still. Adding it to the implementation of the find() command would have mixed critical production code with the new text search feature. When executed via a runCommand() call, text search can be run and tested in isolation.

 

I expect to see a new query operator like$textor$textsearchas soon as text search is integrated with the standard find() command.

译者信息

在我的那篇MongoDB全文检索入门篇一文中,我们已经对MongoDB的基本功能有了一个初步的了解。今天,通过这篇文章,我们来更进一步的讨论MongoDB全文检索功能。

API

你会发现全文搜索并非是通过find()命令实现,而是通过调用

db.foo.runCommand( "text", {search: "bar"} )
请牢记这个命令现在还处于实验阶段。通过这个命令实现find()功能,会在生产环境中掺入危险的代码。通过runCommand()这个命令来执行搜索,运行和测试可以实现分离。

我多么的希望一个新的检索操作符,例如$textor $textsearch 可以和标准的find()命令相结合。

 

Text Query Syntax

In the previous examples we just searched for a single word. We can do more than that. Let’s have a look at the following example:

db.foo.drop()
db.foo.ensureIndex( {txt: "text"} )
db.foo.insert( {txt: "Robots are superior to humans"} )
db.foo.insert( {txt: "Humans are weak"} )
db.foo.insert( {txt: "I, Robot - by Isaac Asimov"} )
A search for “robot” will find two documents, the same it true for “human”:
> db.foo.runCommand("text", {search: "robot"}).results.length
2
> db.foo.runCommand("text", {search: "human"}).results.length
2
When searching for multiple terms, an OR search is performed, yielding three documents in our example:
> db.foo.runCommand("text", {search: "human robot"}).results.length
3
I would have expected that the given search words are AND-ed not OR-ed.
译者信息

文本检索语法

在前面的例子中,我们只是搜索一个单词。我们可以搜索的更复杂一些,让我们来看看以下代码:

db.foo.drop()
db.foo.ensureIndex( {txt: "text"} )
db.foo.insert( {txt: "Robots are superior to humans"} )
db.foo.insert( {txt: "Humans are weak"} )
db.foo.insert( {txt: "I, Robot - by Isaac Asimov"} )
搜索单词“robot”, 会得到2个结果,而搜索“human”结果也是2个。
> db.foo.runCommand("text", {search: "robot"}).results.length
2
> db.foo.runCommand("text", {search: "human"}).results.length
2
当我们搜索条件包含多个单词,数据库会执行或的操作,搜索结果会得到3个。
> db.foo.runCommand("text", {search: "human robot"}).results.length
3
我希望搜索的单词之间是与的关系而不是或的关系。       

 

Negation

By adding a heading minus sign to a search word, you can exclude documents containing that word. Let’s say, we want all documents on “robot” but no “humans”.

> db.foo.runCommand("text", {search: "robot -humans"})
{
        "queryDebugString" : "robot||human||||",
        "language" : "english",
        "results" : [
                {
                        "score" : 0.6666666666666666,
                        "obj" : {
                                "_id" : ObjectId("50ebc484214a1e88aaa4ada0"),
                                "txt" : "I, Robot - by Isaac Asimov"
                        }
                }
        ],
        "stats" : {
                "nscanned" : 2,
                "nscannedObjects" : 0,
                "n" : 1,
                "timeMicros" : 212
        },
        "ok" : 1
}

 

Phrase Search

 

By enclosing multiple words inside quotes (“foo bar”) you perform a phrase search. Inside a phrase, order is important and stop words are also taken into account:

> db.foo.runCommand("text", {search: '"robots are"'})
{
        "queryDebugString" : "robot||||robots are||",
        "language" : "english",
        "results" : [
                {
                        "score" : 0.6666666666666666,
                        "obj" : {
                                "_id" : ObjectId("50ebc482214a1e88aaa4ad9e"),
                                "txt" : "Robots are superior to humans"
                        }
                }
        ],
        "stats" : {
                "nscanned" : 2,
                "nscannedObjects" : 0,
                "n" : 1,
                "timeMicros" : 185
        },
        "ok" : 1
}
Please have a look at the “queryDebugField”:
"queryDebugString" : "robot||||robots are||"
It tells us that our search string contains one stem “robot” but also the phrase “robots are”. That’s the reason we have only one hit. Compare that to these searches:
> // order matters inside phrase
> db.foo.runCommand("text", {search: '"are robots"'}).results.length
0
> // no phrase search --> OR query
> db.foo.runCommand("text", {search: 'are robots'}).results.length
2

 

 

译者信息

取反

通过在搜索单词前加上减号'-',可以在搜索的时候,排除包含该单词的记录。也就是说,我们需要搜索包含“robot”,但是不包含“humans”的记录。

> db.foo.runCommand("text", {search: "robot -humans"})
{
        "queryDebugString" : "robot||human||||",
        "language" : "english",
        "results" : [
                {
                        "score" : 0.6666666666666666,
                        "obj" : {
                                "_id" : ObjectId("50ebc484214a1e88aaa4ada0"),
                                "txt" : "I, Robot - by Isaac Asimov"
                        }
                }
        ],
        "stats" : {
                "nscanned" : 2,
                "nscannedObjects" : 0,
                "n" : 1,
                "timeMicros" : 212
        },
        "ok" : 1
}
词组搜索

通过用引号包含由多个单词组成的词组(“foo bar”),就可以实现词组搜索。在词组里面,单词的顺序十分重要,同时搜索结束单词也需要考虑。

> db.foo.runCommand("text", {search: '"robots are"'})
{
        "queryDebugString" : "robot||||robots are||",
        "language" : "english",
        "results" : [
                {
                        "score" : 0.6666666666666666,
                        "obj" : {
                                "_id" : ObjectId("50ebc482214a1e88aaa4ad9e"),
                                "txt" : "Robots are superior to humans"
                        }
                }
        ],
        "stats" : {
                "nscanned" : 2,
                "nscannedObjects" : 0,
                "n" : 1,
                "timeMicros" : 185
        },
        "ok" : 1
}
请查看如下的例子"queryDebugField"
"queryDebugString" : "robot||||robots are||"
我们需要搜索条件中包含"robot"的词根,同时也包含"robots are"的词组。这就是为什么我们只找到一条记录。请比较如下的搜索:
> // order matters inside phrase
> db.foo.runCommand("text", {search: '"are robots"'}).results.length
0
> // no phrase search --> OR query
> db.foo.runCommand("text", {search: 'are robots'}).results.length
2
Multi Language Support

 

Stemming and stop word filtering are both language dependent. So we have to tell MongoDB what language to use for indexing and searching if you want to use other languages than the default which is English. MongoDB uses the open source Snowball stemmer that supports these languages.

In order to use another language for indexing and searching, you do this when creating the index:

> db.de.insert( {txt: "Ich bin Dein Vater, Luke." } )
> db.de.validate().keysPerIndex["text.de.$txt_text"]
2
With this setting, MongoDB assumes that all text in the field “txt” and all text searches on that collection are in German. Let’s see if it works:
> db.de.runCommand("text", {search: "ich"}).results.length
0
> db.de.runCommand("text", {search: "Vater"}).results.length
1
> db.de.runCommand("text", {search: "Luke"}).results.length
1
译者信息

多语言支持

分词和停用词过滤都是与语言有关的。如果你希望用英语以外的语言来创建索引和搜索,那么必须告诉MongoDB。MongoDB用的是开源的Snowball分词器,它支持这些语言这些语言

如果希望使用其它语言,需要在创建索引时这样写:

db.de.ensureIndex( {txt: "text"}, {default_language: "german"} )
MongoDB就会认为“txt”中的文本是德语,而且我们搜索的文本也是德语。我们看看是不是这样的:
> db.de.insert( {txt: "Ich bin Dein Vater, Luke." } )
> db.de.validate().keysPerIndex["text.de.$txt_text"]
2
As you can see, there are only two index keys, so stop word filtering did occur (this time with a German stop word list. Vater is the German word for father, not some typo with Vader) Let’s try some searches:
db.de.insert( {language:"english", txt: "Ich bin ein Berliner" } )
Please note that we don’t have to give the language we are searching for because it is derived from the index. We have hits for the meaningful words “Vater” and “Luke”, but not for the stop word “ich” (which means “I”).

 

It it also possible to mix multiple languages in the same index. Each single document can have its own language:

db.de.insert( {language:"english", txt: "Ich bin ein Berliner" } )
译者信息             如你所见,这里只有两个索引关键字,因此停用词过滤就会起效(这里用的是德语的停用词,Vater 是德语中的 father 意思) ,我们再试试其他一些搜索:
db.de.insert( {language:"english", txt: "Ich bin ein Berliner" } )

请注意,我们不一定需要在搜索的时候提供语言,因为这是从索引继承而来。我们已经命中了同义词 Vater 和 Luke,但没有命中停用词 ich (意思是 I)

我们还可以在同一个索引中混合多种不同的语言,每个文档都有它独立的语言:

db.de.insert( {language:"english", txt: "Ich bin ein Berliner" } )
If a field “language” is present, its content defines the language for stemming and stop word filtering for the indexed field(s) of that document. The word “ich” is not a stop word in English, so it is indexed now.
// default language: german -> no hits
> db.de.runCommand("text", {search: "ich"})
{
        "queryDebugString" : "||||||",
        "language" : "german",
        "results" : [ ],
        "stats" : {
                "nscanned" : 0,
                "nscannedObjects" : 0,
                "n" : 0,
                "timeMicros" : 96
        },
        "ok" : 1
}
 
// search for English -> one hit
> db.de.runCommand("text", {search: "ich", language: "english"})
{
        "queryDebugString" : "ich||||||",
        "language" : "english",
        "results" : [
                {
                        "score" : 0.625,
                        "obj" : {
                                "_id" : ObjectId("50ed163b1e27d5e73741fafb"),
                                "language" : "english",
                                "txt" : "Ich bin ein Berliner"
                        }
                }
        ],
        "stats" : {
                "nscanned" : 1,
                "nscannedObjects" : 0,
                "n" : 1,
                "timeMicros" : 161
        },
        "ok" : 1
}
What happened here? The default language for searching is German. So the first search has no result (as before). In the second search we say to search for English text (to be more precise: for index keys that were generated with an English stemmer and stop words). That’s why we find the famous sentence from JFK.

 

What does that mean? Well, you have are real multi language text search at hand. You can store text messages from around the world in one collection and still search them dependent on the language.

译者信息 如果存在 “language” 字段,其内容就相当于为文档的索引数据定义了流数据的语言和停用词过滤。单词 ich 在英语中并不是停用词,因此它被索引了。
// default language: german -> no hits
> db.de.runCommand("text", {search: "ich"})
{
        "queryDebugString" : "||||||",
        "language" : "german",
        "results" : [ ],
        "stats" : {
                "nscanned" : 0,
                "nscannedObjects" : 0,
                "n" : 0,
                "timeMicros" : 96
        },
        "ok" : 1
}
 
// search for English -> one hit
> db.de.runCommand("text", {search: "ich", language: "english"})
{
        "queryDebugString" : "ich||||||",
        "language" : "english",
        "results" : [
                {
                        "score" : 0.625,
                        "obj" : {
                                "_id" : ObjectId("50ed163b1e27d5e73741fafb"),
                                "language" : "english",
                                "txt" : "Ich bin ein Berliner"
                        }
                }
        ],
        "stats" : {
                "nscanned" : 1,
                "nscannedObjects" : 0,
                "n" : 1,
                "timeMicros" : 161
        },
        "ok" : 1
}

这里到底发生什么事情?默认的搜索语言是德语,因此首次搜索没有返回任何结果。而第二次搜索时,我们搜索英语文本,这也是为什么我们能从这个句子中找出 JFK。

这是什么意思呢?嗯,你已经有了这种的多语言文本搜索,你可以在一个集合中存储来自全世界不同语言的文本信息,然后仍然使用你的母语进行搜索。

 

Multiple Fields

A text index can span more that one field. If you are using more than one field, each field can have its one weight. That enables you to have indexed text parts of your document with different meanings.

> db.mail.ensureIndex( {subject: "text", body: "text"}, {weights: {subject: 10} } )
> db.mail.getIndices()
[
        ...
        {
                "v" : 0,
                "key" : {
                        "_fts" : "text",
                        "_ftsx" : 1
                },
                "ns" : "de.mail",
                "name" : "subject_text_body_text",
                "weights" : {
                        "body" : 1,
                        "subject" : 10
                },
                "default_language" : "english",
                "language_override" : "language"
        }
]
We created a text index spanning the fields “subject” and “body”, where the first got a weight of 10 and the latter the standard weight 1. Let’s see what impact these weights have:
> db.mail.insert( {subject: "Robot leader to minions", body: "Humans suck", prio: 0 } )
> db.mail.insert( {subject: "Human leader to minions", body: "Robots suck", prio: 1 } )
> db.mail.runCommand("text", {search: "robot"})
{
        "queryDebugString" : "robot||||||",
        "language" : "english",
        "results" : [
                {
                        "score" : 6.666666666666666,
                        "obj" : {
                                "_id" : ObjectId("50ed1be71e27d5e73741fafe"),
                                "subject" : "Robot leader to minions",
                                "body" : "Humans suck"
                                "prio" : 0 
                        }
                },
                {
                        "score" : 0.75,
                        "obj" : {
                                "_id" : ObjectId("50ed1bfd1e27d5e73741faff"),
                                "subject" : "Human leader to minions",
                                "body" : "Robots suck"
                                "prio" : 1
                        }
                }
        ],
        "stats" : {
                "nscanned" : 2,
                "nscannedObjects" : 0,
                "n" : 2,
                "timeMicros" : 148
        },
        "ok" : 1
}
The document with “robot” in the “subject” field has much higher score because the weight of 10 is a taken as a multiplier.
译者信息

多字段

文本索引可以跨越多个字段。在这种情况下,每个字段可以有自己的权重。我们可以利用权重,为文档的不同的部分赋予不同的意义。

> db.mail.ensureIndex( {subject: "text", body: "text"}, {weights: {subject: 10} } )
> db.mail.getIndices()
[
        ...
        {
                "v" : 0,
                "key" : {
                        "_fts" : "text",
                        "_ftsx" : 1
                },
                "ns" : "de.mail",
                "name" : "subject_text_body_text",
                "weights" : {
                        "body" : 1,
                        "subject" : 10
                },
                "default_language" : "english",
                "language_override" : "language"
        }
]
我们创建了一个跨越两个字段的文本索引,“subject”和“body”,它们的权重分别是10和1。我们看下权重有什么影响:
> db.mail.insert( {subject: "Robot leader to minions", body: "Humans suck", prio: 0 } )
> db.mail.insert( {subject: "Human leader to minions", body: "Robots suck", prio: 1 } )
> db.mail.runCommand("text", {search: "robot"})
{
        "queryDebugString" : "robot||||||",
        "language" : "english",
        "results" : [
                {
                        "score" : 6.666666666666666,
                        "obj" : {
                                "_id" : ObjectId("50ed1be71e27d5e73741fafe"),
                                "subject" : "Robot leader to minions",
                                "body" : "Humans suck"
                                "prio" : 0 
                        }
                },
                {
                        "score" : 0.75,
                        "obj" : {
                                "_id" : ObjectId("50ed1bfd1e27d5e73741faff"),
                                "subject" : "Human leader to minions",
                                "body" : "Robots suck"
                                "prio" : 1
                        }
                }
        ],
        "stats" : {
                "nscanned" : 2,
                "nscannedObjects" : 0,
                "n" : 2,
                "timeMicros" : 148
        },
        "ok" : 1
}
可以看到,“subject”字段含有“robot”的文档会有更高的得分,那是因为它有10的权重,作为倍数乘了上去。       

 

Filtering and Projection

You can apply additional search criteria via filtering:

> db.mail.runCommand("text", {search: "robot", filter: {prio:0} } )
{
        "queryDebugString" : "robot||||||",
        "language" : "english",
        "results" : [
                {
                        "score" : 6.666666666666666,
                        "obj" : {
                                "_id" : ObjectId("50ed22621e27d5e73741fb04"),
                                "subject" : "Robot leader to minions",
                                "body" : "Humans suck",
                                "prio" : 0
                        }
                }
        ],
        "stats" : {
                "nscanned" : 2,
                "nscannedObjects" : 2,
                "n" : 1,
                "timeMicros" : 185
        },
        "ok" : 1
}
Please note that filtering does not use an index.

 

If you are interested only in a subset of fields, you can use projection (similar to the aggreation framework):

> db.mail.runCommand("text", {search: "robot", projection: {_id:0, prio:0} } )
{
        "queryDebugString" : "robot||||||",
        "language" : "english",
        "results" : [
                {
                        "score" : 6.666666666666666,
                        "obj" : {
                                "subject" : "Robot leader to minions",
                                "body" : "Humans suck"
                        }
                },
                {
                        "score" : 0.75,
                        "obj" : {
                                "subject" : "Human leader to minions",
                                "body" : "Robots suck"
                        }
                }
        ],
        "stats" : {
                "nscanned" : 2,
                "nscannedObjects" : 0,
                "n" : 2,
                "timeMicros" : 127
        },
        "ok" : 1
}
Filtering and projection can be combined, of course.
译者信息

过滤与投射

我们还可以利用过滤来附加额外的搜索条件:

> db.mail.runCommand("text", {search: "robot", filter: {prio:0} } )
{
        "queryDebugString" : "robot||||||",
        "language" : "english",
        "results" : [
                {
                        "score" : 6.666666666666666,
                        "obj" : {
                                "_id" : ObjectId("50ed22621e27d5e73741fb04"),
                                "subject" : "Robot leader to minions",
                                "body" : "Humans suck",
                                "prio" : 0
                        }
                }
        ],
        "stats" : {
                "nscanned" : 2,
                "nscannedObjects" : 2,
                "n" : 1,
                "timeMicros" : 185
        },
        "ok" : 1
}
需要注意的是,过滤并不会使用索引。

如果我们关心的只是一部分字段,可以使用投射(类似于汇聚框架):

> db.mail.runCommand("text", {search: "robot", projection: {_id:0, prio:0} } )
{
        "queryDebugString" : "robot||||||",
        "language" : "english",
        "results" : [
                {
                        "score" : 6.666666666666666,
                        "obj" : {
                                "subject" : "Robot leader to minions",
                                "body" : "Humans suck"
                        }
                },
                {
                        "score" : 0.75,
                        "obj" : {
                                "subject" : "Human leader to minions",
                                "body" : "Robots suck"
                        }
                }
        ],
        "stats" : {
                "nscanned" : 2,
                "nscannedObjects" : 0,
                "n" : 2,
                "timeMicros" : 127
        },
        "ok" : 1
}
过滤和投射是可以一起使用的。       

 

Summary

With this second part on MongoDB text search we had a look at the more intereting features of the text search capability. For a start that’s quite a good toolbox to implement your own search engines. I’m looking forward your feedback.

译者信息

总结

我们在这篇文章里学习了MongoDB文本搜索的一些有趣功能。它应该对我们实现搜索引擎有很好的帮助。期待大家的反馈。

posted @ 2014-09-11 11:27  LAOS  阅读(1127)  评论(0编辑  收藏  举报