Knowledge about crawl

                      

  • The components that are involved in a crawl.

1.Content Host

This is the server that hosts/stores the content that your indexer is crawling. For example, if you have a content source that crawls a SharePoint site, the content host would be the web front end server that hosts the site. If you are crawling a file share, the content host would be the server where the file share is physically located.

2.MSSdmn.exe

When a crawl is running, you can see this process at work in the task manager at the indexer. This process is called the "Search Daemon" .

When a crawl is started, this process is responsible for connecting to the content host (using the protocol handler and iFilter), requesting content from the content host and crawling the content.

The Search Daemon has the biggest impact on the indexer in terms of resource utilization. It's doing the most amount of work.

Each file type is corresponding one msdmn.exe.

3.MSSearch.exe

Once the MSDMN.EXE process is done crawling the content, it passes the crawled content on to MSSearch.exe (this process also runs on the indexer, you should see it in the task manager during a crawl).

MSSearch.exe does two things. It writes the crawled content on to the disk of the indexer and it also passes the metadata properties of documents that are discovered during the crawl to the backend database. Crawling metadata properties (document title, author etc.) allows the use of Advanced Search in MOSS 2007. Unlike the crawled content index, which gets stored on the physical disk of the indexer, crawled metadata properties are stored in the database.

 Note:

In task manager, one or several mssdmn.exe and one or two mssearch.exe exist.

One mssearch.exe is for WSS search and the other one is for osearch.

Mssearch.exe is the controller, Mssdmn.exe is the worker that does the actual crawl and filtering.

Mssearch.exe finds what URL to crawl, and send URL to mssdmn.exe along with a lot other information like authentication type, username, password, etc. It writes the crawled content to the disk of the indexer and it passes the metadata properties of documents that are discovered during the crawl to the backend database.

Mssdmn.exe does the actual crawl and filtering and feedback the property chunk and plain text it gets from the URL to mssearch.exe for further processing. It is responsible for connecting to the content host.

4.SQL Server (Search Database)

The search database stores information such as information about the current status of the crawls, metadata properties of documents/list items that are discovered during the crawl.

Incremental crawl is depending on the column TimeLastModified in EventCache table of certain contentDB. 

  • MSSeach.exe

MSSearch.exe contains one gatherer manager.  The gatherer manager can have multiple gatherer applications(retrieves a copy of the content of the document, At least 3 per Search Service Application. One per component. (Admin, Crawl, Query)).  The gatherer applications can have multiple gatherer projects(Gatherer Application configuration entities. Two per Search Service Application. Portal_Content and Anchor_Project).

 Components of the Gatherer Project handle a standard set of calls and perform different actions depending on the makeup of the chunk.  Within the Gatherer Project, chunks are taken and given to all of the plug-ins that should evaluate a given chunk. There is a great deal of updating that occurs to a chunk as it is processed through the Gatherer Project.  We call this processing of a chunk a transaction.  Chunks consist essentially of post-filtered document content, properties, and internal information that is populated during the lifecycle of the transaction.  There is a pool of threads in MSSearch.exe called the filtering threads.  For each transaction available in the pool, a filtering thread puts the transaction in motion. Any filtering thread can service any transaction since all documents are handled in the same way at this level.

 Gatherer projects

(1)    Portal_Content  We store this metadata in our Search Property Store DB. We also use the Full-Text Index on the file system which is aligned with Portal_Content to store the content. 

(2)    Anchor Project  <a ...></a> tag in HTML. Project takes the content from the anchor text table and populates it into another Full-Text Index, the Anchor Index.  You can think of this part of the crawl as incremental. 

 (3)    Plug-ins  After all of the filtering is done and internal properties are added, we will pass the chunks through the plug-ins.  Plug-ins can be active or passive and essentially execute special specific actions against the processed chunks.  For example, at: HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Office Server\14.0\Search\Applications\[GUID]\Gathering Manager\Projects\Portal_Content\ActivePlugins\0

Feature Extractor is under the ActivePlugin key, which means it is Active. If you look under: <The same>\Portal_Content\Plugins\1. ARPI is under the Plugins key, which means it is Passive.

a)       Gatherer Plug-in [Passive]  discovers new links and adds those URLs back to the MSSCrawlQueue. 

b)       Feature Extractor Plug-in [Active]  If a document does not have, a title when it is crawled for the first time (based on crawl statistics for that document). The Feature Extractor will try to determine the title based on the content and filename.  The Feature Extractor Plug-in only extracts the title on PPT and DOC files. 

c)        Scopes Plug-in [Active]  add basic scope keys to the pipeline.  The basic scope keys are Full-Text Index keys which encode property values for scopable properties.  These keys end up in the Full-Text index and are used for resolving query restrictions of the type property = value.

d)       Archival Plug-in (ARPI) [Passive]Takes Metadata and/or properties go into the property store. 

e)       Indexer Plug-in (Tripoli Plug-in) [Passive]  takes word broken text and basic scope keys, builds in-memory indexes (in chunk buffers) then writes them into a Full-Text Shadow Index. The Full-Text Shadow Index is located on the file system as a .ci file.

f)        Matrix Plug-in [Active]  tries to get an ACL, put it in a normalized form (people alias to name), then decide if it has previously processed the ACL or not.

g)       Anchor Plug-in [Active]  feeds the anchor text crawl.  It reads anchors from the MSSAnchorText table and emits chunks of anchor text on the pipeline of the Anchor Project.  The pipeline of the Anchor Project does not have protocol handlers.  Once the crawl is bootstrapped by a fake start address (anchorqh://anchor/qh/targetdocid>=0), all the chunks are emitted from the Anchor Plug-in.

h)       Simple Plug-in (simplepi) [Passive] used for troubleshooting and is not used to produce any actual actions against a chunk.

i)         Sample Plug-in [Passive]  a sample.

 

posted @ 2012-05-20 21:00  l'oiseau  阅读(406)  评论(0编辑  收藏  举报