crawler basic conception

Posted on 2010-06-11 09:48 xuczhang 阅读(119) 评论(0) 编辑收藏举报

Difficulties

There are important characteristics of the Web that make crawling very difficult:

The behavior of a Web crawler is the outcome of a combination of policies:

a selection policy that states which pages to download,
a re-visit policy that states when to check for changes to the pages,
a politeness policy that states how to avoid overloading Web sites, and
a parallelization policy that states how to coordinate distributed Web crawlers.

glossary

1. seeds: a list of URLs to visit

2. crawl frontier: As the crawler visits these URLs, it identifies all the hyperlinks in the page and adds them to the list of URLs to visit

刷新页面返回顶部

xuczhang