Difficulties
There are important characteristics of the Web that make crawling very difficult:
- its large volume,
- its fast rate of change, and
- dynamic page generation.
The behavior of a Web crawler is the outcome of a combination of policies:
- a selection policy that states which pages to download,
- a re-visit policy that states when to check for changes to the pages,
- a politeness policy that states how to avoid overloading Web sites, and
- a parallelization policy that states how to coordinate distributed Web crawlers.
glossary
1. seeds: a list of URLs to visit
2. crawl frontier: As the crawler visits these URLs, it identifies all the hyperlinks in the page and adds them to the list of URLs to visit