用python写网络爬虫(第二版)
示例网站:http://example.python-scraping.com
资源提供:https://www.epubit.com/
第一章:网络爬虫简介
1.1 网络爬虫何时会有用?
- 以结构化的格式,获取网上的批量数据(理论上可以手工,但是自动化可以省时省力)
1.2 网络爬虫是否合法?
- 被抓取的数据用于个人用途,且在合理使用版权法的条件下,通常没有问题
1.3 python3
- 工具:
- anaconda
- virtual environment wrapper (https://virtuallenvwrapper.readthedocs.io/en/latest)
- conda (https://conda.io/docs/intro.html)
- python 版本:python3.4+
1.4 背景调研
- 调研工具:
- robots.txt
- sitemap
- google -> WHQIS
1.4.1 检查robots.txt
- 了解当前网站的爬取限制
- 可以发现和网站结构相关的线索
- 详见:http://robotstxt.org
1.4.2 检查网站地图(sitemap)
- 帮助爬虫定位网站最新的内容,无需爬取每一个网页
- 网站地图标准定义:http://www.sitemap.org/protocol.html
1.4.3 估算网站大小
- 目标网站大小会影响我们爬取方式:效率问题
- 工具:https://www.google.com/advanced_search
- 在域名后面添加url路径,可以对结果过滤,仅显示网站的某些部分
1.4.4 识别网站所有技术
- detectem模块 (pip install detectem)
- 工具:
- 安装 Docker (http://www.docker.com/products/overview)
- bash:$docker pull scrapinghub/splash
- bash:$pip install detectem
- python 虚拟环境(https://docs.python.org/3/library/venv.html)
- conda 环境(https://conda.io/docs/using/envs.com)
- 查看项目的README(https://github.com/spectresearch/detectem)
$ det http://example.python-scraping.com
'''
[{'name': 'jquery', 'version': '1.11.0'},
{'name': 'modernizr', 'version': '2.7.1'},
{'name': 'nginx', 'version': '1.12.2'}]
'''
$ docker pull wappalyzer/cli
$ docker run wappalyzer/cli http://example.python-scraping.com
1.4.5 寻找网站所有者
- 寻找网站所有者:使用WHOIS协议查询网站域名注册所有者
- python 中有针对该协议封装的库(https://pypi.python.org/pypi/python-whois)
- 安装:pip install python-whois
import whois
print(whois.whois('url'))
1.5 编写第一个网络爬虫
- 爬取:下载包涵感兴趣数据的网页
- 爬取所用的方法有很多,选取哪种更合适:取决于目标网站的结构
- 三种爬取网站的常见方法:
- 爬取网站地图
- 使用数据库ID便历每一个网页
- 跟踪网页链接
1.5.1 抓取与爬取的对比
- 抓取:针对特定网站,并在站点上获取指定信息
- 爬取:通用的方式构建,目标是一系列顶级域名的网站或是整个网络。可以用来收集更具体的信息,更常见的是爬取整个网络。从不同站点或页面获取的小而通用的信息,然后跟踪连接到其他页面中。
1.5.2 下载网页
1.5.2.1 下载网页
- 下载时经常遇到临时错误:
- 服务器过载(503 Service Unavailable)
- 短暂等待后继续尝试重新下载
- 网页不存在(404 Not Found)
- 请求时发生问题(4XX)-重新下载无效果
- 服务端存在问题(5XX)-可重新下载
- 服务器过载(503 Service Unavailable)
1.5.2.2 设置代理
import urllib.request
from urllib.error import URLError, HTTPError, ContentTooShortError
# user_agent='wswp' 设置用户代理
def download(url, num_retries=2, user_agent='wswp'):
print('Downloading:', url)
# 设置用户代理
request = urllib.request.Request(url)
request.add_header('User-agent', user_agent)
try:
html = urllib.request.urlopen(request).read()
except (URLError, HTTPError, ContentTooShortError) as e:
print('Download error:', e.reason)
html = None
if num_retries > 0:
if hasattr(e, 'code') and 500 <= e.code < 600:
# recursively retry 5xx HTTP errors
return download(url, num_retries - 1)
return html
1.5.3 网站地图爬虫
- 使用正则表达式将robots.txt的url从
标签中取出
# 导入url解析库
import urllib.request
# 导入正则库
import re
# 导入解析错误库
from urllib.error import URLError, HTTPError, ContentTooShortError
def download(url, num_retries=2, user_agent='wswp', charset='utf-8'):
print('Downloading:', url)
request = urllib.request.Request(url)
request.add_header('User-agent', user_agent)
try:
resp = urllib.request.urlopen(request)
cs = resp.headers.get_content_charset()
if not cs:
cs = charset
html = resp.read().decode(cs)
except (URLError, HTTPError, ContentTooShortError) as e:
print('Download error:', e.reason)
html = None
if num_retries > 0:
if hasattr(e, 'code') and 500 <= e.code < 600:
# recursively retry 5xx HTTP errors
return download(url, num_retries - 1)
return html
def crawl_sitemap(url):
# download the sitemap file
sitemap = download(url)
# extract the sitemap links
links = re.findall('<loc>(.*?)</loc>', sitemap)
# download each link
for link in links:
html = download(link)
# scrape html here
test_url = 'http://example.python-scraping.com/sitemap.xml'
crawl_sitemap(test_url)
'''
Downloading: http://example.python-scraping.com/sitemap.xml
Downloading: http://example.python-scraping.com/places/default/view/Afghanistan-1
Downloading: http://example.python-scraping.com/places/default/view/Aland-Islands-2
Downloading: http://example.python-scraping.com/places/default/view/Albania-3
Downloading: http://example.python-scraping.com/places/default/view/Algeria-4
Downloading: http://example.python-scraping.com/places/default/view/American-Samoa-5
Downloading: http://example.python-scraping.com/places/default/view/Andorra-6
Downloading: http://example.python-scraping.com/places/default/view/Angola-7
Downloading: http://example.python-scraping.com/places/default/view/Anguilla-8
Downloading: http://example.python-scraping.com/places/default/view/Antarctica-9
Downloading: http://example.python-scraping.com/places/default/view/Antigua-and-Barbuda-10
Downloading: http://example.python-scraping.com/places/default/view/Argentina-11
...
'''
1.5.4 ID便历爬虫
import itertools
import urllib.request
from urllib.error import URLError, HTTPError, ContentTooShortError
def download(url, num_retries=2, user_agent='wswp', charset='utf-8'):
print('Downloading:', url)
request = urllib.request.Request(url)
request.add_header('User-agent', user_agent)
try:
resp = urllib.request.urlopen(request)
cs = resp.headers.get_content_charset()
if not cs:
cs = charset
html = resp.read().decode(cs)
except (URLError, HTTPError, ContentTooShortError) as e:
print('Download error:', e.reason)
html = None
if num_retries > 0:
if hasattr(e, 'code') and 500 <= e.code < 600:
# recursively retry 5xx HTTP errors
return download(url, num_retries - 1)
return html
def crawl_site(url, max_errors=5):
num_errors = 0
for page in itertools.count(1):
pg_url = '{}{}'.format(url, page)
html = download(pg_url)
if html is None:
num_errors += 1
if num_errors == max_errors:
# reached max number of errors, so exit
break
else:
num_errors = 0
# success - can scrape the result
test_url2 = 'http://example.python-scraping.com/view/-'
# 暂时存在问题,待调
crawl_sitemap(test_url2)
1.5.5 链接爬虫
- 使用正则表达式确定应当下载哪些页面
# 正则表达式
import re
# 发送请求
import urllib.request
# 解析+链接拼接
from urllib.parse import urljoin
# 导入错误类型
from urllib.error import URLError, HTTPError, ContentTooShortError
def download(url, num_retries=2, user_agent='wswp', charset='utf-8'):
print('Downloading:', url)
request = urllib.request.Request(url)
request.add_header('User-agent', user_agent)
try:
resp = urllib.request.urlopen(request)
cs = resp.headers.get_content_charset()
if not cs:
cs = charset
html = resp.read().decode(cs)
except (URLError, HTTPError, ContentTooShortError) as e:
print('Download error:', e.reason)
html = None
if num_retries > 0:
if hasattr(e, 'code') and 500 <= e.code < 600:
# recursively retry 5xx HTTP errors
return download(url, num_retries - 1)
return html
def link_crawler(start_url, link_regex):
" Crawl from the given start URL following links matched by link_regex "
crawl_queue = [start_url]
# keep track which URL's have seen before
seen = set(crawl_queue)
while crawl_queue:
url = crawl_queue.pop()
html = download(url)
if not html:
continue
# filter for links matching our regular expression
for link in get_links(html):
if re.match(link_regex, link):
abs_link = urljoin(start_url, link)
if abs_link not in seen:
seen.add(abs_link)
crawl_queue.append(abs_link)
def get_links(html):
" Return a list of links from html "
# a regular expression to extract all links from the webpage
webpage_regex = re.compile("""<a[^>]+href=["'](.*?)["']""", re.IGNORECASE)
# list of all links from the webpage
return webpage_regex.findall(html)
-
高级功能
- 1.解析robots.txt文件,避免下载禁止爬取的URL,使用python的urllib库中的robotparser模块,就可以轻松完成这项工作
- 2.支持代理:有时候需要使用代理访问某个网站,,使用python urllib支持代理
- 3.下载限速:降低被封号的风险,在两次下载之间添加一组延时,对爬虫进行限速
- 4.避免爬虫陷阱:下载无限的网页,避免爬虫陷阱,记录当前爬取深度
-
最终版本
# 最终版本
from urllib.parse import urlparse
import time
class Throttle:
""" Add a delay between downloads to the same domain
"""
def __init__(self, delay):
# amount of delay between downloads for each domain
self.delay = delay
# timestamp of when a domain was last accessed
self.domains = {}
def wait(self, url):
domain = urlparse(url).netloc
last_accessed = self.domains.get(domain)
if self.delay > 0 and last_accessed is not None:
sleep_secs = self.delay - (time.time() - last_accessed)
if sleep_secs > 0:
# domain has been accessed recently
# so need to sleep
time.sleep(sleep_secs)
# update the last accessed time
self.domains[domain] = time.time()
import re
import urllib.request
from urllib import robotparser
from urllib.parse import urljoin
from urllib.error import URLError, HTTPError, ContentTooShortError
# from throttle import Throttle
# from throtte import Throttle
def download(url, num_retries=2, user_agent='wswp', charset='utf-8', proxy=None):
""" Download a given URL and return the page content
args:
url (str): URL
kwargs:
user_agent (str): user agent (default: wswp)
charset (str): charset if website does not include one in headers
proxy (str): proxy url, ex 'http://IP' (default: None)
num_retries (int): number of retries if a 5xx error is seen (default: 2)
"""
print('Downloading:', url)
request = urllib.request.Request(url)
request.add_header('User-agent', user_agent)
try:
if proxy:
proxy_support = urllib.request.ProxyHandler({'http': proxy})
opener = urllib.request.build_opener(proxy_support)
urllib.request.install_opener(opener)
resp = urllib.request.urlopen(request)
cs = resp.headers.get_content_charset()
if not cs:
cs = charset
html = resp.read().decode(cs)
except (URLError, HTTPError, ContentTooShortError) as e:
print('Download error:', e.reason)
html = None
if num_retries > 0:
if hasattr(e, 'code') and 500 <= e.code < 600:
# recursively retry 5xx HTTP errors
return download(url, num_retries - 1)
return html
def get_robots_parser(robots_url):
" Return the robots parser object using the robots_url "
rp = robotparser.RobotFileParser()
rp.set_url(robots_url)
rp.read()
return rp
def get_links(html):
" Return a list of links (using simple regex matching) from the html content "
# a regular expression to extract all links from the webpage
webpage_regex = re.compile("""<a[^>]+href=["'](.*?)["']""", re.IGNORECASE)
# list of all links from the webpage
return webpage_regex.findall(html)
def link_crawler(start_url, link_regex, robots_url=None, user_agent='wswp',
proxy=None, delay=3, max_depth=4):
""" Crawl from the given start URL following links matched by link_regex. In the current
implementation, we do not actually scrapy any information.
args:
start_url (str): web site to start crawl
link_regex (str): regex to match for links
kwargs:
robots_url (str): url of the site's robots.txt (default: start_url + /robots.txt)
user_agent (str): user agent (default: wswp)
proxy (str): proxy url, ex 'http://IP' (default: None)
delay (int): seconds to throttle between requests to one domain (default: 3)
max_depth (int): maximum crawl depth (to avoid traps) (default: 4)
"""
crawl_queue = [start_url]
# keep track which URL's have seen before
seen = {}
if not robots_url:
robots_url = '{}/robots.txt'.format(start_url)
rp = get_robots_parser(robots_url)
throttle = Throttle(delay)
while crawl_queue:
url = crawl_queue.pop()
# check url passes robots.txt restrictions
if rp.can_fetch(user_agent, url):
depth = seen.get(url, 0)
if depth == max_depth:
print('Skipping %s due to depth' % url)
continue
throttle.wait(url)
html = download(url, user_agent=user_agent, proxy=proxy)
if not html:
continue
# TODO: add actual data scraping here
# filter for links matching our regular expression
for link in get_links(html):
if re.match(link_regex, link):
abs_link = urljoin(start_url, link)
if abs_link not in seen:
seen[abs_link] = depth + 1
crawl_queue.append(abs_link)
else:
print('Blocked by robots.txt:', url)
link_regex = '/(index|view)/'
link_crawler('http://example.python-scraping.com/index',link_regex,max_depth = 1)
1.5.6 使用 request库
- python主流爬虫一般都会使用requests库来管理复杂的HTTP请求
- 足够简单且易于使用
- 安装 $pip install requests
# 使用requests库的高级链接爬虫
import re
from urllib import robotparser
from urllib.parse import urljoin
import requests
from chp1.throttle import Throttle
def download(url, num_retries=2, user_agent='wswp', proxies=None):
""" Download a given URL and return the page content
args:
url (str): URL
kwargs:
user_agent (str): user agent (default: wswp)
proxies (dict): proxy dict w/ keys 'http' and 'https', values
are strs (i.e. 'http(s)://IP') (default: None)
num_retries (int): # of retries if a 5xx error is seen (default: 2)
"""
print('Downloading:', url)
headers = {'User-Agent': user_agent}
try:
resp = requests.get(url, headers=headers, proxies=proxies)
html = resp.text
if resp.status_code >= 400:
print('Download error:', resp.text)
html = None
if num_retries and 500 <= resp.status_code < 600:
# recursively retry 5xx HTTP errors
return download(url, num_retries - 1)
except requests.exceptions.RequestException as e:
print('Download error:', e)
html = None
return html
def get_robots_parser(robots_url):
" Return the robots parser object using the robots_url "
rp = robotparser.RobotFileParser()
rp.set_url(robots_url)
rp.read()
return rp
def get_links(html):
""" Return a list of links (using simple regex matching)
from the html content """
# a regular expression to extract all links from the webpage
webpage_regex = re.compile("""<a[^>]+href=["'](.*?)["']""", re.IGNORECASE)
# list of all links from the webpage
return webpage_regex.findall(html)
def link_crawler(start_url, link_regex, robots_url=None, user_agent='wswp',
proxies=None, delay=3, max_depth=4):
""" Crawl from the given start URL following links matched by link_regex.
In the current implementation, we do not actually scrape any information.
args:
start_url (str): web site to start crawl
link_regex (str): regex to match for links
kwargs:
robots_url (str): url of the site's robots.txt
(default: start_url + /robots.txt)
user_agent (str): user agent (default: wswp)
proxies (dict): proxy dict w/ keys 'http' and 'https', values
are strs (i.e. 'http(s)://IP') (default: None)
delay (int): seconds to throttle between requests
to one domain (default: 3)
max_depth (int): maximum crawl depth (to avoid traps) (default: 4)
"""
crawl_queue = [start_url]
# keep track which URL's have seen before
seen = {}
if not robots_url:
robots_url = '{}/robots.txt'.format(start_url)
rp = get_robots_parser(robots_url)
throttle = Throttle(delay)
while crawl_queue:
url = crawl_queue.pop()
# check url passes robots.txt restrictions
if rp.can_fetch(user_agent, url):
depth = seen.get(url, 0)
if depth == max_depth:
print('Skipping %s due to depth' % url)
continue
throttle.wait(url)
html = download(url, user_agent=user_agent, proxies=proxies)
if not html:
continue
# TODO: add actual data scraping here
# filter for links matching our regular expression
for link in get_links(html):
if re.match(link_regex, link):
abs_link = urljoin(start_url, link)
if abs_link not in seen:
seen[abs_link] = depth + 1
crawl_queue.append(abs_link)
else:
print('Blocked by robots.txt:', url)
1.6 本章小结
- 1.介绍网络爬虫
- 2.给出了一个成熟的爬虫-可复用
- 3.介绍一些外部工具和模块的使用方法(了解网站、用户代理、网站地图、爬虫延时及其他高级爬取技术)