用python写网络爬虫（第二版）

示例网站：http://example.python-scraping.com

资源提供：https://www.epubit.com/

第一章：网络爬虫简介

1.1 网络爬虫何时会有用？

以结构化的格式，获取网上的批量数据（理论上可以手工，但是自动化可以省时省力）

1.2 网络爬虫是否合法？

被抓取的数据用于个人用途，且在合理使用版权法的条件下，通常没有问题

1.3 python3

工具：
- anaconda
- virtual environment wrapper （https://virtuallenvwrapper.readthedocs.io/en/latest）
- conda (https://conda.io/docs/intro.html)
python 版本：python3.4+

1.4 背景调研

调研工具：
- robots.txt
- sitemap
- google -> WHQIS

1.4.1 检查robots.txt

了解当前网站的爬取限制
可以发现和网站结构相关的线索
详见：http://robotstxt.org

1.4.2 检查网站地图(sitemap)

帮助爬虫定位网站最新的内容，无需爬取每一个网页
网站地图标准定义：http://www.sitemap.org/protocol.html

1.4.3 估算网站大小

目标网站大小会影响我们爬取方式：效率问题
工具：https://www.google.com/advanced_search
- 在域名后面添加url路径，可以对结果过滤，仅显示网站的某些部分

1.4.4 识别网站所有技术

detectem模块 (pip install detectem)
工具：
- 安装 Docker （http://www.docker.com/products/overview）
- bash:$docker pull scrapinghub/splash
- bash:$pip install detectem
- python 虚拟环境（https://docs.python.org/3/library/venv.html）
- conda 环境（https://conda.io/docs/using/envs.com）
- 查看项目的README（https://github.com/spectresearch/detectem）

$ det http://example.python-scraping.com

'''
[{'name': 'jquery', 'version': '1.11.0'},
 {'name': 'modernizr', 'version': '2.7.1'},
 {'name': 'nginx', 'version': '1.12.2'}]
'''
$ docker pull wappalyzer/cli
$ docker run wappalyzer/cli http://example.python-scraping.com

1.4.5 寻找网站所有者

寻找网站所有者：使用WHOIS协议查询网站域名注册所有者
- python 中有针对该协议封装的库（https://pypi.python.org/pypi/python-whois）
- 安装：pip install python-whois

import whois
print(whois.whois('url'))

1.5 编写第一个网络爬虫

爬取：下载包涵感兴趣数据的网页
爬取所用的方法有很多，选取哪种更合适：取决于目标网站的结构
三种爬取网站的常见方法：
- 爬取网站地图
- 使用数据库ID便历每一个网页
- 跟踪网页链接

1.5.1 抓取与爬取的对比

抓取：针对特定网站，并在站点上获取指定信息
爬取：通用的方式构建，目标是一系列顶级域名的网站或是整个网络。可以用来收集更具体的信息，更常见的是爬取整个网络。从不同站点或页面获取的小而通用的信息，然后跟踪连接到其他页面中。

1.5.2 下载网页

1.5.2.1 下载网页

下载时经常遇到临时错误：
- 服务器过载（503 Service Unavailable）
  - 短暂等待后继续尝试重新下载
- 网页不存在（404 Not Found）
- 请求时发生问题（4XX）-重新下载无效果
- 服务端存在问题（5XX）-可重新下载

1.5.2.2 设置代理

import urllib.request
from urllib.error import URLError, HTTPError, ContentTooShortError

# user_agent='wswp' 设置用户代理
def download(url, num_retries=2, user_agent='wswp'):
    print('Downloading:', url)
    # 设置用户代理
    request = urllib.request.Request(url)
    request.add_header('User-agent', user_agent)
    
    try:
        html = urllib.request.urlopen(request).read()
    except (URLError, HTTPError, ContentTooShortError) as e:
        print('Download error:', e.reason)
        html = None
        if num_retries > 0:
            if hasattr(e, 'code') and 500 <= e.code < 600:
                # recursively retry 5xx HTTP errors
                return download(url, num_retries - 1)
    return html

1.5.3 网站地图爬虫

使用正则表达式将robots.txt的url从标签中取出

# 导入url解析库
import urllib.request
# 导入正则库
import re
# 导入解析错误库
from urllib.error import URLError, HTTPError, ContentTooShortError


def download(url, num_retries=2, user_agent='wswp', charset='utf-8'):
    print('Downloading:', url)
    request = urllib.request.Request(url)
    request.add_header('User-agent', user_agent)
    try:
        resp = urllib.request.urlopen(request)
        cs = resp.headers.get_content_charset()
        if not cs:
            cs = charset
        html = resp.read().decode(cs)
    except (URLError, HTTPError, ContentTooShortError) as e:
        print('Download error:', e.reason)
        html = None
        if num_retries > 0:
            if hasattr(e, 'code') and 500 <= e.code < 600:
                # recursively retry 5xx HTTP errors
                return download(url, num_retries - 1)
    return html


def crawl_sitemap(url):
    # download the sitemap file
    sitemap = download(url)
    # extract the sitemap links
    links = re.findall('<loc>(.*?)</loc>', sitemap)
    # download each link
    for link in links:
        html = download(link)
        # scrape html here

test_url = 'http://example.python-scraping.com/sitemap.xml'
crawl_sitemap(test_url)

'''
Downloading: http://example.python-scraping.com/sitemap.xml
Downloading: http://example.python-scraping.com/places/default/view/Afghanistan-1
Downloading: http://example.python-scraping.com/places/default/view/Aland-Islands-2
Downloading: http://example.python-scraping.com/places/default/view/Albania-3
Downloading: http://example.python-scraping.com/places/default/view/Algeria-4
Downloading: http://example.python-scraping.com/places/default/view/American-Samoa-5
Downloading: http://example.python-scraping.com/places/default/view/Andorra-6
Downloading: http://example.python-scraping.com/places/default/view/Angola-7
Downloading: http://example.python-scraping.com/places/default/view/Anguilla-8
Downloading: http://example.python-scraping.com/places/default/view/Antarctica-9
Downloading: http://example.python-scraping.com/places/default/view/Antigua-and-Barbuda-10
Downloading: http://example.python-scraping.com/places/default/view/Argentina-11
...
'''

1.5.4 ID便历爬虫

import itertools
import urllib.request
from urllib.error import URLError, HTTPError, ContentTooShortError


def download(url, num_retries=2, user_agent='wswp', charset='utf-8'):
    print('Downloading:', url)
    request = urllib.request.Request(url)
    request.add_header('User-agent', user_agent)
    try:
        resp = urllib.request.urlopen(request)
        cs = resp.headers.get_content_charset()
        if not cs:
            cs = charset
        html = resp.read().decode(cs)
    except (URLError, HTTPError, ContentTooShortError) as e:
        print('Download error:', e.reason)
        html = None
        if num_retries > 0:
            if hasattr(e, 'code') and 500 <= e.code < 600:
                # recursively retry 5xx HTTP errors
                return download(url, num_retries - 1)
    return html


def crawl_site(url, max_errors=5):
    num_errors = 0
    for page in itertools.count(1):
        pg_url = '{}{}'.format(url, page)
        html = download(pg_url)
        if html is None:
            num_errors += 1
            if num_errors == max_errors:
                # reached max number of errors, so exit
                break
        else:
            num_errors = 0
            # success - can scrape the result
test_url2 = 'http://example.python-scraping.com/view/-'
# 暂时存在问题，待调
crawl_sitemap(test_url2)

1.5.5 链接爬虫

使用正则表达式确定应当下载哪些页面

# 正则表达式
import re
# 发送请求
import urllib.request
# 解析+链接拼接
from urllib.parse import urljoin
# 导入错误类型
from urllib.error import URLError, HTTPError, ContentTooShortError


def download(url, num_retries=2, user_agent='wswp', charset='utf-8'):
    print('Downloading:', url)
    request = urllib.request.Request(url)
    request.add_header('User-agent', user_agent)
    try:
        resp = urllib.request.urlopen(request)
        cs = resp.headers.get_content_charset()
        if not cs:
            cs = charset
        html = resp.read().decode(cs)
    except (URLError, HTTPError, ContentTooShortError) as e:
        print('Download error:', e.reason)
        html = None
        if num_retries > 0:
            if hasattr(e, 'code') and 500 <= e.code < 600:
                # recursively retry 5xx HTTP errors
                return download(url, num_retries - 1)
    return html


def link_crawler(start_url, link_regex):
    " Crawl from the given start URL following links matched by link_regex "
    crawl_queue = [start_url]
    # keep track which URL's have seen before
    seen = set(crawl_queue)
    while crawl_queue:
        url = crawl_queue.pop()
        html = download(url)
        if not html:
            continue
        # filter for links matching our regular expression
        for link in get_links(html):
            if re.match(link_regex, link):
                abs_link = urljoin(start_url, link)
                if abs_link not in seen:
                    seen.add(abs_link)
                    crawl_queue.append(abs_link)


def get_links(html):
    " Return a list of links from html "
    # a regular expression to extract all links from the webpage
    webpage_regex = re.compile("""<a[^>]+href=["'](.*?)["']""", re.IGNORECASE)
    # list of all links from the webpage
    return webpage_regex.findall(html)

高级功能
- 1.解析robots.txt文件，避免下载禁止爬取的URL，使用python的urllib库中的robotparser模块，就可以轻松完成这项工作
- 2.支持代理：有时候需要使用代理访问某个网站，，使用python urllib支持代理
- 3.下载限速：降低被封号的风险，在两次下载之间添加一组延时，对爬虫进行限速
- 4.避免爬虫陷阱：下载无限的网页，避免爬虫陷阱，记录当前爬取深度
最终版本

# 最终版本
from urllib.parse import urlparse
import time


class Throttle:
    """ Add a delay between downloads to the same domain
    """
    def __init__(self, delay):
        # amount of delay between downloads for each domain
        self.delay = delay
        # timestamp of when a domain was last accessed
        self.domains = {}

    def wait(self, url):
        domain = urlparse(url).netloc
        last_accessed = self.domains.get(domain)

        if self.delay > 0 and last_accessed is not None:
            sleep_secs = self.delay - (time.time() - last_accessed)
            if sleep_secs > 0:
                # domain has been accessed recently
                # so need to sleep
                time.sleep(sleep_secs)
        # update the last accessed time
        self.domains[domain] = time.time()


import re
import urllib.request
from urllib import robotparser
from urllib.parse import urljoin
from urllib.error import URLError, HTTPError, ContentTooShortError
# from throttle import Throttle
# from throtte import Throttle


def download(url, num_retries=2, user_agent='wswp', charset='utf-8', proxy=None):
    """ Download a given URL and return the page content
        args:
            url (str): URL
        kwargs:
            user_agent (str): user agent (default: wswp)
            charset (str): charset if website does not include one in headers
            proxy (str): proxy url, ex 'http://IP' (default: None)
            num_retries (int): number of retries if a 5xx error is seen (default: 2)
    """
    print('Downloading:', url)
    request = urllib.request.Request(url)
    request.add_header('User-agent', user_agent)
    try:
        if proxy:
            proxy_support = urllib.request.ProxyHandler({'http': proxy})
            opener = urllib.request.build_opener(proxy_support)
            urllib.request.install_opener(opener)
        resp = urllib.request.urlopen(request)
        cs = resp.headers.get_content_charset()
        if not cs:
            cs = charset
        html = resp.read().decode(cs)
    except (URLError, HTTPError, ContentTooShortError) as e:
        print('Download error:', e.reason)
        html = None
        if num_retries > 0:
            if hasattr(e, 'code') and 500 <= e.code < 600:
                # recursively retry 5xx HTTP errors
                return download(url, num_retries - 1)
    return html


def get_robots_parser(robots_url):
    " Return the robots parser object using the robots_url "
    rp = robotparser.RobotFileParser()
    rp.set_url(robots_url)
    rp.read()
    return rp


def get_links(html):
    " Return a list of links (using simple regex matching) from the html content "
    # a regular expression to extract all links from the webpage
    webpage_regex = re.compile("""<a[^>]+href=["'](.*?)["']""", re.IGNORECASE)
    # list of all links from the webpage
    return webpage_regex.findall(html)


def link_crawler(start_url, link_regex, robots_url=None, user_agent='wswp',
                 proxy=None, delay=3, max_depth=4):
    """ Crawl from the given start URL following links matched by link_regex. In the current
        implementation, we do not actually scrapy any information.

        args:
            start_url (str): web site to start crawl
            link_regex (str): regex to match for links
        kwargs:
            robots_url (str): url of the site's robots.txt (default: start_url + /robots.txt)
            user_agent (str): user agent (default: wswp)
            proxy (str): proxy url, ex 'http://IP' (default: None)
            delay (int): seconds to throttle between requests to one domain (default: 3)
            max_depth (int): maximum crawl depth (to avoid traps) (default: 4)
    """
    crawl_queue = [start_url]
    # keep track which URL's have seen before
    seen = {}
    if not robots_url:
        robots_url = '{}/robots.txt'.format(start_url)
    rp = get_robots_parser(robots_url)
    throttle = Throttle(delay)
    while crawl_queue:
        url = crawl_queue.pop()
        # check url passes robots.txt restrictions
        if rp.can_fetch(user_agent, url):
            depth = seen.get(url, 0)
            if depth == max_depth:
                print('Skipping %s due to depth' % url)
                continue
            throttle.wait(url)
            html = download(url, user_agent=user_agent, proxy=proxy)
            if not html:
                continue
            # TODO: add actual data scraping here
            # filter for links matching our regular expression
            for link in get_links(html):
                if re.match(link_regex, link):
                    abs_link = urljoin(start_url, link)
                    if abs_link not in seen:
                        seen[abs_link] = depth + 1
                        crawl_queue.append(abs_link)
        else:
            print('Blocked by robots.txt:', url)

link_regex = '/(index|view)/'
link_crawler('http://example.python-scraping.com/index',link_regex,max_depth = 1)

1.5.6 使用 request库

python主流爬虫一般都会使用requests库来管理复杂的HTTP请求
足够简单且易于使用
安装 $pip install requests

# 使用requests库的高级链接爬虫
import re
from urllib import robotparser
from urllib.parse import urljoin

import requests
from chp1.throttle import Throttle


def download(url, num_retries=2, user_agent='wswp', proxies=None):
    """ Download a given URL and return the page content
        args:
            url (str): URL
        kwargs:
            user_agent (str): user agent (default: wswp)
            proxies (dict): proxy dict w/ keys 'http' and 'https', values
                            are strs (i.e. 'http(s)://IP') (default: None)
            num_retries (int): # of retries if a 5xx error is seen (default: 2)
    """
    print('Downloading:', url)
    headers = {'User-Agent': user_agent}
    try:
        resp = requests.get(url, headers=headers, proxies=proxies)
        html = resp.text
        if resp.status_code >= 400:
            print('Download error:', resp.text)
            html = None
            if num_retries and 500 <= resp.status_code < 600:
                # recursively retry 5xx HTTP errors
                return download(url, num_retries - 1)
    except requests.exceptions.RequestException as e:
        print('Download error:', e)
        html = None
    return html


def get_robots_parser(robots_url):
    " Return the robots parser object using the robots_url "
    rp = robotparser.RobotFileParser()
    rp.set_url(robots_url)
    rp.read()
    return rp


def get_links(html):
    """ Return a list of links (using simple regex matching)
        from the html content """
    # a regular expression to extract all links from the webpage
    webpage_regex = re.compile("""<a[^>]+href=["'](.*?)["']""", re.IGNORECASE)
    # list of all links from the webpage
    return webpage_regex.findall(html)


def link_crawler(start_url, link_regex, robots_url=None, user_agent='wswp',
                 proxies=None, delay=3, max_depth=4):
    """ Crawl from the given start URL following links matched by link_regex.
    In the current implementation, we do not actually scrape any information.

        args:
            start_url (str): web site to start crawl
            link_regex (str): regex to match for links
        kwargs:
            robots_url (str): url of the site's robots.txt
                              (default: start_url + /robots.txt)
            user_agent (str): user agent (default: wswp)
            proxies (dict): proxy dict w/ keys 'http' and 'https', values
                            are strs (i.e. 'http(s)://IP') (default: None)
            delay (int): seconds to throttle between requests
                         to one domain (default: 3)
            max_depth (int): maximum crawl depth (to avoid traps) (default: 4)
    """
    crawl_queue = [start_url]
    # keep track which URL's have seen before
    seen = {}
    if not robots_url:
        robots_url = '{}/robots.txt'.format(start_url)
    rp = get_robots_parser(robots_url)
    throttle = Throttle(delay)
    while crawl_queue:
        url = crawl_queue.pop()
        # check url passes robots.txt restrictions
        if rp.can_fetch(user_agent, url):
            depth = seen.get(url, 0)
            if depth == max_depth:
                print('Skipping %s due to depth' % url)
                continue
            throttle.wait(url)
            html = download(url, user_agent=user_agent, proxies=proxies)
            if not html:
                continue
            # TODO: add actual data scraping here
            # filter for links matching our regular expression
            for link in get_links(html):
                if re.match(link_regex, link):
                    abs_link = urljoin(start_url, link)
                    if abs_link not in seen:
                        seen[abs_link] = depth + 1
                        crawl_queue.append(abs_link)
        else:
            print('Blocked by robots.txt:', url)

1.6 本章小结

1.介绍网络爬虫
2.给出了一个成熟的爬虫-可复用
3.介绍一些外部工具和模块的使用方法（了解网站、用户代理、网站地图、爬虫延时及其他高级爬取技术）

posted on 2019-10-29 14:09 Mario_mj 阅读(3883) 评论(0) 编辑收藏举报

刷新页面返回顶部

Mario_mj

导航

公告

用python写网络爬虫（第二版）

示例网站：http://example.python-scraping.com

资源提供：https://www.epubit.com/

第一章：网络爬虫简介

1.1 网络爬虫何时会有用？

1.2 网络爬虫是否合法？

1.3 python3

1.4 背景调研

1.4.1 检查robots.txt

1.4.2 检查网站地图(sitemap)

1.4.3 估算网站大小

1.4.4 识别网站所有技术

1.4.5 寻找网站所有者

1.5 编写第一个网络爬虫

1.5.1 抓取与爬取的对比

1.5.2 下载网页

1.5.2.1 下载网页

1.5.2.2 设置代理

1.5.3 网站地图爬虫

1.5.4 ID便历爬虫

1.5.5 链接爬虫

1.5.6 使用 request库

1.6 本章小结