用python写网络爬虫 -从零开始 4 用正则表达式编写链接爬虫

通过之前的学习，我们编写了两个基本的爬虫。但对于一些内容大的网站，我们就需要对其进行跟踪链接，利用正则表达式来确定需要下载的页面。
1.正则表达式 下载链接 ，其中  urlparse 模块用来实现相对路径转换成绝对路径，通过一个

import re
import urlparse




def link_crawler(seed_url, link_regex):
    """Crawl from the given seed URL following links matched by link_regex
    """
    crawl_queue = [seed_url] # the queue of URL's to download
    while crawl_queue:
        url = crawl_queue.pop()
        html = download(url)
        # filter for links matching our regular expression
        for link in get_links(html):
            if re.match(link_regex, link):
                # add this link to the crawl queue
                crawl_queue.append(link)


def get_links(html):
    """Return a list of links from html
    """
    # a regular expression to extract all links from the webpage
    webpage_regex = re.compile('<a[^>]+href=["\'](.*?)["\']', re.IGNORECASE)
    # list of all links from the webpage
    return webpage_regex.findall(html)


if __name__ == '__main__':
    link_crawler('http://example.webscraping.com', '/(index|view)')

posted @ 2017-10-08 20:58 逍遥游2 阅读(210) 评论(0) 收藏举报

刷新页面返回顶部

用python写网络爬虫 -从零开始 4 用正则表达式 编写链接爬虫

公告

用python写网络爬虫 -从零开始 4 用正则表达式编写链接爬虫