用python写网络爬虫 -从零开始 1 编写第一个网络爬虫

本文从最简单的爬虫开始,通过添加检测下载错误,设置用户代理,设置网络代理,逐渐完善爬虫功能 。
首先 说明一下代码的使用方法 :在python2.7 环境下,用命令行也可以,用Pycharm编辑也可以。通过定义函数,然后引用函数完成网页抓取
例如 : download (”HTTP://www.baidu.com“)
        download1 (”HTTP://www.baidu.com“)
        download2(”HTTP://www.baidu.com“)



1.用三行代码 完成第一个最简单的网络爬虫

import urllib2
import urlparse


def download1(url):
"""Simple downloader"""
return urllib2.urlopen(url).read()

2.升级一下,编写出现下载错误的网络爬虫
def download2(url):
"""Download function that catches errors"""
print 'Downloading:', url
try:
html = urllib2.urlopen(url).read()
except urllib2.URLError as e:
print 'Download error:', e.reason
html = None
return html
3.网页5xx错误一般发生在服务器端,给爬虫加上一个判断,当错误代码大于500小于600的时候继续下载2次,
def download3(url, num_retries=2):
"""Download function that also retries 5XX errors"""
print 'Downloading:', url
try:
html = urllib2.urlopen(url).read()
except urllib2.URLError as e:
print 'Download error:', e.reason
html = None
if num_retries > 0:
if hasattr(e, 'code') and 500 <= e.code < 600:
# retry 5XX HTTP errors
html = download3(url, num_retries-1)
return html

4.设置用户代理
一般情况下,默认的网络爬虫会被一些网站封杀,这里设置了一个"wswp"为名称的网络代理

def download4(url, user_agent='wswp', num_retries=2):
"""Download function that includes user agent support"""
print 'Downloading:', url
headers = {'User-agent': user_agent}
request = urllib2.Request(url, headers=headers)
try:
html = urllib2.urlopen(request).read()
except urllib2.URLError as e:
print 'Download error:', e.reason
html = None
if num_retries > 0:
if hasattr(e, 'code') and 500 <= e.code < 600:
# retry 5XX HTTP errors
html = download4(url, user_agent, num_retries-1)
return html
5.支持代理
有时候我们需要用代理访问某个网站。比如,NTEflix屏蔽了美国以外的大多数国家。我们使用 requests 模块来实现网络代理的功能。
import urllib2
import urlparse
def download5(url, user_agent='wswp', proxy=None, num_retries=2):
"""Download function with support for proxies"""
print 'Downloading:', url
headers = {'User-agent': user_agent}
request = urllib2.Request(url, headers=headers)
opener = urllib2.build_opener()
if proxy:
proxy_params = {urlparse.urlparse(url).scheme: proxy}
opener.add_handler(urllib2.ProxyHandler(proxy_params))
try:
html = opener.open(request).read()
except urllib2.URLError as e:
print 'Download error:', e.reason
html = None
if num_retries > 0:
if hasattr(e, 'code') and 500 <= e.code < 600:
# retry 5XX HTTP errors
html = download5(url, user_agent, proxy, num_retries-1)
return html


posted @ 2017-10-08 20:16  逍遥游2  阅读(732)  评论(0编辑  收藏  举报