下载及爬取网页内容
三种常用的下载方式
import urllib
import requests
import urllib2
方法一
print('dowmload with urllib')
url = ""
urllib.urlretrieve(url,"lanzous") 不行, module 'urllib' has no attribute 'urlretrieve'
方法二 此方法python常用
r = requests.get(url)
with open("lanzo"+".pdf","wb") as f:
f.write(r.content)# No,rite() argument must be str, not bytes
方法三
print ('downloading with urllib2')
url = ''
f = urllib2.urlopen(url)
data = f.read()
with open("demo2.zip", "wb") as code:
code.write(data)
爬取网页内容
方法一
import urllib.request
url = "http://www.xxx.com"
html = urllib.request.urlopen(url).read() # 最基础的抓取
print(html)
方法二
import requests
url = "http://www.xxx.com"
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
'Accept':'text/html;q=0.9,/;q=0.8',
'Accept-Charset':'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
'Accept-Encoding':'gzip',
'Connection':'close',
'Referer':None #注意如果依然不能抓取,这里可以设置抓取网站的host
}
html =requests.get(url,headers=headers)
html.encoding = html.apparent_encoding# 方法二运行失败
努力拼搏吧,不要害怕,不要去规划,不要迷茫。但你一定要在路上一直的走下去,尽管可能停滞不前,但也要走。