爬虫5:单页面爬虫-珠海历史天气
用了几个小时编写了一个爬取珠海历史天气的python爬虫,这里记录下来
1 引入模块requests和bs4
import requests from bs4 import BeautifulSoup
2 目标url
url = 'http://lishi.tianqi.com/zhuhai/201512.html'
3 定义头信息headers,伪装浏览器访问服务器
headers = { 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'Accept-Language': 'zh-CN,zh;q=0.8,en-US;q=0.5,en;q=0.3', 'Accept-Encoding': 'gzip, deflate', 'Connection': 'keep-alive', 'Host': 'lishi.tianqi.com', 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:45.0) Gecko/20100101 Firefox/45.0', }
4 获取目标相应reponse, 并设置编码 r.apparent_encoding自动解析爬取的编码格式,否则默认是gbk,而解析后的是GB2312;(在meta中查看目标网站编码)
r = requests.get(url, headers = headers) r.encoding = r.apparent_encoding
5 w+方式打开要保存的文件
fd = open('w201512.txt','w+')
6 利用bs4包获取div id= tqtongji2标签下的 ul标签内容:
soup = BeautifulSoup(r.text, "html.parser") res_div = soup.select("div.tqtongji2 > ul")
7 循环获取ul标签下的li标签的内容,get_text()方法获取标签的内容,注意编码
for item in res_div: res_li = item.select("li") for item_li in res_li: item_li = item_li.get_text().encode('utf-8') fd.write(item_li) fd.write(',') print item_li fd.write('\n')
8 关闭保存的文件
fd.close()
源码:
import requests from bs4 import BeautifulSoup url = 'http://lishi.tianqi.com/zhuhai/201512.html' headers = { 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'Accept-Language': 'zh-CN,zh;q=0.8,en-US;q=0.5,en;q=0.3', 'Accept-Encoding': 'gzip, deflate', 'Connection': 'keep-alive', 'Host': 'lishi.tianqi.com', 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:45.0) Gecko/20100101 Firefox/45.0', } r = requests.get(url, headers = headers) r.encoding = r.apparent_encoding fd = open('w201512.txt','w+') soup = BeautifulSoup(r.text, "html.parser") res_div = soup.select("div.tqtongji2 > ul") for item in res_div: res_li = item.select("li") for item_li in res_li: item_li = item_li.get_text().encode('utf-8') fd.write(item_li) fd.write(',') print item_li fd.write('\n') fd.close()
结果:珠海一个月的天气的爬取结果展示: