spider-抓取网页内容(Beautiful soup)
http://jingyan.baidu.com/article/afd8f4de6197c834e386e96b.html
http://cuiqingcai.com/1319.html
Windows下安装Beautifulsoup:
1.下载压缩包:https://www.crummy.com/software/BeautifulSoup/#Download
2.将其解压到Python目录下
3.导航到如下目录,然后运行如下命令:
setup.py build
setup.py install
4.进入Python,导入BS模块,表示安装成功
from bs4 import BeautifulSoup
实例:bs抓取天气预报:
# -*- coding: UTF-8 -*- import urllib2,sys,json from json import * from bs4 import BeautifulSoup as bs reload(sys) sys.setdefaultencoding('utf-8') url='http://www.weather.com.cn/weather/101010100.shtml' req = urllib2.Request(url) res = urllib2.urlopen(req).read() soup = bs(res) #print soup.prettify() divsw = soup.find_all('div',class_='c7d',id='7d')[0] #7天的预报内容都在该div下,查询结果为queryset,所以需要使用索引0 divs_date = divsw.find_all('h1') #find date for h in divs_date: print h.string divs_wea = divsw.find_all('p',class_='wea') #find weather for p in divs_wea: print p.get('title') divs_tem = divsw.find_all('p',class_='tem') #find weather for tem in divs_tem: tem_max = tem.find('span').string tem_min = tem.find('i').string print tem_min,'-',tem_max
结果: