spider-抓取网页内容(Beautiful soup)

http://jingyan.baidu.com/article/afd8f4de6197c834e386e96b.html

http://cuiqingcai.com/1319.html

Windows下安装Beautifulsoup:

1.下载压缩包:https://www.crummy.com/software/BeautifulSoup/#Download

2.将其解压到Python目录下

3.导航到如下目录,然后运行如下命令:

   setup.py build

   setup.py install

4.进入Python,导入BS模块,表示安装成功

   from bs4 import BeautifulSoup

 

实例:bs抓取天气预报:

# -*- coding: UTF-8 -*-

import urllib2,sys,json
from json import *
from bs4 import BeautifulSoup as bs

reload(sys)
sys.setdefaultencoding('utf-8')

url='http://www.weather.com.cn/weather/101010100.shtml'
req = urllib2.Request(url)
res = urllib2.urlopen(req).read()

soup = bs(res)
#print soup.prettify()


divsw = soup.find_all('div',class_='c7d',id='7d')[0]  #7天的预报内容都在该div下,查询结果为queryset,所以需要使用索引0
divs_date = divsw.find_all('h1') #find date
for h in divs_date:
    print h.string

divs_wea = divsw.find_all('p',class_='wea') #find weather
for p in divs_wea:
    print p.get('title')

divs_tem = divsw.find_all('p',class_='tem') #find weather
for tem in divs_tem:
    tem_max = tem.find('span').string
    tem_min = tem.find('i').string
    print tem_min,'-',tem_max



        

 结果:

posted on 2016-03-18 10:53  momingliu11  阅读(587)  评论(0编辑  收藏  举报