雅虎猫

博客园 首页 联系 订阅 管理

Python网络爬虫获取网站楼盘数据

因为需要从网上抓取楼盘信息,所以研究了一下如何使用Python来实现这个功能。具体步骤如下:

第一步,获取包含楼盘数据的网页HTML源代码。使用urllib库来获取网页数据,代码如下:

from urllib import request

resp = request.urlopen(url)

html_data = resp.read().decode('utf-8')

其中url是要打开的网页的网址。执行之后,html_data是字符串类型的变量,用于保存了获取的网页HTML源代码。调用decode方法是为了进行utf-8编码。包含楼盘数据的网址为:https://cd.ixiangzhu.com/House/lists.html?page=1,楼盘太多,所以采用了分页显示,其中page=1中的1表示第一页,如果是第二页则为page=2,依次类推。

 

第二步,对获取的网页HTML源代码进行分析,从中提取楼盘数据。这个可以借助于BeautifulSoup包完成,利用BeautifulSoup,可以很方便地从网页中提取各种tag。

(1)   根据对获取到的网页HTML源代码的分析,发现所有的楼盘数据是包含在<div class="house_lists">中,代码如下:

<!-- 楼盘列表 -->

                    <div class="house_lists">

                                                <div class="house_item clearfix" data-temp-id="WJ000795">

                            <div class="f_left house_item_img">

                                <a href="https://cd.ixiangzhu.com/House/detail/WJ000795.html"><img src="https://www.51xiangzhu.com:6080/app/file/img.do?img=/usr/local/img/property/WJ000795/1320x240.JPG" width="230" height="160" /></a>

                            </div>

                            <div class="f_left house_item_info">

                                <div class="title clearfix">

                                    <a class="f_left" href="https://cd.ixiangzhu.com/House/detail/WJ000795.html">金科星耀天都</a>

                                                                         <span class="price f_right">

                                        <span class="money">11892

                                                                                    元/m&sup2;

                                                                                </span>

                                    </span>

                                                                        </div>

                                <div class="info">

                                    <span>在售</span>

                                    <span class="info-price"></span>

                                </div>

                                <div class="area">

                                    面积区间: 37m&sup2;

                                </div>

                                <div class="address">

                                    <span>[成华区 - 驷马桥]</span>

                                    成华成都市成华区驷马桥昭觉寺南路12号

                                </div>

                                <div class="labels">

                                                                                                                                                                                                            <a>多条地铁</a>

                                                                                                                                                                            <a>读书方便</a>

                                                                                                                                                                            <a>熙悦广场</a>

                                                                                                                                                                            <a>便利社区商业</a>

                                                                                                                                                                                                                                            </div>

                            </div>

                            <!-- 客服头像 -->

                            <div class="head_img">

                                <a href='javascript:;' class='im ' onclick='easemobim.bind({tenantId:38251})'><p class="text">在线咨询</p></a>

 

                            </div>

                            <!--加入对比-->

                            <div class="add_compare ">

                                <p class="text" data-id="WJ000795"  data-img="https://www.51xiangzhu.com:6080/app/file/img.do?img=/usr/local/img/property/WJ000795/1320x240.JPG"><img src="https://cd.ixiangzhu.com/foreground/imgs/icon/add.png " alt="" > 添加对比</p>

                                <p class="have_add" style="display: none">已添加</p>

                            </div>

                        </div>

                                                <div class="house_item clearfix" data-temp-id="WJ000489">

                            <div class="f_left house_item_img">

                                <a href="https://cd.ixiangzhu.com/House/detail/WJ000489.html"><img src="https://www.51xiangzhu.com:6080/app/file/img.do?img=/usr/local/img/property/WJ000489/1320x240.JPG" width="230" height="160" /></a>

                            </div>

                            <div class="f_left house_item_info">

                                <div class="title clearfix">

                                    <a class="f_left" href="https://cd.ixiangzhu.com/House/detail/WJ000489.html">黄龙溪谷</a>

                                                                         <span class="price f_right">

                                        <span class="money">15747

                                                                                    元/m&sup2;

                                                                                </span>

                                    </span>

                                                                        </div>

                                <div class="info">

                                    <span>在售</span>

                                    <span class="info-price"></span>

                                </div>

                                <div class="area">

                                    面积区间: 167-328m&sup2;

                                </div>

                                <div class="address">

                                    <span>[眉山 - 彭山]</span>

                                    剑南大道南延线黄龙古镇旁

                                </div>

                                <div class="labels">

                                                                                                                                                                                                            <a>私家车出行方便</a>

                                                                                                                                                                            <a>读书方便</a>

                                                                                                                                                                                                                                            </div>

                            </div>

                            <!-- 客服头像 -->

                            <div class="head_img">

                                <a href='javascript:;' class='im ' onclick='easemobim.bind({tenantId:38251})'><p class="text">在线咨询</p></a>

 

                            </div>

                            <!--加入对比-->

                            <div class="add_compare ">

                                <p class="text" data-id="WJ000489"  data-img="https://www.51xiangzhu.com:6080/app/file/img.do?img=/usr/local/img/property/WJ000489/1320x240.JPG"><img src="https://cd.ixiangzhu.com/foreground/imgs/icon/add.png " alt="" > 添加对比</p>

                                <p class="have_add" style="display: none">已添加</p>

                            </div>

                        </div>

                                                <div class="house_item clearfix" data-temp-id="WJ000811">

                            <div class="f_left house_item_img">

                                <a href="https://cd.ixiangzhu.com/House/detail/WJ000811.html"><img src="https://www.51xiangzhu.com:6080/app/file/img.do?img=/usr/local/img/property/WJ000811/1320x240.JPG" width="230" height="160" /></a>

                            </div>

(2)   分析观察上面的代码,可以发现每个楼盘的名称是包含在一个<a> 标签中,比如:<a class="f_left" href="https://cd.ixiangzhu.com/House/detail/WJ000795.html">金科星耀天都</a>

(3)   楼盘的价格是包含在一个<span class="money">标签中,比如:<span class="money">11892元/m&sup2;   </span>

(4)   利用BeautifulSoup,将包含所有楼盘数据的<div class="house_lists">标签提取出来,实现代码如下:

from bs4 import BeautifulSoup as bs

soup = bs(html_d, 'html.parser')

house_div = soup.find_all('div', class_='house_lists')

(5)   再进一步提取所有包含楼盘名称的<a> 标签,实现代码如下:

house_list = house_div[0].find_all('a', class_='f_left')

其中house_list是一个列表,它包含了所有楼盘的名称。

(6)   再进一步提取所有包含楼盘价格的<span class="money"> 标签,实现代码如下:

price_list = house_div[0].find_all('span', class_='money')

其中price _list是一个列表,它包含了所有楼盘的价格。

(7)   需要从price _list列表提取楼盘的价格,因为其中保存的字符串中包含了楼盘的价格,内容类似下面的:

<span class="money">11892元/m&sup2;   </span>

因为只需要提取其中的数字,所以采用了一个函数专门来实现这个功能。

(8)   提取楼盘价格的函数如下:

def getPrice(str):

    digits = '1234567890'

    start_postion = 0

    #下面获取数字的起始位置

    for c in str:

        if c in digits:

            start_postion = str.index(c)

            break

 

   # start_postion保存了起始位置

   #下面切片截取了从数字的起始位置开始一直到字符串结束位置

    tempstr = str[start_postion:len(str)]

 

    end_pos = len(tempstr)

   #下面获取数字的截止位置

    for c in tempstr:

        if c in digits:

            continue

        else:

            end_pos = tempstr.index(c)

            break

 

   # end_pos保存了数字的截止位置

   #下面的切片将楼盘的价格提取出来

    price_str = tempstr[0:end_pos]

    return price_str

(9)   利用迭代将楼盘名称和价格提取出来,并保存到字典中。实现代码如下:

    for (ahref,aprice) in zip(house_list,price_list):

                housedictionary[ahref.text] = getPrice(aprice.text)

 

第三步,将字典中的数据保存到文件中。代码如下:

    with open('allhouse.txt', 'w', encoding='utf-8') as f:

        f.write( str(housedictionary) + '\n' )

        f.close()

最终,所有的楼盘数据以字符串的形式保存到了文件allhouse.txt中。

以上代码在Python 3.6.3中运行通过。

posted on 2017-12-17 19:22  雅虎猫  阅读(430)  评论(0编辑  收藏  举报