Python爬虫入门教程石家庄链家租房数据抓取
1. 写在前面
这篇博客爬取了链家网的租房信息,爬取到的数据在后面的博客中可以作为一些数据分析的素材。
我们需要爬取的网址为:https://sjz.lianjia.com/zufang/
2. 分析网址
首先确定一下,哪些数据是我们需要的
可以看到,黄色框就是我们需要的数据。
接下来,确定一下翻页规律
https://sjz.lianjia.com/zufang/pg1/ https://sjz.lianjia.com/zufang/pg2/ https://sjz.lianjia.com/zufang/pg3/ https://sjz.lianjia.com/zufang/pg4/ https://sjz.lianjia.com/zufang/pg5/ ... https://sjz.lianjia.com/zufang/pg80/
3. 解析网页
有了分页地址,就可以快速把链接拼接完毕,我们采用lxml
模块解析网页源码,获取想要的数据。
本次编码使用了一个新的模块 fake_useragent
,这个模块,可以随机的去获取一个UA(user-agent),模块使用比较简单,可以去百度百度就很多教程。
本篇博客主要使用的是调用一个随机的UA
self._ua = UserAgent() self._headers = {"User-Agent": self._ua.random} # 调用一个随机的UA
由于可以快速的把页码拼接出来,所以采用协程进行抓取,写入csv文件采用的pandas
模块
from fake_useragent import UserAgent from lxml import etree import asyncio import aiohttp import pandas as pd class LianjiaSpider(object): def __init__(self): self._ua = UserAgent() self._headers = {"User-Agent": self._ua.random} self._data = list() async def get(self,url): async with aiohttp.ClientSession() as session: try: async with session.get(url,headers=self._headers,timeout=3) as resp: if resp.status==200: result = await resp.text() return result except Exception as e: print(e.args) async def parse_html(self): for page in range(1,77): url = "https://sjz.lianjia.com/zufang/pg{}/".format(page) print("正在爬取{}".format(url)) html = await self.get(url) # 获取网页内容 html = etree.HTML(html) # 解析网页 self.parse_page(html) # 匹配我们想要的数据 print("正在存储数据....") ######################### 数据写入 data = pd.DataFrame(self._data) data.to_csv("链家网租房数据.csv", encoding='utf_8_sig') # 写入文件 ######################### 数据写入 def run(self): loop = asyncio.get_event_loop() tasks = [asyncio.ensure_future(self.parse_html())] loop.run_until_complete(asyncio.wait(tasks)) if __name__ == '__main__': l = LianjiaSpider() l.run()
上述代码中缺少一个解析网页的函数,我们接下来把他补全
def parse_page(self,html): info_panel = html.xpath("//div[@class='info-panel']") for info in info_panel: region = self.remove_space(info.xpath(".//span[@class='region']/text()")) zone = self.remove_space(info.xpath(".//span[@class='zone']/span/text()")) meters = self.remove_space(info.xpath(".//span[@class='meters']/text()")) where = self.remove_space(info.xpath(".//div[@class='where']/span[4]/text()")) con = info.xpath(".//div[@class='con']/text()") floor = con[0] # 楼层 type = con[1] # 样式 agent = info.xpath(".//div[@class='con']/a/text()")[0] has = info.xpath(".//div[@class='left agency']//text()") price = info.xpath(".//div[@class='price']/span/text()")[0] price_pre = info.xpath(".//div[@class='price-pre']/text()")[0] look_num = info.xpath(".//div[@class='square']//span[@class='num']/text()")[0] one_data = { "region":region, "zone":zone, "meters":meters, "where":where, "louceng":floor, "type":type, "xiaoshou":agent, "has":has, "price":price, "price_pre":price_pre, "num":look_num } self._data.append(one_data) # 添加数据
不一会,数据就爬取的差不多了。
Stay Hungary Stay Foolish
posted on 2019-01-17 04:44 Anderson_An 阅读(825) 评论(0) 编辑 收藏 举报