爬取安居客-苏州

爬取安居客-苏州吴江所有二手楼盘并导出到EXCEL表

爬取后保留的信息有,"标题","楼盘名称","地址",

https://suzhou.anjuke.com/sale/p{}

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
import requests
from lxml import etree
import csv
  
class Anjuke():
    def __init__(self):
        self.url_temp = "https://suzhou.anjuke.com/sale/p{}"
        self.headers = {
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36"}
  
    def get_url_list(self):
        return [self.url_temp.format(i) for i in range(1, 3)] #这里爬取1-3页。
  
    def pase_url(self, url):
        response = requests.get(url, headers=self.headers)
        return response.content.decode()
  
    def get_content_list(self, html_str):
        html = etree.HTML(html_str)
        content_list = []
        div_list = html.xpath('//ul[@id="houselist-mod-new"]/li')
        for div in div_list:
            item = {}
            item["标题"] = div.xpath(
                './/div[@class="house-title"]/a/text()')
            item["标题"] = item["标题"][0].strip()
            item["楼盘名称"] = div.xpath(
                './/div[@class="details-item"]/span[@class="comm-address"]/text()')
            item["楼盘名称"] = item['楼盘名称'][0].split("\xa0")[0].strip()
            item["地址"] = div.xpath(
                './/div[@class="details-item"]/span[@class="comm-address"]/text()')
            item["地址"] = item['地址'][0].split("\xa0")[-1].strip()
            content_list.append(item)
        return content_list
  
    def save_content_list(self, content_list):
        headers = ["标题","楼盘名称","地址"]
        with open("信息.csv","w",encoding="utf-8-sig", newline="") as fp:
            writer = csv.DictWriter(fp, headers)
            writer.writeheader()
            writer.writerows(content_list)
  
        # for i in content_list:
        #     print(i["title"])
  
  
    def run(self):
        url_list = self.get_url_list()
        for url in url_list:
            html_str = self.pase_url(url)
            content_list = self.get_content_list(html_str)
            self.save_content_list(content_list)
  
if __name__ == '__main__':
    Anjuke = Anjuke()
    Anjuke.run()

  

posted @   ken-yu  阅读(454)  评论(0编辑  收藏  举报
编辑推荐:
· 没有源码,如何修改代码逻辑?
· 一个奇形怪状的面试题:Bean中的CHM要不要加volatile?
· [.NET]调用本地 Deepseek 模型
· 一个费力不讨好的项目,让我损失了近一半的绩效!
· .NET Core 托管堆内存泄露/CPU异常的常见思路
阅读排行:
· 微软正式发布.NET 10 Preview 1:开启下一代开发框架新篇章
· 没有源码,如何修改代码逻辑?
· NetPad:一个.NET开源、跨平台的C#编辑器
· PowerShell开发游戏 · 打蜜蜂
· 凌晨三点救火实录:Java内存泄漏的七个神坑,你至少踩过三个!
点击右上角即可分享
微信分享提示