数据采集与融合技术实践第二次作业

作业①

爬取中国气象网实验

实验要求
在中国气象网（http://www.weather.com.cn）给定城市集的7日天气预报，并保存在数据库。
输出信息：

序号	地区	日期	天气信息	温度
1	北京	7日（今天）	晴间多云，被堵山区有阵雨或雷阵雨转晴转多云	31℃/17℃
2	北京	8日（明天）	多云转晴，北部地区有分散阵雨或者雷阵雨转晴	34℃、20℃
3	***	***	***	***

具体实现
代码思路：
1. 定义了一个WeatherDB类用于操作SQLite数据库，包括打开、关闭、插入和展示数据库。
2. 定义了一个WeatherForecast类用于获取天气预报数据。主要方法是forecastCity对url进行网页源码进行解析。

gitee链接https://gitee.com/qu-yapeng/crawl_project.git
代码展示：

from bs4 import BeautifulSoup
from bs4 import UnicodeDammit
import urllib.request
import sqlite3

class WeatherDB:
    # 打开数据库
    def openDB(self):
        self.con=sqlite3.connect("weathers.db")# 存储天气预报数据
        self.cursor=self.con.cursor()
        try:
            self.cursor.execute("create table weathers (wCity varchar(16),wDate varchar(16),wWeather varchar(64),"
                                "wTemp varchar(32),constraint pk_weather primary key (wCity,wDate))")# cursor.execute()执行sql语句
        except:
            self.cursor.execute("delete from weathers")
    # 关闭数据库
    def closeDB(self):
        self.con.commit()
        self.con.close()
    # 插入数据
    def insert(self, city, date, weather, temp):
        try:
            self.cursor.execute("insert into weathers (wCity,wDate,wWeather,wTemp) values (?,?,?,?)",
                                (city, date, weather, temp))
        except Exception as err:
            print(err)

    def show(self):
        self.cursor.execute("select * from weathers")
        rows = self.cursor.fetchall()# fetchall()返回多个元组，即返回多个记录(rows),如果没有结果 则返回 ()
        print("%-16s%-16s%-32s%-16s" % ("city", "date", "weather", "temp"))# 格式化输出
        for row in rows:
            print("%-16s%-16s%-32s%-16s" % (row[0], row[1], row[2], row[3]))

class WeatherForecast:
    def __init__(self):
        self.headers = {
            # Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36
            "User-Agent": "Mozilla/5.0 (Windows; U; Windows NT 6.0 x64; en-US; rv:1.9pre) Gecko/2008072421 Minefield/3.0.2pre"}
        # 存放城市和对应编码的字典
        self.cityCode = {"北京": "101010100", "上海": "101020100", "广州": "101280101", "深圳": "101280601"}

    def forecastCity(self, city):
        if city not in self.cityCode.keys():
            print(city + " code cannot be found")
            return
        url = "http://www.weather.com.cn/weather/" + self.cityCode[city] + ".shtml"
        try:
            req = urllib.request.Request(url, headers=self.headers)
            data = urllib.request.urlopen(req)
            data = data.read()
            dammit = UnicodeDammit(data, ["utf-8", "gbk"])
            data = dammit.unicode_markup
            soup = BeautifulSoup(data, "html.parser")
            lis = soup.select("ul[class='t clearfix'] li")# 查找属性值为t clearfix的ul标签下的所有li标签
            for li in lis:
                try:
                    date = li.select('h1')[0].text# 日期
                    weather = li.select('p[class="wea"]')[0].text# 天气状态
                    temp = li.select('p[class="tem"] span')[0].text + "/" + li.select('p[class="tem"] i')[0].text# 温度
                    print(city, date, weather, temp)
                    self.db.insert(city, date, weather, temp)
                except Exception as err:
                    print(err)
        except Exception as err:
            print(err)

    def process(self, cities):
        self.db = WeatherDB()
        self.db.openDB()
        # 遍历城市
        for city in cities:
            self.forecastCity(city)
        # self.db.show()
        self.db.closeDB()

ws = WeatherForecast()#实例化天气预报对象
ws.process(["北京", "上海", "广州", "深圳"])
print("completed")

运行结果：

心得体会

这次的代码就是按照书上敲的，遇到不懂的也通过百度解决了，这次主要是加强了Beautiful的使用以及对sqlite数据库有初步了解。

作业②

爬取股票相关信息实验

实验要求
用 requests 和 BeautifulSoup 库方法定向爬取股票相关信息，并存储在数据库中。
技巧
在谷歌浏览器中进入 F12 调试模式进行抓包，查找股票列表加载使用的 url，并分析 api 返回的值，并根据所要求的参数可适当更改api 的请求参数。根据 URL 可观察请求的参数 f1、f2 可获取不同的数值，根据情况可删减请求的参数。
具体实现
代码思路：

通过requests库发送GET请求获取网页内容。使用BeautifulSoup库解析网页内容。
使用soup.find_all()方法查找所有的商品价格标签（div）和商品名称标签（em）。分别保存在ptl和tlt两个变量中。
通过find（）方法提取相关信息。

gitee链接https://gitee.com/qu-yapeng/crawl_project.git
代码展示：

import re
import requests
import sqlite3

pn = input("请输入要爬取的页面:")
count = 1
i = 1
fs = ""
headers = {
    "User-Agent": "Mozilla/5.0 (Windows; U; Windows NT 6.0 x64; en-US; rv:1.9pre) Gecko/2008072421 Minefield/3.0.2pre"
}

fs_list = {
    "沪深A股": "fs=m:0+t:6,m:0+t:13,m:0+t:80,m:1+t:2,m:1+t:23",
    # "上证A股":"fs=m:1+t:2,m:1+t:23",
    # "深圳A股":"fs=m:0+t:6,m:0+t:13,m:0+t:80"
}

tplt = "{0:{13}<5}{1:{13}<5}{2:{13}<5}{3:{13}<5}{4:{13}<5}{5:{13}<5}{6:{13}<5}{7:{13}^10}{8:{13}<5}{9:{13}<5}{10:{13}<5}{11:{13}<5}{12:{13}<5}"
print(tplt.format("序号", "股票代码", "股票名称", "最新报价", "涨跌幅", "涨跌额", "成交量", "成交额", "振幅", "最高", "最低", "今开", "昨收", chr(12288)))

# 创建数据库连接
conn = sqlite3.connect('stock_data.db')
cursor = conn.cursor()

# 创建stocks表
cursor.execute("CREATE TABLE IF NOT EXISTS stocks (id INTEGER PRIMARY KEY, code TEXT, name TEXT, price REAL, range TEXT, change REAL, volume INTEGER, turnover REAL, amplitude TEXT, high REAL, low REAL, open REAL, close REAL)")

for key in fs_list:
    print("正在爬取"+key+"的数据")
    fs = fs_list[key]
    for i in range(1, int(pn)+1):
        url = "http://38.push2.eastmoney.com/api/qt/clist/get?cb=jQuery112407987064832629414_1696658970374&pn=1&pz=20&po=1&np=1&ut=bd1d9ddb04089700cf9c27f6f7426281&fltt=2&invt=2&wbp2u=%7C0%7C0%7C0%7Cweb&fid=f3&fs=m:0+t:6,m:0+t:80,m:1+t:2,m:1+t:23,m:0+t:81+s:2048&fields=f1,f2,f3,f4,f5,f6,f7,f8,f9,f10,f12,f13,f14,f15,f16,f17,f18,f20,f21,f23,f24,f25,f22,f11,f62,f128,f136,f115,f152&_=1696658970375"
        r = requests.get(url=url, headers=headers)
        page_text = r.text

        reg = '"diff":\[(.*?)\]'
        list1 = re.findall(reg, page_text)
        list2 = re.findall('{(.*?)}', list1[0], re.S)

        for j in list2:
            list3 = []
            list3.append(j.split(","))
            num = list3[0][11].split(":")[-1].replace("\"", "")
            name = list3[0][13].split(":")[-1].replace("\"", "")
            new_price = list3[0][1].split(":")[-1].replace("\"", "")
            crease_range = list3[0][2].split(":")[-1].replace("\"", "") + "%"
            crease_price = list3[0][3].split(":")[-1].replace("\"", "")
            com_num = list3[0][4].split(":")[-1].replace("\"", "")
            com_price = list3[0][5].split(":")[-1].replace("\"", "")
            move_range = list3[0][6].split(":")[-1].replace("\"", "") + "%"
            top = list3[0][14].split(":")[-1].replace("\"", "")
            bottom = list3[0][15].split(":")[-1].replace("\"", "")
            today = list3[0][16].split(":")[-1].replace("\"", "")
            yestoday = list3[0][17].split(":")[-1].replace("\"", "")
            print('%-10s%-10s%-10s%-10s%-10s%-10s%-10s%-10s%10s%10s%10s%10s%10s' %(count,num,name,new_price,crease_range,crease_price,com_num,com_price,move_range,top,bottom,today,yestoday))

            # 将数据插入到数据库
            sql = "INSERT INTO stocks (code, name, price, range, change, volume, turnover, amplitude, high, low, open, close) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)"
            values = (num, name, new_price, crease_range, crease_price, com_num, com_price, move_range, top, bottom, today, yestoday)
            cursor.execute(sql, values)

            count += 1

        print("=================================================="+key+"第"+str(i)+"页内容打印成功===============================================")

# 提交事务并关闭数据库连接
conn.commit()
conn.close()

运行结果：

心得体会

本次实验花费了不少时间，一开始实验的时候用的是select，定位标签的时候调试了很久，定位到自己要爬取的内容时就没有返回值了，这才知道不能再使用Beautifulsoup来做了，后面就对比url也花费了不少时间，感觉本次实验收获还是蛮大的。

作业③爬取中国大学 2021 主榜

实验要求
爬取中国大学 2021 主榜
（https://www.shanghairanking.cn/rankings/bcur/2021）所有院校信息，并存储在数据库中，同时将浏览器 F12 调试分析的过程录制 Gif 加入至博客中
技巧：分析该网站的发包情况，分析获取数据的 api
输出信息：

排名	学校	省市	类型	总分
1	清华大学	北京	综合	969.2

具体实现
代码思路：

先定义了一个省份代码和对应省份名称的字典。
构造一个函数 save_data_to_db，用于连接数据库、创建表格并将数据插入表格中。
构造一个函数 fetch_university_ranking，该函数通过发送 HTTP 请求获取大学排名数据。然后从返回的数据中解析出学校名称、省份和分数，并以列表形式返回这些数据。
构造一个函数 print_university_ranking，用于以表格形式打印出大学排名数据。

gitee链接https://gitee.com/qu-yapeng/crawl_project.git
抓包过程

代码展示：

import requests
import sqlite3

# 省份代码和对应的省份名称字典
province_dict = {
    "y": "安徽",
    "q": "北京",
    "C": "上海",
    # 其他省份代码...
}

def save_data_to_db(data):
    # 连接数据库
    conn = sqlite3.connect('university_ranking.db')
    c = conn.cursor()

    # 创建表格（如果不存在的话）
    c.execute('''CREATE TABLE IF NOT EXISTS universities 
                 (id INTEGER PRIMARY KEY AUTOINCREMENT, name TEXT, province TEXT, score REAL)''')

    # 批量插入数据
    c.executemany("INSERT INTO universities (name, province, score) VALUES (?, ?, ?)", data)

    # 提交更改并关闭连接
    conn.commit()
    conn.close()


def fetch_university_ranking():
    # 请求获取大学排名数据的网址
    url = r'https://www.shanghairanking.cn/_nuxt/static/1697106492/rankings/bcur/2021/payload.js'
    r = requests.get(url, timeout=20)

    if r.status_code == 200:
        r.encoding = 'utf-8'
        content = r.text
        colleges = []

        while content.find('univNameCn:"') != -1:
            acollege = []
            content = content[content.find('univNameCn:"') + 12:]
            college_name = content[:content.find('"')]
            acollege.append(college_name)

            content = content[content.find('province:') + 9:]
            province_code = content[:content.find(',')]
            # 根据省份代码获取对应的省份名称，如果代码不存在则返回"未知"
            province = province_dict.get(province_code, "未知")
            acollege.append(province)

            content = content[content.find('score:') + 6:]
            score = content[:content.find(',')]
            acollege.append(score)

            colleges.append(acollege)

        return colleges

    else:
        print("请求失败！")
        return None


def print_university_ranking(colleges):
    print("{:^10}\t{:^6}\t{:^6}\t{:^6}".format("排名", "学校名称", '省份', '分数'))
    for i, college in enumerate(colleges):
        print("{:^10}\t{:^6}\t{:^6}\t{:^6}".format(i + 1, college[0], college[1], college[2]))


# 获取大学排名数据
colleges_data = fetch_university_ranking()
if colleges_data:
    # 打印大学排名数据
    print_university_ranking(colleges_data)
    # 保存数据到数据库
    save_data_to_db(colleges_data)
    print("数据已保存到数据库中。")

运行结果：

心得体会

这次实验并没有想的那么简单，js抓包花费了好多时间。另外，又练习了有关于数据库的操作，缺点是格式化输出没做好。

posted @ 2023-10-16 09:14 Qu-YaPeng 阅读(40) 评论(0) 编辑收藏举报

刷新页面返回顶部

登录后才能查看或发表评论，立即登录或者逛逛博客园首页

相关博文：

· 2023数据采集与融合技术实践作业一

· 数据采集与融合技术实践第三次作业

· 数据采集与融合技术实践第二次作业

· 数据采集与融合技术第二次作业

公告

昵称： Qu-YaPeng
园龄： 1年5个月
粉丝： 0
关注： 0

+加关注

2025年3月

日

一

二

三

四

五

六

qu-yapeng

数据采集与融合技术实践第二次作业

作业①

爬取中国气象网实验

心得体会

作业②

爬取股票相关信息实验

心得体会

作业③爬取中国大学 2021 主榜

心得体会

公告

搜索

常用链接

随笔档案

阅读排行榜

推荐排行榜