数据采集作业3

课程链接 https://edu.cnblogs.com/campus/fzu/2024DataCollectionandFusiontechnology
作业链接 https://edu.cnblogs.com/campus/fzu/2024DataCollectionandFusiontechnology/homework/13287
实验三仓库链接 https://gitee.com/wd_b/party-soldier-data-collection/tree/master/数据采集实践3

一、实验准备

1.查看Scrapy库安装

2.查看数据库安装

3.创建作业文件夹

二 、实验编写

作业1

  • 数据采集实验

Gitee链接:https://gitee.com/wd_b/party-soldier-data-collection/tree/master/数据采集实践3/weather_spider
要求:指定一个网站,爬取该网站中的所有图片,例如中国气象网(http://www.weather.com.cn)。使用 scrapy 框架分别实现单线程和多线程的方式爬取。
输出信息:将下载的 Url 信息在控制台输出,并把下载的图片存储在 images 子文件中
1.创建 Scrapy 项目
(1)进入作业文件夹、创建新的 Scrapy 项目

(2)查看目录

(3)进入项目目录

2.编写爬虫
(1)创建爬虫文件
在 weather_spider/spiders 目录下创建一个爬虫文件:weather_spider.py
gitee链接:https://gitee.com/wd_b/party-soldier-data-collection/blob/master/数据采集实践3/weather_spider/weather_spider/spiders/weather_spider.py
(2)编辑爬虫文件,代码如下

点击查看代码
import scrapy
from itemadapter import ItemAdapter
import logging
from scrapy.pipelines.images import ImagesPipeline
from scrapy.exceptions import DropItem

class ImagePipeline(ImagesPipeline):
    def get_media_requests(self, item, info):
        for image_url in item['image_urls']:
            logging.info(f"Downloading image: {image_url}")
            yield scrapy.Request(image_url)

    def file_path(self, request, response=None, info=None, *, item=None):
        image_guid = request.url.split('/')[-1]
        return f'images/{image_guid}'

    def item_completed(self, results, item, info):
        image_paths = [x['path'] for ok, x in results if ok]
        if not image_paths:
            raise DropItem("Item contains no images")
        return item

(3)编辑 settings.py 文件,代码如下
gitee链接:https://gitee.com/wd_b/party-soldier-data-collection/blob/master/数据采集实践3/weather_spider/weather_spider/settings.py

点击查看代码
BOT_NAME = "weather_spider"
SPIDER_MODULES = ["weather_spider.spiders"]
NEWSPIDER_MODULE = "weather_spider.spiders"
# 启用图片下载管道
ITEM_PIPELINES = {
    'weather_spider.pipelines.ImagePipeline': 1,
}
# 配置图片存储路径
IMAGES_STORE = 'images'
# 配置并发请求数量
CONCURRENT_REQUESTS = 32
DOWNLOAD_DELAY = 0.5  # 每个请求之间的延迟时间
ROBOTSTXT_OBEY = True
REQUEST_FINGERPRINTER_IMPLEMENTATION = "2.7"
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
FEED_EXPORT_ENCODING = "utf-8"

(4)编辑 pipelines.py 文件,代码如下
gitee链接:https://gitee.com/wd_b/party-soldier-data-collection/blob/master/数据采集实践3/weather_spider/weather_spider/pipelines.py

点击查看代码
import scrapy
from itemadapter import ItemAdapter
import logging
from scrapy.pipelines.images import ImagesPipeline
from scrapy.exceptions import DropItem

class ImagePipeline(ImagesPipeline):
    def get_media_requests(self, item, info):
        for image_url in item['image_urls']:
            logging.info(f"Downloading image: {image_url}")
            yield scrapy.Request(image_url)

    def file_path(self, request, response=None, info=None, *, item=None):
        image_guid = request.url.split('/')[-1]
        return f'images/{image_guid}'

    def item_completed(self, results, item, info):
        image_paths = [x['path'] for ok, x in results if ok]
        if not image_paths:
            raise DropItem("Item contains no images")
        return item

(5)编辑 items.py 文件,代码如下
gitee链接:https://gitee.com/wd_b/party-soldier-data-collection/blob/master/数据采集实践3/weather_spider/weather_spider/items.py
3.运行爬虫
(1)在命令行或终端中,导航到项目根目录:scrapy crawl weather

(2)显示结果截图
gitee链接:https://gitee.com/wd_b/party-soldier-data-collection/tree/master/数据采集实践3/weather_spider/weather_spider/images/images

  • 心得体会
    虽然还是爬取天气网,但这次是第一次用scrapy框架,能感受到scrapy更加便捷,具有良好的可扩展性和灵活性。

作业2

  • 数据采集实验

Gitee链接:https://gitee.com/wd_b/party-soldier-data-collection/tree/master/数据采集实践3/stock_spider
要求:熟练掌握 scrapy 中 Item、Pipeline 数据的序列化输出方法;Scrapy+Xpath
MySQL数据库存储技术路线爬取股票相关信息。MySQL数据库存储
候选网站:
东方财富网:https://www.eastmoney.com/
新浪财经:http://finance.sina.com.cn/stock/
1.创建 Scrapy 项目
(1)进入作业文件夹、创建新的 Scrapy 项目

(2)查看目录

(3)进入项目目录

2.编写爬虫
(1)创建爬虫文件
在 stock_spider/spiders 目录下创建一个爬虫文件:stock_spider.py
Gitee链接:https://gitee.com/wd_b/party-soldier-data-collection/blob/master/数据采集实践3/stock_spider/stock_spider/spiders/stock_spider.py
(2)编辑爬虫文件,代码如下

点击查看代码
import scrapy
from stock_spider.items import StockItem
class StockSpider(scrapy.Spider):
    name = 'stock'
    allowed_domains = ['finance.sina.com.cn']
    start_urls = ['http://vip.stock.finance.sina.com.cn/q/go.php/vInvestConsult/kind/lsjy/index.phtml']
    def parse(self, response):
        rows = response.xpath('//table[@class="list_table"]//tr')[1:]  # 跳过表头
        for row in rows:
            item = StockItem()
            item['id'] = row.xpath('td[1]/text()').get()
            item['bStockNo'] = row.xpath('td[2]/a/text()').get()
            item['stockName'] = row.xpath('td[3]/a/text()').get()
            item['latestPrice'] = row.xpath('td[4]/text()').get()
            item['changePercent'] = row.xpath('td[5]/text()').get()
            item['changeAmount'] = row.xpath('td[6]/text()').get()
            item['volume'] = row.xpath('td[7]/text()').get()
            item['turnover'] = row.xpath('td[8]/text()').get()
            item['amplitude'] = row.xpath('td[9]/text()').get()
            item['highest'] = row.xpath('td[10]/text()').get()
            item['lowest'] = row.xpath('td[11]/text()').get()
            item['openPrice'] = row.xpath('td[12]/text()').get()
            item['closePrice'] = row.xpath('td[13]/text()').get()
            yield item

(3)编辑 settings.py 文件,代码如下
Gitee链接:https://gitee.com/wd_b/party-soldier-data-collection/blob/master/数据采集实践3/stock_spider/stock_spider/settings.py

点击查看代码
BOT_NAME = "stock_spider"
SPIDER_MODULES = ["stock_spider.spiders"]
NEWSPIDER_MODULE = "stock_spider.spiders"
ITEM_PIPELINES = {
    'stock_spider.pipelines.MySQLPipeline': 300,
}
# settings.py
DB_SETTINGS = {
    'db': 'stock_spider',
    'user': 'root',
    'passwd': '123456',
    'host': 'localhost',
    'port': 3306,
}
ROBOTSTXT_OBEY = False
REQUEST_FINGERPRINTER_IMPLEMENTATION = "2.7"
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
FEED_EXPORT_ENCODING = "utf-8"

(4)编辑 pipelines.py 文件,代码如下
Gitee链接:https://gitee.com/wd_b/party-soldier-data-collection/blob/master/数据采集实践3/stock_spider/stock_spider/pipelines.py

点击查看代码
import mysql.connector
from mysql.connector import errorcode
class MySQLPipeline:
    def __init__(self):
        self.connection = None
        self.cursor = None
    def open_spider(self, spider):
        try:
            self.connection = mysql.connector.connect(
                host='localhost',
                user='root',
                password='123456',
                database='stock_spider'
            )
            self.cursor = self.connection.cursor()
            self.create_table()
        except mysql.connector.Error as err:
            if err.errno == errorcode.ER_ACCESS_DENIED_ERROR:
                print("Something is wrong with your user name or password")
            elif err.errno == errorcode.ER_BAD_DB_ERROR:
                print("Database does not exist")
            else:
                print(err)
            self.cursor = None
            self.connection = None

    def create_table(self):
        if self.cursor:
            create_table_query = """
            CREATE TABLE IF NOT EXISTS stocks (
                id INT AUTO_INCREMENT PRIMARY KEY,
                bStockNo VARCHAR(255),
                stockName VARCHAR(255),
                latestPrice VARCHAR(255),
                changePercent VARCHAR(255),
                changeAmount VARCHAR(255),
                volume VARCHAR(255),
                turnover VARCHAR(255),
                amplitude VARCHAR(255),
                highest VARCHAR(255),
                lowest VARCHAR(255),
                openPrice VARCHAR(255),
                closePrice VARCHAR(255)
            )
            """
            self.cursor.execute(create_table_query)
            self.connection.commit()

    def process_item(self, item, spider):
        if not self.cursor:
            return item  # 如果 cursor 为 None,直接返回 item

        insert_query = """
        INSERT INTO stocks (bStockNo, stockName, latestPrice, changePercent, changeAmount, volume, turnover, amplitude, highest, lowest, openPrice, closePrice)
        VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)
        """
        values = (
            item.get('bStockNo', ''),
            item.get('stockName', ''),
            item.get('latestPrice', ''),
            item.get('changePercent', ''),
            item.get('changeAmount', ''),
            item.get('volume', ''),
            item.get('turnover', ''),
            item.get('amplitude', ''),
            item.get('highest', ''),
            item.get('lowest', ''),
            item.get('openPrice', ''),
            item.get('closePrice', '')
        )
        try:
            self.cursor.execute(insert_query, values)
            self.connection.commit()
        except mysql.connector.Error as err:
            print(f"Error inserting data: {err}")
        return item

    def close_spider(self, spider):
        if self.cursor:
            self.cursor.close()
        if self.connection:
            self.connection.close()

(5)编辑 items.py 文件,代码如下
Gitee链接:https://gitee.com/wd_b/party-soldier-data-collection/blob/master/数据采集实践3/stock_spider/stock_spider/items.py

点击查看代码
import scrapy
class StockItem(scrapy.Item):
    id = scrapy.Field()
    bStockNo = scrapy.Field()  # 股票代码
    stockName = scrapy.Field()  # 股票名称
    latestPrice = scrapy.Field()  # 最新报价
    changePercent = scrapy.Field()  # 涨跌幅
    changeAmount = scrapy.Field()  # 涨跌额
    volume = scrapy.Field()  # 成交量
    turnover = scrapy.Field()  # 成交额
    amplitude = scrapy.Field()  # 振幅
    highest = scrapy.Field()  # 最高
    lowest = scrapy.Field()  # 最低
    openPrice = scrapy.Field()  # 今开
    closePrice = scrapy.Field()  # 昨收

3.运行爬虫
(1)在命令行或终端中,导航到项目根目录:scrapy crawl stock

(2)显示结果截图
Gitee链接:https://gitee.com/wd_b/party-soldier-data-collection/tree/master/数据采集实践3/stock_spider/data

  • 心得体会
    在作业2中系统的学习了基本的使用scrapy框架爬取数据方法以及xpath爬取信息的方法,我在这一实验中一一体会,受益颇多。

作业3

  • 数据采集实验

Gitee链接:https://gitee.com/wd_b/party-soldier-data-collection/tree/master/数据采集实践3/currency_project
要求: 熟练掌握 scrapy 中 Item、Pipeline 数据的序列化输出方法; 使用 scrapy 框架 + Xpath + MySQL 数据库存储技术路线爬取外汇网站数据。使用 scrapy 框架 + MySQL 数据库存储技术爬取外汇网站数据。
候选网站: 招商银行网: https://www.boc.cn/sourcedb/whpj
输出信息: (MySQL 数据库存储和输出格式)
1.创建 Scrapy 项目
(1)进入作业文件夹、创建新的 Scrapy 项目

(2)查看目录

(3)进入项目目录

2.编写爬虫
(1)创建爬虫文件
在 currency_project/spiders 目录下创建一个爬虫文件:CurrencySpider.py
Gitee链接:https://gitee.com/wd_b/party-soldier-data-collection/blob/master/数据采集实践3/currency_project/currency_project/spiders/CurrencySpider.py
(2)编辑爬虫文件,代码如下

点击查看代码
import scrapy
from scrapy import signals
from scrapy.utils.log import configure_logging
from ..items import CurrencyItem
class CurrencySpider(scrapy.Spider):
    name = "currency"
    allowed_domains = ["boc.cn"]
    start_urls = ["https://www.boc.cn/sourcedb/whpj/"]

    def __init__(self, *args, **kwargs):
        # 配置日志输出
        configure_logging({'LOG_FORMAT': '%(levelname)s: %(message)s'})
        super().__init__(*args, **kwargs)

    def start_requests(self):
        for url in self.start_urls:
            yield scrapy.Request(url, callback=self.parse, errback=self.errback)

    def parse(self, response):
        # 使用XPath选择所有<tr>元素
        rows = response.xpath("//tr[position()>1]")  # 忽略第一个<tr>元素
        # 遍历每个<tr>元素
        for row in rows:
            # 使用XPath选择当前<tr>下的所有<td>元素,并提取文本值
            currencyname = row.xpath("./td[1]//text()").get()
            hui_in = row.xpath("./td[2]//text()").get()
            chao_in = row.xpath("./td[3]//text()").get()
            hui_out = row.xpath("./td[4]//text()").get()
            chao_out = row.xpath("./td[5]//text()").get()
            zhonghang = row.xpath("./td[6]//text()").get()
            date = row.xpath("./td[7]//text()").get()
            time = row.xpath("./td[8]//text()").get()
            currency = CurrencyItem()
            currency['currencyname'] = str(currencyname)
            currency['hui_in'] = str(hui_in)
            currency['chao_in'] = str(chao_in)
            currency['hui_out'] = str(hui_out)
            currency['chao_out'] = str(chao_out)
            currency['zhonghang'] = str(zhonghang)
            currency['date'] = str(date)
            currency['time'] = str(time)
            yield currency

    def errback(self, failure):
        self.logger.error(repr(failure))


(3)编辑 settings.py 文件,代码如下
Gitee链接:https://gitee.com/wd_b/party-soldier-data-collection/blob/master/数据采集实践3/currency_project/currency_project/settings.py

点击查看代码
BOT_NAME = "currency_project"
SPIDER_MODULES = ["currency_project.spiders"]
NEWSPIDER_MODULE = "currency_project.spiders"
ITEM_PIPELINES = {
    'currency_project.pipelines.CurrencyPipeline': 300,
}
LOG_LEVEL = 'INFO'
MYSQL_HOST = 'localhost'
MYSQL_PORT = 3307
MYSQL_USER = 'root'
MYSQL_PASSWORD = '123456'
MYSQL_DB = 'data acquisition'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'waihui (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

(4)编辑 pipelines.py 文件,代码如下
Gitee链接:https://gitee.com/wd_b/party-soldier-data-collection/blob/master/数据采集实践3/currency_project/currency_project/pipelines.py

点击查看代码
import pymysql
class CurrencyPipeline:
    def __init__(self):
        # 初始化数据库连接参数
        self.host = "localhost"
        self.port = 3306
        self.user = "root"
        self.password = "123456"
        self.db = "data acquisition"
        self.charset = "utf8"
        self.table_name = "currency"

    def open_spider(self, spider):
        # 在爬虫启动时建立数据库连接
        self.client = pymysql.connect(
            host=self.host,
            port=self.port,
            user=self.user,
            password=self.password,
            db=self.db,
            charset=self.charset
        )
        self.cursor = self.client.cursor()

        # 创建表(如果不存在)
        create_table_query = """
        CREATE TABLE IF NOT EXISTS {} (
            id INT AUTO_INCREMENT PRIMARY KEY,
            currencyname VARCHAR(255),
            hui_in FLOAT,
            chao_in FLOAT,
            hui_out FLOAT,
            chao_out FLOAT,
            zhonghang FLOAT,
            date VARCHAR(255),
            time VARCHAR(255)
        )
        """.format(self.table_name)
        self.cursor.execute(create_table_query)
    def process_item(self, item, spider):
        # 处理每个抓取的item,将其插入数据库

        # 转换hui_in字段,处理None值
        hui_in_value = item.get("hui_in")
        if hui_in_value == 'None':
            hui_in_value = None
        else:
            try:
                hui_in_value = float(hui_in_value)
            except ValueError:
                hui_in_value = None
        # 转换hui_out字段,处理None值
        hui_out_value = item.get("hui_out")
        if hui_out_value == 'None':
            hui_out_value = None
        else:
            try:
                hui_out_value = float(hui_out_value)
            except ValueError:
                hui_out_value = None
        # 转换chao_in字段,处理None值
        chao_in_value = item.get("chao_in")
        if chao_in_value == 'None':
            chao_in_value = None
        else:
            try:
                chao_in_value = float(chao_in_value)
            except ValueError:
                chao_in_value = None
        # 转换chao_out字段,处理None值
        chao_out_value = item.get("chao_out")
        if chao_out_value == 'None':
            chao_out_value = None
        else:
            try:
                chao_out_value = float(chao_out_value)
            except ValueError:
                chao_out_value = None

        # 转换zhonghang字段,处理None值
        zhonghang_value = item.get("zhonghang")
        if zhonghang_value == 'None':
            zhonghang_value = None
        else:
            try:
                zhonghang_value = float(zhonghang_value)
            except ValueError:
                zhonghang_value = None

        # 准备要插入的数据
        args = [
            item.get("currencyname"),
            hui_in_value,
            chao_in_value,
            hui_out_value,
            chao_out_value,
            zhonghang_value,
            item.get("date"),
            item.get("time"),
        ]
        print("Inserting data:", args)
        # 插入数据到数据库
        sql = "INSERT INTO {} (currencyname, hui_in, chao_in, hui_out, chao_out, zhonghang, date, time) VALUES (%s,%s,%s,%s,%s,%s,%s,%s)".format(
            self.table_name)
        self.cursor.execute(sql, args)
        self.client.commit()
        # 返回处理后的item
        return item
    def close_spider(self, spider):
        # 在爬虫关闭时关闭数据库连接
        self.cursor.close()
        self.client.close()

(5)编辑 items.py 文件,代码如下
Gitee链接:https://gitee.com/wd_b/party-soldier-data-collection/blob/master/数据采集实践3/currency_project/currency_project/items.py

点击查看代码
import scrapy
class CurrencyItem(scrapy.Item):
    currencyname = scrapy.Field()
    hui_in = scrapy.Field()
    chao_in = scrapy.Field()
    hui_out = scrapy.Field()
    chao_out = scrapy.Field()
    zhonghang = scrapy.Field()
    date = scrapy.Field()
    time = scrapy.Field()

3.运行爬虫
(1)在命令行或终端中,导航到项目根目录:scrapy crawl currency

(2)显示结果截图
Gitee链接:https://gitee.com/wd_b/party-soldier-data-collection/tree/master/数据采集实践3/currency_project/data@0020acquisition

  • 心得体会
    通过本次实验,我熟练掌握了使用 Scrapy 框架结合 Xpath 技术爬取外汇网站数据,并将其存储到 MySQL 数据库的完整流程。定义了 CurrencyItem 类来存储数据,编写了 Spider 类提取外汇牌价信息,并通过 Pipeline 将数据序列化存储到 MySQL 中。实验过程加深了我对 Scrapy 框架的理解,提升了数据爬取和处理的能力。
posted @ 2024-11-04 18:03  布若  阅读(6)  评论(0编辑  收藏  举报