数据采集作业3
一、实验准备
1.查看Scrapy库安装
2.查看数据库安装
3.创建作业文件夹
二 、实验编写
作业1
- 数据采集实验
Gitee链接:https://gitee.com/wd_b/party-soldier-data-collection/tree/master/数据采集实践3/weather_spider
要求:指定一个网站,爬取该网站中的所有图片,例如中国气象网(http://www.weather.com.cn)。使用 scrapy 框架分别实现单线程和多线程的方式爬取。
输出信息:将下载的 Url 信息在控制台输出,并把下载的图片存储在 images 子文件中
1.创建 Scrapy 项目
(1)进入作业文件夹、创建新的 Scrapy 项目
(2)查看目录
(3)进入项目目录
2.编写爬虫
(1)创建爬虫文件
在 weather_spider/spiders 目录下创建一个爬虫文件:weather_spider.py
gitee链接:https://gitee.com/wd_b/party-soldier-data-collection/blob/master/数据采集实践3/weather_spider/weather_spider/spiders/weather_spider.py
(2)编辑爬虫文件,代码如下
点击查看代码
import scrapy
from itemadapter import ItemAdapter
import logging
from scrapy.pipelines.images import ImagesPipeline
from scrapy.exceptions import DropItem
class ImagePipeline(ImagesPipeline):
def get_media_requests(self, item, info):
for image_url in item['image_urls']:
logging.info(f"Downloading image: {image_url}")
yield scrapy.Request(image_url)
def file_path(self, request, response=None, info=None, *, item=None):
image_guid = request.url.split('/')[-1]
return f'images/{image_guid}'
def item_completed(self, results, item, info):
image_paths = [x['path'] for ok, x in results if ok]
if not image_paths:
raise DropItem("Item contains no images")
return item
(3)编辑 settings.py 文件,代码如下
gitee链接:https://gitee.com/wd_b/party-soldier-data-collection/blob/master/数据采集实践3/weather_spider/weather_spider/settings.py
点击查看代码
BOT_NAME = "weather_spider"
SPIDER_MODULES = ["weather_spider.spiders"]
NEWSPIDER_MODULE = "weather_spider.spiders"
# 启用图片下载管道
ITEM_PIPELINES = {
'weather_spider.pipelines.ImagePipeline': 1,
}
# 配置图片存储路径
IMAGES_STORE = 'images'
# 配置并发请求数量
CONCURRENT_REQUESTS = 32
DOWNLOAD_DELAY = 0.5 # 每个请求之间的延迟时间
ROBOTSTXT_OBEY = True
REQUEST_FINGERPRINTER_IMPLEMENTATION = "2.7"
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
FEED_EXPORT_ENCODING = "utf-8"
(4)编辑 pipelines.py 文件,代码如下
gitee链接:https://gitee.com/wd_b/party-soldier-data-collection/blob/master/数据采集实践3/weather_spider/weather_spider/pipelines.py
点击查看代码
import scrapy
from itemadapter import ItemAdapter
import logging
from scrapy.pipelines.images import ImagesPipeline
from scrapy.exceptions import DropItem
class ImagePipeline(ImagesPipeline):
def get_media_requests(self, item, info):
for image_url in item['image_urls']:
logging.info(f"Downloading image: {image_url}")
yield scrapy.Request(image_url)
def file_path(self, request, response=None, info=None, *, item=None):
image_guid = request.url.split('/')[-1]
return f'images/{image_guid}'
def item_completed(self, results, item, info):
image_paths = [x['path'] for ok, x in results if ok]
if not image_paths:
raise DropItem("Item contains no images")
return item
(5)编辑 items.py 文件,代码如下
gitee链接:https://gitee.com/wd_b/party-soldier-data-collection/blob/master/数据采集实践3/weather_spider/weather_spider/items.py
3.运行爬虫
(1)在命令行或终端中,导航到项目根目录:scrapy crawl weather
(2)显示结果截图
gitee链接:https://gitee.com/wd_b/party-soldier-data-collection/tree/master/数据采集实践3/weather_spider/weather_spider/images/images
- 心得体会
虽然还是爬取天气网,但这次是第一次用scrapy框架,能感受到scrapy更加便捷,具有良好的可扩展性和灵活性。
作业2
- 数据采集实验
Gitee链接:https://gitee.com/wd_b/party-soldier-data-collection/tree/master/数据采集实践3/stock_spider
要求:熟练掌握 scrapy 中 Item、Pipeline 数据的序列化输出方法;Scrapy+Xpath
MySQL数据库存储技术路线爬取股票相关信息。MySQL数据库存储
候选网站:
东方财富网:https://www.eastmoney.com/
新浪财经:http://finance.sina.com.cn/stock/
1.创建 Scrapy 项目
(1)进入作业文件夹、创建新的 Scrapy 项目
(2)查看目录
(3)进入项目目录
2.编写爬虫
(1)创建爬虫文件
在 stock_spider/spiders 目录下创建一个爬虫文件:stock_spider.py
Gitee链接:https://gitee.com/wd_b/party-soldier-data-collection/blob/master/数据采集实践3/stock_spider/stock_spider/spiders/stock_spider.py
(2)编辑爬虫文件,代码如下
点击查看代码
import scrapy
from stock_spider.items import StockItem
class StockSpider(scrapy.Spider):
name = 'stock'
allowed_domains = ['finance.sina.com.cn']
start_urls = ['http://vip.stock.finance.sina.com.cn/q/go.php/vInvestConsult/kind/lsjy/index.phtml']
def parse(self, response):
rows = response.xpath('//table[@class="list_table"]//tr')[1:] # 跳过表头
for row in rows:
item = StockItem()
item['id'] = row.xpath('td[1]/text()').get()
item['bStockNo'] = row.xpath('td[2]/a/text()').get()
item['stockName'] = row.xpath('td[3]/a/text()').get()
item['latestPrice'] = row.xpath('td[4]/text()').get()
item['changePercent'] = row.xpath('td[5]/text()').get()
item['changeAmount'] = row.xpath('td[6]/text()').get()
item['volume'] = row.xpath('td[7]/text()').get()
item['turnover'] = row.xpath('td[8]/text()').get()
item['amplitude'] = row.xpath('td[9]/text()').get()
item['highest'] = row.xpath('td[10]/text()').get()
item['lowest'] = row.xpath('td[11]/text()').get()
item['openPrice'] = row.xpath('td[12]/text()').get()
item['closePrice'] = row.xpath('td[13]/text()').get()
yield item
(3)编辑 settings.py 文件,代码如下
Gitee链接:https://gitee.com/wd_b/party-soldier-data-collection/blob/master/数据采集实践3/stock_spider/stock_spider/settings.py
点击查看代码
BOT_NAME = "stock_spider"
SPIDER_MODULES = ["stock_spider.spiders"]
NEWSPIDER_MODULE = "stock_spider.spiders"
ITEM_PIPELINES = {
'stock_spider.pipelines.MySQLPipeline': 300,
}
# settings.py
DB_SETTINGS = {
'db': 'stock_spider',
'user': 'root',
'passwd': '123456',
'host': 'localhost',
'port': 3306,
}
ROBOTSTXT_OBEY = False
REQUEST_FINGERPRINTER_IMPLEMENTATION = "2.7"
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
FEED_EXPORT_ENCODING = "utf-8"
(4)编辑 pipelines.py 文件,代码如下
Gitee链接:https://gitee.com/wd_b/party-soldier-data-collection/blob/master/数据采集实践3/stock_spider/stock_spider/pipelines.py
点击查看代码
import mysql.connector
from mysql.connector import errorcode
class MySQLPipeline:
def __init__(self):
self.connection = None
self.cursor = None
def open_spider(self, spider):
try:
self.connection = mysql.connector.connect(
host='localhost',
user='root',
password='123456',
database='stock_spider'
)
self.cursor = self.connection.cursor()
self.create_table()
except mysql.connector.Error as err:
if err.errno == errorcode.ER_ACCESS_DENIED_ERROR:
print("Something is wrong with your user name or password")
elif err.errno == errorcode.ER_BAD_DB_ERROR:
print("Database does not exist")
else:
print(err)
self.cursor = None
self.connection = None
def create_table(self):
if self.cursor:
create_table_query = """
CREATE TABLE IF NOT EXISTS stocks (
id INT AUTO_INCREMENT PRIMARY KEY,
bStockNo VARCHAR(255),
stockName VARCHAR(255),
latestPrice VARCHAR(255),
changePercent VARCHAR(255),
changeAmount VARCHAR(255),
volume VARCHAR(255),
turnover VARCHAR(255),
amplitude VARCHAR(255),
highest VARCHAR(255),
lowest VARCHAR(255),
openPrice VARCHAR(255),
closePrice VARCHAR(255)
)
"""
self.cursor.execute(create_table_query)
self.connection.commit()
def process_item(self, item, spider):
if not self.cursor:
return item # 如果 cursor 为 None,直接返回 item
insert_query = """
INSERT INTO stocks (bStockNo, stockName, latestPrice, changePercent, changeAmount, volume, turnover, amplitude, highest, lowest, openPrice, closePrice)
VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)
"""
values = (
item.get('bStockNo', ''),
item.get('stockName', ''),
item.get('latestPrice', ''),
item.get('changePercent', ''),
item.get('changeAmount', ''),
item.get('volume', ''),
item.get('turnover', ''),
item.get('amplitude', ''),
item.get('highest', ''),
item.get('lowest', ''),
item.get('openPrice', ''),
item.get('closePrice', '')
)
try:
self.cursor.execute(insert_query, values)
self.connection.commit()
except mysql.connector.Error as err:
print(f"Error inserting data: {err}")
return item
def close_spider(self, spider):
if self.cursor:
self.cursor.close()
if self.connection:
self.connection.close()
(5)编辑 items.py 文件,代码如下
Gitee链接:https://gitee.com/wd_b/party-soldier-data-collection/blob/master/数据采集实践3/stock_spider/stock_spider/items.py
点击查看代码
import scrapy
class StockItem(scrapy.Item):
id = scrapy.Field()
bStockNo = scrapy.Field() # 股票代码
stockName = scrapy.Field() # 股票名称
latestPrice = scrapy.Field() # 最新报价
changePercent = scrapy.Field() # 涨跌幅
changeAmount = scrapy.Field() # 涨跌额
volume = scrapy.Field() # 成交量
turnover = scrapy.Field() # 成交额
amplitude = scrapy.Field() # 振幅
highest = scrapy.Field() # 最高
lowest = scrapy.Field() # 最低
openPrice = scrapy.Field() # 今开
closePrice = scrapy.Field() # 昨收
3.运行爬虫
(1)在命令行或终端中,导航到项目根目录:scrapy crawl stock
(2)显示结果截图
Gitee链接:https://gitee.com/wd_b/party-soldier-data-collection/tree/master/数据采集实践3/stock_spider/data
- 心得体会
在作业2中系统的学习了基本的使用scrapy框架爬取数据方法以及xpath爬取信息的方法,我在这一实验中一一体会,受益颇多。
作业3
- 数据采集实验
Gitee链接:https://gitee.com/wd_b/party-soldier-data-collection/tree/master/数据采集实践3/currency_project
要求: 熟练掌握 scrapy 中 Item、Pipeline 数据的序列化输出方法; 使用 scrapy 框架 + Xpath + MySQL 数据库存储技术路线爬取外汇网站数据。使用 scrapy 框架 + MySQL 数据库存储技术爬取外汇网站数据。
候选网站: 招商银行网: https://www.boc.cn/sourcedb/whpj
输出信息: (MySQL 数据库存储和输出格式)
1.创建 Scrapy 项目
(1)进入作业文件夹、创建新的 Scrapy 项目
(2)查看目录
(3)进入项目目录
2.编写爬虫
(1)创建爬虫文件
在 currency_project/spiders 目录下创建一个爬虫文件:CurrencySpider.py
Gitee链接:https://gitee.com/wd_b/party-soldier-data-collection/blob/master/数据采集实践3/currency_project/currency_project/spiders/CurrencySpider.py
(2)编辑爬虫文件,代码如下
点击查看代码
import scrapy
from scrapy import signals
from scrapy.utils.log import configure_logging
from ..items import CurrencyItem
class CurrencySpider(scrapy.Spider):
name = "currency"
allowed_domains = ["boc.cn"]
start_urls = ["https://www.boc.cn/sourcedb/whpj/"]
def __init__(self, *args, **kwargs):
# 配置日志输出
configure_logging({'LOG_FORMAT': '%(levelname)s: %(message)s'})
super().__init__(*args, **kwargs)
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(url, callback=self.parse, errback=self.errback)
def parse(self, response):
# 使用XPath选择所有<tr>元素
rows = response.xpath("//tr[position()>1]") # 忽略第一个<tr>元素
# 遍历每个<tr>元素
for row in rows:
# 使用XPath选择当前<tr>下的所有<td>元素,并提取文本值
currencyname = row.xpath("./td[1]//text()").get()
hui_in = row.xpath("./td[2]//text()").get()
chao_in = row.xpath("./td[3]//text()").get()
hui_out = row.xpath("./td[4]//text()").get()
chao_out = row.xpath("./td[5]//text()").get()
zhonghang = row.xpath("./td[6]//text()").get()
date = row.xpath("./td[7]//text()").get()
time = row.xpath("./td[8]//text()").get()
currency = CurrencyItem()
currency['currencyname'] = str(currencyname)
currency['hui_in'] = str(hui_in)
currency['chao_in'] = str(chao_in)
currency['hui_out'] = str(hui_out)
currency['chao_out'] = str(chao_out)
currency['zhonghang'] = str(zhonghang)
currency['date'] = str(date)
currency['time'] = str(time)
yield currency
def errback(self, failure):
self.logger.error(repr(failure))
(3)编辑 settings.py 文件,代码如下
Gitee链接:https://gitee.com/wd_b/party-soldier-data-collection/blob/master/数据采集实践3/currency_project/currency_project/settings.py
点击查看代码
BOT_NAME = "currency_project"
SPIDER_MODULES = ["currency_project.spiders"]
NEWSPIDER_MODULE = "currency_project.spiders"
ITEM_PIPELINES = {
'currency_project.pipelines.CurrencyPipeline': 300,
}
LOG_LEVEL = 'INFO'
MYSQL_HOST = 'localhost'
MYSQL_PORT = 3307
MYSQL_USER = 'root'
MYSQL_PASSWORD = '123456'
MYSQL_DB = 'data acquisition'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'waihui (+http://www.yourdomain.com)'
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
(4)编辑 pipelines.py 文件,代码如下
Gitee链接:https://gitee.com/wd_b/party-soldier-data-collection/blob/master/数据采集实践3/currency_project/currency_project/pipelines.py
点击查看代码
import pymysql
class CurrencyPipeline:
def __init__(self):
# 初始化数据库连接参数
self.host = "localhost"
self.port = 3306
self.user = "root"
self.password = "123456"
self.db = "data acquisition"
self.charset = "utf8"
self.table_name = "currency"
def open_spider(self, spider):
# 在爬虫启动时建立数据库连接
self.client = pymysql.connect(
host=self.host,
port=self.port,
user=self.user,
password=self.password,
db=self.db,
charset=self.charset
)
self.cursor = self.client.cursor()
# 创建表(如果不存在)
create_table_query = """
CREATE TABLE IF NOT EXISTS {} (
id INT AUTO_INCREMENT PRIMARY KEY,
currencyname VARCHAR(255),
hui_in FLOAT,
chao_in FLOAT,
hui_out FLOAT,
chao_out FLOAT,
zhonghang FLOAT,
date VARCHAR(255),
time VARCHAR(255)
)
""".format(self.table_name)
self.cursor.execute(create_table_query)
def process_item(self, item, spider):
# 处理每个抓取的item,将其插入数据库
# 转换hui_in字段,处理None值
hui_in_value = item.get("hui_in")
if hui_in_value == 'None':
hui_in_value = None
else:
try:
hui_in_value = float(hui_in_value)
except ValueError:
hui_in_value = None
# 转换hui_out字段,处理None值
hui_out_value = item.get("hui_out")
if hui_out_value == 'None':
hui_out_value = None
else:
try:
hui_out_value = float(hui_out_value)
except ValueError:
hui_out_value = None
# 转换chao_in字段,处理None值
chao_in_value = item.get("chao_in")
if chao_in_value == 'None':
chao_in_value = None
else:
try:
chao_in_value = float(chao_in_value)
except ValueError:
chao_in_value = None
# 转换chao_out字段,处理None值
chao_out_value = item.get("chao_out")
if chao_out_value == 'None':
chao_out_value = None
else:
try:
chao_out_value = float(chao_out_value)
except ValueError:
chao_out_value = None
# 转换zhonghang字段,处理None值
zhonghang_value = item.get("zhonghang")
if zhonghang_value == 'None':
zhonghang_value = None
else:
try:
zhonghang_value = float(zhonghang_value)
except ValueError:
zhonghang_value = None
# 准备要插入的数据
args = [
item.get("currencyname"),
hui_in_value,
chao_in_value,
hui_out_value,
chao_out_value,
zhonghang_value,
item.get("date"),
item.get("time"),
]
print("Inserting data:", args)
# 插入数据到数据库
sql = "INSERT INTO {} (currencyname, hui_in, chao_in, hui_out, chao_out, zhonghang, date, time) VALUES (%s,%s,%s,%s,%s,%s,%s,%s)".format(
self.table_name)
self.cursor.execute(sql, args)
self.client.commit()
# 返回处理后的item
return item
def close_spider(self, spider):
# 在爬虫关闭时关闭数据库连接
self.cursor.close()
self.client.close()
(5)编辑 items.py 文件,代码如下
Gitee链接:https://gitee.com/wd_b/party-soldier-data-collection/blob/master/数据采集实践3/currency_project/currency_project/items.py
点击查看代码
import scrapy
class CurrencyItem(scrapy.Item):
currencyname = scrapy.Field()
hui_in = scrapy.Field()
chao_in = scrapy.Field()
hui_out = scrapy.Field()
chao_out = scrapy.Field()
zhonghang = scrapy.Field()
date = scrapy.Field()
time = scrapy.Field()
3.运行爬虫
(1)在命令行或终端中,导航到项目根目录:scrapy crawl currency
(2)显示结果截图
Gitee链接:https://gitee.com/wd_b/party-soldier-data-collection/tree/master/数据采集实践3/currency_project/data@0020acquisition
- 心得体会
通过本次实验,我熟练掌握了使用 Scrapy 框架结合 Xpath 技术爬取外汇网站数据,并将其存储到 MySQL 数据库的完整流程。定义了 CurrencyItem 类来存储数据,编写了 Spider 类提取外汇牌价信息,并通过 Pipeline 将数据序列化存储到 MySQL 中。实验过程加深了我对 Scrapy 框架的理解,提升了数据爬取和处理的能力。