同花顺数据爬取
请求获取数据
import requests url = 'https://q.10jqka.com.cn/index/index/board/all/field/zdf/order/desc/page/2/ajax/1/' headers = { 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36' } response = requests.get(url=url, headers=headers) html = response.text print(html)
运行结果:
<html><body> <script type="text/javascript" src="//s.thsi.cn/js/chameleon/chameleon.min.1719332.js"></script> <script src="//s.thsi.cn/js/chameleon/chameleon.min.1719332.js" type="text/javascript"></script> <script language="javascript" type="text/javascript"> window.location.href="//q.10jqka.com.cn/index/index/board/all/field/zdf/order/desc/page/2/ajax/1/"; </script> </body></html>
结果中并为出现存在所要爬取的数据
思考:可能存在反爬
1、考虑请求头信息
2、存在加密可能
解决方法:1、请求头中添加Cookie、Referer等后再次运行,问题解决
headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36', 'Cookie': 'v=Awcc3GskLeIF9alu7ByNVWQylrDUDNolNeJfetn1IU-9lCmu4dxrPkWw77Xq', 'Referer': 'https://q.10jqka.com.cn/' }
解析数据:
import parsel response = requests.get(url=url, headers=headers) html = response.text selector = parsel.Selector(response.text) # 提取所有tr标签 data = selector.css('.m-table tr')[1:] # print(data) for i in data: info = i.css('td::text').getall() numberAndName = i.css('td a::text').getall() # print(numberAndName) # print(info) # 把数据保存到字典里 dit = { '序号': info[0], '代码': numberAndName[0], '名称': numberAndName[1], '现价': info[1], '涨跌幅(%)': info[2], '涨跌': info[3], '涨速(%)': info[4], '换手(%)': info[5], '量比': info[6], '振幅(%)' : info[7], '成交额': info[8], '流通股': info[9], '流通市值': info[10], '市盈率': info[11] } print(dit)
保存数据
股票数据保存为csv
import csv # 创建文件对象 f = open('stockInformation.txt', mode='w', encoding='utf-8', newline='') # 字典写入方法 csv_write = csv.DictWriter(f, fieldnames=[ '序号', '代码', '名称', '现价', '涨跌幅(%)', '涨跌', '涨速(%)', '换手(%)', '量比', '振幅(%)', '成交额', '流通股', '流通市值', '市盈率', ]) # 写入表头 csv_write.writeheader() ...... for i in data: ...... # 写入数据 csv_write.writerow(dit) ......
运行结果:
翻页爬取
分析请求连接的变化规律
https://q.10jqka.com.cn/index/index/board/all/field/zdf/order/desc/page/1/ajax/1/https://q.10jqka.com.cn/index/index/board/all/field/zdf/order/desc/page/2/ajax/1/https://q.10jqka.com.cn/index/index/board/all/field/zdf/order/desc/page/3/ajax/1/
添加翻页功能:
for page in range(1, 3): print(f'正在采集第{page}页') url = f'https://q.10jqka.com.cn/index/index/board/all/field/zdf/order/desc/page/{page}/ajax/1/' ...... print(dit)
运行结果:
目标网址cookies会变化,可以使用使用 requests.Session()
来自动管理 cookies
with requests.Session() as session: headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36', # 不需要设置 'Cookie' 头部,因为 Session 会自动处理 'Referer': 'https://q.10jqka.com.cn/' }