爬取淘宝女装并可视化分析
这次主要是爬虫实战+数据可视化分析:爬虫针对是淘宝的女装信息
详细代码数据可以到我的gitee下载:爬取淘宝女装并可视化分析: 基于爬虫,获取淘宝的商品信息,保存本地并进行可视化分析 (gitee.com)
一:淘宝数据的获取
①:分析淘宝网址的数据存储位置(点击搜索女装,复制第一商品的标题,方便接下来的搜索,右键点击检查,从图片上,数据存储在’g_page_config = {}‘这个花括号里面,后面爬取数据可以使用正则匹配获取该json数剧
②:分析请求的url网址,请求方式是get,可以用request.get() 方法发送请求
③:分析请求头,发现里面存储cookie,user-agent等相关信息,也需要添加到请求头。
④:分析json数据格式,就是之前花括号里面的数据,可以看出,详细数据主要存在mods- itemlist-data-auctions下面
⑤:分析完数据,可以开始写代码了,一开始写可能会出现一些数据爬不下来,可以开启debug模式测试,断点可以打在发送请求那里。
import re import requests import json import csv import time url = f'https://s.taobao.com/search?q=%E5%A5%B3%E8%A3%85&commend=all&ssid=s5-e&search_type=item&sourceId=tb.index&spm=a21bo.jianhua.201856-taobao-item.2&ie=utf8&initiative_id=tbindexz_20170306' headers = { 'cookie': 'thw=cn; cna=SJBLGjLZ8A0CAXFaa2FGCY+B; tracknick=%5Cu6708%5Cu4EAE%5Cu6708%5Cu4EAEduang; miid=6691168651885012083; lgc=%5Cu6708%5Cu4EAE%5Cu6708%5Cu4EAEduang; t=e9469098d189b4ce89caa76500ece25e; enc=%2BMYZumBqrxN8tksebnCB0X12K3h2stwpzcmBvTLx1qtrJTX6IPi8oLVU2ypy5ZDph%2FDS8PRpWrf5KebjKgKNdw%3D%3D; mt=ci=27_1; _m_h5_tk=662113fc01a6706d46387f50092b4677_1666687433553; _m_h5_tk_enc=48356351c26577b03510dcc93906a4a3; _samesite_flag_=true; cookie2=1e7c2b464aac0d59ae98b2f46bb2f58b; _tb_token_=ee8b85530333e; unb=2356524382; cancelledSubSites=empty; cookie17=UUtO%2FnNrzSi82A%3D%3D; dnk=%5Cu6708%5Cu4EAE%5Cu6708%5Cu4EAEduang; _cc_=URm48syIZQ%3D%3D; _l_g_=Ug%3D%3D; sg=g25; _nk_=%5Cu6708%5Cu4EAE%5Cu6708%5Cu4EAEduang; cookie1=VACOTbYetMGBJ40PDCJ4PF%2FoIQZAYlc1HdTnjTRx9Sk%3D; xlly_s=1; sgcookie=E100vX6QvI7e8hitR10TRR7ebDE8%2BoDVijS8FcwV%2F11CMgzDRHcUDpB%2BfMsw5PYrII8KH94dJ0zgdcNX7IMNvVrAapYVbi4F8ZCgQzb0Hytbb3Y%3D; uc1=cookie16=W5iHLLyFPlMGbLDwA%2BdvAGZqLg%3D%3D&cookie15=U%2BGCWk%2F75gdr5Q%3D%3D&cookie21=W5iHLLyFeYZ1WM9hVnmS&cookie14=UoeyCGHzjkDPnw%3D%3D&existShop=false&pas=0; uc3=id2=UUtO%2FnNrzSi82A%3D%3D&vt3=F8dCv4oe2OgHJJjdsyg%3D&lg2=UtASsssmOIJ0bQ%3D%3D&nk2=txRmxvwawPg5k0FEZg%3D%3D; csg=95cfb877; skt=768c30978e8806bd; existShop=MTY2Njc3NzUxMg%3D%3D; uc4=id4=0%40U2l0u27JK4MKxZ52fbRCpbiFlOzn&nk4=0%40tWdApSCho72ZUcJSDW0A2oXJ4IRu74Eo; alitrackid=www.tmall.com; lastalitrackid=www.tmall.com; JSESSIONID=1D59A9A2D762B081C51E09D371C431B2; l=eBInJlPPg81JpUcEBOfwourza77OSIRA_uPzaNbMiOCP96fp5BcRW6y_IqT9C3GVh6byR3JmWBbDBeYBqIv4n5U62j-la_kmn; isg=BGRk0Ytk2474rS3ycst3yP-8NWJW_Yhn5MqwC36F8C_yKQTzpg1Y95qL6YEx9sC_; tfstk=cbXNBAXeYRea6osYypvVUARlxl9OZKODav-vsrbT9wS44hOGiq3vxll6Y3nEoCf..', 'sec-ch-ua': '"Chromium";v="106", "Google Chrome";v="106", "Not;A=Brand";v="99"', 'sec-ch-ua-mobile': '?0', 'sec-ch-ua-platform': "Windows", 'sec-fetch-dest': 'document', 'sec-fetch-mode': 'navigate', 'sec-fetch-site': 'same-origin', 'upgrade-insecure-requests': '1', 'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36' } response = requests.get(url=url,headers=headers) # print(response.text) html_data = re.findall('g_page_config = (.*);',response.text)[0] # print(html_data) json_data = json.loads(html_data) # print(json_data) data = json_data['mods']['itemlist']['data']['auctions'] for index in data: try: dict = { '标题':index['raw_title'], '价格':index['view_price'], '店铺':index['nick'], '购买人数':index['view_sales'], '地点':index['item_loc'], '商品详情页':'https:'+index['detail_url'], '店铺链接':'https:'+index['shopLink'], '图片链接':'https:'+index['pic_url'] } print(dict) except Exception as e: print(e)
补充:正则匹配拿下来是列表模式,所以获取第0个元素,就是我们需要的字符串数据,之后用json.loads()其转化成json格式,通过['mods']['itemlist']['data']['auctions']获取所有的商品信息,循环遍历没件商品信息
⑥:以上只是我们获取的一页数据,我们分析第二三页的网址有什么变化,只是结尾的数字是44的倍数
⑦:所以1接下来优化代码,并保存在本地的csv文件中
import re import requests import json import csv import time with open('taobao2.csv','w',encoding='ANSI',newline='') as filename: csvwriter = csv.DictWriter(filename,fieldnames=['标题','价格','店铺','购买人数','地点','商品详情页','店铺链接','图片链接']) csvwriter.writeheader() for i in range(1,100): time.sleep(2) url = f'https://s.taobao.com/search?q=%E7%88%B1%E4%BE%9D%E6%9C%8D&imgfile=&js=1&stats_click=search_radio_all%3A1&initiative_id=staobaoz_20221026&ie=utf8&bcoffset=2&ntoffset=2&p4ppushleft=2%2C48&s={i*44}' headers = { 'cookie': 'thw=cn; cna=SJBLGjLZ8A0CAXFaa2FGCY+B; tracknick=%5Cu6708%5Cu4EAE%5Cu6708%5Cu4EAEduang; miid=6691168651885012083; lgc=%5Cu6708%5Cu4EAE%5Cu6708%5Cu4EAEduang; t=e9469098d189b4ce89caa76500ece25e; enc=%2BMYZumBqrxN8tksebnCB0X12K3h2stwpzcmBvTLx1qtrJTX6IPi8oLVU2ypy5ZDph%2FDS8PRpWrf5KebjKgKNdw%3D%3D; mt=ci=27_1; _m_h5_tk=662113fc01a6706d46387f50092b4677_1666687433553; _m_h5_tk_enc=48356351c26577b03510dcc93906a4a3; _samesite_flag_=true; cookie2=1e7c2b464aac0d59ae98b2f46bb2f58b; _tb_token_=ee8b85530333e; unb=2356524382; cancelledSubSites=empty; cookie17=UUtO%2FnNrzSi82A%3D%3D; dnk=%5Cu6708%5Cu4EAE%5Cu6708%5Cu4EAEduang; _cc_=URm48syIZQ%3D%3D; _l_g_=Ug%3D%3D; sg=g25; _nk_=%5Cu6708%5Cu4EAE%5Cu6708%5Cu4EAEduang; cookie1=VACOTbYetMGBJ40PDCJ4PF%2FoIQZAYlc1HdTnjTRx9Sk%3D; xlly_s=1; sgcookie=E100vX6QvI7e8hitR10TRR7ebDE8%2BoDVijS8FcwV%2F11CMgzDRHcUDpB%2BfMsw5PYrII8KH94dJ0zgdcNX7IMNvVrAapYVbi4F8ZCgQzb0Hytbb3Y%3D; uc1=cookie16=W5iHLLyFPlMGbLDwA%2BdvAGZqLg%3D%3D&cookie15=U%2BGCWk%2F75gdr5Q%3D%3D&cookie21=W5iHLLyFeYZ1WM9hVnmS&cookie14=UoeyCGHzjkDPnw%3D%3D&existShop=false&pas=0; uc3=id2=UUtO%2FnNrzSi82A%3D%3D&vt3=F8dCv4oe2OgHJJjdsyg%3D&lg2=UtASsssmOIJ0bQ%3D%3D&nk2=txRmxvwawPg5k0FEZg%3D%3D; csg=95cfb877; skt=768c30978e8806bd; existShop=MTY2Njc3NzUxMg%3D%3D; uc4=id4=0%40U2l0u27JK4MKxZ52fbRCpbiFlOzn&nk4=0%40tWdApSCho72ZUcJSDW0A2oXJ4IRu74Eo; alitrackid=www.tmall.com; lastalitrackid=www.tmall.com; JSESSIONID=1D59A9A2D762B081C51E09D371C431B2; l=eBInJlPPg81JpUcEBOfwourza77OSIRA_uPzaNbMiOCP96fp5BcRW6y_IqT9C3GVh6byR3JmWBbDBeYBqIv4n5U62j-la_kmn; isg=BGRk0Ytk2474rS3ycst3yP-8NWJW_Yhn5MqwC36F8C_yKQTzpg1Y95qL6YEx9sC_; tfstk=cbXNBAXeYRea6osYypvVUARlxl9OZKODav-vsrbT9wS44hOGiq3vxll6Y3nEoCf..', 'sec-ch-ua': '"Chromium";v="106", "Google Chrome";v="106", "Not;A=Brand";v="99"', 'sec-ch-ua-mobile': '?0', 'sec-ch-ua-platform': "Windows", 'sec-fetch-dest': 'document', 'sec-fetch-mode': 'navigate', 'sec-fetch-site': 'same-origin', 'upgrade-insecure-requests': '1', 'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36' } response = requests.get(url=url,headers=headers) # print(response.text) html_data = re.findall('g_page_config = (.*);',response.text)[0] # print(html_data) json_data = json.loads(html_data) # print(json_data) data = json_data['mods']['itemlist']['data']['auctions'] for index in data: try: dict = { '标题':index['raw_title'], '价格':index['view_price'], '店铺':index['nick'], '购买人数':index['view_sales'], '地点':index['item_loc'], '商品详情页':'https:'+index['detail_url'], '店铺链接':'https:'+index['shopLink'], '图片链接':'https:'+index['pic_url'] } csvwriter.writerow(dict) print(dict) except Exception as e: print(e)
查看csv数据,成功获取下来
二:淘宝数据可视化分析
①:模块的导入以及数据预处理
# 导入模块 import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns from pyecharts.charts import Line,Pie,Bar,Gauge,Funnel import pyecharts.options as opt from pyecharts.globals import ThemeType import warnings warnings.filterwarnings(action='ignore')
# 读取数据 data = pd.read_csv('./taobao2.csv',encoding='gbk') data.head()
# 查看数据类型 data.info()
# 查看是否有缺失值 data.isna().any() # 不存在缺失值
# 删除重复值 data.drop_duplicates(inplace=True) data.duplicated().all()
# 处理数据 将购买人数和地点数据进行处理 data['购买人数'] = data['购买人数'].apply(lambda x:x[:-3]) data['省份'] = data['地点'].apply(lambda x:x[:2]) # 处理价格 转变类型 data['价格'] = data['价格'].astype('int64') data['价格']
# 新增一列存储价格分布情况 data['新价格'] = pd.cut(data.loc[:,'价格'], bins=[0,100,200,300,400,500,600,700,15150], right=True, labels=['0-100','100-200','200-300','300-400','400-500','500-600','600-700','700以上']) data['新价格']
②:数据可视化展示
# 购买人数排名前10的店铺 shop_sum = data.groupby('店铺')['购买人数'].count().sort_values(ascending=False)[:10].to_list() shop_name = data.groupby('店铺')['购买人数'].count().sort_values(ascending=False)[:10].index.to_list() bar = Bar(init_opts=opt.InitOpts(theme=ThemeType.DARK)) bar.add_xaxis(shop_name) bar.add_yaxis('',shop_sum,itemstyle_opts=opt.ItemStyleOpts(color='#FF8C00')) bar.set_global_opts(title_opts=opt.TitleOpts(title='购买人数排名前10的店铺'), brush_opts=opt.BrushOpts(), # 设置操作图表的画笔功能 toolbox_opts=opt.ToolboxOpts(), # 设置操作图表的工具箱功能 yaxis_opts=opt.AxisOpts(axislabel_opts=opt.LabelOpts(formatter="{value}/人"),name="总人数"), # 设置Y轴名称、定制化刻度单位 xaxis_opts=opt.AxisOpts(name="店铺名",axislabel_opts = opt.LabelOpts(font_size=10,rotate=12)), # 设置X轴名称 ) bar.render_notebook()
# 购买人数排名前10的省份 shop_sum = data.groupby('省份')['购买人数'].count().sort_values(ascending=False)[:10].to_list() shop_name = data.groupby('省份')['购买人数'].count().sort_values(ascending=False)[:10].index.to_list() bar = Bar(init_opts=opt.InitOpts(theme=ThemeType.DARK)) bar.add_xaxis(shop_name) bar.add_yaxis('',shop_sum,itemstyle_opts=opt.ItemStyleOpts(color='#FF8C00')) bar.set_global_opts(title_opts=opt.TitleOpts(title='购买人数排名前10的省份'), brush_opts=opt.BrushOpts(), # 设置操作图表的画笔功能 toolbox_opts=opt.ToolboxOpts(), # 设置操作图表的工具箱功能 yaxis_opts=opt.AxisOpts(axislabel_opts=opt.LabelOpts(formatter="{value}/人"),name="总人数"), # 设置Y轴名称、定制化刻度单位 xaxis_opts=opt.AxisOpts(name="省份",axislabel_opts = opt.LabelOpts(font_size=12)), # 设置X轴名称 ) bar.render_notebook()
# 价格分布排名 shop_sum = data.groupby('新价格')['购买人数'].count().sort_values(ascending=False).to_list() shop_name = data.groupby('新价格')['购买人数'].count().sort_values(ascending=False).index.to_list() bar = Bar(init_opts=opt.InitOpts(theme=ThemeType.DARK)) bar.add_xaxis(shop_name) bar.add_yaxis('',shop_sum,itemstyle_opts=opt.ItemStyleOpts(color='#FF8C00')) bar.set_global_opts(title_opts=opt.TitleOpts(title='价格分布排名'), brush_opts=opt.BrushOpts(), # 设置操作图表的画笔功能 toolbox_opts=opt.ToolboxOpts(), # 设置操作图表的工具箱功能 yaxis_opts=opt.AxisOpts(axislabel_opts=opt.LabelOpts(formatter="{value}/人"),name="总人数"), # 设置Y轴名称、定制化刻度单位 xaxis_opts=opt.AxisOpts(name="价格分布区",axislabel_opts = opt.LabelOpts(font_size=12)), # 设置X轴名称 ) bar.render_notebook()
# 制作词云图 # 导入模块 import jieba #分词包 from wordcloud import WordCloud #词云包 import codecs #codecs提供的open方法来指定打开的文件的语言编码,它会在读取的时候自动转换为内部unicode import matplotlib.pyplot as plt %matplotlib inline import matplotlib matplotlib.rcParams['figure.figsize'] = (10.0, 5.0) # tolist:矩阵转换成列表 (df.content.values:object类型) title = data.标题.values.tolist() print(title[0]) print(jieba.lcut(title[0])) # 测试分词)
# 分词 将整个contents列表循环进行分词 segment = [] # 接收分词下来的每个次 for line in title: # line: 值得是每一行数据 try: segs = jieba.lcut(line) # segs 类似上面返回的结果 # 只有长度大于1的字符并且该字符不能为空格换行,这样的词语才认为是有效分词 for arg in segs: if len(arg) > 1 and arg != '\r\n': segment.append(arg) except: print(line) continue
# 去停用词处理 words_df = pd.DataFrame({'segment':segment}) # 新建DataFrame,存储原始的分词结果 stopwords = pd.read_csv('./stopwords.txt',quoting=3,delimiter='\t',names=['stopword']) words_df = words_df[~words_df.segment.isin(stopwords.stopword)] # 去除停用词表中的词语
# 统计词频处理 words_stat = words_df.groupby(by='segment')['segment'].agg([("计数","count")]) words_stat = words_stat.reset_index().sort_values(by='计数',ascending=False) words_stat.head()
# 做词云 wordcloud = WordCloud(font_path='data/simhei.ttf',background_color='black',max_font_size=80) # 统计频率最高的1000个词语 字典形式 word_frequence = {x[0]:x[1] for x in words_stat.head(2000).values} # 循环遍历 生成字典 wordcloud = wordcloud.fit_words(word_frequence) # 对词频信息进行学习 plt.imshow(wordcloud) plt.show()
PS:由于获取可分析的变量数据偏少,所以只是做了简单的数据分析,不过也是一次完整的实战,多多练起来吧!! (●'◡'●)