爬取百度热搜前10

第一、主题式网络爬虫名称：爬取百度热搜
第二、主题式网络爬虫爬取的内容：百度热搜前10
第三、主题式网络爬虫设计方案概述：

1、确定百度热搜网页：http://top.baidu.com/

2、进行Htmls页面解析

3、正式进行爬取网页内容

4、进行可视化并进行数据持久化

5、附上总代码

6、自我总结

1.明确目标：百度热搜 http://top.baidu.com/

2进行Htmls页面解析

在目标页面按F12或单击右键审查元素进行查找。也可以右键查看网页源代码。

可以得到url:'http://top.baidu.com/'

节点查找方法

js = json.loads(html)
# 定位到albumTime和albumCount
albumlist = js['data']['album']['list']
for song in albumlist:
    albumtime = song['publicTime']
    albumsongs = song['song_count']
    #判断，将符合条件的值加入value中。
    if eval(albumtime.split('-')[0]) in all_albumcount:
        temp = eval(albumtime.split('-')[0])
        all_albumcount[temp] += eval(str(albumsongs))

3.正式进行爬取页面内容

import requests
from lxml import etree

head = {}
url = 'http://top.baidu.com/' 
head["User-Agent"] = "Mozilla/5.0 (Windows NT 10.0; WOW64; rv:63.0) Gecko/20100101 Firefox/63.0"
head["Accept"]= "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8"
head["Accept-Language"]= "zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2"
head["Connection"] = "keep-alive"
def main(): 
    print("百度热搜top10： ")
    res = requests.get(url , headers = head)
    with open("html.txt", "wb") as f:
        f.write(res.content)
    html = etree.parse('html.txt' , etree.HTMLParser(encoding='gbk'))
    top_list = html.xpath('//a[@class="list-title"]/text()')
    num_search = html.xpath('//span[@class="icon-rise"]/text()')


    for i  , j in zip(top_list[:10] , num_search[:10]):
        print(i ,"搜索指数为：" ,  j  )
if __name__ == '__main__':
    main()

4.进行可视化并进行数据持久化

def ChartBar(xaxis, yaxis):
    plt.figure()
    plt.bar(left=xaxis, height=yaxis, color='b', width=0.5)
    plt.ylabel('top_list')
    plt.xlabel('num_search')
    plt.title('Barplot')
    # 保存程序结果，数据持久化
    plt.savefig('Bar', dpi=600)
    print('条形图保存成功')
    plt.show()

def ChartBroken(x, y):
     #图片在额外的窗口显示
     plt.figure()
     plt.plot(x, y)
     #y轴命名
     plt.ylabel('num_search')
     #x轴命名
     plt.xlabel('top_list')
     plt.axis([0, 120000, 0, 20000])
     plt.title('Brokenplot')
     # 保存程序结果，数据持久化
     plt.savefig('Broken', dpi=600)
     print('折线图保存成功')
     plt.show()

def sandian():
x = df['排名']
y = df['热度']
plt.xlabel('排名')
plt.ylabel('热度')
plt.scatter(x,y,color="red",label=u"热度分布数据",linewidth=2)
plt.title("排名与热度散点图")
plt.legend()
plt.show()
sandian()

5.附上总代码

import requests
from lxml import etree
head = {}
url = 'http://top.baidu.com/'
head["User-Agent"] = "Mozilla/5.0 (Windows NT 10.0; WOW64; rv:63.0) Gecko/20100101 Firefox/63.0"
head["Accept"]= "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8"
head["Accept-Language"]= "zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2"
head["Connection"] = "keep-alive"
def main(): 
    print("百度热搜top10： ")
    res = requests.get(url , headers = head)
    with open("html.txt", "wb") as f:
        f.write(res.content)
    html = etree.parse('html.txt' , etree.HTMLParser(encoding='gbk'))
    top_list = html.xpath('//a[@class="list-title"]/text()')
    num_search = html.xpath('//span[@class="icon-rise"]/text()')
    for i  , j in zip(top_list[:10] , num_search[:10]):
        print(i ,"搜索指数为：" ,  j  )
if __name__ == '__main__':
    main()
js = json.loads(html)
# 定位到albumTime和albumCount
albumlist = js['data']['album']['list']
for song in albumlist:
    albumtime = song['publicTime']
    albumsongs = song['song_count']
    #判断，将符合条件的值加入value中。
    if eval(albumtime.split('-')[0]) in all_albumcount:
        temp = eval(albumtime.split('-')[0])
        all_albumcount[temp] += eval(str(albumsongs))
def ChartBar(xaxis, yaxis):
    plt.figure()
    plt.bar(left=xaxis, height=yaxis, color='b', width=0.5)
    plt.ylabel('top_list')
    plt.xlabel('num_search')
    plt.title('Barplot')
    # 保存程序结果，数据持久化
    plt.savefig('Bar', dpi=600)
    print('条形图保存成功')
    plt.show()

def ChartBroken(x, y):
     #图片在额外的窗口显示
     plt.figure()
     plt.plot(x, y)
     #y轴命名
     plt.ylabel('num_search')
     #x轴命名
     plt.xlabel('top_list')
     plt.axis([0, 120000, 0, 20000])
     plt.title('Brokenplot')
     # 保存程序结果，数据持久化
     plt.savefig('Broken', dpi=600)
     print('折线图保存成功')
     plt.show()
        
def sandian():
     x = df['排名']
     y = df['热度']
     plt.xlabel('排名')
     plt.ylabel('热度')
     plt.scatter(x,y,color="red",label=u"热度分布数据",linewidth=2)
     plt.title("排名与热度散点图")
     plt.legend()
     plt.show()
     sandian()

6、自我总结

根据可视化图，可得百度热搜的热度随着排名递减，排名越高热度越高越多人讨论该话题。

自己上学期偷懒，现在就得老老实实敲代码。为了完成这次任务，重新复习这门课程，感觉自己似懂非懂的，认真做下去，一下子就暴露出来

posted @ 2020-09-25 21:40 流菏阅读(457) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

流菏

爬取百度热搜前10

公告