Python爬虫：知乎热榜(静态网页)的爬取

1. 请求知乎热榜网页

参考代码如下：

import requests


url = 'https://www.zhihu.com/hot'
headers = {
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36',
    'cookie':'知乎账号下请求头的cookie的值'
}
rsp2 = requests.get(url=url,headers=headers)

因为知乎这个网站不论浏览它下面什么内容，都需要登录，所以在请求头上加了cookie这个字段以及它的值。

2. 解析热榜上的信息

2.1 使用模块pyquery进行数据解析

pq2 = pq(rsp2.text)
list1 = pq2('.HotList-list section .HotItem-content a')
for index,ele in enumerate(list1.items()):
    print(index+1,ele.attr('title'),ele.attr('href'))

2.2 使用模块lxml(xpath语法)进行数据解析

html2 = etree.HTML(rsp2.text)
list2 = html2.xpath('//*[@class="HotList-list"]/section/*[@class="HotItem-content"]/a')
for i in range(len(list2)):
    ele = list2[i]
    print(i+1,ele.xpath('./@title')[0],ele.xpath('./@href')[0])

2.3 使用模块bs4进行数据解析

html2 = BeautifulSoup(rsp2.text,'lxml')
list3 = html2.select('.HotList-list section .HotItem-content a')
for i in range(len(list3)):
    ele = list3[i]
    print(i,ele['title'],ele['href'])

2.4 运行结果

3. 全部参考代码

from pyquery import PyQuery as pq
from lxml import etree
from bs4 import BeautifulSoup
import requests


url = 'https://www.zhihu.com/hot'
headers = {
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36',
    'cookie':'知乎账号下请求头的cookie的值'
}
rsp2 = requests.get(url=url,headers=headers)

pq2 = pq(rsp2.text)
list1 = pq2('.HotList-list section .HotItem-content a')
for index,ele in enumerate(list1.items()):
    print(index+1,ele.attr('title'),ele.attr('href'))


html2 = etree.HTML(rsp2.text)
list2 = html2.xpath('//*[@class="HotList-list"]/section/*[@class="HotItem-content"]/a')
for i in range(len(list2)):
    ele = list2[i]
    print(i+1,ele.xpath('./@title')[0],ele.xpath('./@href')[0])

html2 = BeautifulSoup(rsp2.text,'lxml')
list3 = html2.select('.HotList-list section .HotItem-content a')
for i in range(len(list3)):
    ele = list3[i]
    print(i,ele['title'],ele['href'])

posted @ 2022-03-05 19:41 坚持不懈的大白阅读(677) 评论(0) 编辑收藏举报

刷新页面返回顶部

坚持不懈的大白的博客

liuze,半个研一学生，熟悉C、Java、Python、HTML、CSS、JavaScript、MySql、Oracle、Spring、SpringMVC、Mybatis、SpringBoot，了解SpringCloud、Vue.js，爱好Python爬虫、机器学习(计算机视觉)等.

Python爬虫：知乎热榜(静态网页)的爬取

1. 请求知乎热榜网页

2. 解析热榜上的信息

2.1 使用模块pyquery进行数据解析

2.2 使用模块lxml(xpath语法)进行数据解析

2.3 使用模块bs4进行数据解析

2.4 运行结果

3. 全部参考代码

公告