python基础爬虫

基于beautifulSoup的爬虫：

一：先导包：

import requests
from bs4 import BeautifulSoup

二：伪装：

headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:122.0) Gecko/20100101 Firefox/122.0'}

user-agent在浏览器按f12 -> 网络 -> 消息头

三：获取爬取页面对象、设置编码格式（以防万一）、获取beautifulSoup对象：

response = requests.get("", headers=headers)
    response.encoding = 'utf-8'
    html=BeautifulSoup(response.text,"html.parser")

解析器写第一种就行

四：查看需爬取网页源码确定查找内容：

all_results=html.findAll("标签名",attrs={'关键字':'关键字名'})

如：

五：遍历查找结果并只输出标签内文本：

    for title in all_results:
        for title in all_results:
            title1 = title.get_text()
            print(title1)

示例：

随机挑选一位幸运儿

完整代码：

import requests
from bs4 import BeautifulSoup
headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:122.0) Gecko/20100101 Firefox/122.0'}

#遍历翻页
for i in range(1,20):
    response = requests.get(f"https://www.cnblogs.com/xxxxxxxxx?page={i}", headers=headers)
    response.encoding = 'utf-8'
    html=BeautifulSoup(response.text,"html.parser")
    all_results=html.findAll("a",attrs={'class':'postTitle2 vertical-middle'})
    for title in all_results:
        title1 = title.get_text()
        print(title1)