爬虫入门——requests和Beautifulsoup

操作环境：python3

首先，使用requests库获取HTTP请求

#HTTP请求类型
#get类型
r = requests.get('ulr')
#post类型
r = requests.post("url")
#put类型
r = requests.put("url")
# 因为到目前为止只使用了这三种请求类型，所以先列出这三种类型用法

# 获取返回页面
response = requests.get(url='地址'）
# 使返回页面的编码格式和原页面相同
response.enconding = apparent_encoding
# 返回文本页面信息
response.text

　　获取到返回的信息后，再根据页面上的内容去爬取自己所需要的信息，GET方法和POST方法居多。

　　然后使用Beautifulsoup解析获取到的页面，再获取到想得到的内容。

　# 解析，获取想要的页面内容，格式为 bs4.BeautifulSoup
　　soup = BeautifulSoup(ret.text,'html.parser') # lxml
　　print(type(soup)) 
　# 只有对象有find功能
　　div = soup.find(name='div', id='auto-channel-lazyload-article')
   li_list = div.find_all(name='li')

　　因为不同网站的反爬虫策略不同，所以每个网站的爬虫不一定会相同，要注意查看返回页面的携带的headers里面是否有一些token。

　　一些需要用户账号密码登陆的网站，先要登陆一次网站获取cookie，然后再post账号密码去对获得的cookies进行授权，再使用授权过的cookie进行登陆和其他操作。

　　下面是实际的爬取了某培训网站课程列表，代码如下：

import requests
from bs4 import BeautifulSoup

with open('./courses','w', encoding='utf-8') as f:
    for page in range(1,24):
        ret = requests.get(
            url='https://www.shiyanlou.com/courses/?category=all&course_type=all&fee=all&tag=all&page=%s' %str(page)
        )
        ret.encoding = ret.apparent_encoding

        soup = BeautifulSoup(ret.text, 'html.parser')
        divs = soup.find_all(name='div', attrs={'class':'course-name'})
        spans = soup.find_all(name='span', attrs={'class':'course-per-num pull-left'})

        for div in divs:
            div = str(div)
            div = div.split('>')[1].split('<')[0]
            f.write(div + '\n')
            f.write('='* 30 + '\n')

　　由于把爬取后的信息存放在文件里，所以爬取后的效果如下：

Linux 基础入门（新版）
==============================
用 C语言编写自己的编程语言
==============================
使用 Python3 生成分形图片
==============================
NumPy 百题大冲关
==============================
用 C 编写打字练习软件
==============================
C 语言入门教程
==============================
Python3 简明教程
==============================
GDB 简明教程
==============================
Linux 系统搭建及配置 DNS 服务器
==============================

posted @ 2018-07-05 16:41 孑然枫阅读(236) 评论(0) 编辑收藏举报

刷新页面返回顶部

孑然枫

爬虫入门——requests和Beautifulsoup

公告