Python爬虫快速入门指南

笔者近期需要使用爬虫技术来爬取某网页上的数据，因此学习了一下Python爬虫技术，正好最近也在学习Javaweb和spring相关技术，借此过程巩固一下相关基础知识。

1 了解基础知识

①Http基础原理

相关定义：URL、port、query、path......

HTTP和HTTPS的区别？SSL？

客户端的服务器请求-响应的过程？

什么是请求？响应？它们的组成-请求头、请求体？

使用网页开发者工具，查看请求/响应的详细过程、相关参数。

②Web网页基础

网页的组成？HTML、CSS、JavaScript?

网页的节点和节点间的关系（解析和提取我们目标内容的关键基础知识）

③爬虫的基本原理

爬虫：获取网页并提取和保存信息的自动化程序

④Session和Cookie

Http的无状态

理解目前网页是动态的，区别于静态html

动态网页需要Session和Cookie，Session和Cookie的基本概念

会话Cookie和持久Cookie

⑤代理的基本原理

为什么需要代理？

常见的代理有哪些？

⑥多线程和多进程的基本原理

理解线程和进程

并发和并行

为什么使用多线程？

2 开始基本库的使用

掌握了第1节中的基础知识，就可以从写简单的爬虫程序开始，逐步深入了解爬虫的原理、使用、案例，最终实现我们的目标工作。

首先，安装配置好Python3环境，并使用pip工具安装相关类库

此链接提供完整的安装教程

Scrape Center

这里介绍一个requests库

requests库的使用

1 发送get请求，将返回结果转化为JSON格式的字典

import requests



data = {'name': 'germey', 'age': '25'}

r = requests.post("https://www.httpbin.org/post", data=data)

print(r.text)

print(r.json())

结果：

{
  "args": {},
  "data": "",
  "files": {},
  "form": {
    "age": "25",
    "name": "germey"
  },
  "headers": {
    "Accept": "*/*",
    "Accept-Encoding": "gzip, deflate",
    "Content-Length": "18",
    "Content-Type": "application/x-www-form-urlencoded",
    "Host": "www.httpbin.org",
    "User-Agent": "python-requests/2.28.1",
    "X-Amzn-Trace-Id": "Root=1-632feafa-1e67dd967da4a55216a9fd7e"
  },
  "json": null,
  "origin": "27.17.104.42",
  "url": "https://www.httpbin.org/post"
}

{'args': {}, 'data': '', 'files': {}, 'form': {'age': '25', 'name': 'germey'}, 'headers': {'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate', 'Content-Length': '18', 'Content-Type': 'application/x-www-form-urlencoded', 'Host': 'www.httpbin.org', 'User-Agent': 'python-requests/2.28.1', 'X-Amzn-Trace-Id': 'Root=1-632feafa-1e67dd967da4a55216a9fd7e'}, 'json': None, 'origin': '27.17.104.42', 'url': 'https://www.httpbin.org/post'}
<class 'dict'>

2 使用正则表达式，抓取网页特定的文本

import requests

import re



r = requests.get('https://ssr1.scrape.center/')

pattern = re.compile('<h2.*?>(.*?)</h2>', re.S)

titles = re.findall(pattern, r.text)

print(titles)

结果：

['霸王别姬 - Farewell My Concubine', '这个杀手不太冷 - Léon', '肖申克的救赎 - The Shawshank Redemption', '泰坦尼克号 - Titanic', '罗马假日 - Roman Holiday', '唐伯虎点秋香 - Flirting Scholar', '乱世佳人 - Gone with the Wind', '喜剧之王 - The King of Comedy', '楚门的世界 - The Truman Show', '狮子王 - The Lion King']

3 爬取图像、视频、音频的方式

import requests



r = requests.get('https://scrape.center/favicon.ico')

with open('favicon.ico','wb') as f:

  f.write(r.content)

运行结果：

正则表达式

通过request库的学习我们已经可以获取网页的源代码，正则表达式就是从源代码中获取我们的目标字符串的工具。

https://tool.oschina.net/regex

此网站提供了一个非常好用的构建正则表达式的工具

https://www.cnblogs.com/fancy2022/p/16687764.html

JavaScript-正则表达式基础知识

希望本文章对正在学习或正准备学习Python爬虫技术的同学有所帮助。

本文参考：《Python3网络爬虫开发实战》作者：崔庆才

posted @ 2022-09-25 22:13 Fancy[love] 阅读(518) 评论(0) 编辑收藏举报

刷新页面返回顶部

Fancy[love]