python爬虫基础

Python 爬虫

bs4 网页解析,获取数据

Tag: 标签及其内容任何存在于HTML语法中的标签都可以用soup.访问获得
当HTML文档中存在多个相同对应内容时，soup.返回第一个

for sibling in soup.a.next_sibling:
print(sibling) 遍历后续节点
for sibling in soup.a.previous_sibling:
print(sibling) 遍历前续节点
NavigableString: 标签里的内容-字符串
BeautifulSoup:整篇文章
Comment: 一种特殊的NavigableString,输出的内容不包含注释符号

文档的搜索

find_all() 字符串过滤可跟函数方法或者参数(可以使列表) limit 限制获取数量
```
t_list=bs.findAll("a") 
```

search() 主要是用正则表达式验证

t_list=bs.findAll(re.compile("\d") ) #包含数字

CSS选择器
1. bs.select('title') 通过标签查找
2. bs.select('.mnav') 通过类名查找
3. bs.select('#u1') 通过id查找
4. bs.select('a[class='bri]') 通过属性查找
5. bs.select('head>title') 通过子标签查找
6. bs.select('.manv~.bri') 通过兄弟标签查找

re 正则表达式,进行文字匹配

search() 主要是用正则表达式验证
1. re.findall("正则表达式","待匹配字符串")
2. re.sub("a","b","aacbs"):将字符串中的b替换为a

urllib.request urllib.error 指定URL获取网页数据

import urllib.request
# get请求
response=urllib.request.urlopen("http://www.baidu.com")
print(response.read().decode("utf-8"))

httpbin.org 请求测试

urllib.parse 解析器

import urllib.request
import urllib.parse
#post请求
data=bytes(urllib.parse.urlencode({"hello":"world"}),encoding="utf-8")
response=urllib.request.urlopen("http://httpbin.org/post",data=data)
print(response.read().decode("utf-8"))

可以在urlopen()中加入timeout=时间设置超时时间从而进行超时处理

response.status返回的状态

response.getheaders() 获得头文件内容

response.getheaders("Server") 获得Server的值

#爬虫伪装 主要伪装浏览器标识
req=urllib.request.Request(url=url,data=data,headers=headers,method=post)

import urllib.request
import urllib.parse


url="https://movie.douban.com/"
headers={"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36 Edg/95.0.1020.53"}
req=urllib.request.Request(url=url,headers=headers)
response=urllib.request.urlopen(req)
print(response.read().decode("utf-8"))