爬虫之汽车之家

爬虫

今日内容

1、爬虫介绍
2、爬取汽车之家
3、requests
4、bs4
5、内容编码改为utf-8
掌握requests /bs4 不考虑验证码和性能基本网页都能爬取
以后实际工作中这两个脚本加scrapy框架就可以了

一、爬虫介绍

* 1、什么事爬虫
	编写程序，根据URL获取网站信息。
	历史背景：2015年起我国对数据爬取进行了立法
* 2、爬取汽车之家新闻
		a、伪炤浏览器向某个地址发送http请求，获取返回的字符串
	ret = requests.get(
    		url='https://www.autohome.com.cn/news/',
	)
	print(ret.text)
注：ret是一个对象
	request是伪造浏览器的行为

	ret.encoding = ret.apparent_encoding
	数据类型转换（转换成中文）

	ret.text 按照字符串显示内容

	ret.content 按照字节显示

	b、解析：获取指定内容
		BeautifulSoup 用于把html中的标签做切割

	bs4解析html格式的字符串
		div = soup.find(name='标签名')
		div = soup.find(name='标签名',id='id名')
		div = soup.find(name='标签名', class_='') class后面有_
		div = soup.find(name='标签名'，attrs={id:'user',class:'page-one'})
	 
		div.text  获取标签的文本
		div.attrs  获取标签的属性
		div.get('src') 获取标签属性的值

		li = soup.find_all(name='标签名')
		li_list  获取的是列表
		如果要查询具体的li_list列表的值，必须要下标索引查找
		li_list[0]

request模块

url:
params: url中传入参数
data:
json:
headers:  请求头
cookies:
proxies:封IP 用代理

files: 上传文件

auth: 基本认证

timeout: 超时时间
   
allow_redirects: True 
是否允许重定向

stream: 下载大文件时

ret = requests.get('http://127.0.0.1:8000/test/', stream=True)

for i in r.iter_content():
#iter_content() 方法表示来边下载边存硬盘
 # print(i)

from contextlib import closing

with closing(requests.get('http://httpbin.org/get', stream=True)) as r:
 # 在此处理响应。
 
	for i in r.iter_content():

print(i)

cert： 证书
verify: 确认

posted @ 2019-08-15 09:29 正直boy 阅读(932) 评论(2) 收藏举报

刷新页面返回顶部

正直boy

爬虫之汽车之家

爬虫

今日内容

一、爬虫介绍

request模块

公告