爬虫笔记一

　　要了解什么是爬虫？先睹为快，先跑一跑下面的这个例子（前提是你安装了requests和BeautifulSoup4模块）：

import requests
from bs4 import BeautifulSoup

# 把网址对应的html下载下来赋值给response
response = requests.get('http://www.autohome.com.cn/news/')

# response对应的为字节需要解码，解码方式为utf-8编码或gbk编码
response.encoding = 'gbk'

# 把文本内容转为BeautifulSoup对象
soup = BeautifulSoup(response.text,'html.parser')

# 对象.find可以查找id="页面的id值"
tag = soup.find(id='auto-channel-lazyload-article')

# name对应的是标签类型，如div、input、p等
h3 = tag.find_all(name='h3')

# 打印页面中id等于auto-channel-lazyload-article下的所有h3标签
for i in h3:
    print(i)

　　一个例子不过瘾，再来看一个

import requests
from bs4 import BeautifulSoup
response = requests.get('http://tianqi.moji.com/weather/china/guangdong/shenzhen')
response.encoding = 'utf-8'
soup = BeautifulSoup(response.text,'html.parser')

# 拿到页面中的第一个div标签且类名(class)为wea_weather clearfix的标签
tag = soup.find(name='div',attrs={'class':'wea_weather clearfix'})
print(tag.find('em').text,tag.find('img').get('alt'),tag.find('strong').text,)  # 23 多云 今天20:21更新

import requests
from bs4 import BeautifulSoup
    
obj = requests.get("url")
obj.content
obj.encoding = "gbk"
obj.text

soup = beatifulsoup(obj.text,'html.parser')
标签 = soup.find(name='span')  # 拿到页面中第一个span标签
[标签,] = soup.find_all(...)   # 拿到一个列表，列表中有多个对应的标签

拿文本
    标签.text
拿属性字典
    标签.attrs
拿单个属性对应的值 通过.get('属性名')拿到属性对应的值
    标签.get(...)

关于cookie的验证过程：

有的网站在你get请求获取登录页面的时候就给你发了一个cookie，但是这个cookie某在后台进行验证，等你带着cookie发post请求过来然后在后台验证通过后，然后再给你的cookie授权。
有的网站在你发送post请求并且通过验证后才会直接给你一个授权的cookie

import requests
from bs4 import BeautifulSoup

# 获取token
response1 = requests.get('https://github.com/login')
s1 = BeautifulSoup(response1.text,'html.parser')
token = s1.find(name='input',attrs={'name':'authenticity_token'}).get('value')
# 拿到未授权的cookie字典
response1_cookie_dict = response1.cookies.get_dict()

# 将用户名密码token发送到服务端，post
response2 = requests.post(
    'https://github.com/session',
    data={
        "utf8": '✓',
        "authenticity_token": token,
        'login': '317235332@qq.com',
        'password': 'xxxxxx',
        'commit': 'Sign in'
    },
    cookies=response1_cookie_dict  # 发送数据的时候要把未授权的字典发过去
)

# response2是验证过后的，通过response2拿到 授权了的cookie
response2_cookie_dict = response2.cookies.get_dict()
# 创建一个字典，用来把未授权的cookie和授权了的cookie的信息合并一下
cookie_dict = {}
cookie_dict.update(response1_cookie_dict)
cookie_dict.update(response2_cookie_dict)

# 再次请求的时候，带合并的cookie过去，此时的用户是已登录了的
response3 = requests.get(
    url='https://github.com/settings/emails',
    cookies=cookie_dict
)
print(response3.text)

github登录

import requests
### 1、首先登陆任何页面，获取cookie

response1 = requests.get(url="http://dig.chouti.com/help/service")

### 2、用户登陆，携带上一次的cookie，后台对cookie中的 gpsd 进行授权
response2 = requests.post(
    url="http://dig.chouti.com/login",
    data={
        'phone': "86手机号",
        'password': "密码",
        'oneMonth': ""
    },
    cookies=response1.cookies.get_dict()
)

### 3、点赞（只需要携带已经被授权的gpsd即可）
gpsd = response1.cookies.get_dict()['gpsd']
response3 = requests.post(
    url="http://dig.chouti.com/link/vote?linksId=8589523",
    cookies={'gpsd': gpsd}
)
print(response3.text)

抽屉登录并点赞

------------------------------------------------------------------------------------------------------------------------------------

request相关参数

- method:  提交方式
            - url:     提交地址
            - params:  在URL中传递的参数,GET
            - data:    在请求体里传递的数据
            - json     在请求体里传递的数据
            - headers  请求头
            - cookies  Cookies
            - files    上传文件
            - auth     基本认知(headers中加入加密的用户名和密码)
            - timeout  请求和响应的超市时间
            - allow_redirects  是否允许重定向
            - proxies  代理
            - verify   是否忽略证书
            - cert     证书文件
            - stream   村长下大片
            - session: 用于保存客户端历史访问信息

file上传文件

import requests

requests.post(
    url='www',
    filter={
        'name1': open('a.txt','rb'),             #名称对应的文件对象
        'name2': ('bbb.txt',open('b.txt','rb'))  #表示上传到服务端的名称为 bbb.txt
    }
)

View Code

auth 认证

　　　　配置路由器访问192.168.0.1会弹出小弹窗,输入用户名,密码点击登录不是form表单提交,是基本登录框，这种框会把输入的用户名和密码经过加密放在请求头发送过去

stream 流

#如果服务器文件过大,循环下载
def param_stream():
    ret = requests.get('http://127.0.0.1:8000/test/', stream=True)
    print(ret.content)
    ret.close()

    # from contextlib import closing
    # with closing(requests.get('http://httpbin.org/get', stream=True)) as r:
    # # 在此处理响应。
    # for i in r.iter_content():
    # print(i)

View Code

session 和django不同事例：简化抽屉点赞

import requests

# 创建一个session对象 过程中cookie都在session中
session = requests.Session()

#1、首先登陆任何页面，获取cookie
response1 = session.get(url="http://dig.chouti.com/help/service")

#2、用户登陆，携带上一次的cookie，后台对cookie中的 gpsd 进行授权
response2 = session.post(
    url="http://dig.chouti.com/login",
    data={
        'phone': "8615131255089",
        'password': "xxxxxx",
        'oneMonth': ""
    }
)

response3 = session.post(
    url="http://dig.chouti.com/link/vote?linksId=8589623",
)
print(response3.text)

View Code

补充：

# 找到标签中 类名(class_)为 'action' 的标签
tag=soup.find(class_ = 'action')
tag.name # 拿到当前标签的类型

posted @ 2017-11-30 23:02 _慕阅读(208) 评论(0) 编辑收藏举报

刷新页面返回顶部

_慕

等风，也等你

爬虫笔记一

公告