爬虫 - 一万年

爬虫

爬虫介绍

1，什么是爬虫

　　编写程序，根据URL获取网站信息

2，爬取汽车之家新闻

　　a，伪造浏览器想某个地方发送Http请求，获取返回的字符串

　　下载：pip3 install requests(伪造浏览器)

# 爬虫爬取数据的一个示例


import requests   # requests:伪造浏览器的行为


# 1，下载页面
ret = requests.get(url='https://www.autohome.com.cn/news/')   # ret是一个对象
# print(ret.content)    # content：拿到了原始字节
# print(ret.apparent_encoding)  # apparent_encoding：检测是什么编码
ret.encoding = 'gbk'   # 汽车之家是以gbk进行编码的
# ret.encoding = ret.apparent_encoding   # 是什么编码就以什么编码展示
# print(ret.text)    # 在内部将字转换成字符串，   字节转换成字符串乱码是编码问题


# 2，页面解析，获取想要的指定内容

# *********beautifulsoup**********python中正则模块， 需要安装第三方软件   pip3 install beautifulsoup4
from bs4 import BeautifulSoup  # bs4这个模块帮我们解析Html，将解析之后的字符串变成了一个对象

soup = BeautifulSoup(ret.text, 'html.parser')   # parser: 内置的解析器   lxml也是内置解析器，速度要比parser快，需要安装
# print(soup)    # 上面的soup是一个对象

# div为对象，   find：找匹配成功的第一个
div = soup.find(name='div', id='auto-channel-lazyload-article')   # 找标签，找标签名叫div， id为auto-channel-lazyload-article的标签
# print(div)

# 找div中的所有li标签，   find_all：找所有的
li_list = div.find_all(name='li')
# print(li_list)   # 这里的li_list现在是一个列表，不能再.find，因为.find的是对象，如果想.find就必须将列表搞成一个对象，如：li_list[0].....

# 循环找出所有的h3标签
for li in li_list:
    h3 = li.find(name='h3')   # find是去子子孙孙里面去找，   h3是一个对象，  在页面中是标题标签
    if not h3:   # 如果不是h3标签就跳出
        continue
    # print(h3.text)

    # 拿到所有的简介
    p = li.find(name='p')
    # print(p.text)

    a = li.find('a')   # 不写name也可以，默认就是第一个参数
    # 拿a标签中的属性
    # print(a.attrs)   # 这是拿到所有的属性

    # print(h3.text, a.get('href'))   # 拿一个属性     标题，地址
    #     # print(p.text)   # 拿到新闻
    #     # print('*'*20)


    # 下图片
    img = li.find('img')
    src = img.get('src')
    # print(src)   # 这只是拿到图片地址，想要拿到图片就得再访问这个地址，拼接一个url

    file_name = src.rsplit('__', maxsplit=1)[1]   # 这是取到文件名
    # 发送请求
    ret_img = requests.get(url='http:' + src)
    with open(file_name, 'wb') as f:
        f.write(ret_img.content)

    print(h3.text, a.get('href'))   # 拿一个属性     标题，地址
    print(p.text)   # 拿到新闻
    print('*'*20)

数据爬取示例

总结：
     用到了两个模块
          a，requests：伪造浏览器想某个地方发送请求，获取返回的字符串
               requests里面目前用到的方法：
                     ret = requests.get(url='地址')
                     print(ret.content)    # content：拿到了原始字节
                     ret.encoding = ret.apparent_encoding   是什么编码就以什么编码展示
          b，bs4：解析html格式的字符串     
              bs4中用到的方法：
                    soup = BeautifulSoup('<html>....</html>')   # 生成一个对象
                    div = soup.find(name='标签名')
                    div = soup.find(name='标签名'，id='li')
                    div = soup.find(name='标签名'，class='li')
                    div = soup.find(name='div', attrs={'id':'auto-channel-lazyload-article'}, 'class':'id')
                    # 上面是拿到第一个对象
                    div.text   : 字符转换
                    div.arrts: 获取所有属性
                    div.get(‘href’)： 获取一个属性


    
                    divs = soup.find(name='标签名')
                    divs = soup.find(name='标签名'，id='li')
                    divs = soup.find(name='标签名'，class='li')
                    divs = soup.find(name='div', attrs={'id':'auto-channel-lazyload-article'}, 'class':'id')

                    divs是一个列表
                    divs[0]

posted on 2018-05-08 03:48 一万年阅读(264) 评论(0) 编辑收藏举报

刷新页面返回顶部

马失望

爬虫

导航

公告