爬虫（二）爬取今日头条图片

爬取今日头条图片

声明：此篇文章主要是观看静觅教学视频后做的笔记，原教程地址https://cuiqingcai.com/

自己很菜慢慢学习，刚学2天有啥问题请多指教

一、实现流程介绍

1.分析今日头条网站

2.抓取索引页内容

　　 3.抓取详细页内容

4.下载图片并且保存入数据库

二、具体实现

2.1 分析今日头条网站

1. 首先访问今日头条网站输入关键字来到索引页，我们需要通过分析网站来拿到进入详细页的url

2.通过点击查看data中的内容，我们可以看到访问详细页的url，所以这是一会我们需要获取的信息.

3.随着向下滑动滚动条显示更多的图片索引，我们会发现刷出了很多新的ajax请求如下图所示，通过这个我们可以知道我们之后可以通过改变offset中的参数来获取不同的拿到不同的索引界面，从而获得不同的图集详细页url

4.接下来就是分析查找图集详细页的代码，来找到图片的url，这里自己在学习的时候遇到了些坑，利用Google浏览器当利用利用“检查”来分析页面时候，原网站由

　　https://m.toutiao.com/a6511830952644182542/

转化为

　　https://m.toutiao.com/a6511830952644182542/

这样子在DOC中就看不到图片的信息，自己比较菜找了好久也没找到，然后就换了个浏览器试试发现，火狐浏览器不会发生如此情况，所以后面访问分析的时候利用的火狐浏览器


   后面分析代码可以看出找到了url的位置，在gallery那里，这样子分析页面的工作就基本完成了剩下的就是利用代码实现了

2.2代码实现

代码这里就简要的说说，学了2天发现难处还是在分析网站方面，剩下的就是利用工具进行抓取


import json
import re
from _md5 import md5
from json import JSONDecodeError
import os
from bs4 import BeautifulSoup
import requests
import pymongo
from requests import RequestException
from config import *
from multiprocessing import Pool
client = pymongo.MongoClient(MONGO_URL, connect=False)
db = client[MONGO_DB]


def get_page_index(offset, keyword):
    data = {
        'offset': offset,
        'format': 'json',
        'keyword': keyword,
        'autoload': 'true',
        'count': '20',
        'cur_tab': 1
    }
    headers = {'User-Agent': 'MOzilla/5.0'}
    url = 'https://www.toutiao.com/search_content/?'
    try:
        response = requests.get(url, params=data, headers=headers)
        if response.status_code == 200:
            return response.text
        return None
    except RequestException:
        print('请求页面错误')
        return None


def get_page_detail(url):
    try:
        response = requests.get(url)
        if response.status_code == 200:
            return response.text
        return None
    except RequestException:
        print('request the web error', url)
        return None


def parse_page_detail(html, url):
    soup = BeautifulSoup(html, 'lxml')
    title = soup.select('title')[0].get_text()
    pattern = re.compile('gallery: JSON\.parse\("(.*?)"\),', re.S)
    gallery = re.search(pattern, html)
    if gallery:
        gallery = gallery.group(1)
        gallery = re.sub(r'\\', '', gallery)
        data = json.loads(gallery)
        if data and 'sub_images' in data:
            sub_images = data.get('sub_images')
            images = [item.get('url') for item in sub_images]
            for image in images: download_image(image)
            return {
                'title': title,
                'url': url,
                'images': images
             }


def parse_page_index(html):
    try:
        data = json.loads(html)
        if data and 'data' in data.keys():
            for item in data.get('data'):
                yield item.get('article_url')
    except JSONDecodeError:
        pass


def save_to_mongo(result):
    if db[MONGO_TABLE].insert(result):
        print('save to mongoDB sucessfully',result)
        return True
    return False

def download_image(url):
    print('downloading ',url)
    try:
        response = requests.get(url)
        if response.status_code == 200:
            save_image(response.content)
        return None
    except RequestException:
        print('save photo error',url)
    return None


def save_image(content):
    file_path = '{0}/{1}.{2}'.format(os.getcwd(),md5(content).hexdigest(),'jpg')
    if not os.path.exists(file_path):
        with open(file_path,'wb') as f:
            f.write(content)
            f.close()


def main(offest):
    index_html = get_page_index(offest, KEYWORD)
    for url in parse_page_index(index_html):
        if url:
            detail_html = get_page_detail(url)
            if detail_html:
                result = parse_page_detail(detail_html, url)
                if result:
                    save_to_mongo(result)


if __name__ == '__main__':
    groups = [x*20 for x in range(GROUP_START, GROUP_END +1)]
    pool=Pool()
    pool.map(main,groups)

config.py

MONGO_URL = 'localhost'
MONGO_DB = 'toutiao'
MONGO_TABLE = 'toutiao'
GROUP_START =1
GROUP_END =20
KEYWORD = '街拍'

遇到问题：

1.在利用正则表达式进行匹配的时候如果原文有‘(’，')'，'.'‘这类符号时那么你在进行正则表达式书写的时候应该在前面加'\'

　　　　　　 pattern = re.compile('gallery: JSON\.parse\("(.*?)"\),', re.S)

2. db = client[MONGO_DB]这里应该是方括号而不是（），否则无法正常访问数据库

3. 在Google浏览器中找不到图片url，然后使用的是火狐浏览器然后就找到了2333333

运行之后就可以把图片爬取下来了，然后就可以看.................................................................. emmmm,我是学技术不是看图的

posted on 2018-01-17 14:50 繁华里流浪阅读(1761) 评论(0) 编辑收藏举报

刷新页面返回顶部