python爬取百度贴吧张国荣图片

python爬取张国荣吧张国荣图片

一直喜欢哥哥的歌,也一直听哥哥的歌,突然想着收集一些哥哥的照片,所以写了一个爬虫爬取哥哥的图片,也给大家参考一下;

这里我用的request-html这个包

from requests_html import HTMLSession
from requests_html import HTML

构造请求

class CrawlSpider(object):
    def __init__(self):
        self.sess = HTMLSession()
        self.headers = {
            "Host": "tieba.baidu.com",
            "Referer": "https://www.baidu.com/",
            "User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36"
        }

分析请求的api,发现它是get请求,带有参数

self.params = {
            "kw": "张国荣",
            "tab": "album",
            "subtab": "album_good",
            "cat_id": ""
        }

发出请求,找到分类的id;

resp = self.sess.get(url="http://tieba.baidu.com/f?", params=self.params, headers=self.headers)
category_list = re.findall('<li cat-id="(.*)"><span>', resp.text)

对每个分类构造api请求,获取每个图册的id;

all_tid_list = []
        for cat_id in category_list:
            self.params["cat_id"] = cat_id
            self.params["pagelets"] = 'album/pagelet/album_good'
            self.params["pagelets_stamp"] = "%013d"%(1000 * time.time())
            try:
                resp = self.sess.get(url="http://tieba.baidu.com/f?", params=self.params, headers=self.headers)
                resp.html.render()
                html = HTML(html=resp.html.text)
                tid_list = re.findall(r"/p/\d+", re.sub(r"\\", '', str(html.links)))
                all_tid_list.extend(tid_list)
            except Exception as err:
                print("获取tid失败{}".format(err))
        return all_tid_list

拿到图册的id后我们就可以去请求图册的数据了;tid就是图册的id,“_”是当前的时间戳,“pe”是每页返回40条数据,”pn“是当前页数;

params = {
                "kw": "张国荣",
                "alt": "jview",
                "rn": "200",
                "tid": tid,
                "pn": "1",
                "ps": "1",
                "pe": "40",
                "info": "1",
                "_": "%013d" % (1000 * time.time())
            }

构造请求获取图册的数据;这里我们可以拿到每个图册的标题和图册中图片的信息;

base_url = "http://tieba.baidu.com/photo/g/bw/picture/list?"
resp = self.sess.get(url=base_url, params=params, headers=self.headers).text
time.sleep(random.random() + 1)
resp = json.loads(resp)
title = resp["data"]["title"]
pic_list = resp["data"]["pic_list"]

保存我们需要的图册标题和图片链接;

        try:
            image_list = []
            for item in self.total_images:
                image_dict = {}
                image_url = []
                image_dict["title"] = item["title"]
                for each in item["images_info"]:
                    image_url.append(each["purl"])
                image_dict["image_url"] = image_url
                image_list.append(image_dict)

            with open("image_urls.json", "w", encoding='GBK') as f:
                f.write(json.dumps(image_list))
        except Exception as err:
            print("写入数据失败{}".format(err))

这样我们就拿到了哥哥的图片链接了,然后我们请求图片链接,下载图片即可;

完整代码可以看我的gitub链接:

https://github.com/gongjiaqiang/my_spider

 

posted on 2018-07-05 11:12  逆行人  阅读(322)  评论(0编辑  收藏  举报

导航