爬虫神器——异步爬虫 - 蔚蓝色の天空

一、背景

有一个项目需要使用到爬虫，因此去Google上搜索了一下，发现除了requests，scrapy，多线程，多进程以外，还有一种方法异步爬虫。使用aiohttp+async来进行网站的爬取。

二、使用

通过以下的指令安装

# 通过指令安装aiohttp
pip install aiohttp

我们在使用aiohttp的时候，还需要注意就是需要使用async这种异步编程的方式。同时这个方法需要python3.5以上。主要是使用aiohttp这种异步的方式来进行爬取。

三、模型（不是很准确的结构图）

3.1 多线程爬虫模型

3.2 异步爬虫模型

四、核心代码

其实和一般的request爬虫一样，只是需要注意这里需要使用aiohttp他是异步编程的，因此需要用到async的方法

# 设置爬虫的日志格式
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s: %(message)s')
logger = logging.getLogger(__name__)

df = pd.DataFrame(columns=['occupation', 'companyName', 'location', 'salary', 'date', 'education', 'experience', 'companyType', 'companySize', 'type', 'request'])


# 异步HTTP请求
async def fetch(sem11, session, url):
    async with sem11:
        async with session.get(url, headers=header) as response:
            return await response.text(errors='ignore')

# 处理网页
async def download(sem, url):
    async with aiohttp.ClientSession() as session:
        try:
            html = await fetch(sem, session, url)  # 这里添加网页的解析代码
            await parser(html)
        except Exception as err:
            print(err)


if __name__ == '__main__':
    # 统计该爬虫的消耗时间
    print('*' * 50)
    t_start_web = time.time()
    loop = asyncio.get_event_loop()
    sem1 = asyncio.Semaphore(100)
    # 第一个人 3413
    # 第二个人 3413：6826
    # 第三个人 6826：
    tasks = [asyncio.ensure_future(download(sem1, url)) for url in urls[:3413]]
    tasks = asyncio.gather(*tasks)
    loop.run_until_complete(tasks)
    t_end_web = time.time()
    print('网站爬取总共耗时：%s' % (t_end_web - t_start_web))

五、参考

5.1 详细介绍

http://www.ityouknow.com/python/2019/12/28/python-aiohttp-102.html

https://www.jianshu.com/p/b8010594557f

5.2 性能对比

https://www.cnblogs.com/jiyongjia/p/9803991.html

posted on 2020-07-20 20:49 蔚蓝色の天空阅读(540) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部