Loading

解析网站robots.txt是否可以爬取

通过user_agent 和url判断网页是否可爬

from urllib import robotparser
rb = robotparser.RobotFileParser()
rb.set_url("https://www.jd.com/robots.txt")
rb.read()
url = "https://www.jd.com"
user_agent = "HuihuiSpider"
rb.can_fetch(user_agent, url)
False
rb.can_fetch("sougou", url)
True


posted @ 2020-04-04 14:32  Lust4Life  阅读(551)  评论(0编辑  收藏  举报