爬取西刺网的免费IP
在写爬虫时,经常需要切换IP,所以很有必要自已在数据维护库中维护一个IP池,这样,就可以在需用的时候随机切换IP,我的方法是爬取西刺网的免费IP,存入数据库中,然后在scrapy 工程中加入tools这个目录,里面存放一些常用的目录,包括这个免费IP池,具体目录如下:
crawl_ip_from_xichi.py 代码如下:
import requests from fake_useragent import UserAgent from scrapy.selector import Selector import time import pymysql class GetIPFromXichi(object): """通过西刺得到可用的IP,存入数据库""" def crawl_ip(self): """爬取西刺的免费IP""" ip_list = [] for i in range(1, 20): headers = UserAgent() ua = getattr(headers, "random") ua = {"User-Agent": ua} url = "http://www.xicidaili.com/nn/" + str(i) response = requests.get("http://www.xicidaili.com/nn/", headers=ua) time.sleep(3) selector = Selector(text=response.text) alltr = selector.css("#ip_list tr") for tr in alltr[1:]: speed_str = tr.css(".bar::attr(title)").extract_first() if speed_str: speed = float(speed_str.split("秒")[0]) else: speed = 0 all_text = tr.css("td ::text").extract() ip = all_text[0] port = all_text[1] type = all_text[6] if not 'HTTP' in type.upper(): type = "HTTP" ip_list.append((ip, port, type, speed)) conn = pymysql.connect(host="127.0.0.1", user="root", password="root", db="outback") cursor = conn.cursor() insert_sql = """insert into ip_proxy(ip,port,type,speed) VALUES (%s,%s,%s,%s) """ for i in ip_list: try: cursor.execute(insert_sql, (i[0], i[1], i[2], i[3])) conn.commit() except Exception as e: print(e) conn.rollback() cursor.close() conn.close() if __name__ == "__main__": crawl_ip_from_xichi=GetIPFromXichi() crawl_ip_from_xichi.crawl_ip()
这里有几个容易出错的地方,
一,把函数放在main线程中去执行,这样在以后导入这个类时就不会执行一次,
二,数据连接一定是在整个循环执行完之后才关闭。
三,为了使这个爬虫更加友好,每爬取一页面 sleep 3秒,
github https://github.com/573320328/tools