乐之之

知而行乐,乐而行之,天道酬勤,学无止境。
常见反爬措施——ip反爬

  在使用爬虫过程中经常会遇到这样的情况,爬虫最初运行还可以,正常爬取数据,但一杯茶的功夫就会出现报错,比如返回403Forbidden,这时打开网页可能会发现数据为空,原来网页端的信息并未显示,或提示您的IP访问频率太高,又或者弹出一个验证码需要我们去识别,再者过了一会又可以正常访问。

  出现上述现象的原因是一些网站采取了一些反爬措施,如服务器会检测某个IP在单位时间内的请求次数,如果这个次数超过了指定的阈值,就直接拒绝服务,并返回错误信息,这种情况就称作封IP,爬虫就不能通过正常途径进行数据的爬取了。

  这时,我们需要借助代理,将自己的IP伪装起来,用代理的IP 对所要的数据发起请求,然后将数据传回给爬虫。笔者在此就简单介绍一些代理IP的流程及使用方法。

一、IP代理的逻辑

逻辑图

二、代理的设置

  在使用代理ip之前,我们需要做一些准备工作,找到代理ip的服务商,当然网上有很多代理服务的网站,如:快代理、流冠代理等,当然有些代理服务网站会有一些免费代理,至于免费代理的质量,有些不尽人意。当然你也可以通过各大免费代理网站上爬取免费代理构建ip池,但构建ip池后的使用时效果不太好,所以较为靠谱的方式是用付费代理。

  代理其实就是IP地址和端口的组合,格式(ip:port)。

  • requests的代理设置

  在爬虫中运用较多的大都是requests,而且代理设置非常简单,只需要传入proxies参数即可。本节介绍使用的是流管代理,通过api链接生成ip,如下图所示:

具体代码如下:

proxy = '183.162.226.249:25020'
proxies = {
    'http': 'http://' + proxy,
    'https': 'http://' + proxy,
}
try:
    response=requests.get('https://www.httpbin.org/get',proxies=proxies,timeout=5)
    print(response.text)
except requests.exceptions.ConnectionError as e:
    print('Error',e.args)

返回结果如下:

  从以上简单示例中可以看出我们已成功配置代理ip。但是,以上生成一个ip是三分钟,如果复制再加上程序的改调,时间也差不多用完了,那么接下来需要构建一个基础代理IP池。

三、付费IP代理池。

1、简单的多线程代理池的示例。

  • 一个具有时效性的代理池,防止其中的一个代理被封造成数据丢失或者爬虫程序停止。最好的代理池的构建方式是ip的api接口+基础代码逻辑+数据库,进行筛选构建,如崔神的Python3爬虫教程-高效代理池的维护
  • 笔者简单构建了一个付费的代理池,效率有些方面还需要完善,具体代码如下:
import json
import time
import requests
from fake_useragent import UserAgent
import threading

use_ip_list = []


class IP_pool(object):

    def __init__(self):
        # 目标网站
        self.url = "https://httpbin.org/get"
        # 付费代理IP的API链接,可以生成五个,三分钟的IP
        self.get_url = "https://api.hailiangip.com:8522/api/getIpEncrypt?dataType=0&encryptParam=i6OcePKr4Cq0wPH1UJ%2FCOyBYXf0wSdR0KIhVhoMMPHvy912xFHA3Hogn7b2rQpv2HYJUsHY9eUYoxg%2BMnszqBry50mHL0pwmTLsDGPSw"
        # 存放代理ip
        self.ip_list = []
        # 当前使用第几个ip
        self.result = 0
        # 记录每个ip的使用次数
        self.number = 0
        # 测试ip是否可用的网站

        self.test_url = "https://www.baidu.com"
        ua = UserAgent()
        self.headers = {"User-Agent": f"{ua.random}"}
        # print(self.headers)
        # self.queue = Queue()
        global use_ip_list
        while True:
            if len(use_ip_list) == 0:
                self.get_ip()
            else:
                time.sleep(10)

    # 获取ip
    def get_ip(self):
        try:
            get_data = requests.get(self.get_url).text
            self.ip_list.clear()

            # 生成一个json列表,包含五个ip,循环选中列表中的ip内容
            for data_ip in json.loads(get_data)["data"]:
                # print(data_ip)
                self.ip_list.append(
                    f'{data_ip["ip"]}:{data_ip["port"]}'
                )
            # print(self.ip_list)
            self.test_ip()
        except Exception as e:
            print(e)
            print("获取ip失败!")
        # self.test_ip()
    # 测试ip

    def test_ip(self):
        global use_ip_list
        try:
            for i in self.ip_list:
                print(i)
                proxies = {
                    "http": "http://" + i,
                    "https": "http://" + i,
                }
                # print(proxies)
                # print(66)
                response = requests.get("https://www.baidu.com", headers=self.headers, proxies=proxies, timeout=2)
                # print(response.text)
                if response.status_code == 200:
                    use_ip_list.append(i)
                    print(response.status_code)
                    # self.queue.put(proxies)
                else:
                    if self.result == 5:
                        self.get_ip()
                        self.result = 0
                    self.result +=1
                    self.test_ip()

        except Exception:
            if self.result == 5:
                self.get_ip()
                self.result = 0
            else:
                self.result += 1
                self.test_ip()


class parse_data():
    def __init__(self):
        time.sleep(10)
        self.use_data()
        # self.listen()
    def listen(self):
        print(1)
        global use_ip_list
        time.sleep(2)
        test_number = 1
        while test_number:
            print("wait")
            print(use_ip_list)
            if len(use_ip_list) != 0:
                print(2)
                self.use_data()
                test_number =0
            else:
                time.sleep(2)


    def use_data(self):
        ua=UserAgent()
        headers=ua.random
        print(headers)
        global use_ip_list
        number = 0
        sign = 0
        test_number1=1
        print(use_ip_list)
        try:
            for i in range(15):
                # proxies = self.queue.get()
                proxie = use_ip_list[number]
                print(3)
                proxies = {
                    "http": "http://" + proxie,
                    "https": "http://" + proxie,
                }
                time.sleep(1)
                # print(use_ip_list)
                try:
                    response = requests.get('https://www.httpbin.org/ip', proxies=proxies,timeout=4)
                    print(response.text)
                except Exception as e:
                    print(e)
                    number +=1
                    continue
                sign += 1
                if sign == 3:
                    number += 1
                if number == 5:
                    use_ip_list.clear()
                    number = 0
                    time.sleep(10)
                    while test_number1:
                        if len(use_ip_list) != 0:
                            self.use_data()
                            test_number1 = 0
                test_number1 = 1
        except IndexError:
            use_ip_list.clear()
            number = 0
            time.sleep(10)
            while test_number1:
                if len(use_ip_list) != 0:
                    self.use_data()
                    test_number1 = 0
if __name__ == '__main__':
    t1 = threading.Thread(target=IP_pool, )
    t2 = threading.Thread(target=parse_data, )
    t1.start()
    t2.start()
  •  对于超时或者不能用的代理进行跳过,换下一个代理IP。

  在这里简单实现了付费IP代理池的功能,接下来配合scrapy进行构建IP代理池。

2、scrapy框架中的代理池示例。

  •  在scrapy的中间件中配置ip代理池,每个ip用4次进行更换,若api连接超过最大重试次数更换api连接。
import json
import random

import requests
from scrapy import signals



class IpTextSpiderMiddleware:
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the spider middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_spider_input(self, response, spider):
        # Called for each response that goes through the spider
        # middleware and into the spider.

        # Should return None or raise an exception.
        return None

    def process_spider_output(self, response, result, spider):
        # Called with the results returned from the Spider, after
        # it has processed the response.

        # Must return an iterable of Request, or item objects.
        for i in result:
            yield i

    def process_spider_exception(self, response, exception, spider):
        # Called when a spider or process_spider_input() method
        # (from other spider middleware) raises an exception.

        # Should return either None or an iterable of Request or item objects.
        pass

    def process_start_requests(self, start_requests, spider):
        # Called with the start requests of the spider, and works
        # similarly to the process_spider_output() method, except
        # that it doesn’t have a response associated.

        # Must return only requests (not items).
        for r in start_requests:
            yield r

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)


class IpTextDownloaderMiddleware:
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the downloader middleware does not modify the
    # passed objects.
    def __init__(self):
        self.user_agent_list=["Mozilla/5.0 (Linux; U; Android 2.3.6; en-us; Nexus S Build/GRK39F) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
            "Avant Browser/1.2.789rel1 (http://www.avantbrowser.com)",
            "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/532.5 (KHTML, like Gecko) Chrome/4.0.249.0 Safari/532.5",
            "Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US) AppleWebKit/532.9 (KHTML, like Gecko) Chrome/5.0.310.0 Safari/532.9",
            "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.514.0 Safari/534.7",
            "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US) AppleWebKit/534.14 (KHTML, like Gecko) Chrome/9.0.601.0 Safari/534.14",
            "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.14 (KHTML, like Gecko) Chrome/10.0.601.0 Safari/534.14",
            "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.20 (KHTML, like Gecko) Chrome/11.0.672.2 Safari/534.20",
            "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.27 (KHTML, like Gecko) Chrome/12.0.712.0 Safari/534.27",
            "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/13.0.782.24 Safari/535.1",
            "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.120 Safari/535.2",
            "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.36 Safari/535.7",
            "Mozilla/5.0 (Windows; U; Windows NT 6.0 x64; en-US; rv:1.9pre) Gecko/2008072421 Minefield/3.0.2pre",
            "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.10) Gecko/2009042316 Firefox/3.0.10",
            "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-GB; rv:1.9.0.11) Gecko/2009060215 Firefox/3.0.11 (.NET CLR 3.5.30729)",
            "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6 GTB5",
            "Mozilla/5.0 (Windows; U; Windows NT 5.1; tr; rv:1.9.2.8) Gecko/20100722 Firefox/3.6.8 ( .NET CLR 3.5.30729; .NET4.0E)",
            "Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
            "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
            "Mozilla/5.0 (Windows NT 5.1; rv:5.0) Gecko/20100101 Firefox/5.0",
            "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0a2) Gecko/20110622 Firefox/6.0a2",
            "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:7.0.1) Gecko/20100101 Firefox/7.0.1",
            "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:2.0b4pre) Gecko/20100815 Minefield/4.0b4pre",
            "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0 )",
            "Mozilla/4.0 (compatible; MSIE 5.5; Windows 98; Win 9x 4.90)",
            "Mozilla/5.0 (Windows; U; Windows XP) Gecko MultiZilla/1.6.1.0a",
            "Mozilla/2.02E (Win95; U)",
            "Mozilla/3.01Gold (Win95; I)",
            "Mozilla/4.8 [en] (Windows NT 5.1; U)",
            "Mozilla/5.0 (Windows; U; Win98; en-US; rv:1.4) Gecko Netscape/7.1 (ax)",
            "HTC_Dream Mozilla/5.0 (Linux; U; Android 1.5; en-ca; Build/CUPCAKE) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
            "Mozilla/5.0 (hp-tablet; Linux; hpwOS/3.0.2; U; de-DE) AppleWebKit/534.6 (KHTML, like Gecko) wOSBrowser/234.40.1 Safari/534.6 TouchPad/1.0",
            "Mozilla/5.0 (Linux; U; Android 1.5; en-us; sdk Build/CUPCAKE) AppleWebkit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
            "Mozilla/5.0 (Linux; U; Android 2.1; en-us; Nexus One Build/ERD62) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
            "Mozilla/5.0 (Linux; U; Android 2.2; en-us; Nexus One Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
            "Mozilla/5.0 (Linux; U; Android 1.5; en-us; htc_bahamas Build/CRB17) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
            "Mozilla/5.0 (Linux; U; Android 2.1-update1; de-de; HTC Desire 1.19.161.5 Build/ERE27) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
            "Mozilla/5.0 (Linux; U; Android 2.2; en-us; Sprint APA9292KT Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
            "Mozilla/5.0 (Linux; U; Android 1.5; de-ch; HTC Hero Build/CUPCAKE) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
            "Mozilla/5.0 (Linux; U; Android 2.2; en-us; ADR6300 Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
            "Mozilla/5.0 (Linux; U; Android 2.1; en-us; HTC Legend Build/cupcake) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
            "Mozilla/5.0 (Linux; U; Android 1.5; de-de; HTC Magic Build/PLAT-RC33) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1 FirePHP/0.3",
            "Mozilla/5.0 (Linux; U; Android 1.6; en-us; HTC_TATTOO_A3288 Build/DRC79) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
            "Mozilla/5.0 (Linux; U; Android 1.0; en-us; dream) AppleWebKit/525.10  (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2",
            "Mozilla/5.0 (Linux; U; Android 1.5; en-us; T-Mobile G1 Build/CRB43) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari 525.20.1",
            "Mozilla/5.0 (Linux; U; Android 1.5; en-gb; T-Mobile_G2_Touch Build/CUPCAKE) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
            "Mozilla/5.0 (Linux; U; Android 2.0; en-us; Droid Build/ESD20) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
            "Mozilla/5.0 (Linux; U; Android 2.2; en-us; Droid Build/FRG22D) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
            "Mozilla/5.0 (Linux; U; Android 2.0; en-us; Milestone Build/ SHOLS_U2_01.03.1) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
            "Mozilla/5.0 (Linux; U; Android 2.0.1; de-de; Milestone Build/SHOLS_U2_01.14.0) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
            "Mozilla/5.0 (Linux; U; Android 3.0; en-us; Xoom Build/HRI39) AppleWebKit/525.10  (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2",
            "Mozilla/5.0 (Linux; U; Android 0.5; en-us) AppleWebKit/522  (KHTML, like Gecko) Safari/419.3",
            "Mozilla/5.0 (Linux; U; Android 1.1; en-gb; dream) AppleWebKit/525.10  (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2",
            "Mozilla/5.0 (Linux; U; Android 2.0; en-us; Droid Build/ESD20) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
            "Mozilla/5.0 (Linux; U; Android 2.1; en-us; Nexus One Build/ERD62) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
            "Mozilla/5.0 (Linux; U; Android 2.2; en-us; Sprint APA9292KT Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
            "Mozilla/5.0 (Linux; U; Android 2.2; en-us; ADR6300 Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
            "Mozilla/5.0 (Linux; U; Android 2.2; en-ca; GT-P1000M Build/FROYO) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
            "Mozilla/5.0 (Linux; U; Android 3.0.1; fr-fr; A500 Build/HRI66) AppleWebKit/534.13 (KHTML, like Gecko) Version/4.0 Safari/534.13",
            "Mozilla/5.0 (Linux; U; Android 3.0; en-us; Xoom Build/HRI39) AppleWebKit/525.10  (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2",
            "Mozilla/5.0 (Linux; U; Android 1.6; es-es; SonyEricssonX10i Build/R1FA016) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
            "Mozilla/5.0 (Linux; U; Android 1.6; en-us; SonyEricssonX10i Build/R1AA056) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",]
        self.get_url="https://api.hailiangip.com:8522/api/getIpEncrypt?dataType=0&encryptParam=i6OcePKr4Cq0wPH1UJ%2FCOyBYXf0wSdR0KIhVhoMMPHvy912xFHA3Hogn7b2rQpv2lrmj76kLuq7e14sbfWQuzlOvHP1IbxTPA2LMO00qT4O%2F0UCy4GxAzlw0QepTrRUrV%2FcsKzI2Sft4udNK9vBeAMBuiCewJ1molaFh8MJxFydXwMMY2Pp7wRNtgRIJmPbHvs3ERyFHZ9FAgNS8WBDIMt0Jv%2FQlqwlcd4gkrYI6AFg%3D"
        self.ip_list=[]  # 存放代理ip
        self.result=0    # 当前使用第几个ip
        self.number=0    # 记录每个ip的使用次数
        self.test_url="https://www.baidu.com"
        self.test_headers={'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36'}
        self.api_list=["https://api.hailiangip.com:8522/api/getIpEncrypt?dataType=0&encryptParam=i6OcePKr4Cq0wPH1UJ%2FCOyBYXf0wSdR0KIhVhoMMPHvy912xFHA3Hogn7b2rQpv2lrmj76kLuq7eOWHxckogBGUeCkpAfW2MwC6hSIUqvmJEuChn06y97xvK6RHiYy1dSqDX%2B368VRXwMMY2Pp7wRNtgRIJmPbHvs3ERyFHZ9FAgNS8WBDIMt0Jv%2FQlqwlcd4gkrYI6AFg%3D",
                       "https://api.hailiangip.com:8522/api/getIpEncrypt?dataType=0&encryptParam=i6OcePKr4Cq0wPH1UJ%2FCOyBYXf0wSdR0KIhVhoMMPHvy912xFHA3Hogn7b2rQpv2lrmj76kLuq7e14sbfWQuzlOvHP1IHWQ4A%2BVINCFoKbhydhvTlIRNXszr%2FHaGHGv939i1Nxu8r0dtugDHco%2BmanfXPh0hXwMMY2Pp7wRNtgRIJmPbHvs3ERyFHZ9FAgNS8WBDIMt0Jv%2FQlqwlcd4gkrYI6AFg%3D",
                       "https://api.hailiangip.com:8522/api/getIpEncrypt?dataType=0&encryptParam=i6OcePKr4Cq0wPH1UJ%2FCOyBYXf0wSdR0KIhVhoMMPHvy912xFHA3Hogn7b2rQpv2lrmj76kLuq7urq2dGXnKdp7EUQ4X9QpspzfvwQvOUpSUJb2phf%2B8byojswKKlj%2BTN0qRfulH7lBXwMMY2Pp7wRNtgRIJmPbHvs3ERyFHZ9FAgNS8WBDIMt0Jv%2FQlqwlcd4gkrYI6AFg%3D",
                       "https://api.hailiangip.com:8522/api/getIpEncrypt?dataType=0&encryptParam=i6OcePKr4Cq0wPH1UJ%2FCOyBYXf0wSdR0KIhVhoMMPHvy912xFHA3Hogn7b2rQpv2lrmj76kLiDdgWb%2BiP06kNQT1JBWiXSnS2Qt7WNIYYxM5GZIqxvz0lXwMMY2Pp7wRNtgRIJmPbHvs3ERyFHZ9FAgNS8WBDIMt0Jv%2FQlqwlcd4gkrYI6AFg%3D",
                       "https://api.hailiangip.com:8522/api/getIpEncrypt?dataType=0&encryptParam=i6OcePKr4Cq0wPH1UJ%2FCOyBYXf0wSdR0KIhVhoMMPHvy912xFHA3Hogn7b2rQpv2lrmj76kLuq7e14sbfWQu0qT4NvKCov%2B%2FIpmBc0mnbaMh3yTQ0le4s8ucZtiCLiOpFnV%2F02ZNRefDtCwn5VVr8I%2F1tXwMMY2Pp7wRNtgRIJmPbHvs3ERyFHZ9FAgNS8WBDIMt0Jv%2FQlqwlcd4gkrYI6AFg%3D",
                       "https://api.hailiangip.com:8522/api/getIpEncrypt?dataType=0&encryptParam=i6OcePKr4Cq0wPH1UJ%2FCOyBYXf0wSdR0KIhVhoMMPHvy912xFHA3Hogn7b2rQpv2lrmj76kLuq7A2LMO0jJ8Cc1cDves0KxrtmVpdlMty0gWL55wWjkCGznQdo0KP0RoNG5vrTavLIdXwMMY2Pp7wRNtgRIJmPbHvs3ERyFHZ9FAgNS8WBDIMt0Jv%2FQlqwlcd4gkrYI6AFg%3D"]
        self.api_number=0
    def get_ip(self):
        try:
            temp_data =requests.get(self.get_url).text
            self.ip_list.clear()
            for data_ip in json.loads(temp_data)["data"]:
                print(data_ip)
                self.ip_list.append({
                    "ip":data_ip["ip"],
                    "port":data_ip["port"],
                })
        except Exception:
            self.get_url=self.api_list[self.api_number]
            self.api_number +=1
            if self.api_number == 6:
                print("警告:api接口已是最后一个,请注意换api连接!")
                print("警告:api接口已是最后一个,请注意换api连接!")
                print("警告:api接口已是最后一个,请注意换api连接!")
                print("警告:api接口已是最后一个,请注意换api连接!")
                print("警告:api接口已是最后一个,请注意换api连接!")

        # print(self.ip_list)
    def change_ip_data(self,request):
        if request.url.split(":")[0] == "http":
            request.meta["proxy"]="http://"+str(self.ip_list[self.result-1]["ip"])+":"+str(self.ip_list[self.result-1]["port"])
        if request.url.split(":")[0] == "https":
            request.meta["proxy"]="https://"+str(self.ip_list[self.result-1]["ip"])+":"+str(self.ip_list[self.result-1]["port"])
    def test_ip(self):
        requests.get(self.test_url,headers=self.test_headers,proxies={"https://"+
                str(self.ip_list[self.result-1]["ip"])+":"+str(self.ip_list[self.result-1]["port"])},timeout=5)
    def ip_used(self,request):
        try:
            self.change_ip_data(request)
            self.test_ip()
        except:
            if self.result==0 or self.result==5:
                self.get_ip()
                self.result=1
            else:
                self.result +=1
                self.ip_used(request)


    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_request(self, request, spider):
        request.headers["User-Agent"]=random.choice(self.user_agent_list)
        if self.result ==0 or self.result ==5:
            self.get_ip()
            self.result =1
        if self.number == 3:
            self.result +=1
            self.number =0
            self.change_ip_data(request)
        else:
            self.change_ip_data(request)
            self.number +=1
        # self.ip_used(request)


    def process_response(self, request, response, spider):
        # Called with the response returned from the downloader.

        # Must either;
        # - return a Response object
        # - return a Request object
        # - or raise IgnoreRequest
        return response

    def process_exception(self, request, exception, spider):
        # Called when a download handler or a process_request()
        # (from other downloader middleware) raises an exception.

        # Must either:
        # - return None: continue processing this exception
        # - return a Response object: stops process_exception() chain
        # - return a Request object: stops process_exception() chain
        pass

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)

   对于ip代理池的构建到这里就结束了,在ip反爬中要注意避免代理ip的浪费,注意合理使用ip代理池。

PS:当然还有一种ip代理方式:ADSL拨号。这种方式可以根据重启光猫换动态ip,网上目前较为主流的方式是通过vps实现动态拨号,但是这种方式的难点是部署环境。需要vps主机、固定IP管理服务器、本机爬虫程序三方协同进行实现动态拨号。至于vps主机和固定IP管理服务器大家可以到网上租一个,在这里就先简单介绍以下:

  • vps主机:网上有很多云主机,如云立方、阿斯云、阳光NET、无极网络,个人实现动态拨号选一些基础配置就够用了,如果是大量爬虫,需要实现一千万级以上的数据爬取,可以选一些配置高一点的。个人推荐后两个,相对实惠一些(70-88/月),当然也可以选一些稳定性强一些的代理商像云立方,但是就是贵些(110/月)。
  • 固定IP管理服务器:其作用是通过VPS定时换ip,然后请求服务器,服务器获取VPS的IP地址。也可以用DDNS服务,动态域名解析IP,通过域名映射过来,获取云主机拨号后的ip。
  • 本机爬虫程序:发送拨号信息,定时获取IP。

  对于ADSL自动拨号换ip感兴趣的可以观看这篇博客:如何使用adsl自动拨号实现换代理(保姆级教程)? - 乐之之 - 博客园 (cnblogs.com)

  通过python脚本简单实现了ADSL自动拨号换ip的操作,那么关于ip反爬就暂时介绍到这里了...

posted on 2023-03-16 22:33  乐之之  阅读(316)  评论(0编辑  收藏  举报