python 添加随即user_agent和随即IP来抓取 前提自己先抓去好IP并且测验好可用 并添加IP失败后 使用其他IP重试

2021-03-27 06:26

阅读:441

标签:init   失败   目录   使用   enabled   image   spl   user   完美   

#在middlewares 件中添加以下类 实现随即 user_AGENT
class NovelUserAgentMiddleWare(object): #随即user_AGENT
    def __init__(self):
        self.user_agent_list = [
           "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1"
            "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",
            "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",
            "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1","Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",
            "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",
             "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
             "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
            "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
            "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
            "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
             "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
             "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
            "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",
            "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
       ]

    def process_request(self, request, spider):
        import random
        ua = random.choice(self.user_agent_list)
        print(User-Agent: + ua)
        request.headers.setdefault(User-Agent, ua)

之后setings添加以下代码:

DOWNLOADER_MIDDLEWARES = {

   ImagesRename.middlewares.NovelUserAgentMiddleWare: 544, #随即user
   ImagesRename.middlewares.NovelProxyMiddleWare: 543,#随即IP  ImagesRename 换成自己的   
}

添加后 之后添加随机IP ,不用setting里 不用添加了。

#也在middlewares 件中添加类
class NovelProxyMiddleWare(object): #随即IP

    def process_request(self, request, spider):
        proxy = self.get_random_proxy()
        print("Request proxy is {}".format(proxy))
        request.meta["proxy"] = "http://" + proxy

    def get_random_proxy(self):

        import random

        with open(IP.txt, r, encoding="utf-8") as f:#打开IP的地址,前提这个目录下有#IP.txt
            txt = f.read()
            return random.choice(txt.split(\n))

就完成 uesr 和IP随即了 但我用的IP是免费的,可能实效,所以失败后要重试。在settings添加以下代码

RETRY_ENABLED = True  #打开重试开关
RETRY_TIMES = 20  #重试次数  IP质量越好可以填小点,不介意抓完的可以填小
DOWNLOAD_TIMEOUT = 3  #超时
RETRY_HTTP_CODES = [429,404,403]  #重试

好了,可以完美 

 

python 添加随即user_agent和随即IP来抓取 前提自己先抓去好IP并且测验好可用 并添加IP失败后 使用其他IP重试

标签:init   失败   目录   使用   enabled   image   spl   user   完美   

原文地址:https://www.cnblogs.com/aotumandaren/p/13663713.html


评论


亲,登录后才可以留言!