模拟登录与中间件

一. Scrapy处理cookie

在requests中我们讲解处理cookie主要有两个方案. 第一个方案. 从浏览器里直接把cookie搞出来. 贴到heades里. 这种方案, 简单粗暴. 第二个方案是走正常的登录流程. 通过session来记录请求过程中的cookie. 那么到了scrapy中如何处理cookie? 其实也是这两个方案.

首先, 我们依然是把目标定好, 还是我们的老朋友, https://user.17k.com/ck/author/shelf?page=1&appKey=2406394919

这个url必须要登录后才能访问(用户书架). ==对于该网页而言==, 就必须要用到cookie了. 首先, 创建项目, 建立爬虫. 把该填的地方填上.

import scrapy
from scrapy import Request, FormRequest


class LoginSpider(scrapy.Spider):
    name = 'login'
    allowed_domains = ['17k.com']
    start_urls = ['https://user.17k.com/ck/author/shelf?page=1&appKey=2406394919']

    def parse(self, response):
        print(response.text)

此时运行时, 显示的是该用户还未登录. 不论是哪个方案. 在请求到start_urls里面的url之前必须得获取到cookie. 但是默认情况下, scrapy会自动的帮我们完成request的创建. 此时, 我们需要自己去组装第一个请求. 这时就需要我们自己的爬虫中重写start_requests()方法. 该方法负责起始request的组装工作. 我们不妨先看看原来的start_requests()是如何工作的.

# 以下是scrapy源码

def start_requests(self):
    cls = self.__class__
    if not self.start_urls and hasattr(self, 'start_url'):
        raise AttributeError(
            "Crawling could not start: 'start_urls' not found "
            "or empty (but found 'start_url' attribute instead, "
            "did you miss an 's'?)")
    if method_is_overridden(cls, Spider, 'make_requests_from_url'):
        warnings.warn(
            "Spider.make_requests_from_url method is deprecated; it "
            "won't be called in future Scrapy releases. Please "
            "override Spider.start_requests method instead (see %s.%s)." % (
                cls.__module__, cls.__name__
            ),
        )
        for url in self.start_urls:
            yield self.make_requests_from_url(url)
    else:
        for url in self.start_urls:
            # 核心就这么一句话. 组建一个Request对象.我们也可以这么干. 
            yield Request(url, dont_filter=True)

自己写个start_requests()看看.

def start_requests(self):
    print("我是万恶之源")
    yield Request(
        url=LoginSpider.start_urls[0],
        callback=self.parse
    )

接下来, 我们去处理cookie

1. 方案一, 直接从浏览器复制cookie过来

def start_requests(self):
        # 直接从浏览器复制
        cookies = "GUID=bbb5f65a-2fa2-40a0-ac87-49840eae4ad1; c_channel=0; c_csc=web; Hm_lvt_9793f42b498361373512340937deb2a0=1627572532,1627711457,1627898858,1628144975; accessToken=avatarUrl%3Dhttps%253A%252F%252Fcdn.static.17k.com%252Fuser%252Favatar%252F16%252F16%252F64%252F75836416.jpg-88x88%253Fv%253D1610625030000%26id%3D75836416%26nickname%3D%25E5%25AD%25A4%25E9%25AD%2582%25E9%2587%258E%25E9%25AC%25BCsb%26e%3D1643697376%26s%3D73f8877e452e744c; sensorsdata2015jssdkcross=%7B%22distinct_id%22%3A%2275836416%22%2C%22%24device_id%22%3A%2217700ba9c71257-035a42ce449776-326d7006-2073600-17700ba9c728de%22%2C%22props%22%3A%7B%22%24latest_traffic_source_type%22%3A%22%E7%9B%B4%E6%8E%A5%E6%B5%81%E9%87%8F%22%2C%22%24latest_referrer%22%3A%22%22%2C%22%24latest_referrer_host%22%3A%22%22%2C%22%24latest_search_keyword%22%3A%22%E6%9C%AA%E5%8F%96%E5%88%B0%E5%80%BC_%E7%9B%B4%E6%8E%A5%E6%89%93%E5%BC%80%22%7D%2C%22first_id%22%3A%22bbb5f65a-2fa2-40a0-ac87-49840eae4ad1%22%7D; Hm_lpvt_9793f42b498361373512340937deb2a0=1628145672"
        cookie_dic = {}
        for c in cookies.split("; "):
            k, v = c.split("=")
            cookie_dic[k] = v

        yield Request(
            url=LoginSpider.start_urls[0],
            cookies=cookie_dic,
            callback=self.parse
        )

这种方案和原来的requests几乎一模一样. 需要注意的是: cookie需要通过cookies参数进行传递!

2. 方案二, 完成登录过程.

    def start_requests(self):
        # 登录流程
        username = "18614075987"
        password = "q6035945"
        url = "https://passport.17k.com/ck/user/login"
		
        # 发送post请求
        # yield Request(
        #     url=url,
        #     method="post",
        #     body="loginName=18614075987&password=q6035945",
        #     callback=self.parse
        # )
        
        # 发送post请求
        yield FormRequest(
            url=url,
            formdata={
                "loginName": username,
                "password": password
            },
            callback=self.parse
        )
	
    def parse(self, response):
        # 得到响应结果. 直接请求到默认的start_urls
        yield Request(
            url=LoginSpider.start_urls[0],
            callback=self.parse_detail
        )

    def parse_detail(self, resp):
        print(resp.text)

注意, 发送post请求有两个方案,

  1. Scrapy.Request(url=url, method='post', body=数据)

  2. Scarpy.FormRequest(url=url, formdata=数据) -> 推荐

区别: 方式1的数据只能是字符串. 这个就很难受. 所以推荐用第二种.

二. Scrapy的中间件

中间件的作用: 负责处理引擎和爬虫以及引擎和下载器之间的请求和响应. 主要是可以对request和response做预处理. 为后面的操作做好充足的准备工作. 在python中准备了两种中间件, 分别是下载器中间件和爬虫中间件.

1. DownloaderMiddleware

下载中间件, 它是介于引擎和下载器之间, 引擎在获取到request对象后, 会交给下载器去下载, 在这之间我们可以设置下载中间件. 它的执行流程:

  • 引擎拿到request -> 中间件1(process_request) -> 中间件2(process_request) .....-> 下载器-|
  • 引擎拿到request <- 中间件1(process_response) <- 中间件2(process_response) ..... <-下载器-|
class MidDownloaderMiddleware1:

    def process_request(self, request, spider):
        print("process_request", "ware1")
        return None

    def process_response(self, request, response, spider):
        print("process_response", "ware1")
        return response

    def process_exception(self, request, exception, spider):
        print("process_exception", "ware1")
        pass


class MidDownloaderMiddleware2:

    def process_request(self, request, spider):
        print("process_request", "ware2")
        return None

    def process_response(self, request, response, spider):
        print("process_response", "ware2")
        return response

    def process_exception(self, request, exception, spider):
        print("process_exception", "ware2")
        pass

设置中间件

DOWNLOADER_MIDDLEWARES = {
   # 'mid.middlewares.MidDownloaderMiddleware': 542,
   'mid.middlewares.MidDownloaderMiddleware1': 543,
   'mid.middlewares.MidDownloaderMiddleware2': 544,
}

优先级参考管道.

运行效果;

image-20210805180841148

接下来, 我们来说说这几个方法的返回值问题(难点)

  1. process_request(request, spider): 在每个请求到达下载器之前调用

    一, return None 不拦截, 把请求继续向后传递给权重低的中间件或者下载器

    二, return request 请求被拦截, 并将一个新的请求返回. 后续中间件以及下载器收不到本次请求

    三, return response 请求被拦截, 下载器将获取不到请求, 但是引擎是可以接收到本次响应的内容, 也就是说在当前方法内就已经把响应内容获取到了.

  2. proccess_response(request, response, spider): 每个请求从下载器出来调用

    一, return response 通过引擎将响应内容继续传递给其他组件或传递给其他process_response()处理

    二, return request 响应被拦截. 将返回内容直接回馈给调度器(通过引擎), 后续process_response()接收不到响应内容.

OK, 至此, 中间件的含义算是完事儿了. 那这东西有啥用? 我们上案例!

1.1. 动态随机设置UA

设置统一的UA很简单. 直接在settings里设置即可.

USER_AGENT = 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36'

但是这个不够好, 我希望得到一个随机的UA. 此时就可以这样设计, 首先, 在settings里定义好一堆UserAgent. http://useragentstring.com/pages/useragentstring.php?name=Chrome

USER_AGENT_LIST = [
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36',
    'Mozilla/5.0 (X11; Ubuntu; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2919.83 Safari/537.36',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2866.71 Safari/537.36',
    'Mozilla/5.0 (X11; Ubuntu; Linux i686 on x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2820.59 Safari/537.36',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2762.73 Safari/537.36',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2656.18 Safari/537.36',
    'Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML like Gecko) Chrome/44.0.2403.155 Safari/537.36',
    'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.1 Safari/537.36',
    'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.0 Safari/537.36',
    'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.0 Safari/537.36',
    'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2226.0 Safari/537.36',
    'Mozilla/5.0 (Windows NT 6.4; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2225.0 Safari/537.36',
    'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2225.0 Safari/537.36',
    'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2224.3 Safari/537.36',
    'Mozilla/5.0 (Windows NT 10.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/40.0.2214.93 Safari/537.36',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.124 Safari/537.36',
    'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2049.0 Safari/537.36',
    'Mozilla/5.0 (Windows NT 4.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2049.0 Safari/537.36',
]

中间件

class MyRandomUserAgentMiddleware:

    def process_request(self, request, spider):
        UA = choice(USER_AGENT_LIST)
        request.headers['User-Agent'] = UA
        # 不要返回任何东西

    def process_response(self, request, response, spider):
        return response

    def process_exception(self, request, exception, spider):
        pass

1.2 处理代理问题

代理问题一直是我们作为一名爬虫工程师很蛋疼的问题. 不加容易被检测, 加了效率低, 免费的可用IP更是凤毛麟角. 没办法, 无论如何还是得面对它. 这里, 我们采用两个方案来给各位展示scrapy中添加代理的逻辑.

  1. 免费代理

    class ProxyMiddleware:
    
        def process_request(self, request, spider):
            print("又来")
            proxy = choice(PROXY_LIST)
            request.meta['proxy'] = "https://"+proxy  # 设置代理
            return None
    
        def process_response(self, request, response, spider):
            print('有么有结果???')
            if response.status != 200:
                print("尝试失败")
                request.dont_filter = True  # 丢回调度器重新请求
                return request
            return response
    
        def process_exception(self, request, exception, spider):
            print("出错了!")
            pass
    
  2. 收费代理

    免费代理实在太难用了. 我们这里直接选择一个收费代理. 依然选择快代理, 这个根据你自己的喜好进行调整.

    class MoneyProxyMiddleware:
        def _get_proxy(self):
            """
            912831993520336	t12831993520578	每次请求换IP
            tps138.kdlapi.com 15818
            需实名认证	5次/s	5Mb/s	有效	续费|订单详情|实名认证
            隧道用户名密码修改密码
            用户名:t12831993520578密码:t72a13xu
            :return:
            """
            url = "http://tps138.kdlapi.com:15818"
            auth = basic_auth_header(username="t12831993520578", password="t72a13xu")
            return url, auth
    
        def process_request(self, request, spider):
            print("......")
            url, auth = self._get_proxy()
            request.meta['proxy'] = url
            request.headers['Proxy-Authorization'] = auth
            request.headers['Connection'] = 'close'
            return None
    
        def process_response(self, request, response, spider):
            print(response.status, type(response.status))
            if response.status != 200:
                request.dont_filter = True
                return request
            return response
    
        def process_exception(self, request, exception, spider):
            pass
    

1.3 使用selenium完成数据抓取

首先, 我们需要使用selenium作为下载器进行下载. 那么我们的请求应该也是特殊订制的. 所以, 在我的设计里, 我可以重新设计一个请求. 就叫SeleniumRequest

from scrapy.http.request import Request

class SeleniumRequest(Request):
    pass

这里面不需要做任何操作. 整体还是用它父类的东西来进行操作.

接下来. 完善一下spider

import scrapy
from boss.request import SeleniumRequest

class BeijingSpider(scrapy.Spider):
    name = 'beijing'
    allowed_domains = ['zhipin.com']
    start_urls = ['https://www.zhipin.com/job_detail/?query=python&city=101010100&industry=&position=']

    def start_requests(self):
        yield SeleniumRequest(
            url=BeijingSpider.start_urls[0],
            callback=self.parse,
        )

    def parse(self, resp, **kwargs):
        li_list = resp.xpath('//*[@id="main"]/div/div[3]/ul/li')
        for li in li_list:
            href = li.xpath("./div[1]/div[1]/div[1]/div[1]/div[1]/span[1]/a[1]/@href").extract_first()
            name = li.xpath("./div[1]/div[1]/div[1]/div[1]/div[1]/span[1]/a[1]/text()").extract_first()

            print(name, href)
            print(resp.urljoin(href))
            yield SeleniumRequest(
                url=resp.urljoin(href),
                callback=self.parse_detail,
            )
        # 下一页.....

    def parse_detail(self, resp, **kwargs):
        print("招聘人", resp.xpath('//*[@id="main"]/div[3]/div/div[2]/div[1]/h2').extract())


中间件~

class BossDownloaderMiddleware:

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        # 这里很关键哦. 
        # 在爬虫开始的时候. 执行spider_opened
        # 在爬虫结束的时候. 执行spider_closed
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        crawler.signals.connect(s.spider_closed, signal=signals.spider_closed)
        return s

    def process_request(self, request, spider):
        if isinstance(request, SeleniumRequest):
            self.web.get(request.url)
            time.sleep(3)
            page_source = self.web.page_source
            return HtmlResponse(url=request.url, encoding='utf-8', request=request, body=page_source)

    def process_response(self, request, response, spider):
        return response

    def process_exception(self, request, exception, spider):
        pass

    def spider_opened(self, spider):
        self.web = Chrome()
        self.web.implicitly_wait(10)
        # 完成登录. 拿到cookie. 很容易...
        print("创建浏览器")

    def spider_closed(self, spider):
        self.web.close()
        print("关闭浏览器")

settings

DOWNLOADER_MIDDLEWARES = {
    # 怼在所有默认中间件前面. 只要是selenium后面所有的中间件都给我停
   'boss.middlewares.BossDownloaderMiddleware': 99,  
}

1.4 用selenium设置cookie

有了这个案例. 想要用selenium处理cookie也很容易了. 直接在spider_opened位置完成登录, 然后在process_request()中简单设置一下即可.

class ChaojiyingDownloaderMiddleware:


    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_request(self, request, spider):
        if not request.cookies:
            request.cookies = self.cookie
        return None

    def process_response(self, request, response, spider):
        return response

    def process_exception(self, request, exception, spider):
        pass

    def spider_opened(self, spider):
        web = Chrome()
        web.get("https://www.chaojiying.com/user/login/")
        web.find_element_by_xpath('/html/body/div[3]/div/div[3]/div[1]/form/p[1]/input').send_keys("18614075987")
        web.find_element_by_xpath('/html/body/div[3]/div/div[3]/div[1]/form/p[2]/input').send_keys('q6035945')
        img = web.find_element_by_xpath('/html/body/div[3]/div/div[3]/div[1]/form/div/img')
        verify_code = self.base64_api("q6035945", "q6035945", img.screenshot_as_base64, 3)

        web.find_element_by_xpath('/html/body/div[3]/div/div[3]/div[1]/form/p[3]/input').send_keys(verify_code)

        web.find_element_by_xpath('/html/body/div[3]/div/div[3]/div[1]/form/p[4]/input').click()
        time.sleep(3)
        cookies = web.get_cookies()
        self.cookie = {dic['name']:dic['value'] for dic in cookies}
        web.close()


    def base64_api(self, uname, pwd, b64_img, typeid):
        data = {
            "username": uname,
            "password": pwd,
            "typeid": typeid,
            "image": b64_img
        }
        result = json.loads(requests.post("http://api.ttshitu.com/predict", json=data).text)
        if result['success']:
            return result["data"]["result"]
        else:
            return result["message"]

2. SpiderMiddleware(了解)

爬虫中间件. 是处于引擎和spider之间的中间件. 里面常用的方法有:

class CuowuSpiderMiddleware:
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the spider middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_spider_input(self, response, spider):
        # 请求被返回, 即将进入到spider时调用
        # 要么返回None, 要么报错
        print("我是process_spider_input")
        return None

    def process_spider_output(self, response, result, spider):
        # 处理完spider中的数据. 返回数据后. 执行
        # 返回值要么是item, 要么是request.
        print("我是process_spider_output")
        for i in result:
            yield i
        print("我是process_spider_output")

    def process_spider_exception(self, response, exception, spider):
        print("process_spider_exception")
        # spider中报错 或者, process_spider_input() 方法报错
        # 返回None或者Request或者item.
        it = ErrorItem()
        it['name'] = "exception"
        it['url'] = response.url
        yield it

    def process_start_requests(self, start_requests, spider):
        print("process_start_requests")
        # 第一次启动爬虫时被调用.

        # Must return only requests (not items).
        for r in start_requests:
            yield r

    def spider_opened(self, spider):
        pass

items

class ErrorItem(scrapy.Item):
    name = scrapy.Field()
    url = scrapy.Field()

spider:

class BaocuoSpider(scrapy.Spider):
    name = 'baocuo'
    allowed_domains = ['baidu.com']
    start_urls = ['http://www.baidu.com/']

    def parse(self, resp, **kwargs):
        name = resp.xpath('//title/text()').extract_first()
        # print(1/0)  # 调整调整这个. 简单琢磨一下即可~~
        it = CuowuItem()
        it['name'] = name
        print(name)
        yield it

pipeline:

from cuowu.items import ErrorItem

class CuowuPipeline:
    def process_item(self, item, spider):
        if isinstance(item, ErrorItem):
            print("错误", item)
        else:
            print("没错", item)
        return item

目录结构:

cuowu
├── cuowu
│   ├── __init__.py
│   ├── items.py
│   ├── middlewares.py
│   ├── pipelines.py
│   ├── settings.py
│   └── spiders
│       ├── __init__.py
│       └── baocuo.py
└── scrapy.cfg