
公众号:uncle39py
V1
2022/10/21阅读:14主题:默认主题
18.爬虫:scrapy 扩展
scrapy extension就是基于scrapy信号的一个应用,懂了信号就懂了扩展.
此处以官方文档的代码例子来做说明,以下这段代码的意思是:
爬虫开始时打印日志,
爬虫结束时打印日志,
item爬取1000个时打印一下日志
import logging
from scrapy import signals
from scrapy.exceptions import NotConfigured
logger = logging.getLogger(__name__)
class SpiderOpenCloseLogging:
def __init__(self, item_count):
self.item_count = item_count
self.items_scraped = 0
@classmethod
def from_crawler(cls, crawler):
# first check if the extension should be enabled and raise
# NotConfigured otherwise
if not crawler.settings.getbool('MYEXT_ENABLED'):
raise NotConfigured
# get the number of items from settings
item_count = crawler.settings.getint('MYEXT_ITEMCOUNT', 1000)
# instantiate the extension object
ext = cls(item_count)
# connect the extension object to signals
crawler.signals.connect(ext.spider_opened, signal=signals.spider_opened)
crawler.signals.connect(ext.spider_closed, signal=signals.spider_closed)
crawler.signals.connect(ext.item_scraped, signal=signals.item_scraped)
# return the extension object
return ext
def spider_opened(self, spider):
logger.info("opened spider %s", spider.name)
def spider_closed(self, spider):
logger.info("closed spider %s", spider.name)
def item_scraped(self, item, spider):
self.items_scraped += 1
if self.items_scraped % self.item_count == 0:
logger.info("scraped %d items", self.items_scraped)
首先,扩展是一个类,要自己写一个扩展类
其次,扩展的入口方法是from_crawler(cls,crawler)
这个入口方法中,最最重要的事情是吧信号和其触发的方法做绑定
crawler.signals.connect(ext.spider_opened, signal=signals.spider_opened)
crawler.signals.connect(ext.spider_closed, signal=signals.spider_closed)
crawler.signals.connect(ext.item_scraped, signal=signals.item_scraped)
这里指明了在什么信号下,执行什么方法
第三,要知道如何从setting中设置参数,如何在入口方法中读取参数crawler.settings.getint('自定义的参数')
,如果没有设置参数,可以抛出异常raise NotConfigured
第四,setting中设置
EXTENSIONS = {
'scrapy.extensions.corestats.CoreStats': 1, #数字代表调用的顺序
}
第五,scrapy提供了内置的扩展,有兴趣可以研究一下

相应的视频教程在这里 https://mp.weixin.qq.com/s/DG-T965Y9yKLwNYPus2cXA
作者介绍

公众号:uncle39py
V1