当前位置：网站首页>scrapy_ Redis distributed crawler

scrapy_ Redis distributed crawler

2022-06-21 13:05:00 【InfoQ】

scrapy_redis Creation process and startup of distributed crawler

 Distributed crawler writing process ：
 1. Write ordinary crawlers 
 1. Create project 
 2. Clear objectives 
 3. Create crawler 
 4. Save the content 
 2. Transformed component cloth crawler 
 1. Modified reptile 
 1. Import scrapy-redis Distributed crawler classes in 
 2. Inheritance class 
 3. Cancellation start_url & allowed_domains
 4. Set up redis_key obtain start_urls
 5. Set up __init__ Get allowed domain names 
 2. Modify the configuration file （settings）
 copy Configuration parameters

# Set up scrapy-redis
#1. Enable scheduling to store requests into redis
#from scrapy_redis.scheduler import Scheduler
SCHEDULER=&quot;scrapy_redis.scheduler.Scheduler&quot;

#2. Ensure that all spider adopt redis Share the same duplicate filter 
# from scrapy_redis.dupefilter import RFPDupeFilter
DUPEFILTER_CLASS=&quot;scrapy_redis.dupefilter.RFPDupeFilter&quot;

#3. Set whether to keep when the crawler ends redis De duplication set and task queue in database ( You can leave it blank )
SCHEDULER_PERSIST=True

#3. Specify the connection to Redis Host and port to be used for   The purpose is to connect redis database 
REDIS_HOST=&quot;localhost&quot;
REDIS_PORT=6379

#4. Whether to store data in Redis Database pipeline ：
ITEM_PIPELINES = {
 'jd.pipelines.JdPipeline': 300,
 # 'jd.pipelines.JDSqlPipeline': 300,
 # When this pipe is opened , The pipeline will store the data to Redis In the database 
 'scrapy.redis.pipelines.RedisPipeline':400,
}

（1） Write ordinary crawlers ：

Take Jingdong book information crawling as an example ！ After all , Click the link to jump to the blog post with detailed explanation ：

Crawl JD books information ！

（2） Transformed component cloth crawler ：

 Use scenarios ：
 1. The amount of data is especially huge 
 2. The data requirements are rather tight 
 
 Distributed implementation 
 scrapy—redis Implement distributed

Transformation only needs 5 Step , as well as settings.py Transformation of documents ！

First of all ： Crawler file 5 Walking ！

# -*- coding: utf-8 -*-
import json
from selenium import webdriver
import scrapy

from ..items import JdItem

# 1. Import distributed crawler class 
from scrapy_redis.spiders import RedisSpider

# 2. Inherit distributed crawler classes 
# class BookSpider(scrapy.Spider):
class BookSpider(RedisSpider):
 name = 'book_test'
 # 3. Cancellation start_url and allowed_domains
 # allowed_domains = ['book.jd.com']
 # start_urls = ['https://book.jd.com/booksort.html']
 # 4. Set up redis-key
 redis_key='jd'

 # 5. Set up __init__
 def __init__(self, *args, **kwargs):
 domain = kwargs.pop('domain', '')
 self.allowed_domains = list(filter(None, domain.split(',')))
 super().__init__(*args, **kwargs)

 #  Instantiate a browser object （ Instantiate once ）  Turn on headless mode 
 options = webdriver.ChromeOptions()
 options.add_argument(&quot;--headless&quot;)
 options.add_argument(&quot;--disable-gpu&quot;)
 self.driver = webdriver.Chrome(options=options, executable_path=&quot;C:\my\Chrome_guge\chromedriver.exe&quot;)

second ：settings.py Profile modification ！

# Set up scrapy-redis
#1. Enable scheduling to store requests into redis
#from scrapy_redis.scheduler import Scheduler
SCHEDULER=&quot;scrapy_redis.scheduler.Scheduler&quot;

#2. Ensure that all spider adopt redis Share the same duplicate filter 
# from scrapy_redis.dupefilter import RFPDupeFilter
DUPEFILTER_CLASS=&quot;scrapy_redis.dupefilter.RFPDupeFilter&quot;

#3. Set whether to keep when the crawler ends redis De duplication set and task queue in database ( You can leave it blank )
SCHEDULER_PERSIST=True

#3. Specify the connection to Redis Host and port to be used for   The purpose is to connect redis database 
REDIS_HOST=&quot;localhost&quot;
REDIS_PORT=6379

#4. Whether to store data in Redis Database pipeline ：
ITEM_PIPELINES = {
 'jd.pipelines.JdPipeline': 300,
 # 'jd.pipelines.JDSqlPipeline': 300,
 # When this pipe is opened , The pipeline will store the data to Redis In the database 
 'scrapy.redis.pipelines.RedisPipeline':400,
}