当前位置:网站首页>scrapy_ Redis distributed crawler

scrapy_ Redis distributed crawler

2022-06-21 13:05:00 InfoQ

scrapy_redis Creation process and startup of distributed crawler

Distributed crawler writing process :
 1. Write ordinary crawlers
 1. Create project
 2. Clear objectives
 3. Create crawler
 4. Save the content
 2. Transformed component cloth crawler
 1. Modified reptile
 1. Import scrapy-redis Distributed crawler classes in
 2. Inheritance class
 3. Cancellation start_url & allowed_domains
 4. Set up redis_key obtain start_urls
 5. Set up __init__ Get allowed domain names
 2. Modify the configuration file (settings)
 copy Configuration parameters

# Set up scrapy-redis
#1. Enable scheduling to store requests into redis
#from scrapy_redis.scheduler import Scheduler
SCHEDULER="scrapy_redis.scheduler.Scheduler"

#2. Ensure that all spider adopt redis Share the same duplicate filter
# from scrapy_redis.dupefilter import RFPDupeFilter
DUPEFILTER_CLASS="scrapy_redis.dupefilter.RFPDupeFilter"

#3. Set whether to keep when the crawler ends redis De duplication set and task queue in database ( You can leave it blank )
SCHEDULER_PERSIST=True

#3. Specify the connection to Redis Host and port to be used for   The purpose is to connect redis database
REDIS_HOST="localhost"
REDIS_PORT=6379

#4. Whether to store data in Redis Database pipeline :
ITEM_PIPELINES = {
 'jd.pipelines.JdPipeline': 300,
 # 'jd.pipelines.JDSqlPipeline': 300,
 # When this pipe is opened , The pipeline will store the data to Redis In the database
 'scrapy.redis.pipelines.RedisPipeline':400,
}

(1) Write ordinary crawlers :

Take Jingdong book information crawling as an example ! After all , Click the link to jump to the blog post with detailed explanation :
Crawl JD books information !

(2) Transformed component cloth crawler :

Use scenarios :
 1. The amount of data is especially huge
 2. The data requirements are rather tight
 
Distributed implementation
 scrapy—redis Implement distributed

Transformation only needs 5 Step , as well as settings.py Transformation of documents !
First of all : Crawler file 5 Walking !
# -*- coding: utf-8 -*-
import json
from selenium import webdriver
import scrapy

from ..items import JdItem

# 1. Import distributed crawler class
from scrapy_redis.spiders import RedisSpider

# 2. Inherit distributed crawler classes
# class BookSpider(scrapy.Spider):
class BookSpider(RedisSpider):
 name = 'book_test'
 # 3. Cancellation start_url and allowed_domains
 # allowed_domains = ['book.jd.com']
 # start_urls = ['https://book.jd.com/booksort.html']
 # 4. Set up redis-key
 redis_key='jd'

 # 5. Set up __init__
 def __init__(self, *args, **kwargs):
 domain = kwargs.pop('domain', '')
 self.allowed_domains = list(filter(None, domain.split(',')))
 super().__init__(*args, **kwargs)

 #  Instantiate a browser object ( Instantiate once )  Turn on headless mode
 options = webdriver.ChromeOptions()
 options.add_argument("--headless")
 options.add_argument("--disable-gpu")
 self.driver = webdriver.Chrome(options=options, executable_path="C:\my\Chrome_guge\chromedriver.exe")
second :settings.py Profile modification !
# Set up scrapy-redis
#1. Enable scheduling to store requests into redis
#from scrapy_redis.scheduler import Scheduler
SCHEDULER="scrapy_redis.scheduler.Scheduler"

#2. Ensure that all spider adopt redis Share the same duplicate filter
# from scrapy_redis.dupefilter import RFPDupeFilter
DUPEFILTER_CLASS="scrapy_redis.dupefilter.RFPDupeFilter"

#3. Set whether to keep when the crawler ends redis De duplication set and task queue in database ( You can leave it blank )
SCHEDULER_PERSIST=True

#3. Specify the connection to Redis Host and port to be used for   The purpose is to connect redis database
REDIS_HOST="localhost"
REDIS_PORT=6379

#4. Whether to store data in Redis Database pipeline :
ITEM_PIPELINES = {
 'jd.pipelines.JdPipeline': 300,
 # 'jd.pipelines.JDSqlPipeline': 300,
 # When this pipe is opened , The pipeline will store the data to Redis In the database
 'scrapy.redis.pipelines.RedisPipeline':400,
}
ordinary scrapy The crawler under the framework has been transformed into a distributed crawler , Then copy this project to an identical one ; Use the command... Separately (scrapy crawl  Project name ) function , You will find that both project terminals are waiting ; here , We turn on Redis, To them (lpush jd  start url) You can run two crawler projects normally ; Last , We will find that the sum of the data crawled by the two projects is the demand data of the crawler !
原网站

版权声明
本文为[InfoQ]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/172/202206211232319473.html