当前位置:网站首页>scrapy_ Redis distributed crawler
scrapy_ Redis distributed crawler
2022-06-21 13:05:00 【InfoQ】
scrapy_redis Creation process and startup of distributed crawler
Distributed crawler writing process :
1. Write ordinary crawlers
1. Create project
2. Clear objectives
3. Create crawler
4. Save the content
2. Transformed component cloth crawler
1. Modified reptile
1. Import scrapy-redis Distributed crawler classes in
2. Inheritance class
3. Cancellation start_url & allowed_domains
4. Set up redis_key obtain start_urls
5. Set up __init__ Get allowed domain names
2. Modify the configuration file (settings)
copy Configuration parameters
# Set up scrapy-redis
#1. Enable scheduling to store requests into redis
#from scrapy_redis.scheduler import Scheduler
SCHEDULER="scrapy_redis.scheduler.Scheduler"
#2. Ensure that all spider adopt redis Share the same duplicate filter
# from scrapy_redis.dupefilter import RFPDupeFilter
DUPEFILTER_CLASS="scrapy_redis.dupefilter.RFPDupeFilter"
#3. Set whether to keep when the crawler ends redis De duplication set and task queue in database ( You can leave it blank )
SCHEDULER_PERSIST=True
#3. Specify the connection to Redis Host and port to be used for The purpose is to connect redis database
REDIS_HOST="localhost"
REDIS_PORT=6379
#4. Whether to store data in Redis Database pipeline :
ITEM_PIPELINES = {
'jd.pipelines.JdPipeline': 300,
# 'jd.pipelines.JDSqlPipeline': 300,
# When this pipe is opened , The pipeline will store the data to Redis In the database
'scrapy.redis.pipelines.RedisPipeline':400,
}
(1) Write ordinary crawlers :
(2) Transformed component cloth crawler :
Use scenarios :
1. The amount of data is especially huge
2. The data requirements are rather tight
Distributed implementation
scrapy—redis Implement distributed
First of all : Crawler file 5 Walking !
# -*- coding: utf-8 -*-
import json
from selenium import webdriver
import scrapy
from ..items import JdItem
# 1. Import distributed crawler class
from scrapy_redis.spiders import RedisSpider
# 2. Inherit distributed crawler classes
# class BookSpider(scrapy.Spider):
class BookSpider(RedisSpider):
name = 'book_test'
# 3. Cancellation start_url and allowed_domains
# allowed_domains = ['book.jd.com']
# start_urls = ['https://book.jd.com/booksort.html']
# 4. Set up redis-key
redis_key='jd'
# 5. Set up __init__
def __init__(self, *args, **kwargs):
domain = kwargs.pop('domain', '')
self.allowed_domains = list(filter(None, domain.split(',')))
super().__init__(*args, **kwargs)
# Instantiate a browser object ( Instantiate once ) Turn on headless mode
options = webdriver.ChromeOptions()
options.add_argument("--headless")
options.add_argument("--disable-gpu")
self.driver = webdriver.Chrome(options=options, executable_path="C:\my\Chrome_guge\chromedriver.exe")
second :settings.py Profile modification !
# Set up scrapy-redis
#1. Enable scheduling to store requests into redis
#from scrapy_redis.scheduler import Scheduler
SCHEDULER="scrapy_redis.scheduler.Scheduler"
#2. Ensure that all spider adopt redis Share the same duplicate filter
# from scrapy_redis.dupefilter import RFPDupeFilter
DUPEFILTER_CLASS="scrapy_redis.dupefilter.RFPDupeFilter"
#3. Set whether to keep when the crawler ends redis De duplication set and task queue in database ( You can leave it blank )
SCHEDULER_PERSIST=True
#3. Specify the connection to Redis Host and port to be used for The purpose is to connect redis database
REDIS_HOST="localhost"
REDIS_PORT=6379
#4. Whether to store data in Redis Database pipeline :
ITEM_PIPELINES = {
'jd.pipelines.JdPipeline': 300,
# 'jd.pipelines.JDSqlPipeline': 300,
# When this pipe is opened , The pipeline will store the data to Redis In the database
'scrapy.redis.pipelines.RedisPipeline':400,
}
ordinary scrapy The crawler under the framework has been transformed into a distributed crawler , Then copy this project to an identical one ; Use the command... Separately (scrapy crawl Project name ) function , You will find that both project terminals are waiting ; here , We turn on Redis, To them (lpush jd start url) You can run two crawler projects normally ; Last , We will find that the sum of the data crawled by the two projects is the demand data of the crawler !
边栏推荐
- CP 指令学习
- Educoder 表格标签-表格基本结构
- [appium stepping pit] close the inspector and open the WebEditor, uiautomator2 exceptions. NullPointerExceptionError: -32001 Jsonrpc er
- Graveyard
- postgreSQL中的MVCC
- uva11991
- Libcef common concepts framework features
- Centos7 deploying MySQL environment
- UVA1203 Argus
- Shell process control - 35. Multi branch case conditional statements
猜你喜欢

【升级版学生信息管理系统】+文件操作+更多细节

Huawei cloud releases desktop ide codearts

Not only zero:bmtrain technology principle analysis

Deep learning practice (10): 3D medical image segmentation using pytorch

SSH password free login
MySQL constraints (descriptions of various conditions when creating tables)

【深入理解TcaplusDB技术】TcaplusDB导入数据

PHP uses grafika to synthesize pictures and generate poster images

小程序直播互动功能运行在App里?

SCCM基于已安装的 APP创建客户端集合并定期推送应用更新
随机推荐
Hot information of Tami dog: Xiamen property right trading center creates its first time again!
Display all indexes of a table in Oracle
Summary of several ways to calculate distance
SCCM基于已安装的 APP创建客户端集合并定期推送应用更新
安科瑞BM100系列信号隔离器的实际应用
Distributed transactions, simple in principle, are all pits in writing
SSH password free login
Shell process control - 35. Multi branch case conditional statements
Educoder Web练习题---结构元素
UVA1203 Argus
并发数、并发以及高并发分别是什么意思?
Implementation principle and application practice of Flink CDC mongodb connector
uva11995
uva11729
hands-on-data-analysis 第二单元 第四节数据可视化
uva11292
PingCAP 入选 2022 Gartner 云数据库“客户之声”,获评“卓越表现者”最高分
《預訓練周刊》第50期:無解碼變換器、神經提示搜索、梯度空間降維
Nouveau partage de l'expérience de travail à domicile
PostgreSQL 逻辑存储结构