当前位置:网站首页>Can't get data for duplicate urls using Scrapy framework, dont_filter=True
Can't get data for duplicate urls using Scrapy framework, dont_filter=True
2022-08-03 09:32:00 【The moon give me copy code】
Scenario: The code reports no errors, and the xpath expression is determined to be parsed correctly.
Possible cause: You are using Scrapy to request duplicate urls.
Scrapy has duplicate filtering built in, which is turned on by default.
The following example, parse2 cannot be called:
import scrapyclass ExampleSpider(scrapy.Spider):name="test"# allowed_domains = ["https://www.baidu.com/"]start_urls = ["https://www.baidu.com/"]def parse(self, response):yield scrapy.Request(self.start_urls[0],callback=self.parse2)def parse2(self, response):print(response.url)When Scrapy enters parse, it will request start_urls[0] by default, and when you request start_urls[0] again in parse, the bottom layer of Scrapy will filter out duplicate urls by default, and will not process the request.commit, that's why parse2 is not called.
Workaround:
Add dont_filter=True parameter so that Scrapy doesn't filter out duplicate requests.
import scrapyclass ExampleSpider(scrapy.Spider):name="test"# allowed_domains = ["https://www.baidu.com/"]start_urls = ["https://www.baidu.com/"]def parse(self, response):yield scrapy.Request(self.start_urls[0],callback=self.parse2,dont_filter=True)def parse2(self, response):print(response.url)At this point, parse2 will be called normally.
边栏推荐
猜你喜欢

Exception: Dataset not found. Solution

mysql 事务原理详解

MySQL-TCL语言-transaction control language事务控制语言

013-Binary tree

SQL Daily Practice (Nioke New Question Bank) - Day 5: Advanced Query

MySQL-DDL数据定义语言-约束

兔起鹘落全端涵盖,Go lang1.18入门精炼教程,由白丁入鸿儒,全平台(Sublime 4)Go lang开发环境搭建EP00

dflow入门3——dpdispatcher插件

What exactly does a firmware engineer do?

Scrapy + Selenium 实现模拟登录,获取页面动态加载数据
随机推荐
What are pseudo-classes and pseudo-elements?The difference between pseudo-classes and pseudo-elements
MySQL的分页你还在使劲的limit?
go版本升级
MYSQL 修改时区的几种方法
Redis cluster concept and construction
分区分表(一)
Go操作Redis数据库
10 Convolutional Neural Networks for Deep Learning 2
命令行加载特效 【cli-spinner.js】 实用教程
MySQL-TCL语言-transaction control language事务控制语言
【LeetCode】622. Design Circular Queue
ClickHouse查询语句详解
Let‘s Encrypt 使用
pytorch one-hot 小技巧
RSTP(端口角色+端口状态+工作机制)|||| 交换机接口分析
STP普通生成树安全特性— bpduguard特性 + bpdufilter特性 + guard root 特性 III loopguard技术( 详解+配置)
pytorch one-hot tips
浅聊缓存函数
STP生成树选举结果查看及验证
Network LSTM both short-term and long-term memory