当前位置:网站首页>Scrapy crawler encounters redirection 301/302 problem solution
Scrapy crawler encounters redirection 301/302 problem solution
2022-08-02 04:00:00 【BIG_right】
Scrapy aborts redirects
When scrapy crawls data, it encounters redirection 301/302, especially when crawling a download link, he will redirect directly and start downloading, and will return to crawling after downloadingThe link you retrieved, you need to stop the reset at this time
The following 302 can be replaced with 301, which is the same
Abort redirect
yield Request(url,meta={'dont_redirect': True,'handle_httpstatus_list': [302]},callback=self.parse)If the crawling is crawling with yield Request in parse, then the filter dont_filter=True needs to be added. For details, see the following scenarioTwo
Get the Location value in the response
The redirected link will be placed in the Location in the header of the response, here is how to get the value
location = response.headers.get("Location")Scenario One
If the crawling URL is executed sequentially in start_urls, just add it directly in the start_requests method
def start_requests(self):yield Request(url,meta={'dont_redirect': True,'handle_httpstatus_list': [302]},callback=self.parse)Complete example
import scrapyclass xxSpider(scrapy.Spider):name = 'xx'allowed_domains = ['www.xxx.com']start_urls = ['http://www.xxx.com/download']def start_requests(self):# Abort the 302 redirect directly hereyield Request(start_urls[0],meta={'dont_redirect': True,'handle_httpstatus_list': [302]},callback=self.parse)def parse(self, response):# Get the returned redirect valuelocation = response.headers.get("Location")Scenario Two
If the crawling is crawling with yield Request in parse, then you need to add the filter dont_filter=True
yield Request(url,meta={'dont_redirect': True,'handle_httpstatus_list': [302]},callback=self.parse,dont_filter=True)Complete example
import scrapyclass xxSpider(scrapy.Spider):name = 'xx'allowed_domains = ['www.xxx.com']start_urls = ['http://www.xxx.com/download']def parse(self, response):url = "xxxxxxxxxx"# need to add filter hereyield Request(url,meta={'dont_redirect': True,'handle_httpstatus_list': [302]},callback=self.parse,dont_filter=True)边栏推荐
- 16. JS events, string and operator
- [symfony/mailer]一个优雅易用的发送邮件类库
- AES加密的各种蛋疼方式方式
- [campo/random-user-agent] Randomly fake your User-Agent
- ES6介绍+定义变量+不同情况下箭头函数的this指向
- (3) 字符串
- VIKINGS: 1 vulnhub walkthrough
- IO stream, encoding table, character stream, character buffer stream
- Batch replace file fonts, Simplified -> Traditional
- (1) print()函数、转义字符、二进制与字符编码 、变量、数据类型、input()函数、运算符
猜你喜欢

PHP基金会三月新闻公告发布

(4) 函数、Bug、类与对象、封装、继承、多态、拷贝

(5) 模块与包、编码格式、文件操作、目录操作

12. What is JS

点名系统和数组元素为对象的排序求最大值和最小值

hackmyvm: controller walkthrough

GreenOptic: 1 vulnhub walkthrough

(4) Function, Bug, Class and Object, Encapsulation, Inheritance, Polymorphism, Copy
![[league/climate]一个功能健全的命令行功能操作库](/img/ce/39114b1c74af649223db97e5b0e29c.png)
[league/climate]一个功能健全的命令行功能操作库

Orasi: 1 vulnhub walkthrough
随机推荐
Using PHPMailer send mail
Eric靶机渗透测试通关全教程
[mikehaertl/php-shellcommand]一个用于调用外部命令操作的库
About the apache .htaccess file of tp
(8) requests, os, sys, re, _thread
IO streams, byte stream and byte stream buffer
PHP image compression to specified size
PHP 给图片添加全图水印
Alfa: 1 vulnhub walkthrough
ES6介绍+定义变量+不同情况下箭头函数的this指向
动力:2 vulnhub预排
Function hoisting and variable hoisting
2. PHP variables, output, EOF, conditional statements
Shuriken: 1 vulnhub walkthrough
MySql Advanced -- Constraints
[phpunit/php-timer] A timer for code execution time
百度定位js API
[league/flysystem]一个优雅且支持度非常高的文件操作接口
CTF入门笔记之ping
Stable and easy-to-use short connection generation platform, supporting API batch generation