当前位置:网站首页>Script - crawl the customized storage path of the cartoon and download it to the local
Script - crawl the customized storage path of the cartoon and download it to the local
2022-06-26 13:07:00 【Rowing in the waves】
scrapy—— Crawl the self defined storage path of comics on comics.com and download it to the local
- 1. New project and main crawler file
- 2. Create a new one under the project main.py file , Write the code as follows :
- 3. To write items.py The code in the file , as follows :
- 4. To write pipelines.py The code in the file , as follows
- 5. To write Comic.py The code in the file , as follows :
- 6. To write settings.py The code in the file , as follows :
- 7. function main.py file , give the result as follows :
1. New project and main crawler file
scrapy startproject comic
cd comic
scrapy genspider Comic manhua.sfacg.com
Note that the above command is in the cmd Interface operation
2. Create a new one under the project main.py file , Write the code as follows :
from scrapy import cmdline
cmdline.execute("scrapy crawl Comic".split())
The project structure is shown in the figure below :
3. To write items.py The code in the file , as follows :
import scrapy
class ComicItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
comic_name=scrapy.Field()# Cartoon title
chapter_name=scrapy.Field()# Chapter name
chapter_url=scrapy.Field()# chapter url
img_urls=scrapy.Field()# Cartoon picture list type
chapter_js=scrapy.Field()# Chapter corresponding js
4. To write pipelines.py The code in the file , as follows
class MyImageDownloadPipeline(ImagesPipeline):
def get_media_requests(self,item,info):
meta={'comic_name':item['comic_name'],'chapter_name':item['chapter_name']}# Parameters required for naming the image folder
return [Request(url=x,meta=meta) for x in item.get('img_urls',[])]# Here, you must remember to transfer the value of the corresponding formal parameter , Otherwise, the report will be wrong
def file_path(self, request, response=None, info=None):
comic_name = request.meta.get('comic_name')# Comic name as top-level Directory
chapter_name = request.meta.get('chapter_name')# The chapter name is used as a sub directory
image_name=request.url.split('/')[-1]# Select the suffix at the end of the link to indicate the sequence number of the picture
# Construct file name
filename = '{0}/{1}/{2}'.format(comic_name,chapter_name,image_name)
return filename
The above code completes the inheritance ImagesPipeline class , And rewrite two of them , Complete custom storage path .
5. To write Comic.py The code in the file , as follows :
import scrapy
from ..items import ComicItem
class ComicSpider(scrapy.Spider):
name = 'Comic'
allowed_domains = ['manhua.sfacg.com','comic.sfacg.com']
start_urls = ['https://manhua.sfacg.com/mh/Gongsheng/']
img_host = 'http://coldpic.sfacg.com'# Picture link domain name
host='https://manhua.sfacg.com'# Chapter domain name
def parse(self,response):
chapter_urls=response.xpath('//div[@class="comic_Serial_list"]//a/@href').extract()# Chapter link list
for chapter_url in chapter_urls:# Traverse the chapter url
chapter_url=self.host+chapter_url# Add domain name
yield scrapy.Request(chapter_url,callback=self.parse_chapter)
def parse_chapter(self,response):# Analyze the chapter html
js_url='http:'+response.xpath('//head//script/@src').extract()[0]
chapter_name=response.xpath('//div[@id="AD_j1"]/span/text()').extract_first().strip()
yield scrapy.Request(url=js_url,meta={'chapter_name':chapter_name,'chapter_url':response.url},callback=self.parse_js)
def parse_js(self,response):# analysis js The data returned in it
texts=response.text.split(';')
comic_name=texts[0].split('=')[1].strip()[1:-1]# Cartoon title
img_urls = [self.img_host + text.split('=')[1].strip()[1:-1] for text in texts[7:-1]]# Cartoon chapter picture list
chapter_name=response.meta.get('chapter_name')# Chapter name
chapter_url=response.meta.get('chapter_url')# chapter url
comic_item=ComicItem()
comic_item['comic_name']=comic_name
comic_item['chapter_name']=chapter_name
comic_item['chapter_js']=response.url
comic_item['img_urls']=img_urls
comic_item['chapter_url']=chapter_url
yield comic_item
Because the cartoon chapter detail page picture is through js Dynamically loaded , Get the chapter directly html Does not contain links to cartoon pictures , After observation , It is through the corresponding js Access the link to the corresponding picture . And the js It can be found in the source code of cartoon chapters , All ideas are to get links to all chapters first , Then get the chapter details page source code , It is concluded that js Link to , And then visit js Why url, Get all the pictures of the chapter . among comic_item The contents are as follows
{'chapter_js': 'http://comic.sfacg.com/Utility/1972/ZP/0083_6428.js',
'chapter_name': ' symbiosis symbiosis 19 word ( 5、 ... and )',
'chapter_url': 'https://manhua.sfacg.com/mh/Gongsheng/53379/',
'comic_name': ' symbiosis symbiosis',
'img_urls':
['http://coldpic.sfacg.com/Pic/OnlineComic4/Gongsheng/ZP/0083_6428/002_260.jpg',
'http://coldpic.sfacg.com/Pic/OnlineComic4/Gongsheng/ZP/0083_6428/003_12.jpg',
'http://coldpic.sfacg.com/Pic/OnlineComic4/Gongsheng/ZP/0083_6428/004_71.jpg',
'http://coldpic.sfacg.com/Pic/OnlineComic4/Gongsheng/ZP/0083_6428/005_659.jpg',
'http://coldpic.sfacg.com/Pic/OnlineComic4/Gongsheng/ZP/0083_6428/006_217.jpg',
'http://coldpic.sfacg.com/Pic/OnlineComic4/Gongsheng/ZP/0083_6428/007_611.jpg',
'http://coldpic.sfacg.com/Pic/OnlineComic4/Gongsheng/ZP/0083_6428/008_168.jpg',
'http://coldpic.sfacg.com/Pic/OnlineComic4/Gongsheng/ZP/0083_6428/009_335.jpg',
'http://coldpic.sfacg.com/Pic/OnlineComic4/Gongsheng/ZP/0083_6428/010_175.jpg',
'http://coldpic.sfacg.com/Pic/OnlineComic4/Gongsheng/ZP/0083_6428/011_178.jpg',
'http://coldpic.sfacg.com/Pic/OnlineComic4/Gongsheng/ZP/0083_6428/012_790.jpg']}
6. To write settings.py The code in the file , as follows :
ROBOTSTXT_OBEY = False
DOWNLOAD_DELAY = 3
ITEM_PIPELINES = {
'comic.pipelines.ComicPipeline': None,
'comic.pipelines.MyImageDownloadPipeline':300,
}
import os
project_dir=os.path.abspath(os.path.dirname(__file__))
IMAGES_STORE=os.path.join(project_dir,'images')# Cartoons are stored in the root directory of the project imags Under the folder
7. function main.py file , give the result as follows :
Finally, after the download, you can see that the cartoon has been downloaded locally 


边栏推荐
- Processing polyhedron change
- 第十章 设置结构化日志记录(二)
- Electron official docs series: Contributing
- Record a phpcms9.6.3 vulnerability to use the getshell to the intranet domain control
- Processing function translate (mousex, mousey) learning
- POJ 3070 Fibonacci
- RSS rendering of solo blog system failed
- 四类线性相位 FIR滤波器设计 —— MATLAB源码全集
- Goto statement to realize shutdown applet
- [esp32-c3][rt-thread] run RT-Thread BSP minimum system based on esp32c3
猜你喜欢
随机推荐
心脏滴血漏洞(CVE-2014-0160)分析与防护
软件测试报告应该包含的内容?面试必问
Do you know the limitations of automated testing?
P2393 yyy loves Maths II
find及du -sh显示权限不够的解决方法
IDC报告:百度智能云AI Cloud市场份额连续六次第一
C - Common Subsequence
机器学习笔记 - 时间序列的季节性
【网络是怎么连接的】第二章(中):一个网络包的发出
Beifu PLC passes MC_ Readparameter read configuration parameters of NC axis
Software testing - concept
Go 结构体方法
C# const详解:C#常量的定义和使用
Electron official docs series: Examples
System tasks (display / print class) in Verilog - $display, $write, $strobe, $monitor
ES6:迭代器
processing 函数translate(mouseX, mouseY)学习
Digital signal processing -- Design of linear phase type (Ⅰ, Ⅲ) FIR filter (1)
What are the common categories of software testing?
KVM 显卡透传 —— 筑梦之路






