当前位置:网站首页>Notes on the method of passing items from the spider file to the pipeline in the case of a scratch crawler
Notes on the method of passing items from the spider file to the pipeline in the case of a scratch crawler
2022-06-26 00:04:00 【Falling ink painting snow】
1 Problem description
In self-study scrapy Frame crawler , It is found that there are several in a web page data li perhaps ul label , The text or attributes in these tags are what you need , There is no need to penetrate the links within each corresponding tag , Then jump to the corresponding page for crawling , According to the self-study content, find , Each time from spider Use return Pass on to pipeline The data form in is One item, Dictionary form , So the question comes , In the case of the above :
(1) I hope it is convenient to use a list to transfer to the pipeline , however scrapy Is destined not to pass a list nested any dictionary .
(2) If the parsed page ul perhaps li The label goes on for Traverse , But I can't use return Method , because , Each time you use it, only the first li perhaps ul The label content is returned to the pipeline for processing , That is, the final result can only be the first page stored in the database li perhaps ul Label content .
2 resolvent
For the above question 1, I think it can be in spider In the file, all of the current page will be directly li perhaps ul Label content item The dictionary form is stored in a dictionary , namely {a:{1:1,2:2…},b:{1:1,2:2…}…},
Then go through return This integrated item, Put this item Pass on to pipeline in , Of course, it needs to be integrated in the corresponding pipeline method item Traverse one by one and send it to other methods of the pipeline for processing .
For the above question 2, have access to yield Instead of return, because return Characteristics of the method , Directly terminate the current program , So it led to for Stop iteration of the loop , Use yield Method can avoid this problem , The author finally used this method to deal with , Don't talk much , Post code :
Note the following yield info, Here is the direct traversal of the current page li After label , Used yield Instead of return The method realizes that there is no need to penetrate into every li In the link of the tag , Can keep return To the effect of the pipe
边栏推荐
- Can I upload pictures without deploying the server?
- 别再吃各种维生素C片了,这6种维生素C含量最高的水果
- Talk about the copy on write mechanism of PHP variables or parameters
- Establishment of multiple background blocks in botu software_ Old bear passing by_ Sina blog
- 关于scrapy爬虫时,由spider文件将item传递到管道的方法注意事项
- 推荐系统设计
- On the quantity control mechanism of swoole collaboration creation in production environment
- keil编译运行错误,缺少error:#5:#includecore_cm3.h_过路老熊_新浪博客
- SSM integrated learning notes (mainly ideas)
- Linking MySQL database with visual studio2015 under win10
猜你喜欢

关于运行scrapy项目时提示 ModuleNotFoundError: No module named 'pymongo‘的解决方案

数组常用的一些操作方法

Establishment of multiple background blocks in botu software_ Old bear passing by_ Sina blog

Lazy people teach you to use kiwi fruit to lose 16 kg in a month_ Old bear passing by_ Sina blog

今天说说String相关知识点

Unsigned and signed vernacular

ValueError: color kwarg must have one color per data set. 9 data sets and 1 colors were provided解决

keil编译运行错误,缺少error:#5:#includecore_cm3.h_过路老熊_新浪博客

Keil compilation run error, missing error: # 5: # includecore_ cm3.h_ Old bear passing by_ Sina blog

Literature research (III): overview of data-driven building energy consumption prediction models
随机推荐
Object类常用方法
Installation of third-party library iGraph for social network visualization
西门子S7-200PLC和丹佛斯变频器的通讯协议改造_过路老熊_新浪博客
推荐系统设计
Stop eating vitamin C tablets. These six fruits have the highest vitamin C content
搜索旋转数组II[抽象二分练习]
Bit compressor [Blue Bridge Cup training]
redis之详解
Common knowledge points in JS
文獻調研(三):數據驅動的建築能耗預測模型綜述
Literature research (IV): Hourly building power consumption prediction based on case-based reasoning, Ann and PCA
记录一下刷LeetCode瞬间有思路的一道简单题——剑指 Offer 09. 用两个栈实现队列
How to configure SQL Server 2008 Manager_ Old bear passing by_ Sina blog
Redis之内存淘汰机制
Object array de duplication
Network protocol: detailed explanation of redis protocol
Understanding of pseudo classes
Implement const in Es5
Literature research (I): hourly energy consumption prediction of office buildings based on integrated learning and energy consumption pattern classification
huibian