当前位置:网站首页>Notes on the method of passing items from the spider file to the pipeline in the case of a scratch crawler
Notes on the method of passing items from the spider file to the pipeline in the case of a scratch crawler
2022-06-26 00:04:00 【Falling ink painting snow】
1 Problem description
In self-study scrapy Frame crawler , It is found that there are several in a web page data li perhaps ul label , The text or attributes in these tags are what you need , There is no need to penetrate the links within each corresponding tag , Then jump to the corresponding page for crawling , According to the self-study content, find , Each time from spider Use return Pass on to pipeline The data form in is One item, Dictionary form , So the question comes , In the case of the above :
(1) I hope it is convenient to use a list to transfer to the pipeline , however scrapy Is destined not to pass a list nested any dictionary .
(2) If the parsed page ul perhaps li The label goes on for Traverse , But I can't use return Method , because , Each time you use it, only the first li perhaps ul The label content is returned to the pipeline for processing , That is, the final result can only be the first page stored in the database li perhaps ul Label content .
2 resolvent
For the above question 1, I think it can be in spider In the file, all of the current page will be directly li perhaps ul Label content item The dictionary form is stored in a dictionary , namely {a:{1:1,2:2…},b:{1:1,2:2…}…},
Then go through return This integrated item, Put this item Pass on to pipeline in , Of course, it needs to be integrated in the corresponding pipeline method item Traverse one by one and send it to other methods of the pipeline for processing .
For the above question 2, have access to yield Instead of return, because return Characteristics of the method , Directly terminate the current program , So it led to for Stop iteration of the loop , Use yield Method can avoid this problem , The author finally used this method to deal with , Don't talk much , Post code :
Note the following yield info, Here is the direct traversal of the current page li After label , Used yield Instead of return The method realizes that there is no need to penetrate into every li In the link of the tag , Can keep return To the effect of the pipe
边栏推荐
猜你喜欢
Building cloud computers with FRP
文獻調研(三):數據驅動的建築能耗預測模型綜述
6.常用指令(上)v-cloak,v-once,v-pre
Literature research (I): hourly energy consumption prediction of office buildings based on integrated learning and energy consumption pattern classification
用frp搭建云电脑
iomanip头文件在实战中的作用
Bit Compressor [蓝桥杯题目训练]
(转载)进程和线程的形象解释
Common problems encountered when creating and publishing packages using NPM
Literature research (IV): Hourly building power consumption prediction based on case-based reasoning, Ann and PCA
随机推荐
Topic36——53. 最大子数组和
关于scrapy爬虫时,由spider文件将item传递到管道的方法注意事项
step7和wincc联合仿真_过路老熊_新浪博客
SSH review
The InputStream stream has been closed, but the file or folder cannot be deleted, indicating that it is occupied by the JVM
keil编译运行错误,缺少error:#5:#includecore_cm3.h_过路老熊_新浪博客
PHP interprocess pass file descriptor
我的博客今天2岁167天了,我领取了先锋博主徽章_过路老熊_新浪博客
Jenkins releases PHP project code
Format the number. If the number is not enough, fill in 0, for example, 1:0001,25:0025
树莓派开机发送热点进行远程登录
Shredding Company poj 1416
Rocket之消息存储
redis之集群
Stop eating vitamin C tablets. These six fruits have the highest vitamin C content
huibian
利用VBScript连接mysql数据库_过路老熊_新浪博客
搜索旋转数组II[抽象二分练习]
Studio5k V28 installation and cracking_ Old bear passing by_ Sina blog
寻找翻转数组的最小值[抽象二分]