当前位置:网站首页>Notes on the method of passing items from the spider file to the pipeline in the case of a scratch crawler

Notes on the method of passing items from the spider file to the pipeline in the case of a scratch crawler

2022-06-26 00:04:00 Falling ink painting snow

1 Problem description

In self-study scrapy Frame crawler , It is found that there are several in a web page data li perhaps ul label , The text or attributes in these tags are what you need , There is no need to penetrate the links within each corresponding tag , Then jump to the corresponding page for crawling , According to the self-study content, find , Each time from spider Use return Pass on to pipeline The data form in is One item, Dictionary form , So the question comes , In the case of the above :
(1) I hope it is convenient to use a list to transfer to the pipeline , however scrapy Is destined not to pass a list nested any dictionary .
(2) If the parsed page ul perhaps li The label goes on for Traverse , But I can't use return Method , because , Each time you use it, only the first li perhaps ul The label content is returned to the pipeline for processing , That is, the final result can only be the first page stored in the database li perhaps ul Label content .

2 resolvent

For the above question 1, I think it can be in spider In the file, all of the current page will be directly li perhaps ul Label content item The dictionary form is stored in a dictionary , namely {a:{1:1,2:2…},b:{1:1,2:2…}…},
Then go through return This integrated item, Put this item Pass on to pipeline in , Of course, it needs to be integrated in the corresponding pipeline method item Traverse one by one and send it to other methods of the pipeline for processing .
For the above question 2, have access to yield Instead of return, because return Characteristics of the method , Directly terminate the current program , So it led to for Stop iteration of the loop , Use yield Method can avoid this problem , The author finally used this method to deal with , Don't talk much , Post code :
 Insert picture description here
Note the following yield info, Here is the direct traversal of the current page li After label , Used yield Instead of return The method realizes that there is no need to penetrate into every li In the link of the tag , Can keep return To the effect of the pipe

原网站

版权声明
本文为[Falling ink painting snow]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/176/202206252107243227.html