当前位置:网站首页>Crawler grabs the data of Douban group

Crawler grabs the data of Douban group

2022-06-25 03:54:00 Blockchain research

 

The Douban group doesn't have any anti climbing , And you don't need to log in . Do not crawl too often , Switch agent ip.

The page structure of Douban is also very good , No need to ajax.

Tool preparation : Cloud gathering reptile  

data structure :

 

We need to grab post replies 、 Review data

The data structure is as follows :

 

Flow chart design :

 

 

Let's analyze the whole process :

1、 Extract list links :

 

Like this address :https://www.douban.com/group/240355/discussion?start=0

 

 

Analysis shows that , choice 【css Selectors 】, Directly fill in :

 

td.title

test result :

 

2、 Extract the post title 、 Content 、 Release time, etc.

These data are captured directly from the details page , Pictured , Direct use css selector Just get it

 

3、 Grab comments :

First, we use a 【 Field loop area 】 To get the entire comment loop

 

 

The test is shown in the figure :

 

 

This may not be intuitive enough , We click on trace Analyze it :

 

We then extract the content and time of comments from these loop units .

 

Pictured , One more 【 Data Extraction 】 Just extract it directly .

 

Comment flipping

 

Comments turn the page , How do we turn the page to get comments ?

 

We are 【 Details page 】 This component directly pulls a 【 The next page 】, Just get the link to the next page .

 

 

Data preview :

 

Don't write a single line of code , It doesn't need any profound knowledge , It's easy to get the data .

 

原网站

版权声明
本文为[Blockchain research]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/02/202202210537010501.html