当前位置:网站首页>Crawler grabs the data of Douban group
Crawler grabs the data of Douban group
2022-06-25 03:54:00 【Blockchain research】
The Douban group doesn't have any anti climbing , And you don't need to log in . Do not crawl too often , Switch agent ip.
The page structure of Douban is also very good , No need to ajax.
Tool preparation : Cloud gathering reptile
data structure :
We need to grab post replies 、 Review data
The data structure is as follows :

Flow chart design :

Let's analyze the whole process :
1、 Extract list links :
Like this address :https://www.douban.com/group/240355/discussion?start=0


Analysis shows that , choice 【css Selectors 】, Directly fill in :
td.titletest result :

2、 Extract the post title 、 Content 、 Release time, etc.
These data are captured directly from the details page , Pictured , Direct use css selector Just get it

3、 Grab comments :
First, we use a 【 Field loop area 】 To get the entire comment loop

The test is shown in the figure :

This may not be intuitive enough , We click on trace Analyze it :

We then extract the content and time of comments from these loop units .
Pictured , One more 【 Data Extraction 】 Just extract it directly .

Comment flipping
Comments turn the page , How do we turn the page to get comments ?

We are 【 Details page 】 This component directly pulls a 【 The next page 】, Just get the link to the next page .
Data preview :


Don't write a single line of code , It doesn't need any profound knowledge , It's easy to get the data .
边栏推荐
- Is it safe to open an account with flush securities?
- 马斯克被诉传销索赔2580亿美元,台积电公布2nm制程,中科院发现月壤中含有羟基形式的水,今日更多大新闻在此...
- ICML 2022 | 字节跳动 AI Lab 提出多模态模型:X-VLM,学习视觉和语言的多粒度对齐...
- 发布功能完成02《ivx低代码签到系统制作》
- 用CPU方案打破内存墙?学PayPal堆傲腾扩容量,漏查欺诈交易量可降至1/30
- 你真的需要自动化测试吗?
- Sorting of poor cattle (winter vacation daily question 40)
- 协作+安全+存储,云盒子助力深圳爱德泰重构数据中心
- Does it count as staying up late to sleep at 2:00 and get up at 10:00? Unless you can do it every day
- 完美洗牌问题
猜你喜欢

墨天轮访谈 | IvorySQL王志斌—IvorySQL,一个基于PostgreSQL的兼容Oracle的开源数据库

孙武玩《魔兽》?有图有真相

老叶的祝福

【组队学习】SQL编程语言笔记——Task04

亚马逊在中国的另一面

2点睡10点起不算熬夜?除非你每天都能执行

居家办公之后才明白的时间管理 | 社区征文

存算一体芯片离普及还有多远?听听从业者怎么说 | 对撞派 x 后摩智能

The more AI evolves, the more it resembles the human brain! Meta found the "prefrontal cortex" of the machine. AI scholars and neuroscientists were surprised

Redis related-03
随机推荐
Li Kou daily question - day 26 -506 Relative rank
On the self-cultivation of an excellent red team member
js工具函数,自己封装一个节流函数
Rebeco: using machine learning to predict stock crash risk
How does the administrator prohibit another person from kicking himself?
Is it safe to open an account online? Online and other answers
OpenSUSE environment PHP connection Oracle
Background page production 01 production of IVX low code sign in system
香蕉为什么能做随机数生成器?因为,它是水果界的“辐射之王”
【组队学习】SQL编程语言笔记——Task04
Perfect shuffle problem
Tencent's open source project "Yinglong" has become a top-level project of Apache: the former long-term service wechat payment can hold a million billion level of data stream processing
Maybe it's the wrong reason
服乔布斯不服库克,苹果传奇设计团队解散内幕曝光
About sizeof() and strlen in array
Google founder brin's second marriage broke up: it was revealed that he had filed for divorce from his Chinese wife in January, and his current fortune is $631.4 billion
Self cultivation and learning encouragement
后台页制作01《ivx低代码签到系统制作》
Deveco studio 3.0 editor configuration tips
威马招股书拆解:电动竞争已结束,智能排位赛刚开始