当前位置:网站首页>Crawler grabs the data of Douban group
Crawler grabs the data of Douban group
2022-06-25 03:54:00 【Blockchain research】
The Douban group doesn't have any anti climbing , And you don't need to log in . Do not crawl too often , Switch agent ip.
The page structure of Douban is also very good , No need to ajax.
Tool preparation : Cloud gathering reptile
data structure :
We need to grab post replies 、 Review data
The data structure is as follows :
Flow chart design :
Let's analyze the whole process :
1、 Extract list links :
Like this address :https://www.douban.com/group/240355/discussion?start=0
Analysis shows that , choice 【css Selectors 】, Directly fill in :
td.title
test result :
2、 Extract the post title 、 Content 、 Release time, etc.
These data are captured directly from the details page , Pictured , Direct use css selector Just get it
3、 Grab comments :
First, we use a 【 Field loop area 】 To get the entire comment loop
The test is shown in the figure :
This may not be intuitive enough , We click on trace Analyze it :
We then extract the content and time of comments from these loop units .
Pictured , One more 【 Data Extraction 】 Just extract it directly .
Comment flipping
Comments turn the page , How do we turn the page to get comments ?
We are 【 Details page 】 This component directly pulls a 【 The next page 】, Just get the link to the next page .
Data preview :
Don't write a single line of code , It doesn't need any profound knowledge , It's easy to get the data .
边栏推荐
- 服乔布斯不服库克,苹果传奇设计团队解散内幕曝光
- Collaboration + Security + storage, cloud box helps Shenzhen edetai restructure its data center
- Google founder brin's second marriage broke up: it was revealed that he had filed for divorce from his Chinese wife in January, and his current fortune is $631.4 billion
- China's SkyEye found suspicious signals of extraterrestrial civilization. Musk said that the Starship began its orbital test flight in July. Netinfo office: app should not force users to agree to proc
- Configuration source code
- Understand (DI) dependency injection in PHP
- What is an SSL certificate and what are the benefits of having an SSL certificate?
- Winxp kernel driver debugging
- 騰訊開源項目「應龍」成Apache頂級項目:前身長期服務微信支付,能hold住百萬億級數據流處理...
- Is it safe to open a stock account with the customer's haircut account link? Tell me what you know
猜你喜欢
居家办公之后才明白的时间管理 | 社区征文
Tensorflow, danger! Google itself is the one who abandoned it
About PLSQL error initialization failure
马斯克:推特要学习微信,让10亿人「活在上面」成为超级APP
TCC mode explanation and code implementation of Seata's four modes
Mobile mall project operation
Void* pointer
Russian Airi Research Institute, etc. | SEMA: prediction of antigen B cell conformation characterization using deep transfer learning
香蕉为什么能做随机数生成器?因为,它是水果界的“辐射之王”
AI writes its own code to let agents evolve! The big model of openai has the flavor of "human thought"
随机推荐
【Harmony OS】【ARK UI】ETS 上下文基本操作
ICML 2022 | ByteDance AI Lab proposes a multimodal model: x-vlm, learning multi granularity alignment of vision and language
ASP. Net conference room booking applet source code booking applet source code
騰訊開源項目「應龍」成Apache頂級項目:前身長期服務微信支付,能hold住百萬億級數據流處理...
Winxp kernel driver debugging
【Rust投稿】从零实现消息中间件(6)-CLIENT
孙武玩《魔兽》?有图有真相
OpenSUSE environment PHP connection Oracle
The file attributes downloaded by the browser are protected. How to remove them
腾讯开源项目「应龙」成Apache顶级项目:前身长期服务微信支付,能hold住百万亿级数据流处理...
OpenSUSE installation pit log
Russian Airi Research Institute, etc. | SEMA: prediction of antigen B cell conformation characterization using deep transfer learning
Tutorial on installing SSL certificates in Microsoft Exchange Server 2007
Wuenda, the new course of machine learning is coming again! Free auditing, Xiaobai friendly
x86 CPU,危!最新漏洞引发热议,黑客可远程窃取密钥,英特尔“全部处理器”受影响...
Rebeco: using machine learning to predict stock crash risk
Is it safe to open an online stock account?
Teach you how to install win11 system in winpe
Void* pointer
BGP biplane architecture