当前位置:网站首页>Crawler framework crawler
Crawler framework crawler
2022-07-25 17:11:00 【wangmcn】
crawler
Catalog
- 1、 brief introduction
- 2、 Installation and deployment
- 3、 Frame description
- 4、 Using frames
1、 brief introduction
crawler use requests+lxml The way to crawl , Crawl content and url use XPath In the same way ( About XPath May refer to XPath Reference manual chapter ).
GitHub website :https://github.com/shuizhubocai/crawler
requests yes Python An excellent third-party library , Suitable for human use HTTP library , Encapsulates a lot of cumbersome HTTP function , Greatly simplified HTTP The amount of code required for the request .
lxml yes Python A parsing library of , Support HTML and XML Parsing , Support XPath Analytical way , And the parsing efficiency is very high .
2、 Installation and deployment
stay Windows Environmental Science (64 position ) Next Python Version is 3.6.5.
1、 Open the official website to download , Download completed as crawler-master.zip file .
2、 Unzip the file to the specified directory ( for example D:\crawler).
3、 Installation directory , Command line run pip install -r requrements.txt The library files that the installation framework depends on .
requrements.txt The contents of the document :
certifi==2018.4.16
chardet==3.0.4
idna==2.7
requests==2.19.1
urllib3==1.23
4、 install lxml, The version number is 4.2.5.
Download address :https://www.lfd.uci.edu/~gohlke/pythonlibs/#lxml
Download the specified version ,cp36 representative Python 3.6 Version of ,win_amd64 representative 64 A system of , So you need to choose the right , Otherwise, the installation process will report an error that the platform does not match .
Start the installation when the download is complete lxml, Enter the path where the installation file is located in the command line and enter the command .
pip install lxml-4.2.5-cp36-cp36m-win_amd64.whl
3、 Frame description
1、crawler.py file :
Urls class : Address Manager
Download class : Page downloader
Parser class : Page parser
Output class : Export data to HTML
Scheduler class : Crawler scheduler
2、modules\useragent In the catalog chrome.py、firefox.py Wait for browser proxy .
3、data.html Import the crawled data into this file .
4、 Using frames
demand : visit 51testing Forum , Get the specified number of pages (1-10) Post title and URL Address .
As shown in the figure : Post title to get .
As shown in the figure : obtain 1-10 page .
1、 Modify the script (crawler.py file ).
(1) modify Parser class ,getDatas Methodical html.xpath value .
//tbody[contains(@id,'normalthread')]/tr/th/a[3]
As shown in the figure : Use Firefox+FirePath Debugging and positioning .
(2) modify Parser class ,getUrls Methodical html.xpath value .
//span[@id='fd_page_bottom']/div//a[not(@class)]//@href
As shown in the figure : Use Firefox+FirePath Debugging and positioning .
(3) Instantiation
Add access address :http://bbs.51testing.com/forum-279-1.html
2、 Execute the script (crawler.py file ).
Installation directory , Command line run python crawler.py
3、 View the crawl results .
After script execution , It will be automatically generated in the installation directory data.html file .
open data.html file , Display the data after crawling , Clicking the title will pop up a new window and jump to the specified address .
边栏推荐
- 我们被一个 kong 的性能 bug 折腾了一个通宵
- Redis cluster deployment based on redis6.2.4
- Lvgl 7.11 tileview interface cycle switching
- 文字翻译软件-文字批量翻译转换器免费
- Roson的Qt之旅#99 QML表格控件-TableView
- Starting from business needs, open the road of efficient IDC operation and maintenance
- 基于redis6.2.4的redis cluster部署
- 理财有保本产品吗?
- 超越 ConvNeXt、RepLKNet | 看 51×51 卷积核如何破万卷!
- 备考过程中,这些“谣言”千万不要信!
猜你喜欢

In the eyes of 100 users, there are 100 QQS

如何使用 4EVERLAND CLI 在 IPFS 上部署应用程序

第六章 继承

Exception handling mechanism topic 1

【目标检测】TPH-YOLOv5:基于transformer的改进yolov5的无人机目标检测

生成扩散模型漫谈:DDPM = 贝叶斯 + 去噪

Hcip notes 11 days

3D semantic segmentation - scribed supervised lidar semantic segmentation

Frustrated Internet people desperately knock on the door of Web3

ReBudget:通过运行时重新分配预算的方法,在基于市场的多核资源分配中权衡效率与公平性
随机推荐
[knowledge atlas] practice -- Practice of question and answer system based on medical knowledge atlas (Part5 end): information retrieval and result assembly
In the eyes of 100 users, there are 100 QQS
【目标检测】YOLOv5跑通VOC2007数据集(修复版)
02.两数相加
Starting from business needs, open the road of efficient IDC operation and maintenance
jenkins的文件参数,可以用来上传文件
pgsql有没有好用的图形化管理工具?
Exception handling mechanism topic 1
MySQL view
7. Dependency injection
EasyUI drop-down box, add and put on and off shelves of products
Rebudget: balance efficiency and fairness in market-based multi-core resource allocation by reallocating the budget at run time
WPF implements user avatar selector
HCIP笔记十二天
HCIP笔记十一天
win10自带的框选截图快捷键
Frustrated Internet people desperately knock on the door of Web3
【知识图谱】实践篇——基于医疗知识图谱的问答系统实践(Part3):基于规则的问题分类
Bo Yun container cloud and Devops platform won the trusted cloud "technology best practice Award"
我们被一个 kong 的性能 bug 折腾了一个通宵