当前位置：网站首页>Crawler framework crawler

Crawler framework crawler

2022-07-25 17:11:00 【wangmcn】

crawler

Catalog

1、 brief introduction
2、 Installation and deployment
3、 Frame description
4、 Using frames

1、 brief introduction

crawler use requests+lxml The way to crawl , Crawl content and url use XPath In the same way （ About XPath May refer to XPath Reference manual chapter ）.

GitHub website ：https://github.com/shuizhubocai/crawler

requests yes Python An excellent third-party library , Suitable for human use HTTP library , Encapsulates a lot of cumbersome HTTP function , Greatly simplified HTTP The amount of code required for the request .

lxml yes Python A parsing library of , Support HTML and XML Parsing , Support XPath Analytical way , And the parsing efficiency is very high .

2、 Installation and deployment

stay Windows Environmental Science （64 position ） Next Python Version is 3.6.5.

1、 Open the official website to download , Download completed as crawler-master.zip file .

2、 Unzip the file to the specified directory （ for example D:\crawler）.

3、 Installation directory , Command line run pip install -r requrements.txt The library files that the installation framework depends on .

requrements.txt The contents of the document ：

certifi==2018.4.16

chardet==3.0.4

idna==2.7

requests==2.19.1

urllib3==1.23

4、 install lxml, The version number is 4.2.5.

Download address ：https://www.lfd.uci.edu/~gohlke/pythonlibs/#lxml

Download the specified version ,cp36 representative Python 3.6 Version of ,win_amd64 representative 64 A system of , So you need to choose the right , Otherwise, the installation process will report an error that the platform does not match .

Start the installation when the download is complete lxml, Enter the path where the installation file is located in the command line and enter the command .

pip install lxml-4.2.5-cp36-cp36m-win_amd64.whl

3、 Frame description

1、crawler.py file ：

Urls class ： Address Manager

Download class ： Page downloader

Parser class ： Page parser

Output class ： Export data to HTML

Scheduler class ： Crawler scheduler

2、modules\useragent In the catalog chrome.py、firefox.py Wait for browser proxy .

3、data.html Import the crawled data into this file .

4、 Using frames

demand ： visit 51testing Forum , Get the specified number of pages （1-10） Post title and URL Address .