当前位置：网站首页>Sorting out common problems after crawler deployment

Sorting out common problems after crawler deployment

2022-06-23 06:06:00 【Python Yixi】

After the crawler local test run passes , Some students can't wait to deploy the program to the server for formal operation , Then after running for a period of time, there are various errors and even the program exits , Here are some common problems for reference ：

　　1、 Local debugging only indicates that the process from request to data analysis has been completed , But it does not mean that the program can collect data stably for a long time , The collected websites need to be tested automatically , Generally, it is recommended to conduct stability test according to a certain number of times or time , Take a look at the response and anti - crawling of the website

　　2、 The program needs to add exception protection for data processing , If the data requirements are not high , It can be run in a single thread , If the data requirements are high , It is recommended to add multithreading , Improve the processing performance of the program

　　3、 According to the collected data requirements and website conditions , Configure the appropriate crawler agent , This can reduce the risk of website anti - crawling , Crawler agent selection comparison , Focus on network latency 、IP Pool size and request success rate , In this way, you can quickly select the appropriate crawler agent products

　　 Here is a demo Program , Used to count requests and IP Distribution , It can also be modified into a data acquisition program as required ：

#! -- encoding:utf-8 --

import requests

import random

import requests.adapters

#  Target page to visit

targetUrlList = [

    "https://",

    "https://",

    "https://",

#  proxy server ( The product's official website  h.shenlongip.com)

proxyHost = " h.shenlongip.com"

proxyPort = "  "

#  Proxy authentication information

proxyUser = "username"

proxyPass = "password"

proxyMeta = "http://%(user)s:%(pass)[email protected]%(host)s:%(port)s" % {

    "host": proxyHost,

    "port": proxyPort,

    "user": proxyUser,

    "pass": proxyPass,

#  Set up  http and https All visits are made with HTTP agent

proxies = {

    "http": proxyMeta,

    "https": proxyMeta,

#   Set up IP Switch head

tunnel = random.randint(1, 10000)

headers = {"Proxy-Tunnel": str(tunnel)}

class HTTPAdapter(requests.adapters.HTTPAdapter):

    def proxy_headers(self, proxy):

        headers = super(HTTPAdapter, self).proxy_headers(proxy)

        if hasattr(self, 'tunnel'):

            headers['Proxy-Tunnel'] = self.tunnel

        return headers

#  Visit the website three times , Use the same tunnel sign , Can maintain the same extranet IP

for i in range(3):

    s = requests.session()

    a = HTTPAdapter()

    #   Set up IP Switch head

    a.tunnel = tunnel

    s.mount('https://', a)

    for url in targetUrlList:

        r = s.get(url, proxies=proxies)

        print r.text

原网站

版权声明
本文为[Python Yixi]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/01/202201142043415890.html

当前位置：网站首页>Sorting out common problems after crawler deployment

Sorting out common problems after crawler deployment

边栏推荐

猜你喜欢

随机推荐