当前位置:网站首页>Sorting out common problems after crawler deployment

Sorting out common problems after crawler deployment

2022-06-23 06:06:00 Python Yixi

After the crawler local test run passes , Some students can't wait to deploy the program to the server for formal operation , Then after running for a period of time, there are various errors and even the program exits , Here are some common problems for reference :

  1、 Local debugging only indicates that the process from request to data analysis has been completed , But it does not mean that the program can collect data stably for a long time , The collected websites need to be tested automatically , Generally, it is recommended to conduct stability test according to a certain number of times or time , Take a look at the response and anti - crawling of the website

  2、 The program needs to add exception protection for data processing , If the data requirements are not high , It can be run in a single thread , If the data requirements are high , It is recommended to add multithreading , Improve the processing performance of the program

  3、 According to the collected data requirements and website conditions , Configure the appropriate crawler agent , This can reduce the risk of website anti - crawling , Crawler agent selection comparison , Focus on network latency 、IP Pool size and request success rate , In this way, you can quickly select the appropriate crawler agent products

   Here is a demo Program , Used to count requests and IP Distribution , It can also be modified into a data acquisition program as required :

#! -- encoding:utf-8 --

import requests
import random
import requests.adapters
#  Target page to visit 
targetUrlList = [
    "https://",
    "https://",
    "https://",
]
#  proxy server ( The product's official website  h.shenlongip.com)
proxyHost = " h.shenlongip.com"
proxyPort = "  "
#  Proxy authentication information 
proxyUser = "username"
proxyPass = "password"
proxyMeta = "http://%(user)s:%(pass)[email protected]%(host)s:%(port)s" % {
    "host": proxyHost,
    "port": proxyPort,
    "user": proxyUser,
    "pass": proxyPass,
}
#  Set up  http and https All visits are made with HTTP agent 
proxies = {
    "http": proxyMeta,
    "https": proxyMeta,
}
#   Set up IP Switch head 
tunnel = random.randint(1, 10000)
headers = {"Proxy-Tunnel": str(tunnel)}
class HTTPAdapter(requests.adapters.HTTPAdapter):
    def proxy_headers(self, proxy):
        headers = super(HTTPAdapter, self).proxy_headers(proxy)
        if hasattr(self, 'tunnel'):
            headers['Proxy-Tunnel'] = self.tunnel
        return headers
#  Visit the website three times , Use the same tunnel sign , Can maintain the same extranet IP
for i in range(3):
    s = requests.session()
    a = HTTPAdapter()
    #   Set up IP Switch head 
    a.tunnel = tunnel
    s.mount('https://', a)
    for url in targetUrlList:
        r = s.get(url, proxies=proxies)
        print r.text
原网站

版权声明
本文为[Python Yixi]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/01/202201142043415890.html