当前位置:网站首页>A small crawler program written by beginners

A small crawler program written by beginners

2022-06-25 00:41:00 A big abdominal muscle

Study python Three months , Gradually began the reptile journey , According to the guidance of a Book , I began to want to write a general crawler applet , I hope there is a great God who can give me some advice .

import datetime
import time
from selenium import webdriver
import re


class MyCommonSpider:
    def __init__(self):
        pass

Used selenium Simulate mouse button operation , The goal is to climb 51job Position information on

    def get_data(self, url, send_keys='', pages_you_want=1, search_field='', search_button='', page_field='',
                 next_button=''):
        '''
         Function to get page data 
        :param url:  Get the address of the page 
        :param send_keys:  Search keywords in the search box 
        :param pages_you_want:  The total number of pages you need to crawl 
        :param search_field:  Of search box xPath
        :param search_button:  Search button xPath
        :param page_field:   Page filling box 
        :param next_button:  Next button 
        :return:  Returns the... That stores all page data list[str]
        '''
        browser = webdriver.Chrome()  #  Get a browser object   Use Google browser 
        browser.maximize_window()  #  The entire page displays 
        browser.get(url)  #  Access to the page 
        time.sleep(3)
        if send_keys != '' and search_field != '':
            browser.find_element_by_xpath(search_field).clear()  #  Clear the search box data 
            browser.find_element_by_xpath(search_field).send_keys(send_keys)  #  Write search keywords 
        if search_button != '':
            browser.find_element_by_xpath(search_button).click()  #  Click the search button 
        time.sleep(3)
        datas = []  #  A list that stores all page data 
        for i in range(pages_you_want):
            time.sleep(3)
            print(f' Currently extract the {i + 1} Page data ')
            datas.append(browser.page_source)  #  Add homepage data 
            if page_field != "":  #  If the current page has elements that can fill in page number values 
                if pages_you_want > 1:  #  The number of pages you want is greater than 1
                    browser.find_element_by_xpath(page_field).clear()  #  Clear page number data 
                    browser.find_element_by_xpath(page_field).send_keys((i + 2))  #  Write the page number of the next page  +2
            browser.find_element_by_xpath(next_button).click()  #  Click on the page number element of the next page 
        print(' Page element extraction completed , Go to the analysis stage ')
        browser.quit()
        return datas

get_data function , Some common page elements are defined , adopt if Code control , Get all the page source strings that need to be extracted

    def data_analysis(self, data_source, p_list):
        '''
         Analyze the whole extracted page data 
        :param data_source:  List of data to be analyzed 
        :param p_list:  There is a list of regular expression strings matching page elements 
        :return analysis_list:  Returns a two-dimensional array containing all data 
        '''
        analysis_dict = {}  #  A dictionary for storing and analyzing good data 
        for i in p_list:  #  Loop to create an empty list in the dictionary to store the corresponding 
            analysis_dict[i[0]] = []  #  According to the storage in plist The first i The... In the list 0 Key name of , Generate the keyword corresponding to the dictionary   Every regular expression to be matched , Generate a list 
        for data in data_source:  #  Loop to extract a single page str data 
            for j in p_list:  #  Loop to extract the regular expression string to be matched 
                # print(j[0])  key 
                # print(j[1])   Regular matching 
                # print(analysis_dict[j[0]])  list 
                analysis_dict[j[0]].append(re.findall(j[1], data, re.S))  #  Regular matching with the current page data   Loop through all the required data   Save to the corresponding list in the dictionary 
        return analysis_dict

  Data analysis function , I think the dictionary is a little complicated , Mainly considering that in excel Automatically generate column headers in , But I don't know how to optimize , The data to be extracted exists in the column header name , The value is in the dictionary of multiple lists with multiple page data extracted ( The data of each page is stored in a list )

    def join_list_in_dict(self, data_dict):
        '''
         Conversion data , All the parameters in the dictionary 
        :param data_dict:  A dictionary that stores all data   The original format for  k,v[][]  Convert to  k,v[] value Values change from a two-dimensional array to a one-dimensional list 
        :return:
        '''
        for k, v in data_dict.items():
            changed_list = []
            for i in range(len(v)):
                changed_list += v[i]
            data_dict[k] = changed_list

Subsequent data processing functions , Due to the problems of the above Dictionary , You need to re integrate every key value pair in the dictionary , Consolidate multiple lists in values into one list

    def save_to_excel(self, analysis_dict, excel_name, sheet_name=''):
        '''
         Deposit in excel In the table 
        :param analysis_dict:  Processed data dictionary 
        :param excel_name:  Stored table name 
        :return:
        '''

        import pandas as pd
        data = pd.DataFrame(analysis_dict)
        if sheet_name != "":
            data.to_excel(excel_name, index=True, sheet_name=sheet_name)
        else:
            data.to_excel(excel_name, index=True)
        print(f'excel file {excel_name} Saved ')

Deposit in excel Function of , Column headers can be generated automatically , But I feel that the gains are not worth the losses , The logic is a little too complicated , Almost knocked myself out

    def test(self):
        now = datetime.datetime.now()
        ''' The test method '''
        url = 'https://www.51job.com'  #  Use 51job test 
        send_keys = 'java'
        pages_you_want = 5  #  test 5 Page data 
        search_field = '//*[@id="kwdselectid"]'
        search_button = '/html/body/div[3]/div/div[1]/div/button'
        page_field = '//*[@id="jump_page"]'
        next_button = '/html/body/div[2]/div[3]/div/div[2]/div[4]/div[2]/div/div/div/span[3]'

        datas = self.get_data(url=url,
                              send_keys=send_keys,
                              pages_you_want=pages_you_want,
                              search_field=search_field,
                              search_button=search_button,
                              page_field=page_field,
                              next_button=next_button
                              )
        # print(datas[0])
        p_job = [' Job title ', '<p class="t"><span title="(.*?)".*?</span>']  #  Job title 
        p_time = [' Release time ', '<span class="time">(.*?)</span>']  #  Release time 
        p_salary = [' Wage level ', '<p class="info"><span class="sal">(.*?)</span>']  #  Wage level 
        p_company = [' Corporate name ', '<div class="er">.*?title="(.*?)".*?</a>']  #  Corporate name 
        p_link = [' Connection address ', '<div class="e_icons ick"></div> <a href="(.*?)"']  #  Link address 
        p_needs = [' Demand situation ', '<span class="d at">(.*?)</span>']  #  demand 
        p_list = [p_job, p_salary, p_company, p_needs, p_time, p_link]  #  matching 
        analysis_dict = self.data_analysis(data_source=datas,
                                           p_list=p_list)
        self.join_list_in_dict(data_dict=analysis_dict)  #  Put the dictionary of to into this function   Relink the list data of all pages 
        for k, v in analysis_dict.items():
            print(k, len(v))
        self.save_to_excel(analysis_dict, f'{send_keys} Recruitment information .xlsx', sheet_name=send_keys)  #  Dictionary storage excel
        print(f' complete , Time :{datetime.datetime.now() - now}')  #  Calculate the time 

Test functions , Target website 51job, soso java keyword , Extracted 6 Page elements , It can also be generated normally excel The file , I feel that the next step should be to add some program-controlled statements

原网站

版权声明
本文为[A big abdominal muscle]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/02/202202210546262002.html