当前位置:网站首页>A small crawler program written by beginners
A small crawler program written by beginners
2022-06-25 00:41:00 【A big abdominal muscle】
Study python Three months , Gradually began the reptile journey , According to the guidance of a Book , I began to want to write a general crawler applet , I hope there is a great God who can give me some advice .
import datetime
import time
from selenium import webdriver
import re
class MyCommonSpider:
def __init__(self):
pass
Used selenium Simulate mouse button operation , The goal is to climb 51job Position information on
def get_data(self, url, send_keys='', pages_you_want=1, search_field='', search_button='', page_field='',
next_button=''):
'''
Function to get page data
:param url: Get the address of the page
:param send_keys: Search keywords in the search box
:param pages_you_want: The total number of pages you need to crawl
:param search_field: Of search box xPath
:param search_button: Search button xPath
:param page_field: Page filling box
:param next_button: Next button
:return: Returns the... That stores all page data list[str]
'''
browser = webdriver.Chrome() # Get a browser object Use Google browser
browser.maximize_window() # The entire page displays
browser.get(url) # Access to the page
time.sleep(3)
if send_keys != '' and search_field != '':
browser.find_element_by_xpath(search_field).clear() # Clear the search box data
browser.find_element_by_xpath(search_field).send_keys(send_keys) # Write search keywords
if search_button != '':
browser.find_element_by_xpath(search_button).click() # Click the search button
time.sleep(3)
datas = [] # A list that stores all page data
for i in range(pages_you_want):
time.sleep(3)
print(f' Currently extract the {i + 1} Page data ')
datas.append(browser.page_source) # Add homepage data
if page_field != "": # If the current page has elements that can fill in page number values
if pages_you_want > 1: # The number of pages you want is greater than 1
browser.find_element_by_xpath(page_field).clear() # Clear page number data
browser.find_element_by_xpath(page_field).send_keys((i + 2)) # Write the page number of the next page +2
browser.find_element_by_xpath(next_button).click() # Click on the page number element of the next page
print(' Page element extraction completed , Go to the analysis stage ')
browser.quit()
return datas
get_data function , Some common page elements are defined , adopt if Code control , Get all the page source strings that need to be extracted
def data_analysis(self, data_source, p_list):
'''
Analyze the whole extracted page data
:param data_source: List of data to be analyzed
:param p_list: There is a list of regular expression strings matching page elements
:return analysis_list: Returns a two-dimensional array containing all data
'''
analysis_dict = {} # A dictionary for storing and analyzing good data
for i in p_list: # Loop to create an empty list in the dictionary to store the corresponding
analysis_dict[i[0]] = [] # According to the storage in plist The first i The... In the list 0 Key name of , Generate the keyword corresponding to the dictionary Every regular expression to be matched , Generate a list
for data in data_source: # Loop to extract a single page str data
for j in p_list: # Loop to extract the regular expression string to be matched
# print(j[0]) key
# print(j[1]) Regular matching
# print(analysis_dict[j[0]]) list
analysis_dict[j[0]].append(re.findall(j[1], data, re.S)) # Regular matching with the current page data Loop through all the required data Save to the corresponding list in the dictionary
return analysis_dict
Data analysis function , I think the dictionary is a little complicated , Mainly considering that in excel Automatically generate column headers in , But I don't know how to optimize , The data to be extracted exists in the column header name , The value is in the dictionary of multiple lists with multiple page data extracted ( The data of each page is stored in a list )
def join_list_in_dict(self, data_dict):
'''
Conversion data , All the parameters in the dictionary
:param data_dict: A dictionary that stores all data The original format for k,v[][] Convert to k,v[] value Values change from a two-dimensional array to a one-dimensional list
:return:
'''
for k, v in data_dict.items():
changed_list = []
for i in range(len(v)):
changed_list += v[i]
data_dict[k] = changed_list
Subsequent data processing functions , Due to the problems of the above Dictionary , You need to re integrate every key value pair in the dictionary , Consolidate multiple lists in values into one list
def save_to_excel(self, analysis_dict, excel_name, sheet_name=''):
'''
Deposit in excel In the table
:param analysis_dict: Processed data dictionary
:param excel_name: Stored table name
:return:
'''
import pandas as pd
data = pd.DataFrame(analysis_dict)
if sheet_name != "":
data.to_excel(excel_name, index=True, sheet_name=sheet_name)
else:
data.to_excel(excel_name, index=True)
print(f'excel file {excel_name} Saved ')
Deposit in excel Function of , Column headers can be generated automatically , But I feel that the gains are not worth the losses , The logic is a little too complicated , Almost knocked myself out
def test(self):
now = datetime.datetime.now()
''' The test method '''
url = 'https://www.51job.com' # Use 51job test
send_keys = 'java'
pages_you_want = 5 # test 5 Page data
search_field = '//*[@id="kwdselectid"]'
search_button = '/html/body/div[3]/div/div[1]/div/button'
page_field = '//*[@id="jump_page"]'
next_button = '/html/body/div[2]/div[3]/div/div[2]/div[4]/div[2]/div/div/div/span[3]'
datas = self.get_data(url=url,
send_keys=send_keys,
pages_you_want=pages_you_want,
search_field=search_field,
search_button=search_button,
page_field=page_field,
next_button=next_button
)
# print(datas[0])
p_job = [' Job title ', '<p class="t"><span title="(.*?)".*?</span>'] # Job title
p_time = [' Release time ', '<span class="time">(.*?)</span>'] # Release time
p_salary = [' Wage level ', '<p class="info"><span class="sal">(.*?)</span>'] # Wage level
p_company = [' Corporate name ', '<div class="er">.*?title="(.*?)".*?</a>'] # Corporate name
p_link = [' Connection address ', '<div class="e_icons ick"></div> <a href="(.*?)"'] # Link address
p_needs = [' Demand situation ', '<span class="d at">(.*?)</span>'] # demand
p_list = [p_job, p_salary, p_company, p_needs, p_time, p_link] # matching
analysis_dict = self.data_analysis(data_source=datas,
p_list=p_list)
self.join_list_in_dict(data_dict=analysis_dict) # Put the dictionary of to into this function Relink the list data of all pages
for k, v in analysis_dict.items():
print(k, len(v))
self.save_to_excel(analysis_dict, f'{send_keys} Recruitment information .xlsx', sheet_name=send_keys) # Dictionary storage excel
print(f' complete , Time :{datetime.datetime.now() - now}') # Calculate the time
Test functions , Target website 51job, soso java keyword , Extracted 6 Page elements , It can also be generated normally excel The file , I feel that the next step should be to add some program-controlled statements
边栏推荐
- Modstart: embrace new technologies and take the lead in supporting laravel 9.0
- Using tcp/udp tools to debug the yeelight ribbon
- Android SQLite database
- Discrete mathematics and its application detailed explanation of exercises in the final exam of spring and summer semester of 2018-2019 academic year
- Intensive reading of thinking about markdown
- Scrollview height cannot fill full screen
- Unmanned driving: Some Thoughts on multi-sensor fusion
- Apk slimming compression experience
- redis + lua实现分布式接口限流实现方案
- Im instant messaging development application keeping alive process anti kill
猜你喜欢
MySQL log management
Go crawler framework -colly actual combat (III) -- panoramic cartoon picture capture and download
Technologie des fenêtres coulissantes en octets dans la couche de transmission
Sliding window technology based on byte in transport layer
ros(24):error: invalid initialization of reference of type ‘xx’ from expression of type ‘xx’
Virtual machine - network configuration
VNC viewer remote connection raspberry pie without display
不重要的token可以提前停止计算!英伟达提出自适应token的高效视觉Transformer网络A-ViT,提高模型的吞吐量!...
Use of JMeter easynmon
Related operations of ansible and Playbook
随机推荐
Technologie des fenêtres coulissantes en octets dans la couche de transmission
Tiktok wallpaper applet source code
百公里加速仅5.92秒,威兰达高性能版以高能产品实力领跑
Collection of software testing and game testing articles
Svg line animation background JS effect
Ott marketing is booming. How should businesses invest?
Encryption and encoding resolution
Wallpaper applet wechat applet
C# 闭包的垃圾回收
Related operations of ansible and Playbook
Adding, deleting, modifying and checking in low build code
[leaderboard] Carla leaderboard leaderboard leaderboard operation and participation in hands-on teaching
MySQL semi sync replication
Im instant messaging development application keeping alive process anti kill
ros(24):error: invalid initialization of reference of type ‘xx’ from expression of type ‘xx’
Virtual machine - network configuration
Meta & Berkeley proposed a universal multi-scale visual transformer based on pooled self attention mechanism. The classification accuracy in Imagenet reached 88.8%! Open source
Interesting checkbox counters
【微服务|Sentinel】簇点链路|微服务集群环境搭建
WordPress add photo album function [advanced custom fields Pro custom fields plug-in series tutorial]