当前位置:网站首页>2-6. Automatic acquisition
2-6. Automatic acquisition
2022-07-25 07:34:00 【Green bamboo memory】
1.Selenium Use
1. preparation ( install )
This section is to Chrome For example to explain Selenium Usage of . Before we start , Please make sure it's installed properly Chrome Browser and configured ChromeDriver. in addition , It also needs to be installed correctly Python Of Selenium library
pip install selenium
2. Declare browser objects
1、 visit http://chromedriver.storage.googleapis.com/index.html, Find the corresponding version of your browser chromedriver.exe download ( The version must be downloaded )
2、 The downloaded file is unzipped and placed in chrome Browser directory
from selenium import webdriver
browser = webdriver.Chrome()
browser = webdriver.Firefox()
browser = webdriver.Edge()
browser = webdriver.PhantomJS()
browser = webdriver.Safari()
3. Basic use
After the preparatory work is done , First, let's take a general look at Selenium There are some functions . Examples are as follows :
from selenium import webdriver
browser = webdriver.Chrome()
browser.get('https://www.baidu.com')
input = browser.find_element_by_id('kw')
input.send_keys('Python')
browser.find_element_by_id('su').click()
# Extract page
print(browser.page_source.encode('utf-8','ignore'))
# extract cookie
print(browser.get_cookies())
# Extract the current request address
print(browser.current_url)
browser.close()
After running the code, I found , Will automatically pop up a Chrome browser . The browser will jump to Baidu first , Then type... In the search box Python, Then jump to the search results page
from selenium import webdriver
options = webdriver.ChromeOptions()
# No pictures
prefs = {"profile.managed_default_content_settings.images": 2}
options.add_experimental_option("prefs", prefs)
# Headless mode
option.add_argument("-headless")
# By setting user-agent
user_ag='MQQBrowser/26 Mozilla/5.0 (Linux; U; Android 2.3.7; zh-cn; MB200 Build/GRJ22; '+
'CyanogenMod-7) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1'
options.add_argument('user-agent=%s'%user_ag)
# hide "Chrome Under the control of automatic software "
options.add_experimental_option('useAutomationExtension', False)
options.add_experimental_option('excludeSwitches', ['enable-automation'])
# Setting agent
options.add_argument('proxy-server=' +'192.168.0.28:808')
# Maximize browser display
browser.maximize_window()
# Set width and height
browser.set_window_size(480, 800)
# adopt js Open a new window
driver.execute_script('window.open("https://www.baidu.com");')
browser = webdriver.Chrome(chrome_options=options)
4. Find node
Single node
from selenium import webdriver
browser = webdriver.Chrome()
from selenium.webdriver.common.keys import Keys # Analog keyboard operation
browser.get('https://www.taobao.com/')
s = browser.find_element_by_xpath("//div[@class='search-combobox-input-wrap']/child::input")
s.send_keys(' clothes ')
s.send_keys(Keys.ENTER) # enter I'm sure it means
Here are all the methods to get a single node :
find_element_by_id id
find_element_by_class_name Class name
find_element_by_name name
find_element_by_xpath xpath
find_element_by_link_text Specifically used to locate hyperlink text ( label ) The whole match
find_element_by_partial_link_text Fuzzy matching Sign in deng ( It's fine too )
find_element_by_tag_name The tag name
find_element_by_css_selector CSS Selectors
Extract cases :
browser.get('https://www.taobao.com/')
# ID Select folding positioning
s = browser.find_element_by_id('q')
s.send_keys(' clothes ')
# CSS Select folding positioning
s = browser.find_element_by_css_selector('div.search-combobox-input-wrap>input')
s.send_keys(' clothes ')
# xpath Selector positioning
s = browser.find_elements_by_xpath('//div[@class="search-combobox-input-wrap"]/input')
s.send_keys(' clothes ')
find_element()
in addition ,Selenium There is also a general approach find_element(), It needs to pass in two parameters : How to find it By And the value . actually , It is find_element_by_id() The general function version of this method , such as find_element_by_id(id) It is equivalent to find_element(By.ID, id), The results of the two are exactly the same . Let's do it in code :
from selenium import webdriver
from selenium.webdriver.common.by import By
browser = webdriver.Chrome()
browser.get('https://www.taobao.com')
input_first = browser.find_element(By.ID, 'q')
input_first.send_keys(' clothes ')
print(input_first)
browser.close() // Finally, we must close selenium
Multiple nodes
If there is only one target in the web page , Then it can be used find_element() Method . But if there are multiple nodes , Reuse find_element() Method to find the , You can only get the first node . If you want to find all nodes that meet the conditions , Need to use find_elements() This way . Be careful , In the name of the method ,element One more. s, Pay attention to distinguish between .
from selenium import webdriver
browser = webdriver.Chrome()
browser.get('https://www.taobao.com')
lis = browser.find_elements_by_css_selector('.service-bd li')
print(lis)
browser.close()
Here are all the methods to get multiple nodes :
find_elements_by_id
find_elements_by_name
find_elements_by_xpath
find_elements_by_link_text
find_elements_by_partial_link_text
find_elements_by_tag_name
find_elements_by_class_name
find_elements_by_css_selector
Of course , We can also use it directly find_elements() Method to choose , It can be written like this :
lis = browser.find_elements(By.CSS_SELECTOR, '.service-bd li')
Node interaction
Use the above method , We have completed the actions of some common nodes , For more operations, see the interactive action introduction in the official document :
http://selenium-python.readthedocs.io/api.html
from selenium import webdriver
import time
browser = webdriver.Chrome()
browser.get('https://www.taobao.com')
input = browser.find_element_by_id('q')
input.send_keys('iPhone')
time.sleep(1)
input.clear()
input.send_keys('iPad')
button = browser.find_element_by_class_name('btn-search')
button.click()
Action chain
In the example above , Some interactive actions are performed for a node . such as , For input fields , We call its input text and empty text methods ; For buttons , Just call its click method . Actually , There are other operations , They don't have specific execution objects , Such as mouse dragging 、 Keyboard, keys, etc , These actions are performed in another way , That's the action chain .
such as , Now realize the drag operation of a node , Drag a node from one place to another , It can be realized in this way :
from selenium import webdriver
from selenium.webdriver import ActionChains
browser = webdriver.Chrome()
url = 'http://www.runoob.com/try/try.php?filename=jqueryui-api-droppable'
browser.get(url)
browser.switch_to.frame('iframeResult')
source = browser.find_element_by_css_selector('#draggable')
target = browser.find_element_by_css_selector('#droppable')
actions = ActionChains(browser)
actions.drag_and_drop(source, target)
actions.perform()
Page scrolling
# Move to element element Object's “ Apex ” With the current window “ Top ” alignment
driver.execute_script("arguments[0].scrollIntoView();", element)
driver.execute_script("arguments[0].scrollIntoView(true);", element)
# Move to element element Object's “ Bottom ” With the current window “ Bottom ” alignment
driver.execute_script("arguments[0].scrollIntoView(false);", element)
# Move to the bottom of the page
driver.execute_script("window.scrollTo(0, document.body.scrollHeight)")
# Moves to the specified coordinates ( Move relative to the current coordinates )
driver.execute_script("window.scrollBy(0, 700)")
# Combine the above scrollBy sentence , It is equivalent to moving to 700+800=1600 Pixel position
driver.execute_script("window.scrollBy(0, 800)")
# Move to the absolute position coordinates of the window , Move to the ordinate as follows 1600 Pixel position
driver.execute_script("window.scrollTo(0, 1600)")
# Combine the above scrollTo sentence , Still move to the ordinate 1200 Pixel position
driver.execute_script("window.scrollTo(0, 1200)")
perform JavaScript
For some operations ,Selenium API It didn't provide . such as , Pull down the progress bar , It can directly simulate the operation JavaScript, At this time to use execute_script() Method can be realized , The code is as follows :
# document.body.scrollHeight Get page height
from selenium import webdriver
browser = webdriver.Chrome()
browser.get('https://36kr.com/')
# Drop down border One time pull-down
browser.execute_script('window.scrollTo(0, document.body.scrollHeight)')
# scrollTo Don't stack 200 200 scrollBy superposition 200 300 500 operation
# Slowly pull down
for i in range(1,9):
time.sleep(random.randint(100, 300) / 1000)
browser.execute_script('window.scrollTo(0,{})'.format(i * 700)) # scrollTo Don't stack 700 1400 2100
### Here we use execute_script() Method to drop the progress bar to the bottom
Get node information
get attribute
We can use get_attribute() Method to get the properties of the node , But the premise is to select this node first , Examples are as follows :
from selenium import webdriver
url = 'https://pic.netbian.com/4kmeinv/index.html'
browser.get(url)
src = browser.find_elements_by_xpath('//ul[@class="clearfix"]/li/a/img')
for i in src:
url = i.get_attribute('src')
print(url)
obtain ID、 Location 、 Tag name 、 size
from PIL import Image
url = 'https://pic.netbian.com/4kmeinv/index.html'
browser.get(url)
img = browser.find_element_by_xpath('//ul[@class="clearfix"]/li[1]/a/img')
location = img.location
size = img.size
top, bottom, left, right = location['y'], location['y'] + size['height'], location['x'], location['x'] + size['width']
screen = browser.get_screenshot_as_png()
screen = Image.open(BytesIO(screen))
cap = screen.crop((left, top, right, bottom))
cap.save('asas.png')
Switch Frame
We know that there is a node in a web page called iframe, That is, son Frame, Equivalent to a sub page of a page , Its structure is exactly the same as that of the external web page .Selenium When I open the page , It defaults to the parent Frame Inside operation , At this time, if there are children in the page Frame, It can't get the child Frame The nodes inside . It needs to be used switch_to.frame() Method to switch Frame. Examples are as follows :
browser.get('https://www.douban.com/')
login_iframe = browser.find_element_by_xpath('//div[@class="login"]/iframe')
browser.switch_to.frame(login_iframe)
browser.find_element_by_class_name('account-tab-account').click()
browser.find_element_by_id('username').send_keys('123123123')
Delay waiting for
stay Selenium in ,get() Method will finish execution after loading the web frame , At this point, if you get page_source, It may not be that the browser has completely loaded the completed page , If some pages have extra Ajax request , We may not be able to get it in the source code of the web page . therefore , It's going to take some time to delay , Make sure the node is loaded .
There are two ways to wait here : One is implicit waiting , One is explicit waiting .
For more detailed parameters and usage of wait conditions, please refer to the official documents :
http://selenium-python.readthedocs.io/api.html
Forward and backward
Normally, when using the browser, there are forward and backward functions ,Selenium You can also do this , It USES back() Method back , Use forward() Method forward . Examples are as follows :
import time
from selenium import webdriver
browser = webdriver.Chrome()
browser.get('https://www.baidu.com/')
browser.get('https://www.taobao.com/')
browser.get('https://www.python.org/')
browser.back()
time.sleep(1)
browser.forward()
browser.close()
exception handling
In the use of Selenium In the process of , It is inevitable to encounter some exceptions , For example, overtime 、 Error such as node not found , Once such an error occurs , The program will not continue to run . Here we can use try except Statement to catch various exceptions .
from selenium import webdriver
from selenium.common.exceptions import TimeoutException, NoSuchElementException
browser = webdriver.Chrome()
try:
browser.get('https://www.baidu.com')
except TimeoutException:
print('Time Out')
try:
browser.find_element_by_id('hello')
except NoSuchElementException:
print('No Element')
finally:
browser.close()
Bypass detection
# No treatment
browser.get('https://bot.sannysoft.com/')
# Set up shielding
options = webdriver.ChromeOptions()
options.add_argument('--disable-blink-features=AutomationControlled')
browsers = webdriver.Chrome(chrome_options=options)
browsers.get('https://bot.sannysoft.com/')
2.Pypputeer Use
Puppeteer yes Google be based on Node.js A tool developed , and Pyppeteer And what is it ? It's actually Puppeteer Of Python Implementation of version , But it's not Google Developed , Is an engineer from Japan Puppeteer Some of the features developed by the unofficial version .
stay Pyppetter in , In fact, there is a similar behind it Chrome Browser's Chromium The browser is performing some actions to render web pages , First of all, say Chrome The browser and Chromium The origin of browser .
Environmental installation
pip install pyppeteer
1. Examples :
from pyppeteer import launch
import asyncio
import time
async def main():
# Launch a browser
browser = await launch(headless=False,args=['--disable-infobars','--window-size=1920,1080'])
# Create a page
page = await browser.newPage()
# Jump to Baidu
await page.goto("http://www.baidu.com/")
# Enter the keywords to query ,type The first parameter is the element selector, The second is the keyword to be entered
await page.type('#kw','pyppeteer')
# Click Submit button click adopt selector Click the specified element
await page.click('#su')
time.sleep(3)
await browser.close()
asyncio.get_event_loop().run_until_complete(main())
Next, let's take a look at its parameter introduction :
- ignoreHTTPSErrors (bool): Do you want to ignore HTTPS Error of , The default is False.
- headless (bool): Is it enabled? Headless Pattern , That is, no interface mode , If devtools This parameter is True Words , Then the parameter will be set to False, Otherwise True, That is, by default, the interface free mode is enabled .
- executablePath (str): Path to executable file , If specified, you do not need to use the default Chromium 了 , Can be designated as existing Chrome or Chromium.
- slowMo (int|float): By passing in the specified time , Can slow down Pyppeteer Some of the simulation operations of .
- args (List[str]): Additional parameters that can be passed in during execution .
- ignoreDefaultArgs (bool): Don't use Pyppeteer Default parameters , If this parameter is used , Then it's best to pass args Parameters to set some parameters , Otherwise, there may be some unexpected problems . This parameter is relatively dangerous , Use with caution .
- handleSIGINT (bool): Whether to respond to SIGINT The signal , That is to say, you can use Ctrl + C To terminate the browser program , The default is True.
- handleSIGTERM (bool): Whether to respond to SIGTERM The signal , It's usually kill command , The default is True.
- handleSIGHUP (bool): Whether to respond to SIGHUP The signal , Hang signal , For example, terminal exit operation , The default is True.
- dumpio (bool): Whether or not to Pyppeteer The output content of is passed to process.stdout and process.stderr object , The default is False.
- userDataDir (str): User data folder , That is, you can keep some personalized configuration and operation records .
- env (dict): environment variable , It can be imported in the form of a dictionary .
- devtools (bool): Whether to automatically open the debugging tool for each page , The default is False. If this parameter is set to True, that headless The parameter will be invalid , Will be forced to False.
- logLevel (int|str): The level of logging , Default and root logger Objects have the same level .
- autoClose (bool): When some commands are executed , Do you want to close the browser automatically , The default is True.
- loop (asyncio.AbstractEventLoop): Event loop object .
2. Basic configuration
2.0 The basic parameters
params={
# Close headless browser
"headless": False,
'dumpio':'True', # Prevent the browser from getting stuck
r'userDataDir':'./cache-data', # User file address
"args": [
'--disable-infobars', # Close the automation prompt box
'--window-size=1920,1080', # Window size
'--log-level=30', # Log retention level , It is suggested that the smaller the setting, the better , Otherwise, the generated log will take up a lot of space 30 by warning Level
'--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36',
'--no-sandbox', # Turn off sandbox mode
'--start-maximized', # Window maximization mode
'--proxy-server=http://localhost:1080' # agent
],
}
2.1 Settings window
# UI Pattern Frequency closure warning
browser = await launch(headless=False, args=['--disable-infobars'])
page = await browser.newPage()
await page.setViewport({'width': 1200, 'height': 800})
2.2 Add the head
await page.setUserAgent("Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5")
2.2 Webpage Screenshot
page.screenshot(path='example.png')
2.3 Camouflage browser Bypass detection
Object.defineProperty() Method defines a new property directly on an object , Or modify the existing properties of an object , And return this object .
# camouflage
await page.evaluateOnNewDocument('() =>{ Object.defineProperties(navigator,'
'{ webdriver:{ get: () => false } })}')
await page.goto('https://intoli.com/blog/not-possible-to-block-chrome-headless/chrome-headless-test.html')
2.4 Case presentation Trigger JS
async def run():
browser = await launch()
page = await browser.newPage()
await page.setViewport({'width': 1200, 'height': 800})
await page.goto('https://www.zhipin.com/job_detail/?query=%E8%85%BE%E8%AE%AF%E7%88%AC%E8%99%AB&city=101020100&industry=&position=')
dimensions = await page.evaluate('''() => {
return {
cookie: window.document.cookie,
}
}''')
print(dimensions,type(dimensions))
asyncio.get_event_loop().run_until_complete(run())
3 Advanced use
import asyncio
from pyppeteer import launch
from pyquery import PyQuery as pq
async def main():
browser = await launch(headless=False) # Open the browser
page = await browser.newPage() # Open tab
# Enter the address to access the page
await page.goto('https://careers.tencent.com/search.html?keyword=python')
# Call the folder selector
await page.waitForXPath('//div[@class="recruit-wrap recruit-margin"]/div')
# Get web source code
doc = pq(await page.content())
# Extract the data
title = [item.text() for item in doc('.recruit-title').items()]
print('title:', title)
# Close the browser
await browser.close()
# Start the asynchronous method
asyncio.get_event_loop().run_until_complete(main())
4 Data Extraction
# Execute within the page document.querySelector. If no element matches the specified selector , The return value is None
J = querySelector
# Execute within the page document.querySelector, Then pass the matched element as the first parameter to pageFunction
Jeval = querySelectorEval
# Execute within the page document.querySelectorAll. If no element matches the specified selector , The return value is []
JJ = querySelectorAll
# Execute within the page Array.from(document.querySelectorAll(selector)), Then pass the matched element array as the first parameter to pageFunction
JJeval = querySelectorAllEval
# XPath expression
Jx = xpath
# Pyppeteer Three analytical methods
Page.querySelector() # Selectors css Selectors
Page.querySelectorAll()
Page.xpath() # xpath expression
# The abbreviation is :
Page.J(), Page.JJ(), and Page.Jx()
5 get attribute
Extract the destination address :https://pic.netbian.com/4kmeinv/index.html All picture resources
async def mains1():
browser = await launch(headless=False, args=['--disable-infobars'])
page = await browser.newPage()
await page.setUserAgent("Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5")
await page.setViewport(viewport={'width': 1536, 'height': 768})
await page.evaluateOnNewDocument('() =>{ Object.defineProperties(navigator,'
'{ webdriver:{ get: () => false } }) }')
await page.goto('https://pic.netbian.com/4kmeinv/index.html')
elements = await page.querySelectorAll(".clearfix li a img")
for item in elements:
# Get the connection
title_link = await (await item.getProperty('src')).jsonValue()
print(title_link)
await browser.close()
asyncio.get_event_loop().run_until_complete(mains1())
6 Log in to the case
import asyncio
from pyppeteer import launch
async def mains2():
browser = await launch({'headless': False, 'args': ['--disable-infobars', '--window-size=1920,1080']})
page = await browser.newPage()
await page.setViewport({'width': 1920, 'height': 1080})
await page.goto('https://www.captainbi.com/amz_login.html')
await page.evaluateOnNewDocument('() =>{ Object.defineProperties(navigator,'
'{ webdriver:{ get: () => false } }) }')
await page.type('#username', '13555553333') # account number
await page.type('#password', '123456') # password
await asyncio.sleep(2)
await page.click('#submit',{'timeout': 3000})
import time
# await browser.close()
print(' Login successful ')
asyncio.get_event_loop().run_until_complete(mains2())
7 Comprehensive case
import requests
from lxml import etree
from loguru import logger
import pandas as pd
from utils import ua
import asyncio
from pyppeteer import launch
class Wph(object):
def __init__(self,url,name):
self.url = url
self.name = name
self.headers = {
'user-agent': ua.get_random_useragent()
}
self.session = requests.session()
self.hadlnone = lambda x:x[0] if x else ''
async def main(self,url):
global browser
browser = await launch()
page = await browser.newPage()
await page.goto(url)
text = await page.content() # Return page html
return text
def spider(self):
df = pd.DataFrame(columns=[' brand ', ' title ', ' The original price ', ' Present price ', ' discount '])
# launch HTTP request
# https://category.vip.com/suggest.php?keyword=%E5%8F%A3%E7%BA%A2&brand_sn=10000359
res = self.session.get(self.url,params={'keyword':self.name},headers = self.headers,verify=False)
html = etree.HTML(res.text)
url_list = html.xpath('//div[@class="c-filter-group-content"]/div[contains(@class,"c-filter-group-scroll-brand")]/ul/li/a/@href')
# Iteration brand URL Address
for i in url_list:
ua.wait_some_time()
# Drive browser request
page_html = asyncio.get_event_loop().run_until_complete(self.main('http:' + i))
# Get web source code
page = etree.HTML(page_html)
htmls = page.xpath('//section[@id="J_searchCatList"]/div')
# Iteration product URL list
for h in htmls[1:]:
# judge
pingpai = self.hadlnone(h.xpath('//div[contains(@class,"c-breadcrumbs-cell-title")]/span/text()'))
# title
title = self.hadlnone(h.xpath('.//div[contains(@class,"c-goods-item__name")]/text()'))
# Price The original price
y_price = self.hadlnone(h.xpath('.//div[contains(@class,"c-goods-item__market-price")]/text()'))
# Selling price
x_price = self.hadlnone(h.xpath('.//div[contains(@class,"c-goods-item__sale-price")]/text()'))
# discount
zk = self.hadlnone(h.xpath('.//div[contains(@class,"c-goods-item__discount")]/text()'))
logger.info(f' brand {pingpai}, title {title}, The original price {y_price}, Present price {x_price}, discount {zk}')
# Construct a dictionary
pro = {
' brand ':pingpai,
' title ':title,
' The original price ':y_price,
' Present price ':x_price,
' discount ':zk
}
df = df.append([pro])
df.to_excel(' Vipshop data 2.xlsx',index=False)
return df
def __del__(self):
browser.close()
if __name__ == '__main__':
url = 'https://category.vip.com/suggest.php'
name = ' Perfume '
w = Wph(url,name)
w.spider()
边栏推荐
- Summary of differences between data submission type request payload and form data
- About gbase automatically closing the connection
- [ES6] function parameters, symbol data types, iterators and generators
- scrapy定时爬虫的思路
- Learn when playing No 7 | don't study this holiday, study only
- 集群聊天服务器:项目问题汇总
- [300 + selected interview questions from big companies continued to share] big data operation and maintenance sharp knife interview question column (V)
- UXDB怎么从日期值中提取时分秒?
- 【程序员2公务员】关于体制调研的一些常见问题总结
- RPC communication principle and project technology selection
猜你喜欢

js无法获取headers中Content-Disposition

【Unity入门计划】基本概念-GameObject&Components

QT学习日记20——飞机大战项目

阿里云镜像地址&网易云镜像

QT learning diary 20 - aircraft war project

用VS Code搞Qt6:编译源代码与基本配置

Simulation Implementation of list

【云原生】原来2020.0.X版本开始的OpenFeign底层不再使用Ribbon了

如何在KVM环境中使用网络安装部署多台虚拟服务器

A fast method of data set enhancement for deep learning
随机推荐
toolbar的使用
新库上线| CnOpenDataA股上市公司股东信息数据
华为无线设备配置WPA2-802.1X-AES安全策略
[programmer 2 Civil Servant] I. Basic Knowledge
【Unity入门计划】界面介绍(2)-Games视图&Hierarchy&Project&Inspector
What are the types of financial products in 2022? Which is suitable for beginners?
scrapy定时爬虫的思路
Analysis of difficulties in diagramscene project
大佬秋招面经
用VS Code搞Qt6:编译源代码与基本配置
SAP queries open Po (open purchase order)
BOM overview
Oracle19采用自动内存管理,AWR报告显示SGA、PGA设置的过小了?
What has become a difficult problem for most people to change careers, so why do many people study software testing?
深度学习制作数据集时,从长视频中指定每隔多少帧提取一张图像到指定文件路径的方法
Huawei wireless device sta black and white list configuration command
[unity introduction plan] interface Introduction (2) -games view & hierarchy & Project & Inspector
Analysis of common classes of Servlet
UXDB怎么从日期值中提取时分秒?
cesium简介