当前位置：网站首页>2-6. Automatic acquisition

2-6. Automatic acquisition

2022-07-25 07:34:00 【Green bamboo memory】

1.Selenium Use

1. preparation ( install )

This section is to Chrome For example to explain Selenium Usage of . Before we start , Please make sure it's installed properly Chrome Browser and configured ChromeDriver. in addition , It also needs to be installed correctly Python Of Selenium library

pip install selenium

2. Declare browser objects

1、 visit http://chromedriver.storage.googleapis.com/index.html, Find the corresponding version of your browser chromedriver.exe download （ The version must be downloaded ）
2、 The downloaded file is unzipped and placed in chrome Browser directory

from selenium import webdriver

browser = webdriver.Chrome()
browser = webdriver.Firefox()
browser = webdriver.Edge()
browser = webdriver.PhantomJS()
browser = webdriver.Safari()

3. Basic use

After the preparatory work is done , First, let's take a general look at Selenium There are some functions . Examples are as follows ：

from selenium import webdriver

browser = webdriver.Chrome()
browser.get('https://www.baidu.com')
input = browser.find_element_by_id('kw')
input.send_keys('Python')
browser.find_element_by_id('su').click()
#  Extract page 
print(browser.page_source.encode('utf-8','ignore'))
#  extract cookie
print(browser.get_cookies())
#  Extract the current request address 
print(browser.current_url)
browser.close()

After running the code, I found , Will automatically pop up a Chrome browser . The browser will jump to Baidu first , Then type... In the search box Python, Then jump to the search results page

from selenium import webdriver
options = webdriver.ChromeOptions()

#  No pictures 
prefs = {"profile.managed_default_content_settings.images": 2}
options.add_experimental_option("prefs", prefs)

#  Headless mode 
option.add_argument("-headless")

#  By setting user-agent
user_ag='MQQBrowser/26 Mozilla/5.0 (Linux; U; Android 2.3.7; zh-cn; MB200 Build/GRJ22; '+
'CyanogenMod-7) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1'
options.add_argument('user-agent=%s'%user_ag)

# hide "Chrome Under the control of automatic software "
options.add_experimental_option('useAutomationExtension', False)
options.add_experimental_option('excludeSwitches', ['enable-automation'])

# Setting agent 
options.add_argument('proxy-server=' +'192.168.0.28:808')

# Maximize browser display 
browser.maximize_window() 

#  Set width and height 
browser.set_window_size(480, 800)


#  adopt js Open a new window 
driver.execute_script('window.open("https://www.baidu.com");')

browser = webdriver.Chrome(chrome_options=options)

4. Find node

Single node

from selenium import webdriver
browser = webdriver.Chrome()
from selenium.webdriver.common.keys import Keys  #  Analog keyboard operation 
browser.get('https://www.taobao.com/')
s = browser.find_element_by_xpath("//div[@class='search-combobox-input-wrap']/child::input")
s.send_keys(' clothes ')
s.send_keys(Keys.ENTER)   #  enter   I'm sure it means

Here are all the methods to get a single node ：

find_element_by_id       id
find_element_by_class_name    Class name 
find_element_by_name      name 
find_element_by_xpath   	xpath
find_element_by_link_text				 Specifically used to locate hyperlink text （ label ）   The whole match 
find_element_by_partial_link_text      	  Fuzzy matching    Sign in   deng （ It's fine too ）
find_element_by_tag_name      The tag name 
find_element_by_css_selector   CSS Selectors

Extract cases :

browser.get('https://www.taobao.com/')
# ID Select folding positioning 
s = browser.find_element_by_id('q')
s.send_keys(' clothes ')
# CSS  Select folding positioning 
s = browser.find_element_by_css_selector('div.search-combobox-input-wrap>input')
s.send_keys(' clothes ')
# xpath  Selector positioning 
s = browser.find_elements_by_xpath('//div[@class="search-combobox-input-wrap"]/input')
s.send_keys(' clothes ')

find_element()
in addition ,Selenium There is also a general approach find_element(), It needs to pass in two parameters ： How to find it By And the value . actually , It is find_element_by_id() The general function version of this method , such as find_element_by_id(id) It is equivalent to find_element(By.ID, id), The results of the two are exactly the same . Let's do it in code ：

from selenium import webdriver
from selenium.webdriver.common.by import By

browser = webdriver.Chrome()
browser.get('https://www.taobao.com')
input_first = browser.find_element(By.ID, 'q')
input_first.send_keys(' clothes ')
print(input_first)
browser.close() // Finally, we must close selenium

Multiple nodes

If there is only one target in the web page , Then it can be used find_element() Method . But if there are multiple nodes , Reuse find_element() Method to find the , You can only get the first node . If you want to find all nodes that meet the conditions , Need to use find_elements() This way . Be careful , In the name of the method ,element One more. s, Pay attention to distinguish between .

from selenium import webdriver

browser = webdriver.Chrome()
browser.get('https://www.taobao.com')
lis = browser.find_elements_by_css_selector('.service-bd li')
print(lis)
browser.close()

Here are all the methods to get multiple nodes ：

find_elements_by_id
find_elements_by_name
find_elements_by_xpath
find_elements_by_link_text
find_elements_by_partial_link_text
find_elements_by_tag_name
find_elements_by_class_name
find_elements_by_css_selector

Of course , We can also use it directly find_elements() Method to choose , It can be written like this ：

lis = browser.find_elements(By.CSS_SELECTOR, '.service-bd li')

Node interaction

Use the above method , We have completed the actions of some common nodes , For more operations, see the interactive action introduction in the official document ：
http://selenium-python.readthedocs.io/api.html

from selenium import webdriver
import time

browser = webdriver.Chrome()
browser.get('https://www.taobao.com')
input = browser.find_element_by_id('q')
input.send_keys('iPhone')
time.sleep(1)
input.clear()
input.send_keys('iPad')
button = browser.find_element_by_class_name('btn-search')
button.click()

Action chain

In the example above , Some interactive actions are performed for a node . such as , For input fields , We call its input text and empty text methods ; For buttons , Just call its click method . Actually , There are other operations , They don't have specific execution objects , Such as mouse dragging 、 Keyboard, keys, etc , These actions are performed in another way , That's the action chain .

such as , Now realize the drag operation of a node , Drag a node from one place to another , It can be realized in this way ：

from selenium import webdriver
from selenium.webdriver import ActionChains

browser = webdriver.Chrome()
url = 'http://www.runoob.com/try/try.php?filename=jqueryui-api-droppable'
browser.get(url)
browser.switch_to.frame('iframeResult')
source = browser.find_element_by_css_selector('#draggable')
target = browser.find_element_by_css_selector('#droppable')
actions = ActionChains(browser)
actions.drag_and_drop(source, target)
actions.perform()

Page scrolling

#  Move to element element Object's “ Apex ” With the current window “ Top ” alignment   
driver.execute_script("arguments[0].scrollIntoView();", element)
driver.execute_script("arguments[0].scrollIntoView(true);", element)
 
#  Move to element element Object's “ Bottom ” With the current window “ Bottom ” alignment   
driver.execute_script("arguments[0].scrollIntoView(false);", element)
 
#  Move to the bottom of the page   
driver.execute_script("window.scrollTo(0, document.body.scrollHeight)")
 
#  Moves to the specified coordinates ( Move relative to the current coordinates )
driver.execute_script("window.scrollBy(0, 700)")
#  Combine the above scrollBy sentence , It is equivalent to moving to 700+800=1600 Pixel position   
driver.execute_script("window.scrollBy(0, 800)")
 
#  Move to the absolute position coordinates of the window , Move to the ordinate as follows 1600 Pixel position   
driver.execute_script("window.scrollTo(0, 1600)")
#  Combine the above scrollTo sentence , Still move to the ordinate 1200 Pixel position   
driver.execute_script("window.scrollTo(0, 1200)")

perform JavaScript

For some operations ,Selenium API It didn't provide . such as , Pull down the progress bar , It can directly simulate the operation JavaScript, At this time to use execute_script() Method can be realized , The code is as follows ：

# document.body.scrollHeight  Get page height 

from selenium import webdriver

browser = webdriver.Chrome()
browser.get('https://36kr.com/')
#  Drop down border    One time pull-down 
browser.execute_script('window.scrollTo(0, document.body.scrollHeight)')

# scrollTo   Don't stack  200 200    scrollBy  superposition   200 300  500 operation 
#  Slowly pull down 
for i in range(1,9):
    time.sleep(random.randint(100, 300) / 1000)
    browser.execute_script('window.scrollTo(0,{})'.format(i * 700)) # scrollTo  Don't stack  700 1400 2100
    
###  Here we use  execute_script()  Method to drop the progress bar to the bottom

Get node information

get attribute

We can use get_attribute() Method to get the properties of the node , But the premise is to select this node first , Examples are as follows ：

from selenium import webdriver
url = 'https://pic.netbian.com/4kmeinv/index.html'
browser.get(url)
src = browser.find_elements_by_xpath('//ul[@class="clearfix"]/li/a/img')
for i in src:
    url = i.get_attribute('src')
    print(url)

obtain ID、 Location 、 Tag name 、 size

from PIL import Image
url = 'https://pic.netbian.com/4kmeinv/index.html'
browser.get(url)
img = browser.find_element_by_xpath('//ul[@class="clearfix"]/li[1]/a/img')
location = img.location
size = img.size
top, bottom, left, right = location['y'], location['y'] + size['height'], location['x'], location['x'] + size['width']
screen = browser.get_screenshot_as_png()
screen = Image.open(BytesIO(screen))
cap = screen.crop((left, top, right, bottom))
cap.save('asas.png')

Switch Frame

We know that there is a node in a web page called iframe, That is, son Frame, Equivalent to a sub page of a page , Its structure is exactly the same as that of the external web page .Selenium When I open the page , It defaults to the parent Frame Inside operation , At this time, if there are children in the page Frame, It can't get the child Frame The nodes inside . It needs to be used switch_to.frame() Method to switch Frame. Examples are as follows ：

browser.get('https://www.douban.com/')
login_iframe = browser.find_element_by_xpath('//div[@class="login"]/iframe')
browser.switch_to.frame(login_iframe)
browser.find_element_by_class_name('account-tab-account').click()
browser.find_element_by_id('username').send_keys('123123123')

Delay waiting for

stay Selenium in ,get() Method will finish execution after loading the web frame , At this point, if you get page_source, It may not be that the browser has completely loaded the completed page , If some pages have extra Ajax request , We may not be able to get it in the source code of the web page . therefore , It's going to take some time to delay , Make sure the node is loaded .

There are two ways to wait here ： One is implicit waiting , One is explicit waiting .

For more detailed parameters and usage of wait conditions, please refer to the official documents ：
http://selenium-python.readthedocs.io/api.html

Forward and backward

Normally, when using the browser, there are forward and backward functions ,Selenium You can also do this , It USES back() Method back , Use forward() Method forward . Examples are as follows ：

import time
from selenium import webdriver

browser = webdriver.Chrome()
browser.get('https://www.baidu.com/')
browser.get('https://www.taobao.com/')
browser.get('https://www.python.org/')
browser.back()
time.sleep(1)
browser.forward()
browser.close()

exception handling

In the use of Selenium In the process of , It is inevitable to encounter some exceptions , For example, overtime 、 Error such as node not found , Once such an error occurs , The program will not continue to run . Here we can use try except Statement to catch various exceptions .

from selenium import webdriver
from selenium.common.exceptions import TimeoutException, NoSuchElementException

browser = webdriver.Chrome()
try:
    browser.get('https://www.baidu.com')
except TimeoutException:
    print('Time Out')
try:
    browser.find_element_by_id('hello')
except NoSuchElementException:
    print('No Element')
finally:
    browser.close()

Bypass detection

#  No treatment 
browser.get('https://bot.sannysoft.com/')

#  Set up shielding 
options = webdriver.ChromeOptions()
options.add_argument('--disable-blink-features=AutomationControlled')
browsers = webdriver.Chrome(chrome_options=options)
browsers.get('https://bot.sannysoft.com/')

2.Pypputeer Use

Puppeteer yes Google be based on Node.js A tool developed , and Pyppeteer And what is it ？ It's actually Puppeteer Of Python Implementation of version , But it's not Google Developed , Is an engineer from Japan Puppeteer Some of the features developed by the unofficial version .
stay Pyppetter in , In fact, there is a similar behind it Chrome Browser's Chromium The browser is performing some actions to render web pages , First of all, say Chrome The browser and Chromium The origin of browser .

Environmental installation

pip install pyppeteer

1. Examples :

from pyppeteer import launch
import asyncio
import time
async def main():
    #  Launch a browser 
    browser = await launch(headless=False,args=['--disable-infobars','--window-size=1920,1080'])
    #  Create a page 
    page = await browser.newPage()
    #  Jump to Baidu 
    await page.goto("http://www.baidu.com/")
    #  Enter the keywords to query ,type  The first parameter is the element selector, The second is the keyword to be entered 
    await page.type('#kw','pyppeteer')
    #  Click Submit button  click  adopt selector Click the specified element 
    await page.click('#su')
    time.sleep(3)
    await browser.close()
asyncio.get_event_loop().run_until_complete(main())

Next, let's take a look at its parameter introduction ：

ignoreHTTPSErrors (bool)： Do you want to ignore HTTPS Error of , The default is False.
headless (bool)： Is it enabled? Headless Pattern , That is, no interface mode , If devtools This parameter is True Words , Then the parameter will be set to False, Otherwise True, That is, by default, the interface free mode is enabled .
executablePath (str)： Path to executable file , If specified, you do not need to use the default Chromium 了 , Can be designated as existing Chrome or Chromium.
slowMo (int|float)： By passing in the specified time , Can slow down Pyppeteer Some of the simulation operations of .
args (List[str])： Additional parameters that can be passed in during execution .
ignoreDefaultArgs (bool)： Don't use Pyppeteer Default parameters , If this parameter is used , Then it's best to pass args Parameters to set some parameters , Otherwise, there may be some unexpected problems . This parameter is relatively dangerous , Use with caution .
handleSIGINT (bool)： Whether to respond to SIGINT The signal , That is to say, you can use Ctrl + C To terminate the browser program , The default is True.
handleSIGTERM (bool)： Whether to respond to SIGTERM The signal , It's usually kill command , The default is True.
handleSIGHUP (bool)： Whether to respond to SIGHUP The signal , Hang signal , For example, terminal exit operation , The default is True.
dumpio (bool)： Whether or not to Pyppeteer The output content of is passed to process.stdout and process.stderr object , The default is False.
userDataDir (str)： User data folder , That is, you can keep some personalized configuration and operation records .
env (dict)： environment variable , It can be imported in the form of a dictionary .
devtools (bool)： Whether to automatically open the debugging tool for each page , The default is False. If this parameter is set to True, that headless The parameter will be invalid , Will be forced to False.
logLevel (int|str)： The level of logging , Default and root logger Objects have the same level .
autoClose (bool)： When some commands are executed , Do you want to close the browser automatically , The default is True.
loop (asyncio.AbstractEventLoop)： Event loop object .

2. Basic configuration

2.0 The basic parameters

params={
	#  Close headless browser 
	"headless": False,
	'dumpio':'True', #  Prevent the browser from getting stuck 
	r'userDataDir':'./cache-data',  #  User file address 
	"args": [
		'--disable-infobars',  #  Close the automation prompt box 
		'--window-size=1920,1080',  #  Window size 
		'--log-level=30',  #  Log retention level ,  It is suggested that the smaller the setting, the better , Otherwise, the generated log will take up a lot of space  30 by warning Level 
		'--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36',
		'--no-sandbox',  #  Turn off sandbox mode 
		'--start-maximized',  #  Window maximization mode 
		'--proxy-server=http://localhost:1080'  #  agent 
			],
		}

2.1 Settings window

# UI Pattern    Frequency closure warning 
browser = await launch(headless=False, args=['--disable-infobars'])
page = await browser.newPage()
await page.setViewport({'width': 1200, 'height': 800})

2.2 Add the head

await page.setUserAgent("Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5")

2.2 Webpage Screenshot

page.screenshot(path='example.png')

2.3 Camouflage browser Bypass detection

Object.defineProperty()  Method defines a new property directly on an object , Or modify the existing properties of an object , And return this object .

#  camouflage 
await page.evaluateOnNewDocument('() =>{ Object.defineProperties(navigator,'
                                     '{ webdriver:{ get: () => false } })}')

await page.goto('https://intoli.com/blog/not-possible-to-block-chrome-headless/chrome-headless-test.html')

2.4 Case presentation Trigger JS

async def run():
    browser = await launch()
    page = await browser.newPage()
    await page.setViewport({'width': 1200, 'height': 800})
    await page.goto('https://www.zhipin.com/job_detail/?query=%E8%85%BE%E8%AE%AF%E7%88%AC%E8%99%AB&city=101020100&industry=&position=')
    dimensions = await page.evaluate('''() => {
           return {
               cookie: window.document.cookie,
           }
       }''')
    print(dimensions,type(dimensions))

asyncio.get_event_loop().run_until_complete(run())

3 Advanced use

import asyncio
from pyppeteer import launch
from pyquery import PyQuery as pq
async def main():
   browser = await launch(headless=False)    #  Open the browser 
   page = await browser.newPage()			#  Open tab 
   #  Enter the address to access the page 
   await page.goto('https://careers.tencent.com/search.html?keyword=python')
   #  Call the folder selector 
   await page.waitForXPath('//div[@class="recruit-wrap recruit-margin"]/div')
   #  Get web source code 
   doc = pq(await page.content())
   #  Extract the data 
   title = [item.text() for item in doc('.recruit-title').items()]
   print('title:', title)
   #  Close the browser 
   await browser.close()
#  Start the asynchronous method 
asyncio.get_event_loop().run_until_complete(main())

4 Data Extraction

#  Execute within the page  document.querySelector. If no element matches the specified selector , The return value is  None
J = querySelector
#  Execute within the page  document.querySelector, Then pass the matched element as the first parameter to  pageFunction
Jeval = querySelectorEval
#  Execute within the page  document.querySelectorAll. If no element matches the specified selector , The return value is  []
JJ = querySelectorAll
#  Execute within the page  Array.from(document.querySelectorAll(selector)), Then pass the matched element array as the first parameter to  pageFunction
JJeval = querySelectorAllEval
# XPath expression 
Jx = xpath


# Pyppeteer  Three analytical methods 
Page.querySelector()  #  Selectors  css  Selectors 
Page.querySelectorAll()
Page.xpath()  # xpath   expression 
#  The abbreviation is ：
Page.J(), Page.JJ(), and Page.Jx()

5 get attribute

Extract the destination address ：https://pic.netbian.com/4kmeinv/index.html All picture resources

async def mains1():
    browser = await launch(headless=False, args=['--disable-infobars'])
    page = await browser.newPage()
    await page.setUserAgent("Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5")
    await page.setViewport(viewport={'width': 1536, 'height': 768})
    await page.evaluateOnNewDocument('() =>{ Object.defineProperties(navigator,'
                                     '{ webdriver:{ get: () => false } }) }')
    await page.goto('https://pic.netbian.com/4kmeinv/index.html')
    elements = await page.querySelectorAll(".clearfix li a img")
    for item in elements:
        #  Get the connection 
        title_link = await (await item.getProperty('src')).jsonValue()
        print(title_link)
    await browser.close()
asyncio.get_event_loop().run_until_complete(mains1())

6 Log in to the case

import asyncio
from pyppeteer import launch

async def mains2():
    browser = await launch({'headless': False, 'args': ['--disable-infobars', '--window-size=1920,1080']})
    page = await browser.newPage()
    await page.setViewport({'width': 1920, 'height': 1080})
    await page.goto('https://www.captainbi.com/amz_login.html')
    await page.evaluateOnNewDocument('() =>{ Object.defineProperties(navigator,'
                                     '{ webdriver:{ get: () => false } }) }')
    await page.type('#username', '13555553333')  #  account number 
    await page.type('#password', '123456')  #  password 
    await asyncio.sleep(2)
    await page.click('#submit',{'timeout': 3000})
    import time
    # await browser.close()
    print(' Login successful ')
asyncio.get_event_loop().run_until_complete(mains2())

7 Comprehensive case

import requests
from lxml import etree
from loguru import logger
import pandas as pd
from utils import ua
import asyncio
from pyppeteer import launch

class Wph(object):

    def __init__(self,url,name):
        self.url = url

        self.name = name

        self.headers = {
            'user-agent': ua.get_random_useragent()
        }

        self.session = requests.session()

        self.hadlnone = lambda x:x[0] if x else ''

    async def main(self,url):
        global browser
        browser = await launch()
        page = await browser.newPage()
        await page.goto(url)
        text = await page.content()  #  Return page html
        return text

    def spider(self):

        df = pd.DataFrame(columns=[' brand ', ' title ', ' The original price ', ' Present price ', ' discount '])
        #  launch HTTP request 
        # https://category.vip.com/suggest.php?keyword=%E5%8F%A3%E7%BA%A2&brand_sn=10000359
        res = self.session.get(self.url,params={'keyword':self.name},headers = self.headers,verify=False)

        html = etree.HTML(res.text)

        url_list = html.xpath('//div[@class="c-filter-group-content"]/div[contains(@class,"c-filter-group-scroll-brand")]/ul/li/a/@href')

        #  Iteration brand URL Address 
        for i in url_list:

            ua.wait_some_time()
            #  Drive browser   request 
            page_html = asyncio.get_event_loop().run_until_complete(self.main('http:' + i))
            #  Get web source code 
           
            page = etree.HTML(page_html)

            htmls = page.xpath('//section[@id="J_searchCatList"]/div')
            #  Iteration product URL list 
            for h in htmls[1:]:
                #  judge 
                pingpai = self.hadlnone(h.xpath('//div[contains(@class,"c-breadcrumbs-cell-title")]/span/text()'))
                #  title 
                title =   self.hadlnone(h.xpath('.//div[contains(@class,"c-goods-item__name")]/text()'))
                #  Price    The original price 
                y_price = self.hadlnone(h.xpath('.//div[contains(@class,"c-goods-item__market-price")]/text()'))
                #  Selling price 
                x_price = self.hadlnone(h.xpath('.//div[contains(@class,"c-goods-item__sale-price")]/text()'))
                #  discount 
                zk = self.hadlnone(h.xpath('.//div[contains(@class,"c-goods-item__discount")]/text()'))

                logger.info(f' brand {pingpai}, title {title}, The original price {y_price}, Present price {x_price}, discount {zk}')
                #  Construct a dictionary 
                pro = {
                    ' brand ':pingpai,
                    ' title ':title,
                    ' The original price ':y_price,
                    ' Present price ':x_price,
                    ' discount ':zk
                }
                df = df.append([pro])

        df.to_excel(' Vipshop data 2.xlsx',index=False)

        return df

    def __del__(self):
        browser.close()


if __name__ == '__main__':
    url = 'https://category.vip.com/suggest.php'
    name = ' Perfume '
    w = Wph(url,name)
    w.spider()

原网站

版权声明
本文为[Green bamboo memory]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/206/202207250725153964.html