当前位置：网站首页>Selenium in the crawler realizes automatic collection of CSDN bloggers' articles

Selenium in the crawler realizes automatic collection of CSDN bloggers' articles

2022-07-23 14:57:00 【Black horse Lanxi】

Share daily ：

No flower , From the beginning, it was flowers

Catalog

Preface （ Thinking process ）：

One 、 Something to watch out for

1. Each time it slides, it will slide twice when it is loaded

2. get Judge whether you have clicked the collection before the website

Preface （ Thinking process ）：

I have written brush likes before , Brush comments , Brush reading , Recently, brush collection has also been realized , Write an article to record , Feeling csdn I'm almost ruined by myself （ Manual formation ）

At first, my idea was , First put all the articles of the blogger url Climb down , Put it in txt file , And then use selenium Control the browser for each URL （ article ） Click collect .

How to crawl the website of all articles ？

I looked at , This is crawling the data of the whole station , No hesitation , Direct use crawlspider The framework is designed to deal with the whole station data , At first, I still want to follow the previous idea , Change the homepage to the old version , But I can't find , You can only think of a new version of the home page （ The old version of the article is displayed in pages , The new version is displayed every time 20 individual , Sliding down will show 20 individual ...）. Then I found that there was a classified column in the blogger's homepage , Can be in crawlspider Write three rule extractors , The first one is to extract the URL of each column on the homepage , Then extract the URL of each page for each column , Finally, extract the URL of each article from the URL of each page . Then there will be an accident if there is no accident

This is the label of each page , ah ？ Why is there no website here ？（ This is how I use regular matching ......） It's just , Then use selenium To achieve it. ：

I want to , use selenium Control the browser to slide down , To make it load all the articles , After that xpath Get the website of all articles .（ Realized , Next we will put the code ）

Then it comes to the next question , Control the browser to click favorites

It's not difficult to realize click operation , But there's a problem , Each time selenium The controlled browser is not logged in , I think , Isn't that easy , I carry it directly cookie Don't we just solve it ？ After that, I added cookie, But it doesn't work （ I didn't know why , Now I feel that it should be detected ）; Since you add cookie no way , Then I'll use selenium Control to enter the account and password to log in , The result was another accident , Click operation is not difficult , You log in to it many times and let you slide the slider to verify , It's not hard , After sliding, he said failure , Let slide again , It is estimated that it was detected , I have tried many things online selenium Methods of avoiding detection , Doesn't work （ At first, I felt that maybe my code slider didn't slide well , Just optimize it, but it still doesn't work , Why do I think it was detected , Because I'm here selenium I tried to verify in the open browser , Manual sliding also failed , Then you open the browser manually , Login is no problem ）, I've been stuck here for a long time , Later, I used Take over the browser Solved the problem （ Take over the browser and log in , Did not solve the detected problem ）.

Change your mind , Use take over browser , You can log in manually , You can always use this browser in the code , Will also remain logged in , And then get Each website , Click collect （ Actually before that , My idea is to use selenium Log in once 【 Because I found , A span , Your first login will not allow you to perform sliding verification 】, Then click on the article , Click collect , Close the current tab , Click on the next article , Combined with the sliding operation , But there is something wrong with the code , Later, I came into contact with the method of taking over , Gave up ; How to put it? , This kind of thinking may not be very good , If the height of each article on the homepage is different , There may be problems later , That is, you can't click , But this is also a way of thinking ）

One 、 Something to watch out for

1. Each time it slides, it will slide twice when it is loaded

Sometimes, a slide will fail to load , As a result, data cannot be crawled

The following is the code to get the website ：

from selenium import webdriver
import time


url = " The address of the home page "


options = webdriver.ChromeOptions()
# #  The configuration object adds a command to open the browser without interface 
options.add_argument('--headless')

driver = webdriver.Chrome()
driver.get(url)
time.sleep(4)
#  The number of cycles depends on the total number of articles , once 20, Calculate it by yourself 
for i in range(58):
    #  Don't directly execute js Slide to the bottom every time , Will make mistakes , Sometimes the page does not update new articles , So I let it slide twice , Make sure to refresh 
    driver.execute_script("window.scrollTo(0,document.body.scrollHeight / 2)")
    driver.execute_script("window.scrollTo(document.body.scrollHeight / 2,document.body.scrollHeight)")
    time.sleep(1)
#  Get the URL of each article （ It's not exactly a website , below get_attribute Is the website ）
url_list = driver.find_elements('xpath', '//article[@class="blog-list-box"]/a')
f = open('urls.txt', 'a+', encoding='utf8')
for i in url_list:
    i = i.get_attribute('href')
    print(i)
    f.write(i + '\n')
f.close()
driver.close()

2. get Judge whether you have clicked the collection before the website

At first, there was no judgment , After running for a while , It will go to the text that has been clicked and then click （ It will cause the collection to be cancelled ）, I still don't know why , In short, it's right to add

The following is the code of click collection ：

from selenium.webdriver import Chrome
from selenium import webdriver
import time
from selenium.webdriver.chrome.options import Options


#  Take over the browser , send out get
def get_chrome_proxy(url):
    options = Options()
    options.add_experimental_option("debuggerAddress", "127.0.0.1: Port number ")
    driver = Chrome(options=options)
    driver.get(url)
    driver.implicitly_wait(15)
    return driver


#  Achieve click operation 
def manage(url):
    try:
        driver = get_chrome_proxy(url)
        #  Click collect 
        driver.find_element('xpath', '//*[@id="toolBarBox"]/div/div[2]/ul/li[4]').click()
        time.sleep(2)
        #  Click favorites in the default folder 
        driver.find_element('xpath', '//*[@id="csdn-collection"]/div/div[2]/ul/li/span[2]').click()
    except Exception as e:
        print()


if __name__ == '__main__':
    #  Put the visited ones in the list 
    finish = []
    with open('urls.txt', 'r') as f:
        urls = f.readlines()
    #  Judge whether you have visited 
    for i in urls:
        if i not in finish:
            manage(i)
            finish.append(i)

But the efficiency is really a little low

原网站

版权声明
本文为[Black horse Lanxi]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/204/202207230932556444.html