当前位置:网站首页>Web crawler 2: crawl the user ID and home page address of Netease cloud music reviews
Web crawler 2: crawl the user ID and home page address of Netease cloud music reviews
2022-06-26 21:33:00 【Romantic data analysis】
The goal of this article :
In the last article, we won the title of a popular singer song ID and URL Address . This article further obtains the comment user ID And home page address .The ultimate goal :
1、 Through popular singers , Grab songs ID.
2、 Through songs ID, Grab comment users ID.
3、 By commenting on users ID, Send a directed push message .
The previous article completed the steps 1, This article completes the steps 2.
Digression : For the first part requests No page method to get songs ID, It's quite fast , But get 2000 The server will recognize it as a crawler and ban it IP, By connecting to a mobile hotspot , After restarting the flight mode and connecting, you can get 2000 strip .
In the first part, we use MYSQL Store crawl results , The same method will be used this time , At the same time, this article will support error redoing , Each time a record is processed, a processing flag bit will be marked Y, Similar to our production system .
step 1: build mysql Table of
Here you need to create another one called userinf Table of , Store user's ID And comment time , Home address .
Build the predicative sentence as follows :
DROP TABLE IF EXISTS `userinf`;
CREATE TABLE `userinf` (
`id` int(12) NOT NULL AUTO_INCREMENT,
`user_id` varchar(30) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,
`user_name` varchar(200) CHARACTER SET utf8 COLLATE utf8_general_ci ,
`user_time` varchar(100) CHARACTER SET utf8 COLLATE utf8_general_ci ,
`user_url` varchar(400) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,
`clbz` varchar(1) CHARACTER SET utf8 COLLATE utf8_general_ci ,
`bysz` float(3, 0) NULL DEFAULT 0.00,
PRIMARY KEY (`id`) USING BTREE,
INDEX `user_id`(`user_id`) USING BTREE
) ENGINE = InnoDB CHARACTER SET = utf8 COLLATE = utf8_general_ci ROW_FORMAT = Dynamic;
After creation , We need to create one python Program to insert this table .
python The program is named :useridSpiderSQL.py, The code is :
#!/usr/bin/env python
# -*- coding:utf-8 -*-
__author__ = 'luoji'
import pymysql
# from ,where, group by, select, having, order by, limit
class Mysql_pq(object):
def __init__(self):
self.conn = pymysql.Connect(host='127.0.0.1',
port=3306,
user='root',
passwd='root',
db='python',
# When creating a database format, use utf8mb4 This format , Because it can store non characters such as facial expressions
charset='utf8mb4'
)
self.cursor = self.conn.cursor()
def modify_sql(self,sql,data):
self.cursor.execute(sql,data)
self.conn.commit()
def __del__(self):
self.cursor.close()
self.conn.close()
def insert_userinf(user_id,user_name,user_time,user_url,clbz):
helper = Mysql_pq()
print(' Connected to the database python, Ready to insert song information ')
# insert data
insert_sql = 'insert into userinf(user_id,user_name,user_time,user_url,clbz) value (%s,%s,%s,%s,%s)'
data = (user_id,user_name,user_time,user_url,clbz)
helper.modify_sql(insert_sql, data)
if __name__ == '__main__':
# helper = Mysql_pq()
# print('test db')
# # test
# insert_sql = 'insert into weibo_paqu(werbo) value (%s)'
# data = ('222222xxxxxx2222 ',)
# helper.modify_sql(insert_sql,data)
user_id='519250015'
user_name= ' Please remember me '
user_url = 'https://music.163.com/#/song?id=1313052960&lv=-1&kv=-1&tv=-1'
user_time = '2021 year 2 month 18 Japan '
clbz = 'N'
insert_userinf(user_id,user_name,user_time,user_url,clbz)
print('test over')
Support error redo : Update back songinf surface
Redo for mistakes , We're done with one songinf The update processing flag is Y, When something goes wrong , The program automatically skips processing flag bits Y The record of , Only handle flag bits N The record of , So we can finish the relay .
So in order to complete this relay , We need to review users after climbing a song , Update back songinf. We need to create one python Program to insert this table .
python The program is named :updateSongURLSQL.py, The code is :
#!/usr/bin/env python
# -*- coding:utf-8 -*-
__author__ = 'luoji'
import pymysql
# from ,where, group by, select, having, order by, limit
class Mysql_pq(object):
def __init__(self):
self.conn = pymysql.Connect(host='127.0.0.1',
port=3306,
user='root',
passwd='root',
db='python',
# When creating a database format, use utf8mb4 This format , Because it can store non characters such as facial expressions
charset='utf8mb4'
)
self.cursor = self.conn.cursor()
def __del__(self):
self.cursor.close()
self.conn.close()
def updater_songurl(url):
helper = Mysql_pq()
print(' Connected to the database python, Ready to insert song information ')
sql = "UPDATE songinf SET clbz = 'Y' WHERE song_url= '%s'" % (url)
print('sql is :', sql)
helper.cursor.execute(sql)
helper.conn.commit()
if __name__ == '__main__':
url = 'https://music.163.com/#/song?id=569213220&lv=-1&kv=-1&tv=-1'
updater_songurl(url)
print('urllist = ',url )
print('update over')
Crawl comment users :
To prevent being banned by the server , This time we use selenium Automation control module to control browser access , In this way, the server cannot distinguish whether it is a crawler or a user , The disadvantage is that the speed is relatively slow , The current crawling speed is about 1 Hours 1000 User data .
Running all night , Now get 10 ten thousand + user ID. Here we need to use the song Of URL Information , We need to create one python Program to insert this table .
python The program is named :getSongURLSQL.py, The code is :
#!/usr/bin/env python
# -*- coding:utf-8 -*-
__author__ = 'luoji'
import pymysql
# from ,where, group by, select, having, order by, limit
class Mysql_pq(object):
def __init__(self):
self.conn = pymysql.Connect(host='127.0.0.1',
port=3306,
user='root',
passwd='root',
db='python',
# When creating a database format, use utf8mb4 This format , Because it can store non characters such as facial expressions
charset='utf8mb4'
)
self.cursor = self.conn.cursor()
# def modify_sql(self,sql,data):
# self.cursor.execute(sql,data)
# self.conn.commit()
def __del__(self):
self.cursor.close()
self.conn.close()
def select_songurl():
helper = Mysql_pq()
print(' Connected to the database python, Ready to insert song information ')
urllist = []
sql = "SELECT * FROM songinf WHERE clbz = 'N'"
helper.cursor.execute(sql)
results = helper.cursor.fetchall()
for row in results:
id = row[0]
song_id = row[1]
song_name = row[2]
song_url = row[3]
clbz = row[4]
# Print the results
print('id =', id)
print('song_url =',song_url)
urllist.append(song_url)
return urllist
if __name__ == '__main__':
# helper = Mysql_pq()
# print('test db')
# # test
# insert_sql = 'insert into weibo_paqu(werbo) value (%s)'
# data = ('222222xxxxxx2222 ',)
# helper.modify_sql(insert_sql,data)
# song_id='519250015'
# song_name= ' Please remember me '
# song_url = 'https://music.163.com/#/song?id=1313052960&lv=-1&kv=-1&tv=-1'
# clbz = 'N'
urllist = select_songurl()
print('urllist = ',urllist )
print('test over')
So the database mysql Very important .
The code is :
import re
import time
import numpy as np
from flask_cors.core import LOG
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.remote.webelement import WebElement
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver import ChromeOptions
from getSongURLSQL import *
from useridSpiderSQL import *
from updateSongURLSQL import *
def is_number(s):
try:
float(s)
return True
except ValueError:
pass
try:
import unicodedata
unicodedata.numeric(s)
return True
except (TypeError, ValueError):
pass
return False
def geturl(urllist):
# If driver Not added to the environment variable , Then you need to specify the path
# Verified on 2021 year 2 month 19 Japan
driver = webdriver.Firefox()
#driver = webdriver.Chrome()
driver.maximize_window()
driver.set_page_load_timeout(30)
driver.set_window_size(1124, 850)
# locator = (By.)
for url in urllist:
print('now the url is :',url)
driver.get(url)
time.sleep(3)
print(' Start landing ')
driver.switch_to.frame('g_iframe') # The music elements of Netease cloud are put in the framework !!!! Switch frames first
href_xpath = "//div[contains(@class,'cntwrap')]//div[contains(@class,'cnt f-brk')]//a[contains(@class,'s-fc7')]"
songid = driver.find_elements_by_xpath(href_xpath)
useridlist = []
usernamelist = []
for i in songid:
userurl = i.get_attribute('href')
userid = userurl[35:] # User id Numbers
print('userid = ',userid)
username = i.text
print('username = ',username)
try:
print('userid is ',userid)
if is_number(userid) : # The explanation is pure numbers
print(' user id It's the number. , Retain ')
useridlist.append(userid)
usernamelist.append(username)
else:
iter
except (TypeError, ValueError):
print(' user id The digital , discarded ')
iter
# Get user comment time
commenttimelist=[]
time_xpath = "//div[contains(@class,'cntwrap')]//div[contains(@class,'rp')]//div[contains(@class,'time s-fc4')]"
songtime = driver.find_elements_by_xpath(time_xpath)
for itime in songtime:
#print(i.get_attribute('href'))
commenttime = itime.text
print('commenttime = ',commenttime)
commenttimelist.append(commenttime)
if len(commenttimelist)< len(useridlist):
for i in np.arange(0,len(useridlist)-len(commenttimelist),1):
commenttimelist.append('2021 year 2 month 18 Japan ')
print('len(useridlist) is = ',len(useridlist))
for i in np.arange(0,len(useridlist),1):
userid_i = useridlist[i]
username_i = usernamelist[i]
commenttime_i = commenttimelist[i]
# Insert into database
print('userid_i=',userid_i)
print('username_i=', username_i)
print('commenttime_i=', commenttime_i)
userurl_i ='https://music.163.com/#/user/home?id=' + str.strip(userid_i)
print('userurl_i=', userurl_i)
clbz = 'N'
try:
insert_userinf(userid_i, username_i, commenttime_i, userurl_i, clbz)
except :
print(' Error inserting database ')
pass
time.sleep(5)
updater_songurl(url)
def is_login(source):
rs = re.search("CONFIG\['islogin'\]='(\d)'", source)
if rs:
return int(rs.group(1)) == 1
else:
return False
if __name__ == '__main__':
#url = 'https://music.163.com/#/discover/toplist?id=2884035'
urllist = select_songurl()
# urllist =['https://music.163.com/#/song?id=569200214&lv=-1&kv=-1&tv=-1','https://music.163.com/#/song?id=569200213&lv=-1&kv=-1&tv=-1']
geturl(urllist)
The results are as follows :
Here are a few notes :
1、 I didn't turn the page of the latest comment , To do it , You need to crawl the page turning button and click , And then crawl users again id.
2、 The contents of specific comments are not stored for the time being .
3、 The format of the crawled comment date information is very irregular , Follow up processing is required .
The next part , Step... Will be completed 3, Have the ability to 10w Level users pushed songs .
边栏推荐
- 线性模型LN、单神经网络SNN、深度神经网络DNN与CNN测试对比
- Vi/vim editor
- 与 MySQL 建立连接
- C: Reverse linked list
- 中金证券经理给的开户二维码办理股票开户安全吗?我想开个户
- About appium trample pit: encountered internal error running command: error: cannot verify the signature of (solved)
- 指南针能开户炒股吗?安全吗?
- 0 basic C language (1)
- KDD2022 | 基于知识增强提示学习的统一会话推荐系统
- Icml2022 | neurotoxin: a lasting back door to federal learning
猜你喜欢

Treasure and niche cover PBR multi-channel mapping material website sharing

Gamefi active users, transaction volume, financing amount and new projects continue to decline. Can axie and stepn get rid of the death spiral? Where is the chain tour?

会计要素包括哪些内容

Distributed ID generation system

花店橱窗布置【动态规划】

leetcode刷题:字符串06(实现 strStr())

leetcode刷题:字符串04(颠倒字符串中的单词)

【题解】剑指 Offer 15. 二进制中1的个数(C语言)

Simple Lianliankan games based on QT

【protobuf 】protobuf 昇級後帶來的一些坑
随机推荐
Leetcode(122)——买卖股票的最佳时机 II
Record a redis large key troubleshooting
回首望月
[Bayesian classification 4] Bayesian network
Gamefi active users, transaction volume, financing amount and new projects continue to decline. Can axie and stepn get rid of the death spiral? Where is the chain tour?
线性模型LN、单神经网络SNN、深度神经网络DNN与CNN测试对比
Stringutils judge whether the string is empty
Twenty five of offer - all paths with a certain value in the binary tree
Shiniman household sprint A shares: annual revenue of nearly 1.2 billion red star Macalline and incredibly home are shareholders
Is it safe to open an online account in case of five-year exemption?
Is there any risk in opening a mobile stock registration account? Is it safe?
孙老师版本JDBC(2022年6月12日21:34:25)
The relationship between the development of cloud computing technology and chip processor
About appium trample pit: encountered internal error running command: error: cannot verify the signature of (solved)
Leetcode: String 04 (reverse the words in the string)
Redis + Guava 本地缓存 API 组合,性能炸裂!
在哪家证券公司开户最方便最安全可靠
剑指 Offer II 098. 路径的数目 / 剑指 Offer II 099. 最小路径之和
宝藏又小众的覆盖物PBR多通道贴图素材网站分享
The source code that everyone can understand (I) the overall architecture of ahooks