当前位置:网站首页>1.3-1.4 web page data capture
1.3-1.4 web page data capture
2022-06-23 01:52:00 【I SONGFENG water month】
1.3 Web data capture
Tools :
Linux:curl, Usually can't use
Use headless browser , Grab web pages from the command line , The administrator cannot see whether he is crawling or accessing data , But in a short time, the same ip A large number of visits can also cause suspicion
A lot of ip
from selenium import webdriver
chrome_options = webdriver.ChromeOptions()
chrome_options.headless = True
chrome = webdriver.Chrome(
chrome_options=chrome_options)
page = chrome.get(url)
Web page HTML How to deal with it after climbing down ?
Crawl through the text 
Suppose the web page has been crawled down and put into the disk , And then use Python My own bag BeautifulSoup analysis html, extract HTML The interesting parts in it , Such as link Inside id And category , Use “/” Cut it open to get a set of data , hold id Put it in the template , With id Then you can change id To get other data , And enter id Get details internally , Similarly, obtain other data .
Next, explain how to capture the specific information of all web pages , Put all the list Find it all , Generally in Chrome It uses inspect The tool can get the data you want to crawl HTML Where is the specific location of .( The whole process is complicated ) This is just crawling text .
It's best to crawl web pages 3 Grab once in a second , Not too often .
How to crawl pictures ?
The data cost of crawling a picture is equivalent to crawling a HTML Webpage , And the storage cost is relatively high .
Data crawling should also consider legal issues . Need to log in , Containing personal privacy , Containing copyright, etc
1.4 Data tagging
Are there enough dimensions , If so, consider using Semi supervised model
Not enough annotations , Is there enough money, If so, you can crowdsourcing , Find someone to mark
There are not enough annotations , Is there any money , Consider using Weak supervised learning
Semi-supervised learning :
Solve the problem that only a small part of data is marked
The most commonly used algorithm for semi supervised learning : Self learning
Self learning : First, a model is trained with a small part of labeled data , Then use this model to predict the unmarked data , Get some pseudo annotation data ( Data marked by the machine , Not marked by people ), Combine the pseudo labeled data with the labeled data to train a new model, and then train repeatedly .( Keep only the data labels with good confidence each time , Discard the label with bad confidence )
The bid data may not take into account the cost , Use a deeper network model
Use crowdsourcing to mark
The most widely used data annotation method
ImageNet Crowdsourcing marks
Problems to be considered in crowdsourcing annotation :
1. The task design is simple , Labeling is simple
2. Consideration cost
3. The quality control
Use Active learning Let people mark difficult marks , Here, you can also use some more complex models to classify and label the difficulty
Also use active learning + Self learning combined with 
Quality control of data
More than one person can do a task at the same time ; Questions with high doubts are sent to multiple people at the same time
If supervised learning
Semi automatic generation of annotations
Most commonly used data programming
For example, judge whether it is a normal email
边栏推荐
- Campus network AC authentication failed
- Debian10 create users, user groups, switch users
- Pat class A - 1015 reversible primes
- Pat class a 1016 phone bills (time difference)
- HDU - 7072 double ended queue + opposite top
- Three methods for solving Fibonacci sequence feibonacci (seeking rabbit) - program design
- Hello
- //1.14 comma operator and comma expression
- [hdu] p7058 ink on paper finding the maximum edge of the minimum spanning tree
- Vscade personalization: let a cute girl knock the code with you
猜你喜欢

The devil cold rice # 099 the devil said to travel to the West; The nature of the boss; Answer the midlife crisis again; Specialty selection

Philosopher's walk gym divide and conquer + fractal

Three methods for solving Fibonacci sequence feibonacci (seeking rabbit) - program design

1. Mx6u bare metal program (1) - Lighting master

Campus network AC authentication failed

Ch340 and PL2303 installation (with link)

1. introduction to MySQL database connection pool function technology points

1. Mx6u bare metal program (2) - Lighting master (imitating 32 register version)

Zabbix5 series - use temperature and humidity sensor to monitor the temperature and humidity of the machine room (XX)

Arm assembly syntax
随机推荐
[template] KMP
Rebirth -- millimeter wave radar and some things I have to say
Fluentd is easy to use. Combined with the rainbow plug-in market, log collection is faster
Autumn move script a
Cmake simple usage
Component development
MySQL basic command statement
LeetCode 206. Reverse linked list (iteration + recursion)
Questions not written in the monthly contest
Debian10 create users, user groups, switch users
The devil cold rice # 099 the devil said to travel to the West; The nature of the boss; Answer the midlife crisis again; Specialty selection
Operator part
1. introduction to MySQL database connection pool function technology points
Zabbix5 series - use temperature and humidity sensor to monitor the temperature and humidity of the machine room (XX)
Three methods for solving Fibonacci sequence feibonacci (seeking rabbit) - program design
//1.10 initial value of variable
What aspects of language and database knowledge are needed to build a web Kanban!
Use of higher order functions
Error C2061 syntax error: identifier ‘PreparedStatement‘
How to type Redux actions and Redux reducers in TypeScript?