当前位置：网站首页>1.3-1.4 web page data capture

1.3-1.4 web page data capture

2022-06-23 01:52:00 【I SONGFENG water month】

1.3 Web data capture

Tools ：
Linux：curl, Usually can't use
Use headless browser , Grab web pages from the command line , The administrator cannot see whether he is crawling or accessing data , But in a short time, the same ip A large number of visits can also cause suspicion
A lot of ip

from selenium import webdriver
chrome_options = webdriver.ChromeOptions()
chrome_options.headless = True
chrome = webdriver.Chrome(
	chrome_options=chrome_options)
page = chrome.get(url)

Web page HTML How to deal with it after climbing down ？
Crawl through the text
Insert picture description here

Suppose the web page has been crawled down and put into the disk , And then use Python My own bag BeautifulSoup analysis html, extract HTML The interesting parts in it , Such as link Inside id And category , Use “/” Cut it open to get a set of data , hold id Put it in the template , With id Then you can change id To get other data , And enter id Get details internally , Similarly, obtain other data .
Next, explain how to capture the specific information of all web pages , Put all the list Find it all , Generally in Chrome It uses inspect The tool can get the data you want to crawl HTML Where is the specific location of .（ The whole process is complicated ） This is just crawling text .
It's best to crawl web pages 3 Grab once in a second , Not too often .
How to crawl pictures ？
Insert picture description here
The data cost of crawling a picture is equivalent to crawling a HTML Webpage , And the storage cost is relatively high .
Data crawling should also consider legal issues . Need to log in , Containing personal privacy , Containing copyright, etc

1.4 Data tagging

Are there enough dimensions , If so, consider using Semi supervised model
Not enough annotations , Is there enough money, If so, you can crowdsourcing , Find someone to mark
There are not enough annotations , Is there any money , Consider using Weak supervised learning
Semi-supervised learning ：
Solve the problem that only a small part of data is marked
The most commonly used algorithm for semi supervised learning ： Self learning
Self learning ： First, a model is trained with a small part of labeled data , Then use this model to predict the unmarked data , Get some pseudo annotation data ( Data marked by the machine , Not marked by people ), Combine the pseudo labeled data with the labeled data to train a new model, and then train repeatedly .（ Keep only the data labels with good confidence each time , Discard the label with bad confidence ）
The bid data may not take into account the cost , Use a deeper network model
Use crowdsourcing to mark
The most widely used data annotation method
ImageNet Crowdsourcing marks
Problems to be considered in crowdsourcing annotation ：
1. The task design is simple , Labeling is simple
2. Consideration cost
3. The quality control
Use Active learning Let people mark difficult marks , Here, you can also use some more complex models to classify and label the difficulty
Also use active learning + Self learning combined with
Insert picture description here
Quality control of data
More than one person can do a task at the same time ; Questions with high doubts are sent to multiple people at the same time
If supervised learning
Semi automatic generation of annotations
Most commonly used data programming
For example, judge whether it is a normal email