当前位置:网站首页>Division of training set, verification set and test set in link prediction (take randomlinksplit of pyg as an example)
Division of training set, verification set and test set in link prediction (take randomlinksplit of pyg as an example)
2022-07-24 04:33:00 【Cyril_ KI】
Catalog
1. Basic concept of link prediction
In the diagram task , The so-called link prediction , There are generally two meanings : In a static network , Link prediction is used to find missing links , In dynamic networks , Link prediction is used to predict possible links in the future .
Previous academic programs on link prediction can be divided into three categories : Heuristics (Heuristic Methods),Network Embedding And graph neural network method .
- Heuristics (Heuristic Methods): We mainly consider using some predefined features between two nodes to measure the similarity between nodes , such as , The common neighbor of the node (Common Neighbor),Jaccard Similarity degree ,Katz Index etc. , This method assumes that there are some specific structural characteristics between nodes with edge relationship , But this assumption is not necessarily valid for any network .
- Network Embeddings Method : Use the walk based method to get multiple paths passing through a node , Re pass Skip-Gram and CBOW The way , Yes, on the path mask The nodes are predicted . This way , The link prediction task is not directly embedded into the supervised learning process , And can not make good use of the user's node attributes , Unable to achieve good prediction accuracy .
- Figure neural network (graph neural network,GNN): There are two main categories , Node centered graph neural network model (Node-centric GNN) And edge centered graph neural network model (Edge-centric GNN).
stay GNN Link forecast in progress , We usually turn link prediction into a binary classification problem : The edges in the graph are called positive samples , Nonexistent edges are called negative samples .
A real problem is : The number of links in the network is often much smaller than the number of links that do not exist , That is to say, the number of positive samples in the graph is much smaller than the number of negative samples , In order to make the model training more balanced , We usually divide the positive samples into training sets first 、 Validation set and test set , Then, an equal number of negative samples are sampled from the three data sets to participate in the training 、 Verification and testing .
2. Some terms
stay PyG Link prediction of , Altogether 4 Two types of edges :
training supervision edges: be used for GNN The edge of messaging and aggregation .training message edges: It is divided into positive samples and negative samples , Usingtraining supervision edgesGet the vector representation of the node , utilizetraining message edgesConduct supervised learning .validation edges: Edges for validation .testing edges: Edges for testing .
It should be noted that :validation edges and testing edges It cannot intersect with the two sides in the training stage .
A problem : In the model training stage training supervision edges and training message edges Are the same edges allowed ? Will it cause data leakage ?
Generally speaking , Using the same edge set in messaging and monitoring may lead to data leakage in the training phase , But it depends on the ability of the model . for example ,GAE Using a GCN Encoder and dot product based decoder , Both encoder and decoder have limited capabilities , Therefore, the data leakage capacity of the model is also limited .
The above is from PyG The author is in a issue The answer in :RandomLinkSplit Split error in #3668.
3. example
3.1 Data set introduction
Here we use CiteSeer Take the Internet for example :Citeseer The network is a citation network , The node is a paper , altogether 3327 Papers . The thesis is divided into six categories :Agents、AI( Artificial intelligence )、DB( database )、IR( Information retrieval )、ML( machine language ) and HCI. If there is a citation relationship between two papers , Then there is a link between them .
Load data :
dataset = Planetoid('data', name='CiteSeer')
print(dataset[0])
Output :
Data(x=[3327, 3703], edge_index=[2, 9104], y=[3327], train_mask=[3327], val_mask=[3327], test_mask=[3327])
x=[3327, 3703] Means that there are 3327 Nodes , Then the characteristic dimension of the node is 3703, This is actually to remove stop words and appear less frequently in the document than 10 The second word , Put together 3703 It's a unique word .edge_index=[2, 9104], All in all 9104 strip edge, There are two lines of data , Each line represents the node number .
3.2 RandomLinkSplit
After getting the data set , Next we need to divide the training set 、 Validation set and test set .
The classification standard is : The training set cannot contain links between the verification set and the test set , The validation set cannot contain links that exist in the test set .
utilize PyG Packaged RandomLinkSplit We can easily realize the division of data sets .RandomLinkSplit The specific parameters of are as follows :
Introduce several common parameters :
num_val: Verify the scale of the edges in the set , The default is 0.1.num_test: Scale of edges in the test set , The default is 0.1.is_undirected: IfTrue, Then assume that the graph is undirected .add_negative_train_samples: Whether to add negative training samples for link prediction , If the model has performed negative sampling , Then this option should be set to False. Generally, we set it to False, That is, the training set does not contain negative samples , Then, in each round of training, resample the same number of negative samples as the positive samples in the training set for training , This can ensure that the negative samples sampled in each round of training are different , It can effectively improve the generalization ability of the model . The default isFalse, That is, no sampling .neg_sampling_ratio: The proportion of positive and negative samples , The default is 1. That is, the number of positive and negative samples in the validation set and the test set is consistent .disjoint_train_ratio: If set to greater than 0 Value , Training edges are not shared for messaging and monitoring .
utilize RandomLinkSplit Yes CiteSeer division :
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
transform = T.Compose([
T.NormalizeFeatures(),
T.ToDevice(device),
T.RandomLinkSplit(num_val=0.1, num_test=0.1, is_undirected=True,
add_negative_train_samples=False),
])
dataset = Planetoid('data', name='CiteSeer', transform=transform)
train_data, val_data, test_data = dataset[0]
In the end we get train_data, val_data, test_data.
Output the original data set and the three divided data sets :
Data(x=[3327, 3703], edge_index=[2, 9104], y=[3327], train_mask=[3327], val_mask=[3327], test_mask=[3327])
Data(x=[3327, 3703], edge_index=[2, 7284], y=[3327], train_mask=[3327], val_mask=[3327], test_mask=[3327], edge_label=[3642], edge_label_index=[2, 3642])
Data(x=[3327, 3703], edge_index=[2, 7284], y=[3327], train_mask=[3327], val_mask=[3327], test_mask=[3327], edge_label=[910], edge_label_index=[2, 910])
Data(x=[3327, 3703], edge_index=[2, 8194], y=[3327], train_mask=[3327], val_mask=[3327], test_mask=[3327], edge_label=[910], edge_label_index=[2, 910])
From top to bottom, the original data set 、 Training set 、 Validation set and test set .
You can find , Both the verification set and the test set :
edge_label=[910], edge_label_index=[2, 910])
That is to say, both the verification set and the test set contain 910 side , here edge_label by 01 data , The positive sample label is 1, Negative sample label is 0, Output it edge_label The sum of the :
print(val_data.edge_label.sum())
tensor(455., device='cuda:0')
910 yes 455 Twice as many , It shows that the proportion of positive and negative samples in the validation set and the test set is 1, This is because neg_sampling_ratio The default is 1.
For training sets :
edge_index=[2, 7284], edge_label=[3642], edge_label_index=[2, 3642]
print(train_data.edge_label.sum())
tensor(3642., device='cuda:0')
Explain that the training set contains a total of 3642 side , And these edges are positive samples , That is, the edges existing in the original graph , It is worth noting that , Due to the disjoint_train_ratio=0, Then there will be overlap between message transmission and supervision in training , Specifically, it is used to supervise the training during the training process edge_label_index Included in the edge for messaging and aggregation edge_index in . We can simply verify the following :
train_edge = [(train_data.edge_index[0][i].item(), train_data.edge_index[1][i].item()) for i in range(train_data.edge_index.size(1))]
train_label_edge = [(train_data.edge_label_index[0][i].item(), train_data.edge_label_index[1][i].item()) for i in range(train_data.edge_label_index.size(1))]
s = 0
for x in train_label_edge:
if x in train_edge:
s += 1
print('s=', s)
Output s=3642, explain edge_label_index Completely included edge_index in .
therefore , There are 9104 side , stay 0.1 Verification and 0.1 Under the test proportion , The verification set and the test set are shared 910 Samples , These include 455 A positive sample and 455 Negative samples , The training set is well distributed 3642 A positive sample ( Negative samples are sampled during training ).
And we know that ,9104 Of 0.8 and 0.1 Respectively 7283 and 910, The training set got a total of 7284 The edge is used for training , this 7284 Half of the edge is 3642 Edges are used for supervised learning ; Both the test set and the verification set have been 910 side ( Here are not all positive samples ).
边栏推荐
- Post it notes --46{hbuildx connect to night God simulator}
- What if Adobe pr2022 doesn't have open subtitles?
- [09] program loading: "640K memory" is really not enough?
- Can NFT pledge in addition to trading?
- Live video | 37 how to use starrocks to realize user portrait analysis in mobile games
- Qt5.14_MinGW/MSVC下实现VS2019面板自由拖拽组合功能
- Learn more about the new features of ES6 in grain mall of e-commerce project
- Is cross modal semantic alignment optimal under comparative learning--- Adaptive sparse attention alignment mechanism IEEE trans MultiMedia
- Smart contract: release an erc20 token
- .gz的业务交互和对外服篇中我们通合多个模型
猜你喜欢

Where is the difficulty in attracting investment in the park? Inventory of difficulties and difficulties in attracting investment in industrial parks

Qt5.14_ Realize the free drag and drop combination function of vs2019 panel under mingw/msvc

The judges of C language classic exercises score the highest and lowest to get an average score

C语言经典习题之评委打分去掉最高最低求平均分
![[dish of learning notes, dog learning C] Dachang written test, is that it?](/img/4c/71c7268e40f0e2a15f52083022d565.png)
[dish of learning notes, dog learning C] Dachang written test, is that it?

高频小信号谐振放大器设计-课程设计Multisim仿真

Logback log framework technology in project development

Basic learning notes of C language

一次线上事故,我顿悟了异步的精髓

LabVIEW主VI冻结挂起
随机推荐
C语言经典习题之编写一个程序,找出1000以内所有的完数。
致-.-- -..- -
Gau and ppm for image semantic segmentation
-Bash: wget: command not found
Teacher qiniu said is the VIP account opened for you safe?
Upgrade POI TL version 1.12.0 and resolve the dependency conflict between the previous version of POI (4.1.2) and easyexcel
[C language] program environment and preprocessing operation
Design and implementation of data analysis platform for intelligent commerce
What is the general family of programmers working in first tier cities?
CONDA common commands
[hope to answer] the data cannot be synchronized correctly
postgresql源码学习(32)—— 检查点④-核心函数CreateCheckPoint
An online accident, I suddenly realized the essence of asynchrony
Shell syntax (2)
What new opportunities exist in the short video local life section?
C主机对IIC从来分别设置每足够的话,可下几位
Airiot Q & A issue 5 | how to use low code business flow engine?
The judges of C language classic exercises score the highest and lowest to get an average score
PMIX ERROR: ERROR in file gds_ds12_lock_pthread.c
P一个配置文件期间将SDA松集成。支但事实上