当前位置：网站首页>Division of training set, verification set and test set in link prediction (take randomlinksplit of pyg as an example)

Division of training set, verification set and test set in link prediction (take randomlinksplit of pyg as an example)

2022-07-24 04:33:00 【Cyril_ KI】

Catalog

1. Basic concept of link prediction
2. Some terms
3. example
- 3.1 Data set introduction
- 3.2 RandomLinkSplit

1. Basic concept of link prediction

In the diagram task , The so-called link prediction , There are generally two meanings ： In a static network , Link prediction is used to find missing links , In dynamic networks , Link prediction is used to predict possible links in the future .

Previous academic programs on link prediction can be divided into three categories ： Heuristics （Heuristic Methods）,Network Embedding And graph neural network method .

Heuristics （Heuristic Methods）： We mainly consider using some predefined features between two nodes to measure the similarity between nodes , such as , The common neighbor of the node （Common Neighbor）,Jaccard Similarity degree ,Katz Index etc. , This method assumes that there are some specific structural characteristics between nodes with edge relationship , But this assumption is not necessarily valid for any network .
Network Embeddings Method ： Use the walk based method to get multiple paths passing through a node , Re pass Skip-Gram and CBOW The way , Yes, on the path mask The nodes are predicted . This way , The link prediction task is not directly embedded into the supervised learning process , And can not make good use of the user's node attributes , Unable to achieve good prediction accuracy .
Figure neural network （graph neural network,GNN）： There are two main categories , Node centered graph neural network model （Node-centric GNN） And edge centered graph neural network model （Edge-centric GNN）.

stay GNN Link forecast in progress , We usually turn link prediction into a binary classification problem ： The edges in the graph are called positive samples , Nonexistent edges are called negative samples .

A real problem is ： The number of links in the network is often much smaller than the number of links that do not exist , That is to say, the number of positive samples in the graph is much smaller than the number of negative samples , In order to make the model training more balanced , We usually divide the positive samples into training sets first 、 Validation set and test set , Then, an equal number of negative samples are sampled from the three data sets to participate in the training 、 Verification and testing .

2. Some terms

stay PyG Link prediction of , Altogether 4 Two types of edges ：

training supervision edges： be used for GNN The edge of messaging and aggregation .
training message edges： It is divided into positive samples and negative samples , Using training supervision edges Get the vector representation of the node , utilize training message edges Conduct supervised learning .
validation edges： Edges for validation .
testing edges： Edges for testing .

It should be noted that ：validation edges and testing edges It cannot intersect with the two sides in the training stage .

A problem ： In the model training stage training supervision edges and training message edges Are the same edges allowed ？ Will it cause data leakage ？

Generally speaking , Using the same edge set in messaging and monitoring may lead to data leakage in the training phase , But it depends on the ability of the model . for example ,GAE Using a GCN Encoder and dot product based decoder , Both encoder and decoder have limited capabilities , Therefore, the data leakage capacity of the model is also limited .

The above is from PyG The author is in a issue The answer in ：RandomLinkSplit Split error in #3668.

3. example

3.1 Data set introduction

Here we use CiteSeer Take the Internet for example ：Citeseer The network is a citation network , The node is a paper , altogether 3327 Papers . The thesis is divided into six categories ：Agents、AI（ Artificial intelligence ）、DB（ database ）、IR（ Information retrieval ）、ML（ machine language ） and HCI. If there is a citation relationship between two papers , Then there is a link between them .

Load data ：

dataset = Planetoid('data', name='CiteSeer')
print(dataset[0])

Output ：

Data(x=[3327, 3703], edge_index=[2, 9104], y=[3327], train_mask=[3327], val_mask=[3327], test_mask=[3327])

x=[3327, 3703] Means that there are 3327 Nodes , Then the characteristic dimension of the node is 3703, This is actually to remove stop words and appear less frequently in the document than 10 The second word , Put together 3703 It's a unique word .edge_index=[2, 9104], All in all 9104 strip edge, There are two lines of data , Each line represents the node number .

3.2 RandomLinkSplit

After getting the data set , Next we need to divide the training set 、 Validation set and test set .

The classification standard is ： The training set cannot contain links between the verification set and the test set , The validation set cannot contain links that exist in the test set .

utilize PyG Packaged RandomLinkSplit We can easily realize the division of data sets .RandomLinkSplit The specific parameters of are as follows ：
Insert picture description here
Introduce several common parameters ：

num_val： Verify the scale of the edges in the set , The default is 0.1.
num_test： Scale of edges in the test set , The default is 0.1.
is_undirected： If True, Then assume that the graph is undirected .
add_negative_train_samples： Whether to add negative training samples for link prediction , If the model has performed negative sampling , Then this option should be set to False. Generally, we set it to False, That is, the training set does not contain negative samples , Then, in each round of training, resample the same number of negative samples as the positive samples in the training set for training , This can ensure that the negative samples sampled in each round of training are different , It can effectively improve the generalization ability of the model . The default is False, That is, no sampling .
neg_sampling_ratio： The proportion of positive and negative samples , The default is 1. That is, the number of positive and negative samples in the validation set and the test set is consistent .
disjoint_train_ratio： If set to greater than 0 Value , Training edges are not shared for messaging and monitoring .

utilize RandomLinkSplit Yes CiteSeer division ：

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
transform = T.Compose([
    T.NormalizeFeatures(),
    T.ToDevice(device),
    T.RandomLinkSplit(num_val=0.1, num_test=0.1, is_undirected=True,
                      add_negative_train_samples=False),
])
dataset = Planetoid('data', name='CiteSeer', transform=transform)
train_data, val_data, test_data = dataset[0]

In the end we get train_data, val_data, test_data.

Output the original data set and the three divided data sets ：

Data(x=[3327, 3703], edge_index=[2, 9104], y=[3327], train_mask=[3327], val_mask=[3327], test_mask=[3327])
Data(x=[3327, 3703], edge_index=[2, 7284], y=[3327], train_mask=[3327], val_mask=[3327], test_mask=[3327], edge_label=[3642], edge_label_index=[2, 3642])
Data(x=[3327, 3703], edge_index=[2, 7284], y=[3327], train_mask=[3327], val_mask=[3327], test_mask=[3327], edge_label=[910], edge_label_index=[2, 910])
Data(x=[3327, 3703], edge_index=[2, 8194], y=[3327], train_mask=[3327], val_mask=[3327], test_mask=[3327], edge_label=[910], edge_label_index=[2, 910])

From top to bottom, the original data set 、 Training set 、 Validation set and test set .

You can find , Both the verification set and the test set ：

edge_label=[910], edge_label_index=[2, 910])

That is to say, both the verification set and the test set contain 910 side , here edge_label by 01 data , The positive sample label is 1, Negative sample label is 0, Output it edge_label The sum of the ：

print(val_data.edge_label.sum())
tensor(455., device='cuda:0')

910 yes 455 Twice as many , It shows that the proportion of positive and negative samples in the validation set and the test set is 1, This is because neg_sampling_ratio The default is 1.

For training sets ：

edge_index=[2, 7284], edge_label=[3642], edge_label_index=[2, 3642]
print(train_data.edge_label.sum())
tensor(3642., device='cuda:0')

Explain that the training set contains a total of 3642 side , And these edges are positive samples , That is, the edges existing in the original graph , It is worth noting that , Due to the disjoint_train_ratio=0, Then there will be overlap between message transmission and supervision in training , Specifically, it is used to supervise the training during the training process edge_label_index Included in the edge for messaging and aggregation edge_index in . We can simply verify the following ：

train_edge = [(train_data.edge_index[0][i].item(), train_data.edge_index[1][i].item()) for i in range(train_data.edge_index.size(1))]
train_label_edge = [(train_data.edge_label_index[0][i].item(), train_data.edge_label_index[1][i].item()) for i in range(train_data.edge_label_index.size(1))]
s = 0
for x in train_label_edge:
    if x in train_edge:
        s += 1
print('s=', s)

Output s=3642, explain edge_label_index Completely included edge_index in .

therefore , There are 9104 side , stay 0.1 Verification and 0.1 Under the test proportion , The verification set and the test set are shared 910 Samples , These include 455 A positive sample and 455 Negative samples , The training set is well distributed 3642 A positive sample （ Negative samples are sampled during training ）.

And we know that ,9104 Of 0.8 and 0.1 Respectively 7283 and 910, The training set got a total of 7284 The edge is used for training , this 7284 Half of the edge is 3642 Edges are used for supervised learning ; Both the test set and the verification set have been 910 side （ Here are not all positive samples ）.

原网站

版权声明
本文为[Cyril_ KI]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/205/202207240430560915.html