当前位置:网站首页>Pyg tutorial (4): Customizing datasets

Pyg tutorial (4): Customizing datasets

2022-06-21 06:44:00 Si Xi is towering

One . Preface

stay PyG in , In addition to directly using its own benchmark Outside the data set , Users can also customize data sets , Its way and Pytorch similar , You need to inherit the dataset class .PyG Two dataset abstract classes are provided in :

  • torch_geometric.data.Dataset: For building large datasets ( Non memory dataset );
  • torch_geometric.data.InMemoryDataset: Used to build memory data sets ( Small data set, ), Inherited from Dataset.

The following is a detailed introduction .

Two . Memory datasets

2.1 Create instructions

stay PyG To build your own memory data set, you need to inherit InMemoryDataset class , And implement the following methods :

  • raw_file_names(): Returns a list of file names for the original dataset , if self.raw_dir There are no files in this list in , Will pass download() Download ;
  • processed_file_names(): return process() File name list after method processing , if self.processed_dir There are no files in this list , You have to go through process() Methods to deal with ;
  • download(): Download the original dataset to self.raw_dir in ;
  • process(): Processing raw data sets , And save to processed_dir in .

In the first two methods , If there is only a single file , Directly return the file string , You don't have to return list object .

in addition , above self.raw_dir and self.processed_dir There are actually two ways , Its source code is :

#  add @property, You can make methods called like properties 
@property
def raw_dir(self) -> str:
    return osp.join(self.root, 'raw')

@property
def processed_dir(self) -> str:
    return osp.join(self.root, 'processed')

You can see from the source code ,self.raw_dir and self.processed_dir Is the given save path root The path of the original data folder and the processed data folder .

2.2 Create a presentation

This article takes SNAP A social network in the data set Facebook For example , To demonstrate how to create a InMemoryDataset Data sets FaceBook, The dataset contains 4039 Nodes 、88234 side . utilize Gephi Visualize the network as follows :

facebook

according to 3.1 Description in section , Here's the custom FaceBook Class source code :

import os
import pandas as pd
import torch
from torch_geometric.data import Data
from torch_geometric.data import InMemoryDataset, download_url, extract_gz


class FaceBook(InMemoryDataset):
    url = "https://snap.stanford.edu/data/facebook_combined.txt.gz"

    def __init__(self,
                 root,
                 transform=None,
                 pre_transform=None,
                 pre_filter=None):
        super().__init__(root, transform, pre_transform, pre_filter)
        self.data, self.slices = torch.load(self.processed_paths[0])

    @property
    def raw_file_names(self):
        return ["facebook_combined.txt"]

    @property
    def processed_file_names(self):
        return "data.pt"

    def download(self):
        path = download_url(self.url, self.raw_dir)
        extract_gz(path, self.raw_dir)

    def process(self):
        #  Load raw data file 
        path = os.path.join(self.raw_dir, "facebook_combined.txt")
        edges = pd.read_csv(path, header=None,
                            delimiter=" ").values.reshape(2, -1)
      	#  structure Data object 
        edge_index = torch.from_numpy(edges)
        g = Data(edge_index=edge_index, num_nodes=4039)
        data, slices = self.collate([g])
        torch.save((data, slices), self.processed_paths[0])


if __name__ == "__main__":
    dataset = FaceBook(root="tmp")
    data = dataset[0]
    print(data.num_edges, data.num_nodes)
	# 88234 4039

It should be noted that

  • download and process Only when called for the first time , After that, the processed data set will be loaded directly .
  • above 4 A way Not all are needed , For example, if you already have a local dataset , There is no need to rewrite download() Function to download the original data set .

3、 ... and . Large data sets

For large graph datasets , Need to inherit Dataset class , except InMemoryDataset Need to be rewritten in 4 There are two ways , You also need to rewrite the following method :

  • len(): Returns the number of instances in the dataset ;
  • get(): Load the logic of a single graph .

Due to custom large datasets and InMemoryDataset similar , Brief presentation .

Four . Conclusion

Reference material :

Customizing datasets is an important thing , Especially when you need to convert some local data to PyG In the standard graph dataset .

原网站

版权声明
本文为[Si Xi is towering]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/172/202206210624432954.html