当前位置：网站首页>Pyg tutorial (4): Customizing datasets

Pyg tutorial (4): Customizing datasets

2022-06-21 06:44:00 【Si Xi is towering】

One . Preface

stay PyG in , In addition to directly using its own benchmark Outside the data set , Users can also customize data sets , Its way and Pytorch similar , You need to inherit the dataset class .PyG Two dataset abstract classes are provided in ：

torch_geometric.data.Dataset： For building large datasets （ Non memory dataset ）;
torch_geometric.data.InMemoryDataset： Used to build memory data sets （ Small data set, ）, Inherited from Dataset.

The following is a detailed introduction .

Two . Memory datasets

2.1 Create instructions

stay PyG To build your own memory data set, you need to inherit InMemoryDataset class , And implement the following methods ：

raw_file_names()： Returns a list of file names for the original dataset , if self.raw_dir There are no files in this list in , Will pass download() Download ;
processed_file_names()： return process() File name list after method processing , if self.processed_dir There are no files in this list , You have to go through process() Methods to deal with ;
download()： Download the original dataset to self.raw_dir in ;
process()： Processing raw data sets , And save to processed_dir in .

In the first two methods , If there is only a single file , Directly return the file string , You don't have to return list object .

in addition , above self.raw_dir and self.processed_dir There are actually two ways , Its source code is ：

#  add @property, You can make methods called like properties 
@property
def raw_dir(self) -> str:
    return osp.join(self.root, 'raw')

@property
def processed_dir(self) -> str:
    return osp.join(self.root, 'processed')

You can see from the source code ,self.raw_dir and self.processed_dir Is the given save path root The path of the original data folder and the processed data folder .

2.2 Create a presentation

This article takes SNAP A social network in the data set Facebook For example , To demonstrate how to create a InMemoryDataset Data sets FaceBook, The dataset contains 4039 Nodes 、88234 side . utilize Gephi Visualize the network as follows ：

facebook

according to 3.1 Description in section , Here's the custom FaceBook Class source code ：

import os
import pandas as pd
import torch
from torch_geometric.data import Data
from torch_geometric.data import InMemoryDataset, download_url, extract_gz


class FaceBook(InMemoryDataset):
    url = "https://snap.stanford.edu/data/facebook_combined.txt.gz"

    def __init__(self,
                 root,
                 transform=None,
                 pre_transform=None,
                 pre_filter=None):
        super().__init__(root, transform, pre_transform, pre_filter)
        self.data, self.slices = torch.load(self.processed_paths[0])

    @property
    def raw_file_names(self):
        return ["facebook_combined.txt"]

    @property
    def processed_file_names(self):
        return "data.pt"

    def download(self):
        path = download_url(self.url, self.raw_dir)
        extract_gz(path, self.raw_dir)

    def process(self):
        #  Load raw data file 
        path = os.path.join(self.raw_dir, "facebook_combined.txt")
        edges = pd.read_csv(path, header=None,
                            delimiter=" ").values.reshape(2, -1)
      	#  structure Data object 
        edge_index = torch.from_numpy(edges)
        g = Data(edge_index=edge_index, num_nodes=4039)
        data, slices = self.collate([g])
        torch.save((data, slices), self.processed_paths[0])


if __name__ == "__main__":
    dataset = FaceBook(root="tmp")
    data = dataset[0]
    print(data.num_edges, data.num_nodes)
	# 88234 4039

It should be noted that

download and process Only when called for the first time , After that, the processed data set will be loaded directly .
above 4 A way Not all are needed , For example, if you already have a local dataset , There is no need to rewrite download() Function to download the original data set .

3、 ... and . Large data sets

For large graph datasets , Need to inherit Dataset class , except InMemoryDataset Need to be rewritten in 4 There are two ways , You also need to rewrite the following method ：