LinkNeighborLoader training edge sampling/leave in information #8295

arisp-biophy · 2023-10-30T20:43:38Z

arisp-biophy
Oct 30, 2023

Good day all. I have a question concerning the details behind some of the mini-batch loaders, in particular LinkNeighborLoader.In this case my task is link prediction in the heterogeneous setting.

As a relatively new user, I am not familiar with some of the details behind these minibatchers. LinkNeighborLoader looks like it is an exxpanded version of NieghborSampler for the nodes involved in edges allows you to sample edges from the graphs. It seems like we get a batch of edges, then sample the neighbors of the graph in the same way the GraphSage work was done in 2018.

In my case I am using an Undirected heterogeneous graph, (all edges contain rev-edges) as well.

My questions are 2-fold in the simple case. I ran a simple experiment using the following . We are seeking to make a prediction on a training edge, which i denote p(L).

so we set up a loader for training

  edge_label_index = train_data[edge].edge_index
  edge_label = torch.ones(edge_label_index.shape[1], dtype=torch.float)
  
  
  train_loader = LinkNeighborLoader(
      data= test_data,
      num_neighbors=[50,50],  
      neg_sampling_ratio=1.0,  
      edge_label_index=(edge, edge_label_index),
      edge_label=edge_label,
      batch_size=512,
      shuffle=True,
      neg_sampling = 'binary'
    )

When we grab a batch from this using


batch = next(iter(test_loader)) and examine

inpid = batch[('drug','indication','disease')].input_id.numpy()
eid = batch[('drug','indication','disease')].e_id.numpy(),

we notice an overlap in the global edge ids of these objects, that is that the intersection of inpid and eid, is significant and nonempty.
This is a problem for my link prediction as I would of course in application never have this information a priori.
furthermore we observe the same behavior in the training loop


t_data = data[edge_type].edge_label_index  and data.edge_index_dict have the same mutual intersection using the training loop, 

if pretrain == True:
        #we train both GCN layers and decoder
        model.train() if train else model.eval()
        decoder.train() if train else decoder.eval()
    else:
        #message passing is fixed in finetune
        decoder.train() if train else decoder.eval()
        model.eval()
    optimizer.zero_grad() if train else None
    


    t_data = data[edge_type].edge_label_index
    true_labels = data[edge_type].edge_label.cpu().numpy()  # Convert tensor to numpy array

    
    #forward pass embedding
    print(data.edge_index)
    x = model.forward(data.x_dict, data.edge_index_dict)

In this case it seems incorrect to leave in that training edge when prediction on the training edge, as any function f(G) -> p(L) would be susceptible to over-fit to this edges existence in training. To this end I would like to implement my batch loader to sample edges from my training data, but not include these edges in the message passing for a type of GCN. Im sure there is some simple way to do this, I think im just to new to the problem.

Since I am working in the heterogeneous setting, I have rev_edges left in the graph. These rev_edges would also contain the leave in information. Again when we sample from some edge we want to predict on is it possible we could sample from this rev_edge (obviously in this case my sampled neighborhood is larger the the first hop, as it would require bi-directionality to find this).
I am sure I am missing something fundamental about how LinkNeighborLoader works, thanks all for your time.

rusty1s · 2023-11-02T08:12:12Z

rusty1s
Nov 2, 2023
Maintainer

This is the right observation. LinkNeighborLoader will currently not discard the seed edge from the sampled subgraph. Most people in static link prediction ignore this problem, since a GNN naturally struggles to make use of this information for 100% train accuracy, but it needs to be tackled when incorporating something like identity-awareness to the model.

For this, you would need to take care that edge_index and edge_label_index do not have an overlap. You can achieve this, for example via RandomLinkSplit(disjoint_train_ratio=...).

1 reply

arisp-biophy Nov 14, 2023
Author

Thank you for this response, this is excellent information for future work, and your RandomLinkSplit information certainly solves my problem. Appreciate your time.

minhhotboy9x · 2024-12-15T09:04:04Z

minhhotboy9x
Dec 15, 2024

@rusty1s Can you explain more how to use RandomLinkSplit in details? I use RandomLinkSplit but still have the edge_label_index overlap with edge_index. Here my code:

    def split_data(self):
        transform = T.RandomLinkSplit(
                    num_val=self.data_config["val_ratio"],
                    num_test=self.data_config["test_ratio"],
                    add_negative_train_samples=False,
                    edge_types=("movie", "ratedby", "user"),
                    disjoint_train_ratio=0.5
                    # rev_edge_types=("movie", "rev_rates", "user"),
                )
        self.train_data, self.val_data, self.test_data = transform(self.data)
        print(self.train_data)

    def create_dataloader(self):
        batch_size = self.data_config['batch_size']
        self.trainloader = LinkNeighborLoader(
            self.train_data,
            batch_size = batch_size,
            shuffle = True,
            edge_label_index = ("movie", "ratedby", "user"), 
            edge_label = self.train_data["movie", "ratedby", "user"].edge_label,
            num_neighbors = self.data_config['num_neighbors'], 
        )
        self.valloader = LinkNeighborLoader(
            self.val_data,
            batch_size = batch_size,
            shuffle = False,
            edge_label_index = ("movie", "ratedby", "user"), 
            edge_label = self.val_data["movie", "ratedby", "user"].edge_label,  
            num_neighbors = self.data_config['num_neighbors'],  
        )
        self.testloader = LinkNeighborLoader(
            self.test_data,
            batch_size = batch_size,
            shuffle = False,
            edge_label_index = ("movie", "ratedby", "user"), 
            edge_label = self.test_data["movie", "ratedby", "user"].edge_label,  
            num_neighbors = self.data_config['num_neighbors'],  
        )
    
    def load_batches(self):
        for i, batch in enumerate(self.trainloader):
            print('-----------------')
            edge = batch["movie", "ratedby", "user"]
            edge_index, unique_edges, edge_label_index = utils.get_unlabel_label_edge(edge)
            print(edge_index.shape)
            print(unique_edges.shape)
            print(edge_label_index.shape)
            if i==0:
                break

Here is my results:

HeteroData(
  user={
    node_id=[610],
    num_nodes=610,
  },
  movie={
    node_id=[9742],
    num_nodes=9742,
  },
  genre={
    node_id=[20],
    num_nodes=20,
  },
  (movie, ratedby, user)={
    edge_index=[2, 35293],
    rating=[35293],
    edge_label=[35293],
    edge_label_index=[2, 35293],
  },
  (genre, of, movie)={ edge_index=[2, 22084] }
)
-----------------
torch.Size([2, 281])
torch.Size([2, 278])
torch.Size([2, 16])

3 torch.sizes under the dash are the shape of edge_index, unique_edges, edge_label_index respectively and I tried to check for any overlapping. The unique_edges is smaller than in edge_index so there are some overlapping.

Update: It's my fault when I pass all edge_index to edge_label_index arg in LinkNeighborLoader, it must be edge_label_index = (("movie", "ratedby", "user"), self.train_data["movie", "ratedby", "user"].edge_label_index)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LinkNeighborLoader training edge sampling/leave in information #8295

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

LinkNeighborLoader training edge sampling/leave in information #8295

arisp-biophy Oct 30, 2023

Replies: 2 comments · 1 reply

rusty1s Nov 2, 2023 Maintainer

arisp-biophy Nov 14, 2023 Author

minhhotboy9x Dec 15, 2024

arisp-biophy
Oct 30, 2023

Replies: 2 comments 1 reply

rusty1s
Nov 2, 2023
Maintainer

arisp-biophy Nov 14, 2023
Author

minhhotboy9x
Dec 15, 2024