Importing datasets from pytorch geometric #52

erhc · 2023-04-05T07:46:40Z

erhc
Apr 5, 2023

Hello,

I am trying to import some datasets that are available in pytorch geometric, you can see my code below.

from torch_geometric.datasets import PPI
from neuralogic.dataset import Data
from neuralogic.dataset import TensorDataset

source = PPI("./datasets/PPI")
ds = TensorDataset([Data.from_pyg(source.get(i))[0] for i in range(len(source))])

I realized when building these datasets, they have a different structure than for example Mutagenesis dataset that is available in your framework. For example, the data is stored in ds.data for PPI, but in ds.samples for Mutag.
Is the structure supposed to be different for each dataset, or is there some common structure that each dataset should follow and how to get it in this case?

Also, how can I extract the relations from the dataset? Because in this form, it can never be grounded. The data in PPI has some x, y and edge_index values.

Thank you in advance!

LukasZahradnik · 2023-04-05T19:24:52Z

LukasZahradnik
Apr 5, 2023
Maintainer

Hello,

There are multiple types of datasets (e.g., Dataset, FileDataset, TensorDataset, CSVDataset), and all of them have their own structure.

Despite their different structure, they can all be grounded and turned into an instance of BuiltDataset (dataset with the samples field), which is then used for evaluation and training. I assume the problem here is that you didn't build the dataset.

All types of datasets can also be turned into Dataset (dataset with valued logic fact examples) by calling the to_dataset method.

But I would suggest avoiding turning TensorDataset into Dataset, as it might be slow, considering PPI is quite large. Instead, it might be a good idea to call build_dataset(ds, file_mode=True), which will dump the TensorDataset into a file and will load it as FileDataset instead - bypassing the creation of Java example relations from Python.

Or you can call ds.dump_to_file(..), and, again, load the dataset as FileDataset, which might save some time as well, especially for multiple reruns.

0 replies

erhc · 2023-04-23T14:57:35Z

erhc
Apr 23, 2023
Author

Sorry for the late response, you were right, I didn't ground it. But I also realized there were no relations defined in the dataset.
Now, I would like to ask two questions, the first one is, is there some easy way to import PTC dataset from here?
Secondly, is there some parser available for these datasets? I would like to import quite a bunch of them. If there isn't, then can you give me advice on what would be the best way to import it? My idea was to have each graph be a Data entry with edges, nodes and labels, but I don't know how to add relations to it.

Thank you in advance!

0 replies

LukasZahradnik · 2023-04-23T16:21:35Z

LukasZahradnik
Apr 23, 2023
Maintainer

@erhc Hi, PyNeuraLogic supports loading datasets from databases - so even relational fit. I would recommend loading it and dumping it into files and then using FileDataset:

import mariadb

from neuralogic.dataset import DBDataset, DBSource

conn = mariadb.connect(
    user="guest",
    password="relational",
    host="relational.fit.cvut.cz",
    port=3306,
    database="Toxicology"
)

# relation name, table name, which columns are mapped to terms, which column is mapped to value
molecule = DBSource("molecule", "molecule", ["molecule_id"], "label", value_mapper=lambda x: -1 if x == "-" else 1)
atom = DBSource("atom", "atom", ["atom_id", "molecule_id", "element"])
bond = DBSource("bond", "bond", ["bond_id", "molecule_id", "bond_type"])
connected = DBSource("connected", "connected", ["atom_id", "atom_id2", "bond_id"])

dataset = DBDataset(
    conn,
    [atom, bond, connected],  # Example sources
    molecule,  # Query source
)

logic_dataset = dataset.to_dataset()
logic_dataset.dump_to_file("queries.txt", "examples.txt")

Note: You will have to install mariadb package

You should also install the latest version of PyNeuraLogic (which I just released). It fixes an issue where some capitalized constants fetched from DB were not lowercase - changing their meaning to variables.

Regarding TUDataset, you could load them with PyG and then create the PyNeuraLogic dataset from the PyG dataset.

from torch_geometric.datasets import TUDataset
from neuralogic.dataset import TensorDataset, Data

ds = TUDataset(name=...)
dataset = TensorDataset(data=[Data.from_pyg(data)[0] for data in ds], number_of_classes=...)

Also, from looking at the format of the datasets you posted, they are just a bunch of CSV files. PyNeuraLogic has a CSVDataset (in fact, DBDataset is using it in the background), but I don't think it is usable in this case.

You can open all files and read them line by line (one line = one relation in the dataset). E.g.,

examples = []

with open("BZR_A.txt", mode="r") as fp:
    for line in fp.readlines():
        terms = line.split(",") 
        examples.append(R.edge(terms[0].strip(), terms[1].strip()))

# load other files

dataset = Dataset(examples, queries)

4 replies

erhc Apr 24, 2023
Author

Thank you for your reply! The problem I have with PyG datasets is that the node and edge labels are encoded in the features, so the only predicates defined in the dataset are node_feature and edge, and their weights correspond to the given features. These predicates have an associated one-hot encoded vector, which determine their type, and I would like to extract these to be additional predicates. Something like this:

for example in dataset:
  for edge in example:
    if edge_vector[0] == 1:
      example += R.single_bond(edge)
    elif edge_vector[1] == 1:
      example += R.double_bond(edge)
    ...

But I am guessing this should be easier to do with the original CSV files.

LukasZahradnik Apr 24, 2023
Maintainer

Aha, I see. You can try setting one_hot_decode_features=true in TensorDataset - instead of R.node_feature(...)[one-hot vector] it should generate R.node_feature_{argmax(one-hot vector)}(...)[1.0].

Also, the node_feature predicate name can be changed by setting feature_name=... in TensorDataset.

// Edit: The "one-hot decoding" will work only for node features. Not for edges. I will add support for edges in the next release.

erhc May 3, 2023
Author

Hello, I would just like to ask when could this next release be expected?

// Edit: The "one-hot decoding" will work only for node features. Not for edges. I will add support for edges in the next release.

Thank you!

LukasZahradnik May 3, 2023
Maintainer

@erhc Hello, I just released the new version that contains it (0.7.7). See https://github.com/LukasZahradnik/PyNeuraLogic/blob/master/tests/test_datasets.py#L210

In your case, you would specify TensorDataset(...., one_hot_decode_edge_features=True)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Importing datasets from pytorch geometric #52

{{title}}

Replies: 3 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Importing datasets from pytorch geometric #52

erhc Apr 5, 2023

Replies: 3 comments · 4 replies

LukasZahradnik Apr 5, 2023 Maintainer

erhc Apr 23, 2023 Author

LukasZahradnik Apr 23, 2023 Maintainer

erhc Apr 24, 2023 Author

LukasZahradnik Apr 24, 2023 Maintainer

erhc May 3, 2023 Author

LukasZahradnik May 3, 2023 Maintainer

erhc
Apr 5, 2023

Replies: 3 comments 4 replies

LukasZahradnik
Apr 5, 2023
Maintainer

erhc
Apr 23, 2023
Author

LukasZahradnik
Apr 23, 2023
Maintainer

erhc Apr 24, 2023
Author

LukasZahradnik Apr 24, 2023
Maintainer

erhc May 3, 2023
Author

LukasZahradnik May 3, 2023
Maintainer