Train/test split #43

erhc · 2023-02-20T20:58:36Z

erhc
Feb 20, 2023

Hello,

I am using the Mutagenesis dataset which is already built in your framework, but I would like to split it into the train and test datasets. Simple python indexing seems not to be supported (for FileDataset), so is there a function designated for this? I tried reading the files and create a new dataset, but it seems I have to build the examples and queries first, which produces more problems. Should I

I would appreciate if you could give me some advice on this and what would be the prefered method of doing a train/test split.
Thank you!

LukasZahradnik · 2023-02-20T22:47:41Z

LukasZahradnik
Feb 20, 2023
Maintainer

Hi, splitting FileDataset is a bit awkward. You could split the dataset text files into multiple files and load them as multiple datasets.

You can also do something like this:

from neuralogic.core import BuiltDataset

mutagenesis_dataset = ...
model = template.build(...)

dataset = model.build_dataset(mutagenesis_dataset)

train_size = ...

train_dataset = BuiltDataset(dataset.samples[:train_size], batch_size=1)
test_dataset = BuiltDataset(dataset.samples[train_size:], batch_size=1)

or

mutagenesis_dataset = ...
model = template.build(...)

dataset = model.build_dataset(mutagenesis_dataset)

train_size = ...

# Training
model(dataset.samples[:train_size], train=True)

# Testing
model(dataset.samples[train_size:], train=False)

Both of those solutions require building the dataset first. File datasets are passed directly into the backend and are not read on the Python side.

but it seems I have to build the examples and queries first, which produces more problems

What were the problems you came across?

2 replies

erhc Feb 22, 2023
Author

Oh thank you, that makes much more sense. I tried building it separately using the DatasetBuilder, but it requires some java_factory.

LukasZahradnik Feb 22, 2023
Maintainer

Aha, the DatasetBuilder is only for internal use. I might add some helper functions on top of BuiltDataset for splitting etc.

erhc · 2023-02-24T19:29:24Z

erhc
Feb 24, 2023
Author

Another question, when using this dataset, how can I get the queries? I would like to evaluate the model, but I haven't been able to find a way to get back the queries from the built dataset. My code looks somewhat like the following:

template, dataset = Mutagenesis()

settings = Settings(optimizer=Adam(lr=0.001), epochs=100, error_function=MSE())
evaluator = get_evaluator(template, settings)

built_dataset = evaluator.build_dataset(dataset)

train_dataset = built_dataset.samples[:train_size]
test_dataset = built_dataset.samples[train_size:]

for current_total_loss, number_of_samples in evaluator.train(train_dataset):
        ... training

results = evaluator.test(test_dataset, generator=False)
# I need to compare the results to test_dataset.queries

3 replies

LukasZahradnik Feb 24, 2023
Maintainer

I don't think that you can pass a list of samples directly to an evaluator like that. You should create two BuiltDatasets, just like in the first example above. Or you can remove the evaluator's usage and use the sample list.

Regarding the queries, I assume you want to get their values? If so, you should be able to access their target values like that:

query_value = built_dataset.samples[0].java_sample.target

It is used in one example notebook: https://github.com/LukasZahradnik/PyNeuraLogic/blob/master/examples/MolecularGNN.ipynb (the last cell)

Note that the target value will be a Java Value object, so you will have to cast it to a Python object somehow, which might be a little bit problematic.

If your query value is a matrix or vector, you can do the following:

query_value = list(built_dataset.samples[0].java_sample.target.values)
# or
query_value = np.array(built_dataset.samples[0].java_sample.target.values)

If your query value is a scalar, then you can do this instead:

query_value = built_dataset.samples[0].java_sample.target.value

There is also a united way (for scalars, vectors, and matrices) to get the values by calling getAsArray. For scalars, this would result in one element list (not scalar).

query_value = list(built_dataset.samples[0].java_sample.target.getAsArray())
# or
query_value = np.array(built_dataset.samples[0].java_sample.target.getAsArray())

There is also one, probably the slowest, way to get those values - by serializing and deserializing them from JSON. All possible ways presented above have issues with matrices - you would get a flattened vector instead of a matrix (you would have to reshape it yourself).

query_value = json.loads(str(built_dataset.samples[0].java_sample.target))

But you usually will not have matrices on the output, so don't use the JSON.

Btw, thanks for those questions. It points out a lot of stuff that needs to be added to the BuiltDataset class.

erhc Feb 25, 2023
Author

Thank you for your reply, this is very helpful!
Concerning passing the list of samples, it actually works perfectly for training, and for testing it only works if the generator flag is set to false, otherwise it produces this error:

I am putting it here in case you want to consider it as a bug.

LukasZahradnik Feb 25, 2023
Maintainer

Thanks. Well, evaluators were not supposed to support passing in lists, so this is probably a bug. But I'm considering removing evaluators completely because they have no added value (reason for them was to unite multiple backends, but now there is only one backend left)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Train/test split #43

{{title}}

Replies: 2 comments 5 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Train/test split #43

erhc Feb 20, 2023

Replies: 2 comments · 5 replies

LukasZahradnik Feb 20, 2023 Maintainer

erhc Feb 22, 2023 Author

LukasZahradnik Feb 22, 2023 Maintainer

erhc Feb 24, 2023 Author

LukasZahradnik Feb 24, 2023 Maintainer

erhc Feb 25, 2023 Author

LukasZahradnik Feb 25, 2023 Maintainer

erhc
Feb 20, 2023

Replies: 2 comments 5 replies

LukasZahradnik
Feb 20, 2023
Maintainer

erhc Feb 22, 2023
Author

LukasZahradnik Feb 22, 2023
Maintainer

erhc
Feb 24, 2023
Author

LukasZahradnik Feb 24, 2023
Maintainer

erhc Feb 25, 2023
Author

LukasZahradnik Feb 25, 2023
Maintainer