FragMolBuildingEnvContext function obj_to_graph unable to generate Graph for certain molecules #133

W-OK-E · 2024-05-13T07:02:12Z

The function obj_to_graph takes forever to generate graphs for some molecules even though the are of the same size as the molecules that it can randomly sample. So it takes like 50 minutes to generate the graph and even then it returns Nonetype

W-OK-E · 2024-05-13T07:10:09Z

Rephrasing the issue properly

bengioe · 2024-05-13T14:25:12Z

Not all molecules can be expressed by the default fragment set used in the code base. Normally the search stops after a while if that's the case, but I don't recall it taking up to 50 minutes.
If you can give more specific examples I may be able to give better insights.

W-OK-E · 2024-05-13T15:55:34Z

Yes, the _recursive_decompose function in frag_mol_env.py is meant to stop if it takes more than 1000 recursive iterations, but I removed it in my fork as it was being a bottle neck, and even then after about an hour, I get Nonetype for the molecules that I am trying to convert.

Here's how I was trying to generate graphs:

objs = [rdkit.Chem.MolFromSmiles(x) for x in df["SMILES"]]

//So first generating a bunch of molecules from the Smiles that I have in my dataframe

graphs = [ctx.obj_to_graph(x) for x in objs], where ctx is FragMolBuildingEnvContext object

Now here's how my molecules from the dataset look like:

And here are the corresponding Smiles Values:
"Nc1nnc(-c2ccc(N+[O-])o2)o1"

"O=C(C[C@H]1CC[C@@h]2C@HO1)NCCc1ccncc1"

"CC(C)CC@@HC(=O)N[C@H]1C(=O)NC@@HC(=O)N[C@H]2C(=O)N[C@H]3C(=O)NC@H[C@H]"

"(O)c3ccc(c(Cl)c3)Oc3cc2cc(c3O[C@@h]2OC@H C@@H C@H[C@H]2O[C@H]2CC@(N)C@H C@HO2)Oc2ccc(cc2Cl)[C@H]1O"

"O=C1CN(/N=C/c2ccc(N+[O-])o2)C(=O)N1"

"Cn1cnc(CCNC(=O)C[C@@h]2CC[C@@h]3C@HO2)c1"

"O=C(C[C@@h]1CC[C@H]2C@@HO1)NCCc1ccncc1"

"Cc1nnc(SCC2=C(C(=O)O)N3C(=O)C@@H[C@H]3SC2)s1"

"CCC@H C@HC1=N[C@H](C(=O)N[C@@h]
(CC(C)C)C(=O)N[C@@h]2CCC(=O)O[Zn]OC(=O)C[C@H]3NC(=O)C@HNC(=O)C@@HNC(=O)C@HNC(=O)C@@HNC(=O)C@HNC(=O)C@HNC2=O)CS1"

Now is it because, these molecules are MacroCycles i.e. too big?

Also, I found that it is possible to build their graphs if using MolBuildingEnvContext class, will the two produce different results? Because I was able to train my model, but the predictions don't seem right.

Also after you are done training a model, to get the model to generate new molecules , we just do the following right
model.eval()
algo.create_training_data_from_own_samples(model, 8)

Or is there some other method to get the model to generate molecules post training?

bengioe · 2024-05-13T19:01:43Z

It does look like these molecules contain subgraphs which are not within the default fragment set, so indeed converting them would not work. You may be able to generate your own fragment set which could express most of your compounds. I've been able to train fragment GFNs which were using 1000+ fragments.

I haven't published a proper script to do this fragment generation yet, but you should be able to reuse the code here in the original project from which GFlowNet was born.

An alternative is indeed to use the MolBuildingEnvContext, which instead of expressing molecules as graphs of fragments, expresses them as graphs of atoms. This is of course much harder because you're now training a model to generate something atom-by-atom, but it can in theory express "any" molecule.

Also after you are done training a model, to get the model to generate new molecules , we just do the following right
model.eval()
algo.create_training_data_from_own_samples(model, 8)

Yes this should be sufficient.

W-OK-E · 2024-05-14T04:00:20Z

Thanks a lot for the clarification. Will try creating appropriate fragments and putting them to use.

W-OK-E · 2024-05-15T16:39:41Z

I created fragments specific to my dataset using BRICS from rdkit, but I have a bunch of questions:

Take for instance this molecule:

When BRICS generates the fragments, they look something like:

i.e. They have a number in asterisk attached to the region where the bond was cleaved, now if we number the atoms, they look like:

Now, since for the fragments set, we need to have the SMILES of the molecules along with the atom index where other fragments can be connected, what I have done is to remove the [int*] from the molecules and take it's immediate neighbor as the connecting node, so for instance, for the above fragments, the unchanged smiles look like:
'[14*]c1ccc(N+[O-])o1', '[14*]c1nnc(N)o1'

After removing the [14*] part, the smiles and their respective fragments look like:
'c1ccc(N+[O-])o1' , 'c1nnc(N)o1'

Now, we can see that the Carbon atoms that had 14* attached to them have now taken upon their index positions, so is it right if finally I consider the fragments and their respective atom_idx to be:

('c1ccc(N+[O-])o1',[0]) and ('c1nnc(N)o1',[0])

For some molecules in my dataset, BRICS doesn't find any fragments, so the molecule itself ends up being the fragment. So in that case naturally the atom_idx list is empty, will that hinder the learning process in any way.
e.g. 'Cc1ncc[nH]1'

bengioe · 2024-05-15T17:14:56Z

is it right if finally I consider the fragments and their respective atom_idx to be:

Yes, that's more or less how we did it. We then look for fragments that are used more than once, and if in their repetitions they are using different attachment atoms we add those atoms to the list.

so the molecule itself ends up being the fragment

This is a bit trickier, but a reasonable heuristic is to find carbon atoms with free hydrogens/valence and use those as attachment atoms.

W-OK-E · 2024-05-25T19:20:31Z

@bengioe has something changed in the code base? Because the colab kernel seems to be crashing at the
algo.create_training_data... function. At first I thought, there might be some issues with my fork or the data, but even the getting started colab notebook linked in the implementation notes throws the same issue.

If the error is exclusively on my side, then I had one more issue prior to this. I was able to construct the graphs after feeding the model my own fragments for the molecules that I have, and I have the corresponding reward values as well. Here's the file:
https://drive.google.com/file/d/15VZ3ixBYKoWy_OUnEeuwnmsdPSZiTCdQ/view?usp=sharing
But while training, I get nan losses and the model doesn't learn anything. I tried to trace where it might be shooting off to nan but couldn't. Any chance you could help with this?

PS: The picke file in the drive link contains a dictionary with 'Graphs' key containing the graphs, and 'Rewards' Key containing the corresponding rewards

W-OK-E · 2024-05-26T11:27:13Z

Alright, I was able to resolve the kernel crashing error by going to tools->command palette->Use Fallback Runtime as it seems that Colab has gone through some updates, but still can't resolve the null value error

bengioe · 2024-05-27T18:58:12Z

You're right, colab updated their torch version which broke the torch_geometric pip install command. I updated it, should now be !pip install torch_geometric torch_sparse torch_scatter rdkit gitpython omegaconf wandb --find-links https://data.pyg.org/whl/torch-2.3.0+cu121.html.

bengioe · 2024-05-27T18:59:57Z

I can't really help with the specific files you're sharing, but maybe one thing to make sure of is that your reward is always strictly positive so that the log-reward is not NaN (or inf). Otherwise, when does it NaN? After a single gradient step, or many? Does the learning rate slow this down?

W-OK-E · 2024-05-28T15:55:38Z

Thank you for the torch-geometric clarification.
And on the null values, yes, I am making sure that the reward values are all positive, and the training steps all look something like this:

So it outright starts with null values. Initially, I remember getting real values for the first two epochs or so after which it shoots off to inf or Nan, but now it's just null everywhere.

From your suggestion, I tried inc/dec the learning rate, but the null values remained.

W-OK-E · 2024-05-29T06:25:47Z

Alright, @bengioe I think I know why these null values are there, there seems to be some mismatch in the way I have indexed the fragments, I need to look deeper into that. Thanks a lot for your guidance, will get back to you with the improvements in the fragments. Thanks again.

W-OK-E closed this as completed May 13, 2024

W-OK-E changed the title ~~ctx.obj_to_graph unable to generate Graph for certain molecules~~ FragMolBuildingEnvContext function obj_to_graph unable to generate Graph for certain molecules May 13, 2024

W-OK-E reopened this May 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FragMolBuildingEnvContext function obj_to_graph unable to generate Graph for certain molecules #133

FragMolBuildingEnvContext function obj_to_graph unable to generate Graph for certain molecules #133

W-OK-E commented May 13, 2024 •

edited

Loading

W-OK-E commented May 13, 2024

bengioe commented May 13, 2024

W-OK-E commented May 13, 2024 •

edited

Loading

bengioe commented May 13, 2024

W-OK-E commented May 14, 2024

W-OK-E commented May 15, 2024 •

edited

Loading

bengioe commented May 15, 2024

W-OK-E commented May 25, 2024 •

edited

Loading

W-OK-E commented May 26, 2024

bengioe commented May 27, 2024

bengioe commented May 27, 2024 •

edited

Loading

W-OK-E commented May 28, 2024

W-OK-E commented May 29, 2024

FragMolBuildingEnvContext function obj_to_graph unable to generate Graph for certain molecules #133

FragMolBuildingEnvContext function obj_to_graph unable to generate Graph for certain molecules #133

Comments

W-OK-E commented May 13, 2024 • edited Loading

W-OK-E commented May 13, 2024

bengioe commented May 13, 2024

W-OK-E commented May 13, 2024 • edited Loading

bengioe commented May 13, 2024

W-OK-E commented May 14, 2024

W-OK-E commented May 15, 2024 • edited Loading

bengioe commented May 15, 2024

W-OK-E commented May 25, 2024 • edited Loading

W-OK-E commented May 26, 2024

bengioe commented May 27, 2024

bengioe commented May 27, 2024 • edited Loading

W-OK-E commented May 28, 2024

W-OK-E commented May 29, 2024

W-OK-E commented May 13, 2024 •

edited

Loading

W-OK-E commented May 13, 2024 •

edited

Loading

W-OK-E commented May 15, 2024 •

edited

Loading

W-OK-E commented May 25, 2024 •

edited

Loading

bengioe commented May 27, 2024 •

edited

Loading