Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FragMolBuildingEnvContext function obj_to_graph unable to generate Graph for certain molecules #133

Open
W-OK-E opened this issue May 13, 2024 · 13 comments

Comments

@W-OK-E
Copy link

W-OK-E commented May 13, 2024

The function obj_to_graph takes forever to generate graphs for some molecules even though the are of the same size as the molecules that it can randomly sample. So it takes like 50 minutes to generate the graph and even then it returns Nonetype

@W-OK-E W-OK-E closed this as completed May 13, 2024
@W-OK-E W-OK-E changed the title ctx.obj_to_graph unable to generate Graph for certain molecules FragMolBuildingEnvContext function obj_to_graph unable to generate Graph for certain molecules May 13, 2024
@W-OK-E W-OK-E reopened this May 13, 2024
@W-OK-E
Copy link
Author

W-OK-E commented May 13, 2024

Rephrasing the issue properly

@bengioe
Copy link
Collaborator

bengioe commented May 13, 2024

Not all molecules can be expressed by the default fragment set used in the code base. Normally the search stops after a while if that's the case, but I don't recall it taking up to 50 minutes.
If you can give more specific examples I may be able to give better insights.

@W-OK-E
Copy link
Author

W-OK-E commented May 13, 2024

Yes, the _recursive_decompose function in frag_mol_env.py is meant to stop if it takes more than 1000 recursive iterations, but I removed it in my fork as it was being a bottle neck, and even then after about an hour, I get Nonetype for the molecules that I am trying to convert.

Here's how I was trying to generate graphs:

objs = [rdkit.Chem.MolFromSmiles(x) for x in df["SMILES"]]

//So first generating a bunch of molecules from the Smiles that I have in my dataframe

graphs = [ctx.obj_to_graph(x) for x in objs], where ctx is FragMolBuildingEnvContext object

Now here's how my molecules from the dataset look like:
image
And here are the corresponding Smiles Values:
"Nc1nnc(-c2ccc(N+[O-])o2)o1"

"O=C(C[C@H]1CC[C@@h]2C@HO1)NCCc1ccncc1"

"CC(C)CC@@HC(=O)N[C@H]1C(=O)NC@@HC(=O)N[C@H]2C(=O)N[C@H]3C(=O)NC@H[C@H]"

"(O)c3ccc(c(Cl)c3)Oc3cc2cc(c3O[C@@h]2OC@HC@@HC@H[C@H]2O[C@H]2CC@(N)C@HC@HO2)Oc2ccc(cc2Cl)[C@H]1O"

"O=C1CN(/N=C/c2ccc(N+[O-])o2)C(=O)N1"

"Cn1cnc(CCNC(=O)C[C@@h]2CC[C@@h]3C@HO2)c1"

"O=C(C[C@@h]1CC[C@H]2C@@HO1)NCCc1ccncc1"

"Cc1nnc(SCC2=C(C(=O)O)N3C(=O)C@@H[C@H]3SC2)s1"

"CCC@HC@HC1=N[C@H](C(=O)N[C@@h]
(CC(C)C)C(=O)N[C@@h]2CCC(=O)O[Zn]OC(=O)C[C@H]3NC(=O)C@HNC(=O)C@@HNC(=O)C@HNC(=O)C@@HNC(=O)C@HNC(=O)C@HNC2=O)CS1"

Now is it because, these molecules are MacroCycles i.e. too big?

Also, I found that it is possible to build their graphs if using MolBuildingEnvContext class, will the two produce different results? Because I was able to train my model, but the predictions don't seem right.

Also after you are done training a model, to get the model to generate new molecules , we just do the following right
model.eval()
algo.create_training_data_from_own_samples(model, 8)

Or is there some other method to get the model to generate molecules post training?

@bengioe
Copy link
Collaborator

bengioe commented May 13, 2024

It does look like these molecules contain subgraphs which are not within the default fragment set, so indeed converting them would not work. You may be able to generate your own fragment set which could express most of your compounds. I've been able to train fragment GFNs which were using 1000+ fragments.

I haven't published a proper script to do this fragment generation yet, but you should be able to reuse the code here in the original project from which GFlowNet was born.

An alternative is indeed to use the MolBuildingEnvContext, which instead of expressing molecules as graphs of fragments, expresses them as graphs of atoms. This is of course much harder because you're now training a model to generate something atom-by-atom, but it can in theory express "any" molecule.

Also after you are done training a model, to get the model to generate new molecules , we just do the following right
model.eval()
algo.create_training_data_from_own_samples(model, 8)

Yes this should be sufficient.

@W-OK-E
Copy link
Author

W-OK-E commented May 14, 2024

Thanks a lot for the clarification. Will try creating appropriate fragments and putting them to use.

@W-OK-E
Copy link
Author

W-OK-E commented May 15, 2024

I created fragments specific to my dataset using BRICS from rdkit, but I have a bunch of questions:

  1. Take for instance this molecule:
    image
    When BRICS generates the fragments, they look something like:
    image

i.e. They have a number in asterisk attached to the region where the bond was cleaved, now if we number the atoms, they look like:
image
Now, since for the fragments set, we need to have the SMILES of the molecules along with the atom index where other fragments can be connected, what I have done is to remove the [int*] from the molecules and take it's immediate neighbor as the connecting node, so for instance, for the above fragments, the unchanged smiles look like:
'[14*]c1ccc(N+[O-])o1', '[14*]c1nnc(N)o1'

After removing the [14*] part, the smiles and their respective fragments look like:
'c1ccc(N+[O-])o1' , 'c1nnc(N)o1'
image
Now, we can see that the Carbon atoms that had 14* attached to them have now taken upon their index positions, so is it right if finally I consider the fragments and their respective atom_idx to be:

('c1ccc(N+[O-])o1',[0]) and ('c1nnc(N)o1',[0])

  1. For some molecules in my dataset, BRICS doesn't find any fragments, so the molecule itself ends up being the fragment. So in that case naturally the atom_idx list is empty, will that hinder the learning process in any way.
    e.g. 'Cc1ncc[nH]1'
    image

@bengioe
Copy link
Collaborator

bengioe commented May 15, 2024

is it right if finally I consider the fragments and their respective atom_idx to be:

Yes, that's more or less how we did it. We then look for fragments that are used more than once, and if in their repetitions they are using different attachment atoms we add those atoms to the list.

so the molecule itself ends up being the fragment

This is a bit trickier, but a reasonable heuristic is to find carbon atoms with free hydrogens/valence and use those as attachment atoms.

@W-OK-E
Copy link
Author

W-OK-E commented May 25, 2024

@bengioe has something changed in the code base? Because the colab kernel seems to be crashing at the
algo.create_training_data... function. At first I thought, there might be some issues with my fork or the data, but even the getting started colab notebook linked in the implementation notes throws the same issue.

If the error is exclusively on my side, then I had one more issue prior to this. I was able to construct the graphs after feeding the model my own fragments for the molecules that I have, and I have the corresponding reward values as well. Here's the file:
https://drive.google.com/file/d/15VZ3ixBYKoWy_OUnEeuwnmsdPSZiTCdQ/view?usp=sharing
But while training, I get nan losses and the model doesn't learn anything. I tried to trace where it might be shooting off to nan but couldn't. Any chance you could help with this?

PS: The picke file in the drive link contains a dictionary with 'Graphs' key containing the graphs, and 'Rewards' Key containing the corresponding rewards

@W-OK-E
Copy link
Author

W-OK-E commented May 26, 2024

Alright, I was able to resolve the kernel crashing error by going to tools->command palette->Use Fallback Runtime as it seems that Colab has gone through some updates, but still can't resolve the null value error

@bengioe
Copy link
Collaborator

bengioe commented May 27, 2024

You're right, colab updated their torch version which broke the torch_geometric pip install command. I updated it, should now be !pip install torch_geometric torch_sparse torch_scatter rdkit gitpython omegaconf wandb --find-links https://data.pyg.org/whl/torch-2.3.0+cu121.html.

@bengioe
Copy link
Collaborator

bengioe commented May 27, 2024

I can't really help with the specific files you're sharing, but maybe one thing to make sure of is that your reward is always strictly positive so that the log-reward is not NaN (or inf). Otherwise, when does it NaN? After a single gradient step, or many? Does the learning rate slow this down?

@W-OK-E
Copy link
Author

W-OK-E commented May 28, 2024

Thank you for the torch-geometric clarification.
And on the null values, yes, I am making sure that the reward values are all positive, and the training steps all look something like this:
image

So it outright starts with null values. Initially, I remember getting real values for the first two epochs or so after which it shoots off to inf or Nan, but now it's just null everywhere.

From your suggestion, I tried inc/dec the learning rate, but the null values remained.

@W-OK-E
Copy link
Author

W-OK-E commented May 29, 2024

Alright, @bengioe I think I know why these null values are there, there seems to be some mismatch in the way I have indexed the fragments, I need to look deeper into that. Thanks a lot for your guidance, will get back to you with the improvements in the fragments. Thanks again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants