-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FragMolBuildingEnvContext function obj_to_graph unable to generate Graph for certain molecules #133
Comments
Rephrasing the issue properly |
Not all molecules can be expressed by the default fragment set used in the code base. Normally the search stops after a while if that's the case, but I don't recall it taking up to 50 minutes. |
Yes, the _recursive_decompose function in frag_mol_env.py is meant to stop if it takes more than 1000 recursive iterations, but I removed it in my fork as it was being a bottle neck, and even then after about an hour, I get Nonetype for the molecules that I am trying to convert. Here's how I was trying to generate graphs: objs = [rdkit.Chem.MolFromSmiles(x) for x in df["SMILES"]] //So first generating a bunch of molecules from the Smiles that I have in my dataframe graphs = [ctx.obj_to_graph(x) for x in objs], where ctx is FragMolBuildingEnvContext object Now here's how my molecules from the dataset look like: "O=C(C[C@H]1CC[C@@h]2C@HO1)NCCc1ccncc1" "CC(C)CC@@HC(=O)N[C@H]1C(=O)NC@@HC(=O)N[C@H]2C(=O)N[C@H]3C(=O)NC@H[C@H]" "(O)c3ccc(c(Cl)c3)Oc3cc2cc(c3O[C@@h]2OC@HC@@HC@H[C@H]2O[C@H]2CC@(N)C@HC@HO2)Oc2ccc(cc2Cl)[C@H]1O" "O=C1CN(/N=C/c2ccc(N+[O-])o2)C(=O)N1" "Cn1cnc(CCNC(=O)C[C@@h]2CC[C@@h]3C@HO2)c1" "O=C(C[C@@h]1CC[C@H]2C@@HO1)NCCc1ccncc1" "Cc1nnc(SCC2=C(C(=O)O)N3C(=O)C@@H[C@H]3SC2)s1" "CCC@HC@HC1=N[C@H](C(=O)N[C@@h] Now is it because, these molecules are MacroCycles i.e. too big? Also, I found that it is possible to build their graphs if using MolBuildingEnvContext class, will the two produce different results? Because I was able to train my model, but the predictions don't seem right. Also after you are done training a model, to get the model to generate new molecules , we just do the following right Or is there some other method to get the model to generate molecules post training? |
It does look like these molecules contain subgraphs which are not within the default fragment set, so indeed converting them would not work. You may be able to generate your own fragment set which could express most of your compounds. I've been able to train fragment GFNs which were using 1000+ fragments. I haven't published a proper script to do this fragment generation yet, but you should be able to reuse the code here in the original project from which GFlowNet was born. An alternative is indeed to use the MolBuildingEnvContext, which instead of expressing molecules as graphs of fragments, expresses them as graphs of atoms. This is of course much harder because you're now training a model to generate something atom-by-atom, but it can in theory express "any" molecule.
Yes this should be sufficient. |
Thanks a lot for the clarification. Will try creating appropriate fragments and putting them to use. |
I created fragments specific to my dataset using BRICS from rdkit, but I have a bunch of questions: i.e. They have a number in asterisk attached to the region where the bond was cleaved, now if we number the atoms, they look like: After removing the [14*] part, the smiles and their respective fragments look like: ('c1ccc(N+[O-])o1',[0]) and ('c1nnc(N)o1',[0]) |
Yes, that's more or less how we did it. We then look for fragments that are used more than once, and if in their repetitions they are using different attachment atoms we add those atoms to the list.
This is a bit trickier, but a reasonable heuristic is to find carbon atoms with free hydrogens/valence and use those as attachment atoms. |
@bengioe has something changed in the code base? Because the colab kernel seems to be crashing at the If the error is exclusively on my side, then I had one more issue prior to this. I was able to construct the graphs after feeding the model my own fragments for the molecules that I have, and I have the corresponding reward values as well. Here's the file: PS: The picke file in the drive link contains a dictionary with 'Graphs' key containing the graphs, and 'Rewards' Key containing the corresponding rewards |
Alright, I was able to resolve the kernel crashing error by going to tools->command palette->Use Fallback Runtime as it seems that Colab has gone through some updates, but still can't resolve the null value error |
You're right, colab updated their torch version which broke the torch_geometric |
I can't really help with the specific files you're sharing, but maybe one thing to make sure of is that your reward is always strictly positive so that the log-reward is not NaN (or inf). Otherwise, when does it NaN? After a single gradient step, or many? Does the learning rate slow this down? |
Alright, @bengioe I think I know why these null values are there, there seems to be some mismatch in the way I have indexed the fragments, I need to look deeper into that. Thanks a lot for your guidance, will get back to you with the improvements in the fragments. Thanks again. |
The function obj_to_graph takes forever to generate graphs for some molecules even though the are of the same size as the molecules that it can randomly sample. So it takes like 50 minutes to generate the graph and even then it returns Nonetype
The text was updated successfully, but these errors were encountered: