Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ligand parametrization #1

Open
AminaMnhr opened this issue Jan 30, 2025 · 23 comments
Open

Ligand parametrization #1

AminaMnhr opened this issue Jan 30, 2025 · 23 comments
Assignees

Comments

@AminaMnhr
Copy link

I am interested in using the quantumbindRBFE method to compute RBFE for our protein datasets, using the workflow with AceForce 1.0 that you just published.

I would like to use the same protocol as described in the paper but I have questions about the parameterization of the ligand in your pipeline and the use of AceForce 1.0 as I am not sure at which step it is used.

For the parameterization of the ligand:

  • To produce the files for the ligand (.cif and .frcmod) do you parameterize the ligands using just the interface of Parameterize tool (https://open.playmolecule.org/tools/parameterize?docs=true), using the force field GAFF2 and charge-fitting method RESP-PSI4 of Parameterize or did you compute the charges before / after parameterization with a different tool ? Was there any dihedral fitting using the Parameterize tool or was it turned off ?

  • Do you first do the partial charges with openFFRecharge Package and then use just Parameterize without any charges fitting ? Or do you use Parameterize tool first with charges fitting (like AM1-BCC, Gasteiger) and then use openFF Recharge Package to do the RESP ?

Thank you so much for your help, I am excited to hear from you.

@qsabanes
Copy link
Collaborator

Hi @AminaMnhr!

Everything has been run from parameterize no changes were doing prior and after. Once the ligand was parameterized with that tool it went into the rbfe part of the protocol.

In theory you could run AceForce with different parameters (ligand force field and/or charges), but for now the results we have are the ones with the combination reported.

@AminaMnhr
Copy link
Author

AminaMnhr commented Jan 30, 2025

Amazing, thank you for the fast answer!

  1. Just to make sure I am doing the same parameterization with the Parameterize tool:
    FF is set to GAFF2
    Dihedral fitting method, is it set to Off as well and then AceForce 1.0 will refit them later?
    And for the advanced options did you keep the default parameters ?
  • RESP-PSI4 charges,
  • QM wB97X-D theory,
  • basis set 6-311++G**,
  • rotamer minimized by MM,
  • and iteractive dihedral
  1. Also did you use the Parameterize tool directly with htmd ? As I do not see which package/module I should get to be able to use the tool with htmd.

  2. In your tutorial you use an atm module to run the rbfe, do you know how to install it?

Thanks a lot!

@qsabanes
Copy link
Collaborator

qsabanes commented Jan 31, 2025

We only set the flags of RESP-PSI4 and GAFF2 as FF, the rest is left at default.

In the future we are planning to run tests with other combinations of parameters but these runs take some time.

If you think that another set of parameters will work best for your system, feel free to test them.

@AminaMnhr
Copy link
Author

Hi,

Thanks a lot for your answer! It is very kind from you!
And lastly, may I ask how did you implement the Parameterize tool to use it directly with htmd on a cluster? As testing each target would take a while using just the web interface.

@stefdoerr
Copy link

Parameterize with GAFF2 and RESP-PSI4 doesn't take that long. Maybe a couple (or 15) minutes per ligand I would assume. @qsabanes I think meant that we put Dihedral fitting to "Off" on the web interface.
Internally in Acellera we do run them on a cluster, however on PlayMolecule these jobs run sequentially. But you can still do a couple by hand on the PM server.

@AminaMnhr
Copy link
Author

Thank you so much for your help both, it is definitively more clear right now!

I will get back to you if I have any question on the workflow.

Thanks a lot!

@AminaMnhr
Copy link
Author

AminaMnhr commented Feb 8, 2025

Hi,

I would like to reproduce now the tutorial presented in your github for QuantumbindRBFE : https://github.com/Acellera/quantumbind_rbfe/blob/main/tutorial/tutorial_atm.ipynb
However I am not sure about my software installations and I am still unfamiliar with the workflow.

acemd does not implement the ATM force as you said, so it is my understanding that the minimization, equilibration and production run are done with AtoM-OpenMM as presented here : https://github.com/Gallicchio-Lab/AToM-OpenMM

Moreover, the NN is not anymore implemented in a .yaml file (as acemd) but a .cntl file, required input for AtoM-OpenMM

Sorry for my numerous questions as I am not yet familiar with the workflow.

Question 1 : I have created my environment with htmd, pytorch-cuda, and integrated AtoM-OpenMM for the ATM Method as well as openmm-ml and nnpops for the NN:

However in your tutorial you run the equilibration, production and uwham analysis doing : from atm.rbfe_structprep import rbfe_structprep, from atm.rbfe_production import rbfe_production and from atm.uwham import calculate_uwham.
I do not have those import commands in my environment, so how do I integrate those in my conda env? So far, I just have my conda environment as built above and the rbfe_structprep.py, rbfe_production.py, and analyze.sh from the AtoM Github.

Question 2 : the quantumbindRBFE tutorial seems to work on an input file with the .yaml extension. Is that not supposed to be a .cntl extension as the simulations are done with AtoM-OpenMM?

Question 3 : The system equilibration in your tutorial does not include the NN in the parameters. Is this step running without the potential computed by the NN? So the NN is just recomputing the potential during the production step?

Question 4 : In the ATM production of the tutorial, you wrote :
with open(os.path.join(sim_dir, "nodefile"), "w+") as f:
f.write("localhost,0:0,1,CUDA,,/tmp")

Does it also need to be adapted somehow if I want to use SLURM scheduler for my simulations? I am not sure about the localhost and tmp notations.

Question 5 : If I want to integrate the workflow on a SLURM scheduler, you wrote :

_We can also create a run.sh script to run the equilibration with a queue system later
with open(os.path.join(equil_dir, 'run.sh'), "w+") as f:
f.write(f"#!/bin/bash\nrbfe_structprep.py {basename}config.yaml >log.txt 2>&1

I am not sure to understand this neither. Is the run.sh just containing the SLURM ressources and this line, to run the structprep with SLURM? And same for the production step I would guess?

Thanks a lot
Sorry for the numerous questions

Amina

@AdriaPerezCulubret
Copy link
Contributor

AdriaPerezCulubret commented Feb 10, 2025

Hi @AminaMnhr

Apologies, as part of your issues are on our side. I see that we didn't specify correctly that you need this repository Acellera-ATM. By installing this with HTMD, you'll have the commands you are missing in your environment, as well as all the required atm imports. This implementation is based on AToM-OpenMM, but with a few changes, such as the usage of .yaml files as inputs and getting rid of ConfigObj and the R scripts for analysis. I've updated the README to include it now. Please try to install Acellera-ATM and run the tutorial with it.

By the way, to avoid any confusion, Acellera-ATM (and all the runs shown in the paper) both were ran using OpenMM entirely, no need to use Acemd here.

Now, on to the other questions:

Question 2: This will be solved with Acellera-ATM, which uses .yaml file as inputs

Question 3: Another mistake on my side, forgot to include it. However, we've tried both scenarios (equilibration with/without NNP, production with NNP) and we didn't observe any effects in performance. In any case, I'll update the tutorial now to also include NNPs in equilibration. You just need to add the same parameters in the config file as with the production step.

Question 4: No, we also use SLURM and we did not have to adapt it. You can keep it as it is

Question 5: This is just an example including a simple run.sh that shows how to run that specific part through the command line. You'll need to set up SLURM resources yourself

@AminaMnhr
Copy link
Author

Hi Adria,

Thank you it is way more clear now, thank you so much!

Small precision: I installed the Acellera-ATM package correctly, however the command conda install acellera-atm python=3.10 -c acellera -c conda-forge do not install pytorch with CUDA packages but the cpu version packages of pytorch, torchvisio etc. I think this is due to the specification python=3.10. So I did install first an environment with pytorch-cuda with python 3.12 instead and cuda 11.8 and then installed accellera-atm without specifying python=3.10 (otherwise it downgraded my packages to the cpu ones and makes torch not detected by my env). No idea if the workflow can work like that with Python 3.12 instead of 3.10 but I will let you know after trying your quantumbindRBFE tutorial.

For Question 2: Just to be sure, is your yaml file input following the cntl nomenclature with all their input variables in uppercase https://www.compmolbiophysbc.org/atom-openmm/atom-control-file ? Does it have the same parameters settings as them (job settings, alchemical transfer schedule settings, binding site restraint settings) ?

Or does it follow the acemd nomenclature of yaml file : https://software.acellera.com/acemd/manual.html with the same acemd parameter settings ?

Thank so much for your help until now. I keep you updated for my try on the tutorial.

@AdriaPerezCulubret
Copy link
Contributor

The command I've recently used to do fresh environments with Acellera-ATM and HTMD, as well as all libraries required to load AceForce as an NNP. is this one:
mamba install python=3.10 openmm openmm-ml pyyaml cuda-version=11.8 torchmd-net acellera-atm htmd -c acellera -c conda-forge.

Try it and see if that works for you. Perhaps I'll add this in the README aswell.

Question 2: It has the same parameter settings as the AToM-OpenMM control file. We removed some that are not used in our version, but the ones in there are equivalent to those in .cntl files.

@AminaMnhr
Copy link
Author

Thank you so much for the fast answer! I will try this with the tutorial and will let you know how it goes! Thanks.

@AminaMnhr
Copy link
Author

AminaMnhr commented Feb 13, 2025

Hi, I tried the tutorial and commands went well, so the environment should be alright. However I would have some questions about the workflow as I did not get exactly the same results. Sorry again as it is quite a lot of questions (hopefully those are fast questions) :

Results reproduction
For the tutorial TYK2, I tried to reproduce the data taking your input files.

  1. My potential energy for equilibration step was slightly different than yours (not big difference), is that normal? As I thought that this step was deterministic so I should have got the exact same value
  2. I run the production twice on the same files, and I do not get the same results in my two productions for ddG. And I do not get at all the same results as you:
    Prod 1:
    ddG: 3.205084514417841
    Error 2.8978712306503045
    Replicas 2
    Prod 2 :
    ddG: -2.4984867117738645
    Error 3.179037326337165
    Replicas 2

This is far from your value, and also why is the value between production runs not the same? Is that because of the very few replicas?

  1. For a same ligand pair on a specific target, did you run several times the equilibration and production steps (for example three times), compute the ddG for each and then take the average ddG?

  2. we agree that equilibration and production use the same .yaml file?

Workflow

  1. Did you use the Schrodinger datasets as it is (no modifications on it, weither it is pdb attributes format or atoms coordinates) with protein and ligands already docked?

  2. Do the inputs here : https://github.com/Acellera/quantumbind_rbfe/tree/main/inputs provide the files to run the benchmark proteins directly doing the equilibration and then production? (So you already parameterize ligands, built the system with amber etc, and I just have to run equilibration and production on those files directly). No need of the pdb files format as well of the structure?

  3. How did you define and choose the pairs of ligands to do for a target protein ? And for a same pair, can the ligand 1 and ligand 2 be exchanged for the calculation (interconvertible, so like for example ligand 1 is put far from binding site, or ligand 2 is put far instead of ligand 1) or it does not matter ? I also thought that for a defined protein, every pair of ligands should have the same sign of delta delta G (like either + or -) ?

  4. Is there a reason of why you re-labeled the name of each ligand in your inputs folder ?

  5. How did you pick the displacement vector for each target? Because I can see that the vector is not the same depending on the target so I am wondering what are the rules to define it correctly for a specific pair ligand or target protein.

  6. In the tutorial for TYK2, I do not understand where the indexes [13, 4, 15] come from? I do not find those numbers in the bmol objects, nor in any file. I thought it was the serial number of the pdb file but it does not correspond either. Where those numbers come from?

  7. Is there a way to visualize both ligands with their indexes (to then pick what atoms to choose for indexes), for example on pymol ?

  8. Does the choice of indexes for restrain atoms (align ref atoms) influence a lot the value of DDG? As I am not exactly sure I understood correctly how to intuitively pick the atoms to restrain (according to AToM MM documentation)

  9. For the restriction of the binding site, you do :
    config_params["LIGAND1_CM_ATOMS"] = [int(lig1_restr_idx[0] + lig1_atoms[0])]
    config_params["LIGAND2_CM_ATOMS"] = [int(lig2_restr_idx[0] + lig2_atoms[0])]

    I am not sure I have understood the principle either (which atoms does it pick) and how those atoms are picked.

  10. For the neural network, you set in the tutorial :
    This last parameter depends on the number of atoms in your ligand.
    But be careful, as increasing this parameter greatly increases the computational cost!
    config_params["NNP_MAX_NUM_NEIGHBORS"] = 30

However I do not understand how 30 comes from as ligand 1 and ligand 2 in the tutorial have more than 30 atoms. Also is this parameter supposed to be the number of atoms of ligand 1, or the sum of atoms for both ligands?

@qsabanes
Copy link
Collaborator

Hi @AminaMnhr,

Results reproduction For the tutorial TYK2, I tried to reproduce the data taking your input files.

  1. My potential energy for equilibration step was slightly different than yours (not big difference), is that normal? As I thought that this step was deterministic so I should have got the exact same value
  2. I run the production twice on the same files, and I do not get the same results in my two productions for ddG. And I do not get at all the same results as you:
    Prod 1:
    ddG: 3.205084514417841
    Error 2.8978712306503045
    Replicas 2
    Prod 2 :
    ddG: -2.4984867117738645
    Error 3.179037326337165
    Replicas 2

This is far from your value, and also why is the value between production runs not the same? Is that because of the very few replicas?

If you followed the tutorial then you only run 2 samples. Which is around 2ns of the total ensemble. This is far from necessary for your system to be converged. The tutorial is set up as this to test the system running, but to run a proper production you need to run for longer. As you can see in our https://github.com/Acellera/quantumbind_rbfe/tree/main/inputs folder we run the systems for 400 samples, which corresponds to approximately 70ns for the total ensemble. We have observed that with that simulation time we observe convergence in most cases. But there are cases in which you would need more or that with less is sufficient. Another showing that there's something wrong with your simulations is that you have a high error.

  1. For a same ligand pair on a specific target, did you run several times the equilibration and production steps (for example three times), compute the ddG for each and then take the average ddG?

Yes, indeed. It is better to run replicas because sometimes you can get outliers. The more replicas you do the more consistent average ddG you'll get. This is a balance betweem time/cost that the user has to decide.

  1. we agree that equilibration and production use the same .yaml file?

Yes

Workflow

  1. Did you use the Schrodinger datasets as it is (no modifications on it, weither it is pdb attributes format or atoms coordinates) with protein and ligands already docked?

Exactly, we used it the inputs as is

  1. Do the inputs here : https://github.com/Acellera/quantumbind_rbfe/tree/main/inputs provide the files to run the benchmark proteins directly doing the equilibration and then production? (So you already parameterize ligands, built the system with amber etc, and I just have to run equilibration and production on those files directly). No need of the pdb files format as well of the structure?

These systems are already equilibrated, you just need to run production on them.

  1. How did you define and choose the pairs of ligands to do for a target protein ? And for a same pair, can the ligand 1 and ligand 2 be exchanged for the calculation (interconvertible, so like for example ligand 1 is put far from binding site, or ligand 2 is put far instead of ligand 1) or it does not matter ? I also thought that for a defined protein, every pair of ligands should have the same sign of delta delta G (like either + or -) ?

A way to measure the robustness of your setup and calculations is to test the simulations in both directions like you propose. For these benchmarks the edges/ligand pairs are already defined. But I suggest you check out Lomap or Konnektor (or similar) if you want to build your own perturbation network.
https://github.com/OpenFreeEnergy/Lomap
https://github.com/OpenFreeEnergy/konnektor

  1. Is there a reason of why you re-labeled the name of each ligand in your inputs folder ?

Parameterize relabels all ligands with more than 3 characters to MOL. That's why we have to relabel them.

  1. How did you pick the displacement vector for each target? Because I can see that the vector is not the same depending on the target so I am wondering what are the rules to define it correctly for a specific pair ligand or target protein.

On our inputs we have an automatic approach to select the displacement vector, but you can select it manually. Just try to ensure that the ligand outside of the pocket is more than 10-15 A away from the protein.

  1. In the tutorial for TYK2, I do not understand where the indexes [13, 4, 15] come from? I do not find those numbers in the bmol objects, nor in any file. I thought it was the serial number of the pdb file but it does not correspond either. Where those numbers come from?

These are indexes from the ligand .sdf file

  1. Is there a way to visualize both ligands with their indexes (to then pick what atoms to choose for indexes), for example on pymol ?

Yes with pymol you can do label > rank and the numbers you see correspond to the atom indexes

  1. Does the choice of indexes for restrain atoms (align ref atoms) influence a lot the value of DDG? As I am not exactly sure I understood correctly how to intuitively pick the atoms to restrain (according to AToM MM documentation)

Yes, a poor selection of restrain atoms can influence a lot on the DDG value. I try to ensure that the ligands pre and post equilibration are as similar as possible (no weird torsion or displacements observed)

  1. For the restriction of the binding site, you do :
    config_params["LIGAND1_CM_ATOMS"] = [int(lig1_restr_idx[0] + lig1_atoms[0])]
    config_params["LIGAND2_CM_ATOMS"] = [int(lig2_restr_idx[0] + lig2_atoms[0])]

    I am not sure I have understood the principle either (which atoms does it pick) and how those atoms are picked.

I think the comments in the tutorial explain it fine:
The following parameters are just required in case you want to use the binding site restraints. Otherwise they will be ignored.
We provide some reference atom for the ligands as a center, as well as all the carbon alpha atoms of residues near the binding site

  1. For the neural network, you set in the tutorial :
    This last parameter depends on the number of atoms in your ligand.
    But be careful, as increasing this parameter greatly increases the computational cost!
    config_params["NNP_MAX_NUM_NEIGHBORS"] = 30

However I do not understand how 30 comes from as ligand 1 and ligand 2 in the tutorial have more than 30 atoms. Also is this parameter supposed to be the number of atoms of ligand 1, or the sum of atoms for both ligands?

Ok this, should be set to 40, as it is what we did in the paper, we'll change that. It should take into account the number of atoms for the bigger ligand. The issue is that the higher you set up the number of neighbours, the slower the calculation. So even if we set it up to 40 there are sometimes where the ligands are bigger. This is something that we still have to evaluate once we have the time.

@AminaMnhr
Copy link
Author

Thanks a lot for the fast answer! Happy to have the feedbacks today, as I asked a lot. Thanks again!

If you followed the tutorial then you only run 2 samples. Which is around 2ns of the total ensemble. This is far from necessary for your system to be converged. The tutorial is set up as this to test the system running, but to run a proper production you need to run for longer. As you can see in our https://github.com/Acellera/quantumbind_rbfe/tree/main/inputs folder we run the systems for 400 samples, which corresponds to approximately 70ns for the total ensemble. We have observed that with that simulation time we observe convergence in most cases. But there are cases in which you would need more or that with less is sufficient. Another showing that there's something wrong with your simulations is that you have a high error.

Okay that is what I was thinking. So the fact that I am not getting the same value is due to the MAX SAMPLES = 2. Similarly, then I should not be worried that my value is far from the one in the tutorial?

Yes, indeed. It is better to run replicas because sometimes you can get outliers. The more replicas you do the more consistent average ddG you'll get. This is a balance betweem time/cost that the user has to decide.

Thanks for the answer. However my question was more about the number of times you executed the equilibration and production for a same pair. Replicas are already run inside the production step itself (with the max samples), but I was wondering if you would launch the commands rbfe_structpep(input) and rbfe_production(input) several times as well for a same pair, and then getting the average ddG obtained for each?

A way to measure the robustness of your setup and calculations is to test the simulations in both directions like you propose. For these benchmarks the edges/ligand pairs are already defined. But I suggest you check out Lomap or Konnektor (or similar) if you want to build your own perturbation network. https://github.com/OpenFreeEnergy/Lomap https://github.com/OpenFreeEnergy/konnektor

Thanks that is good to know as well! May I ask what would you check after running the simulations in both directions, to know which one to take? Also, are all my pairs for a target supposed to have the same sign accordingly (I remember seeing something like that for FEP practices but not sure) ?

Okay I see, are the benchmarks pairs the one defined by Schrodinger directly?

Parameterize relabels all ligands with more than 3 characters to MOL. That's why we have to relabel them.

Good to know too thanks! As Parameterize takes an sdf or mol2, which property in the sdf did you change for the relabeling, is that just the first line of the sdf file for the ligand?

On our inputs we have an automatic approach to select the displacement vector, but you can select it manually. Just try to ensure that the ligand outside of the pocket is more than 10-15 A away from the protein.

Okay thanks a lot! We would like to try quantumbindRBFE on new pairs/new targets, so would it make sense if I put for example the ligand 15 Angstrom away from the furthest protein atom in X dimension?
if the diameter of my protein is like 10A, then I would put the ligand to [25 0 0] or [25 25 25] ?

Yes, a poor selection of restrain atoms can influence a lot on the DDG value. I try to ensure that the ligands pre and post equilibration are as similar as possible (no weird torsion or displacements observed)

Also good to know, thanks a lot! Just to make sure, the pre-equilibration is directly the structure.pdb after having built the system, right.
But for post-equilibration, I have :
TYK2_ejm31_jmc28_equil.pdb TYK2_ejm31_jmc28_min.pdb TYK2_ejm31_jmc28_npt.pdb TYK2_ejm31_jmc28_0.pdb TYK2_ejm31_jmc28_mdlambda.pdb TYK2_ejm31_jmc28_therm.pdb TYK2_ejm31_jmc28_min.pdb

Do you check the TYK2_ejm31_jmc28_0.pdb file?

I think the comments in the tutorial explain it fine: The following parameters are just required in case you want to use the binding site restraints. Otherwise they will be ignored. We provide some reference atom for the ligands as a center, as well as all the carbon alpha atoms of residues near the binding site

Yep, so for this : config_params["LIGAND1_CM_ATOMS"] = [int(lig1_restr_idx[0] + lig1_atoms[0])]
I guess I am asking that I am not sure to understand how to pick the CM_ATOMS for the ligand?
If I need to use the method on new dataset, I think it is important to define a binding site restraint as we are not sure about a strong binding.

Ok this, should be set to 40, as it is what we did in the paper, we'll change that. It should take into account the number of atoms for the bigger ligand. The issue is that the higher you set up the number of neighbours, the slower the calculation. So even if we set it up to 40 there are sometimes where the ligands are bigger. This is something that we still have to evaluate once we have the time.

Thanks! I guess that this parameter also would influence a lot the ddG?
Is there a threshold for the size of my ligand that I should not exceed for now? (For example greater than 50 would be too computationally expensive for now)

Thanks for your help.

@qsabanes
Copy link
Collaborator

Okay that is what I was thinking. So the fact that I am not getting the same value is due to the MAX SAMPLES = 2. Similarly, then I should not be worried that my value is far from the one in the tutorial?

Exactly, if you simulate for longer you should get closer values along the different runs.

Thanks for the answer. However my question was more about the number of times you executed the equilibration and production for a same pair. Replicas are already run inside the production step itself (with the max samples), but I was wondering if you would launch the commands rbfe_structpep(input) and rbfe_production(input) several times as well for a same pair, and then getting the average ddG obtained for each?

Here by replicas I meant independent runs. I can understand this can be confusing since it has the same name.

Thanks that is good to know as well! May I ask what would you check after running the simulations in both directions, to know which one to take? Also, are all my pairs for a target supposed to have the same sign accordingly (I remember seeing something like that for FEP practices but not sure) ?

Okay I see, are the benchmarks pairs the one defined by Schrodinger directly?

If you do first A --> B and then B --> A then you should expect that DDG(A-->B) is similar to -DDG(B-->A)
Yes, the pairs were defined by Schrodinger directly most probably with their own tool. I did not reference it befor since it's a commercial one.

Good to know too thanks! As Parameterize takes an sdf or mol2, which property in the sdf did you change for the relabeling, is that just the first line of the sdf file for the ligand?

Ligand name, first line, yes. But if you prepare/parameterize your ligand on your own you should be able to not change it and make it work as well.

Okay thanks a lot! We would like to try quantumbindRBFE on new pairs/new targets, so would it make sense if I put for example the ligand 15 Angstrom away from the furthest protein atom in X dimension?
if the diameter of my protein is like 10A, then I would put the ligand to [25 0 0] or [25 25 25] ?

Play around with it. But take into account that the bigger the box --> more atoms --> slower sims

Also good to know, thanks a lot! Just to make sure, the pre-equilibration is directly the structure.pdb after having built the system, right.
But for post-equilibration, I have :
TYK2_ejm31_jmc28_equil.pdb TYK2_ejm31_jmc28_min.pdb TYK2_ejm31_jmc28_npt.pdb TYK2_ejm31_jmc28_0.pdb TYK2_ejm31_jmc28_mdlambda.pdb TYK2_ejm31_jmc28_therm.pdb TYK2_ejm31_jmc28_min.pdb

Do you check the TYK2_ejm31_jmc28_0.pdb file?

Yes

Yep, so for this : config_params["LIGAND1_CM_ATOMS"] = [int(lig1_restr_idx[0] + lig1_atoms[0])]
I guess I am asking that I am not sure to understand how to pick the CM_ATOMS for the ligand?
If I need to use the method on new dataset, I think it is important to define a binding site restraint as we are not sure about a strong binding.

In general this value is the first atom you select in the index restraints

Thanks! I guess that this parameter also would influence a lot the ddG?
Is there a threshold for the size of my ligand that I should not exceed for now? (For example greater than 50 would be too computationally expensive for now)

Iirc jumping from 30 to 40 neighbours increased the computational cost by 30%, but I don't know if this is an exponential, linear or another kind of increase

@AminaMnhr
Copy link
Author

AminaMnhr commented Feb 13, 2025

Thanks for the answer!!

Here by replicas I meant independent runs. I can understand this can be confusing since it has the same name.

Okay thanks. So for a same ligand pair, I should run rbfe_structprep and rbfe_production several times then. Do you just re-run rbfe_production() only keeping the same equilibration files, or do you also re-run the equilibration each time?

For each time you would repeat independent run, do you change some settings of the yaml file or everything should remain the same?

In general this value is the first atom you select in the index restraints

config_params["LIGAND1_CM_ATOMS"] = [int(lig1_restr_idx[0] + lig1_atoms[0])]
Okay then the first atom selected in the index restrains should be lig1_restr_idx[0]. But what about lig1_atoms[0] ? Not sure to understand what it is, and if i also have to include it for every target i would try.

Also:

  • Is it possible to restart an equilibration or a production for jobs that ended before completed? So to restart the job from the last step before it ended.
  • Is it possible to submit the production run with SLURM using more than one GPU? To make it faster

@qsabanes
Copy link
Collaborator

Okay thanks. So for a same ligand pair, I should run rbfe_structprep and rbfe_production several times then. Do you just re-run rbfe_production() only keeping the same equilibration files, or do you also re-run the equilibration each time?

we re-run everything

For each time you would repeat independent run, do you change some settings of the yaml file or everything should remain the same?

Same protocol

config_params["LIGAND1_CM_ATOMS"] = [int(lig1_restr_idx[0] + lig1_atoms[0])]
Okay then the first atom selected in the index restrains should be lig1_restr_idx[0]. But what about lig1_atoms[0] ? Not sure to understand what it is, and if i also have to include it for every target i would try.

I am pretty sure that the protocol takes care of it, but maybe @AdriaPerezCulubret can confirm

Also:

Is it possible to restart an equilibration or a production for jobs that ended before completed? So to restart the job from the last step before it ended.
Is it possible to submit the production run with SLURM using more than one GPU? To make it faster

equilibration script does not have checkpointing, production does

You can run with more than one GPU modifying the nodefile file and adding the info of the additional/id GPUs you want to use

@AminaMnhr
Copy link
Author

AminaMnhr commented Feb 18, 2025

Thank you for the answer!

I am trying to reproduce the results for TYK2 and will let you know if I managed to reproduce the benchmark.

Some questions for speed of calculations and eq steps:

  • In the tutorial, the equilibration is done setting
    For the testing purpose of this tutorial, we are going to reduce the number of equilibration steps.
    To run the standard equilibration, just comment the following lines
    config_params['EQUIL_MINIM_STEPS'] = 10000
    config_params['EQUIL_ANNEALING_STEPS'] = 10000
    config_params['EQUIL_STEPS'] = 10000

However if I want to do the equilibration without reducing the steps, should I just remove those parameters at all? Because those parameters are not in your input.yaml template example so I was wondering if I still need to set up those keys for the equilibration and assign a value for each. If I do not have those keys in the yaml file, what would be then the default values?

  • It seems to take me two days for a pair of ligand to do the production step. Did you have a same calculation time? Also as the equilibration was already done by you, could you tell me roughly how long an equilibration would take for me?

  • Is there any safe way to reduce the calculation time without affecting that much the ddG value? I guess with the maxsamples parameter, or something else?

  • Or would it make it faster also if I assign more than one GPU in the nodefile?

  • We would like to apply your method on some experimental structures, and macrocycles ligands where the exp structure are the protein and ligand in the binding site. Is there any preparation that I would need to set up before running the quantumbindRBFE setup? (apart from protonation, hydrogens, missing residues, I was wondering if I would need to do any minimization of the exp structure, or to do any docking to refine the experimental binding pose before doing the entire workflow of quantumbindRBFE (building system, eq and prod), or if it is not necessary ?
    And would you advise to model the exp system with different binding poses or not (which would add to the number of calculations to do...)

Thanks so much.

@qsabanes
Copy link
Collaborator

In the tutorial, the equilibration is done setting
For the testing purpose of this tutorial, we are going to reduce the number of equilibration steps.
To run the standard equilibration, just comment the following lines
config_params['EQUIL_MINIM_STEPS'] = 10000
config_params['EQUIL_ANNEALING_STEPS'] = 10000
config_params['EQUIL_STEPS'] = 10000
However if I want to do the equilibration without reducing the steps, should I just remove those parameters at all? Because those parameters are not in your input.yaml template example so I was wondering if I still need to set up those keys for the equilibration and assign a value for each. If I do not have those keys in the yaml file, what would be then the default values?

Default values should be:
"EQUIL_MINIM_STEPS", 150000
"EQUIL_ANNEALING_STEPS", 250000
"EQUIL_STEPS", 150000

It seems to take me two days for a pair of ligand to do the production step. Did you have a same calculation time? Also as the equilibration was already done by you, could you tell me roughly how long an equilibration would take for me?

This depends on your GPU, size of the system and amount of time you want to run it
A TYK2 edge with AceForce should take around 36-48h in a RTX4090 at 2fs

Is there any safe way to reduce the calculation time without affecting that much the ddG value? I guess with the maxsamples parameter, or something else?

You can reduce the number of samples of course, but I cannot promise that the simulations would be converged. This is very system/transformation dependent

Or would it make it faster also if I assign more than one GPU in the nodefile?

We haven't tested multi-GPU runs. We prefer to run several calculations in parallel

We would like to apply your method on some experimental structures, and macrocycles ligands where the exp structure are the protein and ligand in the binding site. Is there any preparation that I would need to set up before running the quantumbindRBFE setup? (apart from protonation, hydrogens, missing residues, I was wondering if I would need to do any minimization of the exp structure, or to do any docking to refine the experimental binding pose before doing the entire workflow of quantumbindRBFE (building system, eq and prod), or if it is not necessary ?
And would you advise to model the exp system with different binding poses or not (which would add to the number of calculations to do...)

Sadly I have no experience working with macrocycles and RBFE. Furthermore I'm not really sure is AceForce 1.0 is adequate for these kind of ligands, but feel free to test them. As for this moment we have only evaluated AceForce with ligands from the JACS/Schrodinger dataset.

@AminaMnhr
Copy link
Author

AminaMnhr commented Feb 20, 2025

Thanks for the answer:

  • For the default values
    "EQUIL_MINIM_STEPS", 150000
    "EQUIL_ANNEALING_STEPS", 250000
    "EQUIL_STEPS", 150000

Is that correctly 150000 and not 15000? Are they also the default values you found for minimal convergence of your system?

  • "A TYK2 edge with AceForce should take around 36-48h in a RTX4090 at 2fs"
    Thank you! Does this time include equilibration as well?
    Also, do you set up the NN for the equilibration as well, or just for production? As you said you did not find any difference by applying the NN or not for the eq. I thought that equilibration parameters should be consistent with production steps, or is that not necessary?

  • Thanks! Was Aceforced trained on macrocycles as well or just linear ligands?

  • To check the equilibration dynamics should I open the _0 .pdb or .dcd ?

  • And to check the production dynamics, should I open each .dcd file in each folder r0, r1... etc?

@AminaMnhr
Copy link
Author

Hi, I did try to reproduce the benchmark and here are my results;

Edge,ddG_my_value,E,ddG_std,ddG_your_value,ddG_exp
A02_A09,0.73,0.28,-1.31,-1.32
A03_A01,0.49,0.23,1.65,0.76
A03_A02,3.37,0.21,3.86,2.49
A03_A05,0.35,0.23,0.36,0.39
A03_A10,-3.05,0.21,1.69,0.42
A04_A02,1.21,0.22,1.07,0.49
A04_A15,2.27,0.22,0.78,0.16
A06_A08,-1.21,0.22,-0.89,-0.80
A08_A02,-0.99,0.21,1.03,0.57
A08_A07,-0.23,0.22,0.68,0.78
A08_A09,-0.30,0.21,-0.47,-0.75
A11_A08,-0.82,0.22,1.31,-0.22
A12_A01,-0.12,0.22,-0.84,0.04
A12_A10,-2.78,0.22,-0.39,-0.30
A13_A02,-1.00,0.23,-1.96,-1.70
A14_A02,0.44,0.22,-0.67,-0.95
A15_A07,0.036,0.24,0.55,0.54

*I runned each pair once. The pairs that I have put in italics, my results are really far from yours like it differs from even the sign of ddG, do you know why the sign is not consistent?

*Also as yours is an average over different runs, how many runs of eq + prod for an edge did you do in general?

Thank you

@qsabanes
Copy link
Collaborator

Hi Amina,
we usually run each pair three times in our benchmarks.
I re-checked my own data, since TYK2 is a target we have evaluated several times with different settings, and I have only seen some outliers in a few occasions for A03_A10, the rest is pretty consistent along the replicas we have.

As for the reasons? Probably the main one is that RBFE calculations can be inconsistent, therefore the importance of running several repeats to identify potential outliers. I don't know if you read this paper which relates to the approached we base QuantumBind on: https://pubs.acs.org/doi/full/10.1021/acs.jctc.3c01250

@AminaMnhr
Copy link
Author

AminaMnhr commented Feb 24, 2025

Thank for your answer! I did not see this paper I will check this.

*As for the difference in the sign, is that also part of the inconsistency in the method? (For example for A02_A09, A08_A02 etc, my value is of the opposite sign of yours)

*And if the edge A02_A09 = 0.73, does mean dG(A02) - dG(A09) = 0.73 (Just want to make sure which way it has been computed). I am also wondering how do we rank the compounds having the edges? (So how do we go from the RBFE to absolute delta G to then rank them)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants