Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make_pdb ESMFold jobs running even when input structure is provided #82

Closed
naailkhan28 opened this issue Jan 25, 2024 · 8 comments
Closed
Labels
bug Something isn't working

Comments

@naailkhan28
Copy link
Contributor

Description of the bug

When running the Search mode Snakemake pipeline, I see that the make_pdb rule is run and a request is made to the ESMFold API, even when I've provided a .pdb file containing my input structure. My experimentally solved crystal structure is being replaced with a less accurate ESMFold prediction which is not ideal.

I've attached my full Snakemake log (it failed due to a python error but ignore this), and my input .yml config, .pdb and .fasta files.

Command used and terminal output

snakemake --snakefile Snakefile --configfile demo/1OHG/config_1OHG.yml --use-conda --cores 64 --conda-frontend conda

# Here's the top part of my Snakemake log:

Config file ./config.yml is extended by additional config specified via the command line.
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 64
Rules claiming more threads will be scaled down.
Job stats:
job                                         count    min threads    max threads
----------------------------------------  -------  -------------  -------------
aggregate_features                              1              1              1
aggregate_foldseek_fraction_seq_identity        1              1              1
aggregate_lists                                 1              1              1
all                                             1              1              1
assess_pdbs                                     1              1              1
calculate_concordance                           1              1              1
dim_reduction                                   2              1              1
download_pdbs                                   1              1              1
extract_blast_hits                              1              1              1
extract_foldseek_hits                           1              1              1
fetch_uniprot_metadata                          1              1              1
filter_uniprot_hits                             1              1              1
foldseek_clustering                             1             16             16
get_source                                      1              1              1
input_distances                                 1              1              1
leiden_clustering                               1              1              1
make_pdb                                        1              1              1
map_refseqids                                   1              1              1
plot_cluster_distributions                      1              1              1
plot_interactive                                2              1              1
plot_semantic_analysis                          1              1              1
plot_similarity_leiden                          1              1              1
plot_similarity_strucluster                     1              1              1
run_blast                                       1              1              1
run_foldseek                                    1              1              1
total                                          27              1             16

Select jobs to execute...

[Thu Jan 25 12:10:17 2024]
rule run_blast:
    input: demo/1OHG/P49861.fasta
    output: demo/1OHG/output/blastresults/P49861.blastresults.tsv
    jobid: 13
    benchmark: demo/1OHG/output/benchmarks/P49861.run_blast.txt
    wildcards: protid=P49861
    resources: tmpdir=/tmp


Activating conda environment: /blockstorage/ProteinCartography/.snakemake/conda/47f42b1b7af144c2415a38423667989d
[Thu Jan 25 12:10:17 2024]
rule make_pdb:
    input: demo/1OHG/P49861.fasta
    output: demo/1OHG/P49861.pdb
    jobid: 10
    benchmark: demo/1OHG/output/benchmarks/P49861.make_pdb.txt
    wildcards: protid=P49861
    resources: tmpdir=/tmp

Relevant files

P49861_fasta_pdb_yml_config_snakemake_log.zip

System information

  • Snakemake version 6.15.1
  • VM.Standard.E4.Flex instance in Oracle Cloud - 64 cores, 1024 GB RAM, 25 TB storage available
  • Ubuntu 22.04.1 LTS
  • ProteinCartography version 0.4.2
@naailkhan28 naailkhan28 added the bug Something isn't working label Jan 25, 2024
@mezarque
Copy link
Member

Thanks for letting us know about this! We've seen this occur occasionally with other analyses as well and haven't been able to ID the cause. We'll look into it.

@mezarque
Copy link
Member

One hypothesis that @mertcelebi shared is that this might occur when the timestamp of the FASTA file is later than that of the PDB file. If so, it's possible that this is a more general issue with how Snakemake detects files. Could you try running this code from within the demo/1OHG/ directory to see if it resolves the issue?

touch P49861.fasta &&
sleep .5 &&
touch P49861.pdb

@mezarque
Copy link
Member

Separately from this issue, the experimentally-determined PDB provided appears to contain multiple chains, which unfortunately isn't supported yet by the pipeline. We're working to resolve the current issue, but I expect you'll run into a new error once the replacement issue is resolved.

@naailkhan28
Copy link
Contributor Author

I just did the experiment you suggested and that seemed to be the issue - when the FASTA is older than the PDB file then it works as expected and no make_pdb job is run.

However, if you swap those two touch calls (ie, the PDB is older than the FASTA) or you run them at the same time (with no sleep .5 command) then the make_pdb job is run. Looks like a tweak to the snakemake pipeline is needed!

@naailkhan28
Copy link
Contributor Author

Separately from this issue, the experimentally-determined PDB provided appears to contain multiple chains, which unfortunately isn't supported yet by the pipeline. We're working to resolve the current issue, but I expect you'll run into a new error once the replacement issue is resolved.

Before you implement a way of handling multiple chains, would it be helpful to handle multi-chain PDB files in a nicer way? Like maybe automatically taking chain A or throwing a warning or something.

@mezarque
Copy link
Member

That's a helpful suggestion; we can explore whether an intermediate version like this would help.

@keithchev
Copy link
Member

@naailkhan28 I think we've fixed this in #83, which should ensure that make_pdb is never called for PDB files that are in the input directory. Please let us know if this resolves this issue on your end!

@naailkhan28
Copy link
Contributor Author

Just tested the latest version, looks fixed! Great work guys :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants