`make_pdb` ESMFold jobs running even when input structure is provided #82

naailkhan28 · 2024-01-25T12:33:58Z

Description of the bug

When running the Search mode Snakemake pipeline, I see that the make_pdb rule is run and a request is made to the ESMFold API, even when I've provided a .pdb file containing my input structure. My experimentally solved crystal structure is being replaced with a less accurate ESMFold prediction which is not ideal.

I've attached my full Snakemake log (it failed due to a python error but ignore this), and my input .yml config, .pdb and .fasta files.

Command used and terminal output

snakemake --snakefile Snakefile --configfile demo/1OHG/config_1OHG.yml --use-conda --cores 64 --conda-frontend conda

# Here's the top part of my Snakemake log:

Config file ./config.yml is extended by additional config specified via the command line.
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 64
Rules claiming more threads will be scaled down.
Job stats:
job                                         count    min threads    max threads
----------------------------------------  -------  -------------  -------------
aggregate_features                              1              1              1
aggregate_foldseek_fraction_seq_identity        1              1              1
aggregate_lists                                 1              1              1
all                                             1              1              1
assess_pdbs                                     1              1              1
calculate_concordance                           1              1              1
dim_reduction                                   2              1              1
download_pdbs                                   1              1              1
extract_blast_hits                              1              1              1
extract_foldseek_hits                           1              1              1
fetch_uniprot_metadata                          1              1              1
filter_uniprot_hits                             1              1              1
foldseek_clustering                             1             16             16
get_source                                      1              1              1
input_distances                                 1              1              1
leiden_clustering                               1              1              1
make_pdb                                        1              1              1
map_refseqids                                   1              1              1
plot_cluster_distributions                      1              1              1
plot_interactive                                2              1              1
plot_semantic_analysis                          1              1              1
plot_similarity_leiden                          1              1              1
plot_similarity_strucluster                     1              1              1
run_blast                                       1              1              1
run_foldseek                                    1              1              1
total                                          27              1             16

Select jobs to execute...

[Thu Jan 25 12:10:17 2024]
rule run_blast:
    input: demo/1OHG/P49861.fasta
    output: demo/1OHG/output/blastresults/P49861.blastresults.tsv
    jobid: 13
    benchmark: demo/1OHG/output/benchmarks/P49861.run_blast.txt
    wildcards: protid=P49861
    resources: tmpdir=/tmp


Activating conda environment: /blockstorage/ProteinCartography/.snakemake/conda/47f42b1b7af144c2415a38423667989d
[Thu Jan 25 12:10:17 2024]
rule make_pdb:
    input: demo/1OHG/P49861.fasta
    output: demo/1OHG/P49861.pdb
    jobid: 10
    benchmark: demo/1OHG/output/benchmarks/P49861.make_pdb.txt
    wildcards: protid=P49861
    resources: tmpdir=/tmp

Relevant files

P49861_fasta_pdb_yml_config_snakemake_log.zip

System information

Snakemake version 6.15.1
VM.Standard.E4.Flex instance in Oracle Cloud - 64 cores, 1024 GB RAM, 25 TB storage available
Ubuntu 22.04.1 LTS
ProteinCartography version 0.4.2

The text was updated successfully, but these errors were encountered:

mezarque · 2024-01-25T17:31:21Z

Thanks for letting us know about this! We've seen this occur occasionally with other analyses as well and haven't been able to ID the cause. We'll look into it.

mezarque · 2024-01-25T17:45:34Z

One hypothesis that @mertcelebi shared is that this might occur when the timestamp of the FASTA file is later than that of the PDB file. If so, it's possible that this is a more general issue with how Snakemake detects files. Could you try running this code from within the demo/1OHG/ directory to see if it resolves the issue?

touch P49861.fasta &&
sleep .5 &&
touch P49861.pdb

mezarque · 2024-01-25T21:34:06Z

Separately from this issue, the experimentally-determined PDB provided appears to contain multiple chains, which unfortunately isn't supported yet by the pipeline. We're working to resolve the current issue, but I expect you'll run into a new error once the replacement issue is resolved.

naailkhan28 · 2024-01-25T21:36:02Z

I just did the experiment you suggested and that seemed to be the issue - when the FASTA is older than the PDB file then it works as expected and no make_pdb job is run.

However, if you swap those two touch calls (ie, the PDB is older than the FASTA) or you run them at the same time (with no sleep .5 command) then the make_pdb job is run. Looks like a tweak to the snakemake pipeline is needed!

naailkhan28 · 2024-01-25T21:39:12Z

Separately from this issue, the experimentally-determined PDB provided appears to contain multiple chains, which unfortunately isn't supported yet by the pipeline. We're working to resolve the current issue, but I expect you'll run into a new error once the replacement issue is resolved.

Before you implement a way of handling multiple chains, would it be helpful to handle multi-chain PDB files in a nicer way? Like maybe automatically taking chain A or throwing a warning or something.

mezarque · 2024-01-25T21:41:07Z

That's a helpful suggestion; we can explore whether an intermediate version like this would help.

keithchev · 2024-01-26T02:02:10Z

@naailkhan28 I think we've fixed this in #83, which should ensure that make_pdb is never called for PDB files that are in the input directory. Please let us know if this resolves this issue on your end!

naailkhan28 · 2024-01-26T13:11:56Z

Just tested the latest version, looks fixed! Great work guys :)

naailkhan28 added the bug Something isn't working label Jan 25, 2024

keithchev mentioned this issue Jan 25, 2024

Ensure user-provided PDBs are used for clustering #83

Merged

6 tasks

naailkhan28 closed this as completed Jan 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`make_pdb` ESMFold jobs running even when input structure is provided #82

`make_pdb` ESMFold jobs running even when input structure is provided #82

naailkhan28 commented Jan 25, 2024

mezarque commented Jan 25, 2024

mezarque commented Jan 25, 2024

mezarque commented Jan 25, 2024

naailkhan28 commented Jan 25, 2024

naailkhan28 commented Jan 25, 2024

mezarque commented Jan 25, 2024

keithchev commented Jan 26, 2024

naailkhan28 commented Jan 26, 2024

make_pdb ESMFold jobs running even when input structure is provided #82

make_pdb ESMFold jobs running even when input structure is provided #82

Comments

naailkhan28 commented Jan 25, 2024

Description of the bug

Command used and terminal output

Relevant files

System information

mezarque commented Jan 25, 2024

mezarque commented Jan 25, 2024

mezarque commented Jan 25, 2024

naailkhan28 commented Jan 25, 2024

naailkhan28 commented Jan 25, 2024

mezarque commented Jan 25, 2024

keithchev commented Jan 26, 2024

naailkhan28 commented Jan 26, 2024

`make_pdb` ESMFold jobs running even when input structure is provided #82

`make_pdb` ESMFold jobs running even when input structure is provided #82