Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory consumption for extremely large search spaces #97

Open
grosenberger opened this issue Oct 26, 2023 · 8 comments
Open

Memory consumption for extremely large search spaces #97

grosenberger opened this issue Oct 26, 2023 · 8 comments

Comments

@grosenberger
Copy link

Hi Michael,

When using Sage with very large search spaces (e.g. many PTMs, non-specific digestion, etc.), the memory consumption frequently goes beyond the available resources on standard workstations. In such scenarios, MSFragger partitions the search space and iteratively processes them.

I was wondering whether similar functionality would be possible to implement in Sage. For example, a "batch size" parameter could be manually set (or estimated based on available memory) to partition the search space. I think there are several options on how this could be implemented, one option could be to partition candidate peptide precursors based on precursor m/z and have different partitions for different spaces. For DIA, this could correspond to the precursor isolation windows, for DDA, it might make sense to just select the range according to batch size. The main search algorithm could then iterate over the partitions for scoring and the individual partitions would be assembled before ML and statistical validation. The search space could be generated according to partitions on-the-fly and kept in memory, or alternatively, also exported to disk (similar to how MSFragger does it).

How do you think about this options? Would there be a preferred solution?

Best regards,
George

@lazear
Copy link
Owner

lazear commented Oct 26, 2023

Hi George,

I agree that it's a necessity for large search spaces. I have been messing around with some internal database splitting, but it's not ready for prime-time yet.

In the mean time, it's possible to perform external database splitting - generate slices of FASTA files and run Sage multiple times, then combine the results and rescore. Perhaps not ideal, but this is essentially what would be done with internal database splitting as well. See below for an example python script for accomplishing this.

import subprocess
import pandas as pd
from Bio import SeqIO

SLICES = 5
records = []
for record in SeqIO.parse("fasta/human_contaminant.fasta", format="fasta"):
    records.append(record)

N = len(records) // SLICES
for i in range(SLICES):
    with open(f"fasta/human_slice_{i}.fasta", "w") as f:
        for record in records[i * N : (i + 1) * N]:
            SeqIO.write(record, f, format="fasta")
    cmd = [
        "sage",
        "search.json",
        "-o",
        f"semi_{i}",
        "-f",
        f"fasta/human_slice_{i}.fasta",
        "--write-pin",
        "HeLa_chytry_HCD_1.mzML.gz",
    ]
    subprocess.run(cmd)


dfs = []
for i in range(SLICES):
    dfs.append(pd.read_csv(f"semi_{i}/results.sage.pin", sep="\t"))

pd.concat(dfs).sort_values(by="ln(hyperscore)", ascending=False).drop_duplicates(
    subset=["FileName", "ScanNr"], keep="first"
).to_csv("sliced.pin", sep="\t")

@patrick-willems
Copy link

patrick-willems commented Nov 2, 2023

Hey, just a question related to this issue. Could it be that by sorting on the hyperscore and only retaining the best match you might lose hits (also not compatible with chimera searching)? Would a valid alternative be to split up the searches in terms of precursor m/z in consecutive searches (but those on whole FASTA instead of FASTA splitting)? I once tried it (by making alternative JSONs in a loop) but the memory consumption did not decrease.

@grosenberger
Copy link
Author

Thanks for the feedback! We have been using similar workarounds before. FragPipe also uses similar mechanisms for very large databases.

@lazear
Copy link
Owner

lazear commented Nov 2, 2023

Hey, just a question related to this issue. Could it be that by sorting on the hyperscore and only retaining the best match you might lose hits (also not compatible with chimera searching)? Would a valid alternative be to split up the searches in terms of precursor m/z in consecutive searches (but those on whole FASTA instead of FASTA splitting)? I once tried it (by making alternative JSONs in a loop) but the memory consumption did not decrease.

Interesting that this didn't decrease memory consumption - setting the peptide_min_mass and peptide_max_mass will restrict down the # of final peptides kept and fragments generated (filter is applied after digestion).

That is a valid point about chimeric searches, but those are already kind of heuristic (subtractive method vs something potentially smarter). One potential alternative would be to pre-digest the FASTA database (and pass in "$" as the cleavage enzyme to Sage), and then chunk the FASTA database by peptide mass. That should help with improving chimeric searches and possibly make it go faster as well - this is basically what would be implemented internally.

@lgatto
Copy link
Contributor

lgatto commented Jul 18, 2024

Will splitting the fasta file not lead to inconsistencies when defining the protein groups? Groups will only be defined within a fasta slice/search, and proteins A from slice 1 and protein B from slice 2 that share peptides won't get grouped together. Or will the groups somehow be updated after concatenation?

@RalfG
Copy link
Contributor

RalfG commented Aug 9, 2024

Hi @lazear,

I was wondering if anyone has started working on implementing database splitting into Sage itself. If not, we might take a look at it.

Best,
Ralf

@sander-willems-bruker
Copy link
Contributor

Draft implementation which works reasonably well: #154

@sander-willems-bruker
Copy link
Contributor

Will splitting the fasta file not lead to inconsistencies when defining the protein groups? Groups will only be defined within a fasta slice/search, and proteins A from slice 1 and protein B from slice 2 that share peptides won't get grouped together. Or will the groups somehow be updated after concatenation?

After iterating over the chunks, a final database is created from the found peptides as if it was a normal digest. As such, peptides from proteins in individual chunks are actually correctly merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants