Speeding up comparisons for very large jobs on HPC #297

widdowquinn · 2021-06-20T14:39:26Z

widdowquinn
Jun 20, 2021
Maintainer

Even the task of compiling command-lines to run can be slow with large enough inputs. Currently, the process of compiling command lines is serial. We might get some speed-up if we used a different approach. I note:

We don't need to compile all jobs before we start running them.
On HPCs, if we know the batch size, we can compile jobs in groups of that batch size and submit them, before we start running the jobs.

I think this gets us two speed-ups:

The Python slowdown of appending to a single long list is avoided (this is quite a hit!)
The jobs start running sooner, and getting every last comparison into the list is not a bottleneck for the entire run

I'm currently hitting this issue with a 2.5k genome job on a SLURM cluster - just generating the job list currently takes hours.

baileythegreen · 2021-06-20T16:52:39Z

baileythegreen
Jun 20, 2021

On HPCs, if we know the batch size, we can compile jobs in groups of that batch size and submit them, before we start compiling the jobs.

I don't follow this.

The Python slowdown of appending to a single long list is avoided (this is quite a hit!)

Does it have to be a list? We only care about order in terms of dependencies, which are not built-in to the joblist itself (unless I'm misreading the code). Sets should be faster, and will implicitly prevent any duplicates.

May not be the only optimisation worth making, but it might help.

3 replies

widdowquinn Jun 20, 2021
Maintainer Author

On HPCs, if we know the batch size, we can compile jobs in groups of that batch size and submit them, before we start compiling the jobs.

I don't follow this.

Sorry - I used "compile" twice for some reason. Probably in a hurry before getting other stuff done.

The Python slowdown of appending to a single long list is avoided (this is quite a hit!)

Does it have to be a list? We only care about order in terms of dependencies, which are not built-in to the joblist itself (unless I'm misreading the code). Sets should be faster, and will implicitly prevent any duplicates.

It doesn't have to be a list. By batching early we will avoid (some) memory issues, too. IIRC sets require that members are hashable, which ComparisonJob may not be yet, which would be another thing to look at.

We need to look at a more efficient way of dispatching jobs - I don't know if Snakemake copes well with 3m+ comparison jobs, either!

baileythegreen Jun 20, 2021

Set members must be hashable, though sets themselves are not.

A quick demonstration of how we could do this (without actually mocking ComparisonJob objects), based on the suggestion here:

from typing import NamedTuple

class LyricalPhrase(NamedTuple):

    """Demonstrating hashing with a song I like."""

    first: str
    second: str
    third: str
    fourth: str

    def __hash__(self):
        return hash(repr(self))

comp1 = LyricalPhrase('When', 'the', 'wind', 'picked')
comp2 = LyricalPhrase('up', 'the', 'fire', 'spread')
comp3 = LyricalPhrase('And', 'the', 'grapevine', 'seemed')
comp4 = LyricalPhrase('left', 'for', 'dead.', 'The')
comp5 = LyricalPhrase('northern', 'sky', 'looked', 'like')
comp6 = LyricalPhrase('the', 'end', 'of', 'days')
comp7 = LyricalPhrase('the', 'end', 'of', 'days')

set([comp1, comp2, comp3, comp4, comp5, comp6, comp7])

returns:

{LyricalPhrase(first='And', second='the', third='grapevine', fourth='seemed'),
 LyricalPhrase(first='When', second='the', third='wind', fourth='picked'),
 LyricalPhrase(first='left', second='for', third='dead.', fourth='The'),
 LyricalPhrase(first='northern', second='sky', third='looked', fourth='like'),
 LyricalPhrase(first='the', second='end', third='of', fourth='days'),
 LyricalPhrase(first='up', second='the', third='fire', fourth='spread')}

I also tried this with a fifth, unhashable element as part of the class (a set, actually), and it still works just fine.

widdowquinn Jun 20, 2021
Maintainer Author

That's one way to make a ComparisonJob hashable, too.

I also tried this with a fifth, unhashable element as part of the class (a set, actually), and it still works just fine.

That's because you're hasing the string __repr__

widdowquinn · 2021-07-07T17:07:35Z

widdowquinn
Jul 7, 2021
Maintainer Author

This has a proposed fix around garbage collection in #306

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speeding up comparisons for very large jobs on HPC #297

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Speeding up comparisons for *very* large jobs on HPC #297

widdowquinn Jun 20, 2021 Maintainer

Replies: 2 comments · 3 replies

baileythegreen Jun 20, 2021

widdowquinn Jun 20, 2021 Maintainer Author

baileythegreen Jun 20, 2021

widdowquinn Jun 20, 2021 Maintainer Author

widdowquinn Jul 7, 2021 Maintainer Author

Speeding up comparisons for very large jobs on HPC #297

widdowquinn
Jun 20, 2021
Maintainer

Replies: 2 comments 3 replies

baileythegreen
Jun 20, 2021

widdowquinn Jun 20, 2021
Maintainer Author

widdowquinn Jun 20, 2021
Maintainer Author

widdowquinn
Jul 7, 2021
Maintainer Author