Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Repcred runs out of memory #37

Open
bcorrie opened this issue Apr 15, 2024 · 8 comments
Open

Repcred runs out of memory #37

bcorrie opened this issue Apr 15, 2024 · 8 comments
Assignees

Comments

@bcorrie
Copy link

bcorrie commented Apr 15, 2024

I ran this on a Repertoire of 4M sequences. Would you expect this? Repcred was set to downsample??? This ran on a compute node with 8GB of memory, so this tells me that the job used more than that for a significant amount of time.

This is where the output got to:

1/53
2/53 [global-options]
3/53
4/53 [input-parameters]
5/53
6/53 [unnamed-chunk-1]

See #35 for details of scalability/performance testing.

  • 4,000,000 sequences, Sample ID: p1974_d60, Failed
@ssnn-airr
Copy link
Contributor

Is this in ipa1? I am using this command and I only get 4579 sequences. If found the repertoire_id using the gateway.

curl -k -s --data '{"filters":{"op":"=","content":{"field":"repertoire_id", "value"
:"60"}}, "format":"tsv"}' https://ipa1.ireceptor.org/airr/v1/rearrangement > p1974_d60.tsv

@bcorrie
Copy link
Author

bcorrie commented Apr 16, 2024

Sorry, that is on ipa3.ireceptor.org. Unfortunately on our old repositories our repertoire_id fields are not unique so this type of confusion can happen.

$ curl -k -s --data '{"filters":{"op":"=","content":{"field":"repertoire_id", "value":"60"}},"facets":"repertoire_id"}' https://ipa3.ireceptor.org/airr/v1/rearrangement
{
    "Info": DELETED
    "Facet": [
        {
            "repertoire_id": "60",
            "count": 3992474
        }
    ]
}

Also, unfortunately, on the Gateway there is no easy way to see which of the IPAs this repertoire is on. If the repertoire_id was unique, then you could search them all and it would only show up on one of them. This is an issue we need to address...

@ssnn-airr
Copy link
Contributor

And I run out of patience. It takes forever to run the chunk CDR3_Chimera_Check. I need to figure out where the issue is. I will keep you posted.

@bcorrie
Copy link
Author

bcorrie commented May 1, 2024

When I am running these jobs, repcred is reporting that it is downsampling:

Warning message:
In normalizePath(opt$OUTDIR) :
  path[1]="ipa1.ireceptor.org/370/370_repcred_report": No such file or directory

Running repcred
|- Repertoire:
|  /scratch/ireceptorgw/gateway-clean/jobs/c1dd2cf7-25e8-4647-9090-ce0b8040beee-007/gateway_analysis/ipa1.ireceptor.org/370/370.tsv
|- Reference germline(s):
|  
|- Downsample:
|  TRUE
|- Output dir:
|  /scratch/ireceptorgw/gateway-clean/jobs/c1dd2cf7-25e8-4647-9090-ce0b8040beee-007/gateway_analysis/ipa1.ireceptor.org/370/370_repcred_report
|- Output format:
|  all 



processing file: _main.Rmd
Killed
slurmstepd: error: Detected 1 oom_kill event in StepId=30273871.batch. Some of the step tasks have been OOM Killed.

So it is running out of memory either while down sampling, or maybe one of the analysis steps isn't down sampling???

The last job reported this before it was killed for exceeding memory limits.

IR-INFO: Running Repcred on ipa1.ireceptor.org/370/370.tsv - Tue Apr 30 04:32:58 PM PDT 2024
1/63                          
2/63 [global-options]         
3/63                          
4/63 [input-parameters]       
5/63                          
6/63 [unnamed-chunk-1]        
IR-ERROR: Repcred failed on file ipa1.ireceptor.org/370/370.tsv
IR-INFO: Done running Repcred on ipa1.ireceptor.org/370/370.tsv - Tue Apr 30 04:36:28 PM PDT 2024

@bcorrie
Copy link
Author

bcorrie commented May 1, 2024

Hmm, this failed with running with 15GB of memory, so maybe this is a bug of some kind. It seems odd that 2M works fine in 8GB but 4M fails on 15GB. I am re-running with 30GB to confirm.

@bcorrie
Copy link
Author

bcorrie commented May 1, 2024

Looks like my job isn't getting memory allocated like I think it is... So ignore my comment about it failing with 15GB. I need to test still.

@bcorrie
Copy link
Author

bcorrie commented May 1, 2024

It looks like 4M sequences requires about 12 GB of memory, which is why it failed at 8GB. If I run with 30GB it works fine and one of the job summary tools reports over 11GB of memory used.

@bcorrie
Copy link
Author

bcorrie commented May 1, 2024

The largest repertoire in the ADC is 16M annotations, so this would presumably require a very large amount of memory if this scales linearly which at a very basic level it seems to be close to that based on my quick testing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In progress
Development

No branches or pull requests

2 participants