Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redis memory spike from task throttling and queue buildup #4455

Open
quevon24 opened this issue Sep 13, 2024 · 1 comment · May be fixed by #4472
Open

Redis memory spike from task throttling and queue buildup #4455

quevon24 opened this issue Sep 13, 2024 · 1 comment · May be fixed by #4472

Comments

@quevon24
Copy link
Member

quevon24 commented Sep 13, 2024

This morning, we encountered a Redis memory usage issue that was escalating rapidly. The problem arose because all the free PACER documents we had been collecting with the full sweep started to queue.

Some tasks began to throttle due to the @throttle_task decorator in the process_free_opinion_result function, leading to significantly long waiting times (~300000 seconds and increasing), causing tasks to pile up in the unacked queue.

This likely happened due to a court blockage that triggered multiple retry attempts as we tried to collect a large volume of documents.

For safety, we halted the PACER free documents full sweep command, assuming it was the cause of the issue. However, the sweep only generates the report—it does not download anything from PACER or create new database entries.

We later realized that the issue began yesterday when the command started sending a large number of Celery tasks to ingest the documents we had already collected from the PACER free documents report. Yesterday, it sent ~80,000 documents (tasks), and today, it sent ~200,000 documents (tasks).

The cause was an outdated cron job, which, for unknown reasons, started running the old command and using an outdated image. The cron job had been updated and correctly configured since August 30th, and no issues were reported until yesterday.

This cron job problem caused the start and end dates (today and today - 10 days) to not be passed, affecting the downloading of documents. While it didn't impact the generation of the document report—because, in the absence of a date range, it runs based on the last successful sweep of each court—it did affect the document download process by queuing all the documents from the PACERFreeDocumentRow table.

The issue with the long throttling times is similar to what is mentioned here: #4077 (comment).

TL;DR: throttle.maybe_wait() and @throttle_task don’t work well together.

It was likely a temporary blockage, and Ramiro has now reconfigured the cron job correctly, we are going to disable it for the moment while we fix the throttle issue. Later we can sweep these days that the cron is disabled.

Moving forward, we should consider removing @throttle_task to avoid these long wait times when retrying tasks. Instead, we could use the CycleChecker to check if we are done cycling through all the courts, and when focusing on a single court, we can increase the delay from 1 second to 4 seconds (i.e., 1/4 second in the decorator) per document.

@mlissner
Copy link
Member

Thanks for the analysis and write up!

A couple things:

  1. I forget if we automatically update all our cronjobs in k8s when we deploy new images. I think we do, but it's worth double checking the github action to make sure. That could be the cause of the misconfigured pod, I suppose.

  2. Let's give the cyclechecker a try and see if it can fix this. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: 📋 Backlog
Development

Successfully merging a pull request may close this issue.

2 participants