-
-
Notifications
You must be signed in to change notification settings - Fork 148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update pacer free document command to avoid high memory usage #4472
base: main
Are you sure you want to change the base?
Update pacer free document command to avoid high memory usage #4472
Conversation
quevon24
commented
Sep 17, 2024
- Remove @throttle_task in get pdfs process because there are very long times when scheduling the retry of the task, this happens mainly when many elements from the same court have to be processed as there are no more documents from other courts with which to intersperse
- Wait longer (3 seconds) before queuing up more items from the same court.
wait longer when cycling the same court over and over again
🔍 Existing Issues For ReviewYour pull request is modifying functions with the following pre-existing issues: 📄 File: cl/corpus_importer/tasks.py
Did you find this useful? React with a 👍 or 👎 |
…-and-queue-buildup
Alberto has done more of the bulk scraping stuff than I have recently, so I'd like to get his eyes here too. I think architecturally, if I'm understanding this correctly, the idea is to stop queueing up everything all once and then hammering with Celery and to instead iterate over all the courts in a loop, doing each one every three seconds. Accurate? |
When there are documents from multiple courts, we will wait 1 second during each cycle, but if the remaining documents are from the same court, only this one will be cycled, so we will wait 3s to give it extra time. |
…-and-queue-buildup
…-and-queue-buildup
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good. Just a comment regarding the sleep value used to wait between court cycles
) | ||
time.sleep(1) | ||
time.sleep(sleep) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we talked about this, we could improve the sleep value here based on the number of courts being cycled through to ensure we don't surpass the scrape rate of 1/4s
per court we had previously via the throttle_task
decorator. We could consider the time it takes to process and download a document, then compute a dynamic value or threshold based on the number of courts being processed. This way, even when only a few courts remain in the list, we still maintain the 1/4s per court rate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1/4s per court rate
Is that 0.25s per court or am I misunderstanding?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's 1 task every 4 seconds per court according to get_task_wait
docstrings
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, duh, thank you. Um, so if sleep is set to four seconds, we'd do each court at most every four seconds, right? But if we use some timing info, we can set this dynamically so that we sleep exactly four seconds for each loop? Like, if downloads take 2s, then we set the sleep for 2s, and boom, we 4s is achieved?