Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update pacer free document command to avoid high memory usage #4472
base: main
Are you sure you want to change the base?
Update pacer free document command to avoid high memory usage #4472
Changes from all commits
6930136
b196519
6d6df0f
246a134
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we talked about this, we could improve the sleep value here based on the number of courts being cycled through to ensure we don't surpass the scrape rate of
1/4s
per court we had previously via thethrottle_task
decorator. We could consider the time it takes to process and download a document, then compute a dynamic value or threshold based on the number of courts being processed. This way, even when only a few courts remain in the list, we still maintain the 1/4s per court rate.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that 0.25s per court or am I misunderstanding?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's 1 task every 4 seconds per court according to
get_task_wait
docstringsThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, duh, thank you. Um, so if sleep is set to four seconds, we'd do each court at most every four seconds, right? But if we use some timing info, we can set this dynamically so that we sleep exactly four seconds for each loop? Like, if downloads take 2s, then we set the sleep for 2s, and boom, we 4s is achieved?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that's right. I think Kevin already has some timing info we can use here. The other scenario we need to consider is when the number of courts with remaining documents to scrape is reduced.
In this case, with the current approach, we would schedule one task per court per second, which exceeds the 1/4-second rate per court. So the idea is to consider the number of courts in the last cycle and the average time to process a document to compute the sleep time for that cycle, ensuring the rate for these courts stays below 1/4 second.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. Sounds great!