-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize memory usage for crawler #4
Comments
Hmm... I'd be happy to help you guys. I have helped OpenAQ in their fetch system which does a similar task of crawling resources (different ones, but crawling is generally the same). I see a some synchronous iteration over files which will use a lot of memory and won't be that efficient. I mean stuff like here: merge.js#L28. I'm not entirely sure if that is your actual problem (please point me in the right direction if I'm off scent), but code like this with synchronous iteration usually works over large arrays and those take up a lot of space. I maintain a framework (scramjet) that can handle these iteration asynchronously and keeping a number of parallel processes and you can use it in a streamed manner. For OpenAQ it resulted in 50% less memory utilization and almost 40% wall clock speedup as well - so I think it's worth a try... |
@MichalCz thanks! The merge.js file is not a problem because it's just a simple initializer script when the node first boots up, so no need to worry about that one. Even the crawling part is done asynchronously in parallel https://github.com/interplanaria/planaria/blob/master/bit.js#L118 so it's all good until that part. The real bottleneck happens right after that: https://github.com/interplanaria/planaria/blob/master/bit.js#L129 it aggregates all the results into an array. So when there's a lot of items, this single variable takes up a lot of memory and can crash. Which is why I was thinking of approaches like streaming. But with streaming there's an issue. Since that variable is essentially what gets passed into |
Optimize through something like streaming so it doesn't use a lot of memory during initial crawl
The text was updated successfully, but these errors were encountered: