Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize memory usage for crawler #4

Open
unwriter opened this issue Apr 12, 2019 · 2 comments
Open

Optimize memory usage for crawler #4

unwriter opened this issue Apr 12, 2019 · 2 comments

Comments

@unwriter
Copy link
Contributor

Optimize through something like streaming so it doesn't use a lot of memory during initial crawl

@MichalCz
Copy link

Hmm... I'd be happy to help you guys. I have helped OpenAQ in their fetch system which does a similar task of crawling resources (different ones, but crawling is generally the same).

I see a some synchronous iteration over files which will use a lot of memory and won't be that efficient. I mean stuff like here: merge.js#L28. I'm not entirely sure if that is your actual problem (please point me in the right direction if I'm off scent), but code like this with synchronous iteration usually works over large arrays and those take up a lot of space.

I maintain a framework (scramjet) that can handle these iteration asynchronously and keeping a number of parallel processes and you can use it in a streamed manner. For OpenAQ it resulted in 50% less memory utilization and almost 40% wall clock speedup as well - so I think it's worth a try...

@unwriter
Copy link
Contributor Author

@MichalCz thanks!

The merge.js file is not a problem because it's just a simple initializer script when the node first boots up, so no need to worry about that one.

Even the crawling part is done asynchronously in parallel https://github.com/interplanaria/planaria/blob/master/bit.js#L118 so it's all good until that part.

The real bottleneck happens right after that: https://github.com/interplanaria/planaria/blob/master/bit.js#L129

it aggregates all the results into an array. So when there's a lot of items, this single variable takes up a lot of memory and can crash. Which is why I was thinking of approaches like streaming.

But with streaming there's an issue. Since that variable is essentially what gets passed into onblock as an event, and since onblock is supposed to be an atomic event, it's not trivial to just incorporate streaming into the picture. Needs to think this out further.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants