Optimize memory usage for crawler #4

unwriter · 2019-04-12T21:52:55Z

Optimize through something like streaming so it doesn't use a lot of memory during initial crawl

MichalCz · 2019-04-18T15:17:04Z

Hmm... I'd be happy to help you guys. I have helped OpenAQ in their fetch system which does a similar task of crawling resources (different ones, but crawling is generally the same).

I see a some synchronous iteration over files which will use a lot of memory and won't be that efficient. I mean stuff like here: merge.js#L28. I'm not entirely sure if that is your actual problem (please point me in the right direction if I'm off scent), but code like this with synchronous iteration usually works over large arrays and those take up a lot of space.

I maintain a framework (scramjet) that can handle these iteration asynchronously and keeping a number of parallel processes and you can use it in a streamed manner. For OpenAQ it resulted in 50% less memory utilization and almost 40% wall clock speedup as well - so I think it's worth a try...

unwriter · 2019-04-18T21:28:24Z

@MichalCz thanks!

The merge.js file is not a problem because it's just a simple initializer script when the node first boots up, so no need to worry about that one.

Even the crawling part is done asynchronously in parallel https://github.com/interplanaria/planaria/blob/master/bit.js#L118 so it's all good until that part.

The real bottleneck happens right after that: https://github.com/interplanaria/planaria/blob/master/bit.js#L129

it aggregates all the results into an array. So when there's a lot of items, this single variable takes up a lot of memory and can crash. Which is why I was thinking of approaches like streaming.

But with streaming there's an issue. Since that variable is essentially what gets passed into onblock as an event, and since onblock is supposed to be an atomic event, it's not trivial to just incorporate streaming into the picture. Needs to think this out further.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize memory usage for crawler #4

Optimize memory usage for crawler #4

unwriter commented Apr 12, 2019

MichalCz commented Apr 18, 2019

unwriter commented Apr 18, 2019

Optimize memory usage for crawler #4

Optimize memory usage for crawler #4

Comments

unwriter commented Apr 12, 2019

MichalCz commented Apr 18, 2019

unwriter commented Apr 18, 2019