-
-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory Leak (XML parsing, Templates, libzim ?) #243
Comments
This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions. |
@rgaudin @mgautierfr Maybe this new opensource tool would help to understand what is going on? https://github.com/bloomberg/memray |
Note: a bounty can now be claimed for solving this issue. https://app.bountysource.com/issues/109534260-leaking-memory |
I guess you already made shure that no process is hanging, too? PS: |
@TheCrazyT those processes happen early on in the whole scraping process and complete before we start the intensive work. Now that we're expecting outsiders to take a look at this, I'll rewrite the issue in the coming day(s) with as much details as possible. Yes https://github.com/openzim/python-scraperlib is a possible culprit. |
I'm currently looking at: If your answer is yes, then I'm wondering if it could be a python internal problem. |
Here's a more complete description for those not familiar with sotoki and this issue.
Here's a graph that show how it looks like overtime (was an OOM'd run with ~60GB or RAM) Huge spike at the begining is the preparation step. Then you can see sotoki processing the Users (storing info in redis and creating users pages in ZIM) followed by a drop in usage: this is when we release some of the info that we won't need anymore (see Latest stackoverflow run using 2022-03 dumps used approx. 125GB of RAM (run wasn't monitored) What the RAM consumption shows is that the entirety of the content that we read/write (that is unclear because we both transform and compress while making the ZIM) goes and stays into RAM. Although Python doesn't releases memory back to the system, this memory is not available anymore as we hit OOM on systems without enough memory. It seems (no evidence but match of XML size vs RAM usage) that the size of images (fetched online and added to the ZIM) is not impacting RAM consumption which may indicate that the leak doesn't affect binary item addition in scraperlib/libzim.
|
Talking about the sax parser ...
I tested it in google colab and the info about the
(same info as here : https://github.com/python/cpython/blob/main/Lib/xml/sax/xmlreader.py#L144 ) The default parser I also didn't find anything that executes the |
It is still possible that the xml library is leaking. https://gist.github.com/TheCrazyT/376bb0aac2d283f9ba505b889ef5da51 Bad thing is ... I currently can't really figure out what the problem is, but it seems to be caused by the html-entities in combination with long text inside the body-attribute of the row-tag. Probably will try to create a better proof of concept with a memory-profiler ... just to be shure. Edit: Edit again: Edit 04.07.2022: Edit 06.07.2022: |
Guess I should run a compared mem-profile with python 3.10 ,because they seem to have changed something about garbage collection within this commit: Hm weird, just noticed that the tag on the commit is v3.11.0b3. Edit: Update 09.07.2022: Guess the biggest problem will be to add the patch to the Dockerfile. Update 11.07.2022: |
Sorry if this is too much to ask, but, I would like to take a look, though Jupiter Notebooks aren't quite helpful for me. Can you please write a small python script, with a provided xml file (I recommend wikipedia's XML dumps, they're beef enough to illustrate trends like this) Thank you! |
@Retkid sorry, beside fishing strings in the core-dump and profiling the whole sotoki I have no simple script that profiles the memory. I'm currently looking at the output of "log-malloc2" to figure out if there is something weird.(something allocated, but not freed) Currently leads me to "lxml" inside "xmlBufGrow" , but that could also be a red hering. |
Thanks for the updates. Missed those edits above as it doesn't trigger a notification. FYI, a nopic (without any picture) run of stackoverflow is about to end. This code includes the It's important to note that the RAM usage is equivalent to the version with pictures, confirming that images are not involved, which can incidentally exclude libzim itself (for binary entries). |
Ok ... here is another thing I noticed. But here is the catch, the sotoki/src/sotoki/utils/database.py Line 496 in 692f59b
A commit will happen if nb_seen is dividable throuhg 1000: sotoki/src/sotoki/utils/database.py Line 82 in 692f59b
But since you increase it by: So my guess is that the command_stack just increases and increases, because a commit does barely happen. To be honest, I would simply remove the |
Good point ; I'd go with |
Please submit a PR once you have a working version ; thanks 🙏 |
I would go with a new if self.nb_seen > self.last_commit + self.commit_every:
do_commit(...)
self.last_commit += self.commit_every This way we would not reset
This is indeed a limitation of your theory (even if this is a good catch). On such a long run, we should see few drops times to times. |
Well, If I saw it correctly, then every thread has its own Maybe its just not noticeable if one thread is lucky and hits commit. Update: Oh btw. the reason I did not create a pull request yet is because I was just logging the "commit". For some reason I still only get the output at the end of the "Questions"-Progress and that is driving me slightly insane atm. Update again: Just now noticed that the actual recording-phase is within "Questions_Meta". (within But that is near the beginning. And this means: ... I probably still didn't find the main memory-leak problem I guess. |
@parvit For 1, it does not work like this. If you find any proof that memory is not freed regularly (at the end of each cluster compression), then please point it. |
@kelson42 I could not tell you if the memory is really being flushed at every cluster but my run still had 1/4 memory total allocated at the end (so maybe there's an actual leak there?). You can find this in the drive folder posted earlier in file original.html in results.zip It might be possible that in a longer run than mine the usage from compression is bounded as you say but we should then do a run with compression only and see if it aligns with the run where both are disabled EDIT: For context the half on the left is the memory directly associated with python code, while on the right are the blocks of memory associated with native code of libzim |
Thanks all for the investigations. Your remark @parvit is interesting. I've took a fresh look at the xapian indexing and it seems we were creating a xapian transaction preventing xapian to flush the data in the fs. I've made the PR openzim/libzim#719 to fix that. @rgaudin it would be nice to test again with this fix. However, we have tested the sotoki scrapper with a fake libzim (creating nothing) and we found that memory consumption was still increasing. So even if xapian is taking (too much) memory, there is probably another leak somewhere on the python side. |
The data from my test did not evidence any clear leak on the python side
but if you could indicate me how you setup the fake libzim,
i may continue the investigation and hopefully find more on that.
Also my suggestion included also the compression change, did you also
consider that one in your test?
Il giorno mer 3 ago 2022 alle ore 15:12 Matthieu Gautier <
***@***.***> ha scritto:
… Thanks all for the investigations.
Your remark @parvit <https://github.com/parvit> is interesting. I've took
a fresh look at the xapian indexing and it seems we were creating a xapian
transaction preventing xapian to flush the data in the fs. I've made the PR
openzim/libzim#719 <openzim/libzim#719> to fix
that. @rgaudin <https://github.com/rgaudin> it would be nice to test
again with this fix.
However, we have tested the sotoki scrapper with a fake libzim (creating
nothing) and we found that memory consumption was still increasing. So even
if xapian is taking (too much) memory, there is probably another leak
somewhere on the python side.
—
Reply to this email directly, view it on GitHub
<#243 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AWGJD4C7NTTNM7PTU5YGT4DVXJV2JANCNFSM5DY2ZIOA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
The fakeCreator is here : https://github.com/openzim/sotoki/blob/v0.0.0.1/src/sotoki/utils/shared.py#L20
I had a look on the compression side but haven't found anything obvious. |
Here's the |
@rgaudin |
Here's the noindex version, confirming that indexing has no visible impact on RAM usage but compression is memory intensive. @mgautierfr can you briefly explain how it works? What should it be compressing and how so we can assess whether this is just normal behavior that we incorrectly didn't expect or if there's a problem. |
This is probably not expected. First, this is normal that libzim memory usage increase with item added. This is what I was explaining in openzim/python-libzim#117. I suspect that the slow slop increasing at beginning and end is due to normal memory increasing. But the big slop in the middle is surprising. How compression is done:
For compressing, we create a compressStream just before a cluster compression and free it just after. We don't reuse context. It could be interesting to have to log of the creator. It gives some (timed) information about what have been processed. So we could see what is processing during the middle slop. It may be also interesting to create a zim with lzma compression. If the memory usage is not the same, it may be relative to zstd compression itself. |
Your comment indicates than that this memory block at the end of my run is surely leaking, the free is not actually freeing the context as expected (but probably keeping internally the memory for reusing). |
|
@mgautierfr The frame at the bottom is "ZSTD_resetCCtx_internal" is the one that still holds reference to memory at the end of the program run and that call accounted for nearly 1/3 of the memory of the run ( more precise data is contained in the files i uploaded, i can post them again if you need ). I also would suggest that we all use the memray tool, without it this is kind of a guessing game. |
This is surprising as valgrind doesn't detect this |
Please reupload it |
Memray is specific to python so i think maybe valgrind being more general has more chances of seeing that as a false positive?
Attached find the aggregated results only, the raw data is quite big but i can arrange if you want. |
We compress the content with a big compression level, it means that zstd allocate a lot of internal data. From #243 (comment), it seems that you peak head size is about 750Mb I may misunderstood memray report, but from https://bloomberg.github.io/memray/flamegraph.html it seems that flamegraph show the memory usage of the peak memory usage. So it is normal that a lot of memory is allocated for compression and indexing. (And that a constraint of this issue. libzim has a base memory allocation which is consequent. Using memory usage profiler on "small" data set may lead to wrong leads) |
@mgautierfr Thanks for the check. |
Sorry for the wait, this run took a really long time (18h+) even with the small site. Here's the run with memray that tracks only the memory leaks, note that to have an accurate tracking i've had to set the python memory allocator to malloc instead of the default pymalloc (which could also be why valgrind was confused by the results). test_malloc.zip.001 Following is my analysis of what i think are the most significant leaks that could contribute in a long run to the memory balooning. Note: This is with both compression and indexing disabled, i'll see if i can make the run with both on. ======= The reported leaks not listed here seem to me too small to contribute to the issue or mostly are static data allocated by the various imports / native classes. unique_ptr<> from the Dirent pool do not seem to ever be released
seems that this is never collected by the GC even if the parser.close()
A couple of illustrations are downloaded and converted even if the no image
The parsers remain in memory and every request to parse the tags leaks some memory.
SSL connection contexts are never released
Internals of the iso639 package seem to leak the dictionary on every request
|
I've ran math.stackoverflow.com which is one of the largest after SO (behind Results are confusing: Baseline consumes less memory and the graph is very different. Not sure if it's usable at all. Haven't had time to unfold the discussion and won't for the days to come. |
This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions. |
Bounty Hunters: See #243 (comment) for additional details
When running sotoki on very large domain such as StackOverflow, memory becomes an issue.
A full run (without images) on athena worker (which has slow CPU and disk) consumed about everything of its 96GiB configured RAM.
Note: because it took so long to complete, we've lost most of the netdata graph… Another run on a faster worker lasted half the time (but we have no RAM usage graph).
This is obviously unexpected and unsustainable. The expected RAM usage for stackOverflow is 36-38GiB (of which 15-18GiB) is used by redis.
While it allowed us to discover some memory issues in libzim and scraperlib, it is clear that sotoki is leaking most of its generated content.
Bellow is a graph of memory consumption over a full run when using a fake zim creator (so no data written to ZIM nor disk).
Don't look at the initial spike which is debug artefact. While the absolute values are small (max of 215MiB), the curve should be flat. Here, we're seeing a 70% RAM increase.
The text was updated successfully, but these errors were encountered: