Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory Leak (XML parsing, Templates, libzim ?) #243

Open
rgaudin opened this issue Sep 10, 2021 · 52 comments
Open

Memory Leak (XML parsing, Templates, libzim ?) #243

rgaudin opened this issue Sep 10, 2021 · 52 comments
Assignees
Labels
bounty! A bounty has been offered for this bug bug
Milestone

Comments

@rgaudin
Copy link
Member

rgaudin commented Sep 10, 2021

Bounty Hunters: See #243 (comment) for additional details

When running sotoki on very large domain such as StackOverflow, memory becomes an issue.
A full run (without images) on athena worker (which has slow CPU and disk) consumed about everything of its 96GiB configured RAM.

Screen Shot 2021-09-10 at 08 12 05

Note: because it took so long to complete, we've lost most of the netdata graph… Another run on a faster worker lasted half the time (but we have no RAM usage graph).

This is obviously unexpected and unsustainable. The expected RAM usage for stackOverflow is 36-38GiB (of which 15-18GiB) is used by redis.

While it allowed us to discover some memory issues in libzim and scraperlib, it is clear that sotoki is leaking most of its generated content.

Bellow is a graph of memory consumption over a full run when using a fake zim creator (so no data written to ZIM nor disk).

full-sotoki-nozim

Don't look at the initial spike which is debug artefact. While the absolute values are small (max of 215MiB), the curve should be flat. Here, we're seeing a 70% RAM increase.

@rgaudin rgaudin added the bug label Sep 10, 2021
@rgaudin rgaudin self-assigned this Sep 10, 2021
@stale
Copy link

stale bot commented Mar 2, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

@stale stale bot added the stale label Mar 2, 2022
@kelson42
Copy link
Contributor

@rgaudin @mgautierfr Maybe this new opensource tool would help to understand what is going on? https://github.com/bloomberg/memray

@kelson42 kelson42 added this to the 2.1.0 milestone May 26, 2022
@stale stale bot removed the stale label May 26, 2022
@Popolechien Popolechien added the bounty! A bounty has been offered for this bug label Jun 23, 2022
@Popolechien
Copy link

Note: a bounty can now be claimed for solving this issue. https://app.bountysource.com/issues/109534260-leaking-memory

@TheCrazyT
Copy link
Contributor

I guess you already made shure that no process is hanging, too?
(I guess a graph with active running processes might also help)
I just took a quick look through the code and saw some subprocess.run, but sadly you log no info about those (when they are started and finished).
Maybe there exist some methods with timeout that you could use to make shure they finish in a defined time ...

PS:
It could also be that "zimscraperlib" is leaking, but I didn't take a look at it for now ...

@rgaudin
Copy link
Member Author

rgaudin commented Jun 28, 2022

@TheCrazyT those processes happen early on in the whole scraping process and complete before we start the intensive work. Now that we're expecting outsiders to take a look at this, I'll rewrite the issue in the coming day(s) with as much details as possible. Yes https://github.com/openzim/python-scraperlib is a possible culprit.

@TheCrazyT
Copy link
Contributor

I'm currently looking at:
https://githubmemory.com/repo/openzim/python-libzim/issues/117?page=2
Does that mean that "4. Observe that memory consumption is stable" is wrong because you still have problems?
If you let the Proof of concept-script on that site run long enough(with a bigger dataset), would you get the same curve?

If your answer is yes, then I'm wondering if it could be a python internal problem.
Somewhere inside xml.sax.handler.
Atleast it would also explain why most of the process is leaking.
But the impact must be really really small, because you won't notice it on short runs.

@rgaudin
Copy link
Member Author

rgaudin commented Jun 29, 2022

Here's a more complete description for those not familiar with sotoki and this issue.

  • sotoki creates a ZIM file given a stack-exchange domain (there are 356 sotoki --list-all)
  • a ZIM is a binary archive of a website: every resource is stored at a path and non-binary resources are compressed
  • It takes a ZIM reader to use/browse/read a ZIM. Most readers rely on libzim. End-user friendly ones are made by Kiwix. kiwix-serve is recommended for tests.
  • sotoki works off (7z-compressed) XML dumps provided by stack-exchange.
  • the leak is certainly affecting all stack-exchange domains but it's only a concern for StackOverflow because of its size. Dump is ~26GB that expands to somewhere near 120GB.
  • sotoki works in stages (see sotoki.scraper:start(). At the beginning, it downloads, extracts and then transforms (in sub processes) XML dumps into easier-to-work-with ones. This is sotoki.archives and sotoki.utils.preparation. Don't get distracted by how much memory those uses as those are tailored for speed and completes before the actual work begins.
  • then sotoki will parse (xml.sax) those prepared XML dumps and for each record either (or both) store some information or generate one-or-several HTML pages that are directly added to the ZIM
  • sotoki stores those to-be-used later information in a Redis DB.
  • redis DB is in-RAM and is quite large. For SO I believe it was around 18GB which is fine: that's actual content that we need to store.
  • In parallel, images found in entries are fetched online and added to the ZIM.
  • all our stackoverflow runs are done in docker.

Here's a graph that show how it looks like overtime (was an OOM'd run with ~60GB or RAM)

Screen Shot 2021-08-30 at 08 43 52

Huge spike at the begining is the preparation step. Then you can see sotoki processing the Users (storing info in redis and creating users pages in ZIM) followed by a drop in usage: this is when we release some of the info that we won't need anymore (see :cleanup_users(), :purge() and :defrag_external() in sotoki.utils.database). Then it starts processing the questions/answers and the RAM increases linearly until either it OOMs or completes.

Latest stackoverflow run using 2022-03 dumps used approx. 125GB of RAM (run wasn't monitored)


What the RAM consumption shows is that the entirety of the content that we read/write (that is unclear because we both transform and compress while making the ZIM) goes and stays into RAM.

Although Python doesn't releases memory back to the system, this memory is not available anymore as we hit OOM on systems without enough memory.

It seems (no evidence but match of XML size vs RAM usage) that the size of images (fetched online and added to the ZIM) is not impacting RAM consumption which may indicate that the leak doesn't affect binary item addition in scraperlib/libzim.

@rgaudin rgaudin changed the title Leaking memory Memory Leak (XML parsing, Templates, libzim ?) Jun 29, 2022
@TheCrazyT
Copy link
Contributor

TheCrazyT commented Jun 29, 2022

Talking about the sax parser ...
It is probably better to execute close after you are done:

import xml.sax
parser = xml.sax.make_parser()

#...

try:
  parser.close()
except xml.sax.SAXException:
  #not shure if anything can be done here
  #just ignoring this error for now ;-)
  pass

I tested it in google colab and the info about the close-function says the following:

This method is called when the entire XML document has been
passed to the parser through the feed method, to notify the
parser that there are no more data. This allows the parser to
do the final checks on the document and empty the internal
data buffer.

The parser will not be ready to parse another document until
the reset method has been called.

close may raise SAXException.

(same info as here : https://github.com/python/cpython/blob/main/Lib/xml/sax/xmlreader.py#L144 )

The default parser ExpatParser implements IncrementalParser and seem to use a "pyexpat C module".
Atleast according to that comment:
https://github.com/python/cpython/blob/f4c03484da59049eb62a9bf7777b963e2267d187/Lib/xml/sax/expatreader.py#L88

I also didn't find anything that executes the close automatically.
So my guess is that this "pyexpat C module" still holds some resources (probably not the whole document) and memray can't figure it out because it is no pure python.

TheCrazyT added a commit to TheCrazyT/sotoki that referenced this issue Jun 30, 2022
@TheCrazyT
Copy link
Contributor

TheCrazyT commented Jul 2, 2022

It is still possible that the xml library is leaking.
Atleast i created a proof of concept with a jupyter notebook:

https://gist.github.com/TheCrazyT/376bb0aac2d283f9ba505b889ef5da51

Bad thing is ... I currently can't really figure out what the problem is, but it seems to be caused by the html-entities in combination with long text inside the body-attribute of the row-tag.
I was hoping that it is just a special entity that can cause problems.

Probably will try to create a better proof of concept with a memory-profiler ... just to be shure.

Edit:
Oh well profiling didn't show the result that I was expecting.
Somehow I still evaluate this wrong, there seems to be another condition i can not figure out.
For example I just duplicated the row and numbered it, because i was expecting that the other rows should also stay in memory.
But for some reason only content of the first row stayed in memory, although the content of the other rows were nearly the same.
I'm a bit clueless at the moment ...

Edit again:
Guess that was just a red hering ... the text stays in memory for some reason, but you can't really increase the memory usage by it.
Atleast i was not able to produce a case where the following rows would also stay in memory.
So you can only increase memory usage if your first row is very very very large, and that won't match the described behaviour ...

Edit 04.07.2022:
Now analysing a core-dump of a full runthrough with a smaller dataset ("astronomy").
There seem to be many "NavigableString" left from beautifulsoup, but not shure if that means anything.
Also some urls seem to be left in memory when comparing multiple runs.
Although I already checked by memory profiling beautifulsoup-commands themself, there might be something special when it could leak memory.
(According to https://bugs.launchpad.net/beautifulsoup ... there seem also to be problems about "recursion depth" , not shure if such errors could also lead to unclean memory)

Edit 06.07.2022:
Just another jupyter notebook to show that memory is leaking:
https://gist.github.com/TheCrazyT/117dd8371f63708e6f655e444f4bfaa4
This time by executing sotoki itself and with garbage-collection near the sys.exit .
I guess it is kinda strange that text of row number 12409 of 12504 is still within the dump.
The last row would be understandable ... but that entry is kinda old.
Also had another test-notebook that showed that there are still about 24097 dict-objects, 13144 list-objects and 11848 tuple-objects referenced within the core-dump.
Not so shure if I could figure out how/where those a still referenced.
My guess is that maybe some are still indirectly referenced by the Global-object.

@TheCrazyT
Copy link
Contributor

TheCrazyT commented Jul 6, 2022

Guess I should run a compared mem-profile with python 3.10 ,because they seem to have changed something about garbage collection within this commit:

python/cpython@59af59c

Hm weird, just noticed that the tag on the commit is v3.11.0b3.
(although somewhere else it said 3.10)
So maybe it is not released yet ...

Edit:
Well alot of packes just won't work with python 3.11.0b3.
Only option left is get the code of python3.8, cherry pick that commit and cross fingers that it is compatible 😞.

Update 09.07.2022:
Was now running a patched python version 3.8 with the following patch:
https://gist.github.com/TheCrazyT/db4323d9cb3d3cc71b2a69020b1fa535
With the "blender"-dataset the peak memory dropped by about 1.2 mb , but I guess more would be possible if there would be some more modifications that were done in python version 3.11 . (Atleast I still see improvement, since the curve by mprof still seems to increase).
It is not that much, but compared to the dataset it might be good enough for the "stackoverflow" dataset.
But because I do not have the resources I'm not able to let it run against that.

Guess the biggest problem will be to add the patch to the Dockerfile.
The only versions i can think about will increase the size and build-time of the docker image dramatically, because you would need to compile 2 python extensions.(so the image would need compile tools and compile time)
Adding the binaries themself is probably also no good solution.

Update 11.07.2022:
I'm still missing something, the current changes(commits below) do not have much impact.
Not really satisfying tbh.

TheCrazyT added a commit to TheCrazyT/sotoki that referenced this issue Jul 10, 2022
TheCrazyT added a commit to TheCrazyT/sotoki that referenced this issue Jul 10, 2022
@albassort
Copy link

Sorry if this is too much to ask, but, I would like to take a look, though Jupiter Notebooks aren't quite helpful for me.

Can you please write a small python script, with a provided xml file (I recommend wikipedia's XML dumps, they're beef enough to illustrate trends like this)

Thank you!

@TheCrazyT
Copy link
Contributor

@Retkid sorry, beside fishing strings in the core-dump and profiling the whole sotoki I have no simple script that profiles the memory.

I'm currently looking at the output of "log-malloc2" to figure out if there is something weird.(something allocated, but not freed)

Currently leads me to "lxml" inside "xmlBufGrow" , but that could also be a red hering.

@rgaudin
Copy link
Member Author

rgaudin commented Jul 14, 2022

Thanks for the updates. Missed those edits above as it doesn't trigger a notification.

FYI, a nopic (without any picture) run of stackoverflow is about to end. This code includes the parser.close() patch but as expected, there is no visible improvement.

It's important to note that the RAM usage is equivalent to the version with pictures, confirming that images are not involved, which can incidentally exclude libzim itself (for binary entries).

Screen Shot 2022-07-14 at 09 17 33

@TheCrazyT
Copy link
Contributor

@rgaudin

Ok ... here is another thing I noticed.
The commit of the redis-package, has actually a "command stack"( https://github.com/redis/redis-py/blob/c54dfa49dda6a7b3389dc230726293af3ffc68a3/redis/client.py#L2049 ) that gets emptied after a commit.

But here is the catch, the commit_maybe won't commit as othen as you would think it would:

self.bump_seen(4 + len(post.get("Tags", [])))

A commit will happen if nb_seen is dividable throuhg 1000:

return self.nb_seen % self.commit_every == 0

But since you increase it by: (4 + len(post.get("Tags", [])) it requires a bit of luck to get exactly to something that is divideable with 1000.
What if you get 1001,1002,1003, 1004, 1005... after incrementing nb_seen with (4 + len(post.get("Tags", [])) ?
It will need another loop till it could get lucky to hit 2000, etc.

So my guess is that the command_stack just increases and increases, because a commit does barely happen.
(currenlty this theory has the problem that you do not see a drop in the graph at all)

To be honest, I would simply remove the + len(post.get("Tags", [])), but I'm not shure if it has any special meaning.
Another solution would be to reset nb_seen to 1 if it is bigger than 1000 inside the should_commit and return self.nb_seen > self.commit_every .

TheCrazyT added a commit to TheCrazyT/sotoki that referenced this issue Jul 18, 2022
@rgaudin
Copy link
Member Author

rgaudin commented Jul 18, 2022

Good point ; I'd go with self.nb_seen > self.commit_every as well as it seems safer.

@rgaudin
Copy link
Member Author

rgaudin commented Jul 18, 2022

Please submit a PR once you have a working version ; thanks 🙏

@mgautierfr
Copy link

I would go with a new last_commit and have something like :

if self.nb_seen > self.last_commit + self.commit_every:
    do_commit(...)
    self.last_commit += self.commit_every

This way we would not reset nb_seen and be able to use it for other statistics. (My two cents)


(currenlty this theory has the problem that you do not see a drop in the graph at all)

This is indeed a limitation of your theory (even if this is a good catch). On such a long run, we should see few drops times to times.
We will see how memory usage behaves with this fix.

@TheCrazyT
Copy link
Contributor

TheCrazyT commented Jul 18, 2022

Well, If I saw it correctly, then every thread has its own nb_seen, starting from 0.

Maybe its just not noticeable if one thread is lucky and hits commit.

Update:

Oh btw. the reason I did not create a pull request yet is because I was just logging the "commit".

For some reason I still only get the output at the end of the "Questions"-Progress and that is driving me slightly insane atm.

Update again:

Just now noticed that the actual recording-phase is within "Questions_Meta". (within PostFirstPasser)

But that is near the beginning.

And this means: ... I probably still didn't find the main memory-leak problem I guess.

@kelson42
Copy link
Contributor

@parvit For 1, it does not work like this. If you find any proof that memory is not freed regularly (at the end of each cluster compression), then please point it.

@parvit
Copy link

parvit commented Jul 30, 2022

@kelson42 I could not tell you if the memory is really being flushed at every cluster but my run still had 1/4 memory total allocated at the end (so maybe there's an actual leak there?).

image

You can find this in the drive folder posted earlier in file original.html in results.zip

It might be possible that in a longer run than mine the usage from compression is bounded as you say but we should then do a run with compression only and see if it aligns with the run where both are disabled

EDIT: For context the half on the left is the memory directly associated with python code, while on the right are the blocks of memory associated with native code of libzim

@mgautierfr
Copy link

Thanks all for the investigations.

Your remark @parvit is interesting. I've took a fresh look at the xapian indexing and it seems we were creating a xapian transaction preventing xapian to flush the data in the fs. I've made the PR openzim/libzim#719 to fix that. @rgaudin it would be nice to test again with this fix.

However, we have tested the sotoki scrapper with a fake libzim (creating nothing) and we found that memory consumption was still increasing. So even if xapian is taking (too much) memory, there is probably another leak somewhere on the python side.

@parvit
Copy link

parvit commented Aug 3, 2022 via email

@mgautierfr
Copy link

The fakeCreator is here : https://github.com/openzim/sotoki/blob/v0.0.0.1/src/sotoki/utils/shared.py#L20
@rgaudin may give you more information to use it but you mostly have to use it instead of the imported Creator is https://github.com/openzim/sotoki/blob/master/src/sotoki/utils/shared.py

Also my suggestion included also the compression change, did you also consider that one in your test?

I had a look on the compression side but haven't found anything obvious.

@rgaudin
Copy link
Member Author

rgaudin commented Aug 5, 2022

Here's the no-comp run that still had indexing. This is strangely identical to the nocomp-noindex one. I am starting a no-index run before testing @mgautierfr's branch.

nocomp

@parvit
Copy link

parvit commented Aug 5, 2022

@rgaudin
Could you send me indications on how to use the fake creator (maybe via mail)? thank you.

@rgaudin
Copy link
Member Author

rgaudin commented Aug 8, 2022

Here's the noindex version, confirming that indexing has no visible impact on RAM usage but compression is memory intensive. @mgautierfr can you briefly explain how it works? What should it be compressing and how so we can assess whether this is just normal behavior that we incorrectly didn't expect or if there's a problem.

noindex

@mgautierfr
Copy link

This is probably not expected.

First, this is normal that libzim memory usage increase with item added. This is what I was explaining in openzim/python-libzim#117. I suspect that the slow slop increasing at beginning and end is due to normal memory increasing.

But the big slop in the middle is surprising.

How compression is done:

  • We always have two cluster opened : one for uncompressed content, one for compressed content.
  • When we add item to the archive, we add the "data" to the right cluster (if content is compressed or not). If we deactivate compression, we always add to uncompressed cluster.
  • When we add "data" to the cluster, we actually use contentProvider. We update the size of the cluster with contentProvider.getSize and we store the contentProvider itself. On libzim side, we don't store the data. But if the contentProvider keep the data, we then keep the data.
  • When a cluster is full (defining with creator.configClusterSize), put it in a working queue to be processed (closed) (in worker threads) and open a new one for next items to be added
  • We have creator.configNbWorkers thread working in parallel. When a worker process a cluster, it close it.
  • When a cluster is closed, it depends of the compression:
    • For uncompressed cluster, we do nothing
    • For compressed cluster, we compress the data (looping on stored contentProviders), store the compressed data (in RAM) and clear the the references to contentProviders.
    • In all case, the cluster is put in a queue to be written
  • A dedicated thread take the cluster and write it:
    • For uncompressed cluster, loop on the contentProviders and write the data in the zim file
    • For compressed cluster, we write the compressed data
    • In all case, we then clear all data (contentProviders and compressed data). The cluster itself is keep as we need to reference it at the end of the process.

For compressing, we create a compressStream just before a cluster compression and free it just after. We don't reuse context.

It could be interesting to have to log of the creator. It gives some (timed) information about what have been processed. So we could see what is processing during the middle slop.

It may be also interesting to create a zim with lzma compression. If the memory usage is not the same, it may be relative to zstd compression itself.

@parvit
Copy link

parvit commented Aug 9, 2022

For compressing, we create a compressStream just before a cluster compression and free it just after. We don't reuse context.

image

Your comment indicates than that this memory block at the end of my run is surely leaking, the free is not actually freeing the context as expected (but probably keeping internally the memory for reusing).

@mgautierfr
Copy link

zim::writer::taskRunner is the name of the worker thread.
Your capture probably just capture a frame where the worker thread were closing (zim::writer::Cluster::close) the cluster and so compressing the content before it is written to the zim file after.

@parvit
Copy link

parvit commented Aug 9, 2022

@mgautierfr
I think you might be misunderstanding the meaning of the flamegraph.

The frame at the bottom is "ZSTD_resetCCtx_internal" is the one that still holds reference to memory at the end of the program run and that call accounted for nearly 1/3 of the memory of the run ( more precise data is contained in the files i uploaded, i can post them again if you need ).

I also would suggest that we all use the memray tool, without it this is kind of a guessing game.

@mgautierfr
Copy link

This is surprising as valgrind doesn't detect this

@mgautierfr
Copy link

more precise data is contained in the files i uploaded, i can post them again if you need

Please reupload it

@parvit
Copy link

parvit commented Aug 9, 2022

This is surprising as valgrind doesn't detect this

Memray is specific to python so i think maybe valgrind being more general has more chances of seeing that as a false positive?
If you can please share the valgrind command line and output.

Please reupload it

Attached find the aggregated results only, the raw data is quite big but i can arrange if you want.

results.zip

@mgautierfr
Copy link

mgautierfr commented Aug 9, 2022

We compress the content with a big compression level, it means that zstd allocate a lot of internal data.
Each compression stream allocate around 90Mb (not including the compressed data itself). By default, there is 4 compression workers so compression may consume up to 360Mb (with the limitation that worker a shared between compression and indexation)

From #243 (comment), it seems that you peak head size is about 750Mb

I may misunderstood memray report, but from https://bloomberg.github.io/memray/flamegraph.html it seems that flamegraph show the memory usage of the peak memory usage. So it is normal that a lot of memory is allocated for compression and indexing.

(And that a constraint of this issue. libzim has a base memory allocation which is consequent. Using memory usage profiler on "small" data set may lead to wrong leads)

@parvit
Copy link

parvit commented Aug 9, 2022

@mgautierfr
Ok i did assume wrong then, but checking the page you linked pointed me to the --leaks option and i'll be sure to use it.

Thanks for the check.

@parvit
Copy link

parvit commented Aug 12, 2022

Sorry for the wait, this run took a really long time (18h+) even with the small site.

Here's the run with memray that tracks only the memory leaks, note that to have an accurate tracking i've had to set the python memory allocator to malloc instead of the default pymalloc (which could also be why valgrind was confused by the results).

test_malloc.zip.001
test_malloc.zip.002
test_malloc.zip.003
test_malloc.zip.004
(If you can't extract rename the files with the ".xxx" after the ".zip", i had to change it to upload to github)

Following is my analysis of what i think are the most significant leaks that could contribute in a long run to the memory balooning.

Note: This is with both compression and indexing disabled, i'll see if i can make the run with both on.

=======

The reported leaks not listed here seem to me too small to contribute to the issue or mostly are static data allocated by the various imports / native classes.

unique_ptr<> from the Dirent pool do not seem to ever be released
back into the pool

  • super().add_item(item) and .add_item_for
    -> zim::writer::CreatorData::createItemDirent
    -> zim::writer::CreatorData::addDirent

  • super().add_redirection
    -> zim::writer::CreatorData::createRedirectDirent
    -> zim::writer::CreatorData::addDirent

  • return self.start() @ scraper.py
    -> Global.creator.start()
    -> zim::writer::Creator::addMetadata
    -> zim::writer::Creator::createDirent

seems that this is never collected by the GC even if the parser.close()
call is made and no native memory is attached

  • self.process_tags()
    -> TagGenerator().run()
    -> soup = BeatifulSoup(content, "lxml")

A couple of illustrations are downloaded and converted even if the no image
was specified and the formats used for the conversion are kept in memory

  • return self.start() @ scraper.py
    -> self.add_illustrations()
    -> convert_image()
    -> fmt = format_for(dst) @ zimscarperlib/image/convertion.py
    -> init_pil() @ zimscarperlib/image/probing.py

The parsers remain in memory and every request to parse the tags leaks some memory.

  • self.process_tags_metadata()
    -> TagFinder().run()
    -> parser.parse(self.path)
    and
    -> parser = xml.sax.make_parser()

SSL connection contexts are never released
In actuality most the calls generated under Global.init(get_site(self.domain))
leak, maybe it points to the connections not being closed / released ?

  • Global.init(get_site(self.domain)) @ scraper.py 170
    -> resp = session.get() @ zimscarperlib/download.py
    -> context = SSLContext(ssl_version) @ urllib3/util/ssl_.py 290
    -> SSL_CTX_new()

Internals of the iso639 package seem to leak the dictionary on every request
for the language code

  • Global.conf = Sotoconf(kwargs)
    -> self.iso_lang_1, self.iso_lang_3 = lang_for_domain(self.domain) @ sotoki/constants.py
    -> lang = get_language_details(so_code) @ sotoki/constants.py
    -> lang_data,macro_data = get_iso_lang_data(adjusted_query) @ zimscarperlib/i18n.py
    -> iso639_languages.get(
    {code_type: lang})
    -> return getattr(self, key)[value]

@rgaudin
Copy link
Member Author

rgaudin commented Aug 17, 2022

I've ran math.stackoverflow.com which is one of the largest after SO (behind tex, askubuntu and superuser) ; both using the standard code and without compression. Goal was to find out if the discrepancy we saw earlier with mathoverflow would scale.

Results are confusing:

baseline
math_baseline

nocomp
math_nocomp

Baseline consumes less memory and the graph is very different. Not sure if it's usable at all. Haven't had time to unfold the discussion and won't for the days to come.

@kelson42 kelson42 modified the milestones: 2.1.0, 2.2.0 Oct 24, 2022
@stale
Copy link

stale bot commented May 26, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

@stale stale bot added the stale label May 26, 2023
@kelson42 kelson42 pinned this issue Oct 29, 2023
@stale stale bot removed the stale label Oct 29, 2023
@kelson42 kelson42 modified the milestones: 2.1.0, 2.2.0 Oct 29, 2023
@benoit74 benoit74 modified the milestones: 2.2.0, 2.3.0 Mar 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bounty! A bounty has been offered for this bug bug
Projects
None yet
Development

No branches or pull requests

8 participants