Indexing - Ran out of memory? Resume Indexing? #146

jared252016 · 2022-04-27T21:26:56Z

jared252016
Apr 27, 2022

Few questions about using Pupyl...

I'm looking at using this tool to start a free service that lets you reverse image search. I've been running the crawler for quite some time, looking into ways to compare the images other than just hashes, when I stumbled upon this. I have about 7TB of media, much of which is probably video and not strictly pictures, but when trying to index just one folder it ran out of memory. The last count was 263154 items. Now this took approximately 3 days to do, so I definitely don't want to start from scratch.

So as the title says, does this resume indexing where it left off? Where is the actual database? Surely it uses more than just files in hundreds of folders?

Also, is there any way to improve performance?

If this project takes off, I will happily donate to the pupyl project. I'm not able to use other reverse search APIs as the sites I am indexing are unique and often not indexed by the others.

policratus · 2022-04-29T13:58:25Z

policratus
Apr 29, 2022
Maintainer

Hey @jared252016, how's it going?

Thanks for the discussion. I'm glad that pupyl may help you with your new service. You've a quite impressive dataset and brought a lot of good ideas that unfortunately aren't implemented yet on pupyl but definitely they're a must have:

Resume indexing: I feel bad to say that it doesn't resume, which on big dataset like yours is something really annoying. I created a new issue (✨ Resume indexing #147) to address this in the near future;
Memory hog and performance: pupyl uses, by default, a mode called extreme_mode, which has really high search precision but uses a lot of resources. A way you can temporarily solve this is to set PupylImageSearch(extreme_mode=False):

from pupyl.search import PupylImageSearch

SEARCH = PupylImageSearch(extreme_mode=False)

, but a better solution just was proposed on #148.

Databases are stored on the path defined on PupylImageSearch(data_dir: str) instantiation. For instance:

from pupyl.search import PupylImageSearch

SEARCH = PupylImageSearch(data_dir=~/pupyl)

will create the database on ~/pupyl. But if you left undefined the data_dir parameters, pupyl creates the database on your system's temporary directory.

0 replies

policratus · 2022-05-11T21:35:02Z

policratus
May 11, 2022
Maintainer

Hi @jared252016,

The first issue that you reported, about #147, has been addressed on #149.

0 replies

policratus · 2022-05-20T14:29:22Z

policratus
May 20, 2022
Maintainer

@jared252016, the resume indexing feature was merged and it's part of new 0.13.4 version.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Indexing - Ran out of memory? Resume Indexing? #146

{{title}}

Replies: 3 comments

{{title}}

{{title}}

{{title}}

Select a reply

Indexing - Ran out of memory? Resume Indexing? #146

jared252016 Apr 27, 2022

Replies: 3 comments

policratus Apr 29, 2022 Maintainer

policratus May 11, 2022 Maintainer

policratus May 20, 2022 Maintainer

jared252016
Apr 27, 2022

policratus
Apr 29, 2022
Maintainer

policratus
May 11, 2022
Maintainer

policratus
May 20, 2022
Maintainer