Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix add_to_index overwrites and duplicate paths #157

Merged
merged 23 commits into from
Mar 5, 2024

Conversation

adharm
Copy link
Collaborator

@adharm adharm commented Feb 24, 2024

This fixes the overwriting issue in add_to_index (from #71) and the issue with paths having duplicated sections (from #82).

I updated the tests to better cover the "add" functionality (moving them to their own file will come in a future PR).

I also added a few more items to the .gitignore.

@adharm adharm requested a review from bclavie February 24, 2024 05:28
@adharm adharm changed the title Fix add to index overwrites and duplicate paths Fix add_to_index overwrites and duplicate paths Feb 24, 2024
@bclavie
Copy link
Collaborator

bclavie commented Feb 24, 2024

Thank you 🙏 I'll review in a bit.

Also, I thought you could be interested in checking out the other indexing approaches (flat & hnsw) and decoupling of model and index (see #158). This should make CRUD a lot easier to support, esp. for the non-PLAID indexes.

@bclavie
Copy link
Collaborator

bclavie commented Feb 25, 2024

LGTM except for one thing: It seems that when using the IndexUpdater, we never write the new collection to file, so there'll be errors when trying to load the index again and query (unless I'm missing something)

@jlscheerer
Copy link
Collaborator

Thank you very much @anirudhdharmarajan. I have integrated the changes in #158

…ed + consolidated repeat logic

Put all repeated read/write from/to disk index operations into functions and removed unnecessary path logic in add_to_index.
@adharm
Copy link
Collaborator Author

adharm commented Feb 27, 2024

First off, I'm super excited for the changes you've started off in #158, @jlscheerer! That's going to be incredibly useful, thanks for taking the lead on that.

@bclavie, good catch! I fixed that + consolidated repeated code and updated the path logic, lmk if these changes make sense.

@adharm
Copy link
Collaborator Author

adharm commented Feb 28, 2024

Actually, just realized that the path logic needs updating, fixing now. Please hold!

@adharm
Copy link
Collaborator Author

adharm commented Feb 28, 2024

I just reverted the logic for manipulating paths for the Searcher for now rather than changing anything for how paths are kept consistent (especially since index_root means different things here vs in the upstream colbert repo). I can tackle that in more detail once #158 goes through!
I'll also take a stab at restructuring the tests once that change goes through to avoid duplication of tests and expanding coverage.
Looks like huggingface is down again, so failed CI run 😞

@paulthemagno
Copy link

That would be super-useful!

@kevinningthu
Copy link

Thanks for your excellent work. May I ask when this PR will be merged into the main branch?

@bclavie bclavie merged commit 401913a into AnswerDotAI:main Mar 5, 2024
2 checks passed
@akshay-anv
Copy link

Does this fix all the issues with add_to_index function ?

@paulthemagno
Copy link

Will the commit of this PR will be tagged as a new version available on pip during this days ?

@bclavie
Copy link
Collaborator

bclavie commented Mar 8, 2024

Hey @Akshay1921, not yet. These will be improved immensely by #158 which I'm expecting to land very soon.

@paulthemagno @kevinningthu Next version release will likely be Monday (potentially today, but might be a bit short on time) and will include this + #158, which will greatly improve a lot of aspects with indexing!

@kevinningthu
Copy link

Hey @Akshay1921, not yet. These will be improved immensely by #158 which I'm expecting to land very soon.

@paulthemagno @kevinningthu Next version release will likely be Monday (potentially today, but might be a bit short on time) and will include this + #158, which will greatly improve a lot of aspects with indexing!

Thank you guys so much!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants