Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Export to Huggingface #50

Closed
wants to merge 12 commits into from
Closed

Export to Huggingface #50

wants to merge 12 commits into from

Conversation

sutyum
Copy link

@sutyum sutyum commented Jan 14, 2024

  • Pull Index form Hub
  • Push Index to Hub

Will close #37


Are there any other concerns this PR should look into. Waiting to understand how best to organise the repo in order to define the push to hub part.

@bclavie
Copy link
Collaborator

bclavie commented Jan 14, 2024

Hey!

Thanks for taking this on, very appreciated! I'll review the changes tomorrow.

I don't think there's any major concerns with this, it's very straightforward and (should) be plug&play with the existing methods.

I think pushing to the hub should be implemented as a utils function (similar to how we implemented pushing a model to the hub and converting to vespa format), but there should also be a method in RAGPretrainedModel to invoke it for the currently loaded index.

@bclavie
Copy link
Collaborator

bclavie commented Jan 14, 2024

Just a quick thought: I think loading from the hub could be implemented in from_index rather than being its own function: first check if the path exists locally, otherwise attempt downloading from the hub, and OSError if it's not found there either.

This follows the style of the huggingface .from_pretrained() loaders


Also, I think it may be good to allow the loading function to take a local_dir: Optional[str] = None argument, which if present we'd pass to hf_hub_download(), and if not present we'd set it to f".ragatouille/indexes/{NAME_OF_THE_REMOTE_INDEX}", ensuring the default behaviour is consistent with the rest of the lib.

@sutyum
Copy link
Author

sutyum commented Jan 15, 2024

converting to vespa format: Not sure how this format works.

So is the idea to have the repo structured in the form of this sample directory?: https://github.com/vespa-engine/sample-apps/tree/master/colbert

Containing:

services.xml
ext/ <-- the contents of ./ragatouille/indexes/index-name goes here
 something.pt
 something-else.json
 etc.pt
schemas/
  doc.sd (same one as in the sample folder)

Is this useful broadly or something specific to the Vespa project?

@bclavie
Copy link
Collaborator

bclavie commented Jan 15, 2024

Hey, my bad -- I just wanted to cite the Vespa example as how I think export functions should be implemented: https://github.com/bclavie/RAGatouille/blob/main/ragatouille/models/utils.py

No need to match Vespa's schema at all, or even export the Vespa ONNX file here at all!

You can probably copy off the code from export_to_huggingface_hub(), remove the vespa related arguments/checks and make it export the content of an index folder rather than a model folder.

However, I think there are a handful of things to take into account here:

  • If using a model available on the huggingface hub, just pushing the index as-is is enough, because metadata.json logs the checkpoint used to create an index, and loading it will fetch the model with that name for the HF hub. E.g. if you create an index with colbert-ir/colbertv2.0, that'll be in your metadata.json, so it'll properly fetch the weights from HF if you don't have them locally.
  • But, this is not the case for any model not on the hub! I think for safety's sake, exporting an index should also involve exporting the associated model, maybe to a subfolder called model? This code should be easy to tweak to do it:
[...]
      colbert_model = ColBERT(
            colbert_path,
            colbert_config=colbert_config,
        )
        print(f"Model loaded... saving export files to disk at {export_path}/model")
        try:
            save_model = colbert_model.save
        except Exception:
            save_model = colbert_model.module.save
        save_model(str(Path(export_path) / model))
[...]

Then, before exporting the index, we'd need to update the Index's config (in metadata.json) to make sure checkpoint points to path/to/your/index/model, so it loads properly!

@sutyum sutyum marked this pull request as draft January 25, 2024 07:14
@bclavie
Copy link
Collaborator

bclavie commented Jan 25, 2024

Hey @sutyum, let me know if you need any assistance with this! The exporting is a bit tricky because it's basically the behaviour we have now, except we need to actively load the ColBERT instance from colbert.modeling.colbert so we can dump the model with the index, without overwriting the artifact metadata. (or rather, only overwriting what we need to for the index in some cases)

I think a structure like the one you used for your manual export, but reversed, could make sense? e.g. the main folder would be the index, and the subfolder /colbert_checkpoint/ would be where we dump the model, before uploading the whole thing to the HF hub? Happy to help with the implementation of this!

@sutyum
Copy link
Author

sutyum commented Jan 26, 2024

Hey @sutyum, let me know if you need any assistance with this! The exporting is a bit tricky because it's basically the behaviour we have now, except we need to actively load the ColBERT instance from colbert.modeling.colbert so we can dump the model with the index, without overwriting the artifact metadata. (or rather, only overwriting what we need to for the index in some cases)

Would be great if could take a look and see if I am doing something wrong.

  • Need to figure out how to interface with the metadata.json in order to update the checkpoint. Couldn't find an equivalent of ColbertConfig which load it into a dict
  • Need to find out how to get the absolute path of the model snapshot stored by huggingface-cli
  • Need to manually test the various cases (existing model on hf + new index, fine-tuned model + new index)

I think a structure like the one you used for your manual export, but reversed, could make sense? e.g. the main folder would be the index, and the subfolder /colbert_checkpoint/ would be where we dump the model, before uploading the whole thing to the HF hub? Happy to help with the implementation of this!

If you try the addition to the 01-basic notebook, I have added a section for uploading and using index from hf. Currently it stores the model in /model/ and the index in repo root

@bclavie
Copy link
Collaborator

bclavie commented Jan 26, 2024

Hey! Overall I think it's looking like it's shaping up pretty nicely. I think we could probably merge export_to_huggingface_hub and upload_index_and_model into a single export_to_huggingface_hub function, that'd conditionally check/take-in an is_index: bool argument to decide which steps must be performed. Happy to take a stab at this in a bit!

Need to figure out how to interface with the metadata.json in order to update the checkpoint. Couldn't find an equivalent of ColbertConfig which load it into a dict

At its core it's just a normal json file, I think for this specific case there is no issue with loading it a json file (RAGatouille tends to use srsly for file ops, so srsly.read_json/srsly.write_json), modifying the attribute and saving it back to disk.

Need to find out how to get the absolute path of the model snapshot stored by huggingface-cli

I'm not sure I understand what you mean here... do you mean the path of a model downloaded from the hub?

Need to manually test the various cases (existing model on hf + new index, fine-tuned model + new index)

Yeah! Once this is done I think we'll be nearing this being releasable, which opens up a lot of niceties! (e.g. easily download a pre-indexed wikipedia for everyone to try out, etc) Thanks for taking this on 🙏

@sutyum sutyum closed this May 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support exporting index to HuggingFace Hub
2 participants