Export to Huggingface #50

sutyum · 2024-01-14T20:16:49Z

Pull Index form Hub
Push Index to Hub

Will close #37

Are there any other concerns this PR should look into. Waiting to understand how best to organise the repo in order to define the push to hub part.

bclavie · 2024-01-14T21:40:06Z

Hey!

Thanks for taking this on, very appreciated! I'll review the changes tomorrow.

I don't think there's any major concerns with this, it's very straightforward and (should) be plug&play with the existing methods.

I think pushing to the hub should be implemented as a utils function (similar to how we implemented pushing a model to the hub and converting to vespa format), but there should also be a method in RAGPretrainedModel to invoke it for the currently loaded index.

bclavie · 2024-01-14T21:52:58Z

Just a quick thought: I think loading from the hub could be implemented in from_index rather than being its own function: first check if the path exists locally, otherwise attempt downloading from the hub, and OSError if it's not found there either.

This follows the style of the huggingface .from_pretrained() loaders

Also, I think it may be good to allow the loading function to take a local_dir: Optional[str] = None argument, which if present we'd pass to hf_hub_download(), and if not present we'd set it to f".ragatouille/indexes/{NAME_OF_THE_REMOTE_INDEX}", ensuring the default behaviour is consistent with the rest of the lib.

sutyum · 2024-01-15T14:27:53Z

converting to vespa format: Not sure how this format works.

So is the idea to have the repo structured in the form of this sample directory?: https://github.com/vespa-engine/sample-apps/tree/master/colbert

Containing:

services.xml
ext/ <-- the contents of ./ragatouille/indexes/index-name goes here
 something.pt
 something-else.json
 etc.pt
schemas/
  doc.sd (same one as in the sample folder)

Is this useful broadly or something specific to the Vespa project?

bclavie · 2024-01-15T17:26:22Z

Hey, my bad -- I just wanted to cite the Vespa example as how I think export functions should be implemented: https://github.com/bclavie/RAGatouille/blob/main/ragatouille/models/utils.py

No need to match Vespa's schema at all, or even export the Vespa ONNX file here at all!

You can probably copy off the code from export_to_huggingface_hub(), remove the vespa related arguments/checks and make it export the content of an index folder rather than a model folder.

However, I think there are a handful of things to take into account here:

If using a model available on the huggingface hub, just pushing the index as-is is enough, because metadata.json logs the checkpoint used to create an index, and loading it will fetch the model with that name for the HF hub. E.g. if you create an index with colbert-ir/colbertv2.0, that'll be in your metadata.json, so it'll properly fetch the weights from HF if you don't have them locally.
But, this is not the case for any model not on the hub! I think for safety's sake, exporting an index should also involve exporting the associated model, maybe to a subfolder called model? This code should be easy to tweak to do it:

[...]
      colbert_model = ColBERT(
            colbert_path,
            colbert_config=colbert_config,
        )
        print(f"Model loaded... saving export files to disk at {export_path}/model")
        try:
            save_model = colbert_model.save
        except Exception:
            save_model = colbert_model.module.save
        save_model(str(Path(export_path) / model))
[...]

Then, before exporting the index, we'd need to update the Index's config (in metadata.json) to make sure checkpoint points to path/to/your/index/model, so it loads properly!

bclavie · 2024-01-25T21:53:24Z

Hey @sutyum, let me know if you need any assistance with this! The exporting is a bit tricky because it's basically the behaviour we have now, except we need to actively load the ColBERT instance from colbert.modeling.colbert so we can dump the model with the index, without overwriting the artifact metadata. (or rather, only overwriting what we need to for the index in some cases)

I think a structure like the one you used for your manual export, but reversed, could make sense? e.g. the main folder would be the index, and the subfolder /colbert_checkpoint/ would be where we dump the model, before uploading the whole thing to the HF hub? Happy to help with the implementation of this!

sutyum · 2024-01-26T06:59:36Z

Hey @sutyum, let me know if you need any assistance with this! The exporting is a bit tricky because it's basically the behaviour we have now, except we need to actively load the ColBERT instance from colbert.modeling.colbert so we can dump the model with the index, without overwriting the artifact metadata. (or rather, only overwriting what we need to for the index in some cases)

Would be great if could take a look and see if I am doing something wrong.

Need to figure out how to interface with the metadata.json in order to update the checkpoint. Couldn't find an equivalent of ColbertConfig which load it into a dict
Need to find out how to get the absolute path of the model snapshot stored by huggingface-cli
Need to manually test the various cases (existing model on hf + new index, fine-tuned model + new index)

I think a structure like the one you used for your manual export, but reversed, could make sense? e.g. the main folder would be the index, and the subfolder /colbert_checkpoint/ would be where we dump the model, before uploading the whole thing to the HF hub? Happy to help with the implementation of this!

If you try the addition to the 01-basic notebook, I have added a section for uploading and using index from hf. Currently it stores the model in /model/ and the index in repo root

bclavie · 2024-01-26T12:54:44Z

Hey! Overall I think it's looking like it's shaping up pretty nicely. I think we could probably merge export_to_huggingface_hub and upload_index_and_model into a single export_to_huggingface_hub function, that'd conditionally check/take-in an is_index: bool argument to decide which steps must be performed. Happy to take a stab at this in a bit!

Need to figure out how to interface with the metadata.json in order to update the checkpoint. Couldn't find an equivalent of ColbertConfig which load it into a dict

At its core it's just a normal json file, I think for this specific case there is no issue with loading it a json file (RAGatouille tends to use srsly for file ops, so srsly.read_json/srsly.write_json), modifying the attribute and saving it back to disk.

Need to find out how to get the absolute path of the model snapshot stored by huggingface-cli

I'm not sure I understand what you mean here... do you mean the path of a model downloaded from the hub?

Need to manually test the various cases (existing model on hf + new index, fine-tuned model + new index)

Yeah! Once this is done I think we'll be nearing this being releasable, which opens up a lot of niceties! (e.g. easily download a pre-indexed wikipedia for everyone to try out, etc) Thanks for taking this on 🙏

add a method to pull index from hub

d571439

sutyum added 2 commits January 15, 2024 11:57

from_index now handles remote name too

64bfe56

todo: vespa format for push to hub

2612b61

bclavie added 2 commits January 24, 2024 17:51

Merge branch 'main' into main

be76b86

Merge branch 'main' into main

bf5c947

sutyum marked this pull request as draft January 25, 2024 07:14

sutyum added 3 commits January 25, 2024 12:44

Merge branch 'main' into main

3373a66

add function to upload to huggingface hub

b2190df

merge with upstream changes

e20b98e

sutyum added 2 commits January 26, 2024 12:20

refactor upload

c59280c

refactor upload

b9670e5

remove accidental code duplication

1b28c2c

Merge branch 'main' into main

9394783

sutyum closed this May 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Export to Huggingface #50

Export to Huggingface #50

sutyum commented Jan 14, 2024 •

edited

Loading

bclavie commented Jan 14, 2024 •

edited

Loading

bclavie commented Jan 14, 2024 •

edited

Loading

sutyum commented Jan 15, 2024

bclavie commented Jan 15, 2024

bclavie commented Jan 25, 2024

sutyum commented Jan 26, 2024 •

edited by bclavie

Loading

bclavie commented Jan 26, 2024 •

edited

Loading

Export to Huggingface #50

Export to Huggingface #50

Conversation

sutyum commented Jan 14, 2024 • edited Loading

bclavie commented Jan 14, 2024 • edited Loading

bclavie commented Jan 14, 2024 • edited Loading

sutyum commented Jan 15, 2024

bclavie commented Jan 15, 2024

bclavie commented Jan 25, 2024

sutyum commented Jan 26, 2024 • edited by bclavie Loading

bclavie commented Jan 26, 2024 • edited Loading

sutyum commented Jan 14, 2024 •

edited

Loading

bclavie commented Jan 14, 2024 •

edited

Loading

bclavie commented Jan 14, 2024 •

edited

Loading

sutyum commented Jan 26, 2024 •

edited by bclavie

Loading

bclavie commented Jan 26, 2024 •

edited

Loading