Allow saving / loading from Huggingface Hub preset #1510

Wauplin · 2024-03-13T17:37:14Z

Solves #1294. As mentioned in #1294 (comment), this PR adds support for the hf:// prefix to load presets from the Huggingface Hub.

The integration requires the huggingface_hub library. Authentication can be configured with the HF_TOKEN environment variable (only for private models or for uploads, similarly to KaggleHub). Here is a Colab notebook showcasing it.

import keras_nlp
from keras_nlp.models import BertClassifier
from keras_nlp.utils.preset_utils import save_to_preset, upload_preset

classifier = BertClassifier.from_preset("bert_base_en_uncased")
(...) # train/retrain/fine-tune

# Save to local folder
save_to_preset(classifier, "bert_base_en_uncased_retrained")

# Upload to Hugging Face Hub
upload_preset("hf://Wauplin/bert_base_en_uncased_retrained", "bert_base_en_uncased_retrained", allow_incomplete=True)

# Reload from the Hugging Face Hub
classifier_reloaded = keras_nlp.models.BertClassifier.from_preset(
    "hf://Wauplin/bert_base_en_uncased_retrained",
    num_classes=2,
    activation="softmax",
)

Here is how it looks like once uploaded on the Hub: https://huggingface.co/Wauplin/bert_base_en_uncased_retrained/tree/main..
If we go this way, I think we should also upload a default model card with keras-nlp tag to make all KerasNLP models discoverable on the Hub. On the Hugging Face side, we could make KerasNLP an official library (e.g. searchable, with code snippets, download counts, etc.).

In the current implementation, saving to "hf://Wauplin/bert_base_en_uncased_retrained" will save the model locally to Wauplin/bert_base_en_uncased_retrained subfolder + create the repository on the Hub + upload the local folder to this repo on the Hub. An alternative could be to save to a temporary folder before uploading to the Hub (to avoid the local copy). Both solutions are correct in my opinion, it's more a matter of how the KerasNLP envision the save_to_preset method.

mattdangerw

Looks good! Love how simple this is to add. Let's hold tight for a bit while we nail down the public version of save_to_preset.

mattdangerw · 2024-03-13T18:46:26Z

keras_nlp/utils/preset_utils.py

@@ -109,6 +123,9 @@ def save_to_preset(
    weights_filename="model.weights.h5",
 ):
    """Save a KerasNLP layer to a preset directory."""
+    push_to_hf = preset.startswith(HF_PREFIX)


Here we might end up doing some tweaks. We want to support a split flow for saving preprocessing and model weights. So our final flow might need to allow something like this

tokenizer.save_to_preset(dir) backbone.save_to_preset(dir) upload_preset("hf://user/model", dir)

I don't think we need to solve that here though! @SamanehSaadat is working on a draft of our upload flow currently.

Makes perfect sense!

mattdangerw · 2024-03-25T17:19:33Z

@Wauplin with #1512 I think we are all ready to go here.

Note that we will keep landing features related to the whole upload flow (in particular uploading a classifier with extra head weights, uploading lora weights that are essentially a diff on another preset). But we will keep the upload_preset flow, so I think this is unblocked.

Wauplin · 2024-03-26T16:38:03Z

Thanks for the ping @mattdangerw! I've just updated the PR accordingly and we should now be good to go :) And agree with you, we shouldn't have much HF-specific logic apart from that given how isolated upload_preset and get_file are.

Regarding documentation, do I have to update some markdown somewhere or are the different preset handlers not really documented for now? Please let me know if I can be of any assistance here.

mattdangerw

Looks great! Just some comments on error messages.

mattdangerw · 2024-03-26T17:19:25Z

keras_nlp/utils/preset_utils.py

+                "Please install with `pip install huggingface_hub`."
+            )
+        hf_handle = preset.removeprefix(HF_PREFIX)
+        return huggingface_hub.hf_hub_download(repo_id=hf_handle, filename=path)


What do the error messages look like if a handle is unformed? Will it read well enough, or should we validate roughly here so we can have a message similar to the Kaggle error above?

At this point, hf_handle must correspond to a repo_id so something in the form username/repo_name (e.g. "google/gemma-7b") since the hf prefix has been removed. If it's not the case, an HFValidationError (which is a custom ValueError) is raised. Here are the validation rules we are checking.

I just pushed 24ef262 to raise a ValueError similar to the kaggle handle one. I tried to be as consistent as possible. Please let me know what you think :)

keras_nlp/utils/preset_utils.py

mattdangerw · 2024-03-26T17:25:28Z

keras_nlp/utils/preset_utils.py

+            )
+        hf_handle = uri.removeprefix(HF_PREFIX)
+        repo_url = huggingface_hub.create_repo(repo_id=hf_handle, exist_ok=True)
+        huggingface_hub.upload_folder(repo_id=repo_url.repo_id, folder_path=preset)
    else:
        raise ValueError(
            f"Unexpected URI `'{uri}'`. Kaggle upload format should follow "


Same here, might want to reword this error message.

Side note: we are kinda inconsistent in how we refer to these model handles. I'm not sure if we should call these URIs or something else, but we should be consistent in our wording. No need to fix on this PR.

I've update the error message in 24ef262. I kept the uri naming to be consistent with the rest of the logic but agree with you it would be good to harmonize naming. I think inconsistency comes from the fact that in get_file the preset refers to either a local directory or a URI while in upload_preset, the preset refers to a local directory and uri to the URI. So maybe not so inconsistent after all?

SamanehSaadat · 2024-03-26T23:13:48Z

Thanks for the PR, @Wauplin! Looks great!
Regarding documentation, for now, you can update the upload_preset() docstring here to include info about hf://.
I'm working on preparing a Kaggle upload guide and I'll make sure to include info about HuggingFace as well.

Wauplin · 2024-03-27T08:40:58Z

Thanks both for the review and feedback! I have addressed all comments and completed the upload_preset docstring to mention hf://. I did not add it to get_file since kaggle preset was not documented there either but happy to document them both if you think it makes sense.

Wauplin · 2024-03-27T08:43:01Z

keras_nlp/utils/preset_utils.py

-            f"Unexpected URI `'{uri}'`. Kaggle upload format should follow "
-            "`kaggle://<KAGGLE_USERNAME>/<MODEL>/<FRAMEWORK>/<VARIATION>`."
+            "Unknown URI. An URI must be a one of:\n"
+            "1) a Kaggle Model handle like `'kaggle://<KAGGLE_USERNAME>/<MODEL>/<FRAMEWORK>/<VARIATION>'`\n"


Here I followed the existing message but I find it inconsistent with the error in get_file. In get_file, we provide real examples (e.g. 'kaggle://keras/bert/keras/bert_base_en') while here we only provide the format ('kaggle://<KAGGLE_USERNAME>/<MODEL>/<FRAMEWORK>/<VARIATION>'). Both are fine IMO but if you prefer one or the other, please let me know and I can update in this PR.

I think we can merge this as is, and I'll chat with folk later to figuring out our broader naming a push a small fix.

mattdangerw

Thanks so much! Looking forward to being on the hub!

Wauplin · 2024-03-27T15:54:12Z

Great thanks for the approval! I just pushed a commit to fix linting. Hope it's fine now :)

SamanehSaadat

Thanks a lot!

SamanehSaadat · 2024-03-28T01:22:54Z

Hi @Wauplin!

Thanks again for the PR!
I was trying to create a demo for HF upload and realized when you want to upload a model, you need to create the model on the HF web UI first. upload_folder() documentation mentions that the folder can be uploaded to an existing repo.

Just wanted to make sure my understanding is correct and there isn't any way to upload a model folder if the model hasn't been created on the HF website.

Wauplin · 2024-03-28T10:14:01Z

I was trying to create a demo for HF upload and realized when you want to upload a model, you need to create the model on the HF web UI first. upload_folder() documentation mentions that the folder can be uploaded to an existing repo.

@SamanehSaadat yes that's true but you can use create_repo to create the repo on the Hub first. This is actually what we are doing here.

* first draft * update upload_preset * lint * consistent error messages * lint

first draft

9715aac

Wauplin mentioned this pull request Mar 13, 2024

Add from_huggingface method to KerasNLP models #1294

Open

mattdangerw reviewed Mar 13, 2024

View reviewed changes

Wauplin mentioned this pull request Mar 15, 2024

Upload Model to Kaggle #1512

Merged

mattdangerw requested a review from SamanehSaadat March 25, 2024 17:18

Wauplin added 3 commits March 26, 2024 17:28

Merge branch 'master' into huggingface-hub-integration

da5dceb

update upload_preset

92858aa

lint

9785423

Wauplin requested a review from mattdangerw March 26, 2024 16:38

mattdangerw reviewed Mar 26, 2024

View reviewed changes

consistent error messages

24ef262

Wauplin commented Mar 27, 2024

View reviewed changes

mattdangerw approved these changes Mar 27, 2024

View reviewed changes

lint

9dae14f

SamanehSaadat approved these changes Mar 27, 2024

View reviewed changes

mattdangerw merged commit 316f18c into keras-team:master Mar 27, 2024
6 checks passed

Wauplin mentioned this pull request Mar 28, 2024

[RfC] Ideas for better Hugging Face Hub integration #1529

Closed

abuelnasr0 pushed a commit to abuelnasr0/keras-nlp that referenced this pull request Apr 2, 2024

Allow saving / loading from Huggingface Hub preset (keras-team#1510)

6ea1e63

* first draft * update upload_preset * lint * consistent error messages * lint

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow saving / loading from Huggingface Hub preset #1510

Allow saving / loading from Huggingface Hub preset #1510

Wauplin commented Mar 13, 2024 •

edited

Loading

mattdangerw left a comment

mattdangerw Mar 13, 2024

Wauplin Mar 15, 2024

mattdangerw commented Mar 25, 2024

Wauplin commented Mar 26, 2024

mattdangerw left a comment •

edited

Loading

mattdangerw Mar 26, 2024

Wauplin Mar 27, 2024

Wauplin Mar 27, 2024

mattdangerw Mar 26, 2024

Wauplin Mar 27, 2024

SamanehSaadat commented Mar 26, 2024

Wauplin commented Mar 27, 2024

Wauplin Mar 27, 2024 •

edited

Loading

mattdangerw Mar 27, 2024

mattdangerw left a comment

Wauplin commented Mar 27, 2024

SamanehSaadat left a comment

SamanehSaadat commented Mar 28, 2024

Wauplin commented Mar 28, 2024

Allow saving / loading from Huggingface Hub preset #1510

Allow saving / loading from Huggingface Hub preset #1510

Conversation

Wauplin commented Mar 13, 2024 • edited Loading

mattdangerw left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mattdangerw commented Mar 25, 2024

Wauplin commented Mar 26, 2024

mattdangerw left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SamanehSaadat commented Mar 26, 2024

Wauplin commented Mar 27, 2024

Wauplin Mar 27, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mattdangerw left a comment

Choose a reason for hiding this comment

Wauplin commented Mar 27, 2024

SamanehSaadat left a comment

Choose a reason for hiding this comment

SamanehSaadat commented Mar 28, 2024

Wauplin commented Mar 28, 2024

Wauplin commented Mar 13, 2024 •

edited

Loading

mattdangerw left a comment •

edited

Loading

Wauplin Mar 27, 2024 •

edited

Loading