Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow saving / loading from Huggingface Hub preset #1510

Merged
merged 6 commits into from
Mar 27, 2024

Conversation

Wauplin
Copy link
Contributor

@Wauplin Wauplin commented Mar 13, 2024

Solves #1294. As mentioned in #1294 (comment), this PR adds support for the hf:// prefix to load presets from the Huggingface Hub.

The integration requires the huggingface_hub library. Authentication can be configured with the HF_TOKEN environment variable (only for private models or for uploads, similarly to KaggleHub). Here is a Colab notebook showcasing it.

import keras_nlp
from keras_nlp.models import BertClassifier
from keras_nlp.utils.preset_utils import save_to_preset, upload_preset

classifier = BertClassifier.from_preset("bert_base_en_uncased")
(...) # train/retrain/fine-tune

# Save to local folder
save_to_preset(classifier, "bert_base_en_uncased_retrained")

# Upload to Hugging Face Hub
upload_preset("hf://Wauplin/bert_base_en_uncased_retrained", "bert_base_en_uncased_retrained", allow_incomplete=True)

# Reload from the Hugging Face Hub
classifier_reloaded = keras_nlp.models.BertClassifier.from_preset(
    "hf://Wauplin/bert_base_en_uncased_retrained",
    num_classes=2,
    activation="softmax",
)

Here is how it looks like once uploaded on the Hub: https://huggingface.co/Wauplin/bert_base_en_uncased_retrained/tree/main..
If we go this way, I think we should also upload a default model card with keras-nlp tag to make all KerasNLP models discoverable on the Hub. On the Hugging Face side, we could make KerasNLP an official library (e.g. searchable, with code snippets, download counts, etc.).

In the current implementation, saving to "hf://Wauplin/bert_base_en_uncased_retrained" will save the model locally to Wauplin/bert_base_en_uncased_retrained subfolder + create the repository on the Hub + upload the local folder to this repo on the Hub. An alternative could be to save to a temporary folder before uploading to the Hub (to avoid the local copy). Both solutions are correct in my opinion, it's more a matter of how the KerasNLP envision the save_to_preset method.

Copy link
Member

@mattdangerw mattdangerw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Love how simple this is to add. Let's hold tight for a bit while we nail down the public version of save_to_preset.

@@ -109,6 +123,9 @@ def save_to_preset(
weights_filename="model.weights.h5",
):
"""Save a KerasNLP layer to a preset directory."""
push_to_hf = preset.startswith(HF_PREFIX)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we might end up doing some tweaks. We want to support a split flow for saving preprocessing and model weights. So our final flow might need to allow something like this

tokenizer.save_to_preset(dir)
backbone.save_to_preset(dir)
upload_preset("hf://user/model", dir)

I don't think we need to solve that here though! @SamanehSaadat is working on a draft of our upload flow currently.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes perfect sense!

@mattdangerw
Copy link
Member

@Wauplin with #1512 I think we are all ready to go here.

Note that we will keep landing features related to the whole upload flow (in particular uploading a classifier with extra head weights, uploading lora weights that are essentially a diff on another preset). But we will keep the upload_preset flow, so I think this is unblocked.

@Wauplin
Copy link
Contributor Author

Wauplin commented Mar 26, 2024

Thanks for the ping @mattdangerw! I've just updated the PR accordingly and we should now be good to go :) And agree with you, we shouldn't have much HF-specific logic apart from that given how isolated upload_preset and get_file are.

Regarding documentation, do I have to update some markdown somewhere or are the different preset handlers not really documented for now? Please let me know if I can be of any assistance here.

@Wauplin Wauplin requested a review from mattdangerw March 26, 2024 16:38
Copy link
Member

@mattdangerw mattdangerw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Just some comments on error messages.

"Please install with `pip install huggingface_hub`."
)
hf_handle = preset.removeprefix(HF_PREFIX)
return huggingface_hub.hf_hub_download(repo_id=hf_handle, filename=path)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do the error messages look like if a handle is unformed? Will it read well enough, or should we validate roughly here so we can have a message similar to the Kaggle error above?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point, hf_handle must correspond to a repo_id so something in the form username/repo_name (e.g. "google/gemma-7b") since the hf prefix has been removed. If it's not the case, an HFValidationError (which is a custom ValueError) is raised. Here are the validation rules we are checking.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just pushed 24ef262 to raise a ValueError similar to the kaggle handle one. I tried to be as consistent as possible. Please let me know what you think :)

keras_nlp/utils/preset_utils.py Show resolved Hide resolved
)
hf_handle = uri.removeprefix(HF_PREFIX)
repo_url = huggingface_hub.create_repo(repo_id=hf_handle, exist_ok=True)
huggingface_hub.upload_folder(repo_id=repo_url.repo_id, folder_path=preset)
else:
raise ValueError(
f"Unexpected URI `'{uri}'`. Kaggle upload format should follow "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, might want to reword this error message.

Side note: we are kinda inconsistent in how we refer to these model handles. I'm not sure if we should call these URIs or something else, but we should be consistent in our wording. No need to fix on this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've update the error message in 24ef262. I kept the uri naming to be consistent with the rest of the logic but agree with you it would be good to harmonize naming. I think inconsistency comes from the fact that in get_file the preset refers to either a local directory or a URI while in upload_preset, the preset refers to a local directory and uri to the URI. So maybe not so inconsistent after all?

@SamanehSaadat
Copy link
Member

Thanks for the PR, @Wauplin! Looks great!
Regarding documentation, for now, you can update the upload_preset() docstring here to include info about hf://.
I'm working on preparing a Kaggle upload guide and I'll make sure to include info about HuggingFace as well.

@Wauplin
Copy link
Contributor Author

Wauplin commented Mar 27, 2024

Thanks both for the review and feedback! I have addressed all comments and completed the upload_preset docstring to mention hf://. I did not add it to get_file since kaggle preset was not documented there either but happy to document them both if you think it makes sense.

f"Unexpected URI `'{uri}'`. Kaggle upload format should follow "
"`kaggle://<KAGGLE_USERNAME>/<MODEL>/<FRAMEWORK>/<VARIATION>`."
"Unknown URI. An URI must be a one of:\n"
"1) a Kaggle Model handle like `'kaggle://<KAGGLE_USERNAME>/<MODEL>/<FRAMEWORK>/<VARIATION>'`\n"
Copy link
Contributor Author

@Wauplin Wauplin Mar 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I followed the existing message but I find it inconsistent with the error in get_file. In get_file, we provide real examples (e.g. 'kaggle://keras/bert/keras/bert_base_en') while here we only provide the format ('kaggle://<KAGGLE_USERNAME>/<MODEL>/<FRAMEWORK>/<VARIATION>'). Both are fine IMO but if you prefer one or the other, please let me know and I can update in this PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can merge this as is, and I'll chat with folk later to figuring out our broader naming a push a small fix.

Copy link
Member

@mattdangerw mattdangerw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much! Looking forward to being on the hub!

@Wauplin
Copy link
Contributor Author

Wauplin commented Mar 27, 2024

Great thanks for the approval! I just pushed a commit to fix linting. Hope it's fine now :)

Copy link
Member

@SamanehSaadat SamanehSaadat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot!

@mattdangerw mattdangerw merged commit 316f18c into keras-team:master Mar 27, 2024
6 checks passed
@SamanehSaadat
Copy link
Member

Hi @Wauplin!

Thanks again for the PR!
I was trying to create a demo for HF upload and realized when you want to upload a model, you need to create the model on the HF web UI first. upload_folder() documentation mentions that the folder can be uploaded to an existing repo.

Just wanted to make sure my understanding is correct and there isn't any way to upload a model folder if the model hasn't been created on the HF website.

@Wauplin
Copy link
Contributor Author

Wauplin commented Mar 28, 2024

I was trying to create a demo for HF upload and realized when you want to upload a model, you need to create the model on the HF web UI first. upload_folder() documentation mentions that the folder can be uploaded to an existing repo.

@SamanehSaadat yes that's true but you can use create_repo to create the repo on the Hub first. This is actually what we are doing here.

abuelnasr0 pushed a commit to abuelnasr0/keras-nlp that referenced this pull request Apr 2, 2024
* first draft

* update upload_preset

* lint

* consistent error messages

* lint
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants