Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Message API code snippets #700

Merged
merged 4 commits into from
Jun 28, 2024
Merged

Improve Message API code snippets #700

merged 4 commits into from
Jun 28, 2024

Conversation

xenova
Copy link
Contributor

@xenova xenova commented May 22, 2024

This PR is the first step towards improving auto-generated code snippets, mainly focusing on improving chat model inputs.

Highlights of the PR:

  • JS snippets were missing the content type header ("Content-Type": "application/json")
  • Code adapted from https://huggingface.co/blog/tgi-messages-api
  • Moved snippet generation code to separate folders , but I think we can move this all into tasks/[task]/snippet.ts. The reason against keeping it all in a single file (which was snippets/inputs.ts) is that this will grow in complexity as we improve code snippets across all other tasks.
  • Some models don't support system messages, and will throw an error or ignore the system message. How should we handle this? (EDIT: Fixed by only specifying a user message by default, which almost all models should support)

xenova added 2 commits May 22, 2024 12:50
Align with new text-generation generator and make codebase more future-proof (when we improve inputs across other tasks)
@julien-c
Copy link
Member

For me we would just change the default input for conversational to the following:

messages = [
    {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
    {"role": "user", "content": "Who are you?"},
]

and let the API and/or libraries handle the rest, no?

(That's what @Wauplin @SBrandeis @mishig25 have been doing on the huggingface.js/inference, widgets and huggingface_hub side i think)

@xenova
Copy link
Contributor Author

xenova commented Jun 13, 2024

Finally getting back to this! Just a few notes:

  1. The code I'm adding is adapted from https://huggingface.co/blog/tgi-messages-api, which uses the OpenAI (py/js) library for inference. This requires the user to install the appropriate library (pip install openai or npm i openai), and I will include as a comment just above the import. Is this the best approach?

  2. There are cases where the model isn't available for some reason. I came across these while testing:

    • For certain models like facebook/blenderbot_small-90M - presumably those not backed by TGI - streaming is not supported:

      BadRequestError: Error code: 400 - {'error': ['Error in `stream`: `stream` is not supported for this model']}
      

      Disabling streaming also produces an error:

      BadRequestError: Error code: 400 - {'error': " `args[0]`: {'messages': [{'role': 'system', 'content': 'You are a helpful assistant.'}, {'role': 'user', 'content': 'Why is open-source software important?'}], 'model': 'tgi', 'max_tokens': 500} have the wrong format. The should be either of type `str` or type `list`", 'warnings': ["There was an inference error:  `args[0]`: {'messages': [{'role': 'system', 'content': 'You are a helpful assistant.'}, {'role': 'user', 'content': 'Why is open-source software important?'}], 'model': 'tgi', 'max_tokens': 500} have the wrong format. The should be either of type `str` or type `list`"]}
      

      Even though it does work correctly when using the current inference API snippet. Is there a way to detect whether TGI supports the model before suggesting the code snippet?

    • Never loads (even after the timeout)

      InternalServerError: Error code: 503 - {'error': 'Model Qwen/Qwen2-0.5B-Instruct is currently loading', 'estimated_time': 39.522621154785156}
      

      Nothing to do here probably, since this error also occurs for the current code snippet.

    • Too large

      PermissionDeniedError: Error code: 403 - {'error': 'The model Qwen/Qwen2-72B-Instruct is too large to be loaded automatically (145GB > 10GB). Please use Spaces (https://huggingface.co/spaces) or Inference Endpoints (https://huggingface.co/inference-endpoints).'}
      

      Nothing really to do here, and the error message is quite descriptive, just adding here for reference.

  3. I will revert the separation I started with, mainly to keep the diff small for now, and since it's probably overkill for just adding example code for conversational models.

  4. Not sure if there's a good way to detect whether a model supports system prompts or not. One way is to search the chat template for common error messages if a user attempts to use a system prompt.

@Wauplin
Copy link
Contributor

Wauplin commented Jun 14, 2024

Thanks for looking into it @xenova!

In Python, it's best to promote InferenceClient.chat_completion from the huggingface_hub package. While it's true that pure TGI is OpenAI-compatible, our Inference API service is not. As you've noticed, transformers-backed models do not provide the same interface. transformers handle messages as input (as mentioned in #700 (comment)) but not all the openai specification. InferenceClient.chat_completion takes care of detecting the correct format to send + handles stuff like retries on HTTP 503.

For JS, I know that @radames also added a chat_completion support to our inference.js package (see #645). I haven't tried it myself but it's supposed to have more or less the same logic as the Python client. @radames is there anything specific to know?

Not sure if there's a good way to detect whether a model supports system prompts or not. One way is to search the chat template for common error messages if a user attempts to use a system prompt.

Not that I'm aware of no, except looking at the config. I think it's also fine to not showcase system prompts at all in code snippets. Describing the message API + user/assistant is already a big win!

@xenova
Copy link
Contributor Author

xenova commented Jun 15, 2024

Thanks for the resources @Wauplin! Exactly what I was looking for! I've updated the code to use this now :) Here's some sample snippet code generated for meta-llama/Meta-Llama-3-8B-Instruct:

JavaScript:

import { HfInference } from "@huggingface/inference";

const inference = new HfInference("hf_xxx");

for await (const chunk of inference.chatCompletionStream({
	model: "meta-llama/Meta-Llama-3-8B-Instruct",
	messages: [{ role: "user", content: "What is the capital of France?" }],
	max_tokens: 500,
})) {
	process.stdout.write(chunk.choices[0]?.delta?.content || "");
}

Python:

from huggingface_hub import InferenceClient

client = InferenceClient(
	"meta-llama/Meta-Llama-3-8B-Instruct",
	token="hf_xxx",
)

for message in client.chat_completion(
	messages=[{"role": "user", "content": "What is the capital of France?"}],
	max_tokens=500,
	stream=True,
):
	print(message.choices[0].delta.content, end="")

cURL:

curl 'https://api-inference.huggingface.co/models/meta-llama/Meta-Llama-3-8B-Instruct/v1/chat/completions' \
-H "Authorization: Bearer hf_xxx" \
-H 'Content-Type: application/json' \
-d '{
	"model": "meta-llama/Meta-Llama-3-8B-Instruct",
	"messages": [{"role": "user", "content": "What is the capital of France?"}],
	"max_tokens": 500,
	"stream": false
}'

Other notes:

  • Should we show the streaming or non-streaming versions by default? I can imagine different users wanting different things. Another option is to display the non-streaming version but include the stream parameter so they can switch if they want? This will affect the output though. So, up for discussion. The exception for this is curl where we set stream to false (but it is in the example code). I just think it's easier to parse a single JSON response than multiple partial messages (not an easy interface like JS/PY).

  • It still seems to fail for non-TGI models (e.g., blenderbot) with a similar error as before:

    Bad request:
    `args[0]`: {'role': 'user', 'content': 'What is the capital of France?'} have the wrong format. The should be either of type `str` or type `list`
    
  • Next, for models that aren't loaded, it just does nothing forever without indicating any loading, progress, or error. 🤔 Is there a way to specify a timeout or something?

  • For the JS snippet, I think it's okay to use process.stdout (server-side only) instead of console.log due to the fact that if the user were running in a web environment (e.g., in-browser), it would require exposing the access token.

  • Like the example in the InferenceClient docs, I think it's best to only provide a single user message as an example to avoid errors that are raised for models that don't support system roles.

  • Thoughts on formatting and parameters shown? Don't want to overwhelm the user with parameters, but also good to include the most common ones (like max_tokens).

@Wauplin
Copy link
Contributor

Wauplin commented Jun 17, 2024

Nice wrap-up! The snippets looks good to me (at least Py and CURL but I trust you on the JS one as well).

Should we show the streaming or non-streaming versions by default?

Thoughts on formatting and parameters shown? Don't want to overwhelm the user with parameters, but also good to include the most common ones (like max_tokens).

What you did there is nice I think. Showcasing only 2 parameters is ok, to let the user know they have more options. It's only a small snippet anyway so we'll not showcase everything (docs are better for that).
And using stream for Python, not for curl makes sense yes 👍

Next, for models that aren't loaded, it just does nothing forever without indicating any loading, progress, or error. 🤔 Is there a way to specify a timeout or something?

In Python, one can do

client = InferenceClient(
	"meta-llama/Meta-Llama-3-8B-Instruct",
	token="hf_xxx",
        timeout=30,
)

But I don't think it's that bad to have an infinite timeout. FYI, there is an info log printed when the model is not loaded yet. Info logs are not shown by default so maybe I could log a warning instead Python-side.

Like the example in the InferenceClient docs, I think it's best to only provide a single user message as an example to avoid errors that are raised for models that don't support system roles.

👍

It still seems to fail for non-TGI models (e.g., blenderbot) with a similar error as before:

Which model are you talking about? facebook/blenderbot-3B, facebook/blenderbot-400M-distill or facebook/blenderbot_small-90M are all text2text-generation task. They don't have a chat template in their tokenizer_config.json.

EDIT: I just realized blenderbot models have the conversational tag hardcoded in their model card metadata even though they don't have a chat template => I assume that's why you get an error. So we should either remove this tag from the model cards or add a chat template.
@xenova That's unrelated to your PR though. To test a non-TGI model, you can use microsoft/DialoGPT-medium (spoiler alert: technically "it works" but the output is non-sense 🤷‍♂️ ).

Copy link
Contributor

@SBrandeis SBrandeis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code, snippets and overall logic looks sane to me - especially since all text-generation models are served with TGI now (IIUC)

@xenova
Copy link
Contributor Author

xenova commented Jun 24, 2024

Info logs are not shown by default so maybe I could log a warning instead Python-side.

I think this will be good, since in cases it hangs forever, it would be a good idea to inform the user why 😇

EDIT: I just realized blenderbot models have the conversational tag hardcoded in their model card metadata even though they don't have a chat template => I assume that's why you get an error. So we should either remove this tag from the model cards or add a chat template.

Good point! This won't be an issue now then, since we only consider this path for text-generation + conversational models:
image


In that case, I think this PR is pretty much ready - just final reviews left! 😎

@xenova xenova marked this pull request as ready for review June 24, 2024 08:28
Copy link
Contributor

@Wauplin Wauplin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made a final review and looks good to me 👍 Thanks for all the iterations! Please wait another approval before merging though

@Wauplin Wauplin requested a review from SBrandeis June 26, 2024 10:10
@Wauplin Wauplin changed the title [WIP] Improve code snippets Improve Message API code snippets Jun 26, 2024
Copy link
Contributor

@SBrandeis SBrandeis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@xenova xenova merged commit 7004980 into main Jun 28, 2024
4 checks passed
@xenova xenova deleted the improve-code-snippets branch June 28, 2024 13:09
@xenova
Copy link
Contributor Author

xenova commented Jun 28, 2024

Merged! 🚀 Shall we put out a new release of @huggingface/tasks now too?

@Wauplin
Copy link
Contributor

Wauplin commented Jun 28, 2024

You can yes :) (by triggering it here)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants