Improve Message API code snippets #700

xenova · 2024-05-22T11:31:54Z

This PR is the first step towards improving auto-generated code snippets, mainly focusing on improving chat model inputs.

Highlights of the PR:

JS snippets were missing the content type header ("Content-Type": "application/json")
Code adapted from https://huggingface.co/blog/tgi-messages-api
Moved snippet generation code to separate folders , but I think we can move this all into tasks/[task]/snippet.ts. The reason against keeping it all in a single file (which was snippets/inputs.ts) is that this will grow in complexity as we improve code snippets across all other tasks.
Some models don't support system messages, and will throw an error or ignore the system message. How should we handle this? (EDIT: Fixed by only specifying a user message by default, which almost all models should support)

Align with new text-generation generator and make codebase more future-proof (when we improve inputs across other tasks)

julien-c · 2024-05-22T18:24:29Z

For me we would just change the default input for conversational to the following:

messages = [
    {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
    {"role": "user", "content": "Who are you?"},
]

and let the API and/or libraries handle the rest, no?

(That's what @Wauplin @SBrandeis @mishig25 have been doing on the huggingface.js/inference, widgets and huggingface_hub side i think)

packages/tasks/src/snippets/tasks/text-generation.ts

xenova · 2024-06-13T17:32:11Z

Finally getting back to this! Just a few notes:

The code I'm adding is adapted from https://huggingface.co/blog/tgi-messages-api, which uses the OpenAI (py/js) library for inference. This requires the user to install the appropriate library (pip install openai or npm i openai), and I will include as a comment just above the import. Is this the best approach?

There are cases where the model isn't available for some reason. I came across these while testing:

For certain models like facebook/blenderbot_small-90M - presumably those not backed by TGI - streaming is not supported:

BadRequestError: Error code: 400 - {'error': ['Error in `stream`: `stream` is not supported for this model']}

Disabling streaming also produces an error:

BadRequestError: Error code: 400 - {'error': " `args[0]`: {'messages': [{'role': 'system', 'content': 'You are a helpful assistant.'}, {'role': 'user', 'content': 'Why is open-source software important?'}], 'model': 'tgi', 'max_tokens': 500} have the wrong format. The should be either of type `str` or type `list`", 'warnings': ["There was an inference error:  `args[0]`: {'messages': [{'role': 'system', 'content': 'You are a helpful assistant.'}, {'role': 'user', 'content': 'Why is open-source software important?'}], 'model': 'tgi', 'max_tokens': 500} have the wrong format. The should be either of type `str` or type `list`"]}

Even though it does work correctly when using the current inference API snippet. Is there a way to detect whether TGI supports the model before suggesting the code snippet?

Never loads (even after the timeout)

InternalServerError: Error code: 503 - {'error': 'Model Qwen/Qwen2-0.5B-Instruct is currently loading', 'estimated_time': 39.522621154785156}

Nothing to do here probably, since this error also occurs for the current code snippet.

Too large

PermissionDeniedError: Error code: 403 - {'error': 'The model Qwen/Qwen2-72B-Instruct is too large to be loaded automatically (145GB > 10GB). Please use Spaces (https://huggingface.co/spaces) or Inference Endpoints (https://huggingface.co/inference-endpoints).'}

Nothing really to do here, and the error message is quite descriptive, just adding here for reference.

I will revert the separation I started with, mainly to keep the diff small for now, and since it's probably overkill for just adding example code for conversational models.
Not sure if there's a good way to detect whether a model supports system prompts or not. One way is to search the chat template for common error messages if a user attempts to use a system prompt.

Wauplin · 2024-06-14T07:17:20Z

Thanks for looking into it @xenova!

In Python, it's best to promote InferenceClient.chat_completion from the huggingface_hub package. While it's true that pure TGI is OpenAI-compatible, our Inference API service is not. As you've noticed, transformers-backed models do not provide the same interface. transformers handle messages as input (as mentioned in #700 (comment)) but not all the openai specification. InferenceClient.chat_completion takes care of detecting the correct format to send + handles stuff like retries on HTTP 503.

For JS, I know that @radames also added a chat_completion support to our inference.js package (see #645). I haven't tried it myself but it's supposed to have more or less the same logic as the Python client. @radames is there anything specific to know?

Not sure if there's a good way to detect whether a model supports system prompts or not. One way is to search the chat template for common error messages if a user attempts to use a system prompt.

Not that I'm aware of no, except looking at the config. I think it's also fine to not showcase system prompts at all in code snippets. Describing the message API + user/assistant is already a big win!

xenova · 2024-06-15T18:22:36Z

Thanks for the resources @Wauplin! Exactly what I was looking for! I've updated the code to use this now :) Here's some sample snippet code generated for meta-llama/Meta-Llama-3-8B-Instruct:

JavaScript:

import { HfInference } from "@huggingface/inference";

const inference = new HfInference("hf_xxx");

for await (const chunk of inference.chatCompletionStream({
	model: "meta-llama/Meta-Llama-3-8B-Instruct",
	messages: [{ role: "user", content: "What is the capital of France?" }],
	max_tokens: 500,
})) {
	process.stdout.write(chunk.choices[0]?.delta?.content || "");
}

Python:

from huggingface_hub import InferenceClient

client = InferenceClient(
	"meta-llama/Meta-Llama-3-8B-Instruct",
	token="hf_xxx",
)

for message in client.chat_completion(
	messages=[{"role": "user", "content": "What is the capital of France?"}],
	max_tokens=500,
	stream=True,
):
	print(message.choices[0].delta.content, end="")

cURL:

curl 'https://api-inference.huggingface.co/models/meta-llama/Meta-Llama-3-8B-Instruct/v1/chat/completions' \
-H "Authorization: Bearer hf_xxx" \
-H 'Content-Type: application/json' \
-d '{
	"model": "meta-llama/Meta-Llama-3-8B-Instruct",
	"messages": [{"role": "user", "content": "What is the capital of France?"}],
	"max_tokens": 500,
	"stream": false
}'

Other notes:

Should we show the streaming or non-streaming versions by default? I can imagine different users wanting different things. Another option is to display the non-streaming version but include the stream parameter so they can switch if they want? This will affect the output though. So, up for discussion. The exception for this is curl where we set stream to false (but it is in the example code). I just think it's easier to parse a single JSON response than multiple partial messages (not an easy interface like JS/PY).

It still seems to fail for non-TGI models (e.g., blenderbot) with a similar error as before:

Bad request:
`args[0]`: {'role': 'user', 'content': 'What is the capital of France?'} have the wrong format. The should be either of type `str` or type `list`

Next, for models that aren't loaded, it just does nothing forever without indicating any loading, progress, or error. 🤔 Is there a way to specify a timeout or something?
For the JS snippet, I think it's okay to use process.stdout (server-side only) instead of console.log due to the fact that if the user were running in a web environment (e.g., in-browser), it would require exposing the access token.
Like the example in the InferenceClient docs, I think it's best to only provide a single user message as an example to avoid errors that are raised for models that don't support system roles.
Thoughts on formatting and parameters shown? Don't want to overwhelm the user with parameters, but also good to include the most common ones (like max_tokens).

Wauplin · 2024-06-17T07:16:44Z

Nice wrap-up! The snippets looks good to me (at least Py and CURL but I trust you on the JS one as well).

Should we show the streaming or non-streaming versions by default?

Thoughts on formatting and parameters shown? Don't want to overwhelm the user with parameters, but also good to include the most common ones (like max_tokens).

What you did there is nice I think. Showcasing only 2 parameters is ok, to let the user know they have more options. It's only a small snippet anyway so we'll not showcase everything (docs are better for that).
And using stream for Python, not for curl makes sense yes 👍

Next, for models that aren't loaded, it just does nothing forever without indicating any loading, progress, or error. 🤔 Is there a way to specify a timeout or something?

In Python, one can do

client = InferenceClient(
	"meta-llama/Meta-Llama-3-8B-Instruct",
	token="hf_xxx",
        timeout=30,
)

But I don't think it's that bad to have an infinite timeout. FYI, there is an info log printed when the model is not loaded yet. Info logs are not shown by default so maybe I could log a warning instead Python-side.

Like the example in the InferenceClient docs, I think it's best to only provide a single user message as an example to avoid errors that are raised for models that don't support system roles.

👍

It still seems to fail for non-TGI models (e.g., blenderbot) with a similar error as before:

Which model are you talking about? facebook/blenderbot-3B, facebook/blenderbot-400M-distill or facebook/blenderbot_small-90M are all text2text-generation task. They don't have a chat template in their tokenizer_config.json.

EDIT: I just realized blenderbot models have the conversational tag hardcoded in their model card metadata even though they don't have a chat template => I assume that's why you get an error. So we should either remove this tag from the model cards or add a chat template.
@xenova That's unrelated to your PR though. To test a non-TGI model, you can use microsoft/DialoGPT-medium (spoiler alert: technically "it works" but the output is non-sense 🤷‍♂️ ).

SBrandeis

The code, snippets and overall logic looks sane to me - especially since all text-generation models are served with TGI now (IIUC)

xenova · 2024-06-24T08:28:43Z

Info logs are not shown by default so maybe I could log a warning instead Python-side.

I think this will be good, since in cases it hangs forever, it would be a good idea to inform the user why 😇

EDIT: I just realized blenderbot models have the conversational tag hardcoded in their model card metadata even though they don't have a chat template => I assume that's why you get an error. So we should either remove this tag from the model cards or add a chat template.

Good point! This won't be an issue now then, since we only consider this path for text-generation + conversational models:

In that case, I think this PR is pretty much ready - just final reviews left! 😎

Wauplin

Made a final review and looks good to me 👍 Thanks for all the iterations! Please wait another approval before merging though

SBrandeis

Thank you!

xenova · 2024-06-28T13:09:29Z

Merged! 🚀 Shall we put out a new release of @huggingface/tasks now too?

Wauplin · 2024-06-28T14:28:05Z

You can yes :) (by triggering it here)

xenova added 2 commits May 22, 2024 12:50

Generate sample text-generation code from chat template

0cd327f

Move input generators to separate folders

d8373b4

Align with new text-generation generator and make codebase more future-proof (when we improve inputs across other tasks)

julien-c reviewed May 22, 2024

View reviewed changes

packages/tasks/src/snippets/tasks/text-generation.ts Outdated Show resolved Hide resolved

revert + use messages API

2262207

Update code snippets

048cfb2

SBrandeis reviewed Jun 17, 2024

View reviewed changes

xenova marked this pull request as ready for review June 24, 2024 08:28

xenova requested review from osanseviero, gary149 and pcuenca as code owners June 24, 2024 08:28

Wauplin approved these changes Jun 26, 2024

View reviewed changes

Wauplin requested a review from SBrandeis June 26, 2024 10:10

Wauplin changed the title ~~[WIP] Improve code snippets~~ Improve Message API code snippets Jun 26, 2024

SBrandeis approved these changes Jun 28, 2024

View reviewed changes

xenova merged commit 7004980 into main Jun 28, 2024
4 checks passed

xenova deleted the improve-code-snippets branch June 28, 2024 13:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve Message API code snippets #700

Improve Message API code snippets #700

Uh oh!

xenova commented May 22, 2024 •

edited

Loading

Uh oh!

julien-c commented May 22, 2024

Uh oh!

Uh oh!

xenova commented Jun 13, 2024 •

edited

Loading

Uh oh!

Wauplin commented Jun 14, 2024

Uh oh!

xenova commented Jun 15, 2024

Uh oh!

Wauplin commented Jun 17, 2024 •

edited

Loading

Uh oh!

SBrandeis left a comment

Uh oh!

xenova commented Jun 24, 2024

Uh oh!

Wauplin left a comment •

edited

Loading

Uh oh!

SBrandeis left a comment

Uh oh!

Uh oh!

xenova commented Jun 28, 2024

Uh oh!

Wauplin commented Jun 28, 2024

Uh oh!

Uh oh!

Improve Message API code snippets #700

Improve Message API code snippets #700

Uh oh!

Conversation

xenova commented May 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

julien-c commented May 22, 2024

Uh oh!

Uh oh!

xenova commented Jun 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Wauplin commented Jun 14, 2024

Uh oh!

xenova commented Jun 15, 2024

Uh oh!

Wauplin commented Jun 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SBrandeis left a comment

Choose a reason for hiding this comment

Uh oh!

xenova commented Jun 24, 2024

Uh oh!

Wauplin left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SBrandeis left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

xenova commented Jun 28, 2024

Uh oh!

Wauplin commented Jun 28, 2024

Uh oh!

Uh oh!

xenova commented May 22, 2024 •

edited

Loading

xenova commented Jun 13, 2024 •

edited

Loading

Wauplin commented Jun 17, 2024 •

edited

Loading

Wauplin left a comment •

edited

Loading