From 689ac5e8c3ba8885561acb3dbc76bd9cf4354a1c Mon Sep 17 00:00:00 2001
From: MathieuBsqt <mathieu.busquet@outlook.fr>
Date: Mon, 29 Sep 2025 18:23:27 +0200
Subject: [PATCH 1/6] new guide for audio models

---
 .../guide.en-gb.md                            | 641 ++++++++++++++++++
 .../meta.yaml                                 |   2 +
 2 files changed, 643 insertions(+)
 create mode 100644 pages/public_cloud/ai_machine_learning/endpoints_guide_08_audio_transcriptions/guide.en-gb.md
 create mode 100644 pages/public_cloud/ai_machine_learning/endpoints_guide_08_audio_transcriptions/meta.yaml

diff --git a/pages/public_cloud/ai_machine_learning/endpoints_guide_08_audio_transcriptions/guide.en-gb.md b/pages/public_cloud/ai_machine_learning/endpoints_guide_08_audio_transcriptions/guide.en-gb.md
new file mode 100644
index 00000000000..60e5d8651a5
--- /dev/null
+++ b/pages/public_cloud/ai_machine_learning/endpoints_guide_08_audio_transcriptions/guide.en-gb.md
@@ -0,0 +1,641 @@
+---
+title: AI Endpoints - Speech to Text
+excerpt: Learn how to transcribe audio files with OVHcloud AI Endpoints
+updated: 2025-09-30
+---
+
+> [!primary]
+>
+> AI Endpoints is covered by the **[OVHcloud AI Endpoints Conditions](https://storage.gra.cloud.ovh.net/v1/AUTH_325716a587c64897acbef9a4a4726e38/contracts/48743bf-AI_Endpoints-ALL-1.1.pdf)** and the **[OVHcloud Public Cloud Special Conditions](https://storage.gra.cloud.ovh.net/v1/AUTH_325716a587c64897acbef9a4a4726e38/contracts/d2a208c-Conditions_particulieres_OVH_Stack-WE-9.0.pdf)**.
+>
+
+## Introduction
+
+[AI Endpoints](https://endpoints.ai.cloud.ovh.net/) is a serverless platform provided by OVHcloud that offers easy access to a selection of world-renowned, pre-trained AI models. The platform is designed to be simple, secure, and intuitive, making it an ideal solution for developers who want to enhance their applications with AI capabilities without extensive AI expertise or concerns about data privacy.
+
+**Speech to Text** is a powerful feature that enables the conversion of spoken language into written text.
+
+The Speech to Text endpoints on AI Endpoints allow you to easily integrate this technology into your applications, enabling you to transcribe audio files with high accuracy. Our endpoints support various audio formats and provide flexible configuration options to suit your specific use cases.
+
+## Objective
+
+This documentation provides an overview of the Speech to Text endpoints offered on [AI Endpoints](https://endpoints.ai.cloud.ovh.net/). 
+
+Visit our [Catalog](https://endpoints.ai.cloud.ovh.net/catalog) to find out which models are compatible with Audio Analysis.
+
+The examples provided during this guide can be used with one of the following environments:
+
+> [!tabs]
+> **Python**
+>> 
+>> A [Python](https://www.python.org/) environment with the [openai client](https://pypi.org/project/openai/) and the pydantic library installed.
+>>
+>> ```sh
+>> pip install openai pydantic
+>> ```
+>> 
+> **JavaScript**
+>> 
+>> A [Node.js](https://nodejs.org/en) environment with the [request](https://www.npmjs.com/package/request) library.
+>> Request can be installed using [NPM](https://www.npmjs.com/):
+>> 
+>> ```sh
+>> npm install request
+>> ```
+>> 
+> **cURL**
+>> 
+>> A standard terminal, with [cURL](https://cURL.se/) installed on the system.
+>> 
+
+*These exmaples will be using the [Whisper-large-v3](https://endpoints.ai.cloud.ovh.net/models/whisper-large-v3) model.*
+
+## Authentication & Rate Limiting
+
+All the examples provided in this guide use anonymous authentication, which makes it simpler to use but may cause rate limiting issues. If you wish to enable authentication using your own token, simply specify your API key within the requests.
+
+Follow the instructions in the [AI Endpoints - Getting Started](/pages/public_cloud/ai_machine_learning/endpoints_guide_01_getting_started) guide for more information on authentication.
+
+## Request Body
+
+### Parameters Overview
+
+The request body for the audio transcription endpoint is of type `multipart/form-data` and includes the following fields:
+
+| Parameter                | Required | Type          | Allowed Values / Format                                                                 | Default | Description                                                                                                                                                                                                 |
+|--------------------------|----------|---------------|---------------------------------------------------------------------------------------|---------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| **file**                     | Yes      | binary        | `mp3`, `mp4`, `aac`, `m4a`, `wav`, `flac`, `ogg`, `opus`, `webm`, `mpeg`, `mpga`                                    | -       | The **audio file object (not file name)** to transcribe.                                                                                                                                                      |
+| **chunking_strategy**        | No       | `string`/`server_vad object`/`null`   | -                                                                                     | null    | Strategy for dividing the audio into chunks. More details [here](#chunking-strategy).                                                                                                                                                     |
+| **diarize**                  | No       | `boolean`/`null`  | `true`/`false`                                                                            | false   | Enables speaker separation in the transcript. When set to true, the system separates the audio into segments based on speakers, by adding labels like "Speaker 1" and "Speaker 2", so you can see who said what in conversations such as interviews, meetings, or phone calls. More details [here](#diarize).                                                                                                                                           |
+| **language**                 | No       | `string`/`null`   | [ISO-639-1 format](https://en.wikipedia.org/wiki/List_of_ISO_639_language_codes)                                                                      | -       | The language parameter specifies the language spoken in the input audio. Providing it can improve transcription accuracy and reduce latency (e.g. `en` for English, `fr` for French, `de` for German, `es` for Spanish, `zh` for Chinese, `ar` for Arabic ...). If not provided, the system will attempt automatic language detection, which may be slightly slower and less accurate in some cases.                                                                                                                                           |
+| **model**                    | No       | `string`/`null`   | ID of the model to use                                                                | -       | Specifies the model to use for transcription. Useful when using our [unified endpoint](/pages/public_cloud/ai_machine_learning/endpoints_guide_07_virtual_models).                                                                                                                                                            |
+| **prompt**                   | No       | `string`/`null`   | -                                                                                     | -       | Text to guide the model's style, translate transcript to english or continue a previous audio segment. The language in which you write the prompt must match the audio's one. More details about prompt usage [here](#prompt).                                                                                         |
+| **response_format**          | No       | `enum`/`null`     | `json`, `text`, `srt`, `verbose_json`, `vtt`                                                | `verbose_json`    | Determines how the transcription data is returned. For detailed examples of each output type, visit the [Response Formats](#response-formats) section.                                                                                                                                                        |
+| **stream**                   | No       | `boolean`/`null`  | `true`/`false`                                                                            | false   | If set to true, the model response data will be streamed to the client. Currently not supported for Whisper models. |
+| **temperature**              | No       | `number`/`null`   | From `0.0` to `1.0`                                                                                | 0       | Controls randomness in the output. Higher values make the output more random, while lower values make it more focused and deterministic.                                                                   |
+| **timestamp_granularities**  | No       | `array`/`null`    | `["segment"]`, `["word"]`, `["word", "segment"]`                                        | `["segment"]` | Controls the level of detail in the timestamps provided in the transcription. More details [here](#timestamp-granularities).                                                              |
+
+### Example Usage
+
+Now that you know which parameters are available, let’s look at how to put them into practice. Below are sample requests in **Python**, **cURL** and **JavaScript**:
+
+> [!tabs]
+> **Python (using requests)**
+>> 
+>> ```python
+>> import os
+>> import requests
+>>
+>> url = "https://oai.endpoints.kepler.ai.cloud.ovh.net/v1/audio/transcriptions"
+>> 
+>> audio_file_path = "my_audio.mp3"
+>> 
+>> headers = {
+>>    "accept": "application/json",
+>> #   "Authorization": f"Bearer {os.getenv('OVH_AI_ENDPOINTS_ACCESS_TOKEN')}",
+>> }
+>> 
+>> files = {"file": open(audio_file_path, "rb")}
+>> 
+>> data = {
+>>     "model": "whisper-large-v3",
+>>     "language": "en",
+>>     "temperature": "0",
+>>     "prompt": "<|transcribe|>",
+>>     "diarize": "false",
+>>     "timestamp_granularities": ["segment"],
+>>     "response_format": "verbose_json"
+>> }
+>> 
+>> response = requests.post(url, headers=headers, files=files, data=data)
+>> 
+>> if response.status_code == 200:
+>>     # Handle response
+>>     print(response.json())
+>> else:
+>>     print("Error:", response.status_code, response.text)
+>> ```
+>> 
+> **Python (using OpenAI client)**
+>> 
+>> ```python
+>> from openai import OpenAI
+>> import os
+>> 
+>> url = "https://oai.endpoints.kepler.ai.cloud.ovh.net/v1/"
+>> audio_file_path = "my_audio.mp3"
+>> 
+>> client = OpenAI(
+>>     base_url=url,
+>> #    api_key=os.getenv('OVH_AI_ENDPOINTS_ACCESS_TOKEN'),
+>>     )
+>> 
+>> with open(audio_file_path, "rb") as f:
+>>     transcript = client.audio.transcriptions.create(
+>>         file=f,
+>>         model="whisper-large-v3",
+>>         language="en",
+>>         temperature=0,
+>>         prompt="<|transcribe|>",
+>>         timestamp_granularities=["segment"],
+>>         response_format="verbose_json"
+>>     )
+>> 
+>> print(transcript)
+>> ```
+>>
+>> > [!warning]
+>> >
+>> > **Warning**: The `diarize` parameter is not supported when using the OpenAI client library.
+>> >
+>> > To use diarization, you must make a direct HTTP request using `requests` or `cURL` with `diarize` set to `true`.
+>> >
+>>
+> **cURL**
+>> 
+>> ```sh
+>> curl -X POST "https://oai.endpoints.kepler.ai.cloud.ovh.net/v1/audio/transcriptions" \
+>> -F "file=@my_audio.mp3" \
+>> -F "model=whisper-large-v3" \
+>> -F "language=en" \
+>> -F "temperature=0" \
+>> -F "prompt==<|transcribe|>" \
+>> -F "diarize=false" \
+>> -F "timestamp_granularities[]=segment" \
+>> -F "response_format=verbose_json"
+>> ```
+>>
+>> To [**authenticate with your API key**](/pages/public_cloud/ai_machine_learning/endpoints_guide_01_getting_started), add an Authorization header:
+>>
+>> ```sh
+>> `-H "Authorization: Bearer $OVH_AI_ENDPOINTS_ACCESS_TOKEN" \`
+>> ```
+>>
+> **JavaScript (using OpenAI client)**
+>>
+>> ```javascript
+>> import OpenAI from "openai";
+>> import fs from "fs";
+>> 
+>> const openai = new OpenAI({
+>>  baseURL: "https://oai.endpoints.kepler.ai.cloud.ovh.net/v1/",
+>>  // apiKey: process.env.OVH_AI_ENDPOINTS_ACCESS_TOKEN,
+>> });
+>> 
+>> const transcript = await openai.audio.>> transcriptions.create({
+>>   file: fs.createReadStream("my_audio.mp3"),
+>>   model: "whisper-large-v3",
+>>   language: "en",
+>>   temperature: 0,
+>>   prompt: "<|transcribe|>",
+>>   timestamp_granularities: ["segment"],
+>>   response_format: "verbose_json"
+>> });
+>> 
+>> console.log(transcript);
+>> ```
+>>
+
+**Output example**
+
+By default, the transcription endpoint returns output in `verbose_json` format.
+
+This includes detailed metadata such as language, segments, tokens, and diarization information:
+
+```json
+{
+  "task": "transcribe",
+  "success": true,
+  "language": "en",
+  "duration": 4.46975,
+  "text": "My name is Octave and I am working at OVHcloud",
+  "words": [],
+  "segments": [
+    {
+      "id": 1,
+      "seek": 0,
+      "start": 0,
+      "end": 3.48,
+      "text": "My name is Octave and I am working at OVHcloud",
+      "tokens": [
+        50365,
+        2588,
+        275,
+        ...
+      ],
+      "temperature": 0,
+      "avg_logprob": -0.38066408,
+      "compression_ratio": 0.9,
+      "no_speech_prob": 0
+    }
+  ],
+  "diarization": [],
+  "usage": {
+    "type": "duration",
+    "duration": 5
+  }
+}
+```
+
+For **detailed examples** of each available output type, see the [Response Formats section](#response-formats) section.
+
+### Parameters Details
+
+While the previous overview gives a quick reference, certain parameters require more context to understand how and when to use them.
+
+#### Diarization
+
+The `diarize` parameter enables speaker separation in the generated transcript. When set to `true`, the system labels different voices as `Speaker 0`, `Speaker 1`, etc.
+
+This is useful for meetings, debates, or interviews where multiple people are speaking.
+
+> [!warning]
+> **Warning**: 
+> - This parameter is only available with the default `verbose_json` [response format](#response-formats). Using any other will raise an error.
+> - `diarize` is not supported when using the OpenAI client libraries. You must use a direct HTTP request with `requests`, `cURL`, or another HTTP client.
+
+**Output Example**: Transcribing an audio file with `diarize` enabled:
+
+Request:
+
+```json
+{
+  "file": "<audio file object>",
+  "diarize": true
+}
+```
+
+Output:
+
+```json
+{
+  "task":"transcribe",
+  "success":true,
+  "language":"en",
+  "duration":8.939875,
+  "text":"Hello Marie, your recent experiments are fascinating. Indeed, Albert. It's exciting to explore radioactivity together.",
+  "words":[],
+  "segments":[
+    {"id":1,"seek":0,"start":0.0,"end":3.24,"text":"Hello Marie, your recent experiments are fascinating.","tokens":[50365,5490,...,13,50543],"temperature":0.0,"avg_logprob":-0.21679688,"compression_ratio":1.0361446,"no_speech_prob":0.0},
+    {"id":2,"seek":0,"start":4.74,"end":8.1,"text":"Indeed, Albert. It's exciting to explore radioactivity together.","tokens":[50597,2421,..., 30,50780],"temperature":0.0,"avg_logprob":-0.21679688,"compression_ratio":1.0361446,"no_speech_prob":0.0}
+  ],
+  "diarization":[
+    {"speaker":0,"text":"Hello Marie","start":0.0,"end":1.06},
+    {"speaker":0,"text":"your recent experiments are fascinating.","start":1.16,"end":3.24},
+    {"speaker":1,"text":"Indeed, Albert","start":4.74,"end":6.76},
+    {"speaker":1,"text":"It's exciting to explore radioactivity together.","start":6.98,"end":8.1}
+  ],
+  "usage":{"type":"duration","duration":9.0}
+}
+```
+
+#### Prompt
+
+The `prompt` parameter lets you provide extra context to improve transcription. Think of it as giving the model a **hint** before it starts listening to your audio. This can help when: 
+
+- **Correcting words** or acronyms that are often misrecognized.
+- **Preserving context** if the audio is split into several parts.
+- **Enforcing punctuation**, filler words, or writing style.
+- **Translate** generated speech to English.
+
+> [!warning]
+> **Warning**: The prompt **must be written in the same language** as the audio. For example, if your audio is in English, your prompt must also be in English.
+
+**Examples**
+
+> [!tabs]
+> **Correcting acronyms and names**
+>> 
+>> If the audio mentions complicated words such as products, companies, technical terms, or people but the model often mispells them, you could list them in your prompt:
+>>
+>> ```json
+>> {
+>>     "file": <audio file object>,
+>>     "prompt": "OVHcloud, Grand Palais, CNRS."
+>> }
+>> ```
+>>
+>> **Translating transcript into English**
+>>
+>> To directly translate the transcription into English instead of keeping it in the source language, you can pass the special translation token `<|translate|>` in your prompt:
+>>
+>> ```json
+>> {
+>>      "file": <audio file object>,
+>>      "prompt": "<|translate|>"
+>> }
+>> ```
+>>
+> **Maintaining context across segments**
+>>
+>> When processing split audio files, provide the transcript from the previous segment to maintain context and improve accuracy:
+>>
+>> ```json
+>> {
+>>      "file": <audio file object>,
+>>      "prompt": "...previous segment transcript..."
+>> }
+>> ```
+>>
+> **Keeping punctuation**
+>> 
+>> If the model skips punctuation in the transcript, use a properly punctuated prompt to encourage correct formatting:
+>>
+>> ```json
+>> {
+>>     "file": <audio file object>,
+>>     "prompt": "Hello, welcome to my lecture on the history of France."
+>> }
+>> ```
+>>
+> **Preserving filler words**
+>> 
+>> If the model omits filler words when transcribing audio, like "ums" or "like", include them in your prompt:
+>>
+>> ```json
+>> {
+>>     "file": <audio file object>,
+>>     "prompt": "Uh, well, let me think... hmm, okay, so about OVHcloud."
+>> }
+>> ```
+>>
+> **Enforcing writing style**
+>> 
+>> For languages with multiple writing systems (like simplified vs. traditional Chinese), or to maintain consistent style:
+>>
+>> **Simplified Chinese:**
+>>
+>> ```json
+>> {
+>>     "file": <audio file object>,
+>>     "prompt": "以下是普通话的句子。"
+>> }
+>> ```
+>>
+>> **Traditional Chinese:**
+>>
+>> ```json
+>> {
+>>     "file": <audio file object>,
+>>     "prompt": "以下是普通話的句子。"
+>> }
+>> ```
+
+#### Timestamp Granularities
+
+The `timestamp_granularities` parameter controls the level of time markers included in the transcript. You have three possibilities there: 
+
+> [!tabs]
+> **Segment-level timestamps**
+>> 
+>> Timestamps for each segment, providing timing for larger sections of the audio.
+>>
+>> ```json
+>> {
+>>     "file": <audio file object>,
+>>     "timestamp_granularities": ["segment"],
+>>     "response_format": verbose_json
+>> }
+>> ```
+>>
+>> ```json
+>> words=[],
+>> segments=[
+>>    {'id': 1, 'seek': 0, 'start': 1.76, 'end': 4.58, 'text': ' France is the world's leading tourist destination', 'tokens': [50365, 1456, 1181, 650, 1459, 1030, 476, 8124, 515, 50609], 'temperature': 0.0, 'avg_logprob': -0.14139344, 'compression_ratio': 1.2769231, 'no_speech_prob': 0.007171631}, 
+>>    {'id': 2, 'seek': 0, 'start': 9.44, 'end': 14.92, 'text': ' having received 100 million foreign visitors in 2023.', 'tokens': [50609, 4042, 25011, 3925, 650, 1459, 11, 1022, 517, 594, 2672, 14303, 11, 2064, 1001, 465, 1872, 312, 66, 517, 490, 609, 13, 51117], 'temperature': 0.0, 'avg_logprob': -0.14139344, 'compression_ratio': 1.2769231, 'no_speech_prob': 0.007171631}, 
+>>    ...
+>> ]
+>> ```
+>>
+> **Word-level timestamps**
+>>
+>> Timestamps for each word, providing precise timing for every spoken word.
+>>
+>> ```json
+>> {
+>>     "file": <audio file object>,
+>>     "timestamp_granularities": ["word"],
+>>     "response_format": verbose_json
+>> }
+>> ```
+>>
+>> ```json
+>> words=[
+>>     {'word': 'France', 'start': 1.76, 'end': 2.44}, 
+>>     {'word': 'is', 'start': 2.44, 'end': 3.12}, 
+>>     {'word': 'the', 'start': 3.12, 'end': 3.96}, 
+>>     {'word': 'world', 'start': 3.96, 'end': 4.08},
+>>     ...
+>> ],
+>> segments=[]
+>> ```
+>>
+>> > [!warning]
+>> > **Warning**: Generating `["word"]` timestamps incurs additional latency.
+>>
+> **Word and segment timestamps**
+>> 
+>> You can also get both:
+>>
+>> ```json
+>> {
+>>     "file": <audio file object>,
+>>     "timestamp_granularities": ["word", "segment"],
+>>     "response_format": verbose_json
+>> }
+>> ```
+>>
+>> ```json
+>> words=[
+>>     {'word': 'France', 'start': 1.76, 'end': 2.44}, 
+>>     {'word': 'is', 'start': 2.44, 'end': 3.12}, 
+>>     {'word': 'the', 'start': 3.12, 'end': 3.96}, 
+>>     {'word': 'world', 'start': 3.96, 'end': 4.08},
+>>     ...
+>> ],
+>> segments=[
+>>    {'id': 1, 'seek': 0, 'start': 1.76, 'end': 4.58, 'text': ' France is the world's leading tourist destination', 'tokens': [50365, 1456, 1181, 650, 1459, 1030, 476, 8124, 515, 50609], 'temperature': 0.0, 'avg_logprob': -0.14139344, 'compression_ratio': 1.2769231, 'no_speech_prob': 0.007171631}, 
+>>    {'id': 2, 'seek': 0, 'start': 9.44, 'end': 14.92, 'text': ' having received 100 million foreign visitors in 2023.', 'tokens': [50609, 4042, 25011, 3925, 650, 1459, 11, 1022, 517, 594, 2672, 14303, 11, 2064, 1001, 465, 1872, 312, 66, 517, 490, 609, 13, 51117], 'temperature': 0.0, 'avg_logprob': -0.14139344, 'compression_ratio': 1.2769231, 'no_speech_prob': 0.007171631}, 
+>>    ...
+>> ]
+>> ```
+>>
+>> > [!warning]
+>> > **Warning**: Generating `["word"]` timestamps can incur additional latency.
+>>
+
+#### Response Formats
+
+The `response_format` determines how the transcription data is returned. Available formats include:
+
+> [!tabs]
+> **Verbose JSON (default)**
+>> 
+>> Returns the full transcription with metadata such as segments, tokens, language, duration, and diarization:
+>>
+>> ```json
+>> {
+>>   "task": "transcribe",
+>>   "success": true,
+>>   "language": "en",
+>>   "duration": 4.46975,
+>>   "text": "My name is Octave and I am working at OVHcloud.",
+>>   "words": [],
+>>   "segments": [
+>>     {
+>>       "id": 1,
+>>       "seek": 0,
+>>       "start": 0,
+>>       "end": 3.48,
+>>       "text": "My name is Octave and I am working at OVHcloud.",
+>>       "tokens": [
+>>         50365,
+>>         2588,
+>>         275,
+>>         ...
+>>       ],
+>>       "temperature": 0,
+>>       "avg_logprob": -0.38066408,
+>>       "compression_ratio": 0.9,
+>>       "no_speech_prob": 0
+>>     }
+>>   ],
+>>   "diarization": [],
+>>   "usage": {
+>>     "type": "duration",
+>>     "duration": 5
+>>   }
+>> }
+>> ```
+>> 
+> **JSON**
+>> 
+>> Returns only the basic transcription data, such as transcribed text and usage information:
+>>
+>> ```json
+>> {
+>>   "text": "My name is Octave and I am working at OVHcloud.",
+>>   "usage": {
+>>     "type": "duration",
+>>     "duration": 5
+>>   }
+>> }
+>> 
+> **Text**
+>> 
+>> Returns only the transcribed text as a plain string.
+>> ```text
+>> My name is Octave and I am working at OVHcloud
+>> ```
+>>
+> **SRT**
+>> 
+>> Not yet supported.
+>>
+> **VTT**
+>> 
+>> Not yet supported.
+>> 
+
+#### Chunking Strategy
+
+The `chunking_strategy` parameter controls how the audio file is **divided into smaller segments** before transcription.
+
+By **default**, when unset, the audio is **processed as a single block**. 
+
+When set to `auto`, the system first normalizes audio loudness and then uses voice activity detection (VAD) to automatically split the audio at natural pauses (silence).
+
+You can also provide a `server_vad` object 
+to manually tweak VAD detection parameters. This lets you control the following parameters:
+
+- `prefix_padding_ms`: Amount of audio to include before the VAD detected speech (in milliseconds).
+- `silence_duration_ms`: Duration of silence to detect speech stop (in milliseconds). With shorter values the model will respond more quickly, but may jump in on short pauses from the user.
+- `threshold`: Sensitivity threshold (0.0 to 1.0) for voice activity detection. A higher threshold will require louder audio to activate the model, and thus might perform better in noisy environments.
+
+**Example**:
+
+```json
+{
+  "file": "<audio file object>",
+  "chunking_strategy": {
+    "type": "server_vad",
+    "prefix_padding_ms": 200,
+    "silence_duration_ms": 500,
+    "threshold": 0.6
+  }
+}
+```
+
+## Endpoint Limitations
+
+### Compatibility and Performances
+
+Different languages have varying levels of compatibility and performance with the models. Refer to the model documentation for specific details on language support and performance metrics. Generally, models perform best with languages they are specifically trained on, and performance may vary for less common languages.
+
+### Prompt Length
+
+For Whisper-based models, the `prompt` parameter only considers the last 224 tokens (approximately the final 200 characters). If your prompt is longer, tokens preceding the last 224 will be **ignored**.
+
+### Parameters Support
+
+- Streaming is not **yet supported** for audio transcription endpoints. All audio must be uploaded and processed in a single request.
+
+- The `srt` and `vtt` response formats are not yet supported. Available response formats can be found [here](#response-formats).
+
+### Supported Audio Formats, Durations and Sizes
+
+**Audio Formats**
+
+The API supports multiple audio formats as [mentioned before](#parameters-overview). Ensure your file is in a supported format to be successfully transcribed.
+
+**File Size and Duration Limits**:
+
+- **Authenticated requests** (using an [API key](/pages/public_cloud/ai_machine_learning/endpoints_guide_01_getting_started)): Up to `2048 MB` or `10 800 seconds` of audio per request.
+- **Anonymous requests**: Up to `10 MB` or `60 seconds` of audio per request.
+
+**Transcribing larger audio files**
+
+If your audio file exceeds these limits, you can split it into smaller chunks before sending it to the transcription endpoint.
+
+Try to avoid splitting mid-sentence, as this can cause context to be lost and reduce transcription accuracy. Using compressed audio formats can also help reduce file size.
+
+**Example**: Splitting Audio with open-source Python PyDub library:
+
+```python
+from pydub import AudioSegment
+
+# Load the audio file
+audio = AudioSegment.from_mp3("long_interview.mp3")
+
+# Define chunk duration in milliseconds (e.g., 10 minutes)
+chunk_duration = 10 * 60 * 1000
+
+# Split first chunk
+first_chunk = audio[:chunk_duration]
+
+# Export chunk
+first_chunk.export("long_interview_part1.mp3", format="mp3")
+```
+
+Repeat this process to create multiple chunks, then transcribe each chunk individually.
+
+> [!warning]
+>
+> **Warning**: OVHcloud makes no guarantees about the usability or security of third-party software like PyDub.
+
+## Conclusion
+
+In this guide, we have explained how to use Speech to Text models available on [AI Endpoints](https://endpoints.ai.cloud.ovh.net/) models. We have provided a comprehensive overview of the feature which can help you perfect your integration of model for your own application.
+
+## Go Further
+
+Browse the full [AI Endpoints documentation](/products/public-cloud-ai-and-machine-learning-ai-endpoints) to further understand the main concepts and get started.
+
+If you need training or technical assistance to implement our solutions, contact your sales representative or click on [this link](/links/professional-services) to get a quote and ask our Professional Services experts for a custom analysis of your project.
+
+## Feedback
+
+Please send us your questions, feedback, and suggestions to improve the service:
+
+- On the OVHcloud [Discord server](https://discord.gg/ovhcloud).
+
diff --git a/pages/public_cloud/ai_machine_learning/endpoints_guide_08_audio_transcriptions/meta.yaml b/pages/public_cloud/ai_machine_learning/endpoints_guide_08_audio_transcriptions/meta.yaml
new file mode 100644
index 00000000000..8deea77cd1f
--- /dev/null
+++ b/pages/public_cloud/ai_machine_learning/endpoints_guide_08_audio_transcriptions/meta.yaml
@@ -0,0 +1,2 @@
+id: f7243279-6f0f-4436-be36-0862aaa91db8
+full_slug: public-cloud-ai-endpoints-audio-models
\ No newline at end of file

From 7dbbe80601f2f6d196b070b4901c44c44f96dc02 Mon Sep 17 00:00:00 2001
From: MathieuBsqt <mathieu.busquet@outlook.fr>
Date: Mon, 29 Sep 2025 18:24:08 +0200
Subject: [PATCH 2/6] add new guide to index file

---
 pages/index.md | 1 +
 1 file changed, 1 insertion(+)

diff --git a/pages/index.md b/pages/index.md
index 8f65e0ef0e6..156c217f66f 100644
--- a/pages/index.md
+++ b/pages/index.md
@@ -1219,6 +1219,7 @@
                 + [AI Endpoints - Structured Output](public_cloud/ai_machine_learning/endpoints_guide_05_structured_output)
                 + [AI Endpoints - Function Calling](public_cloud/ai_machine_learning/endpoints_guide_06_function_calling)
                 + [AI Endpoints - Virtual Models](public_cloud/ai_machine_learning/endpoints_guide_07_virtual_models)
+                + [AI Endpoints - Speech to Text](public_cloud/ai_machine_learning/endpoints_guide_08_audio_transcriptions)
             + [Tutorials](public-cloud-ai-and-machine-learning-ai-endpointstutorials)
                 + [AI Endpoints - Create your own audio summarizer](public_cloud/ai_machine_learning/endpoints_tuto_01_audio_summarizer)
                 + [AI Endpoints - Create your own voice assistant](public_cloud/ai_machine_learning/endpoints_tuto_02_voice_virtual_assistant)

From cc9f532b9bad13c602b4f97034886fa57be96be9 Mon Sep 17 00:00:00 2001
From: MathieuBsqt <mathieu.busquet@outlook.fr>
Date: Mon, 29 Sep 2025 18:24:32 +0200
Subject: [PATCH 3/6] remove duplicated section in existing guides

---
 .../guide.en-gb.md                                     | 10 ++++------
 .../endpoints_guide_06_function_calling/guide.en-gb.md |  2 --
 2 files changed, 4 insertions(+), 8 deletions(-)

diff --git a/pages/public_cloud/ai_machine_learning/endpoints_guide_05_structured_output/guide.en-gb.md b/pages/public_cloud/ai_machine_learning/endpoints_guide_05_structured_output/guide.en-gb.md
index a49a160a73c..cb1f84d378b 100644
--- a/pages/public_cloud/ai_machine_learning/endpoints_guide_05_structured_output/guide.en-gb.md
+++ b/pages/public_cloud/ai_machine_learning/endpoints_guide_05_structured_output/guide.en-gb.md
@@ -43,7 +43,7 @@ The examples provided during this guide can be used with one of the following en
 >> pip install openai pydantic
 >> ```
 >> 
->> **Javascript**
+>> **JavaScript**
 >> 
 >> A [Node.js](https://nodejs.org/en) environment with the [request](https://www.npmjs.com/package/request) library.
 >> Request can be installed using [NPM](https://www.npmjs.com/):
@@ -208,7 +208,7 @@ The following code samples provide a simple example on how to specify a JSON sch
 >>
 >> As we can see, the response is matching the expected JSON schema!
 >>
-> **Javascript**
+> **JavaScript**
 >>
 >> ```javascript
 >> const request = require('request');
@@ -303,7 +303,7 @@ The following code samples provide a simple example on how to specify a JSON sch
 >> Java is the n°3 most popular language (https://www.java.com/)
 >> ```
 >>
->> This example shows us how to use the JSON schema response format with Javascript.
+>> This example shows us how to use the JSON schema response format with JavaScript.
 >>
 
 ### JSON object
@@ -398,7 +398,7 @@ The following code samples provide a simple example on how to use the legacy JSO
 >> {"id":"chatcmpl-dfdbf074ab864199bac48ec929179fed","object":"chat.completion","created":1750773314,"model":"Meta-Llama-3_3-70B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"{\"rank\": [\n    {\"position\": 1, \"language\": \"JavaScript\", \"popularity\": \"94.5%\"},\n    {\"position\": 2, \"language\": \"HTML/CSS\", \"popularity\": \"93.2%\"},\n    {\"position\": 3, \"language\": \"Python\", \"popularity\": \"87.3%\"}\n]}"},"finish_reason":"stop","logprobs":null}],"usage":{"prompt_tokens":65,"completion_tokens":77,"total_tokens":142}}%
 >> ```
 >>
-> **Javascript**
+> **JavaScript**
 >>
 >> ```javascript
 >> const request = require('request');
@@ -558,8 +558,6 @@ In this guide, we have explained how to use Structured Output with the [AI Endpo
 
 Browse the full [AI Endpoints documentation](/products/public-cloud-ai-and-machine-learning-ai-endpoints) to further understand the main concepts and get started.
 
-To discover how to build complete and powerful applications using AI Endpoints, explore our dedicated [AI Endpoints guides](/products/public-cloud-ai-and-machine-learning-ai-endpoints).
-
 If you need training or technical assistance to implement our solutions, contact your sales representative or click on [this link](/links/professional-services) to get a quote and ask our Professional Services experts for a custom analysis of your project.
 
 ## Feedback
diff --git a/pages/public_cloud/ai_machine_learning/endpoints_guide_06_function_calling/guide.en-gb.md b/pages/public_cloud/ai_machine_learning/endpoints_guide_06_function_calling/guide.en-gb.md
index 429e28589a9..544d261dc42 100644
--- a/pages/public_cloud/ai_machine_learning/endpoints_guide_06_function_calling/guide.en-gb.md
+++ b/pages/public_cloud/ai_machine_learning/endpoints_guide_06_function_calling/guide.en-gb.md
@@ -737,8 +737,6 @@ We have provided a comprehensive overview of the feature which can help you perf
 
 Browse the full [AI Endpoints documentation](/products/public-cloud-ai-and-machine-learning-ai-endpoints) to further understand the main concepts and get started.
 
-To discover how to build complete and powerful applications using AI Endpoints, explore our dedicated [AI Endpoints guides](/products/public-cloud-ai-and-machine-learning-ai-endpoints).
-
 If you need training or technical assistance to implement our solutions, contact your sales representative or click on [this link](/links/professional-services) to get a quote and ask our Professional Services experts for a custom analysis of your project.
 
 ## Feedback

From 2423a684b08d5c968c7737d4a9e045b479a8804c Mon Sep 17 00:00:00 2001
From: MathieuBsqt <mathieu.busquet@outlook.fr>
Date: Tue, 30 Sep 2025 10:42:09 +0200
Subject: [PATCH 4/6] add language compatibility and link perfs, benchmarks

---
 .../guide.en-gb.md                             | 18 ++++++++++++++----
 1 file changed, 14 insertions(+), 4 deletions(-)

diff --git a/pages/public_cloud/ai_machine_learning/endpoints_guide_08_audio_transcriptions/guide.en-gb.md b/pages/public_cloud/ai_machine_learning/endpoints_guide_08_audio_transcriptions/guide.en-gb.md
index 60e5d8651a5..a3ff0ce47ef 100644
--- a/pages/public_cloud/ai_machine_learning/endpoints_guide_08_audio_transcriptions/guide.en-gb.md
+++ b/pages/public_cloud/ai_machine_learning/endpoints_guide_08_audio_transcriptions/guide.en-gb.md
@@ -67,7 +67,7 @@ The request body for the audio transcription endpoint is of type `multipart/form
 | **file**                     | Yes      | binary        | `mp3`, `mp4`, `aac`, `m4a`, `wav`, `flac`, `ogg`, `opus`, `webm`, `mpeg`, `mpga`                                    | -       | The **audio file object (not file name)** to transcribe.                                                                                                                                                      |
 | **chunking_strategy**        | No       | `string`/`server_vad object`/`null`   | -                                                                                     | null    | Strategy for dividing the audio into chunks. More details [here](#chunking-strategy).                                                                                                                                                     |
 | **diarize**                  | No       | `boolean`/`null`  | `true`/`false`                                                                            | false   | Enables speaker separation in the transcript. When set to true, the system separates the audio into segments based on speakers, by adding labels like "Speaker 1" and "Speaker 2", so you can see who said what in conversations such as interviews, meetings, or phone calls. More details [here](#diarize).                                                                                                                                           |
-| **language**                 | No       | `string`/`null`   | [ISO-639-1 format](https://en.wikipedia.org/wiki/List_of_ISO_639_language_codes)                                                                      | -       | The language parameter specifies the language spoken in the input audio. Providing it can improve transcription accuracy and reduce latency (e.g. `en` for English, `fr` for French, `de` for German, `es` for Spanish, `zh` for Chinese, `ar` for Arabic ...). If not provided, the system will attempt automatic language detection, which may be slightly slower and less accurate in some cases.                                                                                                                                           |
+| **language**                 | No       | `string`/`null`   | [ISO-639-1 format](https://en.wikipedia.org/wiki/List_of_ISO_639_language_codes)                                                                      | -       | The language parameter specifies the language spoken in the input audio. Providing it can improve transcription accuracy and reduce latency (e.g. `en` for English, `fr` for French, `de` for German, `es` for Spanish, `zh` for Chinese, `ar` for Arabic ...). If not provided, the system will attempt automatic language detection, which may be slightly slower and less accurate in some cases. [More details on language compatibility and performance](#language-compatibility-and-performances).                                                                                                                                           |
 | **model**                    | No       | `string`/`null`   | ID of the model to use                                                                | -       | Specifies the model to use for transcription. Useful when using our [unified endpoint](/pages/public_cloud/ai_machine_learning/endpoints_guide_07_virtual_models).                                                                                                                                                            |
 | **prompt**                   | No       | `string`/`null`   | -                                                                                     | -       | Text to guide the model's style, translate transcript to english or continue a previous audio segment. The language in which you write the prompt must match the audio's one. More details about prompt usage [here](#prompt).                                                                                         |
 | **response_format**          | No       | `enum`/`null`     | `json`, `text`, `srt`, `verbose_json`, `vtt`                                                | `verbose_json`    | Determines how the transcription data is returned. For detailed examples of each output type, visit the [Response Formats](#response-formats) section.                                                                                                                                                        |
@@ -568,13 +568,23 @@ to manually tweak VAD detection parameters. This lets you control the following
 
 ## Endpoint Limitations
 
-### Compatibility and Performances
+### Language Compatibility and Performances
 
-Different languages have varying levels of compatibility and performance with the models. Refer to the model documentation for specific details on language support and performance metrics. Generally, models perform best with languages they are specifically trained on, and performance may vary for less common languages.
+Whisper models are compatible with a **wide range of languages**, supporting approximately 100 in total.
+
+However, transcription quality and speed depend on the **language of the input audio**. While Whisper v3 models are multilingual, their accuracy varies significantly by language:
+
+- Common languages such as English, French, Spanish, and German generally produce the best results.  
+- Less common or low-resource languages may yield lower accuracy or longer processing times.  
+- Regional accents, dialects, or code-switching (switching between multiple languages in the same recording) can reduce accuracy further.  
+
+Providing the `language` parameter explicitly (instead of relying on automatic detection) generally improves both accuracy and latency.  
+
+For a detailed performance breakdown by language, see [Whisper’s benchmark results](https://github.com/openai/whisper?tab=readme-ov-file#available-models-and-languages). This includes word error rates (WER) and character error rates (CER) across different datasets.  
 
 ### Prompt Length
 
-For Whisper-based models, the `prompt` parameter only considers the last 224 tokens (approximately the final 200 characters). If your prompt is longer, tokens preceding the last 224 will be **ignored**.
+For **Whisper-based models**, the `prompt` parameter only **considers the last 224 tokens** (approximately the final 200 characters). If your prompt is longer, tokens preceding the last 224 will be **ignored**.
 
 ### Parameters Support
 

From 09647a99379e246d4bd5571a1f8e64d3b61e3705 Mon Sep 17 00:00:00 2001
From: Montrealhub <89825661+Jessica41@users.noreply.github.com>
Date: Wed, 1 Oct 2025 08:24:57 -0400
Subject: [PATCH 5/6] first proof guide.en-gb.md

---
 .../guide.en-gb.md                            | 19 ++++++++++---------
 1 file changed, 10 insertions(+), 9 deletions(-)

diff --git a/pages/public_cloud/ai_machine_learning/endpoints_guide_08_audio_transcriptions/guide.en-gb.md b/pages/public_cloud/ai_machine_learning/endpoints_guide_08_audio_transcriptions/guide.en-gb.md
index a3ff0ce47ef..8b2a0ade135 100644
--- a/pages/public_cloud/ai_machine_learning/endpoints_guide_08_audio_transcriptions/guide.en-gb.md
+++ b/pages/public_cloud/ai_machine_learning/endpoints_guide_08_audio_transcriptions/guide.en-gb.md
@@ -1,7 +1,7 @@
 ---
 title: AI Endpoints - Speech to Text
 excerpt: Learn how to transcribe audio files with OVHcloud AI Endpoints
-updated: 2025-09-30
+updated: 2025-10-01
 ---
 
 > [!primary]
@@ -250,7 +250,6 @@ The `diarize` parameter enables speaker separation in the generated transcript.
 This is useful for meetings, debates, or interviews where multiple people are speaking.
 
 > [!warning]
-> **Warning**: 
 > - This parameter is only available with the default `verbose_json` [response format](#response-formats). Using any other will raise an error.
 > - `diarize` is not supported when using the OpenAI client libraries. You must use a direct HTTP request with `requests`, `cURL`, or another HTTP client.
 
@@ -299,7 +298,7 @@ The `prompt` parameter lets you provide extra context to improve transcription.
 - **Translate** generated speech to English.
 
 > [!warning]
-> **Warning**: The prompt **must be written in the same language** as the audio. For example, if your audio is in English, your prompt must also be in English.
+> The prompt **must be written in the same language** as the audio. For example, if your audio is in English, your prompt must also be in English.
 
 **Examples**
 
@@ -380,6 +379,7 @@ The `prompt` parameter lets you provide extra context to improve transcription.
 >>     "prompt": "以下是普通話的句子。"
 >> }
 >> ```
+>
 
 #### Timestamp Granularities
 
@@ -431,7 +431,7 @@ The `timestamp_granularities` parameter controls the level of time markers inclu
 >> ```
 >>
 >> > [!warning]
->> > **Warning**: Generating `["word"]` timestamps incurs additional latency.
+>> > Generating `["word"]` timestamps incurs additional latency.
 >>
 > **Word and segment timestamps**
 >> 
@@ -461,7 +461,7 @@ The `timestamp_granularities` parameter controls the level of time markers inclu
 >> ```
 >>
 >> > [!warning]
->> > **Warning**: Generating `["word"]` timestamps can incur additional latency.
+>> > Generating `["word"]` timestamps can incur additional latency.
 >>
 
 #### Response Formats
@@ -545,8 +545,7 @@ By **default**, when unset, the audio is **processed as a single block**.
 
 When set to `auto`, the system first normalizes audio loudness and then uses voice activity detection (VAD) to automatically split the audio at natural pauses (silence).
 
-You can also provide a `server_vad` object 
-to manually tweak VAD detection parameters. This lets you control the following parameters:
+You can also provide a `server_vad` object to manually tweak VAD detection parameters. This lets you control the following parameters:
 
 - `prefix_padding_ms`: Amount of audio to include before the VAD detected speech (in milliseconds).
 - `silence_duration_ms`: Duration of silence to detect speech stop (in milliseconds). With shorter values the model will respond more quickly, but may jump in on short pauses from the user.
@@ -609,7 +608,9 @@ If your audio file exceeds these limits, you can split it into smaller chunks be
 
 Try to avoid splitting mid-sentence, as this can cause context to be lost and reduce transcription accuracy. Using compressed audio formats can also help reduce file size.
 
-**Example**: Splitting Audio with open-source Python PyDub library:
+**Example**
+
+Splitting Audio with open-source Python PyDub library:
 
 ```python
 from pydub import AudioSegment
@@ -631,7 +632,7 @@ Repeat this process to create multiple chunks, then transcribe each chunk indivi
 
 > [!warning]
 >
-> **Warning**: OVHcloud makes no guarantees about the usability or security of third-party software like PyDub.
+> OVHcloud makes no guarantees about the usability or security of third-party software like PyDub.
 
 ## Conclusion
 

From 99424a0c2db0410ea14177b3d1bc98e4a6cc171f Mon Sep 17 00:00:00 2001
From: Yoann Cosse <yoann.cosse@ovhcloud.com>
Date: Wed, 1 Oct 2025 16:06:28 +0200
Subject: [PATCH 6/6] FR version

---
 .../guide.fr-fr.md                            | 652 ++++++++++++++++++
 1 file changed, 652 insertions(+)
 create mode 100644 pages/public_cloud/ai_machine_learning/endpoints_guide_08_audio_transcriptions/guide.fr-fr.md

diff --git a/pages/public_cloud/ai_machine_learning/endpoints_guide_08_audio_transcriptions/guide.fr-fr.md b/pages/public_cloud/ai_machine_learning/endpoints_guide_08_audio_transcriptions/guide.fr-fr.md
new file mode 100644
index 00000000000..5b3d3cc5304
--- /dev/null
+++ b/pages/public_cloud/ai_machine_learning/endpoints_guide_08_audio_transcriptions/guide.fr-fr.md
@@ -0,0 +1,652 @@
+---
+title: AI Endpoints - Transcription Audio
+excerpt: Découvrez comment transcrire des fichiers audio avec OVHcloud AI Endpoints
+updated: 2025-10-01
+---
+
+> [!primary]
+>
+> AI Endpoints is covered by the **[OVHcloud AI Endpoints Conditions](https://storage.gra.cloud.ovh.net/v1/AUTH_325716a587c64897acbef9a4a4726e38/contracts/48743bf-AI_Endpoints-ALL-1.1.pdf)** and the **[OVHcloud Public Cloud Special Conditions](https://storage.gra.cloud.ovh.net/v1/AUTH_325716a587c64897acbef9a4a4726e38/contracts/d2a208c-Conditions_particulieres_OVH_Stack-WE-9.0.pdf)**.
+>
+
+## Introduction
+
+[AI Endpoints](https://endpoints.ai.cloud.ovh.net/) is a serverless platform provided by OVHcloud that offers easy access to a selection of world-renowned, pre-trained AI models. The platform is designed to be simple, secure, and intuitive, making it an ideal solution for developers who want to enhance their applications with AI capabilities without extensive AI expertise or concerns about data privacy.
+
+**Speech to Text** is a powerful feature that enables the conversion of spoken language into written text.
+
+The Speech to Text endpoints on AI Endpoints allow you to easily integrate this technology into your applications, enabling you to transcribe audio files with high accuracy. Our endpoints support various audio formats and provide flexible configuration options to suit your specific use cases.
+
+## Objective
+
+This documentation provides an overview of the Speech to Text endpoints offered on [AI Endpoints](https://endpoints.ai.cloud.ovh.net/). 
+
+Visit our [Catalog](https://endpoints.ai.cloud.ovh.net/catalog) to find out which models are compatible with Audio Analysis.
+
+The examples provided during this guide can be used with one of the following environments:
+
+> [!tabs]
+> **Python**
+>> 
+>> A [Python](https://www.python.org/) environment with the [openai client](https://pypi.org/project/openai/) and the pydantic library installed.
+>>
+>> ```sh
+>> pip install openai pydantic
+>> ```
+>> 
+> **JavaScript**
+>> 
+>> A [Node.js](https://nodejs.org/en) environment with the [request](https://www.npmjs.com/package/request) library.
+>> Request can be installed using [NPM](https://www.npmjs.com/):
+>> 
+>> ```sh
+>> npm install request
+>> ```
+>> 
+> **cURL**
+>> 
+>> A standard terminal, with [cURL](https://cURL.se/) installed on the system.
+>> 
+
+*These exmaples will be using the [Whisper-large-v3](https://endpoints.ai.cloud.ovh.net/models/whisper-large-v3) model.*
+
+## Authentication & Rate Limiting
+
+All the examples provided in this guide use anonymous authentication, which makes it simpler to use but may cause rate limiting issues. If you wish to enable authentication using your own token, simply specify your API key within the requests.
+
+Follow the instructions in the [AI Endpoints - Getting Started](/pages/public_cloud/ai_machine_learning/endpoints_guide_01_getting_started) guide for more information on authentication.
+
+## Request Body
+
+### Parameters Overview
+
+The request body for the audio transcription endpoint is of type `multipart/form-data` and includes the following fields:
+
+| Parameter                | Required | Type          | Allowed Values / Format                                                                 | Default | Description                                                                                                                                                                                                 |
+|--------------------------|----------|---------------|---------------------------------------------------------------------------------------|---------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| **file**                     | Yes      | binary        | `mp3`, `mp4`, `aac`, `m4a`, `wav`, `flac`, `ogg`, `opus`, `webm`, `mpeg`, `mpga`                                    | -       | The **audio file object (not file name)** to transcribe.                                                                                                                                                      |
+| **chunking_strategy**        | No       | `string`/`server_vad object`/`null`   | -                                                                                     | null    | Strategy for dividing the audio into chunks. More details [here](#chunking-strategy).                                                                                                                                                     |
+| **diarize**                  | No       | `boolean`/`null`  | `true`/`false`                                                                            | false   | Enables speaker separation in the transcript. When set to true, the system separates the audio into segments based on speakers, by adding labels like "Speaker 1" and "Speaker 2", so you can see who said what in conversations such as interviews, meetings, or phone calls. More details [here](#diarize).                                                                                                                                           |
+| **language**                 | No       | `string`/`null`   | [ISO-639-1 format](https://en.wikipedia.org/wiki/List_of_ISO_639_language_codes)                                                                      | -       | The language parameter specifies the language spoken in the input audio. Providing it can improve transcription accuracy and reduce latency (e.g. `en` for English, `fr` for French, `de` for German, `es` for Spanish, `zh` for Chinese, `ar` for Arabic ...). If not provided, the system will attempt automatic language detection, which may be slightly slower and less accurate in some cases. [More details on language compatibility and performance](#language-compatibility-and-performances).                                                                                                                                           |
+| **model**                    | No       | `string`/`null`   | ID of the model to use                                                                | -       | Specifies the model to use for transcription. Useful when using our [unified endpoint](/pages/public_cloud/ai_machine_learning/endpoints_guide_07_virtual_models).                                                                                                                                                            |
+| **prompt**                   | No       | `string`/`null`   | -                                                                                     | -       | Text to guide the model's style, translate transcript to english or continue a previous audio segment. The language in which you write the prompt must match the audio's one. More details about prompt usage [here](#prompt).                                                                                         |
+| **response_format**          | No       | `enum`/`null`     | `json`, `text`, `srt`, `verbose_json`, `vtt`                                                | `verbose_json`    | Determines how the transcription data is returned. For detailed examples of each output type, visit the [Response Formats](#response-formats) section.                                                                                                                                                        |
+| **stream**                   | No       | `boolean`/`null`  | `true`/`false`                                                                            | false   | If set to true, the model response data will be streamed to the client. Currently not supported for Whisper models. |
+| **temperature**              | No       | `number`/`null`   | From `0.0` to `1.0`                                                                                | 0       | Controls randomness in the output. Higher values make the output more random, while lower values make it more focused and deterministic.                                                                   |
+| **timestamp_granularities**  | No       | `array`/`null`    | `["segment"]`, `["word"]`, `["word", "segment"]`                                        | `["segment"]` | Controls the level of detail in the timestamps provided in the transcription. More details [here](#timestamp-granularities).                                                              |
+
+### Example Usage
+
+Now that you know which parameters are available, let’s look at how to put them into practice. Below are sample requests in **Python**, **cURL** and **JavaScript**:
+
+> [!tabs]
+> **Python (using requests)**
+>> 
+>> ```python
+>> import os
+>> import requests
+>>
+>> url = "https://oai.endpoints.kepler.ai.cloud.ovh.net/v1/audio/transcriptions"
+>> 
+>> audio_file_path = "my_audio.mp3"
+>> 
+>> headers = {
+>>    "accept": "application/json",
+>> #   "Authorization": f"Bearer {os.getenv('OVH_AI_ENDPOINTS_ACCESS_TOKEN')}",
+>> }
+>> 
+>> files = {"file": open(audio_file_path, "rb")}
+>> 
+>> data = {
+>>     "model": "whisper-large-v3",
+>>     "language": "en",
+>>     "temperature": "0",
+>>     "prompt": "<|transcribe|>",
+>>     "diarize": "false",
+>>     "timestamp_granularities": ["segment"],
+>>     "response_format": "verbose_json"
+>> }
+>> 
+>> response = requests.post(url, headers=headers, files=files, data=data)
+>> 
+>> if response.status_code == 200:
+>>     # Handle response
+>>     print(response.json())
+>> else:
+>>     print("Error:", response.status_code, response.text)
+>> ```
+>> 
+> **Python (using OpenAI client)**
+>> 
+>> ```python
+>> from openai import OpenAI
+>> import os
+>> 
+>> url = "https://oai.endpoints.kepler.ai.cloud.ovh.net/v1/"
+>> audio_file_path = "my_audio.mp3"
+>> 
+>> client = OpenAI(
+>>     base_url=url,
+>> #    api_key=os.getenv('OVH_AI_ENDPOINTS_ACCESS_TOKEN'),
+>>     )
+>> 
+>> with open(audio_file_path, "rb") as f:
+>>     transcript = client.audio.transcriptions.create(
+>>         file=f,
+>>         model="whisper-large-v3",
+>>         language="en",
+>>         temperature=0,
+>>         prompt="<|transcribe|>",
+>>         timestamp_granularities=["segment"],
+>>         response_format="verbose_json"
+>>     )
+>> 
+>> print(transcript)
+>> ```
+>>
+>> > [!warning]
+>> >
+>> > **Warning**: The `diarize` parameter is not supported when using the OpenAI client library.
+>> >
+>> > To use diarization, you must make a direct HTTP request using `requests` or `cURL` with `diarize` set to `true`.
+>> >
+>>
+> **cURL**
+>> 
+>> ```sh
+>> curl -X POST "https://oai.endpoints.kepler.ai.cloud.ovh.net/v1/audio/transcriptions" \
+>> -F "file=@my_audio.mp3" \
+>> -F "model=whisper-large-v3" \
+>> -F "language=en" \
+>> -F "temperature=0" \
+>> -F "prompt==<|transcribe|>" \
+>> -F "diarize=false" \
+>> -F "timestamp_granularities[]=segment" \
+>> -F "response_format=verbose_json"
+>> ```
+>>
+>> To [**authenticate with your API key**](/pages/public_cloud/ai_machine_learning/endpoints_guide_01_getting_started), add an Authorization header:
+>>
+>> ```sh
+>> `-H "Authorization: Bearer $OVH_AI_ENDPOINTS_ACCESS_TOKEN" \`
+>> ```
+>>
+> **JavaScript (using OpenAI client)**
+>>
+>> ```javascript
+>> import OpenAI from "openai";
+>> import fs from "fs";
+>> 
+>> const openai = new OpenAI({
+>>  baseURL: "https://oai.endpoints.kepler.ai.cloud.ovh.net/v1/",
+>>  // apiKey: process.env.OVH_AI_ENDPOINTS_ACCESS_TOKEN,
+>> });
+>> 
+>> const transcript = await openai.audio.>> transcriptions.create({
+>>   file: fs.createReadStream("my_audio.mp3"),
+>>   model: "whisper-large-v3",
+>>   language: "en",
+>>   temperature: 0,
+>>   prompt: "<|transcribe|>",
+>>   timestamp_granularities: ["segment"],
+>>   response_format: "verbose_json"
+>> });
+>> 
+>> console.log(transcript);
+>> ```
+>>
+
+**Output example**
+
+By default, the transcription endpoint returns output in `verbose_json` format.
+
+This includes detailed metadata such as language, segments, tokens, and diarization information:
+
+```json
+{
+  "task": "transcribe",
+  "success": true,
+  "language": "en",
+  "duration": 4.46975,
+  "text": "My name is Octave and I am working at OVHcloud",
+  "words": [],
+  "segments": [
+    {
+      "id": 1,
+      "seek": 0,
+      "start": 0,
+      "end": 3.48,
+      "text": "My name is Octave and I am working at OVHcloud",
+      "tokens": [
+        50365,
+        2588,
+        275,
+        ...
+      ],
+      "temperature": 0,
+      "avg_logprob": -0.38066408,
+      "compression_ratio": 0.9,
+      "no_speech_prob": 0
+    }
+  ],
+  "diarization": [],
+  "usage": {
+    "type": "duration",
+    "duration": 5
+  }
+}
+```
+
+For **detailed examples** of each available output type, see the [Response Formats section](#response-formats) section.
+
+### Parameters Details
+
+While the previous overview gives a quick reference, certain parameters require more context to understand how and when to use them.
+
+#### Diarization
+
+The `diarize` parameter enables speaker separation in the generated transcript. When set to `true`, the system labels different voices as `Speaker 0`, `Speaker 1`, etc.
+
+This is useful for meetings, debates, or interviews where multiple people are speaking.
+
+> [!warning]
+> - This parameter is only available with the default `verbose_json` [response format](#response-formats). Using any other will raise an error.
+> - `diarize` is not supported when using the OpenAI client libraries. You must use a direct HTTP request with `requests`, `cURL`, or another HTTP client.
+
+**Output Example**: Transcribing an audio file with `diarize` enabled:
+
+Request:
+
+```json
+{
+  "file": "<audio file object>",
+  "diarize": true
+}
+```
+
+Output:
+
+```json
+{
+  "task":"transcribe",
+  "success":true,
+  "language":"en",
+  "duration":8.939875,
+  "text":"Hello Marie, your recent experiments are fascinating. Indeed, Albert. It's exciting to explore radioactivity together.",
+  "words":[],
+  "segments":[
+    {"id":1,"seek":0,"start":0.0,"end":3.24,"text":"Hello Marie, your recent experiments are fascinating.","tokens":[50365,5490,...,13,50543],"temperature":0.0,"avg_logprob":-0.21679688,"compression_ratio":1.0361446,"no_speech_prob":0.0},
+    {"id":2,"seek":0,"start":4.74,"end":8.1,"text":"Indeed, Albert. It's exciting to explore radioactivity together.","tokens":[50597,2421,..., 30,50780],"temperature":0.0,"avg_logprob":-0.21679688,"compression_ratio":1.0361446,"no_speech_prob":0.0}
+  ],
+  "diarization":[
+    {"speaker":0,"text":"Hello Marie","start":0.0,"end":1.06},
+    {"speaker":0,"text":"your recent experiments are fascinating.","start":1.16,"end":3.24},
+    {"speaker":1,"text":"Indeed, Albert","start":4.74,"end":6.76},
+    {"speaker":1,"text":"It's exciting to explore radioactivity together.","start":6.98,"end":8.1}
+  ],
+  "usage":{"type":"duration","duration":9.0}
+}
+```
+
+#### Prompt
+
+The `prompt` parameter lets you provide extra context to improve transcription. Think of it as giving the model a **hint** before it starts listening to your audio. This can help when: 
+
+- **Correcting words** or acronyms that are often misrecognized.
+- **Preserving context** if the audio is split into several parts.
+- **Enforcing punctuation**, filler words, or writing style.
+- **Translate** generated speech to English.
+
+> [!warning]
+> The prompt **must be written in the same language** as the audio. For example, if your audio is in English, your prompt must also be in English.
+
+**Examples**
+
+> [!tabs]
+> **Correcting acronyms and names**
+>> 
+>> If the audio mentions complicated words such as products, companies, technical terms, or people but the model often mispells them, you could list them in your prompt:
+>>
+>> ```json
+>> {
+>>     "file": <audio file object>,
+>>     "prompt": "OVHcloud, Grand Palais, CNRS."
+>> }
+>> ```
+>>
+>> **Translating transcript into English**
+>>
+>> To directly translate the transcription into English instead of keeping it in the source language, you can pass the special translation token `<|translate|>` in your prompt:
+>>
+>> ```json
+>> {
+>>      "file": <audio file object>,
+>>      "prompt": "<|translate|>"
+>> }
+>> ```
+>>
+> **Maintaining context across segments**
+>>
+>> When processing split audio files, provide the transcript from the previous segment to maintain context and improve accuracy:
+>>
+>> ```json
+>> {
+>>      "file": <audio file object>,
+>>      "prompt": "...previous segment transcript..."
+>> }
+>> ```
+>>
+> **Keeping punctuation**
+>> 
+>> If the model skips punctuation in the transcript, use a properly punctuated prompt to encourage correct formatting:
+>>
+>> ```json
+>> {
+>>     "file": <audio file object>,
+>>     "prompt": "Hello, welcome to my lecture on the history of France."
+>> }
+>> ```
+>>
+> **Preserving filler words**
+>> 
+>> If the model omits filler words when transcribing audio, like "ums" or "like", include them in your prompt:
+>>
+>> ```json
+>> {
+>>     "file": <audio file object>,
+>>     "prompt": "Uh, well, let me think... hmm, okay, so about OVHcloud."
+>> }
+>> ```
+>>
+> **Enforcing writing style**
+>> 
+>> For languages with multiple writing systems (like simplified vs. traditional Chinese), or to maintain consistent style:
+>>
+>> **Simplified Chinese:**
+>>
+>> ```json
+>> {
+>>     "file": <audio file object>,
+>>     "prompt": "以下是普通话的句子。"
+>> }
+>> ```
+>>
+>> **Traditional Chinese:**
+>>
+>> ```json
+>> {
+>>     "file": <audio file object>,
+>>     "prompt": "以下是普通話的句子。"
+>> }
+>> ```
+>
+
+#### Timestamp Granularities
+
+The `timestamp_granularities` parameter controls the level of time markers included in the transcript. You have three possibilities there: 
+
+> [!tabs]
+> **Segment-level timestamps**
+>> 
+>> Timestamps for each segment, providing timing for larger sections of the audio.
+>>
+>> ```json
+>> {
+>>     "file": <audio file object>,
+>>     "timestamp_granularities": ["segment"],
+>>     "response_format": verbose_json
+>> }
+>> ```
+>>
+>> ```json
+>> words=[],
+>> segments=[
+>>    {'id': 1, 'seek': 0, 'start': 1.76, 'end': 4.58, 'text': ' France is the world's leading tourist destination', 'tokens': [50365, 1456, 1181, 650, 1459, 1030, 476, 8124, 515, 50609], 'temperature': 0.0, 'avg_logprob': -0.14139344, 'compression_ratio': 1.2769231, 'no_speech_prob': 0.007171631}, 
+>>    {'id': 2, 'seek': 0, 'start': 9.44, 'end': 14.92, 'text': ' having received 100 million foreign visitors in 2023.', 'tokens': [50609, 4042, 25011, 3925, 650, 1459, 11, 1022, 517, 594, 2672, 14303, 11, 2064, 1001, 465, 1872, 312, 66, 517, 490, 609, 13, 51117], 'temperature': 0.0, 'avg_logprob': -0.14139344, 'compression_ratio': 1.2769231, 'no_speech_prob': 0.007171631}, 
+>>    ...
+>> ]
+>> ```
+>>
+> **Word-level timestamps**
+>>
+>> Timestamps for each word, providing precise timing for every spoken word.
+>>
+>> ```json
+>> {
+>>     "file": <audio file object>,
+>>     "timestamp_granularities": ["word"],
+>>     "response_format": verbose_json
+>> }
+>> ```
+>>
+>> ```json
+>> words=[
+>>     {'word': 'France', 'start': 1.76, 'end': 2.44}, 
+>>     {'word': 'is', 'start': 2.44, 'end': 3.12}, 
+>>     {'word': 'the', 'start': 3.12, 'end': 3.96}, 
+>>     {'word': 'world', 'start': 3.96, 'end': 4.08},
+>>     ...
+>> ],
+>> segments=[]
+>> ```
+>>
+>> > [!warning]
+>> > Generating `["word"]` timestamps incurs additional latency.
+>>
+> **Word and segment timestamps**
+>> 
+>> You can also get both:
+>>
+>> ```json
+>> {
+>>     "file": <audio file object>,
+>>     "timestamp_granularities": ["word", "segment"],
+>>     "response_format": verbose_json
+>> }
+>> ```
+>>
+>> ```json
+>> words=[
+>>     {'word': 'France', 'start': 1.76, 'end': 2.44}, 
+>>     {'word': 'is', 'start': 2.44, 'end': 3.12}, 
+>>     {'word': 'the', 'start': 3.12, 'end': 3.96}, 
+>>     {'word': 'world', 'start': 3.96, 'end': 4.08},
+>>     ...
+>> ],
+>> segments=[
+>>    {'id': 1, 'seek': 0, 'start': 1.76, 'end': 4.58, 'text': ' France is the world's leading tourist destination', 'tokens': [50365, 1456, 1181, 650, 1459, 1030, 476, 8124, 515, 50609], 'temperature': 0.0, 'avg_logprob': -0.14139344, 'compression_ratio': 1.2769231, 'no_speech_prob': 0.007171631}, 
+>>    {'id': 2, 'seek': 0, 'start': 9.44, 'end': 14.92, 'text': ' having received 100 million foreign visitors in 2023.', 'tokens': [50609, 4042, 25011, 3925, 650, 1459, 11, 1022, 517, 594, 2672, 14303, 11, 2064, 1001, 465, 1872, 312, 66, 517, 490, 609, 13, 51117], 'temperature': 0.0, 'avg_logprob': -0.14139344, 'compression_ratio': 1.2769231, 'no_speech_prob': 0.007171631}, 
+>>    ...
+>> ]
+>> ```
+>>
+>> > [!warning]
+>> > Generating `["word"]` timestamps can incur additional latency.
+>>
+
+#### Response Formats
+
+The `response_format` determines how the transcription data is returned. Available formats include:
+
+> [!tabs]
+> **Verbose JSON (default)**
+>> 
+>> Returns the full transcription with metadata such as segments, tokens, language, duration, and diarization:
+>>
+>> ```json
+>> {
+>>   "task": "transcribe",
+>>   "success": true,
+>>   "language": "en",
+>>   "duration": 4.46975,
+>>   "text": "My name is Octave and I am working at OVHcloud.",
+>>   "words": [],
+>>   "segments": [
+>>     {
+>>       "id": 1,
+>>       "seek": 0,
+>>       "start": 0,
+>>       "end": 3.48,
+>>       "text": "My name is Octave and I am working at OVHcloud.",
+>>       "tokens": [
+>>         50365,
+>>         2588,
+>>         275,
+>>         ...
+>>       ],
+>>       "temperature": 0,
+>>       "avg_logprob": -0.38066408,
+>>       "compression_ratio": 0.9,
+>>       "no_speech_prob": 0
+>>     }
+>>   ],
+>>   "diarization": [],
+>>   "usage": {
+>>     "type": "duration",
+>>     "duration": 5
+>>   }
+>> }
+>> ```
+>> 
+> **JSON**
+>> 
+>> Returns only the basic transcription data, such as transcribed text and usage information:
+>>
+>> ```json
+>> {
+>>   "text": "My name is Octave and I am working at OVHcloud.",
+>>   "usage": {
+>>     "type": "duration",
+>>     "duration": 5
+>>   }
+>> }
+>> 
+> **Text**
+>> 
+>> Returns only the transcribed text as a plain string.
+>> ```text
+>> My name is Octave and I am working at OVHcloud
+>> ```
+>>
+> **SRT**
+>> 
+>> Not yet supported.
+>>
+> **VTT**
+>> 
+>> Not yet supported.
+>> 
+
+#### Chunking Strategy
+
+The `chunking_strategy` parameter controls how the audio file is **divided into smaller segments** before transcription.
+
+By **default**, when unset, the audio is **processed as a single block**. 
+
+When set to `auto`, the system first normalizes audio loudness and then uses voice activity detection (VAD) to automatically split the audio at natural pauses (silence).
+
+You can also provide a `server_vad` object to manually tweak VAD detection parameters. This lets you control the following parameters:
+
+- `prefix_padding_ms`: Amount of audio to include before the VAD detected speech (in milliseconds).
+- `silence_duration_ms`: Duration of silence to detect speech stop (in milliseconds). With shorter values the model will respond more quickly, but may jump in on short pauses from the user.
+- `threshold`: Sensitivity threshold (0.0 to 1.0) for voice activity detection. A higher threshold will require louder audio to activate the model, and thus might perform better in noisy environments.
+
+**Example**:
+
+```json
+{
+  "file": "<audio file object>",
+  "chunking_strategy": {
+    "type": "server_vad",
+    "prefix_padding_ms": 200,
+    "silence_duration_ms": 500,
+    "threshold": 0.6
+  }
+}
+```
+
+## Endpoint Limitations
+
+### Language Compatibility and Performances
+
+Whisper models are compatible with a **wide range of languages**, supporting approximately 100 in total.
+
+However, transcription quality and speed depend on the **language of the input audio**. While Whisper v3 models are multilingual, their accuracy varies significantly by language:
+
+- Common languages such as English, French, Spanish, and German generally produce the best results.  
+- Less common or low-resource languages may yield lower accuracy or longer processing times.  
+- Regional accents, dialects, or code-switching (switching between multiple languages in the same recording) can reduce accuracy further.  
+
+Providing the `language` parameter explicitly (instead of relying on automatic detection) generally improves both accuracy and latency.  
+
+For a detailed performance breakdown by language, see [Whisper’s benchmark results](https://github.com/openai/whisper?tab=readme-ov-file#available-models-and-languages). This includes word error rates (WER) and character error rates (CER) across different datasets.  
+
+### Prompt Length
+
+For **Whisper-based models**, the `prompt` parameter only **considers the last 224 tokens** (approximately the final 200 characters). If your prompt is longer, tokens preceding the last 224 will be **ignored**.
+
+### Parameters Support
+
+- Streaming is not **yet supported** for audio transcription endpoints. All audio must be uploaded and processed in a single request.
+
+- The `srt` and `vtt` response formats are not yet supported. Available response formats can be found [here](#response-formats).
+
+### Supported Audio Formats, Durations and Sizes
+
+**Audio Formats**
+
+The API supports multiple audio formats as [mentioned before](#parameters-overview). Ensure your file is in a supported format to be successfully transcribed.
+
+**File Size and Duration Limits**:
+
+- **Authenticated requests** (using an [API key](/pages/public_cloud/ai_machine_learning/endpoints_guide_01_getting_started)): Up to `2048 MB` or `10 800 seconds` of audio per request.
+- **Anonymous requests**: Up to `10 MB` or `60 seconds` of audio per request.
+
+**Transcribing larger audio files**
+
+If your audio file exceeds these limits, you can split it into smaller chunks before sending it to the transcription endpoint.
+
+Try to avoid splitting mid-sentence, as this can cause context to be lost and reduce transcription accuracy. Using compressed audio formats can also help reduce file size.
+
+**Example**
+
+Splitting Audio with open-source Python PyDub library:
+
+```python
+from pydub import AudioSegment
+
+# Load the audio file
+audio = AudioSegment.from_mp3("long_interview.mp3")
+
+# Define chunk duration in milliseconds (e.g., 10 minutes)
+chunk_duration = 10 * 60 * 1000
+
+# Split first chunk
+first_chunk = audio[:chunk_duration]
+
+# Export chunk
+first_chunk.export("long_interview_part1.mp3", format="mp3")
+```
+
+Repeat this process to create multiple chunks, then transcribe each chunk individually.
+
+> [!warning]
+>
+> OVHcloud makes no guarantees about the usability or security of third-party software like PyDub.
+
+## Conclusion
+
+In this guide, we have explained how to use Speech to Text models available on [AI Endpoints](https://endpoints.ai.cloud.ovh.net/) models. We have provided a comprehensive overview of the feature which can help you perfect your integration of model for your own application.
+
+## Go Further
+
+Browse the full [AI Endpoints documentation](/products/public-cloud-ai-and-machine-learning-ai-endpoints) to further understand the main concepts and get started.
+
+If you need training or technical assistance to implement our solutions, contact your sales representative or click on [this link](/links/professional-services) to get a quote and ask our Professional Services experts for a custom analysis of your project.
+
+## Feedback
+
+Please send us your questions, feedback, and suggestions to improve the service:
+
+- On the OVHcloud [Discord server](https://discord.gg/ovhcloud).
+