Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
152 changes: 151 additions & 1 deletion API_DOCS.md
Original file line number Diff line number Diff line change
@@ -1 +1,151 @@
Coming soon
# StyleTTS2 HTTP Streaming API Documentation

## Overview

The HTTP Streaming API provides text-to-speech synthesis with real-time audio streaming. The server uses Flask and returns WAV audio data.

## Base URL

```
http://localhost:5000
```

## Endpoints

### GET /

Returns API documentation in HTML format.

---

### POST /api/v1/stream

Synthesizes speech from text with streaming audio response.

**Request Body (form-data):**

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `text` | string | Yes | Text to synthesize |
| `voice` | string | Yes | Voice ID (see available voices below) |
| `steps` | integer | No | Diffusion steps (default: 7, higher = better quality) |

**Response:**
- Content-Type: `audio/x-wav`
- Streams WAV audio data in chunks

**Example with curl:**

```bash
curl -X POST http://localhost:5000/api/v1/stream \
-d "text=Hello, this is a test of the streaming API." \
-d "voice=f-us-1" \
-d "steps=7" \
--output output.wav
```

**Example with Python:**

```python
import requests

response = requests.post(
"http://localhost:5000/api/v1/stream",
data={
"text": "Hello, this is a test.",
"voice": "f-us-1",
"steps": 7
},
stream=True
)

with open("output.wav", "wb") as f:
for chunk in response.iter_content(chunk_size=8192):
f.write(chunk)
```

---

### POST /api/v1/static

Synthesizes speech from text and returns complete audio file.

**Request Body (form-data):**

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `text` | string | Yes | Text to synthesize |
| `voice` | string | Yes | Voice ID |

**Response:**
- Content-Type: `audio/wav`
- Returns complete WAV file

**Example:**

```bash
curl -X POST http://localhost:5000/api/v1/static \
-d "text=Hello world" \
-d "voice=m-us-1" \
--output output.wav
```

---

## Available Voices

| Voice ID | Description |
|----------|-------------|
| `f-us-1` | Female US English #1 |
| `f-us-2` | Female US English #2 |
| `f-us-3` | Female US English #3 |
| `f-us-4` | Female US English #4 |
| `m-us-1` | Male US English #1 |
| `m-us-2` | Male US English #2 |
| `m-us-3` | Male US English #3 |
| `m-us-4` | Male US English #4 |

---

## Error Responses

All errors return JSON with an `error` field:

```json
{
"error": "Missing required fields. Please include \"text\" and \"voice\" in your request."
}
```

**Common errors:**
- `400`: Missing required fields or invalid voice selection

---

## Testing

Use the provided test client:

```bash
# List available voices
python test_api_client.py --list-voices

# Check server status
python test_api_client.py --check-server

# Synthesize speech
python test_api_client.py -t "Hello world" -v f-us-1 -o output.wav

# With custom diffusion steps
python test_api_client.py -t "Hello world" -v m-us-2 -o output.wav -s 10
```

---

## Starting the Server

```bash
python api.py
```

The server starts on `http://0.0.0.0:5000` by default.
47 changes: 41 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,5 @@
# StyleTTS 2 API

> [!CAUTION]
> The Streaming API is not fully implemented yet.

[Original Repo](https://github.com/yl4579/StyleTTS2) - [CLI Tool](https://github.com/fakerybakery/styletTS2-cli) - **Streaming API**

(GPL licensed due to Phonemizer. Should I switch to OpenPhonemizer and make it MIT-licensed?)
Expand All @@ -27,9 +24,9 @@ Online demo: [Hugging Face](https://huggingface.co/spaces/styletts2/styletts2) (
- [x] Add a finetuning script for new speakers with base pre-trained multispeaker models
- [x] REST API
- [x] Importable inference script (PR #78)
- [x] Streaming API and WebSocket support
- [ ] Fix DDP (accelerator) for `train_second.py` **(I have tried everything I could to fix this but had no success, so if you are willing to help, please see [#7](https://github.com/yl4579/StyleTTS2/issues/7))**
- [ ] Pip package
- [ ] Demo of audio streaming

## Pre-requisites
1. Python >= 3.7
Expand All @@ -56,16 +53,54 @@ For LibriTTS, you will need to combine train-clean-360 with train-clean-100 and

## Streaming API

You can use StyleTTS 2 in your projects by launching the HTTP API with streaming support. Synthesize text from your frontend apps, etc by making HTTP calls to the API server. The server uses Flask. It has not been extensively tested and should not be used for production purposes.
You can use StyleTTS 2 in your projects by launching the HTTP API with streaming support. Synthesize text from your frontend apps, etc by making HTTP calls to the API server. The server uses Flask.

API documentation may be found in the [`API_DOCS.md`](API_DOCS.md) file.

Launch server:

```
```bash
python api.py
```

## WebSocket API

For real-time TTS streaming with chunked text input and low-latency audio output, use the WebSocket API powered by FastAPI.

**Features:**
- Real-time bidirectional communication
- Chunked text input (send text incrementally)
- Base64-encoded MP3 audio output
- GPU queue management for concurrent requests
- Idle timeout and connection management

**Quick Start:**

```bash
# Start WebSocket server (default port 8765)
python ws_server.py

# Or with custom port
python ws_server.py 9000
```

**Endpoints:**
- WebSocket: `ws://localhost:8765/ws/tts`
- Health Check: `http://localhost:8765/health`
- Voice List: `http://localhost:8765/voices`

**Test the WebSocket API:**

```bash
# Simple test
python test_ws_client.py --text "Hello world" --voice f-us-1 --output output.mp3

# Chunked streaming test
python test_ws_client.py --text "This is a longer text" --voice m-us-2 --chunked
```

Full WebSocket documentation: [`WEBSOCKET_DOCS.md`](WEBSOCKET_DOCS.md)

## Python API

You can now use StyleTTS 2 directly in your programs! A `pip`-compatible package is coming soon.
Expand Down
Loading