Skip to content

Latest commit

 

History

History
261 lines (202 loc) · 9.37 KB

README.md

File metadata and controls

261 lines (202 loc) · 9.37 KB

LinTO-diarization

LinTO-diarization is the LinTO service for speaker diarization, with some ability to guess the number of speakers and identify some speakers if samples of their voice are provided.

LinTO-diarization can either be used as a standalone diarization service or deployed as a micro-services.


Pre-requisites

Docker

The transcription service requires docker up and running.

(micro-service) Service broker and shared folder

The diarization only entry point in job mode are tasks posted on a Redis message broker. Futhermore, to prevent large audio from transiting through the message broker, diarization uses a shared storage folder mounted on /opt/audio.

Deploy

linto-diarization can be deployed:

  • As a standalone diarization service through an HTTP API.
  • As a micro-service connected to a message broker.

1- First step is to build or pull the image:

git clone https://github.com/linto-ai/linto-diarization.git
cd linto-diarization
docker build . -t linto-diarization-pyannote:latest -f pyannote/Dockerfile 

or

docker pull lintoai/linto-diarization-pyannote

HTTP

1- Fill the .env

An example of .env file is provided in pyannote/.envdefault.

Parameters:

Variables Description Example
SERVING_MODE (Required) Specify launch mode http
CONCURRENCY Number of worker(s) additional to the main worker 0 | 1 | 2 | ...
DEVICE Device to use for the embedding model (by default, GPU/CUDA is used if it is available, CPU otherwise) cpu | cuda | cuda:0
DEVICE_CLUSTERING Device to use for clustering (Same as DEVICE by default) cpu | cuda | cuda:0
DEVICE_IDENTIFICATION Device to use for speaker identification, if it is enabled (Same as DEVICE by default) cpu | cuda | cuda:0
NUM_THREADS Number of threads (maximum) to use for things running on CPU 1 | 4 | ...
CUDA_VISIBLE_DEVICES GPU device index to use, when running on GPU/CUDA. We also recommend to set CUDA_DEVICE_ORDER=PCI_BUS_ID on multi-GPU machines 0 | 1 | 2 | ...
SPEAKER_SAMPLES_FOLDER (default: /opt/speaker_samples) Folder where to find audio files for target speakers samples /path/to/folder
SPEAKER_PRECOMPUTED_FOLDER (default: /opt/speaker_precomputed) Folder where to store precomputed embeddings of target speakers /path/to/folder

2- Run the container

This will run a container providing an http API binded on the host <HOST_SERVING_PORT> port (for instance 8080):

docker run --rm \
-v <SHARED_FOLDER>:/opt/audio \
-p <HOST_SERVING_PORT>:80 \
--env-file .env \
linto-diarization-pyannote:latest

If you want to enable speaker identification, you have to provide samples of the target speakers' voices, either in separate folders with the name of the speaker as the folder name, or in separate files with the name of the speaker as the file name. Then the parent folder of the samples must be mounted as a volume in the container under /opt/speaker_samples (or a custom folder set with the SPEAKER_SAMPLES_FOLDER environment variable).

docker run ... -v <</path/to/speaker/samples/folder>>:/opt/speaker_samples

When speaker identification, you can also mount a volume (empty at the beginning) on /opt/speaker_precomputed (or a custom folder set with the SPEAKER_PRECOMPUTED_FOLDER environment variable), where will be stored the precomputed embeddings of the speakers. This can avoid an initialisation time at each new docker run, if the set of target speakers remains the same or just grows.

docker run ... -v <</path/to/precomputed/embeddings/folder>>:/opt/speaker_precomputed

You may also want to add --gpus all to enable GPU capabilitiesn and maybe set CUDA_VISIBLE_DEVICES if there are several available GPU cards.

Using celery

LinTO-diarization can be deployed as a micro-service using celery. Used this way, the container spawn celery worker waiting for diarization task on a message broker.

You need a message broker up and running at SERVICES_BROKER.

1- Fill the .env

An example of .env file is provided in pyannote/.envdefault.

Parameters: Parameters are the same as for the HTTP API, with the addition of the following:

Variables Description Example
SERVING_MODE (Required) Specify launch mode task
SERVICES_BROKER Service broker uri redis://my_redis_broker:6379
BROKER_PASS Service broker password (Leave empty if there is no password) my_password
QUEUE_NAME Overide the generated queue's name (See Queue name bellow) my_queue
SERVICE_NAME Service's name diarization-ml
LANGUAGE Language code as a BCP-47 code en-US or * or languages separated by "|"
MODEL_INFO Human readable description of the model Multilingual diarization model

2- Fill the docker-compose.yml

#docker-compose.yml

version: '3.7'

services:
  punctuation-service:
    image: linto-diarization-pyannote:latest
    volumes:
      - /path/to/shared/folder:/opt/audio
    env_file: .env
    deploy:
      replicas: 1
    networks:
      - your-net

networks:
  your-net:
    external: true

3- Run with docker compose

docker stack deploy --resolve-image always --compose-file docker-compose.yml your_stack

Queue name:

By default the service queue name is generated as SERVICE_NAME.

The queue name can be overided using the QUEUE_NAME env variable.

Service discovery:

As a micro-service, the instance will register itself in the service registry for discovery. The service information are stored as a JSON object in redis's db0 under the id service:{HOST_NAME}.

The following information are registered:

{
  "service_name": $SERVICE_NAME,
  "host_name": $HOST_NAME,
  "service_type": "diarization",
  "service_language": $LANGUAGE,
  "queue_name": $QUEUE_NAME,
  "version": "1.2.0", # This repository's version
  "info": $MODEL_INFO,
  "last_alive": 65478213,
  "concurrency": 1
}

Usages

HTTP API

/healthcheck

Returns the state of the API

Method: GET

Returns "1" if healthcheck passes.

/diarization

Diarization API

Input arguments are:

  • file: A Wave file
  • speaker_count: (integer - optional) Number of speakers. If empty, diarization will clusterize automatically.
  • max_speaker: (integer - optional) Max number of speakers if speaker_count is unknown.
  • speaker_names: (string - optional) List of target speaker names, speaker identification (if speaker samples are provided only). Possible values are
    • empty string "": no speaker identification
    • wild card "*": speaker identification for all speakers
    • list of speaker names in json format (ex: "["speaker1", ..., "speakerN"]") or separated by | (ex: "speaker1|...|speakerN"): speaker identification for the listed speakers only

The response (application/json) is a json object when using structured as followed:

{
  "speakers": [
      {"spk_id": "spk5", "duration": 2.2, "nbr_seg": 1},
      ...
  ],
  "segments": [
      {"seg_id": 1, "spk_id": "spk5", "seg_begin": 0.0, "seg_end": 2.2},
      ...
  ]
}

/docs

The /docs route offers a OpenAPI/swagger interface.

Through the message broker

Diarization worker accepts requests with the following arguments:

  • file: (str) Is the relative path of the file in the shared_folder.
  • speaker_count: (int, default None) Fixed number of speakers.
  • max_speaker: (int, default None) Max number of speaker if speaker_count=None.
  • speaker_names: (string, optional) List of target speaker names, speaker identification (if speaker samples are provided only). Possible values are
    • empty string "": no speaker identification
    • wild card "*": speaker identification for all speakers
    • list of speaker names in json format (ex: "["speaker1", ..., "speakerN"]") or separated by | (ex: "speaker1|...|speakerN"): speaker identification for the listed speakers only

Return format

On a successfull transcription the returned object is a json object structured as follow:

{
  "speakers": [
      {"spk_id": "spk5", "duration": 2.0, "nbr_seg": 1},
      ...
  ],
  "segments": [
      {"seg_id": 1, "spk_id": "spk5", "seg_begin": 0.0, "seg_end": 2.0},
      ...
  ]
}
  • The speakers field contains an arraw of speaker with overall duration and number of segments.
  • The segments field contains each audio segment with the associated speaker id start time and end time.

Test

Curl

You can test you http API using curl:

curl -X POST "http://YOUR_SERVICE:PORT/diarization" -H  "accept: application/json" -H  "Content-Type: multipart/form-data" -F "file=@YOUR_FILE.wav;type=audio/x-wav" -F "speaker_count=NUMBER_OF_SPEAKERS"

License

This project is developped under the AGPLv3 License (see LICENSE).

Acknowlegment.

  • PyAnnote diarization framework (License MIT).