Add whisperx support (including diarization) #123

philmcmahon · 2025-01-23T17:27:25Z

What does this change?

This adds support for using https://github.com/m-bain/whisperX on the transcription workers.

The main benefit of whisperx is that it supports diarization, or speaker recognition. This is a frequently requested feature of the transcription tool.

The downside of whisperx is that it needs to be run on a GPU instance. So far I've been running it on a g5.xlarge instance. These instances are roughly 2x the price of the c7g.4xlarge instances we've been using to run whisper.cpp on. I think we might get some savings moving to a g4dn.2xlarge instance, but performance testing would be required for that. See here for the cost figures. I think that the cost increase may be a little more than 2x, as I imagine that GPU instances are less available on the spot market.

Update: Performance is actually better on a g4dn.2xlarge than on the g5.xlarge instance, and 25% cheaper at $0.75/hr - so only a 30% increase over our existing instances (the spot issue probably remains)

As well as cost, the other downside of whisper x is that getting it running (and using the gpu) is a right ole faff. To get it working I needed to:

Create a new amigo base image based off the aws deep learning ami - see Add 'requiresXLargeBuilder' field to base image amigo#1604
Create a new whisperx amigo role - Add WhisperX Role amigo#1606
Pre-cache all the various whisperx models into the AMI using the python script included in this pr, and Add step to whisperx role to cache models required for transcription amigo#1607

I'm not yet certain that whisperx has sufficient performance improvements such that we'll want to use it for all transcripts - it may be we only want to use it where the user has requested diarization. With that in mind, I've set whisperx up as a totally separate pipeline, with its own SQS queue and autoscaling group. Currently the use of whisperx (and the gpu instances) is controlled by a parameter store parameter, and is enabled for DEV/CODE and disabled for PROD.

I decided to install whisperx onto the base AMI rather than using a containerised version as we are doing with whisper.cpp. Whilst you can fairly easily allow docker containers to make use of the GPU, I didn't see much benefit of adding an extra layer of virtualisation. This change has lead to some awkwardness as 'useWhisperX' also means "don't use whisper.cpp container for ffmpeg or transcription' - it will be good if whisperX proves superioir across the board so that we can remove all this conditional logic.

Reviewing this PR

The most interesting changes are to the worker app - everything else is just wiring really. Even the worker app is now just running whisperx instead of whispercpp.

How to test

I've tested this on CODE using an AMI baked on amigo CODE - I'll need to get guardian/amigo#1607 merged before we can get a PROD AMI.

How can we measure success?

Happy users thanks to the new diarization feature!

Current performance tests:

1h7 minute audio https://www.youtube.com/watch?v=uKK1RX4Ml3I
whisper.cpp - 33mins
whisperx with diarization - 20mins

I don't think diarization adds much to the time.

I think the performance boost would be much higher if we could pre-warm the gpu instance somehow, but need to do more research there.

github-actions · 2025-01-27T13:52:06Z

Deploy build 857 of `investigations::transcription-service` to CODE

All deployment options

From guardian/actions-riff-raff.

github-actions · 2025-01-27T13:52:08Z

Deploy build 737 of `investigations::transcription-service-repository` to CODE

All deployment options

From guardian/actions-riff-raff.

philmcmahon · 2025-01-27T15:08:24Z

An update - the offline issue is fixed in m-bain/whisperX#1021 and 163505e

Some more performance measurements:

1h7 minute audio https://www.youtube.com/watch?v=uKK1RX4Ml3I
whisperx with diarization, g5.xlarge, phoneme alignment disabled - 16mins
whisperx with diarization, g4dn.2xlarge, phoneme alignment disabled - 8mins

philmcmahon · 2025-01-29T14:53:47Z

package.json

 		"worker::build": "npm run build --workspace worker; npm run build --workspace worker",
 		"worker::package": "npm run package --workspace worker",
-		"worker::start": "AWS_REGION=eu-west-1 STAGE=DEV npm run start --workspace worker",
+		"worker::start": "APP=transcription-service-gpu-worker AWS_REGION=eu-west-1 STAGE=DEV npm run start --workspace worker",


todo: add worker-gpu::start, worker-cpu::start

philmcmahon · 2025-01-29T14:58:00Z

packages/cdk/lib/transcription-service.ts

+			name: `${mediaDownloadApp}-temp-volume`,
+		};
+		mediaDownloadTask.taskDefinition.addVolume(downloadVolume);
+		mediaDownloadTask.taskDefinition.addVolume(tempVolume);


todo - pull this out into a different PR as it's unrelated whisperX

zekehuntergreen

Looks good 👏

Thanks for the walkthrough

zekehuntergreen · 2025-01-29T17:44:49Z

package.json

@@ -7,18 +7,18 @@
 		"prettier:check": "prettier . --check",
 		"prettier:fix": "prettier . --write",
 		"api::build": "npm run build --workspace api",
-		"api::start": "AWS_REGION=eu-west-1 STAGE=DEV npm run start --workspace api",
+		"api::start": "APP=api AWS_REGION=eu-west-1 STAGE=DEV npm run start --workspace api",


non-blocking: it looks like we only look at config.app.app in the worker so might be simpler to put an env variable like GPU=true or processor type?

I went for APP because we are already tagging the instances with APP - it's just the start script that needs changing, whereas with a new variable I'd need to tag the instances accordingly

zekehuntergreen · 2025-01-29T18:19:53Z

packages/cdk/lib/transcription-service.ts

+			machineImage: MachineImage.genericLinux({
+				'eu-west-1': workerAmi.valueAsString,
+			}),
+			instanceType: InstanceType.of(InstanceClass.C7G, InstanceSize.XLARGE4),


might be a little easier to follow if this object only holds the props in common between the two launch templates

might not be worth the effort now if we get rid of non-gpu pipeline soon

You're right, I was just being lazy. Resolved in c34debc

zekehuntergreen · 2025-01-30T17:35:02Z

packages/worker/src/transcribe.ts

+		const metadata = extractWhisperXStderrData(result.stderr);
+		logger.info('Whisper finished successfully', metadata);
+		return {
+			fileName: `${fileName}`,


Suggested change

fileName: `${fileName}`,

fileName,

zekehuntergreen · 2025-01-30T17:40:50Z

packages/worker/src/transcribe.ts

+		return Promise.resolve('auto');
+	}
+	const dlParams = whisperParams(true, whisperBaseParams.wavPath);
+	const { metadata } = await runWhisper(whisperBaseParams, dlParams);


this was the case before your change, but are we running whisper twice (or thrice if we're translating)? once here and once in runTranscription?

might be worth adding a comment to explain

I've added quite a bit of extra documentation here, it's confusing as this bit of code is only used by Giant 68967d9

zekehuntergreen · 2025-01-30T17:43:36Z

packages/worker/src/transcribe.ts

+	whisperX: boolean,
+): Promise<LanguageCode> => {
+	if (whisperX) {
+		return Promise.resolve('auto');


does whisperx not take a language arg?

whisperx takes so long to start up I didn't think there was any point running a pre-pass of language detection and decided to just detect it every time

zekehuntergreen · 2025-01-30T17:49:14Z

scripts/download_whisperx_models.py

@@ -0,0 +1,136 @@
+import torchaudio


might be worth adding a comment linking to amigo role where this is used

…iption engine

…ainer on cpu worker

philmcmahon marked this pull request as ready for review January 27, 2025 15:12

philmcmahon requested a review from a team as a code owner January 27, 2025 15:12

philmcmahon commented Jan 29, 2025

View reviewed changes

zekehuntergreen approved these changes Jan 30, 2025

View reviewed changes

philmcmahon added 22 commits January 30, 2025 23:31

Automatically create media export function parameter in cdk

69a0eed

Add infrastructure and application code for separate whisperx transcr…

a3872e5

…iption engine

Automatically create parameter store param for gpu queue

291b626

Fix gpu parameter path, diarization ui

39f0b0a

Allow worker capacity manager to control gpu worker asg

bf0b0cd

Add worker type property so worker can decide which queue to poll

e5eba4e

Add script to download whisper models

9599ad0

Add App config property to determine if gpu/cpu worker. Only use cont…

4663d3e

…ainer on cpu worker

Tell worker systemd service where to find cuda drivers

19ff789

Use medium whisperx model on CODE

c3b833a

Parameterise useWhisperX setting

b15e143

More tidying/documentation

30d41e1

Disable phoneme alignment for whisperx

fe4525f

Add model_cache_only param

5d52aac

Better error reporting for transcribe-url endpoint

3b45989

Use g4dn.xlarge2 instance rather than g5.xlarge

e8aaa41

Add diarizationRequested to transcribe-url call

1ac81a5

Fix transcript type selector

be2e22c

Set APP env var on media download service

20d5428

snapshot

8ff0046

Oops - use g4dn instances on CODE

80463c3

Improve readability of worker cdk - pull out common props only

c34debc

philmcmahon added 2 commits January 30, 2025 23:32

Document model download script and transcribeAndTranslatePath

68967d9

Add npm script for starting worker in cpu mode

93177c8

philmcmahon force-pushed the add-whisperx-support branch from e63057a to 93177c8 Compare January 30, 2025 23:44

philmcmahon merged commit 854105b into main Jan 31, 2025
4 checks passed

philmcmahon deleted the add-whisperx-support branch January 31, 2025 10:01

This was referenced Jan 31, 2025

Create media export function parameter in cdk #121

Closed

Add engine and diarize fields to TranscriptionJob guardian/giant#252

Merged

Drop max spot price for gpu instances #133

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add whisperx support (including diarization) #123

Add whisperx support (including diarization) #123

philmcmahon commented Jan 23, 2025 •

edited

Loading

github-actions bot commented Jan 27, 2025 •

edited

Loading

github-actions bot commented Jan 27, 2025 •

edited

Loading

philmcmahon commented Jan 27, 2025 •

edited

Loading

philmcmahon Jan 29, 2025

philmcmahon Jan 31, 2025

philmcmahon Jan 29, 2025

philmcmahon Jan 31, 2025

zekehuntergreen left a comment

zekehuntergreen Jan 29, 2025

philmcmahon Jan 31, 2025

zekehuntergreen Jan 29, 2025

zekehuntergreen Jan 29, 2025

philmcmahon Jan 31, 2025

zekehuntergreen Jan 30, 2025

zekehuntergreen Jan 30, 2025

zekehuntergreen Jan 30, 2025

philmcmahon Jan 31, 2025

zekehuntergreen Jan 31, 2025

zekehuntergreen Jan 30, 2025

philmcmahon Jan 31, 2025

zekehuntergreen Jan 30, 2025

Add whisperx support (including diarization) #123

Add whisperx support (including diarization) #123

Conversation

philmcmahon commented Jan 23, 2025 • edited Loading

What does this change?

Reviewing this PR

How to test

How can we measure success?

github-actions bot commented Jan 27, 2025 • edited Loading

github-actions bot commented Jan 27, 2025 • edited Loading

philmcmahon commented Jan 27, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zekehuntergreen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

philmcmahon commented Jan 23, 2025 •

edited

Loading

github-actions bot commented Jan 27, 2025 •

edited

Loading

github-actions bot commented Jan 27, 2025 •

edited

Loading

philmcmahon commented Jan 27, 2025 •

edited

Loading