Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add whisperx support (including diarization) #123

Merged
merged 24 commits into from
Jan 31, 2025
Merged

Conversation

philmcmahon
Copy link
Contributor

@philmcmahon philmcmahon commented Jan 23, 2025

What does this change?

This adds support for using https://github.com/m-bain/whisperX on the transcription workers.

The main benefit of whisperx is that it supports diarization, or speaker recognition. This is a frequently requested feature of the transcription tool.

The downside of whisperx is that it needs to be run on a GPU instance. So far I've been running it on a g5.xlarge instance. These instances are roughly 2x the price of the c7g.4xlarge instances we've been using to run whisper.cpp on. I think we might get some savings moving to a g4dn.2xlarge instance, but performance testing would be required for that. See here for the cost figures. I think that the cost increase may be a little more than 2x, as I imagine that GPU instances are less available on the spot market.

Update: Performance is actually better on a g4dn.2xlarge than on the g5.xlarge instance, and 25% cheaper at $0.75/hr - so only a 30% increase over our existing instances (the spot issue probably remains)

As well as cost, the other downside of whisper x is that getting it running (and using the gpu) is a right ole faff. To get it working I needed to:

I'm not yet certain that whisperx has sufficient performance improvements such that we'll want to use it for all transcripts - it may be we only want to use it where the user has requested diarization. With that in mind, I've set whisperx up as a totally separate pipeline, with its own SQS queue and autoscaling group. Currently the use of whisperx (and the gpu instances) is controlled by a parameter store parameter, and is enabled for DEV/CODE and disabled for PROD.

I decided to install whisperx onto the base AMI rather than using a containerised version as we are doing with whisper.cpp. Whilst you can fairly easily allow docker containers to make use of the GPU, I didn't see much benefit of adding an extra layer of virtualisation. This change has lead to some awkwardness as 'useWhisperX' also means "don't use whisper.cpp container for ffmpeg or transcription' - it will be good if whisperX proves superioir across the board so that we can remove all this conditional logic.

Reviewing this PR

The most interesting changes are to the worker app - everything else is just wiring really. Even the worker app is now just running whisperx instead of whispercpp.

How to test

I've tested this on CODE using an AMI baked on amigo CODE - I'll need to get guardian/amigo#1607 merged before we can get a PROD AMI.

How can we measure success?

Happy users thanks to the new diarization feature!

Current performance tests:

I don't think diarization adds much to the time.

I think the performance boost would be much higher if we could pre-warm the gpu instance somehow, but need to do more research there.

@philmcmahon
Copy link
Contributor Author

philmcmahon commented Jan 27, 2025

An update - the offline issue is fixed in m-bain/whisperX#1021 and 163505e

Some more performance measurements:

@philmcmahon philmcmahon marked this pull request as ready for review January 27, 2025 15:12
@philmcmahon philmcmahon requested a review from a team as a code owner January 27, 2025 15:12
"worker::build": "npm run build --workspace worker; npm run build --workspace worker",
"worker::package": "npm run package --workspace worker",
"worker::start": "AWS_REGION=eu-west-1 STAGE=DEV npm run start --workspace worker",
"worker::start": "APP=transcription-service-gpu-worker AWS_REGION=eu-west-1 STAGE=DEV npm run start --workspace worker",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo: add worker-gpu::start, worker-cpu::start

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

name: `${mediaDownloadApp}-temp-volume`,
};
mediaDownloadTask.taskDefinition.addVolume(downloadVolume);
mediaDownloadTask.taskDefinition.addVolume(tempVolume);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo - pull this out into a different PR as it's unrelated whisperX

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Contributor

@zekehuntergreen zekehuntergreen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good 👏

Thanks for the walkthrough

@@ -7,18 +7,18 @@
"prettier:check": "prettier . --check",
"prettier:fix": "prettier . --write",
"api::build": "npm run build --workspace api",
"api::start": "AWS_REGION=eu-west-1 STAGE=DEV npm run start --workspace api",
"api::start": "APP=api AWS_REGION=eu-west-1 STAGE=DEV npm run start --workspace api",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

non-blocking: it looks like we only look at config.app.app in the worker so might be simpler to put an env variable like GPU=true or processor type?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went for APP because we are already tagging the instances with APP - it's just the start script that needs changing, whereas with a new variable I'd need to tag the instances accordingly

Comment on lines 378 to 381
machineImage: MachineImage.genericLinux({
'eu-west-1': workerAmi.valueAsString,
}),
instanceType: InstanceType.of(InstanceClass.C7G, InstanceSize.XLARGE4),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might be a little easier to follow if this object only holds the props in common between the two launch templates

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might not be worth the effort now if we get rid of non-gpu pipeline soon

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, I was just being lazy. Resolved in c34debc

const metadata = extractWhisperXStderrData(result.stderr);
logger.info('Whisper finished successfully', metadata);
return {
fileName: `${fileName}`,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
fileName: `${fileName}`,
fileName,

return Promise.resolve('auto');
}
const dlParams = whisperParams(true, whisperBaseParams.wavPath);
const { metadata } = await runWhisper(whisperBaseParams, dlParams);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was the case before your change, but are we running whisper twice (or thrice if we're translating)? once here and once in runTranscription?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might be worth adding a comment to explain

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added quite a bit of extra documentation here, it's confusing as this bit of code is only used by Giant 68967d9

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks!

whisperX: boolean,
): Promise<LanguageCode> => {
if (whisperX) {
return Promise.resolve('auto');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does whisperx not take a language arg?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whisperx takes so long to start up I didn't think there was any point running a pre-pass of language detection and decided to just detect it every time

@@ -0,0 +1,136 @@
import torchaudio
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might be worth adding a comment linking to amigo role where this is used

@philmcmahon philmcmahon merged commit 854105b into main Jan 31, 2025
4 checks passed
@philmcmahon philmcmahon deleted the add-whisperx-support branch January 31, 2025 10:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants