Simple tools for dealing with transcripts generated by AWS Transcribe
This class is specifically for handling multi-speaker json files generated by the Amazon Web Services (AWS) "Transcribe" service. This will not work with single-speaker files.
Simply instantiate the class with a json file object and the names of the speakers, in order. (AWS lists them as 'spk_0', 'spk_1' etc.) AWS breaks up the file into segments, which roughly comport with a short burst of spoken words. After a pause, a new segment starts. With the resulting instance you can use the print_segment
method to retrieve all the words from a specific segment, and the speaker who spoke those words. If you'd like to print a whole transcript, simply find the total number of segments with the count_segment
method and loop the print_segment
method as many times as necessary.
"Line" numbers can be included by setting "show_seg_num" to True before using the print_segment
method.
There's also a method to retrieve the start_time of a segment (in seconds). That method is get_segment_start
Here's an example:
f = open('/Users/efstone/Downloads/depo_audio_6052.json', 'r')
transcript = AwsTranscript(f, 'Deponent', 'Lawyer', 'unknown1', 'unknown2')
for i in range(8):
print(transcript.print_segment(i))
Which will generate the following:
Lawyer: And you also understand that you cannot consult with your attorneys before answering the question unless it regards a matter of privilege.
Lawyer: Just All
Lawyer: All right.
Lawyer: You talk a little bit about your personal history when
Deponent: when you say personal. Start went from bird dust on El Richard. Grow up.
Deponent: No, the city of New Orleans.
Lawyer: Okay.
Lawyer: And where'd you go to school
Amazon Transcribe isn't perfect, but it's SUPER cheap and this is DAMN handy when you've got to wait a few weeks for, say, a deposition transcript to be prepared, but need to start picking through the specifics of the deposition ASAP.
I hope someone finds this as useful as I did!