This corpus contains 1332 hours of conversational speech from 47 different languages and can be used in a variety of studies. Scraped from 106 public chat groups, it allows the study of the effects of device variability, language variability, and speaker and speech variability on the performance of speaker recognition systems and automatic language detection systems.
It is uploaded in google drive and you can download it after completing the Letter of Consent here, and sending a signed copy along with your Gmail address to ali.janalizadeh@outlook.com.
JSpeech contains up to 452,007 audio messages scraped from public groups, comprising a total of 1332 hours of converational audio data. The discussions in these groups are unstructured and are conducted with multiple speakers. JSpeech is a multilingual corpus with audio speech data from 47 different languages with over 12140 different speakers. The most notable feature of this audio data is the presence of different uncontrolled environments surrounding the speakers. This is useful in the development of speech technologies that are robust to different kinds of background noise.
The audio data has been downloaded directly from Telegram using the Telethon API in OGG format. Metadata of each file is stored in an SQLite database.
To convert the audio to the WAV format, use the following commands in whilest in the directory containing the OGG files:
apt-get install ffmpeg
ffmpeg -i audio.ogg audio.wav
In order to ensure the diversity and adequacy of the corpus, a set of 106 group chats from different backgrounds and languages were scraped from the public groups of the Telegram messaging application. Each voice message has 6 fields which are described in the Table below.
Field Name | Description |
---|---|
Voice_id | Unique ID assigned to each voice message |
User_id | Unique ID assigned to each speaker |
Fwd_from | ID of user that this message has been forwarded from |
Reply_to_msg_id | ID of the message this message was replied to |
Date | Time stamp of each message |
Size | Size of the voice message (byte) |
Duration | Duration of the voice message (second) |
Chat_name | Group name |
As shown in the barchart, the majority of the voice messages are made up of speech spoken in English, but there is also a noticeable amount of audio data available in other languages like Farsi, Spanish, and French.
In addition, you can find the distribution of the number of speakers available for each language.
If you want to know more information about JSpeech, you can read the paper here.
It is expected that with the availability of multilingual speech corpora with different setups of background environments will improve and boost R&D in the field of automatic speaker recognition and voice activity detection.
At Miras Technologies International we are using JSpeech to develop Speaker and Speech Recognition Systems.
Please cite the following paper in your publication if you are using JSpeech in your research:
@article{choobbastijspeech,
title={JSPEECH: A MULTI-LINGUAL CONVERSATIONAL SPEECH CORPUS},
author={Choobbasti, Ali Janalizadeh and Gholamian, Mohammad Erfan and Vaheb, Amir and Safavi, Saeid}
}