Audio-Text-Multimodal-ASR

This repository contains the code and resources for the paper:

Audio-Text Multimodal Speech Recognition via Dual-Tower Architecture for Mandarin Air Traffic Control Communications

Introduction

In air traffic control communications (ATCC), miscommunications between pilots and controllers can lead to serious aviation accidents. This work introduces a novel speech-text multimodal dual-tower architecture designed to address these issues by:

Semantic Alignment: Enhancing semantic alignment between speech and text during the encoding phase.
Modeling Long-Distance Context: Strengthening the ability to model long-distance acoustic dependencies in extended ATCC data.

The architecture employs:

Cross-modal interactions for close semantic alignment during encoding.
A two-stage training strategy to:
- Pre-train the multimodal encoding module.
- Fine-tune the entire network to bridge the modality gap and boost performance.

The method demonstrates significant improvements over existing speech recognition techniques.

Key Features

Dual-Tower Architecture: Incorporates audio and text modalities for joint semantic representation.
Two-Stage Training:
- Stage 1: Enhances inter-modal semantic alignment and auditory long-distance context modeling.
- Stage 2: Fine-tunes the model for improved generalization.
Improved Recognition Accuracy: Achieves notable reductions in character error rate (CER) on benchmark datasets.

Results

Datasets:
- ATCC
- AISHELL-1
Performance:
- CER on ATCC: 6.54%
- CER on AISHELL-1: 8.73%
- Performance gains compared to baselines:
  - ATCC: +28.76%
  - AISHELL-1: +23.82%

Usage

Prerequisites

Python >= 3.8
Required libraries (install via requirements.txt):
```
pip install -r requirements.txt
```

Getting Started

Clone the Repository:

git clone https://github.com/yourusername/Audio-Text-Multimodal-ASR.git
cd Audio-Text-Multimodal-ASR

Prepare Data:
- Place ATCC and AISHELL-1 datasets in the data/ directory.
- Preprocess the data using:
```
python preprocess.py
```
Train the Model:
- Stage 1 (Pre-training):
```
python train.py --stage 1
```
- Stage 2 (Fine-tuning):
```
python train.py --stage 2
```

Evaluate:

python evaluate.py --checkpoint <model_checkpoint>

Citation

If you use this code in your research, please cite our paper:

@Article{GE20243215,
AUTHOR = {Shuting Ge, Jin Ren, Yihua Shi, Yujun Zhang, Shunzhi Yang, Jinfeng Yang},
TITLE = {Audio-Text Multimodal Speech Recognition via Dual-Tower Architecture for Mandarin Air Traffic Control Communications},
JOURNAL = {Computers, Materials \& Continua},
VOLUME = {78},
YEAR = {2024},
NUMBER = {3},
PAGES = {3215--3245},
URL = {http://www.techscience.com/cmc/v78n3/55898},
ISSN = {1546-2226},
DOI = {10.32604/cmc.2023.046746}
}

License

This project is licensed under the MIT License. See the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
baseline models		baseline models
chinese-roberta-wwm-ext		chinese-roberta-wwm-ext
config		config
experiments		experiments
preprocess		preprocess
README.md		README.md
asr_downstream.py		asr_downstream.py
asr_test.py		asr_test.py
decode.py		decode.py
m2p_dataloader.py		m2p_dataloader.py
m2p_mask.py		m2p_mask.py
m2p_model.py		m2p_model.py
m2p_optimization.py		m2p_optimization.py
m2p_runner.py		m2p_runner.py
run_m2pretrain.py		run_m2pretrain.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Audio-Text-Multimodal-ASR

Introduction

Key Features

Results

Usage

Prerequisites

Getting Started

Citation

License

About

Releases

Packages

Languages

Jason-6/Audio-Text-Multimodal-ASR

Folders and files

Latest commit

History

Repository files navigation

Audio-Text-Multimodal-ASR

Introduction

Key Features

Results

Usage

Prerequisites

Getting Started

Citation

License

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages