Paper, Code and Resources for Speech Language Model and End2End Speech Dialogue System.
- LTU: Listen, Think, and Understand -
ICLR 2024
- SALMONN: Towards Generic Hearing Abilities for Large Language Models -
ICLR 2024
- LTU-AS: Joint Audio and Speech Understanding -
ASRU 2024
- Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models -
arXiv 2023
- Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities -
ICML 2024
- Qwen2-Audio Technical Report -
arXiv 2024
- WavLLM: Towards Robust and Adaptive Speech Large Language Model -
EMNLP 2024
- DiVA: Distilling an End-to-End Voice Assistant Without Instruction Training Data -
arXiv 2024
- Dynamic-SUPERB: Towards A Dynamic, Collaborative, and Comprehensive Instruction-Tuning Benchmark for Speech -
ICASSP 2024
- AIR-Bench: Benchmarking Large Audio-Language Models via Generative Comprehension -
ACL 2024
- SD-Eval: A Benchmark Dataset for Spoken Dialogue Understanding Beyond Words -
arXiv 2024
- AudioBench: A Universal Benchmark for Audio Large Language Models -
arXiv 2024
- SALMon: A Suite for Acoustic Language Model Evaluation -
arXiv 2024
- MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark -
arXiv 2024
- Dynamic-SUPERB Phase-2: A Collaboratively Expanding Benchmark for Measuring the Capabilities of Spoken Language Models with 180 Tasks -
ICLR 2024 open review
- SpeechGPT: Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities -
EMNLP 2023
- GPT-4o Voice Mode -
API 2024
- PSLM: Parallel Generation of Text and Speech with LLMs for Low-Latency Spoken Dialogue Systems -
EMNLP 2024
- VITA: Towards Open-Source Interactive Omni Multimodal LLM -
arXiv 2024
- Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming -
arXiv 2024
- LLaMA-Omni: Seamless Speech Interaction with Large Language Models -
arXiv 2024
- Moshi: a speech-text foundation model for real-time dialogue -
arXiv 2024
- Westlake-Omni -
GitHub 2024
- EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions -
arXiv 2024
- IntrinsicVoice: Empowering LLMs with Intrinsic Real-time Voice Interaction Abilities -
arXiv 2024
- Mini-Omni2: Towards Open-source GPT-4o with Vision, Speech and Duplex Capabilities -
arXiv 2024
- MooER-omni -
GitHub 2024
- GLM-4-Voice -
GitHub 2024
- Freeze-Omni: A Smart and Low Latency Speech-to-speech Dialogue Model with Frozen LLM -
arXiv 2024
- Hertz-dev -
GitHub 2024
- Fish Agent -
GitHub 2024
- VoiceBench: Benchmarking LLM-Based Voice Assistants -
arXiv 2024
- A Full-duplex Speech Dialogue Scheme Based On Large Language Models -
NeurIPS 2024
- MiniCPM-duplex: Beyond the Turn-Based Game: Enabling Real-Time Conversations with Duplex Models -
EMNLP 2024
- LSLM: Language Model Can Listen While Speaking -
arXiv 2024
- SyncLLM: Beyond Turn-Based Interfaces: Synchronous LLMs as Full-Duplex Dialogue Agents -
arXiv 2024
- Enabling Real-Time Conversations with Minimal Training Costs -
arXiv 2024
- Towards audio language modeling -- an overview -
arXiv 2024
- Recent Advances in Speech Language Models: A Survey -
arXiv 2024
- A Survey on Speech Large Language Models -
arXiv 2024
- Speech Trident -
Github