This study investigates the progress in visual speech recognition (VSR), a domain that interprets spoken language from lip movements. The research delves into the architectural components of VSR technology, specifically focusing on convolutional and recurrent neural network architectures. Rigorous testing has been conducted on the GRID corpus to ensure unbiased results. Proposed advancements aim to improve accuracy and broaden applicability through the integration of state-of-the-art deep learning techniques, multi-modal audio-visual learning, and attention mechanisms. The research proposes a novel deep learning model that can convert video sequences of lip movements into spoken text. This model leverages sequence-to-sequence learning with encoder-decoder components to map visual features to textual representations. Potential applications encompass speech recognition, accessibility solutions, and audio-visual synchronization, with a particular emphasis on diverse speaker populations and languages. The research sheds light on the evolution of VSR, highlighting the importance of multi-modal strategies, user-centric design principles, and robust evaluation metrics for fostering more inclusive and effective communication systems.
-
Notifications
You must be signed in to change notification settings - Fork 0
keshh22/SpeechRecogMotion
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
About
This model seeks to decipher sequences of lip movements captured in video frames and translate them into meaningful spoken language or phonetic representations.
Topics
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published