The ser
package is a Python package designed to identify and analyze emotions from spoken language. Utilizing cutting-edge machine learning techniques and audio processing algorithms, this package classifies emotions in speech, providing insights into the emotional states conveyed in audio recordings.
sequenceDiagram;
participant A as Emotion Prediction
participant B as Transcript Extraction
participant C as Timeline Integration
A->>C: Emotion with Timestamps
B->>C: Transcript with Timestamps
C->>C: Integrate and Align
- Emotion Classification Model: Trains on a dataset of audio files for accurate emotion recognition.
- Emotion Prediction: Predicts emotions from provided audio files.
- Transcript Extraction: Extracts a transcript of spoken words in the audio file.
- Timeline Integration: Builds a comprehensive timeline integrating recognized emotions with the corresponding transcript.
- CLI Interface: Offers command-line options for user interaction.
graph TD;
A[Audio Input] --> B[Feature Extraction];
B --> C[Emotion Classification Model];
A --> D[Transcript Extraction];
C --> E[Emotion Prediction];
D --> F[Transcript];
E --> G[Timeline Integration];
F --> G;
G --> H[Output];
git clone https://github.com/jsugg/ser/
cd ser
pip install -r requirements.txt
To train the emotion classification model:
python -m ser --train
graph TD;
A[Data Loading] --> B[Data Splitting];
B --> C[Train Model];
B --> D[Test Model];
C --> E[Model Validation];
E --> F[Trained Model];
To predict emotions in an audio file:
python -m ser --file audio.mp3
graph LR;
A[Audio Data] -->|Preprocessing| B[Feature Extraction];
B -->|Feature Set| C[Model Prediction];
C -->|Emotion Labels| D[Output];
A -->|Transcription| E[Transcript Extraction];
E -->|Transcript| D;
- Specify language:
--language <language>--
- Save transcript:
--save_transcript
transcript_extractor
: Extracts transcripts from audio files.audio_utils
: Utilities for audio processing.feature_extractor
: Extracts audio features for model training.emotion_model
: Contains the emotion classification model.
graph TD;
A[User Input] -->|Train Command| B[Train Model];
A -->|Predict Command| C[Predict Emotion];
C --> D[Display Emotion];
A -->|Transcript Command| E[Extract Transcript];
E --> F[Display Transcript];
Edit ser/config.py
to modify default configurations, including model paths, dataset paths, and feature extraction settings.
Contributions to SER are welcome!
This project is licensed under the MIT License - see the LICENSE.md
file for details.
- Libraries and Frameworks: Special thanks to the developers and maintainers of
librosa
,openai-whisper
,stable-whisper
,numpy
,scikit-learn
,soundfile
,tqdm
, and for their invaluable tools that made this project possible. - Datasets: Gratitude to the creators of the RAVDESS and Emo-DB datasets for providing high-quality audio data essential for training the models.
- Inspirational Sources: Inspired by Models-based representations for speech emotion recognition