Real-time Voice Phishing(Lie) Classifier using Echo State Networks
All code was written in Python>=3.7.
To download the libraries used in this project, enter the following command:
!pip install -r requirement.txt
1. Collecting
We collected 10 hours of Korean voice phishing data from YouTube. All data has been checked for duplicates and anomalies, and unnecessary sound effects have been removed. Furthermore, to ensure consistent training, all data has been segmented into multiple 10-second clips.
2. Labeling
For Speaker Diarization, we utilized a pretrained model provided by the Pyannote library.
- The voices of the scam callers(voice phishing scammers) were labeled as 1,
- And the voices of the recipients(ordinary conversation) were labeled as 0.
3. Augmentation
We tried augmentation method to expand the amount of data.
Time strech, pitch shift and adding noise were used to augmetation.
Using these methods, we also utilized 40 hours of augmented data for training.
- MFCC(total 20 of feature vectors)
- Pitch
- F0(Fundamental Frequency)
- Spectral Flux
- Spectral Frequency
Classifier : Echo State Network
- A specific kind of recurrent neural network (RNN) designed to efficiently handle sequential data based on Reservoir Computing.
- Considering the need for a model with low computational requirements for real-time AI predictions during calls and the ability to reflect the temporal nature of the data, ESN is the most suitable choice.
Performance Metrics
The ESN-based model demonstrates superior performance compared to other machine learning and deep learning models, with significantly faster inference speed than deep learning models. However, due to the limited amount of data, the SVM outperformed the deep learning-based models (this trend is expected to reverse as the data size increases).
Hyperparameter