title | top |
Awesome Speech |
true |
Hi, everyone! I’m Junjie Li [Homepage], currently a Ph.D. student at Hong Kong Polytechnic University (PolyU) 🇭🇰. This repository aims to help students become familiar with speech-related tasks, such as speech separation, speaker verification, ASR, TTS and so on.
- Understanding Deep learning [pdf]
- Computer vision: models learning and inference [pdf]
- 深入浅出强化学习:原理入门 [pdf]
- Reinforcement Learning [pdf]
- R. Tao, K. Aik Lee, R. Kumar Das, V. Hautamäki and H. Li, "Self-Supervised Speaker Recognition with Loss-Gated Learning," ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, Singapore, 2022, pp. 6142-6146
- Chen, T., Kornblith, S., Norouzi, M. & Hinton, G.. (2020). A Simple Framework for Contrastive Learning of Visual Representations. Proceedings of the 37th International Conference on Machine Learning, in Proceedings of Machine Learning Research 119:1597-1607.
- D. Cai, W. Wang and M. Li, "An Iterative Framework for Self-Supervised Deep Speaker Representation Learning," ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 2021, pp. 6728-6732]
- H. Zhang, Y. Zou and H. Wang, "Contrastive Self-Supervised Learning for Text-Independent Speaker Verification," ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 2021, pp. 6713-6717
- He, K., Fan, H., Wu, Y., Xie, S., & Girshick, R. (2020). Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 9729-9738).
- Wu, Z., Xiong, Y., Yu, S. X., & Lin, D. (2018). Unsupervised feature learning via non-parametric instance discrimination. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3733-3742).
- Cai, D., Wang, W., & Li, M. (2021, June). An iterative framework for self-supervised deep speaker representation learning. In ICASSP 2021-2021 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 6728-6732). IEEE.
- Hadsell, R., Chopra, S., & LeCun, Y. (2006, June). Dimensionality reduction by learning an invariant mapping. In 2006 IEEE computer society conference on computer vision and pattern recognition (CVPR'06) (Vol. 2, pp. 1735-1742). IEEE.
- Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., ... & Sutskever, I. (2021, July). Learning transferable visual models from natural language supervision. In International conference on machine learning (pp. 8748-8763). PMLR.
- Wang, S., Chen, Z., Lee, K. A., Qian, Y., & Li, H. (2024). Overview of speaker modeling and its applications: From the lens of deep speaker representation learning. IEEE/ACM Transactions on Audio, Speech, and Language Processing.
speaker model
- i-vector
- d-vector: Variani, E., Lei, X., McDermott, E., Moreno, I. L., & Gonzalez-Dominguez, J. (2014, May). Deep neural networks for small footprint text-dependent speaker verification. In 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 4052-4056). IEEE.
- x-vector
- Snyder, D., Garcia-Romero, D., Sell, G., Povey, D., & Khudanpur, S. (2018, April). X-vectors: Robust dnn embeddings for speaker recognition. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 5329-5333). IEEE.
- Snyder, D., Garcia-Romero, D., Povey, D., & Khudanpur, S. (2017, August). Deep neural network embeddings for text-independent speaker verification. In Interspeech (Vol. 2017, pp. 999-1003).
- Desplanques, B., Thienpondt, J., & Demuynck, K. (2020). Ecapa-tdnn: Emphasized channel attention, propagation and aggregation in tdnn based speaker verification. arXiv preprint arXiv:2005.07143.
- r-vector Zeinali, H., Wang, S., Silnova, A., Matějka, P., & Plchot, O. (2019). But system description to voxceleb speaker recognition challenge 2019. arXiv preprint arXiv:1910.12592.
- Lee, K. A., Wang, Q., & Koshinaka, T. (2021). Xi-vector embedding for speaker recognition. IEEE Signal Processing Letters, 28, 1385-1389.
- Wang, Q., & Lee, K. A. (2024). Cosine Scoring with Uncertainty for Neural Speaker Embedding. IEEE Signal Processing Letters.
- Chen, L., Lee, K. A., Guo, W., & Ling, Z. H. (2024, April). Modeling Pseudo-Speaker Uncertainty in Voice Anonymization. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 11601-11605). IEEE.
- Wang, Q., Lee, K. A., & Liu, T. (2023, June). Incorporating uncertainty from speaker embedding estimation to speaker verification. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 1-5). IEEE.
- Liu, T., Lee, K. A., Wang, Q., & Li, H. (2023). Disentangling voice and content with self-supervision for speaker recognition. Advances in Neural Information Processing Systems, 36, 50221-50236.
Ravanelli, M., & Bengio, Y. (2018, December). Speaker recognition from raw waveform with sincnet. In 2018 IEEE spoken language technology workshop (SLT) (pp. 1021-1028). IEEE.
Zhou, D., Wang, L., Lee, K. A., Wu, Y., Liu, M., Dang, J., & Wei, J. (2020, October). Dynamic Margin Softmax Loss for Speaker Verification. In INTERSPEECH (pp. 3800-3804).
Cai, D., & Li, M. (2024). Leveraging asr pretrained conformers for speaker verification through transfer learning and knowledge distillation. IEEE/ACM Transactions on Audio, Speech, and Language Processing.
- L-softmax: Liu, W., Wen, Y., Yu, Z., & Yang, M. (2016). Large-margin softmax loss for convolutional neural networks. arXiv preprint arXiv:1612.02295.
- A-softmax:
- Li, Y., Gao, F., Ou, Z., & Sun, J. (2018, November). Angular softmax loss for end-to-end speaker verification. In 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP) (pp. 190-194). IEEE.
- Liu, W., Wen, Y., Yu, Z., Li, M., Raj, B., & Song, L. (2017). Sphereface: Deep hypersphere embedding for face recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 212-220).
- SphereFace2:
- Han, B., Chen, Z., & Qian, Y. (2023, June). Exploring binary classification loss for speaker verification. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 1-5). IEEE.
- AM-softmax:
- Wang, H., Wang, Y., Zhou, Z., Ji, X., Gong, D., Zhou, J., ... & Liu, W. (2018). Cosface: Large margin cosine loss for deep face recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 5265-5274).
- Wang, F., Cheng, J., Liu, W., & Liu, H. (2018). Additive margin softmax for face verification. IEEE Signal Processing Letters, 25(7), 926-930.
- AAM-softmax: Deng, J., Guo, J., Xue, N., & Zafeiriou, S. (2019). Arcface: Additive angular margin loss for deep face recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 4690-4699).
- DAM-softmax:
- Zhou, D., Wang, L., Lee, K. A., Wu, Y., Liu, M., Dang, J., & Wei, J. (2020, October). Dynamic Margin Softmax Loss for Speaker Verification. In INTERSPEECH (pp. 3800-3804).
- MV-Softmax
- Wang, X., Zhang, S., Wang, S., Fu, T., Shi, H., & Mei, T. (2020, April). Mis-classified vector guided softmax loss for face recognition. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 34, No. 07, pp. 12241-12248).
- Huang, Y., Wang, Y., Tai, Y., Liu, X., Shen, P., Li, S., ... & Huang, F. (2020). Curricularface: adaptive curriculum learning loss for deep face recognition. In proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 5901-5910).
- Kim, M., Jain, A. K., & Liu, X. (2022). Adaface: Quality adaptive margin for face recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 18750-18759).
- Vox1: Nagrani, A., Chung, J. S., Xie, W., & Zisserman, A. (2020). Voxceleb: Large-scale speaker verification in the wild. Computer Speech & Language, 60, 101027.
- Vox2: Chung, J. S., Nagrani, A., & Zisserman, A. (2018). Voxceleb2: Deep speaker recognition. arXiv preprint arXiv:1806.05622.
Tao, R., Das, R. K., & Li, H. (2020). Audio-visual speaker recognition with a cross-modal discriminative network. arXiv preprint arXiv:2008.03894.
- Ju, Z., Wang, Y., Shen, K., Tan, X., Xin, D., Yang, D., ... & Zhao, S. (2024). Naturalspeech 3: Zero-shot speech synthesis with factorized codec and diffusion models. arXiv preprint arXiv:2403.03100.
- Défossez, A., Copet, J., Synnaeve, G., & Adi, Y. (2022). High fidelity neural audio compression. arXiv preprint arXiv:2210.13438.
- CycleGAN-VC: Non-parallel Voice Conversion Using Cycle-Consistent Adversarial Networks
- Ashihara, T., Moriya, T., Horiguchi, S., Peng, J., Ochiai, T., Delcroix, M., ... & Sato, H. (2024, December). Investigation of Speaker Representation for Target-Speaker Speech Processing. In 2024 IEEE Spoken Language Technology Workshop (SLT) (pp. 423-430). IEEE.
- Veluri, B., Itani, M., Chen, T., Yoshioka, T., & Gollakota, S. (2024, May). Look Once to Hear: Target Speech Hearing with Noisy Examples. In Proceedings of the CHI Conference on Human Factors in Computing Systems (pp. 1-16).
- Towards Quantifying and Reducing Language Mismatch Effects in Cross-Lingual Speech Anti-Spoofing
- Emerging Properties in Self-Supervised Vision Transformers
- MuQ: Self-Supervised Music Representation Learning with Mel Residual Vector Quantization