Skip to content

mrjunjieli/awesome_speech

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 

Repository files navigation

title top
Awesome Speech
true

Hi, everyone! I’m Junjie Li [Homepage], currently a Ph.D. student at Hong Kong Polytechnic University (PolyU) 🇭🇰. This repository aims to help students become familiar with speech-related tasks, such as speech separation, speaker verification, ASR, TTS and so on.

Book recommendations

  • Understanding Deep learning [pdf]
  • Computer vision: models learning and inference [pdf]
  • 深入浅出强化学习:原理入门 [pdf]
  • Reinforcement Learning [pdf]

Self-Supervised Learning

  • R. Tao, K. Aik Lee, R. Kumar Das, V. Hautamäki and H. Li, "Self-Supervised Speaker Recognition with Loss-Gated Learning," ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, Singapore, 2022, pp. 6142-6146
  • Chen, T., Kornblith, S., Norouzi, M. & Hinton, G.. (2020). A Simple Framework for Contrastive Learning of Visual Representations. Proceedings of the 37th International Conference on Machine Learning, in Proceedings of Machine Learning Research 119:1597-1607.
  • D. Cai, W. Wang and M. Li, "An Iterative Framework for Self-Supervised Deep Speaker Representation Learning," ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 2021, pp. 6728-6732]
  • H. Zhang, Y. Zou and H. Wang, "Contrastive Self-Supervised Learning for Text-Independent Speaker Verification," ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 2021, pp. 6713-6717
  • He, K., Fan, H., Wu, Y., Xie, S., & Girshick, R. (2020). Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 9729-9738).
  • Wu, Z., Xiong, Y., Yu, S. X., & Lin, D. (2018). Unsupervised feature learning via non-parametric instance discrimination. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3733-3742).
  • Cai, D., Wang, W., & Li, M. (2021, June). An iterative framework for self-supervised deep speaker representation learning. In ICASSP 2021-2021 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 6728-6732). IEEE.
  • Hadsell, R., Chopra, S., & LeCun, Y. (2006, June). Dimensionality reduction by learning an invariant mapping. In 2006 IEEE computer society conference on computer vision and pattern recognition (CVPR'06) (Vol. 2, pp. 1735-1742). IEEE.
  • Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., ... & Sutskever, I. (2021, July). Learning transferable visual models from natural language supervision. In International conference on machine learning (pp. 8748-8763). PMLR.

Speaker Recognition/Verification:

  • Overview

    • Wang, S., Chen, Z., Lee, K. A., Qian, Y., & Li, H. (2024). Overview of speaker modeling and its applications: From the lens of deep speaker representation learning. IEEE/ACM Transactions on Audio, Speech, and Language Processing.
  • speaker model

    • i-vector
    • d-vector: Variani, E., Lei, X., McDermott, E., Moreno, I. L., & Gonzalez-Dominguez, J. (2014, May). Deep neural networks for small footprint text-dependent speaker verification. In 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 4052-4056). IEEE.
    • x-vector
      • Snyder, D., Garcia-Romero, D., Sell, G., Povey, D., & Khudanpur, S. (2018, April). X-vectors: Robust dnn embeddings for speaker recognition. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 5329-5333). IEEE.
      • Snyder, D., Garcia-Romero, D., Povey, D., & Khudanpur, S. (2017, August). Deep neural network embeddings for text-independent speaker verification. In Interspeech (Vol. 2017, pp. 999-1003).
      • Desplanques, B., Thienpondt, J., & Demuynck, K. (2020). Ecapa-tdnn: Emphasized channel attention, propagation and aggregation in tdnn based speaker verification. arXiv preprint arXiv:2005.07143.
    • r-vector Zeinali, H., Wang, S., Silnova, A., Matějka, P., & Plchot, O. (2019). But system description to voxceleb speaker recognition challenge 2019. arXiv preprint arXiv:1910.12592.
  • Uncertainty

    • Lee, K. A., Wang, Q., & Koshinaka, T. (2021). Xi-vector embedding for speaker recognition. IEEE Signal Processing Letters, 28, 1385-1389.
    • Wang, Q., & Lee, K. A. (2024). Cosine Scoring with Uncertainty for Neural Speaker Embedding. IEEE Signal Processing Letters.
    • Chen, L., Lee, K. A., Guo, W., & Ling, Z. H. (2024, April). Modeling Pseudo-Speaker Uncertainty in Voice Anonymization. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 11601-11605). IEEE.
    • Wang, Q., Lee, K. A., & Liu, T. (2023, June). Incorporating uncertainty from speaker embedding estimation to speaker verification. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 1-5). IEEE.
    • Liu, T., Lee, K. A., Wang, Q., & Li, H. (2023). Disentangling voice and content with self-supervision for speaker recognition. Advances in Neural Information Processing Systems, 36, 50221-50236.
  • Ravanelli, M., & Bengio, Y. (2018, December). Speaker recognition from raw waveform with sincnet. In 2018 IEEE spoken language technology workshop (SLT) (pp. 1021-1028). IEEE.

  • Zhou, D., Wang, L., Lee, K. A., Wu, Y., Liu, M., Dang, J., & Wei, J. (2020, October). Dynamic Margin Softmax Loss for Speaker Verification. In INTERSPEECH (pp. 3800-3804).

  • Cai, D., & Li, M. (2024). Leveraging asr pretrained conformers for speaker verification through transfer learning and knowledge distillation. IEEE/ACM Transactions on Audio, Speech, and Language Processing.

  • softmax

    • L-softmax: Liu, W., Wen, Y., Yu, Z., & Yang, M. (2016). Large-margin softmax loss for convolutional neural networks. arXiv preprint arXiv:1612.02295.
    • A-softmax:
      • Li, Y., Gao, F., Ou, Z., & Sun, J. (2018, November). Angular softmax loss for end-to-end speaker verification. In 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP) (pp. 190-194). IEEE.
      • Liu, W., Wen, Y., Yu, Z., Li, M., Raj, B., & Song, L. (2017). Sphereface: Deep hypersphere embedding for face recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 212-220).
    • SphereFace2:
      • Han, B., Chen, Z., & Qian, Y. (2023, June). Exploring binary classification loss for speaker verification. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 1-5). IEEE.
    • AM-softmax:
      • Wang, H., Wang, Y., Zhou, Z., Ji, X., Gong, D., Zhou, J., ... & Liu, W. (2018). Cosface: Large margin cosine loss for deep face recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 5265-5274).
      • Wang, F., Cheng, J., Liu, W., & Liu, H. (2018). Additive margin softmax for face verification. IEEE Signal Processing Letters, 25(7), 926-930.
    • AAM-softmax: Deng, J., Guo, J., Xue, N., & Zafeiriou, S. (2019). Arcface: Additive angular margin loss for deep face recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 4690-4699).
    • DAM-softmax:
      • Zhou, D., Wang, L., Lee, K. A., Wu, Y., Liu, M., Dang, J., & Wei, J. (2020, October). Dynamic Margin Softmax Loss for Speaker Verification. In INTERSPEECH (pp. 3800-3804).
    • MV-Softmax
      • Wang, X., Zhang, S., Wang, S., Fu, T., Shi, H., & Mei, T. (2020, April). Mis-classified vector guided softmax loss for face recognition. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 34, No. 07, pp. 12241-12248).
    • Huang, Y., Wang, Y., Tai, Y., Liu, X., Shen, P., Li, S., ... & Huang, F. (2020). Curricularface: adaptive curriculum learning loss for deep face recognition. In proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 5901-5910).
    • Kim, M., Jain, A. K., & Liu, X. (2022). Adaface: Quality adaptive margin for face recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 18750-18759).

    summary: https://zhuanlan.zhihu.com/p/23089590666

  • Datasets

    • Vox1: Nagrani, A., Chung, J. S., Xie, W., & Zisserman, A. (2020). Voxceleb: Large-scale speaker verification in the wild. Computer Speech & Language, 60, 101027.
    • Vox2: Chung, J. S., Nagrani, A., & Zisserman, A. (2018). Voxceleb2: Deep speaker recognition. arXiv preprint arXiv:1806.05622.
    • SRE: NIST SRE
  • Tao, R., Das, R. K., & Li, H. (2020). Audio-visual speaker recognition with a cross-modal discriminative network. arXiv preprint arXiv:2008.03894.

Text-to-Speech

  • Ju, Z., Wang, Y., Shen, K., Tan, X., Xin, D., Yang, D., ... & Zhao, S. (2024). Naturalspeech 3: Zero-shot speech synthesis with factorized codec and diffusion models. arXiv preprint arXiv:2403.03100.

Audio Codec

  • Défossez, A., Copet, J., Synnaeve, G., & Adi, Y. (2022). High fidelity neural audio compression. arXiv preprint arXiv:2210.13438.

Voice Conversion

  • CycleGAN-VC: Non-parallel Voice Conversion Using Cycle-Consistent Adversarial Networks

Speech Separation

  • Ashihara, T., Moriya, T., Horiguchi, S., Peng, J., Ochiai, T., Delcroix, M., ... & Sato, H. (2024, December). Investigation of Speaker Representation for Target-Speaker Speech Processing. In 2024 IEEE Spoken Language Technology Workshop (SLT) (pp. 423-430). IEEE.
  • Veluri, B., Itani, M., Chen, T., Yoshioka, T., & Gollakota, S. (2024, May). Look Once to Hear: Target Speech Hearing with Noisy Examples. In Proceedings of the CHI Conference on Human Factors in Computing Systems (pp. 1-16).

Speaker Diarization

Spoofing

  • Towards Quantifying and Reducing Language Mismatch Effects in Cross-Lingual Speech Anti-Spoofing

Targer Speaker ASR

Personalized VAD

Others

  • Emerging Properties in Self-Supervised Vision Transformers

Music

  • MuQ: Self-Supervised Music Representation Learning with Mel Residual Vector Quantization

Toolkit

  • Wespeaker
  • Wesep

Some interesting repos

  • MuQ

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published