This repository is a collection of step-by-step programs and notebooks for learning and experimenting with Natural Language Processing (NLP), with a special focus on Vietnamese text.
It is being developed during my participation in the Sudo Code Program. This repository is a collection of step-by-step programs and notebooks for learning and experimenting with Natural Language Processing (NLP), with a special focus on Vietnamese text.
- Unicode & diacritic normalization
- Remove HTML tags, URLs, emails, numbers, emojis
- Vietnamese tokenization
- Smart lowercase handling with POS tagging
- and more (will update after) 👉 Notebook
- N-gram representation (bi-gram, tri-gram)
- Bag-of-Words (BoW) vectorization
- TF-IDF weighting for word importance
👉 Notebook
- Train Skip-gram & CBOW models using Gensim
- Generate 300-dimensional Vietnamese word embeddings
- Evaluate with similarity & analogy tests
- Visualize embeddings via PCA 2D plot
👉 Notebook