Skip to content

A Jupyter-based repository for exploring key concepts and techniques in Natural Language Processing (NLP), from fundamental text processing to advanced methods with modern language models.

Notifications You must be signed in to change notification settings

hanguyenai/sudo-code-nlp

Repository files navigation

🧠 Sudo Code Program NLP

This repository is a collection of step-by-step programs and notebooks for learning and experimenting with Natural Language Processing (NLP), with a special focus on Vietnamese text.

It is being developed during my participation in the Sudo Code Program. This repository is a collection of step-by-step programs and notebooks for learning and experimenting with Natural Language Processing (NLP), with a special focus on Vietnamese text.


📚 Table of Contents

1. Text Preprocessing

  • Unicode & diacritic normalization
  • Remove HTML tags, URLs, emails, numbers, emojis
  • Vietnamese tokenization
  • Smart lowercase handling with POS tagging
  • and more (will update after) 👉 Notebook

2. Text Representation

  • N-gram representation (bi-gram, tri-gram)
  • Bag-of-Words (BoW) vectorization
  • TF-IDF weighting for word importance
    👉 Notebook

3. Word2Vec Model

  • Train Skip-gram & CBOW models using Gensim
  • Generate 300-dimensional Vietnamese word embeddings
  • Evaluate with similarity & analogy tests
  • Visualize embeddings via PCA 2D plot
    👉 Notebook

About

A Jupyter-based repository for exploring key concepts and techniques in Natural Language Processing (NLP), from fundamental text processing to advanced methods with modern language models.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages