Repository to show how NLP can tacke real problem. Including the source code, dataset, state-of-the art in NLP
- Data Augmentation in NLP
- Data Augmentation library for Text
- Does your NLP model able to prevent adversarial attack?
- How does Data Noising Help to Improve your NLP Model?
- Data Augmentation library for Speech Recognition
- Data Augmentation library for Audio
- Unsupervied Data Augmentation
- Adversarial Attacks in Textual Deep Neural Networks
| Section | Sub-Section | Description | Story |
|---|---|---|---|
| Tokenization | Subword Tokenization | Medium | |
| Tokenization | Word Tokenization | Medium Github | |
| Tokenization | Sentence Tokenization | Medium Github | |
| Part of Speech | Medium Github | ||
| Lemmatization | Medium Github | ||
| Stemming | Medium Github | ||
| Stop Words | Medium Github | ||
| Phrase Word Recognition | |||
| Spell Checking | Lexicon-based | Peter Norvig algorithm | Medium Github |
| Lexicon-based | Symspell | Medium Github | |
| Machine Translation | Statistical Machine Translation | Medium | |
| Machine Translation | Attention | Medium | |
| String Matching | Fuzzywuzzy | Medium Github |
| Section | Sub-Section | Research Lab | Story | Source |
|---|---|---|---|---|
| Traditional Method | Bag-of-words (BoW) | Medium Github | ||
| Latent Semantic Analysis (LSA) and Latent Dirichlet Allocation (LDA) | Medium Github | |||
| Character Level | Character Embedding | NYU | Medium Github | Paper |
| Word Level | Negative Sampling and Hierarchical Softmax | Medium | ||
| Word2Vec, GloVe, fastText | Medium Github | |||
| Contextualized Word Vectors (CoVe) | Salesforce | Medium Github | Paper Code | |
| Misspelling Oblivious (word) Embeddings | Medium | Paper | ||
| Embeddings from Language Models (ELMo) | AI2 | Medium Github | Paper Code | |
| Contextual String Embeddings | Zalando Research | Medium | Paper Code | |
| Sentence Level | Skip-thoughts | Medium Github | Paper Code | |
| InferSent | Medium Github | Paper Code | ||
| Quick-Thoughts | Medium | Paper Code | ||
| General Purpose Sentence (GenSen) | Medium | Paper Code | ||
| Bidirectional Encoder Representations from Transformers (BERT) | Medium | Paper(2019) Code | ||
| Generative Pre-Training (GPT) | OpenAI | Medium | Paper(2019) Code | |
| Self-Governing Neural Networks (SGNN) | Medium | Paper | ||
| Multi-Task Deep Neural Networks (MT-DNN) | Microsoft | Medium | Paper(2019) | |
| Generative Pre-Training-2 (GPT-2) | OpenAI | Medium | Paper(2019) Code | |
| Universal Language Model Fine-tuning (ULMFiT) | OpenAI | Medium | Paper Code | |
| BERT in Science Domain | Medium | Paper(2019) Paper(2019) | ||
| BERT in Clinical Domain | NYU/PU | Medium | Paper(2019) Paper(2019) | |
| RoBERTa | UW/Facebook | Medium | Paper(2019) Paper | |
| Unified Language Model for NLP and NLU (UNILM) | Microsoft | Medium | Paper(2019) | |
| Cross-lingual Language Model (XLMs) | Medium | Paper(2019) | ||
| Transformer-XL | CMU/Google | Medium | Paper(2019) | |
| XLNet | CMU/Google | Medium | Paper(2019) | |
| CTRL | Salesforce | Medium | Paper(2019) | |
| Document Level | lda2vec | Medium | Paper | |
| doc2vec | Medium Github | Paper |
| Section | Sub-Section | Description | Research Lab | Story | Paper & Code |
|---|---|---|---|---|---|
| Named Entity Recognition (NER) | Pattern-based Recognition | Medium | |||
| Lexicon-based Recognition | Medium | ||||
| spaCy Pre-trained NER | Medium Github | ||||
| Optical Character Recognition (OCR) | Printed Text | Google Cloud Vision API | Medium | Paper | |
| Handwriting | LSTM | Medium | Paper | ||
| Text Summarization | Extractive Approach | Medium Github | |||
| Abstractive Approach | Medium | ||||
| Emotion Recognition | Audio, Text, Visual | 3 Multimodals for Emotion Recognition | Medium |
| Section | Sub-Section | Description | Research Lab | Story | Paper & Code |
|---|---|---|---|---|---|
| Feature Representation | Unsupervised Learning | Introduction to Audio Feature Learning | Medium | Paper 1 Paper 2 Paper 3 | |
| Feature Representation | Unsupervised Learning | Speech2Vec and Sentence Level Embeddings | Medium | Paper 1 Paper 2 | |
| Speech-to-text | Introduction to Speeh-to-text | Medium |
| Section | Sub-Section | Description | Research Lab | Story | Paper & Code |
|---|---|---|---|---|---|
| Euclidean Distance, Cosine Similarity and Jaccard Similarity | Medium Github | ||||
| Edit Distance | Levenshtein Distance | Medium Github | |||
| Word Moving Distance (WMD) | Medium Github | ||||
| Supervised Word Moving Distance (S-WMD) | Medium | ||||
| Manhattan LSTM | Medium | Paper |
| Section | Sub-Section | Description | Research Lab | Story | Paper & Code |
|---|---|---|---|---|---|
| ELI5, LIME and Skater | Medium Github | ||||
| SHapley Additive exPlanations (SHAP) | Medium Github | ||||
| Anchors | Medium Github |
| Section | Sub-Section | Description | Link |
|---|---|---|---|
| Spellcheck | Github | ||
| InferSent | Github |