Skip to content

londogard/londogard-nlp-toolkit

Repository files navigation

Maven Central Buy Me A Coffee

londogard-nlp-toolkit

Londogard Natural Language Processing Toolkit written in Kotlin for the JVM.
This toolkit will be used throughout Londogard libraries/products such as our Summarizer, Text-Generation & more.

The LanguageSupport enum is used to determine what support different tools like Embeddings or Stopwords have out-of-the-box.

Tool Info Docs Samples (Kotlin Notebook)
Word Embeddings Word & Subword Embeddings available in 157 (fastText.cc) & 275 languages (bpemb) out-of-the-box. embeddings wordembeddings.ipynb
Sentence Embeddings Average & Unsupervised Random Walk Sentence Embeddings sentence-embeddings sentence-embeddings.ipynb
Stopwords Supports 23 languages out-of-the-box through NLTK's list of stopword stopwords stopwords.ipynb
Word Frequencies Supports 34 languages out-of-the-box through LuminosoInsight word frequency tables wordfrequency wordfreq.ipynb
Stemming Supports 14 languages out-of-the-box using Snowball Stemmer under the hood stemming stemmer.ipynb
Tokenizers Char, Word, Subword & Sentence Tokenizer support! SentencePiece? HuggingFace? It's there! - tokenizers
- sentence-tokenizers
tokenizer.ipynb
Vectorizers & Encoders BagOfWords, TF-IDF, BM25 & OneHot - vectorizers (TF-IDF, BM-25,..)
- count-vectorizers (Count, Hash, ..)
encoders (OneHot)
- transforms (TF-IDF, BM-25,..)
TODO
Keyword Extractions CooccurenceKeywords based on algorithm proposed in DOI:10.1142/S0218213004001466 keywords.ipynb
Machine Learning LogisticRegression Classifier (using Gradient Descent), NaïveBayes (binary) & Hidden Markov Model (HMM) as Sequence Classifier - classifiers (LogisticRegression, NaïveBayes)
regression (LinearRegression)
- sequence classifier (HiddenMarkovModel)
See e2e-examples
Deep Learning (Transformers / HuggingFace) ClassifierPipeline and TokenClassifierPipeline which supports HuggingFace ONNX model-names & PyTorch from local files transformers See e2e-examples
spaCy-like API 🚧WIP🚧

Installation

MavenCentral

implementation("com.londogard:nlp:$version")

Guides

Simple end-2-end guides available as notebooks via docs/samples.

This includes:

  1. IMDB Sentiment Analysis using Logistic Regression or Naïve Bayes
  2. IMDB Sentiment Analysis using HuggingFace Transformers, using ClassifierPipeline.create(<model-name>)
  3. POS-Tagging using Hidden Markov Model
  4. POS-Tagging using HuggingFace Transformers, using TokenClassifierPipeline.create(<model-name>)

& potentially more.