#
jtokkit
Here are 2 public repositories matching this topic...
This project implements a Byte Pair Encoding (BPE) tokenization approach along with a Word2Vec model to generate word embeddings from a text corpus. The implementation leverages Apache Hadoop for distributed processing and includes evaluation metrics for optimal dimensionality of embeddings.
scala logback word2vec scalatest hadoop-mapreduce deeplearning4j nd4j apache-hadoop llm bpe-tokenizer jtokkit
-
Updated
Nov 2, 2024 - Scala
Improve this page
Add a description, image, and links to the jtokkit topic page so that developers can more easily learn about it.
Add this topic to your repo
To associate your repository with the jtokkit topic, visit your repo's landing page and select "manage topics."