token-dict

basic dictionary based tokenization

for example:

pub fn main(){
	let tokenizer:TokenDict=BufReader::new(File::open("words.txt").unwrap()).lines().filter_map(Result::ok).collect();
	let tokens:Vec<u32>=tokenizer.tokenize_str("some text to tokenize").collect();
	let detokens:String=tokenizer.detokenize_str(&tokens).collect();

	print!("[");
	for id in tokens.iter().take(tokens.len().saturating_sub(1)){print!("{id}, ")}
	if let Some(id)=tokens.last(){print!("{id}")}
	println!("]");
	println!("\"{detokens}\"");
}
use {
	std::{
		fs::File,io::{BufRead,BufReader}
	},
	token_dict::{TokenDict}
};

possible output (id sequence depends what's in words.txt):

[375018, 32, 403933, 32, 410301, 32, 410782]
"some text to tokenize"

this tokenizer finds the next token based on whether it's a prefix of the remaining text, so despite being designed for word level tokenization it doesn't need to split on word boundaries first

pub fn main(){
	let tokenizer:TokenDict=BufReader::new(File::open("words.txt").unwrap()).lines().filter_map(Result::ok).collect();
	let tokens:Vec<u32>=tokenizer.tokenize_str("スペースは不要です").collect();
	let detokens:String=tokenizer.detokenize_str(&tokens).collect();

	print!("[");
	for id in tokens.iter().take(tokens.len().saturating_sub(1)){print!("{id}, ")}
	if let Some(id)=tokens.last(){print!("{id}")}
	println!("]");
	println!("\"{detokens}\"");
}
use {
	std::{
		fs::File,io::{BufRead,BufReader}
	},
	token_dict::{TokenDict}
};

possible output (id sequence depends what's in words.txt):

[470364, 467937, 471716, 467952]
"スペースは不要です"

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
examples		examples
src		src
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md
words.txt		words.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

token-dict

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

8000thCube/token-dict

Folders and files

Latest commit

History

Repository files navigation

token-dict

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages