toki-pona-pi

01/05/2024 I'm in vacancy to refresh some ideas.

( this file is a stub )

Some statistical programs for frequency analysis about Toki Pona. Want know more about Toki Pona ?

Steven D. Rogers, said "One of the language's main goals is a focus on minimalism. It is designed to express maximal meaning with minimal complexity."

Toki Pona

The Toki Pona (TP) conlang is minimal language, both in the vocabulary, with 14 letters and 124 lemmas, and about only 10 syntax rules.

The language also have glyphs for each word and each word refer to a concept with multiple meanings.

Renato Fabbri, made a methodologic study in "Basic concepts and tools for the Toki Pona minimal and constructed language: description of the language and main issues; analysis of the vocabulary; text synthesis and syntax highlighting; Wordnet synsets;" [https://arxiv.org/abs/1712.09359]

Vocabularies and Dictionaries

Most of meanings of Toki Pona are made by using words as concepts and associating concepts to describing things.

The use and order of words does combinations for translate and transmit ideas.

The lexicalization [https://sona.pona.la/wiki/Lexicalization] is against the philosophy of Toki Pona, which prefer personal signification against patterns of any dictionary.

But the common use of composition of words is inherent to any language, as productive method to create predicates without create new words by concatenation.

By using common toki pona texts and calculating the frequency of words and the frequency of words in sequence, the most used ngrams are showed. These frequencies are the most used auto-correlation concepts.

Programs

By simplicity the programs use bash, tr and awk, with uses a vocabulary reference (a list of words one by line), a text source (with lines composed by words separated by spaces), as inputs, and outputs a parsed text with only the words in vocabulary, and a file with counts for words and bigrams.

Some clean processing of files, removing non a-z characters, translating to lowercase and sort lines are done before processing.

Note: Those programs can be used for statistics of any language.

Corpus

Got a Toki Pona corpus tatoeba.tok from tatoeba.org, with 56517 lines and 547408 words.

Frequency

Please look for Frequency Analysis

Be binary ?

The 124 glyphs of Toki Pona, plus null, space, new line and full stop, could be represented by a byte.

Reserving 0x00 for null, 0x01 for space, 0x02 new line, 0x03 for full stop (as this one --->).

Reserving 0x04 to 0x0E for the 11 particules: a, ala, e, kin, la, li, lon, nanpa, o, pi, seme.

Next 0x0F to 0xFF are used for other glyphs sorted by alphabetic, frequency or any definided criteria.

The 124 words in Toki Pona can be represented as one byte.

... and no more symbols.

Name		Name	Last commit message	Last commit date
Latest commit History 54 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
Stats.md		Stats.md
a.md		a.md
all		all
all.t		all.t
cops		cops
corpus.tok		corpus.tok
correlate.awk		correlate.awk
counts_in_list.awk		counts_in_list.awk
do.sh		do.sh
list		list
only_in_list.awk		only_in_list.awk
some_text		some_text
tatoeba.tok		tatoeba.tok
tatoeba.tok.t		tatoeba.tok.t
tbsgb.md		tbsgb.md
toki_pona-bigrams.tsv		toki_pona-bigrams.tsv
toki_pona-words.tsv		toki_pona-words.tsv
toki_pona_mi		toki_pona_mi
tp.csv		tp.csv
tp.esv		tp.esv
tu		tu
tu.1.csv		tu.1.csv
tu.1x		tu.1x
tu.2.csv		tu.2.csv
tu.2x		tu.2x
tu.3.csv		tu.3.csv
tu.3x		tu.3x
tu.t		tu.t
words		words
words.np		words.np

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

toki-pona-pi

Toki Pona

Vocabularies and Dictionaries

Programs

Corpus

Frequency

Be binary ?

More

About

Releases

Packages

Languages

License

agsb/toki-pona-pi

Folders and files

Latest commit

History

Repository files navigation

toki-pona-pi

Toki Pona

Vocabularies and Dictionaries

Programs

Corpus

Frequency

Be binary ?

More

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages