Skip to content

Latest commit

 

History

History
71 lines (56 loc) · 1.99 KB

README.mkd

File metadata and controls

71 lines (56 loc) · 1.99 KB

Markov Chain Generator of fixed-length sequences of words.

I use it to generate random species names (Linnean binomial nomenclature).

Takes in input a text file containing one example name per line (e.g one species name per line).

Output a serie of randomly-generated (letter-wise) sequence of words. There is a transition matrix for each position of the word in the line (e.g a transition matrix for the genus name, and one for the species name).

The first time it is run on a training dataset, it will save the transition matrix and word length distribution in .npy files in the current directory. You can delete those files to rerun the learning step.

Usage

# Generate random species names (2 words: Genus+species)
python3 species_markovi.py trainingsets/Opisthokonta_from_timetree.txt -o 3

# To generate random genus names (one word)
./species_markovi.py trainingsets/Opisthokonta-genera_from_timetree.txt -w 1 -o 3

Output example

An example of names generated with a chain of order 3, with a training set of 47850 Opisthokonta species from TimeTree:

Pseudosophilus furicoloroensis
Hydroas morachyuratus
Cetheria dicoloralis
Feylios placontademontalis
Melaconius thymetterina
Amphorus zymalendicus
Mokopis pyrans
Tarhombodina trix
Chus flexaspardelanevana
Panoplodon neva
Notamia zospinis
Aiti planthurenae
Napeonocohyllonotaena atrilis
Spiza virinensis
Eoscelis sphaemariatus
Chla cata
Phylamydoides elongchir
Eum tanum
Scylidura tatulater
Pana pardwickleylosus

And another example of order 3 trained on mammalian genera:

Pomys
Hericrotheirotis
Mineociurus
Braccotocomyscius
Asteaseutes
Phoecopomyomus
Equattinus
Martia
Peomyscophintodorhicomys
Stenobudellorex

Licensing: Do what you want with it.