Skip to content

assassinsurvivor/boosting_with_bert

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Boosting with BERT

tokenizer.py script can be used to train the vocabulary from the data that is going to be used to train BERT & HAN.

Here is a detailed description:

​ 1.frequent_bichar functions generates bigrams that maximizes the likelihood of occurence in the corpus.

​ 2. decode function uses the vocabulary which is generated by the frequent_bichar function in order to tokenize the word.

​ The script expects three user arguments:

        -complete path to the .txt file containing the data
        
        -maximum number of words that the dictionary should have.
        
        -minimum number of frequency that a bigram should have.

python tokenizer.py "sample.txt" 100 7

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published