Skip to content

Here is the code I used to compile a Wikipedia corpus, preprocessing, topic modeling

Notifications You must be signed in to change notification settings

pollyndos/tm_compcorp_2021

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

tm_compcorp_2021

Here is the code I used for compiling a Wikipedia corpus, preprocessing, topic modeling


wikiscrap contains 4 functions for creating a comparable corpora using Wikipedia (I used the Wikipedia-API library). You get txt files.

preprocess contains functions for preprocessing texts in russian and english.

BasicLDAmethods contains code for experiments with a comparable corpora using a standart LDA model from the gensim library

pd2txt is used to create files with all documents in english and russian corpora which then can be used for training polylingual topic model (PLTM) from the MALLET package (http://mallet.cs.umass.edu/topics-polylingual.php)


my comparable corpus you can find here https://drive.google.com/file/d/19HuC0MpxNc-WYKNF9zpc4NWmeWkRxtz4/view?usp=sharing

About

Here is the code I used to compile a Wikipedia corpus, preprocessing, topic modeling

Topics

Resources

Stars

Watchers

Forks