tm_compcorp_2021

Here is the code I used for compiling a Wikipedia corpus, preprocessing, topic modeling

wikiscrap contains 4 functions for creating a comparable corpora using Wikipedia (I used the Wikipedia-API library). You get txt files.

preprocess contains functions for preprocessing texts in russian and english.

BasicLDAmethods contains code for experiments with a comparable corpora using a standart LDA model from the gensim library

pd2txt is used to create files with all documents in english and russian corpora which then can be used for training polylingual topic model (PLTM) from the MALLET package (http://mallet.cs.umass.edu/topics-polylingual.php)

my comparable corpus you can find here https://drive.google.com/file/d/19HuC0MpxNc-WYKNF9zpc4NWmeWkRxtz4/view?usp=sharing

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
BasicLDAmethods.ipynb		BasicLDAmethods.ipynb
Preprocess.ipynb		Preprocess.ipynb
README.md		README.md
Wikiscrap.ipynb		Wikiscrap.ipynb
pd2txt.py		pd2txt.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

tm_compcorp_2021

About

Languages

pollyndos/tm_compcorp_2021

Folders and files

Latest commit

History

Repository files navigation

tm_compcorp_2021

About

Topics

Resources

Stars

Watchers

Forks

Languages