Cross-lingual supervised text classification

Hauke Licht
Cologne Center for Comparative Politics, University of Cologne
hauke.licht@wiso.uni-koeln.de

This repository contains the slides for my short tutorial on cross-lingual supervised text classification I have prepared for the COMPTEXT 2022 conference.

Below, you find the links to the data and interactive Google Colab notebooks I use during the tutorial:

Link	Description
file	This CSV file is a cleaned version of the data set compiled by Pola Lehmann and Malisa Zobel (2018). The file can be downloaded or read from my Google Drive. Note: that the data already records machine-translated versions of sentences' original texts.
notebook	This notebook is a short walk through through the Lehmann+Zobel data that reports on the label and language distribution in the data.
notebook	In this notebook I use the Lehmann+Zobel data as an example to show how to use the `easyNMT` python package to machine-translate a multilingual corpus free of charge.
notebook	In this notebook I use the Lehmann+Zobel data as an example to show how to use the `sentence-transformers` python package to sentence-embed documents in a multilingual corpus.
folder	This contain in my Google Drive records zipped TSV files that records multilingual sentence embeddings of the sentences in the Lehmann+Zobel data I have generated using the knowledge-distilled XLM-R model available through the `sentence-transformers` package. The folder can be downloaded or read from my Google Drive.
notebook	In this notebook I sample quasi-sentences in the Lehmann+Zobel data into training, test, and cross validation folds.
file	This JSON file records the training, test, and cross validation indeces sampled in the notebook described above. The file can be downloaded or read from my Google Drive.
notebook	In this notebook I train and evaluate supervised text classifiers on bag-of-word representations on machince-translated versions of quasi-sentences in the Lehmann+Zobel data.
notebook	In this notebook I train and evaluate supervised text classifiers using multilingual sentence embeddings of quasi-sentences in the Lehmann+Zobel data.
notebook	In this notebook I assess language independence of an MSE-based classifier trained on quasi-sentences in the Lehmann+Zobel data.

Note: If you haven't worked with Google Colab notebooks before, check out this short tutorial

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.gitignore		.gitignore
README.md		README.md
presentation.pdf		presentation.pdf
references.bib		references.bib

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Cross-lingual supervised text classification

About

Releases

Packages

Languages

haukelicht/crosslingual-supervised-text-classification-tuorial

Folders and files

Latest commit

History

Repository files navigation

Cross-lingual supervised text classification

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages