Skip to content

This repository contains the slides for my short tutorial on cross-lingual supervised text classification I have prepared for the COMPTEXT 2022 conference.

Notifications You must be signed in to change notification settings

haukelicht/crosslingual-supervised-text-classification-tuorial

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Cross-lingual supervised text classification

Hauke Licht
Cologne Center for Comparative Politics, University of Cologne
hauke.licht@wiso.uni-koeln.de

This repository contains the slides for my short tutorial on cross-lingual supervised text classification I have prepared for the COMPTEXT 2022 conference.

Below, you find the links to the data and interactive Google Colab notebooks I use during the tutorial:

Link Description
file This CSV file is a cleaned version of the data set compiled by Pola Lehmann and Malisa Zobel (2018). The file can be downloaded or read from my Google Drive. Note: that the data already records machine-translated versions of sentences' original texts.
notebook This notebook is a short walk through through the Lehmann+Zobel data that reports on the label and language distribution in the data.
notebook In this notebook I use the Lehmann+Zobel data as an example to show how to use the easyNMT python package to machine-translate a multilingual corpus free of charge.
notebook In this notebook I use the Lehmann+Zobel data as an example to show how to use the sentence-transformers python package to sentence-embed documents in a multilingual corpus.
folder This contain in my Google Drive records zipped TSV files that records multilingual sentence embeddings of the sentences in the Lehmann+Zobel data I have generated using the knowledge-distilled XLM-R model available through the sentence-transformers package. The folder can be downloaded or read from my Google Drive.
notebook In this notebook I sample quasi-sentences in the Lehmann+Zobel data into training, test, and cross validation folds.
file This JSON file records the training, test, and cross validation indeces sampled in the notebook described above. The file can be downloaded or read from my Google Drive.
notebook In this notebook I train and evaluate supervised text classifiers on bag-of-word representations on machince-translated versions of quasi-sentences in the Lehmann+Zobel data.
notebook In this notebook I train and evaluate supervised text classifiers using multilingual sentence embeddings of quasi-sentences in the Lehmann+Zobel data.
notebook In this notebook I assess language independence of an MSE-based classifier trained on quasi-sentences in the Lehmann+Zobel data.

Note: If you haven't worked with Google Colab notebooks before, check out this short tutorial

About

This repository contains the slides for my short tutorial on cross-lingual supervised text classification I have prepared for the COMPTEXT 2022 conference.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages