Cologne Center for Comparative Politics, University of Cologne
hauke.licht@wiso.uni-koeln.de
This repository contains the slides for my short tutorial on cross-lingual supervised text classification I have prepared for the COMPTEXT 2022 conference.
Below, you find the links to the data and interactive Google Colab notebooks I use during the tutorial:
Link | Description |
---|---|
file | This CSV file is a cleaned version of the data set compiled by Pola Lehmann and Malisa Zobel (2018). The file can be downloaded or read from my Google Drive. Note: that the data already records machine-translated versions of sentences' original texts. |
notebook | This notebook is a short walk through through the Lehmann+Zobel data that reports on the label and language distribution in the data. |
notebook | In this notebook I use the Lehmann+Zobel data as an example to show how to use the easyNMT python package to machine-translate a multilingual corpus free of charge. |
notebook | In this notebook I use the Lehmann+Zobel data as an example to show how to use the sentence-transformers python package to sentence-embed documents in a multilingual corpus. |
folder | This contain in my Google Drive records zipped TSV files that records multilingual sentence embeddings of the sentences in the Lehmann+Zobel data I have generated using the knowledge-distilled XLM-R model available through the sentence-transformers package. The folder can be downloaded or read from my Google Drive. |
notebook | In this notebook I sample quasi-sentences in the Lehmann+Zobel data into training, test, and cross validation folds. |
file | This JSON file records the training, test, and cross validation indeces sampled in the notebook described above. The file can be downloaded or read from my Google Drive. |
notebook | In this notebook I train and evaluate supervised text classifiers on bag-of-word representations on machince-translated versions of quasi-sentences in the Lehmann+Zobel data. |
notebook | In this notebook I train and evaluate supervised text classifiers using multilingual sentence embeddings of quasi-sentences in the Lehmann+Zobel data. |
notebook | In this notebook I assess language independence of an MSE-based classifier trained on quasi-sentences in the Lehmann+Zobel data. |
Note: If you haven't worked with Google Colab notebooks before, check out this short tutorial