Releases: transducens/parallel-urls-classifier
Releases · transducens/parallel-urls-classifier
PyTorch model
PyTorch model that can be used within the code provided in this repository. A manually converted HuggingFace compliant model is also available: https://huggingface.co/Transducens/xlm-roberta-base-parallel-urls-classifier
You may want to use this version instead of the HuggingFace one if, for example, you want to use the Gunicorn server without writing new code and use the available scripts.
Dataset
Dataset used to train and evaluate the released model. Necessary steps to use the dataset in the code:
Decompress:
xz -d train.tsv.xz
xz -d dev.tsv.xz
xz -d test.tsv.xz
MaCoCu v1 wordfreq files
Created following the method described in the Bicleaner AI repo:
l="bg"
cat monolingual.${l} \
| sacremoses -l ${l} tokenize -x \
| awk '{print tolower($0)}' \
| tr ' ' '\n' \
| LC_ALL=C sort | uniq -c \
| LC_ALL=C sort -nr \
| grep -v '[[:space:]]*1' \
| pigz -c > wordfreq-${l}.gz