Skip to content

Releases: transducens/parallel-urls-classifier

PyTorch model

05 Mar 14:09
878ad78
Compare
Choose a tag to compare

PyTorch model that can be used within the code provided in this repository. A manually converted HuggingFace compliant model is also available: https://huggingface.co/Transducens/xlm-roberta-base-parallel-urls-classifier

You may want to use this version instead of the HuggingFace one if, for example, you want to use the Gunicorn server without writing new code and use the available scripts.

Dataset

05 Mar 14:22
878ad78
Compare
Choose a tag to compare

Dataset used to train and evaluate the released model. Necessary steps to use the dataset in the code:

Decompress:

xz -d train.tsv.xz
xz -d dev.tsv.xz
xz -d test.tsv.xz

MaCoCu v1 wordfreq files

03 Jan 01:22
Compare
Choose a tag to compare

Created following the method described in the Bicleaner AI repo:

l="bg"

cat monolingual.${l} \
    | sacremoses -l ${l} tokenize -x \
    | awk '{print tolower($0)}' \
    | tr ' ' '\n' \
    | LC_ALL=C sort | uniq -c \
    | LC_ALL=C sort -nr \
    | grep -v '[[:space:]]*1' \
    | pigz -c > wordfreq-${l}.gz