Releases · transducens/parallel-urls-classifier

05 Mar 14:09

cgr71ii

878ad78

PyTorch model Latest

Latest

PyTorch model that can be used within the code provided in this repository. A manually converted HuggingFace compliant model is also available: https://huggingface.co/Transducens/xlm-roberta-base-parallel-urls-classifier

You may want to use this version instead of the HuggingFace one if, for example, you want to use the Gunicorn server without writing new code and use the available scripts.

Assets 3

05 Mar 14:22

cgr71ii

dataset-model-v1

878ad78

Dataset

Dataset used to train and evaluate the released model. Necessary steps to use the dataset in the code:

Decompress:

xz -d train.tsv.xz
xz -d dev.tsv.xz
xz -d test.tsv.xz

Assets 5

03 Jan 01:22

cgr71ii

MaCoCu-v1-wordfreq

41a5e03

MaCoCu v1 wordfreq files

Created following the method described in the Bicleaner AI repo:

l="bg"

cat monolingual.${l} \
    | sacremoses -l ${l} tokenize -x \
    | awk '{print tolower($0)}' \
    | tr ' ' '\n' \
    | LC_ALL=C sort | uniq -c \
    | LC_ALL=C sort -nr \
    | grep -v '[[:space:]]*1' \
    | pigz -c > wordfreq-${l}.gz

Assets 23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Releases: transducens/parallel-urls-classifier

PyTorch model

Uh oh!

Dataset

Uh oh!

MaCoCu v1 wordfreq files

Uh oh!