Skip to content
Change the repository type filter

All

    Repositories list

    • Scripts for parallelized extraction of plain texts from WARC archieves. Aiming at common and reproducible extraction approach.
      HTML
      0351Updated Oct 10, 2024Oct 10, 2024
    • Shell
      0130Updated Oct 8, 2024Oct 8, 2024
    • OpusPocus

      Public
      Marian machine translation training pipeline for thousands of models
      Python
      02221Updated Oct 8, 2024Oct 8, 2024
    • Data Analytics Tool
      JavaScript
      1800Updated Oct 7, 2024Oct 7, 2024
    • Shell
      0000Updated Oct 6, 2024Oct 6, 2024
    • Scripts for running bitextor jobs
      Shell
      1000Updated Sep 26, 2024Sep 26, 2024
    • Curriculum training
      Python
      MIT License
      515190Updated Sep 14, 2024Sep 14, 2024
    • OpusCleaner is a web interface that helps you select, clean and schedule your data for training machine translation models.
      Python
      1346561Updated Sep 7, 2024Sep 7, 2024
    • Set of scripts to run monotextor-like pipeline under slurm HPCs
      Rust
      GNU General Public License v3.0
      0200Updated Sep 5, 2024Sep 5, 2024
    • Internet archive downloader
      Jupyter Notebook
      0210Updated Aug 7, 2024Aug 7, 2024
    • HPLT-WP4

      Public
      Information and pipelines on WP4: language models training
      Python
      Creative Commons Zero v1.0 Universal
      2100Updated Jul 11, 2024Jul 11, 2024
    • Python port of Moses tokenizer, truecaser and normalizer
      Python
      MIT License
      59486265Updated May 26, 2024May 26, 2024
    • tf/idf-based document aligner from Bitextor
      C++
      Apache License 2.0
      0001Updated Mar 19, 2024Mar 19, 2024
    • PHP
      MIT License
      1000Updated Mar 9, 2024Mar 9, 2024
    • This contains the configuration and scripts for HPLT MT model releases.
      Python
      0410Updated Mar 6, 2024Mar 6, 2024
    • Monolingual or Multilingual Instruction Tuning: Which Makes a Better Alpaca
      Python
      0800Updated Mar 6, 2024Mar 6, 2024
    • OpusFilter - Parallel corpus processing toolkit
      Python
      MIT License
      18000Updated Jan 3, 2024Jan 3, 2024
    • clianer

      Public
      A lightweight command-line frontend to OpusCleaner
      Python
      MIT License
      1000Updated Nov 27, 2023Nov 27, 2023
    • Make-shift interface for managing Paracrawl processing and exploring its outputs
      HTML
      1000Updated Oct 10, 2023Oct 10, 2023
    • 0100Updated Feb 7, 2023Feb 7, 2023