Skip to content
Change the repository type filter

All

    Repositories list

    • Step-by-step schematic description of data processing in HPLT
      Python
      1040Updated Aug 4, 2025Aug 4, 2025
    • HPLT-WP4

      Public
      Information and pipelines on WP4: language models training
      Jupyter Notebook
      3350Updated Aug 4, 2025Aug 4, 2025
    • HPLT Analytics
      JavaScript
      31411Updated Aug 4, 2025Aug 4, 2025
    • This contains the configuration and scripts for HPLT MT model releases.
      Python
      0820Updated Aug 4, 2025Aug 4, 2025
    • Scripts for parallelized extraction of plain texts from WARC archieves. Aiming at common and reproducible extraction approach.
      Jupyter Notebook
      1440Updated Jul 24, 2025Jul 24, 2025
    • Set of scripts to run monotextor-like pipeline under slurm HPCs
      Rust
      0300Updated Jul 15, 2025Jul 15, 2025
    • OpusCleaner is a web interface that helps you select, clean and schedule your data for training machine translation models.
      Python
      1551561Updated Jul 9, 2025Jul 9, 2025
    • Curriculum training
      Python
      618190Updated Jun 25, 2025Jun 25, 2025
    • Jupyter Notebook
      7110Updated May 27, 2025May 27, 2025
    • OpusPocus

      Public
      Marian machine translation training pipeline for thousands of models
      Python
      02200Updated May 16, 2025May 16, 2025
    • hplt-e

      Public
      Jupyter Notebook
      0140Updated May 13, 2025May 13, 2025
    • Shell
      0130Updated Feb 24, 2025Feb 24, 2025
    • Internet archive downloader
      Jupyter Notebook
      0220Updated Jan 26, 2025Jan 26, 2025
    • Shell
      0000Updated Jan 26, 2025Jan 26, 2025
    • Scripts for running bitextor jobs
      Shell
      1010Updated Jan 20, 2025Jan 20, 2025
    • Monolingual or Multilingual Instruction Tuning: Which Makes a Better Alpaca
      Python
      11000Updated Nov 2, 2024Nov 2, 2024
    • Python port of Moses tokenizer, truecaser and normalizer
      Python
      60495274Updated May 26, 2024May 26, 2024
    • tf/idf-based document aligner from Bitextor
      C++
      0001Updated Mar 19, 2024Mar 19, 2024
    • PHP
      1000Updated Mar 9, 2024Mar 9, 2024
    • OpusFilter - Parallel corpus processing toolkit
      Python
      24000Updated Jan 3, 2024Jan 3, 2024
    • clianer

      Public
      A lightweight command-line frontend to OpusCleaner
      Python
      1000Updated Nov 27, 2023Nov 27, 2023
    • Make-shift interface for managing Paracrawl processing and exploring its outputs
      HTML
      1000Updated Oct 10, 2023Oct 10, 2023
    • 0100Updated Feb 7, 2023Feb 7, 2023