OpusPocus

Modular NLP pipeline manager.

OpusPocus is aimed at simplifying the description and execution of popular and custom NLP pipelines, including dataset preprocessing, model training, fine-tuning and evaluation. The pipeline manager supports execution using simple CLI (Bash) or common HPC schedulers (Slurm).

It uses OpusCleaner for data preparation and OpusTrainer for training scheduling (development in progress).

Structure

go.py - pipeline manager entry script
opuspocus/ - OpusPocus modules
opuspocus_cli/ - OpusPocus CLI subcommands
config/ - default configuration files (pipeline config, marian training config, ...)
examples/ - pipeline manager usage examples
scripts/ - helper scripts, at this moment not directly implemented in OpusPocus
tests/ - unit tests

Installation

Install MarianNMT

$ ./scripts/install_marian_gpu.sh PATH_TO_CUDA CUDNN_VERSION [NUM_THREADS]

Alternatively, you can usel scripts/install_marian_cpu.sh for CPU version. Note that the scripts may require modification based on your system.

(Optional) Setup the Python virtual environment (using virtualenv):

$ /usr/bin/virtualenv -p /usr/bin/python3.10 python-venv

Install the Python dependencies.

(source python-venv/bin/activate  # if using virtual environment)
$ pip install --upgrade pip setuptools
$ pip install -r requirements.txt

Setup the Python virtual environment for Opuscleaner. (OpusCleaner is currently not supported by Python>=3.10.)

$ /usr/bin/virtualenv -p /usr/bin/python3.9 opuscleaner-venv

Activate the OpusCleaner virtualenv and install OpusCleaner's dependencies

$ source opuscleaner-venv/bin/activate
$ pip install --upgrade pip setuptools
$ pip install -r requirements-opuscleaner.txt

Usage (Simple Pipeline)

Either run the main script go.py or the subcommand scripts from opuspocus_cli/ directory. Run the scripts directly from the root directory for this repository. (You may need to add the path to the local OpusPocus repository directory to your PYTHONPATH.)

Pipeline execution

There are two main subcommands (init, run) which need to be executed separately. ./go.py init prepares the pipeline directory structure and infers basic information about the datasets used in the pipeline. ./go.py run executes the pipeline graph, running the code from each of the pipeline step in the order defined by the pipeline graph.

(See the examples/ directory for example execution)

I. Data preprocessing example

Download the data and setup the dataset directory structure.

$ scripts/prepare_data.en-eu.sh

Initialize the (data preprocessing) pipeline.

$ mkdir -p experiments/en-eu/preprocess.simple
$ ./go.py init \
    --pipeline-config config/pipeline.preprocess.yml \
    --pipeline-dir experiments/en-eu/preprocess.simple

--pipeline-config (required) provides the details about the pipeline steps and their dependencies
--pipeline-dir (optional) overrides the pipeline.pipeline_dir value from the pipeline-config

Execute the (data preprocessing) pipeline.

$ ./go.py run \
    --pipeline-dir experiments/en-eu/preprocess.simple \
    --runner bash

--pipeline-dir (required) path to the initialized pipeline directory.
--runner (required) runner to be used for pipeline execution. Use --runner slurm for more effective HPC execution (if Slurm is available)

Check the pipeline status.

$ ./go.py traceback --pipeline-dir experiments/en-eu/preprocess.simple

OR

$ ./go.py status --pipeline-dir experiments/en-eu/preprocess.simple

II. Model training example (preprocessing follow-up)

Check the preprocessing pipeline status (The data preprocessing pipeline must be finished, i.e. all steps must be in the DONE step)

$ ./go.py status --pipeline-dir experiments/en-eu/preprocess.simple

Initialize the training pipeline.

$ mkdir -p experiments/en-eu/train.simple
$ ./go.py init \
    --pipeline-config config/pipeline.train.simple.yml \
    --pipeline-dir experiments/en-eu/train.simple

Execute the (data preprocessing) pipeline.

$ ./go.py run \
    --pipeline-dir experiments/en-eu/train.simple \
    --runner bash

(Advanced) Config modification examples

TBD

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OpusPocus

Structure

Installation

Usage (Simple Pipeline)

Pipeline execution

I. Data preprocessing example

II. Model training example (preprocessing follow-up)

(Advanced) Config modification examples

About

Releases

Packages

Contributors 5

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 252 Commits
config		config
examples		examples
opuspocus		opuspocus
opuspocus_cli		opuspocus_cli
scripts		scripts
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
README.md		README.md
go.py		go.py
iterative_bt_example.sh		iterative_bt_example.sh
pyproject.toml		pyproject.toml
requirements-all.txt		requirements-all.txt
requirements-opuscleaner.txt		requirements-opuscleaner.txt
requirements.txt		requirements.txt

hplt-project/OpusPocus

Folders and files

Latest commit

History

Repository files navigation

OpusPocus

Structure

Installation

Usage (Simple Pipeline)

Pipeline execution

I. Data preprocessing example

II. Model training example (preprocessing follow-up)

(Advanced) Config modification examples

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Languages

Packages