Skip to content

LUMI setup

Jaume Zaragoza edited this page Aug 28, 2024 · 5 revisions

Recommended reads:

Installation and Configuration

Please ignore "Compiling Software" section in README, instead follow these steps. The conda container that will be used, contains most of the software needed.

Clone the repo (do not clone recursively), change to lumi branch and clone only needed submodules

git clone https://github.com/paracrawl/cirrus-scripts
cd cirrus-scripts
git checkout lumi
git submodule update --init env/src/preprocess

Edit env/init.d/lumi.sh and set PATH variable to the bin directory of the conda container. Right now is set to project_462000252/zaragoza/bitextor-8.1/bin, which is a working env and you can use it, so no need to change it.

Edit config.d/10.lumi.sh to set up working directories for processed data:

  • Change PROJ_DIR and SCRATCH_DIR to your directories in projappl and scratch partitions of the project (e.g. /projappl/project_462000252/user). Project partition will be used to store the code and models, scratch to store the data.
  • Set up collection names and directories. For the test runs, there is no need to do additional changes, only to copy the data (explained afterwards).
  • Other relevant variables that may not need modifications for the test runs:
    • SBATCH_ACCOUNT specifies the project that will be billed for the computing hours.
    • SBATCH_PARTITION: we will be using small for the test but will probably change to standard.
    • SBATCH_MEM_PER_CPU: only needed for small partition. Comment this line for standard partition.
    • SLURM_LOGS: directory to store the logs of all the jobs. THIS DIRECTORY NEEDS TO BE CREATED before running jobs, otherwise they will fail. Also note that this directory grows significantly in number of files, so make sure to clean it from time to time.

To install the software that is not included in the container run:

cd env
./setup.sh install paracrawl

Container creation

For users without access to the project, the bitextor container mentioned above is not available but can be created with this configuration file for the LUMI conda container wrapper:

channels:
  - conda-forge
  - bitextor
  - dmnapolitano
  - esarrias
dependencies:
  - bitextor=8.1

Configure translation

To configure translation step with a Bergamot student, the following steps are required:

  • Create the language pair directory like models/es-en.
  • Download the student model files to models/es-en/esen.student.tiny11 and create a symlink.
  • Create a symlink to models/translate-bergamot.
zaragoza2@uan01:~/proj_462000252/zaragoza/cirrus-scripts> ll models/es-en/
total 8.0K
drwxrws--- 2 zaragoza2 project_462000252 4.0K May 11 13:03 esen.student.tiny11
lrwxrwxrwx 1 zaragoza2 project_462000252   19 May 11 13:14 model -> esen.student.tiny11
lrwxrwxrwx 1 zaragoza2 project_462000252   84 May 11 13:00 translate.sh -> /users/zaragoza2/proj_462000252/zaragoza/cirrus-scripts/models/translate-bergamot.sh

Note that translate-bergamot.sh will look for marian-decoder config at models/es-en/model/config.yml. This is an optimized example for bergamot models:

quiet-translation: true
relative-paths: true
models:
    - model.intgemm.alphas.bin
vocabs:
    - vocab.esen.spm
    - vocab.esen.spm
shortlist:
    - lex.s2t.bin
    - false

beam-size: 1
normalize: 1.0
word-penalty: 0
mini-batch: 16
maxi-batch: 100
maxi-batch-sort: src
workspace: 256
max-length: 300
max-length-crop: true
gemm-precision: int8shiftAlphaAll

max-length-crop is avoids super long lines freezing Marian.

Marian Bergamot CPU version is compiled and configured in translate-bergamot.sh, so there is no need to compile it.

To use other types of translators, you will need to compile/install it by yourself and configure translate.sh. Take a look at translation template scripts in models/ directory to have an idea of what is needed. Note the use of foldfilter wrapper to chop very long lines before translation and rejoin them at the output. WARNING: foldfilter can mess up with spaces in some cases, can't handle languages without spaces, for example. Check outputs before using it.

Sharding

Sharding step can be used as it is, but for test runs with little amounts of data we do not need that level of parallelization and we need keep the number of files low. So we can configure it to create only 4 (2²) shards

diff --git a/02.giashard b/02.giashard
index e367863..6c2a6f4 100755
--- a/02.giashard
+++ b/02.giashard
@@ -17,7 +17,7 @@ mkdir $SHARD_PATH.$$

 cat "$BATCH_LIST" \
 | awk "NR > $GROUP_START && NR <= $GROUP_END" \
-| xargs giashard -d $SCRIPTS/domain-suffixes.txt -f text,url -b 1024 -n 8 -o $SHARD_PATH.$$
+| xargs giashard -d $SCRIPTS/domain-suffixes.txt -f text,url -b 1024 -n 2 -o $SHARD_PATH.$$

 # Fix filenames
 for BATCH in $SHARD_PATH.$$/*/*/; do
diff --git a/02.giashard.sh b/02.giashard.sh
index 99fd834..0ed3904 100755
--- a/02.giashard.sh
+++ b/02.giashard.sh
@@ -35,7 +35,7 @@ esac
 export BATCHES_PER_TASK

 export TASKS_PER_BATCH=1 # more than 1 is not supported by 02.giashard
-export SHARDS_PER_TASK=16 # for 02.giamerge -> 1-16 * 16 = 256 shards
+export SHARDS_PER_TASK=1 # for 02.giamerge -> 1-16 * 16 = 256 shards

 for language in $@; do
        batch_list=$(make_batch_list $collection $language)
@@ -62,7 +62,7 @@ for language in $@; do
                        merge_job_id=$(schedule \
                                -J merge-shard-${language}-${collection} \
                                --dependency afterok:$shard_job_id \
-                               -a 1-16 \
+                               -a 1-2 \
                                --time 24:00:00 \
                                --cpus-per-task 8 `#really just need 4, but 8 for more memory and better spread` \
                                -e ${SLURM_LOGS}/02.merge-${language}-%A_%a.err \

CAUTION: the same number of shards for each language in a language pair that is going to be aligned, needs the same sharding configuration.

Copy the data

For each collection we need $colection-{text,shards,batches} directories. -{shards,batches} dirs need to be created manually, otherwise job scheduling will fail. The data that sharding will use as a starting point need to be located at $collection-text directory. For the test runs you can copy the data from /scratch/project_462000252/zaragoza/data/*-text.

Run the pipeline

After all the configuration, all the steps can be followed as the README explains.

General recommendations

Each processing step follows the scheme "run process writing to output file with a temporary suffix" then remove the suffix to mark it as finished. So every time a job fails or does not finish properly, it will leave temp files all over the place. Cleaning them regularly is advised in order to reduce the number of files.

Scheduling in 'small' partition

Some steps, like tokenise and split-text are using serial jobs, so allocating more CPUs per job does not parallelize them. But allocating more than one CPU will provide, in the case of 'small' partition, more memory to prevent running out of it. So running

TPB=1 TPN=3 ./05.tokenisesh output_wide15_filtered_sample12 es

will schedule an array job of size the total number of batches that we have for that language, where each job will process in serial that batch. Having 3 CPUs will provide more ram.

In an scenario where we have more batches, therefore size of job array, bigger than the limit of jobs (200), we can increase the TPB so that each job processes more than one batch. Note that this won't increase the parallelization, only avoid the scheduler limits.

To allocate more cpus per job and let the threading parallelize in non-serial steps do

TPN=1 SBATCH_CPUS_PER_TASK=128 ./04.translate.sh

Scheduling in 'standard' partition

We are not using this partition by now, but will probably use it in next runs.

It is important to know that standard partition does allocate full nodes and not sub-node resources. So TPN variable won't affect the number of CPUs allocated. Each job in the array will have 128 cores. To take advantage of the full node in serial steps, we will need to run with

TPB=128 ./05.tokenise.sh ...

or something high like 64, to have most of the cores process a batch. Note that 128 would spawn a lot of processes and can lead to OOM, so decreasing it a bit could be reasonable. But this is not tested.