Skip to content

Commit

Permalink
Merge pull request #380 from sillsdev/setup_update
Browse files Browse the repository at this point in the history
Update setup documentation and Docker images
  • Loading branch information
isaac091 authored May 13, 2024
2 parents 17d0f13 + 8005a67 commit e8c4516
Show file tree
Hide file tree
Showing 8 changed files with 248 additions and 271 deletions.
28 changes: 12 additions & 16 deletions .devcontainer/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -8,16 +8,9 @@ WORKDIR /app
ENV POETRY_HOME=/opt/poetry
ENV POETRY_VENV=/opt/poetry-venv
ENV POETRY_CACHE_DIR=/opt/.cache
ENV DOTNET_ROLL_FORWARD=LatestMajor
ENV PIP_DISABLE_PIP_VERSION_CHECK=on
ENV TZ=America/New_York
RUN ln -snf /usr/share/zoneinfo/$TZ /etc/localtime && echo $TZ > /etc/timezone
# Install .NET SDK
RUN apt-get update
RUN apt-get install --no-install-recommends -y wget
RUN wget https://packages.microsoft.com/config/ubuntu/20.04/packages-microsoft-prod.deb -O packages-microsoft-prod.deb && \
dpkg -i packages-microsoft-prod.deb && \
rm packages-microsoft-prod.deb
# Install apt packages
RUN apt-get update
RUN apt-get upgrade -y
Expand All @@ -29,23 +22,26 @@ RUN apt-get install --no-install-recommends -y \
build-essential \
gdb \
curl \
unzip \
dotnet-sdk-7.0
unzip
# Make some useful symlinks that are expected to exist
RUN ln -sfn /usr/bin/python${PYTHON_VERSION} /usr/bin/python3 & \
ln -sfn /usr/bin/python${PYTHON_VERSION} /usr/bin/python
# Install poetry separated from system interpreter
RUN python3 -m venv $POETRY_VENV \
&& $POETRY_VENV/bin/pip install -U pip setuptools \
&& $POETRY_VENV/bin/pip install poetry==${POETRY_VERSION}
# Add `poetry` to PATH
# Add `poetry` to PATH and configure
ENV PATH="${PATH}:${POETRY_VENV}/bin"
# Install AWS CLI
RUN curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip" && \
unzip awscliv2.zip && \
./aws/install && \
rm awscliv2.zip
RUN rm -rf /var/lib/apt/lists/*
RUN poetry config virtualenvs.create true && \
poetry config virtualenvs.in-project true
# Clean up
RUN rm -rf /var/lib/apt/lists/*
# Create caches
RUN mkdir -p /root/.cache/silnlp/experiments
RUN mkdir /root/.cache/silnlp/projects
ENV SIL_NLP_CACHE_EXPERIMENT_DIR=/root/.cache/silnlp/experiments
ENV SIL_NLP_CACHE_PROJECT_DIR=/root/.cache/silnlp/projects
# Set environment variables
ENV CLEARML_API_HOST="https://api.sil.hosted.allegro.ai"
ENV SIL_NLP_DATA_PATH=/aqua-ml-data
CMD ["bash"]
7 changes: 5 additions & 2 deletions .devcontainer/devcontainer.json
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@
"-v",
"${env:HOME}/.aws:/root/.aws", // Mount user's AWS credentials into the container
"-v",
"/home/clearml/.clearml/hf-cache:/root/.cache/huggingface"
"${env:HOME}/clearml/.clearml/hf-cache:/root/.cache/huggingface"
],
"containerEnv": {
"AWS_REGION": "${localEnv:AWS_REGION}",
Expand Down Expand Up @@ -44,7 +44,10 @@
},
"editor.formatOnSave": true,
"editor.formatOnType": true,
"isort.args":["--profile", "black"]
"isort.args": [
"--profile",
"black"
]
},
// Add the IDs of extensions you want installed when the container is created.
"extensions": [
Expand Down
18 changes: 5 additions & 13 deletions Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -55,14 +55,6 @@ RUN apt-get install --no-install-recommends -y \
RUN ln -sfn /usr/bin/python${PYTHON_VERSION} /usr/bin/python3 & \
ln -sfn /usr/bin/python${PYTHON_VERSION} /usr/bin/python

# Install .NET SDK
RUN wget https://packages.microsoft.com/config/ubuntu/20.04/packages-microsoft-prod.deb -O packages-microsoft-prod.deb && \
dpkg -i packages-microsoft-prod.deb && \
rm packages-microsoft-prod.deb
RUN apt-get update && \
apt-get install --no-install-recommends -y dotnet-sdk-7.0
ENV DOTNET_ROLL_FORWARD=LatestMajor

# Install dependencies from poetry
COPY --from=builder /src/requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt && rm requirements.txt
Expand Down Expand Up @@ -105,11 +97,11 @@ RUN mv meteor-1.5/meteor-1.5.jar /usr/local/bin
RUN rm -rf meteor-1.5
ENV METEOR_PATH=/usr/local/bin

# Other environment variables
ENV SIL_NLP_DATA_PATH=/aqua-ml-data
RUN mkdir -p .cache/silnlp
ENV SIL_NLP_CACHE_EXPERIMENT_DIR=/root/.cache/silnlp
ENV CLEARML_API_HOST="https://api.sil.hosted.allegro.ai"
# Create caches
RUN mkdir -p .cache/silnlp/experiments
RUN mkdir .cache/silnlp/projects
ENV SIL_NLP_CACHE_EXPERIMENT_DIR=/root/.cache/silnlp/experiments
ENV SIL_NLP_CACHE_PROJECT_DIR=/root/.cache/silnlp/projects

# Clone silnlp and make it the starting directory
RUN git clone https://github.com/sillsdev/silnlp.git
Expand Down
245 changes: 56 additions & 189 deletions README.md

Large diffs are not rendered by default.

7 changes: 2 additions & 5 deletions clear_ml_linux_setup.md → clear_ml_setup.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,7 @@
# Instructions for setting up Clear-ML on Linux.

These were tested on Pop!_OS.
See [Clear-ML Windows setup](clear_ml_windows_setup.md) for instructions to set up Clear-ML on Windows.
# Instructions for setting up Clear-ML.

## Install the clearml python package.
Open a terminal and use pip to install Clear-ML.
Open a terminal (or Command Prompt on Windows) and use pip to install Clear-ML.
`pip install clearml`

## Add your AWS storage vault credentials (If using AWS S3).
Expand Down
46 changes: 0 additions & 46 deletions clear_ml_windows_setup.md

This file was deleted.

113 changes: 113 additions & 0 deletions manual_setup.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,113 @@
# Manual Setup

## SILNLP Prerequisites
These are the main requirements for the SILNLP code to run on a local machine. Since there are many Python packages that need to be used with complex versioning requirements, we use a Python package called Poetry to mangage all of those. So here is a rough heirarchy of SILNLP with the major dependencies.

| Requirement | Reason |
| --------------------- | ----------------------------------------------------------------- |
| GIT | to get the repo from [github](https://github.com/sillsdev/silnlp) |
| Python | to run the silnlp code |
| Poetry | to manage all the Python packages and versions |
| NVIDIA GPU | Required to run on a local machine |
| Nvidia drivers | Required for the GPU |
| CUDA Toolkit | Required for the Machine learning with the GPU |
| Environment variables | To tell SILNLP where to find the data, etc. |

## Setup

The SILNLP code can be run on either Windows or Linux operating systems. If using an Ubuntu distribution, the only compatible version is 20.04.

__Download and install__ the following before creating any projects or starting any code, preferably in this order to avoid most warnings:

1. If using a local GPU: [NVIDIA driver](https://www.nvidia.com/download/index.aspx)
* On Ubuntu, the driver can alternatively be installed through the GUI by opening Software & Updates, navigating to Additional Drivers in the top menu, and selecting the newest NVIDIA driver with the labels proprietary and tested.
* After installing the driver, reboot your system.
2. [Git](https://git-scm.com/downloads)
3. [Python 3.8](https://www.python.org/downloads/) (latest minor version, ie 3.8.19)
* Can alternatively install Python using [miniconda](https://docs.conda.io/projects/conda/en/latest/user-guide/install/windows.html) if you're planning to use more than one version of Python. If following this method, activate your conda environment before installing Poetry.
4. [Poetry](https://python-poetry.org/docs/#installation)
* Note that whether the command should call python or python3 depends on which is required on your machine.
* It may (or may not) be possible to run the curl command within a VS Code terminal. If that causes permission errors close VS Code and try it in an elevated CMD prompt.

Windows:
At an administrator CMD prompt or a terminal within VS Code run:
```
curl -sSL https://install.python-poetry.org | python - --version 1.7.1
```
In Powershell, run:
```
(Invoke-WebRequest -Uri https://raw.githubusercontent.com/python-poetry/poetry/master/get-poetry.py -UseBasicParsing).Content | python
```

Linux:
In terminal, run:
```
curl -sSL https://install.python-poetry.org | python3 - --version 1.7.1
```
Add the following line to your .bashrc file in your home directory:
```
export PATH="$HOME/.local/bin:$PATH"
```
5. C++ Redistributable
* Note - this may already be installed. If it is not installed you may get cryptic errors such as "System.DllNotFoundException: Unable to load DLL 'thot' or one of its dependencies"
* Windows: Download from https://support.microsoft.com/en-us/topic/the-latest-supported-visual-c-downloads-2647da03-1eea-4433-9aff-95f26a218cc0 and install
* Linux: Instead of installing the redistributable, run the following commands:
```
sudo apt-get update
sudo apt-get install build-essential gdb
```

### Visual Studio Code setup

1. Install Visual Studio Code
2. Install Python extension for VS Code
3. Open up silnlp folder in VSC
4. In CMD window, type `poetry install` to create the virtual environment for silnlp
* If using conda, activate your conda environment first before `poetry install`. Poetry will then install all the dependencies into the conda environment.
5. Choose the newly created virtual environment as the "Python Interpreter" in the command palette (ctrl+shift+P)
* If using conda, choose the conda environment as the interpreter
6. Open the command palette and select "Preferences: Open User Settings (JSON)". In the `settings.json` file, add the following options:
``` json
"python.formatting.provider": "black",
"python.linting.pylintEnabled": true,
"editor.formatOnSave": true,
```

### S3 bucket setup

See [S3 bucket setup](s3_bucket_setup.md).

### ClearML setup

See [ClearML setup](clear_ml_setup.md).

### Create SILNLP cache
* Create the directory "/home/user/.cache/silnlp", replacing "user" with your username.
* Create the directory "/home/user/.cache/silnlp/experiments" and set the environment variable SIL_NLP_CACHE_EXPERIMENT_DIR to that path.
* Create the directory "/home/user/.cache/silnlp/projects" and set the environment variable SIL_NLP_CACHE_PROJECT_DIR to that path.

### Additional Environment Variables
* Set the following environment variables with your respective credentials: CLEARML_API_ACCESS_KEY, CLEARML_API_SECRET_KEY, AWS_ACCESS_KEY_ID, and AWS_SECRET_ACCESS_KEY.
* Set SIL_NLP_DATA_PATH to "/aqua-ml-data" and CLEARML_API_HOST to "https://api.sil.hosted.allegro.ai".

### Setting Up and Running Experiments

See the [wiki](https://github.com/sillsdev/silnlp/wiki) for information on setting up and running experiments. The most important pages for getting started are the ones on [file structure](https://github.com/sillsdev/silnlp/wiki/Folder-structure-and-file-naming-conventions), [model configuration](https://github.com/sillsdev/silnlp/wiki/Configure-a-model), and [running experiments](https://github.com/sillsdev/silnlp/wiki/NMT:-Usage). A lot of the instructions are specific to NMT, but are still helpful starting points for doing other things like [alignment](https://github.com/sillsdev/silnlp/wiki/Alignment:-Usage).

See [this](https://github.com/sillsdev/silnlp/wiki/Using-the-Python-Debugger) page for information on using the VS code debugger.

If you need to use a tool that is supported by SILNLP but is not installable as a Python library (which is probably the case if you get an error like "RuntimeError: eflomal is not installed."), follow the appropriate instructions [here](https://github.com/sillsdev/silnlp/wiki/Installing-External-Libraries).

## Setting environment variables permanently
Windows users: see [here](https://github.com/sillsdev/silnlp/wiki/Install-silnlp-on-Windows-10#permanently-set-environment-variables) for instructions on setting environment variables permanently

Linux users: To set environment variables permanently, add each variable as a new line to the `.bashrc` file in your home directory with the format
```
export VAR="VAL"
```

## .NET Machine alignment models

If you need to run the .NET versions of the Machine alignment models, you will need to install .NET Core SDK 8.0. After installing, run `dotnet tool restore`.
* Windows: [.NET Core SDK](https://dotnet.microsoft.com/download)
* Linux: Installation instructions can be found [here](https://learn.microsoft.com/en-us/dotnet/core/install/linux-ubuntu-2004).
55 changes: 55 additions & 0 deletions s3_bucket_setup.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
# S3 bucket setup

We use Amazon S3 storage for storing our experiment data. Here is some workspace setup to enable a decent workflow.

### Install and configure AWS S3 storage
* Install the aws-cli from: https://aws.amazon.com/cli/
* In cmd, type: `aws configure` and enter your AWS access_key_id and secret_access_key and the region (we use region = us-east-1).
* The aws configure command will create a folder in your home directory named '.aws' it should contain two plain text files named 'config' and 'credentials'. The config file should contain the region and the credentials file should contain your access_key_id and your secret_access_key.
(Home directory on windows is usually C:\Users\<Username>\ and on linux it is /home/username)

### Install and configure rclone

**Windows**

The following will mount /aqua-ml-data on your S drive and allow you to explore, read and write.
* Install WinFsp: http://www.secfs.net/winfsp/rel/ (Click the button to "Download WinFsp Installer" not the "SSHFS-Win (x64)" installer)
* Download rclone from: https://rclone.org/downloads/
* Unzip to your desktop (or some convient location).
* Add the folder that contains rclone.exe to your PATH environment variable.
* Take the `scripts/rclone/rclone.conf` file from this SILNLP repo and copy it to `~\AppData\Roaming\rclone` (creating folders if necessary)
* Add your credentials in the appropriate fields in `~\AppData\Roaming\rclone`
* Take the `scripts/rclone/mount_to_s.bat` file from this SILNLP repo and copy it to the folder that contains the unzipped rclone.
* Double-click the bat file. A command window should open and remain open. You should see something like:
```
C:\Users\David\Software\rclone>call rclone mount --vfs-cache-mode full --use-server-modtime s3aqua:aqua-ml-data S:
The service rclone has been started.
```

**Linux**

The following will mount /aqua-ml-data to an S folder in your home directory and allow you to explore, read and write.
* Download rclone from: https://rclone.org/install/
* Take the `scripts/rclone/rclone.conf` file from this SILNLP repo and copy it to `~/.config/rclone/rclone.conf` (creating folders if necessary)
* Add your credentials in the appropriate fields in `~/.config/rclone/rclone.conf`
* Create a folder called "S" in your user directory
* Run the following command:
```
rclone mount --vfs-cache-mode full --use-server-modtime s3aqua:aqua-ml-data ~/S
```
### To start S: drive on start up

**Windows**

Put a shortcut to the mount_to_s.bat file in the Startup folder.
* In Windows Explorer put `shell:startup` in the address bar or open `C:\Users\<Username>\AppData\Roaming\Microsoft\Windows\Start Menu\Programs\Startup`
* Right click to add a new shortcut. Choose `mount_to_s.bat` as the target, you can leave the name as the default.

Now your AWS S3 bucket should be mounted as S: drive when you start Windows.

**Linux**
* Run `crontab -e`
* Paste `@reboot rclone mount --vfs-cache-mode full --use-server-modtime s3aqua:aqua-ml-data ~/S` into the file, save and exit
* Reboot Linux

Now your AWS S3 bucket should be mounted as ~/S when you start Linux.

0 comments on commit e8c4516

Please sign in to comment.