Skip to content

Commit

Permalink
Update setup documentation and Docker images
Browse files Browse the repository at this point in the history
  • Loading branch information
isaac091 committed May 9, 2024
1 parent 1fbe518 commit bec0ade
Show file tree
Hide file tree
Showing 8 changed files with 231 additions and 240 deletions.
19 changes: 12 additions & 7 deletions .devcontainer/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -38,14 +38,19 @@ RUN ln -sfn /usr/bin/python${PYTHON_VERSION} /usr/bin/python3 & \
RUN python3 -m venv $POETRY_VENV \
&& $POETRY_VENV/bin/pip install -U pip setuptools \
&& $POETRY_VENV/bin/pip install poetry==${POETRY_VERSION}
# Add `poetry` to PATH
# Add `poetry` to PATH and configure
ENV PATH="${PATH}:${POETRY_VENV}/bin"
# Install AWS CLI
RUN curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip" && \
unzip awscliv2.zip && \
./aws/install && \
rm awscliv2.zip
RUN rm -rf /var/lib/apt/lists/*
RUN poetry config virtualenvs.create true && \
poetry config virtualenvs.in-project true
# Clean up
RUN rm -rf /var/lib/apt/lists/*
# Create caches
RUN mkdir -p /root/.cache/silnlp/experiments
RUN mkdir /root/.cache/silnlp/projects
ENV SIL_NLP_CACHE_EXPERIMENT_DIR=/root/.cache/silnlp/experiments
ENV SIL_NLP_CACHE_PROJECT_DIR=/root/.cache/silnlp/projects
# Set environment variables
ENV CLEARML_API_HOST="https://api.sil.hosted.allegro.ai"
ENV SIL_NLP_DATA_PATH=/aqua-ml-data
ENV AWS_REGION="us-east-1"
CMD ["bash"]
6 changes: 4 additions & 2 deletions .devcontainer/devcontainer.json
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,6 @@
"/home/clearml/.clearml/hf-cache:/root/.cache/huggingface"
],
"containerEnv": {
"AWS_REGION": "${localEnv:AWS_REGION}",
"AWS_ACCESS_KEY_ID": "${localEnv:AWS_ACCESS_KEY_ID}",
"AWS_SECRET_ACCESS_KEY": "${localEnv:AWS_SECRET_ACCESS_KEY}",
"CLEARML_API_ACCESS_KEY": "${localEnv:CLEARML_API_ACCESS_KEY}",
Expand All @@ -44,7 +43,10 @@
},
"editor.formatOnSave": true,
"editor.formatOnType": true,
"isort.args":["--profile", "black"]
"isort.args": [
"--profile",
"black"
]
},
// Add the IDs of extensions you want installed when the container is created.
"extensions": [
Expand Down
9 changes: 7 additions & 2 deletions Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -105,11 +105,16 @@ RUN mv meteor-1.5/meteor-1.5.jar /usr/local/bin
RUN rm -rf meteor-1.5
ENV METEOR_PATH=/usr/local/bin

# Create caches
RUN mkdir -p .cache/silnlp/experiments
RUN mkdir .cache/silnlp/projects
ENV SIL_NLP_CACHE_EXPERIMENT_DIR=/root/.cache/silnlp/experiments
ENV SIL_NLP_CACHE_PROJECT_DIR=/root/.cache/silnlp/projects

# Other environment variables
ENV SIL_NLP_DATA_PATH=/aqua-ml-data
RUN mkdir -p .cache/silnlp
ENV SIL_NLP_CACHE_EXPERIMENT_DIR=/root/.cache/silnlp
ENV CLEARML_API_HOST="https://api.sil.hosted.allegro.ai"
ENV AWS_REGION="us-east-1"

# Clone silnlp and make it the starting directory
RUN git clone https://github.com/sillsdev/silnlp.git
Expand Down
222 changes: 44 additions & 178 deletions README.md

Large diffs are not rendered by default.

7 changes: 2 additions & 5 deletions clear_ml_linux_setup.md → clear_ml_setup.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,7 @@
# Instructions for setting up Clear-ML on Linux.

These were tested on Pop!_OS.
See [Clear-ML Windows setup](clear_ml_windows_setup.md) for instructions to set up Clear-ML on Windows.
# Instructions for setting up Clear-ML.

## Install the clearml python package.
Open a terminal and use pip to install Clear-ML.
Open a terminal (or Command Prompt on Windows) and use pip to install Clear-ML.
`pip install clearml`

## Add your AWS storage vault credentials (If using AWS S3).
Expand Down
46 changes: 0 additions & 46 deletions clear_ml_windows_setup.md

This file was deleted.

105 changes: 105 additions & 0 deletions manual_setup.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,105 @@
# Manual Setup

## SILNLP Prerequisites
These are the main requirements for the SILNLP code to run on a local machine. The SILNLP repo itself is hosted on Github, mainly written in Python and calls SIL.Machine.Tool. 'Machine' as we tend to call it, is a .NET application that has many functions for manipulating USFM data. Most of the language data we have for low resource languages in USFM format. Since Machine is a .Net application it depends upon the __.NET core SDK__ which works on Windows and Linux. Since there are many python packages that need to be used, with complex versioning requirements we use a Python package called Poetry to mangage all of those. So here is a rough heirarchy of SILNLP with the major dependencies.

| Requirement | Reason |
| --------------------- | ----------------------------------------------------------------- |
| GIT | to get the repo from [github](https://github.com/sillsdev/silnlp) |
| Python | to run the silnlp code |
| Poetry | to manage all the Python packages and versions |
| SIL.Machine.Tool | to support many functions for data manipulation |
| .Net core SDK | Required by SIL.Machine.Tool |
| NVIDIA GPU | Required to run on a local machine |
| Nvidia drivers | Required for the GPU |
| CUDA Toolkit | Required for the Machine learning with the GPU |
| Environment variables | To tell SILNLP where to find the data, etc. |

## Setup

The SILNLP code can be run on either Windows or Linux operating systems. If using an Ubuntu distribution, the only compatible version is 20.04.

__Download and install__ the following before creating any projects or starting any code, preferably in this order to avoid most warnings:

1. If using a local GPU: [NVIDIA driver](https://www.nvidia.com/download/index.aspx)
* On Ubuntu, the driver can alternatively be installed through the GUI by opening Software & Updates, navigating to Additional Drivers in the top menu, and selecting the newest NVIDIA driver with the labels proprietary and tested.
* After installing the driver, reboot your system.
2. [Git](https://git-scm.com/downloads)
3. [Python 3.8](https://www.python.org/downloads/) (latest minor version, ie 3.8.19)
* Can alternatively install Python using [miniconda](https://docs.conda.io/projects/conda/en/latest/user-guide/install/windows.html) if you're planning to use more than one version of Python. If following this method, activate your conda environment before installing Poetry.
4. [Poetry](https://python-poetry.org/docs/#installation)
* Note that whether the command should call python or python3 depends on which is required on your machine.
* It may (or may not) be possible to run the curl command within a VS Code terminal. If that causes permission errors close VS Code and try it in an elevated CMD prompt.

Windows:
At an administrator CMD prompt or a terminal within VS Code run:
```
curl -sSL https://install.python-poetry.org | python - --version 1.7.1
```
In Powershell, run:
```
(Invoke-WebRequest -Uri https://raw.githubusercontent.com/python-poetry/poetry/master/get-poetry.py -UseBasicParsing).Content | python
```

Linux:
In terminal, run:
```
curl -sSL https://install.python-poetry.org | python3 - --version 1.7.1
```
Add the following line to your .bashrc file in your home directory:
```
export PATH="$HOME/.local/bin:$PATH"
```
5. .NET Core SDK
* The necessary versions are 7.0 and 3.1. If your machine is only able to install version 7.0, you can set the DOTNET_ROLL_FORWARD environment variable to "LatestMajor", which will allow you to run anything that depends on dotnet 3.1.
* Note - the .NET SDK is needed for [SIL.Machine.Tool](https://github.com/sillsdev/machine). Many of the scripts in this repo require this .Net package. The .Net package will be installed and updated when the silnlp is initialized in `__init__.py`.
* Windows: [.NET Core SDK](https://dotnet.microsoft.com/download)
* Linux: Installation instructions can be found [here](https://learn.microsoft.com/en-us/dotnet/core/install/linux-ubuntu-2004)
6. C++ Redistributable
* Note - this may already be installed. If it is not installed you may get cryptic errors such as "System.DllNotFoundException: Unable to load DLL 'thot' or one of its dependencies"
* Windows: Download from https://support.microsoft.com/en-us/topic/the-latest-supported-visual-c-downloads-2647da03-1eea-4433-9aff-95f26a218cc0 and install
* Linux: Instead of installing the redistributable, run the following commands:
```
sudo apt-get update
sudo apt-get install build-essential gdb
```

### Visual Studio Code setup

1. Install Visual Studio Code
2. Install Python extension for VS Code
3. Open up silnlp folder in VSC
4. In CMD window, type `poetry install` to create the virtual environment for silnlp
* If using conda, activate your conda environment first before `poetry install`. Poetry will then install all the dependencies into the conda environment.
5. Choose the newly created virtual environment as the "Python Interpreter" in the command palette (ctrl+shift+P)
* If using conda, choose the conda environment as the interpreter
6. Open the command palette and select "Preferences: Open User Settings (JSON)". In the `settings.json` file, add the following options:
``` json
"python.formatting.provider": "black",
"python.linting.pylintEnabled": true,
"editor.formatOnSave": true,
```

### S3 bucket setup

See [S3 bucket setup](s3_bucket_setup.md).

### ClearML setup

See [ClearML setup](clear_ml_setup.md).

### Additional Environment Variables
Set the following environment variables with your respective credentials: CLEARML_API_ACCESS_KEY, CLEARML_API_SECRET_KEY, AWS_ACCESS_KEY_ID, and AWS_SECRET_ACCESS_KEY
* Windows users: see [here](https://github.com/sillsdev/silnlp/wiki/Install-silnlp-on-Windows-10#permanently-set-environment-variables) for instructions on setting environment variables permanently
* Linux users: To set environment variables permanently, add each variable as a new line to the `.bashrc` file in your home directory with the format
```
export VAR="VAL"
```

### Setting Up and Running Experiments

See the [wiki](https://github.com/sillsdev/silnlp/wiki) for information on setting up and running experiments. The most important pages for getting started are the ones on [file structure](https://github.com/sillsdev/silnlp/wiki/Folder-structure-and-file-naming-conventions), [model configuration](https://github.com/sillsdev/silnlp/wiki/Configure-a-model), and [running experiments](https://github.com/sillsdev/silnlp/wiki/NMT:-Usage). A lot of the instructions are specific to NMT, but are still helpful starting points for doing other things like [alignment](https://github.com/sillsdev/silnlp/wiki/Alignment:-Usage).

See [this](https://github.com/sillsdev/silnlp/wiki/Using-the-Python-Debugger) page for information on using the VS code debugger.

If you need to use a tool that is supported by SILNLP but is not installable as a Python library (which is probably the case if you get an error like "RuntimeError: eflomal is not installed."), follow the appropriate instructions [here](https://github.com/sillsdev/silnlp/wiki/Installing-External-Libraries).
57 changes: 57 additions & 0 deletions s3_bucket_setup.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
# S3 bucket setup

We use Amazon S3 storage for storing our experiment data. Here is some workspace setup to enable a decent workflow.

### Install and configure AWS S3 storage
The following will allow the boto3 and S3Path libraries in Python correctly talk to the S3 bucket.
* Install the aws-cli from: https://aws.amazon.com/cli/
* In cmd, type: `aws configure` and enter your AWS access_key_id and secret_access_key and the region (we use region = us-east-1).
* The aws configure command will create a folder in your home directory named '.aws' it should contain two plain text files named 'config' and 'credentials'. The config file should contain the region and the credentials file should contain your access_key_id and your secret_access_key.
(Home directory on windows is usually C:\Users\<Username>\ and on linux it is /home/username)

### Install and configure rclone


**Windows**

The following will mount /aqua-ml-data on your S drive and allow you to explore, read and write.
* Install WinFsp: http://www.secfs.net/winfsp/rel/ (Click the button to "Download WinFsp Installer" not the "SSHFS-Win (x64)" installer)
* Download rclone from: https://rclone.org/downloads/
* Unzip to your desktop (or some convient location).
* Add the folder that contains rclone.exe to your PATH environment variable.
* Take the `scripts/rclone/rclone.conf` file from this SILNLP repo and copy it to `~\AppData\Roaming\rclone` (creating folders if necessary)
* Add your credentials in the appropriate fields in `~\AppData\Roaming\rclone`
* Take the `scripts/rclone/mount_to_s.bat` file from this SILNLP repo and copy it to the folder that contains the unzipped rclone.
* Double-click the bat file. A command window should open and remain open. You should see something like:
```
C:\Users\David\Software\rclone>call rclone mount --vfs-cache-mode full --use-server-modtime s3aqua:aqua-ml-data S:
The service rclone has been started.
```

**Linux**

The following will mount /aqua-ml-data to an S folder in your home directory and allow you to explore, read and write.
* Download rclone from: https://rclone.org/install/
* Take the `scripts/rclone/rclone.conf` file from this SILNLP repo and copy it to `~/.config/rclone/rclone.conf` (creating folders if necessary)
* Add your credentials in the appropriate fields in `~/.config/rclone/rclone.conf`
* Create a folder called "S" in your user directory
* Run the following command:
```
rclone mount --vfs-cache-mode full --use-server-modtime s3aqua:aqua-ml-data ~/S
```
### To start S: drive on start up

**Windows**

Put a shortcut to the mount_to_s.bat file in the Startup folder.
* In Windows Explorer put `shell:startup` in the address bar or open `C:\Users\<Username>\AppData\Roaming\Microsoft\Windows\Start Menu\Programs\Startup`
* Right click to add a new shortcut. Choose `mount_to_s.bat` as the target, you can leave the name as the default.

Now your AWS S3 bucket should be mounted as S: drive when you start Windows.

**Linux**
* Run `crontab -e`
* Paste `@reboot rclone mount --vfs-cache-mode full --use-server-modtime s3aqua:aqua-ml-data ~/S` into the file, save and exit
* Reboot Linux

Now your AWS S3 bucket should be mounted as ~/S when you start Linux.

0 comments on commit bec0ade

Please sign in to comment.