-
Notifications
You must be signed in to change notification settings - Fork 0
Developer guides – Docker & UI
This page introduces the usage of docker and the architecture of the user interface/webserver in greater detail than on this page.
This diagram should give a good overview of the architecture of the webserver.
The local client interacts with the remotely deployed code through a RESTful API. This API uses the Python Framework FastAPI with a uvicorn server. It provides the following routes:
-
/start
: Starts a docker container with the configuration given in the request body in.json
format. The response to this request will contain theid
of the started container.
-
/data?id=<container_id>
: Returns a stream of a.tar
or.zip
file of theresults
folder of the experiment. -
/data/tensorboard?id=<container_id>
: Returns the link to the tensorboard of the running container. -
/health?id=<container_id>
: Returns the state of the container with the givenid
. -
/logs?id=<container_id>
: Returns the commandline output of the running process in the container. -
/remove?id=<container_id>
: Stops and removes a running container.
The API will return a .json
response with the following fields:
-
id
: The id of the docker container. -
status
: The current status of the container, e.g.running
orpaused
. -
data
: Any miscellaneous data (e.g. a link to a running TensorBoard or the container's logs) -
stream
: A stream generator, e.g. for streaming files.
To start the API on port 8000 go to the /docker
directory and run:
uvicorn app:app --reload
Don't use --reload
when deploying in production.
Alternatively, you can also run the app.py
from the docker directory as well.
Keep in mind, that all output will be written to log files located in /docker/log_files
.
If the API should be used an AUTHORIZATION_TOKEN
must be provided in the environment variables.
For each API request the value of the authorization header will be checked.
The token given by the request will be compared to a calculated token on the API side.
For the calculation the API uses the AUTHORIZATION_TOKEN
and the current time.
WARNING: Please keep in mind that the AUTHORIZATION_TOKEN
must be kept a secret.
If it is revealed it is inevitable to revoke it and set a new secret.
Furthermore, think about using transport encryption to ensure that the token won't get stolen on the way.
We suggest running the API with https
only for additional security.
Doing this is pretty simple with uvicorn.
Create your own SSL-certificates and set the ssl_keyfile
and the ssl_certfile
argument accordingly.
We use Docker to isolate recommerce
experiments.
Before continue reading, please make sure, you understood the basic concepts of Docker.
Make sure Docker is installed on the machine, you want to run multiple isolated recommerce
experiments.
There are different ways on how to use Docker with the recommerce
framework.
The API and the Docker container must run on the same machine.
This should usually be a remote machine with a lot of GPU and CPU power.
There is a docker_manager
, which can be used by the API to manage Docker container.
Therefore the docker_manager
makes use of the Docker SDK for python.
Whenever the code for the recommerce
framework changes, it is necessary to update the docker image.
When executing the docker_manager
, it will automatically update or create the image.
Depending on the internet connection this might take a while.
python ./docker/docker_manager.py
When using Docker on your local machine , we recommend using Docker Desktop.
Once Docker is available, the recommerce
image can be built.
This can be done by either running the docker_manager
or by using the following command in the directory where the dockerfile
is located:
docker build . -t recommerce
Building the image may take a while, it is about 7GB in size. To see all current images on the system use:
docker images
Container can be create and run a container from the recommerce
image using the following command:
Note that if the machine does not have a dedicated GPU, it might be necessary to omit the --gpus all
flag.
docker run -it --entrypoint /bin/bash --gpus all recommerce
Running this command will start a container and automatically open an interactive shell. To list all runnning Docker container use:
docker ps -a
Stop a specific container with a <container_id> by using:
docker stop <container_id>
And remove it with:
docker rm <container_id>
Any errors described here should only occur when trying to deploy the webserver/docker to a new environment/virtual machine.
When trying to run a docker container (with a GPU device request), there might be the following error:
failed to create shim: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: Running hook #0:: error running hook: signal: segmentation fault, stdout: , stderr:: unknown
This error is caused by a local linux distribution (on Windows this pertains to the WSL instance used by docker) not having required packages installed needed to support CUDA. A proposed workaround is to update/downgrade the following packages:
apt install libnvidia-container1=1.4.0-1 libnvidia-container-tools=1.4.0-1 nvidia-container-toolkit=1.5.1-1
Issues in the nvidia-docker
-repository that describe this error can be found here, here, and here. Please note that we have not confirmed that the workaround solves this problem.
Note: This error should no longer occur if the recommerce
package was installed with the correct extra selected. We are still including this section for completeness.
When trying to start a training session on a remote or local machine (e.g. using recommerce -c training
) there might be the following error:
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasSgemm( handle, opa, apb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)`
This error comes from the torch installation not having CUDA support, but the machine supporting CUDA.
Confirm that the correct version of recommerce
is installed, see Installation.
In this case, recommerce
should be installed with the gpu
extra, which installs the following versions of torch:
torch==1.11.0+cu115
torchvision==0.12.0+cu115
torchaudio==0.11.0+cu115
It is possible to manually update these versions using:
pip install torch==1.11.0+cu115 torchvision==0.12.0+cu115 torchaudio==0.11.0+cu115 -f https://download.pytorch.org/whl/torch_stable.html
The provided webserver acts as a client to the API and as a UI to the users of the recommerce
framework.
It is implemented using the python library Django.
Furthrmore JQuery 3.6 and Bootstrap v5.0 are used.
There is an App called alpha_business_app
which implements all webserver logic.
The second App users
is an App provided by Django to implement user management.
Django follows the Model-View-Controller pattern.
The views are HTML-files found in /webserver/templates
.
Controllers and models can be found in webserver/alpha_business_app
.
All files in the webserver/alpha_business_app/models
directory are models which represent the database.
All listed commands can only be executed in the webserver directory. Before starting the webserver for the first time, the database needs to be initialized, therefore run:
python ./manage.py migrate
To start the webserver on 127.0.0.1:2709
use the following command:
python ./manage.py runserver 2709
When starting the webserver, there will be the login page. To create a superuser and login to the page run:
python ./manage.py createsuperuser
Django will ask you for a username a password and an E-Mail address.
It should be possible to leave the E-Mail field blank.
To manage other users, go to 127.0.0.1:2709/admin
and login with the admin/superuser credentials.
When more fields are added to a model or existing fields are changed, the database must be modified. To do so, run:
python ./manage.py makemigrations
This will write migration files. Do not forget to apply migrations afterwards
Run tests by using the following command: Warning for CI Pipeline: If Django does not run any test, the pipeline will pass.
python ./manage.py test -v 2
The configuration files are represented as database models in the webserver. This is an overview of the current classes:
Each class has all possible attributes, even if some agents or marketplaces do not implement these.
There is one configuration object for each saved configuration file.
So they are always "complete", not necessary values are set to None
.
In order to be able to have "complete" model classes, it is necessary to know all fields that can be in any configuration file.
Therefore the agents and markets in the recommerce
-Framework need to provide the get_configurable_fields
classmethod.
Whenever the parameters for an agent or a market changed, the webserver needs to update its models.
At the moment, this can be done by using the on_recommerce_change.py
-Script, located in /webserver/alpha_business_app
.
Run this script before starting the server.
It will overwrite the model files for rl-config and sim_market-config, as well as the template files for rl and sim_market.
The supported types for input fields and database fields are int, float, string and boolean.
After running this script make and run migrations to apply these changes-
To use the webserver either an .env.txt
file is needed in the BP2021/webserver
directory or the environment variables SECRET_KEY
and API_TOKEN
must be set.
Here is an example for an .env.txt
this_line_contains_the_secret_key_for_the_django_server
this_line_contains_the_master_secret_for_the_api
Remember to change these secrets when they are leaked to the public.
Both secrets should be random long strings.
Keep in mind, that the master_secret for the API (API_TOKEN
) should be equal to the AUTHORIZATION_TOKEN
on the API side.
Sometimes it might be useful to monitor the performance of the Docker container.
To do so, there is a monitoring tool on the API side.
It uses a database to store information about the system performance during an experiment and a separate process to monitor container.
If monitoring in app.py
(set should_run_monitoring
to True) is enabled, the docker_manager
will report all actions on a container to the database.
Furthermore the container_health_checker
is started.
The container_health_checker
tells the database about stopped container. This makes it possible to figure out how long a container is running.
When the API is shut down, the monitoring tool will stop working as well.
Find more information about this monitoring tool in this thesis.
There is a websocket which can be used to receive notifications about exited containers.
The websocket can be started by running the container_notification_websocket.py
file within the /docker
directory.
The websocket will run on port 8001
.
webserver will automatically connect to this websocket. Whenever a container exits users will receive a push notification in your interface.
There is still no stable version of the websocket, see PR #519
Online Marketplace Simulation: A Testbed for Self-Learning Agents is the 2021/2022 bachelor's project of the Enterprise Platform and Integration Concepts (@hpi-epic, epic.hpi.de) research group of the Hasso Plattner Institute.