RLlib integration brings support between the Ray/RLlib library and the CARLA simulator. This repository handles the creation and use of the CARLA simulator as an environment of Ray, which the users can use for training and inference purposes. This is complemented by an example, as well as some files to ease the use of AWS instances. These functionalities are divided in the following way:
-
rllib_integration contains all the infrastructure related to CARLA. Here, we set up the CARLA server, clients and actors. Also, the basic structure that all training and testing experiments must follow is included here.
-
aws has the files needed to run this in an AWS instance. Specifically, the aws_helper.py provides several functionalities that ease the management of the EC2 instances, including their creation as well as retrieving and sending data.
-
dqn_example, as well as all the others dqn_* files, provide an easy-to-understand example on how to set up a Ray experiment using CARLA as its environment.
As CARLA is the environment that Ray will be using, the first step is to set it up. To do so, a packaged version must be installed (see all CARLA releases). This integration has been done using CARLA 0.9.11, and therefore it is recommended to use that version. While other versions might be compatible, they haven't been fully tested, so proceed at your own discretion.
Additionally, in order to know where this package is located, set the CARLA_ROOT environment variable to the path of the folder.
Note: It is only needed to install CARLA if you want to run this repository locally.
With CARLA installed, we can install the rest of the prerequisites with:
pip3 install -r requirements.txt
Let's start by explaining how to create your own experiment. To do so, you'll need to create:
- Experiment class
- Configuration file
- Training and inference files
The first step that you need to do is to define a training experiment. For all environments to work with Ray, they have to return specific information (see CarlaEnv), which will be dependent on your chosen experiment. As such, all experiments should inherit from BaseExperiment, overwritting all of its methods.
Additionally, a configuration file is also required. Any settings here update the default ones. It can be divided in three parts:
- Ray trainer configuration: everything related to the specific trainer used. If you are using a built-in model, you can set up its settings here.
- CARLA environment: CARLA related settings. These can be divided into the simulation, such as timeout or map quality (default values here); and the experiment configuration, related to the ego vehicle and its sensors (check default settings for how to specificy the sensors to use), as well as the town conditions (default values here).
The last step is to create your own training and inference files. This part is completely up to you and is dependent on the Ray API. Remember to check Ray's custom model docs, if you want to create your own specific model.
To solidify the previous section, we also provide a simple example. It uses the BirdView pseudosensor, along with Ray's DQNTrainer. The files are:
- The DQNExperiment is the experiment class, which overwrites the methods of its parent class.
- The configuration file is dqn_example/dqn_config.yaml.
- dqn_train.py is responsible for the training, dqn_inference_ray.py of the inference using Ray's API, and dqn_inference.py, of the inference without it.
To run this example locally, you need to install pytorch
pip3 install -r dqn_example/dqn_requirements.txt
and run the training file
python3 dqn_train.py dqn_example/dqn_config.yaml --name dqn
Note: The default configuration uses 1 GPU and 12 CPUs, so if your current instance doesn't have that amount of capacity, lower the numbers at the dqn_example/dqn_config.yaml
. Additionally, if you are having out of memory problems, consider reducing the buffer_size
parameter.
Additionally, we also provide tools to automatically run the training on EC2 instances. To do so, we use the Ray autoscaler API.
Firstly, configure your boto3 environment correctly. You can follow the instructions here.
The first step is to create the image needed for training. We provide a script that automatically creates it, given the base image and the installation script:
python3 aws_helper.py create-image --name <AMI-name> --installation-scripts <installation-scripts> --instance-type <instance-type> --volume-size <volume-size>
Note: This script will end by outputting information about the created image. In order to use Ray autoscaler, manually update the image id and security group id information at your autoscaler configuration file with the provided ones.
With the image created, we can use Ray's API to run the training at the cluster:
- Initialize the cluster:
ray up <autoscaler_configuration_file>
- (Optional) Update remote files with local changes:
If the local code has been modified after the cluster initialization, use these command lines to update it.
ray rsync-up <autoscaler_configuration_file> <path_to_local_folder> <path_to_remote_folder>
- Run the training:
ray submit <autoscaler_configuration_file> <training_file>
- (Optional) Monitor the cluster status:
ray attach <autoscaler_configuration_file>
watch -n 1 ray status
- Shutdown the cluster:
ray down <autoscaler_configuration_file>
For this example, we use the autoscaler configuration at dqn_example/dqn_autoscaler.yaml. To execute it, you just need to run:
# Create the training image
python3 aws_helper.py create-image --name <AMI-name> --installation-scripts install/install.sh --instance-type <instance-type> --volume-size <volume-size>
Note: Remember to manually change the image id and security group id at the dqn_example/dqn_autoscaler.yaml
after this command line
# Initialize the cluster
ray up dqn_example/dqn_autoscaler.yaml
# (Optional) Update remote files with local changes
ray rsync-up dqn_example/dqn_autoscaler.yaml dqn_example .
ray rsync-up dqn_example/dqn_autoscaler.yaml rllib_integration .
# Run the training
ray submit dqn_example/dqn_autoscaler.yaml dqn_train.py -- dqn_example/dqn_config.yaml --auto
# (Optional) Monitor the cluster status
ray attach dqn_example/dqn_autoscaler.yaml
watch -n 1 ray status
# Shutdown the cluster
ray down dqn_example/dqn_autoscaler.yaml