Jon Goikoetxea, Jesús F. Palacián
[Project page]
This repository contains the code necessary to replicate the experiments in the work titled Learning efficient goal-conditioned policies by imitating optimal trajectories, defended by Jon Goikoetxea as his Bachelor's thesis in Computer Science. The thesis was directed by Prof. Jesús F. Palacián and defended at the Public University of Navarre (UPNA/NUP).
First, clone the repository:
git clone https://github.com/jongoiko/gcimopt.git
cd gcimoptThen, create a virtual environment to install the project's dependencies.
We will use Python 3.11;
depending on your package manager of choice, there'll be different methods to set up specific Python versions.
In this example we use venv and pip and assume that we have Python 3.11 installed as python3.11:
python3.11 -m venv .venv
source .venv/bin/activate
python3 -m pip install --upgrade pipRunning the experiments requires the environments in safe-control-gym;
clone their repository and install it in the virtual environment:
git clone https://github.com/utiasDSL/safe-control-gym.git
pip install -e ./safe-control-gymWe can now install the rest of the dependencies:
pip install -e .All experiments consist of generating a dataset of optimal trajectories, training MLP policies on the dataset and then evaluating said policy. The dataset generation step may be skipped by downloading and unzipping our ZIP file with generated datasets:
wget https://github.com/jongoiko/gcimopt/releases/download/v0.1.0/data.zip
unzip data.zipSimilarly, you can skip policy training by downloading our pre-trained policies:
wget https://github.com/jongoiko/gcimopt/releases/download/v0.1.0/policies.zip
unzip policies.zipFor an experiment name (one of double_integrator, cartpole, drone2d, drone3d or robot_reach), run
python -m experiments.run +experiment=$EXPERIMENT_NAME generate=trueTo generate datasets for all experiments, we provide a shell script which may be run directly:
./generate.shThe URDF file for the Franka Emika Panda robot is stored in
experiments/resources/fer_franka_hand.urdfand was generated withfranka_description.
At this point, there should be a data directory with the following structure:
data
├── cartpole
│ └── ...
├── double_integrator
│ └── ...
├── drone2d
│ └── ...
├── drone3d
│ └── ...
└── robot_reach
└── ...
To train a policy of a specific architecture for an experiment name, run
python -m experiments.run +experiment=$EXPERIMENT_NAME +experiment/policy_model=$POLICY_ARCH train=truewhere $POLICY_ARCH corresponds to the name of an architecture configuration in experiments/config/experiment/policy_model, such as mlp_2_32.
We provide a script train.sh to train all policies as specified in the thesis report:
./train.shAt this point, there should be a policies directory with the following structure:
policies
├── cartpole
│ └── ...
├── double_integrator
│ └── ...
├── drone2d
│ └── ...
├── drone3d
│ └── ...
└── robot_reach
└── ...
To evaluate a trained policy of a specific architecture for an experiment name, run
python -m experiments.run +experiment=$EXPERIMENT_NAME +experiment/policy_model=$POLICY_ARCH evaluate=trueWe provide a script eval.sh to train all policies as specified in the thesis report:
./eval.shThis section explains how to add a new experiment (i.e. a different dynamical system to be controlled by a trained policy) and also describes the structure of the already included experiments.
We recommend writing the following code in a new .py file in the experiments/ directory;
the files already in that directory (double_integrator.py, cartpole.py, etc.) can also be used as examples to guide code structure.
We start by defining all components of an Optimal Control Problem (OCP), which will be passed to the OCP class constructor:
- Given the ODE
$\dot x = f(x, u)$ describing the system dynamics, the dynamics$f$ should be implemented as a CasADi function. This function should take two vectorsxandu(in that order, of dimension$n_x$ and$n_u$ respectively) and output a vector with$n_x$ dimensions. - From the OCP's cost functional
$\int_0^{t_f}L(x, u)dt + L_f(x_f, t_f)$ , we declare two more CasADi functions: one for the Lagrangian$L$ and another for the Mayer term$L_f$ . The function for$L$ should take two vector inputsxanduand return a scalar; the one for$L_f$ should take a vector$x$ and a scalar$t$ , and also output a scalar. Either one of the two terms might be omitted by passingNonein its place to the constructor, but not both. - A pair (tuple) of scalars to indicate the lower and upper bounds
$[t_{min}, t_{max}]$ for trajectory duration. - Lower and upper bounds for the system state, as a pair of arrays/lists of
$n_x$ dimensions. - Lower and upper bounds for the controls, as a pair of arrays/lists of
$n_u$ dimensions. - Optionally, a state-to-goal mapping function
$y$ implemented as a CasADi function taking a$n_x$ -dimensional vector (state) as input and outputting an$n_g$ -dimensional output. If not provided, an identity state-to-goal mapping is assumed ($y(x) = x$ ).
From all these components, we can instantiate an OCP as
ocp = OCP(
dynamics,
lagrangian,
mayer,
time_bounds,
state_bounds,
cotrol_bounds,
state_to_goal,
)To randomly sample initial and final states for dataset generation and policy evaluation, we need to write a function with the signature
def sample_initial_final_states(
gen: np.random.Generator,
) -> tuple[np.ndarray, np.ndarray]:
# ...
return initial, finalwhich, given a NumPy Generator object, returns a pair of Generator for random sampling, in order to maintain reproducibility.
Neural network inputs and outputs may be preprocessed before training; also, the output of the NN may have to be postprocessed before being executed as a control on the dynamical system. We thus need to define four functions:
buffer_transform_states_goals: this function will be called on the states and goals of the dataset before training. Its signature is
def buffer_transform_states_goals(
states: jax.Array, goals: jax.Array, params: jax.Array | None, fit: bool
) -> tuple[jax.Array, jax.Array | None]:
passIt takes a fit.
The function should combine and preprocess the states and goals to return an fit will be set to True:
if we need to fit certain parameters from the training set (such as mean and variance for normalization), this should be done when fit is true.
As a second output, the function should then return an array with the fit parameters (or None if it is not necessary for preprocessing).
On the second call, fit will be False, and if parameters were calculated from the first call, they will be passed through the params parameter.
Then, the function should transform the states and goals using the parameters fit from the training set.
buffer_transform_controlsworks exactly the same, except that it is applied to the controls in the dataset. Its signature is
def buffer_transform_controls(
controls: jax.Array, params: jax.Array | None, fit: bool
) -> tuple[jax.Array, jax.Array | None]:
passtransform_state_goalis similar tobuffer_transform_states_goals, except that it takes a single state and a single goal as input:
def transform_state_goal(
state: jax.Array, goal: jax.Array, params: jax.Array | None
) -> jax.Array:
passmodel_output_to_controlshould convert NN outputs to controls that can be applied to the dynamical system. Thus, it should apply the inverse transformation ofbuffer_transform_controls:
def model_output_to_control(
model_output: jax.Array, params: jax.Array | None
) -> jax.Array:
passwhere model_output is an array containing the NN output for a single state-goal input pair.
To decide whether a state reaches a goal up to a certain tolerance, we need to define a function
def goal_reached(state: jax.Array, goal: jax.Array) -> jax.Array:
passThe function takes a state and a goal as input, and should return a bool (or a Boolean Jax array) which is True if state reaches goal, and False otherwise.
All the aforementioned components are packaged into an Experiment object:
experiment = Experiment(
ocp,
sample_initial_final_states,
transform_state_goal,
model_output_to_control,
buffer_transform_states_goals,
buffer_transform_controls,
goal_reached,
)To perform dataset generation, policy training and evaluation with this experiment, we need to "register" it on experiment/run.py.
Thus, at the start of this file, add
from my_new_experiment import experiment as new_experiment
EXPERIMENTS = {
# ...
"EXPERIMENT_NAME": new_experiment
}where EXPERIMENT_NAME is a human-readable name of your choosing.
We now need to provide configurations for all steps in the method:
in experiments/config, create a file EXPERIMENT_NAME.yaml specifying the hyperparameters for dataset generation and the training and evaluation of policies.
Make sure to set the name attribute in the file to EXPERIMENT_NAME, and feel free to use the existing configs as a base to configure your experiment.
See the Hydra documentation for more details about this YAML-based configuration.
Once the experiment is registered and configured, you may run
python -m experiments.run +experiment=$EXPERIMENT_NAME [options]where [options] may contain generate=true, train=true and/or evaluate=true depending on which step(s) of the method you wish to execute.
You can also override the configurations of the YAML file directly from the command line.
