Farama-Foundation · reginald-mclean · May 2, 2024 · May 3, 2024 · Aug 13, 2024 · Aug 29, 2024
diff --git a/docs/README.md b/docs/README.md
@@ -1,5 +1,5 @@
-# Metaworld documentation
+# Meta-World documentation
 
-This directory contains the documentation for Metaworld.
+This directory contains the documentation for Meta-World.
 
 For more information about how to contribute to the documentation go to our [CONTRIBUTING.md](https://github.com/Farama-Foundation/Celshast/blob/main/CONTRIBUTING.md)
diff --git a/docs/_static/img/metaworld-text.svg b/docs/_static/img/metaworld-text.svg
diff --git a/docs/_static/img/ml1-1.gif b/docs/_static/img/ml1-1.gif
diff --git a/docs/_static/img/ml1.gif b/docs/_static/img/ml1.gif
diff --git a/docs/_static/img/ml10-1.gif b/docs/_static/img/ml10-1.gif
diff --git a/docs/_static/img/ml10.gif b/docs/_static/img/ml10.gif
diff --git a/docs/_static/img/ml45-1.gif b/docs/_static/img/ml45-1.gif
diff --git a/docs/_static/img/ml45.gif b/docs/_static/img/ml45.gif
diff --git a/docs/_static/img/mt1-1.gif b/docs/_static/img/mt1-1.gif
diff --git a/docs/_static/img/mt1.gif b/docs/_static/img/mt1.gif
diff --git a/docs/_static/img/mt10-1.gif b/docs/_static/img/mt10-1.gif
diff --git a/docs/_static/img/mt10.gif b/docs/_static/img/mt10.gif
diff --git a/docs/_static/ml1.gif b/docs/_static/ml1.gif
diff --git a/docs/_static/ml10.gif b/docs/_static/ml10.gif
diff --git a/docs/_static/ml45.gif b/docs/_static/ml45.gif
diff --git a/docs/_static/mt1.gif b/docs/_static/mt1.gif
diff --git a/docs/benchmark/action_space.md b/docs/benchmark/action_space.md
@@ -0,0 +1,22 @@
+---
+layout: "contents"
+title: Action Space
+firstpage:
+---
+
+# Action Space
+
+In the Meta-World benchmark, the agent must simultaneously solve multiple tasks that could be individually defined by their own Markov decision processes.
+As this is solved by current approaches using a single policy/model, it requires the action space for all tasks to have a constant size, hence sharing a common structure.
+
+The action space of the Sawyer robot is a ```Box(-1.0, 1.0, (4,), float32)```.
+An action represents the Cartesian displacement `dx`, `dy`, and `dz` of the end-effector, and an additional action for gripper control.
+
+For tasks that do not require the gripper, actions along those dimensions can be masked or ignored and set to a constant value that permanently closes the fingers.
+
+| Num | Action | Control Min | Control Max | Name (in XML file) | Joint | Unit |
+|-----|--------|-------------|-------------|---------------------|-------|------|
+| 0 | Displacement of the end-effector in x direction (dx) | -1 | 1 | mocap | N/A | position (m) |
+| 1 | Displacement of the end-effector in y direction (dy) | -1 | 1 | mocap | N/A | position (m) |
+| 2 | Displacement of the end-effector in z direction (dz) | -1 | 1 | mocap | N/A | position (m) |
+| 3 | Gripper adjustment (closing/opening) | -1 | 1 | rightclaw, leftclaw | r_close, l_close | position (normalized) |
diff --git a/docs/benchmark/benchmark_descriptions.md b/docs/benchmark/benchmark_descriptions.md
@@ -0,0 +1,112 @@
+---
+layout: "contents"
+title: Benchmark Descriptions
+firstpage:
+---
+
+# Benchmark Descriptions
+
+The benchmark provides a selection of tasks used to study generalization in reinforcement learning (RL).
+Different combinations of tasks provide benchmark scenarios suitable for multi-task RL and meta-RL.
+Unlike usual RL benchmarks, the training of the agent is strictly split into training and testing phases.
+
+## Task Configuration
+
+Meta-World distinguishes between parametric and non-parametric variations.
+Parametric variations concern the configuration of the goal or object position, such as changing the location of the puck in the `push` task.
+
+```
+TODO: Add code snippets
+```
+
+Non-parametric variations are implemented by the settings containing multiple tasks, where the agent is faced with challenges like `push` and `open window` that necessitate a different set of skills.
+
+
+## Multi-Task Problems
+
+The multi-task setting challenges the agent to learn a predefined set of skills simultaneously.
+Below, different levels of difficulty are described.
+
+
+### Multi-Task (MT1)
+
+In the easiest setting, **MT1**, a single task needs to be learned where the agent must, e.g., *reach*, *push*, or *pick place* a goal object.
+There is no testing of generalization involved in this setting.
+
+```{figure} ../_static/mt1.gif
+   :alt: Multi-Task 1
+   :width: 500
+```
+
+### Multi-Task (MT10)
+
+The **MT10** evaluation uses 10 tasks: *reach*, *push*, *pick and place*, *open door*, *open drawer*, *close drawer*, *press button top-down*, *insert peg side*, *open window*, and *open box*.
+The policy should be provided with a one-hot vector indicating the current task.
+The positions of objects and goal positions are fixed in all tasks to focus solely on skill acquisition. <!-- TODO: check this -->
+
+
+```{figure} ../_static/mt10.gif
+   :alt: Multi-Task 10
+   :width: 500
+```
+
+### Multi-Task (MT50)
+
+The **MT50** evaluation uses all 50 Meta-World tasks.
+This is the most challenging multi-task setting and involves no evaluation on test tasks.
+As with **MT10**, the policy is provided with a one-hot vector indicating the current task, and object and goal positions are fixed.
+
+See [Task Descriptions](task_descriptions) for more details.
+
+## Meta-Learning Problems
+
+Meta-RL attempts to evaluate the [transfer learning](https://en.wikipedia.org/wiki/Transfer_learning)
+capabilities of agents learning skills based on a predefined set of training
+tasks, by evaluating generalization using a hold-out set of test tasks.
+In other words, this setting allows for benchmarking an algorithm's
+ability to adapt to or learn new tasks.
+
+### Meta-RL (ML1)
+
+The simplest meta-RL setting, **ML1**, involves few-shot adaptation to goal
+variation within one task. ML1 uses single Meta-World Tasks, with the
+meta-training "tasks" corresponding to 50 random initial object and goal
+positions, and meta-testing on 10 held-out positions. We evaluate algorithms
+on three individual tasks from Meta-World: *reaching*, *pushing*, and *pick and
+place*, where the variation is over reaching position or goal object position.
+The goal positions are not provided in the observation, forcing meta-RL
+algorithms to adapt to the goal through trial-and-error.
+
+```{figure} ../_static/ml1.gif
+   :alt: Meta-RL 1
+   :width: 500
+```
+
+### Meta-RL (ML10)
+
+The **ML10** evaluation involves few-shot adaptation to new test tasks with 10
+meta-training tasks. We hold out 5 tasks and meta-train policies on 10 tasks.
+We randomize object and goal positions and intentionally select training tasks
+with structural similarity to the test tasks. Task IDs are not provided as
+input, requiring a meta-RL algorithm to identify the tasks from experience.
+
+```{figure} ../_static/ml10.gif
+   :alt: Meta-RL 10 
+   :width: 500
+```
+
+### Meta-RL (ML45)
+
+The most difficult environment setting of Meta-World, **ML45**, challenges the
+agent with few-shot adaptation to new test tasks using 45 meta-training tasks.
+Similar to ML10, we hold out 5 tasks for testing and meta-train policies on 45
+tasks. Object and goal positions are randomized, and training tasks are
+selected for structural similarity to test tasks. As with ML10, task IDs are
+not provided, requiring the meta-RL algorithm to identify tasks from experience.
+
+<<<<<<< HEAD
+
+```{figure} ../_static/ml45.gif
+   :alt: Meta-RL 10
+   :width: 500
+```
diff --git a/docs/usage/basic_usage.md → docs/benchmark/expert_trajectories.md b/docs/usage/basic_usage.md → docs/benchmark/expert_trajectories.md
@@ -1,10 +1,10 @@
 ---
 layout: "contents"
-title: Generate data with expert policies
+title: Expert Trajectories
 firstpage:
 ---
 
-# Generate data with expert policies
+# Expert Trajectories
 
 ## Expert Policies
 For each individual environment in Meta-World (i.e. reach, basketball, sweep) there are expert policies that solve the task. These policies can be used to generate expert data for imitation learning tasks.
@@ -14,13 +14,12 @@ The below example provides sample code for the reach environment. This code can
 
 
 ```python
-from metaworld import MT1
+import gymnasium as gym
+import metaworld
+from metaworld.policies.sawyer_reach_v3_policy import SawyerReachV3Policy as p
 
-from metaworld.policies.sawyer_reach_v2_policy import SawyerReachV2Policy as p
+env = gym.make('MetaWorld/reach-v3')
 
-mt1 = MT1('reach-v2', seed=42)
-env = mt1.train_classes['reach-v2']()
-env.set_task(mt1.train_tasks[0])
 obs, info = env.reset()
 
 policy = p()

diff --git a/docs/benchmark/resetting.md b/docs/benchmark/resetting.md
@@ -0,0 +1,7 @@
+---
+layout: "contents"
+title: Resetting to a Specific State
+firstpage:
+---
+
+# Resetting to a Specific State
diff --git a/docs/benchmark/reward_functions.md b/docs/benchmark/reward_functions.md
@@ -0,0 +1,27 @@
+---
+layout: "contents"
+title: Reward Functions 
+firstpage:
+---
+
+# Reward Functions
+
+Similar structures are provided with the [action](action_space) and [state space](space_space).
+Meta-World provides well-shaped reward functions for the individual tasks that are solveable by current single-task reinforcement learning approaches.
+To assure equivalent learning in the settings with multiple tasks, all task rewards have the same magnitude.
+
+## Options
+
+Meta-World currently implements two types of reward functions that can be selected
+by passing the `reward_func_version` keyword argument to `gym.make(...)`.
+
+### Version 1
+
+Passing `reward_func_version=v1` configures the benchmark with the primary
+reward function of Meta-World, which is actually a version of the
+`pick-place-wall` task that is modified to also work for the other tasks.
+
+
+### Version 2
+
+TBA
diff --git a/docs/benchmark/state_space.md b/docs/benchmark/state_space.md
@@ -0,0 +1,17 @@
+---
+layout: "contents"
+title: State Space 
+firstpage:
+---
+
+# State Space
+
+
+Likewise the [action space](action_space), the state space among the tasks requires maintaining the same structure that allows current approaches to employ a single policy/model.
+Meta-World contains tasks that either require manipulation of a single object with a potentially variable goal position (e.g., `reach`, `push`, `pick place`) or two objects with a fixed goal position (e.g., `hammer`, `soccer`, `shelf place`).
+To account for such variability, large parts of the observation space are kept as placeholders, e.g., for the second object, if only one object is available.
+
+The observation array consists of the end-effector's 3D Cartesian position and the composition of a single object with its goal coordinates or the positions of the first and second object.
+This always results in a 9D state vector.
+
+TODO: Provide table