-
Notifications
You must be signed in to change notification settings - Fork 2
Home
Deep offline reinforcement learning (RL) algorithms are known to be highly sensitive to hyperparameters and small implementation details. This can be observed by skimming through various papers and comparing results. Surprisingly, even different deep neural network (DNN) libraries can produce different results despite identical code logic .
In such a context, ensuring consistent performance across different codebases is challenging. Essentially, there is no single, unified performance value for an algorithm like Conservative Q-Learning (CQL). Instead, what exists is the performance of CQL with specific hyperparameters implemented in a particular codebase.
Given this situation, our approach was as follows:
- We chose a single reliable existing codebase for each algorithm.
- We transferred that codebase into a single-file JAX code with the same hyperparameters.
For each algorithm, we report:
- The codebase we referred to (also listed in the README)
- Published papers using the codebase for baseline experiments (if available)
- The performance reported by the paper.
Although we can run the referred codebase ourselves, for those who wish to use JAX-CoRL as a baseline in their own research, results from published papers provide a more reliable certification. For detailed performance reports of our implementations, please refer to the README.
ver | halfcheetah-m | halfcheetah-me | hopper-m | hopper-me | walker2d-m | walker2d-me |
---|---|---|---|---|---|---|
Reference | 49 | 72 | 58 | 30 | 75 | 86 |
Ours | 42 | 77 | 51 | 52 | 68 | 91 |
ver | halfcheetah-m | halfcheetah-me | hopper-m | hopper-me | walker2d-m | walker2d-me |
---|---|---|---|---|---|---|
Reference | 53 | 59 | 78 | 86 | 80 | 100 |
Ours | 49 | 54 | 78 | 90 | 80 | 110 |
ver | halfcheetah-m | halfcheetah-me | hopper-m | hopper-me | walker2d-m | walker2d-me | Note |
---|---|---|---|---|---|---|---|
Reference | 47.4 | 89.6 | 63.9 | 64.2 | 84.2 | 108.9 | 1M steps, average over 10 episodes |
Ours | 43.3 | 92.9 | 52.2 | 53.4 | 75.3 | 109.2 | 1M steps, average over 5 episodes |
ver | halfcheetah-m | halfcheetah-me | hopper-m | hopper-me | walker2d-m | walker2d-me | Note |
---|---|---|---|---|---|---|---|
Reference | 48.1 | 93.7 | 59.1 | 98.1 | 84.3 | 110.5 | 1M steps, average over 10 episodes |
Ours | 48.1 | 93.0 | 46.5 | 105.5 | 72.7 | 109.2 | 1M steps, average over 5 episodes |
- Codebase: min-decision-trainsformer
- Paper using the codebase: None
- Results
ver | halfcheetah-m | halfcheetah-me | hopper-m | hopper-me | walker2d-m | walker2d-me |
---|---|---|---|---|---|---|
- | - | - | - | - | - | - |
- | - | - | - | - | - | - |
- Codebase: TD7(original)
- Paper using the codebase: None
- Results
ver | halfcheetah-m | halfcheetah-me | hopper-m | hopper-me | walker2d-m | walker2d-me |
---|---|---|---|---|---|---|
- | - | - | - | - | - | - |
- | - | - | - | - | - | - |
- [1] Tarasov, Denis, et al. "CORL: Research-oriented deep offline reinforcement learning library." Advances in Neural Information Processing Systems 36 (2024).
- [2] Nakamoto, Mitsuhiko, et al. "Cal-ql: Calibrated offline rl pre-training for efficient online fine-tuning." Advances in Neural Information Processing Systems 36 (2024).
- [3] Fujimoto, Scott, et al. "For sale: State-action representation learning for deep reinforcement learning." Advances in Neural Information Processing Systems 36 (2024).