Skip to content
Soichiro Nishimori edited this page May 27, 2024 · 22 revisions

Validity of the reported results and the details on hyperparameters.

As well known, deep offline RL algorithms are highly sensitive to hyperparameters and small details of implementations. You feel that by skimming through some papers and comparing results. Surprisingly it is known that even different DNN libraries produce different results with the code logics identical [1].

In such situation, it is difficult to ensure the same performance between different codebases. In other words, there is no such thing as the performance of CQL as a unified value. What exists is, or rather, the performance of CQL with xxx hyperparameters written in xxx.

Considering this situation,

  • chose **single reliable existing codebase **for each algorithm.
  • Tried to transfer that codebase into single-file jax code with the same hyperparameters.

Here for each algorithm, we report

  • The codebase we referred to (Also in README)
  • Published paper using the codebase for baseline experiment (If exists)
  • The performance report by the paper, (If there is not, accepted report with different codebase.)

We can run the codebase we refer by ourselves, but it takes time. Furthermore, for those who would like to use jax-corl as baselines in your own research, results from published papers would be more reliable certification to use. For details of the performance of ours, please refer to README.

AWAC

  • Codebase: jaxrl
  • Paper using the codebase: Cal-QL [2]
  • Results: From Table 5 (Only mean)
ver halfcheetah-m halfcheetah-me hopper-m hopper-me walker2d-m walker2d-me
Reference 49 72 58 30 75 86
Ours 42 77 51 52 68 91

CQL

  • Codebase: JaxCQL
  • Paper using the codebase: Cal-QL [2]
  • Results: From Table 5 (Only mean)
ver halfcheetah-m halfcheetah-me hopper-m hopper-me walker2d-m walker2d-me
Reference 53 59 78 86 80 100
Ours 49 54 78 90 80 110

IQL

  • Codebase: Original
  • Paper using the codebase: TD7 [3]
  • Results: From Table 2 (Only mean)
ver halfcheetah-m halfcheetah-me hopper-m hopper-me walker2d-m walker2d-me Note
Reference 47.4 89.6 63.9 64.2 84.2 108.9 1M steps, average over 10 episodes
Ours 43.9 89.1 46.5 52.7 77.9 109.1 1M steps, average over 5 episodes

TD3+BC

  • Codebase: Original
  • Paper using the codebase: TD7 [3]
  • Results: From Table 2 (Only mean)
ver halfcheetah-m halfcheetah-me hopper-m hopper-me walker2d-m walker2d-me Note
Reference 48.1 93.7 59.1 98.1 84.3 110.5 1M steps, average over 10 episodes
Ours 48.1 93.0 46.5 105.5 72.7 109.2 1M steps, average over 5 episodes

DT

ver halfcheetah-m halfcheetah-me hopper-m hopper-me walker2d-m walker2d-me
- - - - - - -
- - - - - - -

Reference

  • [1] Tarasov, Denis, et al. "CORL: Research-oriented deep offline reinforcement learning library." Advances in Neural Information Processing Systems 36 (2024).
  • [2] Nakamoto, Mitsuhiko, et al. "Cal-ql: Calibrated offline rl pre-training for efficient online fine-tuning." Advances in Neural Information Processing Systems 36 (2024).
  • [3] Fujimoto, Scott, et al. "For sale: State-action representation learning for deep reinforcement learning." Advances in Neural Information Processing Systems 36 (2024).
Clone this wiki locally