Home

Validity of the reported results and the details on hyperparameters.

As well known, deep offline RL algorithms are highly sensitive to hyperparameters and small details of implementations. You feel that by skimming through some papers and comparing results. Surprisingly it is known that even different DNN libraries produce different results with the code logics identical [1].

In such situation, it is difficult to ensure the same performance between different codebases. In other words, there is no such thing as the performance of CQL as a unified value. What exists is, or rather, the performance of CQL with xxx hyperparameters written in xxx.

Considering this situation,

Chose single reliable existing codebase for each algorithm.
Tried to transfer that codebase into single-file jax code with the same hyperparameters.

Here for each algorithm, we report

The codebase we referred to (Also in README)
Published paper using the codebase for baseline experiment (If exists)
The performance report by the paper, (If there is not, accepted report with different codebase.)

We can run the codebase we refer by ourselves, but it takes time. Furthermore, for those who would like to use jax-corl as baselines in your own research, results from published papers would be more reliable certification to use. For details of the performance of ours, please refer to README.

AWAC

Codebase: jaxrl
Paper using the codebase: Cal-QL [2]
Results: From Table 5 (Only mean)

ver	halfcheetah-m	halfcheetah-me	hopper-m	hopper-me	walker2d-m	walker2d-me
Reference	49	72	58	30	75	86
Ours	42	77	51	52	68	91

CQL

Codebase: JaxCQL
Paper using the codebase: Cal-QL [2]
Results: From Table 5 (Only mean)

ver	halfcheetah-m	halfcheetah-me	hopper-m	hopper-me	walker2d-m	walker2d-me
Reference	53	59	78	86	80	100
Ours	49	54	78	90	80	110

IQL

Codebase: Original
Paper using the codebase: TD7 [3]
Results: From Table 2 (Only mean)

ver	halfcheetah-m	halfcheetah-me	hopper-m	hopper-me	walker2d-m	walker2d-me	Note
Reference	47.4	89.6	63.9	64.2	84.2	108.9	1M steps, average over 10 episodes
Ours	43.9	89.1	46.5	52.7	77.9	109.1	1M steps, average over 5 episodes

TD3+BC

Codebase: Original
Paper using the codebase: TD7 [3]
Results: From Table 2 (Only mean)

ver	halfcheetah-m	halfcheetah-me	hopper-m	hopper-me	walker2d-m	walker2d-me	Note
Reference	48.1	93.7	59.1	98.1	84.3	110.5	1M steps, average over 10 episodes
Ours	48.1	93.0	46.5	105.5	72.7	109.2	1M steps, average over 5 episodes

DT

Codebase: min-decision-trainsformer
Paper using the codebase: None
Results

ver	halfcheetah-m	halfcheetah-me	hopper-m	hopper-me	walker2d-m	walker2d-me
-	-	-	-	-	-	-
-	-	-	-	-	-	-

Reference

[1] Tarasov, Denis, et al. "CORL: Research-oriented deep offline reinforcement learning library." Advances in Neural Information Processing Systems 36 (2024).
[2] Nakamoto, Mitsuhiko, et al. "Cal-ql: Calibrated offline rl pre-training for efficient online fine-tuning." Advances in Neural Information Processing Systems 36 (2024).
[3] Fujimoto, Scott, et al. "For sale: State-action representation learning for deep reinforcement learning." Advances in Neural Information Processing Systems 36 (2024).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Home

Validity of the reported results and the details on hyperparameters.

AWAC

CQL

IQL

TD3+BC

DT

Reference

Clone this wiki locally