Home

Validity of Reported Results and Hyperparameter Details

Deep offline reinforcement learning (RL) algorithms are known to be highly sensitive to hyperparameters and small implementation details. This can be observed by skimming through various papers and comparing results. Surprisingly, even different deep neural network (DNN) libraries can produce different results despite identical code logic .

In such a context, ensuring consistent performance across different codebases is challenging. Essentially, there is no single, unified performance value for an algorithm like Conservative Q-Learning (CQL). Instead, what exists is the performance of CQL with specific hyperparameters implemented in a particular codebase.

Given this situation, our approach was as follows:

We chose a single reliable existing codebase for each algorithm.
We transferred that codebase into a single-file JAX code with the same hyperparameters.

For each algorithm, we report:

The codebase we referred to (also listed in the README)
Published papers using the codebase for baseline experiments (if available)
The performance reported by the paper.

Although we can run the referred codebase ourselves, for those who wish to use JAX-CoRL as a baseline in their own research, results from published papers provide a more reliable certification. For detailed performance reports of our implementations, please refer to the README.

AWAC

Codebase: jaxrl
Paper using the codebase: Cal-QL [2]
Results: From Table 5 (Only mean)

ver	halfcheetah-m	halfcheetah-me	hopper-m	hopper-me	walker2d-m	walker2d-me
Reference	49	72	58	30	75	86
Ours	42	77	51	52	68	91

CQL

Codebase: JaxCQL
Paper using the codebase: Cal-QL [2]
Results: From Table 5 (Only mean)

ver	halfcheetah-m	halfcheetah-me	hopper-m	hopper-me	walker2d-m	walker2d-me
Reference	53	59	78	86	80	100
Ours	49	54	78	90	80	110

IQL

Codebase: Original
Paper using the codebase: TD7 [3]
Results: From Table 2 (Only mean)

ver	halfcheetah-m	halfcheetah-me	hopper-m	hopper-me	walker2d-m	walker2d-me	Note
Reference	47.4	89.6	63.9	64.2	84.2	108.9	1M steps, average over 10 episodes
Ours	43.3	92.9	52.2	53.4	75.3	109.2	1M steps, average over 5 episodes

TD3+BC

Codebase: Original
Paper using the codebase: TD7 [3]
Results: From Table 2 (Only mean)

ver	halfcheetah-m	halfcheetah-me	hopper-m	hopper-me	walker2d-m	walker2d-me	Note
Reference	48.1	93.7	59.1	98.1	84.3	110.5	1M steps, average over 10 episodes
Ours	48.1	93.0	46.5	105.5	72.7	109.2	1M steps, average over 5 episodes

DT

Codebase: min-decision-trainsformer
Paper using the codebase: None
Results

ver	halfcheetah-m	halfcheetah-me	hopper-m	hopper-me	walker2d-m	walker2d-me
-	-	-	-	-	-	-
-	-	-	-	-	-	-

Reference

[1] Tarasov, Denis, et al. "CORL: Research-oriented deep offline reinforcement learning library." Advances in Neural Information Processing Systems 36 (2024).
[2] Nakamoto, Mitsuhiko, et al. "Cal-ql: Calibrated offline rl pre-training for efficient online fine-tuning." Advances in Neural Information Processing Systems 36 (2024).
[3] Fujimoto, Scott, et al. "For sale: State-action representation learning for deep reinforcement learning." Advances in Neural Information Processing Systems 36 (2024).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Home

Validity of Reported Results and Hyperparameter Details

AWAC

CQL

IQL

TD3+BC

DT

Reference

Clone this wiki locally