RLPD algorithm #1727

runjerry · 2025-02-03T23:03:37Z

Extend SAC to RLPD. Tested on 5 DM control tasks with 4 seeds each to confirmed its performance improvement over SAC.

emailweixu · 2025-02-04T00:59:57Z

alf/algorithms/rlpd_algorithm.py

+            critics = critics.reshape(-1, self._num_critic_replicas,
+                                      *self._reward_spec.shape,
+                                      *remaining_shape)
+            if self._act_type == ActionType.Discrete:


According to __init__(), only ActionType.Continuous is supported

emailweixu · 2025-02-04T01:13:52Z

alf/algorithms/rlpd_algorithm.py

+            if self.has_multidim_reward():
+                sign = self.reward_weights.sign()
+                critics = (critics * sign).mean(dim=1) * sign
+            else:


This branch is unnecessary for 'mean'

Also curious the motivation of the mean branch? Any benefits?

This branch is unnecessary for 'mean'

I feel that it is needed in order to get the mean critic values across all critics for actor training?

Also curious the motivation of the mean branch? Any benefits?

This is the default setting of RLPD to get a concensus q_value for actor training. Given that RLPD might use more than two critics, taking a min as in SAC would be too conservative. Moreover, from my previous experience, as long as critics are trained with conservative target critic values, there is no need to be conservative in actor training.

I see. Thanks!

This branch is unnecessary for 'mean'

I feel that it is needed in order to get the mean critic values across all critics for actor training?

(critics * sign).mean(dim=1) * sign is same as critics.mean(dim=1)

Oh I see, thanks, updated.

I had to revert to keep this branch since self.reward_weights by default is None for scalar reward.

Why can't you change the whole branch as the following?

elif replica_consensus == 'mean': critics = critics.mean(dim=1)

Oh yes, that should work well.

emailweixu · 2025-02-04T01:20:22Z

alf/algorithms/rlpd_algorithm.py

+                 name="RlpdAlgorithm"):
+        # **kwargs):
+        """
+        Refer to SacAlgorithm for more details for kwargs


for other arguments

emailweixu · 2025-02-04T01:20:32Z

alf/algorithms/rlpd_algorithm.py

+        Refer to SacAlgorithm for more details for kwargs
+
+        Args:
+            name (str): The name of this algorithm.


can be removed.

Haichao-Zhang · 2025-02-04T02:27:02Z

alf/algorithms/rlpd_algorithm.py

+                         action,
+                         critics_state,
+                         replica_consensus='mean',
+                         sample_subset=False,


can add some comments to sample_subset argument, and the reason using True for target critics and False for critics. Seems that using False for critics is to ensure all the critics have gradients.

Haichao-Zhang · 2025-02-04T02:32:16Z

alf/algorithms/rlpd_algorithm.py

+
+        Args:
+            name (str): The name of this algorithm.
+            num_critic_targets (int): Number of sampled subset of target critics


num_critic_targets: maybe can change its name to reflect the sampled aspect? Currently, it is very close to the meaning of num_critic_replicas

Should ensure that num_critic_targets <= num_critic_replicas.

Haichao-Zhang · 2025-02-04T02:35:22Z

alf/algorithms/rlpd_algorithm.py

+                critics = critics.permute(*order)
+
+        if sample_subset:
+            critics = critics[:,


when self._num_critic_targets and num_critic_replicas are equal, the `randperm`` is not necessary and can be removed to save computation.

Good point, updated.

Haichao-Zhang

LGTM

Haichao-Zhang · 2025-02-13T18:52:00Z

alf/algorithms/rlpd_algorithm.py

+                 checkpoint=None,
+                 debug_summaries=False,
+                 name="RlpdAlgorithm"):
+        # **kwargs):


can remove this line?

* update actor and critics separately, each with own utd * add an option to use bootstrapped critics

runjerry · 2025-02-15T06:43:46Z

Pushed an updated version that works consistently better than previous version. Please take a look.

runjerry added 2 commits February 3, 2025 14:58

add rlpd_algorithm, unittests, and dm_control conf

9449596

minor rlpd_dmc_conf update

82cce62

runjerry mentioned this pull request Feb 3, 2025

Pr rlpd #1722

Closed

revert debug comments

e90394a

runjerry requested review from emailweixu and Haichao-Zhang February 3, 2025 23:43

emailweixu reviewed Feb 4, 2025

View reviewed changes

Haichao-Zhang reviewed Feb 4, 2025

View reviewed changes

runjerry added 2 commits February 3, 2025 21:47

address code reviews

e48da8a

update rlpd_algorithm_test arg name and add it to ci test

c3c5d26

Haichao-Zhang previously approved these changes Feb 4, 2025

View reviewed changes

address further code reviews

9733fd1

runjerry dismissed Haichao-Zhang’s stale review via 9733fd1 February 4, 2025 06:23

revert to handle the case when reward_weights is None in rlpd algorithm

7b09947

Haichao-Zhang reviewed Feb 13, 2025

View reviewed changes

runjerry added 3 commits February 14, 2025 20:55

rlpd algorithm and unittest update

f80f19c

* update actor and critics separately, each with own utd * add an option to use bootstrapped critics

add rlpd to the algorithm list of README

600d577

fix readme format

1ce8391

runjerry requested review from emailweixu and Haichao-Zhang February 15, 2025 06:44

runjerry added 2 commits February 15, 2025 12:23

add further docstring about config suggestions of rlpd algorithm

5dcf07c

fix a bug in where to call update_train_mode for rlpd algorithm

8d9d6b4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RLPD algorithm #1727

RLPD algorithm #1727

runjerry commented Feb 3, 2025

emailweixu Feb 4, 2025

runjerry Feb 4, 2025

emailweixu Feb 4, 2025

Haichao-Zhang Feb 4, 2025

runjerry Feb 4, 2025

runjerry Feb 4, 2025

Haichao-Zhang Feb 4, 2025

emailweixu Feb 4, 2025

runjerry Feb 4, 2025

runjerry Feb 4, 2025

emailweixu Feb 6, 2025

runjerry Feb 6, 2025

emailweixu Feb 4, 2025

runjerry Feb 4, 2025

emailweixu Feb 4, 2025

runjerry Feb 4, 2025

Haichao-Zhang Feb 4, 2025 •

edited

Loading

runjerry Feb 4, 2025

Haichao-Zhang Feb 4, 2025

Haichao-Zhang Feb 4, 2025

runjerry Feb 4, 2025

Haichao-Zhang Feb 4, 2025

runjerry Feb 4, 2025

Haichao-Zhang left a comment

Haichao-Zhang Feb 13, 2025

runjerry Feb 15, 2025

runjerry commented Feb 15, 2025

RLPD algorithm #1727

Are you sure you want to change the base?

RLPD algorithm #1727

Conversation

runjerry commented Feb 3, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Haichao-Zhang Feb 4, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Haichao-Zhang left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

runjerry commented Feb 15, 2025

Haichao-Zhang Feb 4, 2025 •

edited

Loading