-
Notifications
You must be signed in to change notification settings - Fork 114
Issue with 'rotate' and 'push' tasks in training taskD_D dataset #113
Description
Hi,
I tried to evaluate the training dataset of taskD_D and calvin_debug_dataset by performing rollouts of the episodes. To do this, I adapted the script evaluate_policy_singlestep.py to use episode["language"] as the goal, and to apply episode["action"] to update the simulator — following a similar approach to this pull request.
I downloaded the dataset in September 2024 and used the annotations located in:
/lang_annotations/auto_lang_ann.npy
Problem Summary
This experiment is very similar to issue #32, and I observed the same symptoms:
very low success rates for rotate and push tasks when rolling out the dataset in the simulator.
After deeper investigation, I found that this is likely caused by a mismatch between the dataset labels and the updated task success conditions introduced in the following commits:
Details on the Mismatch
rotate task
The new task definition requires the object to be in contact with a surface after rotation, but the old dataset contains many successful rotations without final surface contact.
Relevant condition (in calvin_env/envs/tasks.py, line ~78):
end_contacts = set(c[2] for c in obj_end_info["contacts"])
robot_uid = {start_info["robot_info"]["uid"]}
if len(end_contacts - robot_uid) == 0:
return Falsepush task
The new logic assumes that the object must remain in contact with the same surface as before, which is not always the case in the old dataset (e.g., object ends up in contact with a drawer rather than the table).
Condition (around line 109):
robot_uid = start_info["robot_info"]["uid"]
start_contacts = set((c[2], c[4]) for c in obj_start_info["contacts"] if c[2] != robot_uid)
end_contacts = set((c[2], c[4]) for c in obj_end_info["contacts"] if c[2] != robot_uid)
surface_contact = len(start_contacts) > 0 and len(end_contacts) > 0 and start_contacts <= end_contactsExample of Failure: push_pink_block_right
In the debug_dataset, the push_pink_block_right episode shows this clearly:
- The pink block starts in contact with the table,
- but ends in contact with the drawer.
The current task oracle labels this as a failure, even though the push was visually successful.
Here’s a GIF illustrating the issue:
Evaluation Results
I evaluated the full taskD_D training set, using 5 trials per episode (to reduce the impact of environment stochasticity).
The results clearly show low success rates for rotate and push tasks, indicating the issue is systematic.
Conclusion
It seems this issue was previously identified in issue #32, but the dataset was not re-labeled to match the updated task success criteria.
Would it be possible to:
- Clarify whether a re-labeling effort is planned or in progress? or if using the
auto_lang_annis not the right thing to do? - Or if we are supposed to use the
calvin_envsubmodule version previous to these commits?
Thank you!

