Issue with 'rotate' and 'push' tasks in training taskD_D dataset

Hi,

I tried to evaluate the **training dataset of taskD\_D** and **calvin\_debug\_dataset** by performing rollouts of the episodes. To do this, I adapted the script [`evaluate_policy_singlestep.py`](https://github.com/mees/calvin/blob/main/calvin_models/calvin_agent/evaluation/evaluate_policy_singlestep.py) to use `episode["language"]` as the goal, and to apply `episode["action"]` to update the simulator — following a similar approach to [this pull request](https://github.com/mees/calvin/pull/33#commits-pushed-93e85b5).

I downloaded the dataset in **September 2024** and used the annotations located in:

```
/lang_annotations/auto_lang_ann.npy
```

---

###  Problem Summary

This experiment is very similar to [issue #32](https://github.com/mees/calvin/issues/32), and I observed the same symptoms:
**very low success rates for `rotate` and `push` tasks** when rolling out the dataset in the simulator.

After deeper investigation, I found that this is likely caused by a **mismatch between the dataset labels and the updated task success conditions** introduced in the following commits:

* [`141fc1173b32ffd1d5a538c85640f53341fdf657`](https://github.com/mees/calvin/commit/141fc1173b32ffd1d5a538c85640f53341fdf657)
* [`37c8abe860b6cfbbd94dd9ad64b4245b1d402d60`](https://github.com/mees/calvin/commit/37c8abe860b6cfbbd94dd9ad64b4245b1d402d60)

---

###  Details on the Mismatch

#### `rotate` task

The new task definition requires the object to be **in contact with a surface** after rotation, but the old dataset contains many successful rotations without final surface contact.

Relevant condition (in `calvin_env/envs/tasks.py`, line \~78):

```python
end_contacts = set(c[2] for c in obj_end_info["contacts"])
robot_uid = {start_info["robot_info"]["uid"]}
if len(end_contacts - robot_uid) == 0:
    return False
```

####  `push` task

The new logic assumes that the object must remain in contact with the **same surface** as before, which is not always the case in the old dataset (e.g., object ends up in contact with a drawer rather than the table).

Condition (around line 109):

```python
robot_uid = start_info["robot_info"]["uid"]
start_contacts = set((c[2], c[4]) for c in obj_start_info["contacts"] if c[2] != robot_uid)
end_contacts = set((c[2], c[4]) for c in obj_end_info["contacts"] if c[2] != robot_uid)
surface_contact = len(start_contacts) > 0 and len(end_contacts) > 0 and start_contacts <= end_contacts
```

---

###  Example of Failure: `push_pink_block_right`

In the `debug_dataset`, the `push_pink_block_right` episode shows this clearly:

* The pink block starts in contact with the **table**,
* but ends in contact with the **drawer**.

The current task oracle labels this as a **failure**, even though the push was visually successful.
Here’s a GIF illustrating the issue:


![Image](https://github.com/user-attachments/assets/49123917-af51-4c5d-8f2b-0df8fb29c839)



---

###  Evaluation Results

I evaluated the full **taskD\_D training set**, using **5 trials per episode** (to reduce the impact of environment stochasticity).
The results clearly show **low success rates for `rotate` and `push` tasks**, indicating the issue is systematic.

![Image](https://github.com/user-attachments/assets/654e4c11-2af8-4da2-9e82-7dccb32f4f11)

---

### Conclusion

It seems this issue was previously identified in [issue #32](https://github.com/mees/calvin/issues/32), but **the dataset was not re-labeled** to match the updated task success criteria.

Would it be possible to:

* Clarify whether a re-labeling effort is planned or in progress? or if using the `auto_lang_ann` is not the right thing to do?
* Or if we are supposed to use the `calvin_env` submodule version previous to these commits?

Thank you!


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue with 'rotate' and 'push' tasks in training taskD_D dataset #113

Problem Summary

Details on the Mismatch

`rotate` task

`push` task

Example of Failure: `push_pink_block_right`

Evaluation Results

Conclusion

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue with 'rotate' and 'push' tasks in training taskD_D dataset #113

Description

Problem Summary

Details on the Mismatch

rotate task

push task

Example of Failure: push_pink_block_right

Evaluation Results

Conclusion

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

`rotate` task

`push` task

Example of Failure: `push_pink_block_right`