How to evaluate Hugging Face per‑task checkpoints (e.g. `walker-walk.pt`) with this repo?

Hi @nicklashansen, thanks a lot for releasing Newt, MMBench, and the checkpoints.

I’m trying to run a **sanity‑check evaluation** of one of the per‑task checkpoints from the HF repo (https://huggingface.co/nicklashansen/newt/blob/main/walker-walk.pt) using the current code in this repo (commit [d0cea95](https://github.com/nicklashansen/newt/commit/d0cea95aaeaed7a4714c1200fd4d40dfa163b84a)).

I ran:

```bash
cd tdmpc2

python train.py \
  task=walker-walk \
  model_size=B \
  checkpoint=/path/to/hf_checkpoints/walker-walk.pt \
  steps=1 \
  num_envs=2 \
  use_demos=False \
  tasks_fp=/path/to/newt/tasks.json \
  exp_name=eval_hf_walker_walk \
  save_video=True \
  env_mode=sync \
  compile=False
```

This fails with:

```text
AssertionError: pad should be positive
...
File "tdmpc2/common/layers.py", line 190, in api_model_conversion
    assert pad > 0, 'pad should be positive'
```

I inspected the HF checkpoint:

```python
state = torch.load("hf_checkpoints/walker-walk.pt", map_location="cpu", weights_only=False)
state = state["model"]

print("_task_emb.weight", state["_task_emb.weight"].shape)       # torch.Size([10, 512])
print("_action_masks", state["_action_masks"].shape)             # torch.Size([10, 7])
print("_encoder.state.0.weight", state["_encoder.state.0.weight"].shape)  # torch.Size([256, 554])
print("_dynamics.0.weight", state["_dynamics.0.weight"].shape)   # torch.Size([512, 1031])
```

From this I infer the HF `walker-walk.pt` was trained with something like:

- `model_size = B` (latent_dim=512, enc_dim=256, mlp_dim=512),
- `task_dim = 512` (10 tasks × 512‑dim embedding),
- `obs_state_dim = 42` (since 554 = 42 + 512),
- `action_dim = 7` (since 1031 = 512 (latent) + 7 (action) + 512 (task)).

However, the current single‑task path in this repo builds a model with:

- `task != "soup"` → `task_dim = 0` in `parse_cfg`,
- padded state `obs_shape['state'] = (128,)` via `VecWrapper`,
- padded actions `(16,)` via `VecWrapper`.

So the local `_encoder.state.0.weight` is `[256, 128]`, which is smaller than the HF `[256, 554]`, and `api_model_conversion`’s assumption (“target has more input channels than source, pad source”) is violated, hence `pad < 0` and the assertion.

Environment: Docker image built from this repo’s `docker/Dockerfile`.

---

### Questions

1. **What is the exact config used to train the per‑task HF checkpoints**, e.g. `walker-walk.pt`?
   - `model_size`?
   - `obs` (`state` vs `rgb` vs `state+rgb`)?
   - `task_dim`?
   - How many tasks / which tasks correspond to the 10 rows in `_task_emb.weight` and `_action_masks`?

2. **What is the recommended way to evaluate these HF per‑task checkpoints with this implementation?**
   - Is there a matching config/script you use internally (e.g. a specific `train.py` invocation or a separate eval script)?
   - Should we be using a multitask (`task="soup"` or similar) config with `task_dim=512` and then fix the task index at eval time, rather than the current `task_dim=0` single‑task path?

3. More generally: **are the HF single‑task `.pt` files intended to be evaluated with this early code release as‑is**, or should we wait for a dedicated evaluation script / config that matches those checkpoints?

Any guidance or an example eval command for `walker-walk.pt` (or any other per‑task checkpoint) would be greatly appreciated.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to evaluate Hugging Face per‑task checkpoints (e.g. `walker-walk.pt`) with this repo? #1

Questions

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

How to evaluate Hugging Face per‑task checkpoints (e.g. walker-walk.pt) with this repo? #1

Description

Questions

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

How to evaluate Hugging Face per‑task checkpoints (e.g. `walker-walk.pt`) with this repo? #1