Skip to content

How to evaluate Hugging Face per‑task checkpoints (e.g. walker-walk.pt) with this repo? #1

@keyboardAnt

Description

@keyboardAnt

Hi @nicklashansen, thanks a lot for releasing Newt, MMBench, and the checkpoints.

I’m trying to run a sanity‑check evaluation of one of the per‑task checkpoints from the HF repo (https://huggingface.co/nicklashansen/newt/blob/main/walker-walk.pt) using the current code in this repo (commit d0cea95).

I ran:

cd tdmpc2

python train.py \
  task=walker-walk \
  model_size=B \
  checkpoint=/path/to/hf_checkpoints/walker-walk.pt \
  steps=1 \
  num_envs=2 \
  use_demos=False \
  tasks_fp=/path/to/newt/tasks.json \
  exp_name=eval_hf_walker_walk \
  save_video=True \
  env_mode=sync \
  compile=False

This fails with:

AssertionError: pad should be positive
...
File "tdmpc2/common/layers.py", line 190, in api_model_conversion
    assert pad > 0, 'pad should be positive'

I inspected the HF checkpoint:

state = torch.load("hf_checkpoints/walker-walk.pt", map_location="cpu", weights_only=False)
state = state["model"]

print("_task_emb.weight", state["_task_emb.weight"].shape)       # torch.Size([10, 512])
print("_action_masks", state["_action_masks"].shape)             # torch.Size([10, 7])
print("_encoder.state.0.weight", state["_encoder.state.0.weight"].shape)  # torch.Size([256, 554])
print("_dynamics.0.weight", state["_dynamics.0.weight"].shape)   # torch.Size([512, 1031])

From this I infer the HF walker-walk.pt was trained with something like:

  • model_size = B (latent_dim=512, enc_dim=256, mlp_dim=512),
  • task_dim = 512 (10 tasks × 512‑dim embedding),
  • obs_state_dim = 42 (since 554 = 42 + 512),
  • action_dim = 7 (since 1031 = 512 (latent) + 7 (action) + 512 (task)).

However, the current single‑task path in this repo builds a model with:

  • task != "soup"task_dim = 0 in parse_cfg,
  • padded state obs_shape['state'] = (128,) via VecWrapper,
  • padded actions (16,) via VecWrapper.

So the local _encoder.state.0.weight is [256, 128], which is smaller than the HF [256, 554], and api_model_conversion’s assumption (“target has more input channels than source, pad source”) is violated, hence pad < 0 and the assertion.

Environment: Docker image built from this repo’s docker/Dockerfile.


Questions

  1. What is the exact config used to train the per‑task HF checkpoints, e.g. walker-walk.pt?

    • model_size?
    • obs (state vs rgb vs state+rgb)?
    • task_dim?
    • How many tasks / which tasks correspond to the 10 rows in _task_emb.weight and _action_masks?
  2. What is the recommended way to evaluate these HF per‑task checkpoints with this implementation?

    • Is there a matching config/script you use internally (e.g. a specific train.py invocation or a separate eval script)?
    • Should we be using a multitask (task="soup" or similar) config with task_dim=512 and then fix the task index at eval time, rather than the current task_dim=0 single‑task path?
  3. More generally: are the HF single‑task .pt files intended to be evaluated with this early code release as‑is, or should we wait for a dedicated evaluation script / config that matches those checkpoints?

Any guidance or an example eval command for walker-walk.pt (or any other per‑task checkpoint) would be greatly appreciated.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions