Skip to content

confusion about coordinates #5

@BroJunn

Description

@BroJunn

Thanks for this amazing work!

I'm very confused about one thing that how MLLM is aware of the size of the image.

I looked a bit through the json, e.g. there's an example

{ "id": 3559, "image": "bridge_data_v2/bridge_data_v1/berkeley/toysink3_bww/put_pot_or_pan_from_sink_into_drying_rack/2021-06-14_18-01-38/raw/traj_group0/traj12/images0/im_0.jpg", "height": 480, "width": 640, "conversations": [ { "from": "human", "value": "<image>\nYou are currently a robot performing robotic manipulation tasks. Your task instruction: put pot or pan from sink into drying rack. Observe the image, use 2D points to mark the manipulated object-centric waypoints to guide the robot to manipulate the object.Typically, the waypoints consists of an ordered sequence of eight 2D points. The format is <point>[[x1, y1], [x2, y2], ...]</point>." }, { "from": "gpt", "value": "<Description>\nThe image shows a robotic arm positioned above a <ref>metal pot</ref><box>[[411, 452, 712, 667]]</box> in a sink. To the right of the pot is a blue cup, and above the pot is a <ref>drying rack</ref><box>[[74, 191, 554, 495]]</box>.\n</Description>\n\n<Reasoning>\nTo move the metal pot from the sink into the drying rack, start by identifying the initial position of the pot in the sink at <point>[[674, 565]]</point>. The robotic arm should lift the pot upwards and slightly to the left, moving towards the drying rack. The path should include intermediate points such as <point>[[611, 579]]</point> and <point>[[607, 530]]</point> as it moves out of the sink. Continue moving the pot towards the drying rack, passing through points like <point>[[594, 458]]</point> and <point>[[566, 396]]</point>. Finally, place the pot into the drying rack at the target location <box>[[173, 263, 406, 424]]</box>, ending the trace at <point>[[575, 301]]</point>.\n</Reasoning>\n\n<Answer>\nThe visual trace for moving the metal pot into the drying rack is <point>[[674, 565], [611, 579], [607, 530], [594, 458], [566, 396], [527, 343], [561, 304], [575, 301]]</point>.\n</Answer>" } ] },

If the size of image is (480, 640), how point like[674, 565] exists on this image? I'm quite confused here.

And if the coordinates are not normalized to [0,1] or e.g. [0, 336], how could the model be aware of the size of given image

Looking forward to you reply!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions