confusion about coordinates

Thanks for this amazing work!

I'm very confused about one thing that how MLLM is aware of the size of the image.

I looked a bit through the json, e.g. there's an example

`    {
        "id": 3559,
        "image": "bridge_data_v2/bridge_data_v1/berkeley/toysink3_bww/put_pot_or_pan_from_sink_into_drying_rack/2021-06-14_18-01-38/raw/traj_group0/traj12/images0/im_0.jpg",
        "height": 480,
        "width": 640,
        "conversations": [
            {
                "from": "human",
                "value": "<image>\nYou are currently a robot performing robotic manipulation tasks. Your task instruction: put pot or pan from sink into drying rack. Observe the image, use 2D points to mark the manipulated object-centric waypoints to guide the robot to manipulate the object.Typically, the waypoints consists of an ordered sequence of eight 2D points. The format is <point>[[x1, y1], [x2, y2], ...]</point>."
            },
            {
                "from": "gpt",
                "value": "<Description>\nThe image shows a robotic arm positioned above a <ref>metal pot</ref><box>[[411, 452, 712, 667]]</box> in a sink. To the right of the pot is a blue cup, and above the pot is a <ref>drying rack</ref><box>[[74, 191, 554, 495]]</box>.\n</Description>\n\n<Reasoning>\nTo move the metal pot from the sink into the drying rack, start by identifying the initial position of the pot in the sink at <point>[[674, 565]]</point>. The robotic arm should lift the pot upwards and slightly to the left, moving towards the drying rack. The path should include intermediate points such as <point>[[611, 579]]</point> and <point>[[607, 530]]</point> as it moves out of the sink. Continue moving the pot towards the drying rack, passing through points like <point>[[594, 458]]</point> and <point>[[566, 396]]</point>. Finally, place the pot into the drying rack at the target location <box>[[173, 263, 406, 424]]</box>, ending the trace at <point>[[575, 301]]</point>.\n</Reasoning>\n\n<Answer>\nThe visual trace for moving the metal pot into the drying rack is <point>[[674, 565], [611, 579], [607, 530], [594, 458], [566, 396], [527, 343], [561, 304], [575, 301]]</point>.\n</Answer>"
            }
        ]
    },`

If the size of image is (480, 640), how point like[674, 565] exists on this image? I'm quite confused here. 

And if the coordinates are not normalized to [0,1] or e.g. [0, 336], how could the model be aware of the size of given image

Looking forward to you reply!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

confusion about coordinates #5

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

confusion about coordinates #5

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions