-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Thanks for this amazing work!
I'm very confused about one thing that how MLLM is aware of the size of the image.
I looked a bit through the json, e.g. there's an example
{ "id": 3559, "image": "bridge_data_v2/bridge_data_v1/berkeley/toysink3_bww/put_pot_or_pan_from_sink_into_drying_rack/2021-06-14_18-01-38/raw/traj_group0/traj12/images0/im_0.jpg", "height": 480, "width": 640, "conversations": [ { "from": "human", "value": "<image>\nYou are currently a robot performing robotic manipulation tasks. Your task instruction: put pot or pan from sink into drying rack. Observe the image, use 2D points to mark the manipulated object-centric waypoints to guide the robot to manipulate the object.Typically, the waypoints consists of an ordered sequence of eight 2D points. The format is <point>[[x1, y1], [x2, y2], ...]</point>." }, { "from": "gpt", "value": "<Description>\nThe image shows a robotic arm positioned above a <ref>metal pot</ref><box>[[411, 452, 712, 667]]</box> in a sink. To the right of the pot is a blue cup, and above the pot is a <ref>drying rack</ref><box>[[74, 191, 554, 495]]</box>.\n</Description>\n\n<Reasoning>\nTo move the metal pot from the sink into the drying rack, start by identifying the initial position of the pot in the sink at <point>[[674, 565]]</point>. The robotic arm should lift the pot upwards and slightly to the left, moving towards the drying rack. The path should include intermediate points such as <point>[[611, 579]]</point> and <point>[[607, 530]]</point> as it moves out of the sink. Continue moving the pot towards the drying rack, passing through points like <point>[[594, 458]]</point> and <point>[[566, 396]]</point>. Finally, place the pot into the drying rack at the target location <box>[[173, 263, 406, 424]]</box>, ending the trace at <point>[[575, 301]]</point>.\n</Reasoning>\n\n<Answer>\nThe visual trace for moving the metal pot into the drying rack is <point>[[674, 565], [611, 579], [607, 530], [594, 458], [566, 396], [527, 343], [561, 304], [575, 301]]</point>.\n</Answer>" } ] },
If the size of image is (480, 640), how point like[674, 565] exists on this image? I'm quite confused here.
And if the coordinates are not normalized to [0,1] or e.g. [0, 336], how could the model be aware of the size of given image
Looking forward to you reply!