In the realm of object manipulation, human engagement typically manifests through a constrained array of discrete maneuvers. This interaction can often characterized by a handful of low-dimensional latent actions, such as the act of opening and closing a drawer. Notice that such interaction could diverge on different types of objects but the interaction mode such as opening and closing is discrete. In this paper, we explore how the learned prior emulates this limited repertoire of interactions and if such a prior can be learned from unsupervised play data. we take a perspective that decomposes the policy into two distinct components: a mode selector and a low-level action predictor, where the mode selector operates within a discretely structured latent space.
We introduce ActAIM2, which given an RGBD image of an articulated object and a robot, identifies meaningful interaction modes like opening drawer and closing drawer. ActAIM2 represents the interaction modes as discrete clusters of embedding. ActAIM2 then trains a policy that takes cluster embedding as input and produces control actions for the corresponding interactions.
Our model training aims to uncover the policy's distribution
Our dataset was constructed through a combination of random sampling, heuristic grasp sampling, and Gaussian Mixture Model (GMM)-based adaptive sampling, featuring the Franka Emika robot engaging with various articulated objects across multiple interaction modes.
The figure below indicates how we achieve diverse interaction mode sampling by using GMM-based adaptive sampling.
We also formulate our data collection algorithm here.
In this part, we show how we train and infer from the mode selector to extract the discrete task embedding for action predictor training. Our mode selector is a VAE-style generative model but replacing the simple Gaussian with the Mixture of Gaussian.
This figure illustrates the training procedure of the mode selector, mirroring the approach of a conditional generative model. It highlights the contrastive analysis between the initial and final observations—the latter serving as the ground truth for task embedding—to delineate generated data against the backdrop of encoded initial images as the conditional variable. The process involves inputting both the generated task embedding data and the conditional variable into a 4-layer Residual network-based mode encoder, which then predicts the categorical variable
In the inference phase, the agent discretely samples a cluster from the trained Gaussian Mixture Variational Autoencoder (GMVAE) model to calculate the Mixture of Gaussian variable
This disentanglement visualization with CGMVAE illustrates the efficacy of the Conditional Gaussian Mixture Variational Autoencoder (CGMVAE) in disentangling interaction modes for the "single drawer" object (ID: 20411), using a t-SNE plot for visualization. Task embeddings
Our objective is to infer a sequence of low-level actions
Interaction mode
Here, we provide more qualitative results about how our agent interacts with articulated objects in different interaction mode. We see how given different task embedding, action predictor produces actions representing distinct interaction modes. Here, we visualize the camera view and the prediction heatmap from the top for object instances. We also show the correspondent video of how our robot interacts with the object by extracting the gripper pose from the predicted heatmap.
Click the video here to see how the robot is interacting with the faucet above in 3 different interaction modes.
Click the video here to see how the robot is interacting with the table with multiple drawers above in 3 different interaction modes.
Click the video here to see how the robot is interacting with the table with multiple drawers above in 3 different interaction modes.
Here are more qualitative results of how the robot interacts with different types of articulated objects
Interacting with a switch and performing turning on and turning off
Interacting with a single drawer table and performing opening and closing the drawer
Interacting with a door and performing opening and closing on either side of the door