GR-MG: Leveraging Partially Annotated Data via Multi-Modal Goal Conditioned Policy

Work done during internship at Bytedance*Corresponding authorProject lead

 Paper Code

Abstract

The robotics community has consistently aimed to achieve generalizable robot manipulation with flexible natural language instructions. One of the primary challenges is that obtaining robot data fully annotated with both actions and texts is time-consuming and labor-intensive. However, partially annotated data, such as human activity videos without action labels and robot play data without language labels, is much easier to collect. Can we leverage these data to enhance the generalization capability of robots? In this paper, we propose GR-MG, a novel method which supports conditioning on both a language instruction and a goal image. During training, GR- MG samples goal images from trajectories and conditions on both the text and the goal image or solely on the image when text is unavailable. During inference, where only the text is provided, GR-MG generates the goal image via a diffusion- based image-editing model and condition on both the text and the generated image. This approach enables GR-MG to leverage large amounts of partially annotated data while still using language to flexibly specify tasks. To generate accurate goal images, we propose a novel progress-guided goal image generation model which injects task progress information into the generation process, significantly improving the fidelity and the performance.

GR-MG is a model designed to support multi-modal goals: a language and a goal image. It consists of two modules: a progress-guided goal image generation model and a multi-modal goal-conditioned policy. The goal image generation model generates a goal image based on the current observation, a language description of the task, and the task progress. The multi-modal goal-conditioned policy takes the generated goal image, the current observation, and the language instruction as inputs to predict the action trajectory, task progress, and future images. The task progress is subsequently fed into the goal image generation model. The task progress is initialized as zero at the beginning of a rollout.

Example Rollout

Below, we show a rollout example below. Specifically, the goal image generation model generates the goal image every n timesteps. The goal image indicates the state the robot should move towards according to the given language instruction.

Multi-Task Learning Experiments

GR-MG is able to perform 47 tasks, including object pick-and-place, articulated object manipulation, and pouring.

Generalization Experiments

GR-MG showcases powerful generalization capability. We evaluate its performance in four different settings including Simple, Unseen Disctractor, Uneen Background, and Unseen Object. Below, we show example rollouts in generalization settings.

Robust to Different Object Poses

press the toaster switch

open the oven

Robust to Unseen Distractors

pick up the mandarin from the green plate & place the picked object on the table

pick up the red mug from the rack & place the picked object on the table

Robust to Unseen Backgrounds

pick up the red mug from the rack & place the picked object on the table

pick up the potato from the vegetable basket & place the picked object on the cutting board

pick up the eggplant from the green plate & place the picked object on the red plate

pick up the red mug from the rack & place the picked object on the table

pick up the potato from the vegetable basket & place the picked object on the cutting board

pick up the eggplant from the green plate & place the picked object on the red plate

Robust to Unseen Objects

pick up the tiger from the red plate & place the picked object on the green plate

pick up the red apple from the red plate & place the picked object on the green plate

pick up the while bottle from the red plate & place the picked object on the green plate

pick up the yellow bottle from the vegetable basket & place the picked object on the cutting board

pick up the banana from the vegetable basket & place the picked object on the cutting board

pick up the red apple from the vegetable basket & place the picked object on the cutting board

Quantitative Results

We evaluate the performance of GR-MG in a CALVIN benchmark and a real robot. For the CALVIN benchmark, we evaluate GR-MG on the ABC->D challenge. GR-MG outperforms all baselines. The advantage becomes even more pronounced when using only 10% of data with language and action labels. The results are presented in the following table.

Example

Results on CALVIN ABC->D

In real-robot experiments, we evaluate GR-MG in a simple setting as well as three challenging generalization settings. GR-MG consistently outperforms competitive baselines. Results are shown below.

Example

Success Rates of Real-Robot Experiments.

More details about our method and experimental settings can be found in our paper.

Citation

@article{li2024gr,
  title={GR-MG: Leveraging Partially Annotated Data via Multi-Modal Goal Conditioned Policy},
  author={Li, Peiyan and Wu, Hongtao and Huang, Yan and Cheang, Chilam and Wang, Liang and Kong, Tao},
  journal={arXiv preprint arXiv:2408.14368},
  year={2024}
}