USC at the Conference of Robot Learning (CoRL) 2024

| November 4, 2024 

USC computer science faculty and students present their latest work on the intersection of robotics and machine learning.

USC at the Conference of Robot Learning (CoRL) 2024

USC computer science students and faculty are presenting their latest work at the Conference on Robotic Learning (CORL) this week. Photo/USC.

This week, USC computer science students and faculty are presenting their latest work at the Conference on Robotic Learning (CORL), one of the largest and most influential robotics conferences, Nov. 6-9. From programming robots to perform tasks with household objects, to robots that can excavate and move objects across difficult terrain, the research aims to advance the development of reliable and efficient robots. 

Yue Wang

“My ultimate goal is to bring robots to everyone’s homes.” Yue Wang.

One exceptional paper, RAM: Retrieval-Based Affordance Transfer for Generalizable Zero-Shot Robotic Manipulation, secured a coveted spot in the conference’s oral track, with an acceptance rate of only 6.6%. We asked the paper’s lead author, USC Viterbi Assistant Professor of Computer Science Yue Wang, to share his thoughts and takeaways from his research.

What inspired you to pursue this research? 

My ultimate goal is to bring robots to everyone’s homes, to assist humanity’s daily tasks and augment human intelligence. Recent foundation models have made significant strides in other fields, and I want to leverage these powerful tools to boost the performance of robots.

If there is one key insight you hope readers take away from the paper, what should it be and why do you think it’s significant? 

The key insight is that we can design algorithms to leverage the significant amount of actionable knowledge that is embedded in massive video data, using foundation models. This knowledge can be obtained nearly for free and improves the generalization capability of robotics.

Were there any breakthrough moments or lessons? 

Recently, foundation models and generative AI models have greatly improved 3D computer vision and robotics, making many robotic tasks that were considered impossible into reality.  

How do you envision the findings from this paper influencing the broader field of robotic learning?

The findings in this paper can be seamlessly applied to any robotic manipulation pipelines to improve their generalization capability. Also, the data pseudo-labeling module can greatly democratize data collection and annotation in robot learning.

What are the next steps for this line of research, and how might it evolve in future studies?

We are working on making the proposed method more general to deal with complex robotic tasks. In addition, we are studying humanoid robots as more general embodiments for human daily tasks. We hope robots can achieve similar agility and intelligence with humans soon.

Accepted papers with USC affiliation (USC authors highlighted):

Oral Session 1

RAM: Retrieval-Based Affordance Transfer for Generalizable Zero-Shot Robotic Manipulation

Authors: Yuxuan Kuang, Junjie Ye, Haoran Geng, Jiageng Mao, Congyue Deng, Leonidas Guibas, He Wang, Yue Wang

Abstract: This work proposes a retrieve-and-transfer framework for zero-shot robotic manipulation, dubbed RAM, featuring generalizability across various objects, environments, and embodiments. Unlike existing approaches that learn manipulation from expensive in-domain demonstrations, RAM capitalizes on a retrieval-based affordance transfer paradigm to acquire versatile manipulation capabilities from abundant out-of-domain data. First, RAM extracts unified affordance at scale from diverse sources of demonstrations including robotic data, human-object interaction (HOI) data, and custom data to construct a comprehensive affordance memory. Then given a language instruction, RAM hierarchically retrieves the most similar demonstration from the affordance memory and transfers such out-of-domain 2D affordance to in-domain 3D executable affordance in a zero-shot and embodiment-agnostic manner. Extensive simulation and real-world evaluations demonstrate that our RAM consistently outperforms existing works in diverse daily tasks. Additionally, RAM shows significant potential for downstream applications such as automatic and efficient data collection, one-shot visual imitation, and LLM/VLM-integrated long-horizon manipulation. For more details, please check our website at https://yxkryptonite.github.io/RAM/.

Poster and Spotlight Session 1

Trajectory Improvement and Reward Learning from Comparative Language Feedback

Authors: Zhaojing Yang, Miru Jun, Jeremy Tien, Stuart Russell, Anca Dragan, Erdem Biyik

Abstract: Learning from human feedback has gained traction in fields like robotics and natural language processing in recent years. While prior works mostly rely on human feedback in the form of comparisons, language is a preferable modality that provides more informative insights into user preferences. In this work, we aim to incorporate comparative language feedback to iteratively improve robot trajectories and to learn reward functions that encode human preferences. To achieve this goal, we learn a shared latent space that integrates trajectory data and language feedback, and subsequently leverage the learned latent space to improve trajectories and learn human preferences. To the best of our knowledge, we are the first to incorporate comparative language feedback into reward learning. Our simulation experiments demonstrate the effectiveness of the learned latent space and the success of our learning algorithms. We also conduct human subject studies that show our reward learning algorithm achieves a 23.9% higher subjective score on average and is 11.3% more time-efficient compared to preference-based reward learning, underscoring the superior performance of our method.

Poster and Spotlight Session 2

Efficient Policy Learning by Extracting Transferrable Robot Skills from Offline Data

Authors: Jesse Zhang, Minho Heo, Zuxin Liu, Erdem Biyik, Joseph J Lim, Yao Liu, Rasool Fakoor

Abstract: Most reinforcement learning (RL) methods focus on learning optimal policies over low-level action spaces. While these methods can perform well in their training environments, they lack the flexibility to transfer to new tasks. Instead, RL agents that can act over useful, temporally extended skills rather than low-level actions can learn new tasks more easily. Prior work in skill-based RL either requires expert supervision to define useful skills, which is hard to scale, or learns a skill-space from offline data with heuristics that limit the adaptability of the skills, making them difficult to transfer during downstream RL. Our approach, EXTRACT, instead utilizes pre-trained vision language models to extract a discrete set of semantically meaningful skills from offline data, each of which is parameterized by continuous arguments, without human supervision. This skill parameterization allows robots to learn new tasks by only needing to learn when to select a specific skill and how to modify its arguments for the specific task. We demonstrate through experiments in sparse-reward, image-based, robot manipulation environments that EXTRACT can more quickly learn new tasks than prior works, with major gains in sample efficiency and performance over prior skill-based RL. Website at https://www.jessezhang.net/projects/extract/.

Learning Granular Media Avalanche Behavior for Indirectly Manipulating Obstacles on a Granular Slope

Authors: Haodi Hu, Feifei Qian, Daniel Seita

Abstract: Legged robot locomotion on sand slopes is challenging due to the complex dynamics of granular media and how the lack of solid surfaces can hinder locomotion. A promising strategy, inspired by ghost crabs and other organisms in nature, is to strategically interact with rocks, debris, and other obstacles to facilitate movement. To provide legged robots with this ability, we present a novel approach that leverages avalanche dynamics to indirectly manipulate objects on a granular slope. We use a Vision Transformer (ViT) to process image representations of granular dynamics and robot excavation actions. The ViT predicts object movement, which we use to determine which leg excavation action to execute. We collect training data from 100 real physical trials and, at test time, deploy our trained model in novel settings. Experimental results suggest that our model can accurately predict object movements and achieve a success rate ≥80% in a variety of manipulation tasks with up to four obstacles, and can also generalize to objects with different physics properties. To our knowledge, this is the first paper to leverage granular media avalanche dynamics to indirectly manipulate objects on granular slopes. Supplementary material is available at https://sites.google.com/view/grain-corl2024/home.

Poster and Spotlight Session 3

Contrast Sets for Evaluating Language-Guided Robot Policies

Authors: Abrar Anwar, Rohan Gupta, Jesse Thomason

Abstract: Robot evaluations in language-guided, real world settings are time-consuming and often sample only a small space of potential instructions across complex scenes. In this work, we introduce contrast sets for robotics as an approach to make small, but specific, perturbations to otherwise independent, identically distributed (i.i.d.) test instances. We investigate the relationship between experimenter effort to carry out an evaluation and the resulting estimated test performance as well as the insights that can be drawn from performance on perturbed instances. We use contrast sets to characterize policies at reduced experimenter effort in both a simulated manipulation task and a physical robot vision-and-language navigation task. We encourage the use of contrast set evaluations as a more informative alternative to small scale, i.i.d. demonstrations on physical robots, and as a scalable alternative to industry-scale real world evaluations.

Poster and Spotlight Session 4

VoxAct-B: Voxel-Based Acting and Stabilizing Policy for Bimanual Manipulation

Authors: I-Chun Arthur Liu, Sicheng He, Daniel Seita, Gaurav S. Sukhatme

Abstract: Bimanual manipulation is critical to many robotics applications. In contrast to single-arm manipulation, bimanual manipulation tasks are challenging due to higher-dimensional action spaces. Prior works leverage large amounts of data and primitive actions to address this problem, but may suffer from sample inefficiency and limited generalization across various tasks. To this end, we propose VoxAct-B, a language-conditioned, voxel-based method that leverages Vision Language Models (VLMs) to prioritize key regions within the scene and reconstruct a voxel grid. We provide this voxel grid to our bimanual manipulation policy to learn acting and stabilizing actions. This approach enables more efficient policy learning from voxels and is generalizable to different tasks. In simulation, we show that VoxAct-B outperforms strong baselines on fine-grained bimanual manipulation tasks. Furthermore, we demonstrate VoxAct-B on real-world Open Drawer and Open Jar tasks using two UR5s. Code, data, and videos are available at https://voxact-b.github.io/.

Published on November 4th, 2024

Last updated on November 4th, 2024

Share this Post