AW-Opt: Learning Robotic Skills with
Imitation and Reinforcement Learning at Scale
CoRL 2021


Robotic skills can be learned via imitation learning (IL) using user-provided demonstrations, or via reinforcement learning (RL) using large amounts of autonomously collected experience. Both methods have complementary strengths and weaknesses: RL can reach a high level of performance, but requires exploration, which can be very time consuming and unsafe; IL does not require exploration, but only learns skills that are as good as the provided demonstrations. Can a single method combine the strengths of both approaches? A number of prior methods have aimed to address this question, proposing a variety of techniques that integrate elements of IL and RL. However, scaling up such methods to complex robotic skills that integrate diverse offline data and generalize meaningfully to real-world scenarios still presents a major challenge. In this paper, our aim is to test the scalability of prior IL + RL algorithms and devise a system based on detailed empirical experimentation that combines existing components in the most effective and scalable way. To that end, we present a series of experiments aimed at understanding the implications of each design decision, so as to develop a combined approach that can utilize demonstrations and heterogeneous prior data to attain the best performance on a range of real-world and realistic simulated robotic problems. Our complete method, which we call AW-Opt, combines elements of advantage-weighted regression and QT-Opt, providing a unified approach for integrating demonstrations and offline data for robotic manipulation.


We began our investigation with two existing methods: AWAC, which combines IL and RL, and QT-Opt, a scalable RL algorithm we have been using on our robots. Our testbed consists of 6 tasks, including a navigation task with dense reward and 5 manipulation tasks with sparse rewards. The manipulation tasks are on two different robot platforms using different control modalities (KUKA and our proprietary robot). Our tasks cover varying levels of difficulty from indiscriminate grasping (figure (a) and (c) below) to semantic grasping (figure (d), grasping compostable objects) to instance grasping (figure (b), grasping the green bowl).

Both algorithms are provided with demonstrations for offline pretraining, either from human or from previous successful RL rollouts. Afterwards they switch to on-policy data collection and training. We found that QT-Opt fails to learn from only successful rollouts, and even fails to make progress during on-policy training for tasks with a 7 DoF action space. On the other hand, AWAC does attain non-zero success rates from the demonstrations, but performance is still poor, and performance collapses during online fine-tuning for all our sparse-reward manipulation tasks.

We introduced a series of modifications to AWAC that bring it closer to QT-Opt, improving overall learning performance while retaining the ability to utilize demonstrations, culminating in our full AW-Opt algorithm.

Positive Sample Filtering: One possible explanation for the poor performance of AWAC is that, due to the relatively low success rate after pretraining, during online exploration, large amounts of failed episodes drowns the initial successful demonstrations and the actor unlearns the promising policy. To address this issue, we used positive filtering for the actor, applying the AWAC actor update only on successful samples. As a result, the algorithm no longer collapses during on-policy training.

Hybrid Actor-Critic Exploration: QT-Opt uses the cross-entropy method (CEM) to optimize the actions with respect to the critic’s prediction, which can be viewed as an implicit policy (CEM policy). The CEM process has intrinsic noise due to sampling and can act as an exploration policy. AWAC on the other hand, explores by sampling from the actor network, although we could also obtain a CEM policy from its critic. We found that using both the actor and the CEM policies for exploration, by switching randomly between the two on a per-episode basis, performs better than using either one alone.

Action Selection in Bellman Update: QT-Opt uses CEM to find the optimal action for the Bellman backup target. AWAC on the other hand, samples from the actor network. We compared both methods as well as two new ones combining both actor and critic networks: (a) using the actor predicted action (Gaussian mean) as the initial mean for CEM; (b) using the actor-predicted action as an additional candidate in each round of CEM. We found that the last choice gave us the best performance.


Our results suggest that AW-Opt can be a powerful IL+RL method for scaling up robotic skills learning. With AW-Opt we have shown that depending on task difficulty, with a few hours or a few days of human demonstration and additional simulated on-policy training, we can get high-performing manipulation or navigation policies without task-specific engineering. A compilation of AW-Opt evaluation videos on several tasks is shown below.



The authors would like to give special thanks to Dao Tran, Conrad Villondo, Clayton Tan for collecting demonstration data as well as Julian Ibarz and Kanishka Rao for valuable discussions.

The website template was borrowed from Jon Barron.