Abstract

In policy learning for robotic manipulation, sample efficiency is of paramount importance. Thus, learning and extracting more compact representations from camera observations is a promising avenue. However, current methods often assume full observability of the scene and struggle with scale invariance. In many tasks and settings, this assumption does not hold as objects in the scene are often occluded or lie outside the field of view of the camera, rendering the camera observation ambiguous with regard to their location. To tackle this problem, we present BASK, a Bayesian approach to tracking scale-invariant keypoints over time. Our approach can successfully resolve inherent ambiguities in images, enabling keypoint tracking on symmetrical objects and occluded and out-of-view objects. We employ our method to learn challenging multi-object robot manipulation tasks from wrist camera observations and demonstrate superior utility for policy learning compared to other representation learning techniques. Furthermore, we show outstanding robustness towards disturbances such as clutter, occlusions, and noisy depth measurements, as well as generalization to unseen objects both in simulation and real-world robotic experiments.


How Does It Work?

approach
Figure: Individual camera observations are often ambiguous. For example, from the observation on the left, the rotation of the saucepan cannot be uniquely inferred. When tracking object keypoints, this leads to multimodal localization hypotheses. We overcome this problem by considering the image in context. We find likely correspondences across image scales and then use spatial or temporal context to resolve the ambiguities. Our model further detects when a keypoint is likely not observed, enabling our approach to track occluded objects and objects outside the current field of view as shown on the right.


We generate 3D keypoints as an efficient representation for downstream policy learning. To be applicable to wrist camera observations, these keypoints need to be scale and occlusion invariant. Furthermore, they should be able to track multiple relevant scene objects and represent objects that are temporally occluded or outside the camera's field of view. Our approach, Bayesian Scene Keypoints (BASK), is two-pronged. First, we find semantic correspondences between images. To this end, we train Dense Object Nets (DON) in directly on multi-object scenes for improved scale and occlusion invariance in a self-supervised manner. Comparing a fixed set of reference vectors to the semantic embeddings, lets us generate localization hypotheses for keypoints in the scene. Ambiguous images lead to multimodal hypotheses. We then integrate these hypotheses using the Bayes filter to resolve ambiguities.

don
Figure: Dense object nets are pretrained using a self-supervised pixel-wise contrastive loss, optimizing the descriptor distance on scans of static scenes. During downstream policy learning, the descriptor of the current observation is compared to a previously selected set of reference descriptors, and the pixel coordinates of the respective most likely match are used as the location of the keypoint.


BASK overcomes the inherent limitations of current representation learning methods with respect to scale invariance and ambiguities, e.g. due to occlusions and a limited field of view. It allows for efficient policy learning in multi-object scenes, especially when learning from wrist camera observations on a real robot. Moreover, it facilitates robustness towards visual clutter and disturbances, as well as effectively generalizing to unseen objects. Thus, it opens up a plethora of applications such as learning from wrist cameras, as well as flexible deployment in homes and in mobile manipulation. Our Bayesian framework is agnostic towards the representation learning method itself as well as the policy learning approach. Hence, it can be employed for alternative representations and pretraining schemes as well as novel policy learning methods.

Video

Code and Models

A software implementation of this project based on PyTorch including trained models can be found in our GitHub repository for academic usage and is released under the GPLv3 license. For any commercial purpose, please contact the authors.

Publication

Jan Ole von Hartz, Eugenio Chisari, Tim Welschehold, Wolfram Burgard, Joschka Boedecker, Abhinav Valada
The Treachery of Images: Bayesian Scene Keypoints for Deep Policy Learning in Robotic Manipulation
IEEE Robotics and Automation Letters (RA-L), 2023.

(Pdf) (Bibtex)


People

Acknowledgements

This work was funded by the BrainLinks-BrainTools center of the University of Freiburg, the Carl Zeiss Foundation with the ReScaLe project, and an academic grant from Nvidia.