Arrow Research search

Author name cluster

Ales Leonardis

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

21 papers
2 author rows

Possible papers

21

AAAI Conference 2025 Conference Paper

Collaborative Learning for 3D Hand-Object Reconstruction and Compositional Action Recognition from Egocentric RGB Videos Using Superquadrics

  • Tze Ho Elden Tse
  • Runyang Feng
  • Linfang Zheng
  • Jiho Park
  • Yixing Gao
  • Jihie Kim
  • Ales Leonardis
  • Hyung Jin Chang

With the availability of egocentric 3D hand-object interaction datasets, there is increasing interest in developing unified models for hand-object pose estimation and action recognition. However, existing methods still struggle to recognise seen actions on unseen objects due to the limitations in representing object shape and movement using 3D bounding boxes. Additionally, the reliance on object templates at test time limits their generalisability to unseen objects. To address these challenges, we propose to leverage superquadrics as an alternative 3D object representation to bounding boxes and demonstrate their effectiveness on both template-free object reconstruction and action recognition tasks. Moreover, as we find that pure appearance-based methods can outperform the unified methods, the potential benefits from 3D geometric information remain unclear. Therefore, we study the compositionality of actions by considering a more challenging task where the training combinations of verbs and nouns do not overlap with the testing split. We extend H2O and FPHA datasets with compositional splits and design a novel collaborative learning framework that can explicitly reason about the geometric relations between hands and the manipulated object. Through extensive quantitative and qualitative evaluations, we demonstrate significant improvements over the state-of-the-arts in (compositional) action recognition.

ICLR Conference 2024 Conference Paper

Multi-task Learning with 3D-Aware Regularization

  • Wei-Hong Li 0001
  • Steven McDonagh 0001
  • Ales Leonardis
  • Hakan Bilen

Deep neural networks have become the standard solution for designing models that can perform multiple dense computer vision tasks such as depth estimation and semantic segmentation thanks to their ability to capture complex correlations in high dimensional feature space across tasks. However, the cross-task correlations that are learned in the unstructured feature space can be extremely noisy and susceptible to overfitting, consequently hurting performance. We propose to address this problem by introducing a structured 3D-aware regularizer which interfaces multiple tasks through the projection of features extracted from an image encoder to a shared 3D feature space and decodes them into their task output space through differentiable rendering. We show that the proposed method is architecture agnostic and can be plugged into various prior multi-task backbones to improve their performance; as we evidence using standard benchmarks NYUv2 and PASCAL-Context.

IROS Conference 2022 Conference Paper

Conditional Patch-Based Domain Randomization: Improving Texture Domain Randomization Using Natural Image Patches

  • Mohammad Ani
  • Hector Basevi
  • Ales Leonardis

Using Domain Randomized synthetic data for training deep learning systems is a promising approach for addressing the data and the labeling requirements for supervised techniques to bridge the gap between simulation and the real world. We propose a novel approach for generating and applying class-specific Domain Randomization textures by using randomly cropped image patches from real-world data. In evaluation against the current Domain Randomization texture application techniques, our approach outperforms the highest performing technique by 4. 94 AP and 6. 71 AP when solving object detection and semantic segmentation tasks on the YCB-M [1] real-world robotics dataset. Our approach is a fast and inexpensive way of generating Domain Randomized textures while avoiding the need to handcraft texture distributions currently being used.

AAAI Conference 2022 Conference Paper

Model-Based Image Signal Processors via Learnable Dictionaries

  • Marcos V. Conde
  • Steven McDonagh
  • Matteo Maggioni
  • Ales Leonardis
  • Eduardo Pérez-Pellitero

Digital cameras transform sensor RAW readings into RGB images by means of their Image Signal Processor (ISP). Computational photography tasks such as image denoising and colour constancy are commonly performed in the RAW domain, in part due to the inherent hardware design, but also due to the appealing simplicity of noise statistics that result from the direct sensor readings. Despite this, the availability of RAW images is limited in comparison with the abundance and diversity of available RGB data. Recent approaches have attempted to bridge this gap by estimating the RGB to RAW mapping: handcrafted model-based methods that are interpretable and controllable usually require manual parameter fine-tuning, while end-to-end learnable neural networks require large amounts of training data, at times with complex training procedures, and generally lack interpretability and parametric control. Towards addressing these existing limitations, we present a novel hybrid model-based and data-driven ISP that builds on canonical ISP operations and is both learnable and interpretable. Our proposed invertible model, capable of bidirectional mapping between RAW and RGB domains, employs end-to-end learning of rich parameter representations, i. e. dictionaries, that are free from direct parametric supervision and additionally enable simple and plausible data augmentation. We evidence the value of our data generation process by extensive experiments under both RAW image reconstruction and RAW image denoising tasks, obtaining state-of-the-art performance in both. Additionally, we show that our ISP can learn meaningful mappings from few data samples, and that denoising models trained with our dictionary-based data augmentation are competitive despite having only few or zero ground-truth labels.

IJCAI Conference 2022 Conference Paper

Residual Contrastive Learning for Image Reconstruction: Learning Transferable Representations from Noisy Images

  • Nanqing Dong
  • Matteo Maggioni
  • Yongxin Yang
  • Eduardo Pérez-Pellitero
  • Ales Leonardis
  • Steven McDonagh

This paper is concerned with contrastive learning (CL) for low-level image restoration and enhancement tasks. We propose a new label-efficient learning paradigm based on residuals, residual contrastive learning (RCL), and derive an unsupervised visual representation learning framework, suitable for low-level vision tasks with noisy inputs. While supervised image reconstruction aims to minimize residual terms directly, RCL alternatively builds a connection between residuals and CL by defining a novel instance discrimination pretext task, using residuals as the discriminative feature. Our formulation mitigates the severe task misalignment between instance discrimination pretext tasks and downstream image reconstruction tasks, present in existing CL frameworks. Experimentally, we find that RCL can learn robust and transferable representations that improve the performance of various downstream tasks, such as denoising and super resolution, in comparison with recent self-supervised methods designed specifically for noisy inputs. Additionally, our unsupervised pre-training can significantly reduce annotation costs whilst maintaining performance competitive with fully-supervised image reconstruction.

ICRA Conference 2022 Conference Paper

TP-AE: Temporally Primed 6D Object Pose Tracking with Auto-Encoders

  • Linfang Zheng
  • Ales Leonardis
  • Tze Ho Elden Tse
  • Nora Horanyi
  • Hua Chen 0007
  • Wei Zhang 0013
  • Hyung Jin Chang

Fast and accurate tracking of an object's motion is one of the key functionalities of a robotic system for achieving reliable interaction with the environment. This paper focuses on the instance-level six-dimensional (6D) pose tracking problem with a symmetric and textureless object under occlusion. We propose a Temporally Primed 6D pose tracking framework with Auto-Encoders (TP-AE) to tackle the pose tracking problem. The framework consists of a prediction step and a temporally primed pose estimation step. The prediction step aims to quickly and efficiently generate a guess on the object's real-time pose based on historical information about the target object's motion. Once the prior prediction is obtained, the temporally primed pose estimation step embeds the prior pose into the RGB-D input, and leverages auto-encoders to reconstruct the target object with higher quality under occlusion, thus improving the framework's performance. Extensive experiments show that the proposed 6D pose tracking method can accurately estimate the 6D pose of a symmetric and textureless object under occlusion, and significantly outperforms the state-of-the-art on T-LESS dataset while running in real-time at 26 FPS.

EUMAS Conference 2020 Conference Paper

Integrated Commonsense Reasoning and Deep Learning for Transparent Decision Making in Robotics

  • Tiago Mota
  • Mohan Sridharan
  • Ales Leonardis

Abstract A robot’s ability to provide explanatory descriptions of its decisions and beliefs promotes effective collaboration with humans. Providing such transparency in decision making is particularly challenging in integrated robot systems that include knowledge-based reasoning methods and data-driven learning algorithms. Towards addressing this challenge, our architecture couples the complementary strengths of non-monotonic logical reasoning with incomplete commonsense domain knowledge, deep learning, and inductive learning. During reasoning and learning, the architecture enables a robot to provide on-demand explanations of its decisions, beliefs, and the outcomes of hypothetical actions, in the form of relational descriptions of relevant domain objects, attributes, and actions. The architecture’s capabilities are illustrated and evaluated in the context of scene understanding tasks and planning tasks performed using simulated images and images from a physical robot manipulating tabletop objects. Experimental results indicate the ability to reliably acquire and merge new information about the domain in the form of constraints, and to provide accurate explanations in the presence of noisy sensing and actuation.

NeurIPS Conference 2018 Conference Paper

Learning to Exploit Stability for 3D Scene Parsing

  • Yilun Du
  • Zhijian Liu
  • Hector Basevi
  • Ales Leonardis
  • Bill Freeman
  • Josh Tenenbaum
  • Jiajun Wu

Human scene understanding uses a variety of visual and non-visual cues to perform inference on object types, poses, and relations. Physics is a rich and universal cue which we exploit to enhance scene understanding. We integrate the physical cue of stability into the learning process using a REINFORCE approach coupled to a physics engine, and apply this to the problem of producing the 3D bounding boxes and poses of objects in a scene. We first show that applying physics supervision to an existing scene understanding model increases performance, produces more stable predictions, and allows training to an equivalent performance level with fewer annotated training examples. We then present a novel architecture for 3D scene parsing named Prim R-CNN, learning to predict bounding boxes as well as their 3D size, translation, and rotation. With physics supervision, Prim R-CNN outperforms existing scene understanding approaches on this problem. Finally, we show that applying physics supervision on unlabeled real images improves real domain transfer of models training on synthetic data.

ICRA Conference 2017 Conference Paper

Visual stability prediction for robotic manipulation

  • Wenbin Li 0003
  • Ales Leonardis
  • Mario Fritz

Understanding physical phenomena is a key competence that enables humans and animals to act and interact under uncertain perception in previously unseen environments containing novel objects and their configurations. Developmental psychology has shown that such skills are acquired by infants from observations at a very early stage. In this paper, we contrast a more traditional approach of taking a model-based route with explicit 3D representations and physical simulation by an end-to-end approach that directly predicts stability from appearance. We ask the question if and to what extent and quality such a skill can directly be acquired in a data-driven way — bypassing the need for an explicit simulation at run-time. We present a learning-based approach based on simulated data that predicts stability of towers comprised of wooden blocks under different conditions and quantities related to the potential fall of the towers. We first evaluate the approach on synthetic data and compared the results to human judgments on the same stimuli. Further, we extend this approach to reason about future states of such towers that in return enables successful stacking.

ICRA Conference 2016 Conference Paper

Hierarchical spatial model for 2D range data based room categorization

  • Peter Ursic
  • Ales Leonardis
  • Danijel Skocaj
  • Matej Kristan

The next generation service robots are expected to co-exist with humans in their homes. Such a mobile robot requires an efficient representation of space, which should be compact and expressive, for effective operation in real-world environments. In this paper we present a novel approach for 2D ground-plan-like laser-range-data-based room categorization that builds on a compositional hierarchical representation of space, and show how an additional abstraction layer, whose parts are formed by merging partial views of the environment followed by graph extraction, can achieve improved categorization performance. A new algorithm is presented that finds a dictionary of exemplar elements from a multi-category set, based on the affinity measure defined among pairs of elements. This algorithm is used for part selection in new layer construction. Room categorization experiments have been performed on a challenging publicly available dataset, which has been extended in this work. State-of-the-art results were obtained by achieving the most balanced performance over all categories.

ICRA Conference 2016 Conference Paper

Part-based room categorization for household service robots

  • Peter Ursic
  • Rok Mandeljc
  • Ales Leonardis
  • Matej Kristan

A service robot that operates in a previously-unseen home environment should be able to recognize the functionality of the rooms it visits, such as a living room, a bathroom, etc. We present a novel part-based model and an approach for room categorization using data obtained from a visual sensor. Images are represented with sets of unordered parts that are obtained by object-agnostic region proposals, and encoded using state-of-the-art image descriptor extractor - a convolutional neural network (CNN). An approach is proposed that learns category-specific discriminative parts for the part-based model. The proposed approach was compared to the state-of-the-art CNN trained specifically for place recognition. Experimental results show that the proposed approach outperforms the holistic CNN by being robust to image degradation, such as occlusions, modifications of image scaling, and aspect changes. In addition, we report non-negligible annotation errors and image duplicates in a popular dataset for place categorization and discuss annotation ambiguities.

IROS Conference 2016 Conference Paper

Task-relevant grasp selection: A joint solution to planning grasps and manipulative motion trajectories

  • Amir M. Ghalamzan E.
  • Nikos Mavrakis
  • Marek Sewer Kopicki
  • Rustam Stolkin
  • Ales Leonardis

This paper addresses the problem of jointly planning both grasps and subsequent manipulative actions. Previously, these two problems have typically been studied in isolation, however joint reasoning is essential to enable robots to complete real manipulative tasks. In this paper, the two problems are addressed jointly and a solution that takes both into consideration is proposed. To do so, a manipulation capability index is defined, which is a function of both the task execution waypoints and the object grasping contact points. We build on recent state-of-the-art grasp-learning methods, to show how this index can be combined with a likelihood function computed by a probabilistic model of grasp selection, enabling the planning of grasps which have a high likelihood of being stable, but which also maximise the robot's capability to deliver a desired post-grasp task trajectory. We also show how this paradigm can be extended, from a single arm and hand, to enable efficient grasping and manipulation with a bi-manual robot. We demonstrate the effectiveness of the approach using experiments on a simulated as well as a real robot.

ICRA Conference 2014 Conference Paper

A hierarchical approach for joint multi-view object pose estimation and categorization

  • Mete Ozay
  • Krzysztof Walas
  • Ales Leonardis

We propose a joint object pose estimation and categorization approach which extracts information about object poses and categories from the object parts and compositions constructed at different layers of a hierarchical object representation algorithm, namely Learned Hierarchy of Parts (LHOP) [7]. In the proposed approach, we first employ the LHOP to learn hierarchical part libraries which represent entity parts and compositions across different object categories and views. Then, we extract statistical and geometric features from the part realizations of the objects in the images in order to represent the information about object pose and category at each different layer of the hierarchy. Unlike the traditional approaches which consider specific layers of the hierarchies in order to extract information to perform specific tasks, we combine the information extracted at different layers to solve a joint object pose estimation and categorization problem using distributed optimization algorithms. We examine the proposed generative-discriminative learning approach and the algorithms on two benchmark 2-D multi-view image datasets. The proposed approach and the algorithms outperform state-of-the-art classification, regression and feature extraction algorithms. In addition, the experimental results shed light on the relationship between object categorization, pose estimation and the part realizations observed at different layers of the hierarchy.

IROS Conference 2012 Conference Paper

Room classification using a hierarchical representation of space

  • Peter Ursic
  • Matej Kristan
  • Danijel Skocaj
  • Ales Leonardis

Mobile robots need an effective spatial model for the successful operation in real-world environment. The model should be compact and simultaneously possess large expressive power. Moreover, it should scale well. In this paper we propose a new hierarchical representation of space, whose compositional structure is learned based on statistically significant observations. We have focused on a two dimensional space, since many robots perceive their surroundings in two dimensions with the use of a laser range finder or a sonar. We also propose the use of a low-level image descriptor for addressing the room classification problem, by which we demonstrate the performance of our representation. Using only the lower layers of the hierarchy, we obtain state-of-the-art classification results on demanding datasets.

ICRA Conference 2010 Conference Paper

Self-supervised cross-modal online learning of basic object affordances for developmental robotic systems

  • Barry Ridge
  • Danijel Skocaj
  • Ales Leonardis

For a developmental robotic system to function successfully in the real world, it is important that it be able to form its own internal representations of affordance classes based on observable regularities in sensory data. Usually successful classifiers are built using labeled training data, but it is not always realistic to assume that labels are available in a developmental robotics setting. There does, however, exist an advantage in this setting that can help circumvent the absence of labels: co-occurrence of correlated data across separate sensory modalities over time. The main contribution of this paper is an online classifier training algorithm based on Kohonen's learning vector quantization (LVQ) that, by taking advantage of this co-occurrence information, does not require labels during training, either dynamically generated or otherwise. We evaluate the algorithm in experiments involving a robotic arm that interacts with various household objects on a table surface where camera systems extract features for two separate visual modalities. It is shown to improve its ability to classify the affordances of novel objects over time, coming close to the performance of equivalent fully-supervised algorithms.

IROS Conference 2009 Conference Paper

A computer vision integration model for a multi-modal cognitive system

  • Alen Vrecko
  • Danijel Skocaj
  • Nick Hawes
  • Ales Leonardis

We present a general method for integrating visual components into a multi-modal cognitive system. The integration is very generic and can work with an arbitrary set of modalities. We illustrate our integration approach with a specific instantiation of the architecture schema that focuses on integration of vision and language: a cognitive system able to collaborate with a human, learn and display some understanding of its surroundings. As examples of cross-modal interaction we describe mechanisms for clarification and visual learning.

NeurIPS Conference 2009 Conference Paper

Evaluating multi-class learning strategies in a generative hierarchical framework for object detection

  • Sanja Fidler
  • Marko Boben
  • Ales Leonardis

Multiple object class learning and detection is a challenging problem due to the large number of object classes and their high visual variability. Specialized detectors usually excel in performance, while joint representations optimize sharing and reduce inference time --- but are complex to train. Conveniently, sequential learning of categories cuts down training time by transferring existing knowledge to novel classes, but cannot fully exploit the richness of shareability and might depend on ordering in learning. In hierarchical frameworks these issues have been little explored. In this paper, we show how different types of multi-class learning can be done within one generative hierarchical framework and provide a rigorous experimental analysis of various object class learning strategies as the number of classes grows. Specifically, we propose, evaluate and compare three important types of multi-class learning: 1. ) independent training of individual categories, 2. ) joint training of classes, 3. ) sequential learning of classes. We explore and compare their computational behavior (space and time) and detection performance as a function of the number of learned classes on several recognition data sets.

IROS Conference 2005 Conference Paper

Panoramic volumes for robot localization

  • Matej Artac
  • Matjaz Jogan
  • Ales Leonardis
  • Hynek Bakstein

We propose a method for visual robot localization using a panoramic image volume as the representation from which we can generate views from virtual viewpoints and match them to the current view. We use a geometric image-based rendering formalism in combination with a subspace representation of images, which allows us to synthesize views at arbitrary virtual viewpoints from a compact low-dimensional representation.

ICRA Conference 2002 Conference Paper

Mobile Robot Localization using an Incremental Eigenspace Model

  • Matej Artac
  • Matjaz Jogan
  • Ales Leonardis

When using appearance-based recognition for self-localization of mobile robots, the images obtained during the exploration of the environment need to be efficiently stored in the memory. PCA offers means for representing the images in a low-dimensional subspace, which allows for efficient matching and recognition. For active exploration it is necessary to use an incremental method for the computation of the subspace. We propose to use an incremental PCA algorithm with the updating of partial image representations in a way that allows the robot to discard the acquired images immediately after the update. Such a model is open-ended, meaning that we can easily update it with new images. We show that the performance of the proposed method is comparable to the performance of the batch method in terms of compression, computational cost and the precision of localization. We also show that by applying the repetitive learning, the subspace converges to that constructed with the batch method.

ICRA Conference 1995 Conference Paper

Grasping Arbitrarily Shaped 3-D Objects from a Pile

  • Marjan Trobina
  • Ales Leonardis

Presents a reliable and robust approach to the problem of grasping arbitrarily shaped 3-D objects from a pile. The approach adheres to the paradigm of purposive vision, which says that one should only extract as much information as it is needed to perform a certain task, e. g. grasping, while a complete and precise recovery of the shape of the objects is not necessary. The authors show that planar patches obtained by the recover-and-select paradigm contain enough information to enable generating object hypotheses and to estimate grasping points for the objects. The authors present some results for objects with polyhedral as well as with curved surfaces obtained on real range images.

ICRA Conference 1994 Conference Paper

A Direct Part-Level Segmentation of Range Images Using Volumetric Models

  • Franc Solina
  • Ales Leonardis
  • Alenka Macerl

Volumetric part models play an important part in robotic applications such as grasping, path planning, object avoidance, and modeling kinematic chains. The authors present a novel method for reliable and efficient recovery of part-descriptions in terms of superquadric models from range images. In contrast to usual approaches which perform the recovery of volumetric models in several steps (from curves, surfaces to volumes), the authors show that a direct recovery is possible. This is achieved by combining two existing methods: recover-and-select paradigm and recovery of superquadric models. A redundant set of superquadratics is initiated in the image and only the recovered models resulting in the simplest overall description are selected. The author show the results on several real range images. >