Author name cluster

Niki Trigoni

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

42 papers

2 author rows

NeurIPS Conference 2025 Conference Paper

COOPERA: Continual Open-Ended Human-Robot Assistance

Chenyang Ma
Kai Lu
Ruta Desai
Xavier Puig
Andrew Markham
Niki Trigoni

To understand and collaborate with humans, robots must account for individual human traits, habits, and activities over time. However, most robotic assistants lack these abilities, as they primarily focus on predefined tasks in structured environments and lack a human model to learn from. This work introduces COOPERA, a novel framework for COntinual, OPen-Ended human-Robot Assistance, where simulated humans, driven by psychological traits and long-term intentions, interact with robots in complex environments. By integrating continuous human feedback, our framework, for the first time, enables the study of long-term, open-ended human-robot collaboration (HRC) in different collaborative tasks across various time-scales. Within COOPERA, we introduce a benchmark and an approach to personalize the robot's collaborative actions by learning human traits and context-dependent intents. Experiments validate the extent to which our simulated humans reflect realistic human behaviors and demonstrate the value of inferring and personalizing to human intents for open-ended and long-term HRC.

IROS Conference 2025 Conference Paper

Ray Visual Odometry

Fanqi Xu
Yasin Almalioglu
Niki Trigoni

Learning-based Visual Odometry (VO) has seen significant advancements over the past decades. However, all the existing methods rely on the six degrees of freedom (6-DoF) representation for pose prediction, which is sparse and less conducive for neural network learning. In this work, we introduce a novel dense and distributed representation by modeling VO as ray bundles, referred to as RayVO. This richly parameterized representation is tightly coupled with corresponding spatial features, making it highly effective for neural learning. Additionally, the ray-based approach enables simultaneous prediction of both intrinsic and extrinsic parameters. To prove its effectiveness against the traditional 6-DoF representation, we propose three specialized loss functions for ray’s training: a ray-based loss, a 6-DoF-based loss and a hybrid loss. We extensively evaluate RayVO on both indoor and outdoor benchmark datasets and show that it outperforms the state-of-the-art VO methods.

NeurIPS Conference 2025 Conference Paper

Target Speaker Extraction through Comparing Noisy Positive and Negative Audio Enrollments

Shitong Xu
Yiyuan Yang
Niki Trigoni
Andrew Markham

Target speaker extraction focuses on isolating a specific speaker's voice from an audio mixture containing multiple speakers. To provide information about the target speaker's identity, prior works have utilized clean audio samples as conditioning inputs. However, such clean audio examples are not always readily available. For instance, obtaining a clean recording of a stranger's voice at a cocktail party without leaving the noisy environment is generally infeasible. Limited prior research has explored extracting the target speaker's characteristics from noisy enrollments, which may contain overlapping speech from interfering speakers. In this work, we explore a novel enrollment strategy that encodes target speaker information from the noisy enrollment by comparing segments where the target speaker is talking (Positive Enrollments) with segments where the target speaker is silent (Negative Enrollments). Experiments show the effectiveness of our model architecture, which achieves over 2. 1 dB higher SI-SNRi compared to prior works in extracting the monaural speech from the mixture of two speakers. Additionally, the proposed two-stage training strategy accelerates convergence, reducing the number of optimization steps required to reach 3 dB SNR by 60\%. Overall, our method achieves state-of-the-art performance in the monaural target speaker extraction conditioned on noisy enrollments. Our implementation is available at https: //github. com/xu-shitong/TSE-through-Positive-Negative-Enroll.

ICRA Conference 2024 Conference Paper

Dusk Till Dawn: Self-supervised Nighttime Stereo Depth Estimation using Visual Foundation Models

Madhu Vankadari
Samuel Hodgson
Sangyun Shin
Kaichen Zhou
Andrew Markham
Niki Trigoni

Self-supervised depth estimation algorithms rely heavily on frame-warping relationships, exhibiting substantial performance degradation when applied in challenging circumstances, such as low-visibility and nighttime scenarios with varying illumination conditions. Addressing this challenge, we introduce an algorithm designed to achieve accurate selfsupervised stereo depth estimation focusing on nighttime conditions. Specifically, we use pretrained visual foundation models to extract generalised features across challenging scenes and present an efficient method for matching and integrating these features from stereo frames. Moreover, to prevent pixels violating photometric consistency assumption from negatively affecting the depth predictions, we propose a novel masking approach designed to filter out such pixels. Lastly, addressing weaknesses in the evaluation of current depth estimation algorithms, we present novel evaluation metrics. Our experiments, conducted on challenging datasets including Oxford RobotCar and MultiSpectral Stereo, demonstrate the robust improvements realized by our approach.

AAAI Conference 2024 Conference Paper

SoundCount: Sound Counting from Raw Audio with Dyadic Decomposition Neural Network

Yuhang He
Zhuangzhuang Dai
Niki Trigoni
Long Chen
Andrew Markham

In this paper, we study an underexplored, yet important and challenging problem: counting the number of distinct sounds in raw audio characterized by a high degree of polyphonicity. We do so by systematically proposing a novel end-to-end trainable neural network~(which we call DyDecNet, consisting of a dyadic decomposition front-end and backbone network), and quantifying the difficulty level of counting depending on sound polyphonicity. The dyadic decomposition front-end progressively decomposes the raw waveform dyadically along the frequency axis to obtain time-frequency representation in multi-stage, coarse-to-fine manner. Each intermediate waveform convolved by a parent filter is further processed by a pair of child filters that evenly split the parent filter's carried frequency response, with the higher-half child filter encoding the detail and lower-half child filter encoding the approximation. We further introduce an energy gain normalization to normalize sound loudness variance and spectrum overlap, and apply it to each intermediate parent waveform before feeding it to the two child filters. To better quantify sound counting difficulty level, we further design three polyphony-aware metrics: polyphony ratio, max polyphony and mean polyphony. We test DyDecNet on various datasets to show its superiority, and we further show dyadic decomposition network can be used as a general front-end to tackle other acoustic tasks.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

SpatialPIN: Enhancing Spatial Reasoning Capabilities of Vision-Language Models through Prompting and Interacting 3D Priors

Chenyang Ma
Kai Lu
Ta-Ying Cheng
Niki Trigoni
Andrew Markham

Current state-of-the-art spatial reasoning-enhanced VLMs are trained to excel at spatial visual question answering (VQA). However, we believe that higher-level 3D-aware tasks, such as articulating dynamic scene changes and motion planning, require a fundamental and explicit 3D understanding beyond current spatial VQA datasets. In this work, we present SpatialPIN, a framework designed to enhance the spatial reasoning capabilities of VLMs through prompting and interacting with priors from multiple 3D foundation models in a zero-shot, training-free manner. Extensive experiments demonstrate that our spatial reasoning-imbued VLM performs well on various forms of spatial VQA and can extend to help in various downstream robotics tasks such as pick and stack and trajectory planning.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

Towards Learning Group-Equivariant Features for Domain Adaptive 3D Detection

Sangyun Shin
Yuhang He
Madhu Vankadari
Ta-Ying Cheng
Qian Xie
Andrew Markham
Niki Trigoni

The performance of 3D object detection in large outdoor point clouds deteriorates significantly in an unseen environment due to the inter-domain gap. To address these challenges, most existing methods for domain adaptation harness self-training schemes and attempt to bridge the gap by focusing on a single factor that causes the inter-domain gap, such as objects' sizes, shapes, and foreground density variation. However, the resulting adaptations suggest that there is still a substantial inter-domain gap left to be minimized. We argue that this is due to two limitations: 1) Biased pseudo-label collection from self-training. 2) Multiple factors jointly contributing to how the object is perceived in the unseen target domain. In this work, we propose a grouping-exploration strategy framework, Group Explorer Domain Adaptation ($\textbf{GroupEXP-DA}$), to addresses those two issues. Specifically, our grouping divides the available label sets into multiple clusters and ensures all of them have equal learning attention with the group-equivariant spatial feature, avoiding dominant types of objects causing imbalance problems. Moreover, grouping learns to divide objects by considering inherent factors in a data-driven manner, without considering each factor separately as existing works. On top of the group-equivariant spatial feature that selectively detects objects similar to the input group, we additionally introduce an explorative group update strategy that reduces the false negative detection in the target domain, further reducing the inter-domain gap. During inference, only the learned group features are necessary for making the group-equivariant spatial feature, placing our method as a simple add-on that can be applicable to most existing detectors. We show how each module contributes to substantially bridging the inter-domain gaps compared to existing works across large urban outdoor datasets such as NuScenes, Waymo, and KITTI.

PDF Details DOI

IROS Conference 2024 Conference Paper

WSCLoc: Weakly-Supervised Sparse-View Camera Relocalization via Radiance Field

Jialu Wang
Kaichen Zhou
Andrew Markham
Niki Trigoni

Despite the advancements in deep learning for camera relocalization tasks, obtaining ground truth pose labels required for the training process remains a costly endeavor. While current weakly supervised methods excel in lightweight label generation, their performance notably declines in scenarios with sparse views. In response to this challenge, we introduce WSCLoc, a system capable of being customized to various deep learning-based relocalization models to enhance their performance under weakly-supervised and sparse view conditions. This is realized with two stages. In the initial stage, WSCLoc employs a multilayer perceptron-based structure called WFT-NeRF to co-optimize image reconstruction quality and initial pose information. To ensure a stable learning process, we incorporate temporal information as input. Furthermore, instead of optimizing SE(3), we opt for sim(3) optimization to explicitly enforce a scale constraint. In the second stage, we co-optimize the pre-trained WFT-NeRF and WFT-Pose. This optimization is enhanced by Time-Encoding based Random View Synthesis and supervised by inter-frame geometric constraints that consider pose, depth, and RGB information. We validate our approaches on two publicly available datasets, one outdoor and one indoor. Our experimental results demonstrate that our weakly-supervised relocalization solutions achieve superior pose estimation accuracy in sparse-view scenarios, comparable to state-of-the-art camera relocalization methods. We will make our code publicly available.

NeurIPS Conference 2023 Conference Paper

DynPoint: Dynamic Neural Point For View Synthesis

Kaichen Zhou
Jia-Xing Zhong
Sangyun Shin
Kai Lu
Yiyuan Yang
Andrew Markham
Niki Trigoni

The introduction of neural radiance fields has greatly improved the effectiveness of view synthesis for monocular videos. However, existing algorithms face difficulties when dealing with uncontrolled or lengthy scenarios, and require extensive training time specific to each new scenario. To tackle these limitations, we propose DynPoint, an algorithm designed to facilitate the rapid synthesis of novel views for unconstrained monocular videos. Rather than encoding the entirety of the scenario information into a latent representation, DynPoint concentrates on predicting the explicit 3D correspondence between neighboring frames to realize information aggregation. Specifically, this correspondence prediction is achieved through the estimation of consistent depth and scene flow information across frames. Subsequently, the acquired correspondence is utilized to aggregate information from multiple reference frames to a target frame, by constructing hierarchical neural point clouds. The resulting framework enables swift and accurate view synthesis for desired views of target frames. The experimental results obtained demonstrate the considerable acceleration of training time achieved - typically an order of magnitude - by our proposed method while yielding comparable outcomes compared to prior approaches. Furthermore, our method exhibits strong robustness in handling long-duration videos without learning a canonical representation of video content.

NeurIPS Conference 2023 Conference Paper

Multi-body SE(3) Equivariance for Unsupervised Rigid Segmentation and Motion Estimation

Jia-Xing Zhong
Ta-Ying Cheng
Yuhang He
Kai Lu
Kaichen Zhou
Andrew Markham
Niki Trigoni

A truly generalizable approach to rigid segmentation and motion estimation is fundamental to 3D understanding of articulated objects and moving scenes. In view of the closely intertwined relationship between segmentation and motion estimates, we present an SE(3) equivariant architecture and a training strategy to tackle this task in an unsupervised manner. Our architecture is composed of two interconnected, lightweight heads. These heads predict segmentation masks using point-level invariant features and estimate motion from SE(3) equivariant features, all without the need for category information. Our training strategy is unified and can be implemented online, which jointly optimizes the predicted segmentation and motion by leveraging the interrelationships among scene flow, segmentation mask, and rigid transformations. We conduct experiments on four datasets to demonstrate the superiority of our method. The results show that our method excels in both model performance and computational efficiency, with only 0. 25M parameters and 0. 92G FLOPs. To the best of our knowledge, this is the first work designed for category-agnostic part-level SE(3) equivariance in dynamic point clouds.

IROS Conference 2023 Conference Paper

RADA: Robust Adversarial Data Augmentation for Camera Localization in Challenging Conditions

Jialu Wang
Muhamad Risqi Utama Saputra
Chris Xiaoxuan Lu
Niki Trigoni
Andrew Markham

Camera localization is a fundamental problem for many applications in computer vision, robotics, and autonomy. Despite recent deep learning-based approaches, the lack of robustness in challenging conditions persists due to changes in appearance caused by texture-less planes, repeating structures, reflective surfaces, motion blur, and illumination changes. Data augmentation is an attractive solution, but standard image perturbation methods fail to improve localization robustness. To address this, we propose RADA, which concentrates on perturbing the most vulnerable pixels to generate relatively less image perturbations that perplex the network. Our method outperforms previous augmentation techniques, achieving up to twice the accuracy of state-of-the-art models even under ‘unseen’ challenging weather conditions. Videos of our results can be found at https://youtu.be/niOv7-fJeCA.The source code for RADA is publicly available at https://github.com/jialuwang123321/RADA.

ICRA Conference 2023 Conference Paper

Sample, Crop, Track: Self-Supervised Mobile 3D Object Detection for Urban Driving LiDAR

Sangyun Shin
Stuart Golodetz
Madhu Vankadari
Kaichen Zhou
Andrew Markham
Niki Trigoni

Deep learning has led to great progress in the detection of mobile (i. e. movement-capable) objects in urban driving scenes in recent years. Supervised approaches typically require the annotation of large training sets; there has thus been great interest in leveraging weakly, semi- or self- supervised methods to avoid this, with much success. Whilst weakly and semi-supervised methods require some annotation, self-supervised methods have used cues such as motion to relieve the need for annotation altogether. However, a complete absence of annotation typically degrades their performance, and ambiguities that arise during motion grouping can inhibit their ability to find accurate object boundaries. In this paper, we propose a new self-supervised mobile object detection approach called SCT. This uses both motion cues and expected object sizes to improve detection performance, and predicts a dense grid of 3 $D$ oriented bounding boxes to improve object discovery. We significantly outperform the state-of-the-art self-supervised mobile object detection method TCR on the KITTI tracking benchmark, and achieve performance that is within 30 % of the fully supervised PV-RCNN++ method for IoUs $\leq$ 0. 5. Our source code will be made available online.

IROS Conference 2022 Conference Paper

DeepCIR: Insights into CIR-based Data-driven UWB Error Mitigation

Vu Tran
Zhuangzhuang Dai
Niki Trigoni
Andrew Markham

Ultra-Wide-Band (UWB) ranging sensors have been widely adopted for robotic navigation thanks to their extremely high bandwidth and hence high resolution. However, off-the-shelf devices may output ranges with significant errors in cluttered, severe non-line-of-sight (NLOS) environments. Recently, neural networks have been actively studied to improve the ranging accuracy of UWB sensors using the channel-impulse-response (CIR) as input. However, previous works have not systematically evaluated the efficacy of various packet types and their possible combinations in a two-way-ranging transaction, including poll, response and final packets. In this paper, we firstly investigate the utility of different packet types and their combinations when used as input for a neural network. Secondly, we propose two novel data-driven approaches, namely FMCIR and WMCIR, that leverage two-sided CIRs for efficient UWB error mitigation. Our approaches outperform state-of-the-art by a significant margin, further reducing range errors up to 45%. Finally, we create and release a dataset of transaction-level synchronized CIRs (each sample consists of the CIR of the poll, response and final packets), which will enable further studies in this area.

AAAI Conference 2022 Conference Paper

Pose Adaptive Dual Mixup for Few-Shot Single-View 3D Reconstruction

Ta-Ying Cheng
Hsuan-Ru Yang
Niki Trigoni
Hwann-Tzong Chen
Tyng-Luh Liu

We present a pose adaptive few-shot learning procedure and a two-stage data interpolation regularization, termed Pose Adaptive Dual Mixup (PADMix), for single-image 3D reconstruction. While augmentations via interpolating feature-label pairs are effective in classification tasks, they fall short in shape predictions potentially due to inconsistencies between interpolated products of two images and volumes when rendering viewpoints are unknown. PADMix targets this issue with two sets of mixup procedures performed sequentially. We first perform an input mixup which, combined with a pose adaptive learning procedure, is helpful in learning 2D feature extraction and pose adaptive latent encoding. The stagewise training allows us to build upon the pose invariant representations to perform a follow-up latent mixup under one-to-one correspondences between features and ground-truth volumes. PADMix significantly outperforms previous literature on fewshot settings over the ShapeNet dataset and sets new benchmarks on the more challenging real-world Pix3D dataset.

IROS Conference 2022 Conference Paper

Real-Time Hybrid Mapping of Populated Indoor Scenes using a Low-Cost Monocular UAV

Stuart Golodetz
Madhu Vankadari
Aluna Everitt
Sangyun Shin
Andrew Markham
Niki Trigoni

Unmanned aerial vehicles (UAVs) have been used for many applications in recent years, from urban search and rescue, to agricultural surveying, to autonomous underground mine exploration. However, deploying UAVs in tight, indoor spaces, especially close to humans, remains a challenge. One solution, when limited payload is required, is to use micro-UAVs, which pose less risk to humans and typically cost less to replace after a crash. However, micro-UAVs can only carry a limited sensor suite, e. g. a monocular camera instead of a stereo pair or LiDAR, complicating tasks like dense mapping and markerless multi-person 3D human pose estimation, which are needed to operate in tight environments around people. Monocular approaches to such tasks exist, and dense monocular mapping approaches have been successfully deployed for UAV applications. However, despite many recent works on both marker-based and markerless multi-UAV single-person motion capture, markerless single-camera multi-person 3D human pose estimation remains a much earlier-stage technology, and we are not aware of existing attempts to deploy it in an aerial context. In this paper, we present what is thus, to our knowledge, the first system to perform simultaneous mapping and multi-person 3D human pose estimation from a monocular camera mounted on a single UAV. In particular, we show how to loosely couple state-of-the-art monocular depth estimation and monocular 3D human pose estimation approaches to reconstruct a hybrid map of a populated indoor scene in real time. We validate our component-level design choices via extensive experiments on the large-scale ScanNet and GTA-IM datasets. To evaluate our system-level performance, we also construct a new Oxford Hybrid Mapping dataset of populated indoor scenes.

ICRA Conference 2021 Conference Paper

3D Motion Capture of an Unmodified Drone with Single-chip Millimeter Wave Radar

Peijun Zhao
Chris Xiaoxuan Lu
Bing Wang 0013
Niki Trigoni
Andrew Markham

Accurate motion capture of aerial robots in 3D is a key enabler for autonomous operation in indoor environments such as warehouses or factories, as well as driving forward research in these areas. The most commonly used solutions at present are optical motion capture (e. g. VICON) and Ultrawide-band (UWB), but these are costly and cumbersome to deploy, due to their requirement of multiple cameras/anchors spaced around the tracking area. They also require the drone to be modified to carry an active or passive marker. In this work, we present an inexpensive system that can be rapidly installed, based on single-chip millimeter wave (mmWave) radar. Importantly, the drone does not need to be modified or equipped with any markers, as we exploit the Doppler signals from the rotating propellers. Furthermore, 3D tracking is possible from a single point, greatly simplifying deployment. We develop a novel deep neural network and demonstrate decimeter level 3D tracking at 10Hz, achieving better performance than classical baselines. Our hope is that this low-cost system will act to catalyse inexpensive drone research and increased autonomy.

ICRA Conference 2021 Conference Paper

RadarLoc: Learning to Relocalize in FMCW Radar

Wei Wang 0226
Pedro P. B. de Gusmao
Bo Yang 0027
Andrew Markham
Niki Trigoni

Relocalization is a fundamental task in the field of robotics and computer vision. There is considerable work in the field of deep camera relocalization, which directly estimates poses from raw images. However, learning-based methods have not yet been applied to the radar sensory data. In this work, we investigate how to exploit deep learning to predict global poses from Emerging Frequency-Modulated Continuous Wave (FMCW) radar scans. Specifically, we propose a novel end-to-end neural network with self-attention, termed RadarLoc, which is able to estimate 6-DoF global poses directly. We also propose to improve the localization performance by utilizing geometric constraints between radar scans. We validate our approach on the recently released challenging outdoor dataset Oxford Radar RobotCar. Comprehensive experiments demonstrate that the proposed method outperforms radar-based localization and deep camera relocalization methods by a significant margin.

ICML Conference 2021 Conference Paper

SoundDet: Polyphonic Moving Sound Event Detection and Localization from Raw Waveform

Yuhang He
Niki Trigoni
Andrew Markham

We present a new framework SoundDet, which is an end-to-end trainable and light-weight framework, for polyphonic moving sound event detection and localization. Prior methods typically approach this problem by preprocessing raw waveform into time-frequency representations, which is more amenable to process with well-established image processing pipelines. Prior methods also detect in segment-wise manner, leading to incomplete and partial detections. SoundDet takes a novel approach and directly consumes the raw, multichannel waveform and treats the spatio-temporal sound event as a complete “sound-object" to be detected. Specifically, SoundDet consists of a backbone neural network and two parallel heads for temporal detection and spatial localization, respectively. Given the large sampling rate of raw waveform, the backbone network first learns a set of phase-sensitive and frequency-selective bank of filters to explicitly retain direction-of-arrival information, whilst being highly computationally and parametrically efficient than standard 1D/2D convolution. A dense sound event proposal map is then constructed to handle the challenges of predicting events with large varying temporal duration. Accompanying the dense proposal map are a temporal overlapness map and a motion smoothness map that measure a proposal’s confidence to be an event from temporal detection accuracy and movement consistency perspective. Involving the two maps guarantees SoundDet to be trained in a spatio-temporally unified manner. Experimental results on the public DCASE dataset show the advantage of SoundDet on both segment-based evaluation and our newly proposed event-based evaluation system.

AAAI Conference 2021 Conference Paper

VMLoc: Variational Fusion For Learning-Based Multimodal Camera Localization

Kaichen Zhou
Changhao Chen
Bing Wang
Muhamad Risqi U. Saputra
Niki Trigoni
Andrew Markham

Recent learning-based approaches have achieved impressive results in the field of single-shot camera localization. However, how best to fuse multiple modalities (e. g. , image and depth) and to deal with degraded or missing input are less well studied. In particular, we note that previous approaches towards deep fusion do not perform significantly better than models employing a single modality. We conjecture that this is because of the naive approaches to feature space fusion through summation or concatenation which do not take into account the different strengths of each modality. To address this, we propose an end-to-end framework, termed VM- Loc, to fuse different sensor inputs into a common latent space through a variational Product-of-Experts (PoE) followed by attention-based fusion. Unlike previous multimodal variational works directly adapting the objective function of vanilla variational auto-encoder, we show how camera localization can be accurately estimated through an unbiased objective function based on importance weighting. Our model is extensively evaluated on RGB-D datasets and the results prove the efficacy of our model. The source code is available at https: //github. com/Zalex97/VMLoc.

AAAI Conference 2020 Conference Paper

AtLoc: Attention Guided Camera Localization

Bing Wang
Changhao Chen
Chris Xiaoxuan Lu
Peijun Zhao
Niki Trigoni
Andrew Markham

Deep learning has achieved impressive results in camera localization, but current single-image techniques typically suffer from a lack of robustness, leading to large outliers. To some extent, this has been tackled by sequential (multi-images) or geometry constraint approaches, which can learn to reject dynamic objects and illumination conditions to achieve better performance. In this work, we show that attention can be used to force the network to focus on more geometrically robust objects and features, achieving state-of-the-art performance in common benchmark, even if using only a single image as input. Extensive experimental evidence is provided through public indoor and outdoor datasets. Through visualization of the saliency maps, we demonstrate how the network learns to reject dynamic objects, yielding superior global camera pose regression performance. The source code is avaliable at https: //github. com/BingCS/AtLoc.

ICRA Conference 2020 Conference Paper

Heart Rate Sensing with a Robot Mounted mmWave Radar

Peijun Zhao
Chris Xiaoxuan Lu
Bing Wang 0013
Changhao Chen
Linhai Xie
Mengyu Wang
Niki Trigoni
Andrew Markham

Heart rate monitoring at home is a useful metric for assessing health e. g. of the elderly or patients in post-operative recovery. Although non-contact heart rate monitoring has been widely explored, typically using a static, wall-mounted device, measurements are limited to a single room and sensitive to user orientation and position. In this work, we propose mBeats, a robot mounted millimeter wave (mmWave) radar system that provide periodic heart rate measurements under different user poses, without interfering in a users daily activities. mBeats contains a mmWave servoing module that adaptively adjusts the sensor angle to the best reflection pro le. Furthermore, mBeats features a deep neural network predictor, which can estimate heart rate from the lower leg and additionally provides estimation uncertainty. Through extensive experiments, we demonstrate accurate and robust operation of mBeats in a range of scenarios. We believe by integrating mobility and adaptability, mBeats can empower many down-stream healthcare applications at home, such as palliative care, post-operative rehabilitation and telemedicine.

ICRA Conference 2020 Conference Paper

SnapNav: Learning Mapless Visual Navigation with Sparse Directional Guidance and Visual Reference

Linhai Xie
Andrew Markham
Niki Trigoni

Learning-based visual navigation still remains a challenging problem in robotics, with two overarching issues: how to transfer the learnt policy to unseen scenarios, and how to deploy the system on real robots. In this paper, we propose a deep neural network based visual navigation system, SnapNav. Unlike map-based navigation or Visual-Teach-and-Repeat (VT&R), SnapNav only receives a few snapshots of the environment combined with directional guidance to allow it to execute the navigation task. Additionally, SnapNav can be easily deployed on real robots due to a two-level hierarchy: a high level commander that provides directional commands and a low level controller that provides real-time control and obstacle avoidance. This also allows us to effectively use simulated and real data to train the different layers of the hierarchy, facilitating robust control. Extensive experimental results show that SnapNav achieves a highly autonomous navigation ability compared to baseline models, enabling sparse, map-less navigation in previously unseen environments.

IROS Conference 2019 Conference Paper

DeepPCO: End-to-End Point Cloud Odometry through Deep Parallel Neural Network

Wei Wang 0226
Muhamad Risqi Utama Saputra
Peijun Zhao
Pedro P. B. de Gusmao
Bo Yang 0027
Changhao Chen
Andrew Markham
Niki Trigoni

Odometry is of key importance for localization in the absence of a map. There is considerable work in the area of visual odometry (VO), and recent advances in deep learning have brought novel approaches to VO, which directly learn salient features from raw images. These learning-based approaches have led to more accurate and robust VO systems. However, they have not been well applied to point cloud data yet. In this work, we investigate how to exploit deep learning to estimate point cloud odometry (PCO), which may serve as a critical component in point cloud-based downstream tasks or learning-based systems. Specifically, we propose a novel end-to-end deep parallel neural network called DeepPCO, which can estimate the 6-DOF poses using consecutive point clouds. It consists of two parallel sub-networks to estimate 3D translation and orientation respectively rather than a single neural network. We validate our approach on KITTI Visual Odometry/SLAM benchmark dataset with different baselines. Experiments demonstrate that the proposed approach achieves good performance in terms of pose accuracy.

ICRA Conference 2019 Conference Paper

GANVO: Unsupervised Deep Monocular Visual Odometry and Depth Estimation with Generative Adversarial Networks

Yasin Almalioglu
Muhamad Risqi Utama Saputra
Pedro P. B. de Gusmao
Andrew Markham
Niki Trigoni

In the last decade, supervised deep learning approaches have been extensively employed in visual odometry (VO) applications, which is not feasible in environments where labelled data is not abundant. On the other hand, unsupervised deep learning approaches for localization and mapping in unknown environments from unlabelled data have received comparatively less attention in VO research. In this study, we propose a generative unsupervised learning framework that predicts 6-DoF pose camera motion and monocular depth map of the scene from unlabelled RGB image sequences, using deep convolutional Generative Adversarial Networks (GANs). We create a supervisory signal by warping view sequences and assigning the re-projection minimization to the objective loss function that is adopted in multi-view pose estimation and single-view depth generation network. Detailed quantitative and qualitative evaluations of the proposed framework on the KITTI [1] and Cityscapes [2] datasets show that the proposed method outperforms both existing traditional and unsupervised deep VO methods providing better results for both pose estimation and depth recovery.

ICRA Conference 2019 Conference Paper

Learning Monocular Visual Odometry through Geometry-Aware Curriculum Learning

Muhamad Risqi Utama Saputra
Pedro P. B. de Gusmao
Sen Wang 0002
Andrew Markham
Niki Trigoni

Inspired by the cognitive process of humans and animals, Curriculum Learning (CL) trains a model by gradually increasing the difficulty of the training data. In this paper, we study whether CL can be applied to complex geometry problems like estimating monocular Visual Odometry (VO). Unlike existing CL approaches, we present a novel CL strategy for learning the geometry of monocular VO by gradually making the learning objective more difficult during training. To this end, we propose a novel geometry-aware objective function by jointly optimizing relative and composite transformations over small windows via bounded pose regression loss. A cascade optical flow network followed by recurrent network with a differentiable windowed composition layer, termed CL-VO, is devised to learn the proposed objective. Evaluation on three real-world datasets shows superior performance of CL-VO over state-of-the-art feature-based and learning-based VO.

NeurIPS Conference 2019 Conference Paper

Learning Object Bounding Boxes for 3D Instance Segmentation on Point Clouds

Bo Yang
Jianan Wang
Ronald Clark
Qingyong Hu
Sen Wang
Andrew Markham
Niki Trigoni

We propose a novel, conceptually simple and general framework for instance segmentation on 3D point clouds. Our method, called 3D-BoNet, follows the simple design philosophy of per-point multilayer perceptrons (MLPs). The framework directly regresses 3D bounding boxes for all instances in a point cloud, while simultaneously predicting a point-level mask for each instance. It consists of a backbone network followed by two parallel network branches for 1) bounding box regression and 2) point mask prediction. 3D-BoNet is single-stage, anchor-free and end-to-end trainable. Moreover, it is remarkably computationally efficient as, unlike existing approaches, it does not require any post-processing steps such as non-maximum suppression, feature sampling, clustering or voting. Extensive experiments show that our approach surpasses existing work on both ScanNet and S3DIS datasets while being approximately 10x more computationally efficient. Comprehensive ablation studies demonstrate the effectiveness of our design.

AAAI Conference 2019 Conference Paper

MotionTransformer: Transferring Neural Inertial Tracking between Domains

Changhao Chen
Yishu Miao
Chris Xiaoxuan Lu
Linhai Xie
Phil Blunsom
Andrew Markham
Niki Trigoni

Inertial information processing plays a pivotal role in egomotion awareness for mobile agents, as inertial measurements are entirely egocentric and not environment dependent. However, they are affected greatly by changes in sensor placement/orientation or motion dynamics, and it is infeasible to collect labelled data from every domain. To overcome the challenges of domain adaptation on long sensory sequences, we propose MotionTransformer - a novel framework that extracts domain-invariant features of raw sequences from arbitrary domains, and transforms to new domains without any paired data. Through the experiments, we demonstrate that it is able to efficiently and effectively convert the raw sequence from a new unlabelled target domain into an accurate inertial trajectory, benefiting from the motion knowledge transferred from the labelled source domain. We also conduct real-world experiments to show our framework can reconstruct physically meaningful trajectories from raw IMU measurements obtained with a standard mobile phone in various attachments.

IJCAI Conference 2018 Conference Paper

3D-PhysNet: Learning the Intuitive Physics of Non-Rigid Object Deformations

Zhihua Wang
Stefano Rosa
Bo Yang
Sen Wang
Niki Trigoni
Andrew Markham

The ability to interact and understand the environment is a fundamental prerequisite for a wide range of applications from robotics to augmented reality. In particular, predicting how deformable objects will react to applied forces in real time is a significant challenge. This is further confounded by the fact that shape information about encountered objects in the real world is often impaired by occlusions, noise and missing regions e. g. a robot manipulating an object will only be able to observe a partial view of the entire solid. In this work we present a framework, 3D-PhysNet, which is able to predict how a three-dimensional solid will deform under an applied force using intuitive physics modelling. In particular, we propose a new method to encode the physical properties of the material and the applied force, enabling generalisation over materials. The key is to combine deep variational autoencoders with adversarial training, conditioned on the applied force and the material properties. We further propose a cascaded architecture that takes a single 2. 5D depth view of the object and predicts its deformation. Training data is provided by a physics simulator. The network is fast enough to be used in real-time applications from partial views. Experimental results show the viability and the generalisation properties of the proposed architecture.

ICRA Conference 2018 Conference Paper

DEFO-NET: Learning Body Deformation Using Generative Adversarial Networks

Zhihua Wang 0005
Stefano Rosa
Linhai Xie
Bo Yang 0027
Sen Wang 0002
Niki Trigoni
Andrew Markham

Modelling the physical properties of everyday objects is a fundamental prerequisite for autonomous robots. We present a novel generative adversarial network (DEFO-NET), able to predict body deformations under external forces from a single RGB-D image. The network is based on an invertible conditional Generative Adversarial Network (IcGAN) and is trained on a collection of different objects of interest generated by a physical finite element model simulator. Defo-netinherits the generalisation properties of GANs. This means that the network is able to reconstruct the whole 3-D appearance of the object given a single depth view of the object and to generalise to unseen object configurations. Contrary to traditional finite element methods, our approach is fast enough to be used in real-time applications. We apply the network to the problem of safe and fast navigation of mobile robots carrying payloads over different obstacles and floor materials. Experimental results in real scenarios show how a robot equipped with an RGB-D camera can use the network to predict terrain deformations under different payload configurations and use this to avoid unsafe areas.

ICRA Conference 2018 Conference Paper

iMag: Accurate and Rapidly Deployable Inertial Magneto-Inductive Localisation

Bo Wei 0003
Niki Trigoni
Andrew Markham

Localisation is of importance for many applications. Our motivating scenarios are short-term construction work and emergency rescue. Not only is accuracy necessary, these scenarios also require rapid setup and robustness to environmental conditions. These requirements preclude the use of many traditional methods e. g. vision-based, laser-based, Ultra-wide band (UWB) and Global Positioning System (GPS)-based localisation systems. To solve these challenges, we introduce iMag, an accurate and rapidly deployable inertial magneto-inductive (MI) localisation system. It localises monitored workers using a single MI transmitter and inertial measurement units with minimal setup effort. However, MI location estimates can be distorted and ambiguous. To solve this problem, we suggest a novel method to use MI devices for sensing environmental distortions, and use these to correctly close inertial loops. By applying robust simultaneous localisation and mapping (SLAM), our proposed localisation method achieves excellent tracking accuracy, and can improve performance significantly compared with only using an inertial measurement unit (IMU) and MI device for localisation.

AAAI Conference 2018 Conference Paper

IONet: Learning to Cure the Curse of Drift in Inertial Odometry

Changhao Chen
Xiaoxuan Lu
Andrew Markham
Niki Trigoni

Inertial sensors play a pivotal role in indoor localization, which in turn lays the foundation for pervasive personal applications. However, low-cost inertial sensors, as commonly found in smartphones, are plagued by bias and noise, which leads to unbounded growth in error when accelerations are double integrated to obtain displacement. Small errors in state estimation propagate to make odometry virtually unusable in a matter of seconds. We propose to break the cycle of continuous integration, and instead segment inertial data into independent windows. The challenge becomes estimating the latent states of each window, such as velocity and orientation, as these are not directly observable from sensor data. We demonstrate how to formulate this as an optimization problem, and show how deep recurrent neural networks can yield highly accurate trajectories, outperforming state-of-the-art shallow techniques, on a wide range of tests and attachments. In particular, we demonstrate that IONet can generalize to estimate odometry for non-periodic motion, such as a shopping trolley or baby-stroller, an extremely challenging task for existing techniques.

ICRA Conference 2018 Conference Paper

Learning with Training Wheels: Speeding up Training with a Simple Controller for Deep Reinforcement Learning

Linhai Xie
Sen Wang 0002
Stefano Rosa
Andrew Markham
Niki Trigoni

Deep Reinforcement Learning (DRL) has been applied successfully to many robotic applications. However, the large number of trials needed for training is a key issue. Most of existing techniques developed to improve training efficiency (e. g. imitation) target on general tasks rather than being tailored for robot applications, which have their specific context to benefit from. We propose a novel framework, Assisted Reinforcement Learning, where a classical controller (e. g. a PID controller) is used as an alternative, switchable policy to speed up training of DRL for local planning and navigation problems. The core idea is that the simple control law allows the robot to rapidly learn sensible primitives, like driving in a straight line, instead of random exploration. As the actor network becomes more advanced, it can then take over to perform more complex actions, like obstacle avoidance. Eventually, the simple controller can be discarded entirely. We show that not only does this technique train faster, it also is less sensitive to the structure of the DRL network and consistently outperforms a standard Deep Deterministic Policy Gradient network. We demonstrate the results in both simulation and real-world experiments.

ICRA Conference 2017 Conference Paper

DeepVO: Towards end-to-end visual odometry with deep Recurrent Convolutional Neural Networks

Sen Wang 0002
Ronald Clark
Hongkai Wen 0001
Niki Trigoni

This paper studies monocular visual odometry (VO) problem. Most of existing VO algorithms are developed under a standard pipeline including feature extraction, feature matching, motion estimation, local optimisation, etc. Although some of them have demonstrated superior performance, they usually need to be carefully designed and specifically fine-tuned to work well in different environments. Some prior knowledge is also required to recover an absolute scale for monocular VO. This paper presents a novel end-to-end framework for monocular VO by using deep Recurrent Convolutional Neural Networks (RCNNs). Since it is trained and deployed in an end-to-end manner, it infers poses directly from a sequence of raw RGB images (videos) without adopting any module in the conventional VO pipeline. Based on the RCNNs, it not only automatically learns effective feature representation for the VO problem through Convolutional Neural Networks, but also implicitly models sequential dynamics and relations using deep Recurrent Neural Networks. Extensive experiments on the KITTI VO dataset show competitive performance to state-of-the-art methods, verifying that the end-to-end Deep Learning technique can be a viable complement to the traditional VO systems.

IROS Conference 2017 Conference Paper

GraphTinker: Outlier rejection and inlier injection for pose graph SLAM

Linhai Xie
Sen Wang 0002
Andrew Markham
Niki Trigoni

In pose graph Simultaneous Localization and Mapping (SLAM) systems, incorrect loop closures can seriously hinder optimizers from converging to correct solutions, significantly degrading both localization accuracy and map consistency. Therefore, it is crucial to enhance their robustness in the presence of numerous false-positive loop closures. Existing approaches tend to fail when working with very unreliable front-end systems, where the majority of inferred loop closures are incorrect. In this paper, we propose a novel middle layer, seamlessly embedded between front and back ends, to boost the robustness of the whole SLAM system. The main contributions of this paper are two-fold: 1) the proposed middle layer offers a new mechanism to reliably detect and remove false-positive loop closures, even if they form the overwhelming majority; 2) artificial loop closures are automatically reconstructed and injected into pose graphs in the framework of an Extended Rauch-Tung-Striebel smoother, reinforcing reliable loop closures. The proposed algorithm alters the graph generated by the front-end and can then be optimized by any back-end system. Extensive experiments are conducted to demonstrate significantly improved accuracy and robustness compared with state-of-the-art methods and various back-ends, verifying the effectiveness of the proposed algorithm.

AAAI Conference 2017 Conference Paper

VINet: Visual-Inertial Odometry as a Sequence-to-Sequence Learning Problem

Ronald Clark
Sen Wang
Hongkai Wen
Andrew Markham
Niki Trigoni

In this paper we present an on-manifold sequence-tosequence learning approach to motion estimation using visual and inertial sensors. It is to the best of our knowledge the ﬁrst end-to-end trainable method for visual-inertial odometry which performs fusion of the data at an intermediate feature-representation level. Our method has numerous advantages over traditional approaches. Speciﬁcally, it eliminates the need for tedious manual synchronization of the camera and IMU as well as eliminating the need for manual calibration between the IMU and camera. A further advantage is that our model naturally and elegantly incorporates domain speciﬁc information which signiﬁcantly mitigates drift. We show that our approach is competitive with state-of-theart traditional methods when accurate calibration data is available and can be trained to outperform them in the presence of calibration and synchronization errors.

IROS Conference 2016 Conference Paper

Keyframe based large-scale indoor localisation using geomagnetic field and motion pattern

Sen Wang 0002
Hongkai Wen 0001
Ronald Clark
Niki Trigoni

This paper studies indoor localisation problem by using low-cost and pervasive sensors. Most of existing indoor localisation algorithms rely on camera, laser scanner, floor plan or other pre-installed infrastructure to achieve sub-meter or sub-centimetre localisation accuracy. However, in some circumstances these required devices or information may be unavailable or too expensive in terms of cost or deployment. This paper presents a novel keyframe based Pose Graph Simultaneous Localisation and Mapping (SLAM) method, which correlates ambient geomagnetic field with motion pattern and employs low-cost sensors commonly equipped in mobile devices, to provide positioning in both unknown and known environments. Extensive experiments are conducted in large-scale indoor environments to verify that the proposed method can achieve high localisation accuracy similar to state-of-the-arts, such as vision based Google Project Tango.

ICRA Conference 2010 Conference Paper

Probabilistic search with agile UAVs

Sonia Waharte
Andrew Symington
Niki Trigoni

Through their ability to rapidly acquire aerial imagery, Unmanned Aerial Vehicles (UAVs) have the potential to aid target search tasks. Many of the core algorithms which are used to plan search tasks use occupancy grid-based representations and are often based on two main assumptions. Firstly, the altitude of the UAV is constant. Secondly, the onboard sensors can measure the entire state of an entire grid cell. Although these assumptions are sufficient for fixed-wing, high speed UAVs, we do not believe that they are appropriate for small, lightweight, low speed and agile UAVs such as quadrotors. These platforms have the ability to change altitude and their low speed means that multiple measurements may easily overlap multiple cells for substantial periods of time. In this paper we extend a framework for probabilistic search based on decision making to incorporate multiple observations of grid cells and changes in UAV altitude. We account for observation areas that completely and partially cover multiple grid cells. We show the resultant impact on a number of simulation examples.

ICRA Conference 2010 Conference Paper

Probabilistic target detection by camera-equipped UAVs

Andrew Symington
Sonia Waharte
Simon Julier
Niki Trigoni

This paper is motivated by the real world problem of search and rescue by unmanned aerial vehicles (UAVs). We consider the problem of tracking a static target from a bird's-eye view camera mounted to the underside of a quadrotor UAV. We begin by proposing a target detection algorithm, which we then execute on a collection of video frames acquired from four different experiments. We show how the efficacy of the target detection algorithm changes as a function of altitude. We summarise this efficacy into a table which we denote the observation model. We then run the target detection algorithm on a sequence of video frames and use parameters from the observation model to update a recursive Bayesian estimator. The estimator keeps track of the probability that a target is currently in view of the camera, which we refer to more simply as target presence. Between each target detection event the UAV changes position and so the sensing region changes. Under certain assumptions regarding the movement of the UAV, the proportion of new information may be approximated to a value, which we then use to weight the prior in each iteration of the estimator. Through a series of experiments we show how the value of the prior for unseen regions, the altitude of the UAV and the camera sampling rate affect the accuracy of the estimator. Our results indicate that there is no single optimal sampling rate for all tested scenarios. We also show how the prior may be used as a mechanism for tuning the estimator according to whether a high false positive or high false negative probability is preferable.

IROS Conference 2008 Conference Paper

HybridExploration: A distributed approach to terrain exploration using mobile and fixed sensor nodes

Ettore Ferranti
Niki Trigoni
Mark Levene

When an emergency occurs within a building, it may be initially safer to send autonomous mobile nodes, instead of human responders, to explore the area and identify hazards and victims. Exploring all the area in the minimum amount of time and reporting back interesting findings to the human personnel outside the building is an essential part of rescue operations. Our assumptions are that the area map is unknown, there is no existing network infrastructure, long-range wireless communication is unreliable and nodes are not location-aware. We take into account these limitations, and propose a novel algorithm, HybridExploration, that makes use of both mobile nodes (robots, called agents) and stationary nodes (inexpensive smart devices, called tags). As agents enter the emergency area, they sprinkle tags within the space to label the environment with states. By reading and updating the state of the local tags, agents are able to coordinate indirectly with each other, without relying on direct agent-to-agent communication. In addition, tags wirelessly exchange local information with nearby tags to further assist agents in their exploration task. Our simulation results show that the proposed algorithm, which exploits both tag-to-tag and agent-to-tag communication, outperforms previous algorithms that rely only on agent-to-tag communication.

JAAMAS Journal 2008 Journal Article

Rapid exploration of unknown areas through dynamic deployment of mobile and stationary sensor nodes

Ettore Ferranti
Niki Trigoni
Mark Levene

Abstract When an emergency occurs within a building, it may be initially safer to send autonomous mobile nodes, instead of human responders, to explore the area and identify hazards and victims. Exploring all the area in the minimum amount of time and reporting back interesting findings to the human personnel outside the building is an essential part of rescue operations. Our assumptions are that the area map is unknown, there is no existing network infrastructure, long-range wireless communication is unreliable and nodes are not location-aware. We take into account these limitations, and propose an architecture consisting of both mobile nodes (robots, called agents) and stationary nodes (inexpensive smart devices, called tags). As agents enter the emergency area, they sprinkle tags within the space to label the environment with states. By reading and updating the state of the local tags, agents are able to coordinate indirectly with each other, without relying on direct agent-to-agent communication. In addition, tags wirelessly exchange local information with nearby tags to further assist agents in their exploration task. Our simulation results show that the proposed algorithm, which exploits both tag-to-tag and agent-to-tag communication, outperforms previous algorithms that rely only on agent-to-tag communication.

ICRA Conference 2008 Conference Paper

Robot-assisted discovery of evacuation routes in emergency scenarios

Ettore Ferranti
Niki Trigoni

When an emergency occurs within a building, it is crucial to guide victims towards emergency exits or human responders towards the locations of victims and hazards. The objective of this work is thus to devise distributed algorithms that allow agents to dynamically discover and maintain short evacuation routes connecting emergency exits to critical cells in the area. We propose two Evacuation Route Discovery mechanisms, Agent2Tag-ERD and Tag2Tag-ERD, and show how they can be seamlessly integrated with existing exploration algorithms, like Ants, MDFS and Brick&Mortar. We then examine the interplay between the tasks of area exploration and evacuation route discovery; our goal is to assess whether the exploration algorithm influences the length of evacuation paths and the time that they are first discovered. Finally, we perform an extensive simulation to assess the impact of the area topology on the quality of discovered evacuation paths.

ICRA Conference 2007 Conference Paper

Brick & Mortar: an on-line multi-agent exploration algorithm

Ettore Ferranti
Niki Trigoni
Mark Levene

When an emergency occurs within a building, it is critical to explore the area as fast as possible in order to find victims and identify hazards. We propose Brick& Mortar, an algorithm for the autonomous exploration of unknown terrains by a team of mobile nodes, referred to as agents. Because of the unreliability and short range of wireless communications in an indoor environment we suggest that agents communicate indirectly with each other by tagging the environment. Agents have no prior knowledge of the terrain map, but are able to coordinate in order to explore a variety of terrains with different topological features. In our experimental evaluation, we show that Brick&Mortar significantly outperforms the competing algorithms, namely Ants and Multiple Depth First Search, in terms of exploration time. The observed performance benefits suggest that our algorithm is suitable for safety-critical applications that require rapid area coverage for real-time event detection and response.