Arrow Research search

Author name cluster

Andrew Markham

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

39 papers
2 author rows

Possible papers

39

NeurIPS Conference 2025 Conference Paper

COOPERA: Continual Open-Ended Human-Robot Assistance

  • Chenyang Ma
  • Kai Lu
  • Ruta Desai
  • Xavier Puig
  • Andrew Markham
  • Niki Trigoni

To understand and collaborate with humans, robots must account for individual human traits, habits, and activities over time. However, most robotic assistants lack these abilities, as they primarily focus on predefined tasks in structured environments and lack a human model to learn from. This work introduces COOPERA, a novel framework for COntinual, OPen-Ended human-Robot Assistance, where simulated humans, driven by psychological traits and long-term intentions, interact with robots in complex environments. By integrating continuous human feedback, our framework, for the first time, enables the study of long-term, open-ended human-robot collaboration (HRC) in different collaborative tasks across various time-scales. Within COOPERA, we introduce a benchmark and an approach to personalize the robot's collaborative actions by learning human traits and context-dependent intents. Experiments validate the extent to which our simulated humans reflect realistic human behaviors and demonstrate the value of inferring and personalizing to human intents for open-ended and long-term HRC.

NeurIPS Conference 2025 Conference Paper

Target Speaker Extraction through Comparing Noisy Positive and Negative Audio Enrollments

  • Shitong Xu
  • Yiyuan Yang
  • Niki Trigoni
  • Andrew Markham

Target speaker extraction focuses on isolating a specific speaker's voice from an audio mixture containing multiple speakers. To provide information about the target speaker's identity, prior works have utilized clean audio samples as conditioning inputs. However, such clean audio examples are not always readily available. For instance, obtaining a clean recording of a stranger's voice at a cocktail party without leaving the noisy environment is generally infeasible. Limited prior research has explored extracting the target speaker's characteristics from noisy enrollments, which may contain overlapping speech from interfering speakers. In this work, we explore a novel enrollment strategy that encodes target speaker information from the noisy enrollment by comparing segments where the target speaker is talking (Positive Enrollments) with segments where the target speaker is silent (Negative Enrollments). Experiments show the effectiveness of our model architecture, which achieves over 2. 1 dB higher SI-SNRi compared to prior works in extracting the monaural speech from the mixture of two speakers. Additionally, the proposed two-stage training strategy accelerates convergence, reducing the number of optimization steps required to reach 3 dB SNR by 60\%. Overall, our method achieves state-of-the-art performance in the monaural target speaker extraction conditioned on noisy enrollments. Our implementation is available at https: //github. com/xu-shitong/TSE-through-Positive-Negative-Enroll.

ICML Conference 2024 Conference Paper

Deep Neural Room Acoustics Primitive

  • Yuhang He
  • Anoop Cherian
  • Gordon Wichern
  • Andrew Markham

The primary objective of room acoustics is to model the intricate sound propagation dynamics from any source to receiver position within enclosed 3D spaces. These dynamics are encapsulated in the form of a 1D room impulse response (RIR). Precisely measuring RIR is difficult due to the complexity of sound propagation encompassing reflection, diffraction, and absorption. In this work, we propose to learn a continuous neural room acoustics field that implicitly encodes all essential sound propagation primitives for each enclosed 3D space, so that we can infer the RIR corresponding to arbitrary source-receiver positions unseen in the training dataset. Our framework, dubbed DeepNeRAP, is trained in a self-supervised manner without requiring direct access to RIR ground truth that is often needed in prior methods. The key idea is to design two cooperative acoustic agents to actively probe a 3D space, one emitting and the other receiving sound at various locations. Analyzing this sound helps to inversely characterize the acoustic primitives. Our framework is well-grounded in the fundamental physical principles of sound propagation, including reciprocity and globality, and thus is acoustically interpretable and meaningful. We present experiments on both synthetic and real-world datasets, demonstrating superior quality in RIR estimation against closely related methods.

ICRA Conference 2024 Conference Paper

Dusk Till Dawn: Self-supervised Nighttime Stereo Depth Estimation using Visual Foundation Models

  • Madhu Vankadari
  • Samuel Hodgson
  • Sangyun Shin
  • Kaichen Zhou
  • Andrew Markham
  • Niki Trigoni

Self-supervised depth estimation algorithms rely heavily on frame-warping relationships, exhibiting substantial performance degradation when applied in challenging circumstances, such as low-visibility and nighttime scenarios with varying illumination conditions. Addressing this challenge, we introduce an algorithm designed to achieve accurate selfsupervised stereo depth estimation focusing on nighttime conditions. Specifically, we use pretrained visual foundation models to extract generalised features across challenging scenes and present an efficient method for matching and integrating these features from stereo frames. Moreover, to prevent pixels violating photometric consistency assumption from negatively affecting the depth predictions, we propose a novel masking approach designed to filter out such pixels. Lastly, addressing weaknesses in the evaluation of current depth estimation algorithms, we present novel evaluation metrics. Our experiments, conducted on challenging datasets including Oxford RobotCar and MultiSpectral Stereo, demonstrate the robust improvements realized by our approach.

IROS Conference 2024 Conference Paper

Learning Generalizable Manipulation Policy with Adapter-Based Parameter Fine-Tuning

  • Kai Lu 0003
  • Kim Tien Ly
  • William Hebberd
  • Kaichen Zhou
  • Ioannis Havoutis
  • Andrew Markham

This study investigates the use of adapters in reinforcement learning for robotic skill generalization across multiple robots and tasks. Traditional methods are typically reliant on robot-specific retraining and face challenges such as efficiency and adaptability, particularly when scaling to robots with varying kinematics. We propose an alternative approach where a disembodied (virtual) hand manipulator learns a task (i. e. , an abstract skill) and then transfers it to various robots with different kinematic constraints without retraining the entire model (i. e. , the concrete, physical implementation of the skill). Whilst adapters are commonly used in other domains with strong supervision available, we show how weaker feedback from robotic control can be used to optimize task execution by preserving the abstract skill dynamics whilst adapting to new robotic domains. We demonstrate the effectiveness of our method with experiments conducted in the SAPIEN ManiSkill environment, showing improvements in generalization and task success rates. All code, data, and additional videos are at this GitHub link: https://kl-research.github.io/genrob.

ICRA Conference 2024 Conference Paper

Learning to Catch Reactive Objects with a Behavior Predictor

  • Kai Lu 0003
  • Jia-Xing Zhong
  • Bo Yang 0027
  • Bing Wang 0013
  • Andrew Markham

Tracking and catching moving objects is an important ability for robots in a dynamic world. Whilst some objects have highly predictable state evolution e. g. , the ballistic trajectory of a tennis ball, reactive targets alter their behavior in response to motion of the manipulator. Reactive applications range from gently capturing living animals such as snakes or fish for biological investigations, to smoothly interacting with and assisting a person. Existing works for dynamic catching usually perform target prediction followed by planning, but seldom account for highly non-linear reactive behaviors. Alternatively, Reinforcement Learning (RL) based methods simply treat the target and its motion as part of the observation of the world-state, but perform poorly due to the weak reward signal. In this work, we blend the approach of an explicit, yet learned, target state predictor with RL. We further show how a tightly coupled predictor which ‘observes’ the state of the robot leads to significantly improved anticipatory action, especially with targets that seek to evade the robot following a simple policy. Experiments show that our method achieves an 86. 4% (open plane area) and a 73. 8% (room) success rate on evasive objects, outperforming monolithic reinforcement learning and other techniques. We also demonstrate the efficacy of our approach across varied targets and trajectories. All code, data, and additional videos are at this GitHub link: https://kl-research.github.io/dyncatch.

AAAI Conference 2024 Conference Paper

SoundCount: Sound Counting from Raw Audio with Dyadic Decomposition Neural Network

  • Yuhang He
  • Zhuangzhuang Dai
  • Niki Trigoni
  • Long Chen
  • Andrew Markham

In this paper, we study an underexplored, yet important and challenging problem: counting the number of distinct sounds in raw audio characterized by a high degree of polyphonicity. We do so by systematically proposing a novel end-to-end trainable neural network~(which we call DyDecNet, consisting of a dyadic decomposition front-end and backbone network), and quantifying the difficulty level of counting depending on sound polyphonicity. The dyadic decomposition front-end progressively decomposes the raw waveform dyadically along the frequency axis to obtain time-frequency representation in multi-stage, coarse-to-fine manner. Each intermediate waveform convolved by a parent filter is further processed by a pair of child filters that evenly split the parent filter's carried frequency response, with the higher-half child filter encoding the detail and lower-half child filter encoding the approximation. We further introduce an energy gain normalization to normalize sound loudness variance and spectrum overlap, and apply it to each intermediate parent waveform before feeding it to the two child filters. To better quantify sound counting difficulty level, we further design three polyphony-aware metrics: polyphony ratio, max polyphony and mean polyphony. We test DyDecNet on various datasets to show its superiority, and we further show dyadic decomposition network can be used as a general front-end to tackle other acoustic tasks.

NeurIPS Conference 2024 Conference Paper

SpatialPIN: Enhancing Spatial Reasoning Capabilities of Vision-Language Models through Prompting and Interacting 3D Priors

  • Chenyang Ma
  • Kai Lu
  • Ta-Ying Cheng
  • Niki Trigoni
  • Andrew Markham

Current state-of-the-art spatial reasoning-enhanced VLMs are trained to excel at spatial visual question answering (VQA). However, we believe that higher-level 3D-aware tasks, such as articulating dynamic scene changes and motion planning, require a fundamental and explicit 3D understanding beyond current spatial VQA datasets. In this work, we present SpatialPIN, a framework designed to enhance the spatial reasoning capabilities of VLMs through prompting and interacting with priors from multiple 3D foundation models in a zero-shot, training-free manner. Extensive experiments demonstrate that our spatial reasoning-imbued VLM performs well on various forms of spatial VQA and can extend to help in various downstream robotics tasks such as pick and stack and trajectory planning.

NeurIPS Conference 2024 Conference Paper

Towards Learning Group-Equivariant Features for Domain Adaptive 3D Detection

  • Sangyun Shin
  • Yuhang He
  • Madhu Vankadari
  • Ta-Ying Cheng
  • Qian Xie
  • Andrew Markham
  • Niki Trigoni

The performance of 3D object detection in large outdoor point clouds deteriorates significantly in an unseen environment due to the inter-domain gap. To address these challenges, most existing methods for domain adaptation harness self-training schemes and attempt to bridge the gap by focusing on a single factor that causes the inter-domain gap, such as objects' sizes, shapes, and foreground density variation. However, the resulting adaptations suggest that there is still a substantial inter-domain gap left to be minimized. We argue that this is due to two limitations: 1) Biased pseudo-label collection from self-training. 2) Multiple factors jointly contributing to how the object is perceived in the unseen target domain. In this work, we propose a grouping-exploration strategy framework, Group Explorer Domain Adaptation ($\textbf{GroupEXP-DA}$), to addresses those two issues. Specifically, our grouping divides the available label sets into multiple clusters and ensures all of them have equal learning attention with the group-equivariant spatial feature, avoiding dominant types of objects causing imbalance problems. Moreover, grouping learns to divide objects by considering inherent factors in a data-driven manner, without considering each factor separately as existing works. On top of the group-equivariant spatial feature that selectively detects objects similar to the input group, we additionally introduce an explorative group update strategy that reduces the false negative detection in the target domain, further reducing the inter-domain gap. During inference, only the learned group features are necessary for making the group-equivariant spatial feature, placing our method as a simple add-on that can be applicable to most existing detectors. We show how each module contributes to substantially bridging the inter-domain gaps compared to existing works across large urban outdoor datasets such as NuScenes, Waymo, and KITTI.

IROS Conference 2024 Conference Paper

WSCLoc: Weakly-Supervised Sparse-View Camera Relocalization via Radiance Field

  • Jialu Wang
  • Kaichen Zhou
  • Andrew Markham
  • Niki Trigoni

Despite the advancements in deep learning for camera relocalization tasks, obtaining ground truth pose labels required for the training process remains a costly endeavor. While current weakly supervised methods excel in lightweight label generation, their performance notably declines in scenarios with sparse views. In response to this challenge, we introduce WSCLoc, a system capable of being customized to various deep learning-based relocalization models to enhance their performance under weakly-supervised and sparse view conditions. This is realized with two stages. In the initial stage, WSCLoc employs a multilayer perceptron-based structure called WFT-NeRF to co-optimize image reconstruction quality and initial pose information. To ensure a stable learning process, we incorporate temporal information as input. Furthermore, instead of optimizing SE(3), we opt for sim(3) optimization to explicitly enforce a scale constraint. In the second stage, we co-optimize the pre-trained WFT-NeRF and WFT-Pose. This optimization is enhanced by Time-Encoding based Random View Synthesis and supervised by inter-frame geometric constraints that consider pose, depth, and RGB information. We validate our approaches on two publicly available datasets, one outdoor and one indoor. Our experimental results demonstrate that our weakly-supervised relocalization solutions achieve superior pose estimation accuracy in sparse-view scenarios, comparable to state-of-the-art camera relocalization methods. We will make our code publicly available.

ICRA Conference 2023 Conference Paper

Decoupling Skill Learning from Robotic Control for Generalizable Object Manipulation

  • Kai Lu 0003
  • Bo Yang 0027
  • Bing Wang 0013
  • Andrew Markham

Recent works in robotic manipulation through reinforcement learning (RL) or imitation learning (IL) have shown potential for tackling a range of tasks e. g. , opening a drawer or a cupboard. However, these techniques generalize poorly to unseen objects. We conjecture that this is due to the high-dimensional action space for joint control. In this paper, we take an alternative approach and separate the task of learning ‘what to do’ from ‘how to do it’ i. e. , whole-body control. We pose the RL problem as one of determining the skill dynamics for a disembodied virtual manipulator interacting with articulated objects. The whole-body robotic kinematic control is optimized to execute the high-dimensional joint motion to reach the goals in the workspace. It does so by solving a quadratic programming (QP) model with robotic singularity and kinematic constraints. Our experiments on manipulating complex articulated objects show that the proposed approach is more generalizable to unseen objects with large intra-class variations, outperforming previous approaches. The evaluation results indicate that our approach generates more compliant robotic motion and outperforms the pure RL and IL baselines in task success rates. Additional information and videos are available at https://kl-research.github.io/decoupskill.

NeurIPS Conference 2023 Conference Paper

DynPoint: Dynamic Neural Point For View Synthesis

  • Kaichen Zhou
  • Jia-Xing Zhong
  • Sangyun Shin
  • Kai Lu
  • Yiyuan Yang
  • Andrew Markham
  • Niki Trigoni

The introduction of neural radiance fields has greatly improved the effectiveness of view synthesis for monocular videos. However, existing algorithms face difficulties when dealing with uncontrolled or lengthy scenarios, and require extensive training time specific to each new scenario. To tackle these limitations, we propose DynPoint, an algorithm designed to facilitate the rapid synthesis of novel views for unconstrained monocular videos. Rather than encoding the entirety of the scenario information into a latent representation, DynPoint concentrates on predicting the explicit 3D correspondence between neighboring frames to realize information aggregation. Specifically, this correspondence prediction is achieved through the estimation of consistent depth and scene flow information across frames. Subsequently, the acquired correspondence is utilized to aggregate information from multiple reference frames to a target frame, by constructing hierarchical neural point clouds. The resulting framework enables swift and accurate view synthesis for desired views of target frames. The experimental results obtained demonstrate the considerable acceleration of training time achieved - typically an order of magnitude - by our proposed method while yielding comparable outcomes compared to prior approaches. Furthermore, our method exhibits strong robustness in handling long-duration videos without learning a canonical representation of video content.

NeurIPS Conference 2023 Conference Paper

Multi-body SE(3) Equivariance for Unsupervised Rigid Segmentation and Motion Estimation

  • Jia-Xing Zhong
  • Ta-Ying Cheng
  • Yuhang He
  • Kai Lu
  • Kaichen Zhou
  • Andrew Markham
  • Niki Trigoni

A truly generalizable approach to rigid segmentation and motion estimation is fundamental to 3D understanding of articulated objects and moving scenes. In view of the closely intertwined relationship between segmentation and motion estimates, we present an SE(3) equivariant architecture and a training strategy to tackle this task in an unsupervised manner. Our architecture is composed of two interconnected, lightweight heads. These heads predict segmentation masks using point-level invariant features and estimate motion from SE(3) equivariant features, all without the need for category information. Our training strategy is unified and can be implemented online, which jointly optimizes the predicted segmentation and motion by leveraging the interrelationships among scene flow, segmentation mask, and rigid transformations. We conduct experiments on four datasets to demonstrate the superiority of our method. The results show that our method excels in both model performance and computational efficiency, with only 0. 25M parameters and 0. 92G FLOPs. To the best of our knowledge, this is the first work designed for category-agnostic part-level SE(3) equivariance in dynamic point clouds.

IROS Conference 2023 Conference Paper

RADA: Robust Adversarial Data Augmentation for Camera Localization in Challenging Conditions

  • Jialu Wang
  • Muhamad Risqi Utama Saputra
  • Chris Xiaoxuan Lu
  • Niki Trigoni
  • Andrew Markham

Camera localization is a fundamental problem for many applications in computer vision, robotics, and autonomy. Despite recent deep learning-based approaches, the lack of robustness in challenging conditions persists due to changes in appearance caused by texture-less planes, repeating structures, reflective surfaces, motion blur, and illumination changes. Data augmentation is an attractive solution, but standard image perturbation methods fail to improve localization robustness. To address this, we propose RADA, which concentrates on perturbing the most vulnerable pixels to generate relatively less image perturbations that perplex the network. Our method outperforms previous augmentation techniques, achieving up to twice the accuracy of state-of-the-art models even under ‘unseen’ challenging weather conditions. Videos of our results can be found at https://youtu.be/niOv7-fJeCA.The source code for RADA is publicly available at https://github.com/jialuwang123321/RADA.

ICRA Conference 2023 Conference Paper

Sample, Crop, Track: Self-Supervised Mobile 3D Object Detection for Urban Driving LiDAR

  • Sangyun Shin
  • Stuart Golodetz
  • Madhu Vankadari
  • Kaichen Zhou
  • Andrew Markham
  • Niki Trigoni

Deep learning has led to great progress in the detection of mobile (i. e. movement-capable) objects in urban driving scenes in recent years. Supervised approaches typically require the annotation of large training sets; there has thus been great interest in leveraging weakly, semi- or self- supervised methods to avoid this, with much success. Whilst weakly and semi-supervised methods require some annotation, self-supervised methods have used cues such as motion to relieve the need for annotation altogether. However, a complete absence of annotation typically degrades their performance, and ambiguities that arise during motion grouping can inhibit their ability to find accurate object boundaries. In this paper, we propose a new self-supervised mobile object detection approach called SCT. This uses both motion cues and expected object sizes to improve detection performance, and predicts a dense grid of 3 $D$ oriented bounding boxes to improve object discovery. We significantly outperform the state-of-the-art self-supervised mobile object detection method TCR on the KITTI tracking benchmark, and achieve performance that is within 30 % of the fully supervised PV-RCNN++ method for IoUs $\leq$ 0. 5. Our source code will be made available online.

IROS Conference 2022 Conference Paper

DeepCIR: Insights into CIR-based Data-driven UWB Error Mitigation

  • Vu Tran
  • Zhuangzhuang Dai
  • Niki Trigoni
  • Andrew Markham

Ultra-Wide-Band (UWB) ranging sensors have been widely adopted for robotic navigation thanks to their extremely high bandwidth and hence high resolution. However, off-the-shelf devices may output ranges with significant errors in cluttered, severe non-line-of-sight (NLOS) environments. Recently, neural networks have been actively studied to improve the ranging accuracy of UWB sensors using the channel-impulse-response (CIR) as input. However, previous works have not systematically evaluated the efficacy of various packet types and their possible combinations in a two-way-ranging transaction, including poll, response and final packets. In this paper, we firstly investigate the utility of different packet types and their combinations when used as input for a neural network. Secondly, we propose two novel data-driven approaches, namely FMCIR and WMCIR, that leverage two-sided CIRs for efficient UWB error mitigation. Our approaches outperform state-of-the-art by a significant margin, further reducing range errors up to 45%. Finally, we create and release a dataset of transaction-level synchronized CIRs (each sample consists of the CIR of the poll, response and final packets), which will enable further studies in this area.

IROS Conference 2022 Conference Paper

Real-Time Hybrid Mapping of Populated Indoor Scenes using a Low-Cost Monocular UAV

  • Stuart Golodetz
  • Madhu Vankadari
  • Aluna Everitt
  • Sangyun Shin
  • Andrew Markham
  • Niki Trigoni

Unmanned aerial vehicles (UAVs) have been used for many applications in recent years, from urban search and rescue, to agricultural surveying, to autonomous underground mine exploration. However, deploying UAVs in tight, indoor spaces, especially close to humans, remains a challenge. One solution, when limited payload is required, is to use micro-UAVs, which pose less risk to humans and typically cost less to replace after a crash. However, micro-UAVs can only carry a limited sensor suite, e. g. a monocular camera instead of a stereo pair or LiDAR, complicating tasks like dense mapping and markerless multi-person 3D human pose estimation, which are needed to operate in tight environments around people. Monocular approaches to such tasks exist, and dense monocular mapping approaches have been successfully deployed for UAV applications. However, despite many recent works on both marker-based and markerless multi-UAV single-person motion capture, markerless single-camera multi-person 3D human pose estimation remains a much earlier-stage technology, and we are not aware of existing attempts to deploy it in an aerial context. In this paper, we present what is thus, to our knowledge, the first system to perform simultaneous mapping and multi-person 3D human pose estimation from a monocular camera mounted on a single UAV. In particular, we show how to loosely couple state-of-the-art monocular depth estimation and monocular 3D human pose estimation approaches to reconstruct a hybrid map of a populated indoor scene in real time. We validate our component-level design choices via extensive experiments on the large-scale ScanNet and GTA-IM datasets. To evaluate our system-level performance, we also construct a new Oxford Hybrid Mapping dataset of populated indoor scenes.

ICRA Conference 2021 Conference Paper

3D Motion Capture of an Unmodified Drone with Single-chip Millimeter Wave Radar

  • Peijun Zhao
  • Chris Xiaoxuan Lu
  • Bing Wang 0013
  • Niki Trigoni
  • Andrew Markham

Accurate motion capture of aerial robots in 3D is a key enabler for autonomous operation in indoor environments such as warehouses or factories, as well as driving forward research in these areas. The most commonly used solutions at present are optical motion capture (e. g. VICON) and Ultrawide-band (UWB), but these are costly and cumbersome to deploy, due to their requirement of multiple cameras/anchors spaced around the tracking area. They also require the drone to be modified to carry an active or passive marker. In this work, we present an inexpensive system that can be rapidly installed, based on single-chip millimeter wave (mmWave) radar. Importantly, the drone does not need to be modified or equipped with any markers, as we exploit the Doppler signals from the rotating propellers. Furthermore, 3D tracking is possible from a single point, greatly simplifying deployment. We develop a novel deep neural network and demonstrate decimeter level 3D tracking at 10Hz, achieving better performance than classical baselines. Our hope is that this low-cost system will act to catalyse inexpensive drone research and increased autonomy.

ICRA Conference 2021 Conference Paper

RadarLoc: Learning to Relocalize in FMCW Radar

  • Wei Wang 0226
  • Pedro P. B. de Gusmao
  • Bo Yang 0027
  • Andrew Markham
  • Niki Trigoni

Relocalization is a fundamental task in the field of robotics and computer vision. There is considerable work in the field of deep camera relocalization, which directly estimates poses from raw images. However, learning-based methods have not yet been applied to the radar sensory data. In this work, we investigate how to exploit deep learning to predict global poses from Emerging Frequency-Modulated Continuous Wave (FMCW) radar scans. Specifically, we propose a novel end-to-end neural network with self-attention, termed RadarLoc, which is able to estimate 6-DoF global poses directly. We also propose to improve the localization performance by utilizing geometric constraints between radar scans. We validate our approach on the recently released challenging outdoor dataset Oxford Radar RobotCar. Comprehensive experiments demonstrate that the proposed method outperforms radar-based localization and deep camera relocalization methods by a significant margin.

ICML Conference 2021 Conference Paper

SoundDet: Polyphonic Moving Sound Event Detection and Localization from Raw Waveform

  • Yuhang He
  • Niki Trigoni
  • Andrew Markham

We present a new framework SoundDet, which is an end-to-end trainable and light-weight framework, for polyphonic moving sound event detection and localization. Prior methods typically approach this problem by preprocessing raw waveform into time-frequency representations, which is more amenable to process with well-established image processing pipelines. Prior methods also detect in segment-wise manner, leading to incomplete and partial detections. SoundDet takes a novel approach and directly consumes the raw, multichannel waveform and treats the spatio-temporal sound event as a complete “sound-object" to be detected. Specifically, SoundDet consists of a backbone neural network and two parallel heads for temporal detection and spatial localization, respectively. Given the large sampling rate of raw waveform, the backbone network first learns a set of phase-sensitive and frequency-selective bank of filters to explicitly retain direction-of-arrival information, whilst being highly computationally and parametrically efficient than standard 1D/2D convolution. A dense sound event proposal map is then constructed to handle the challenges of predicting events with large varying temporal duration. Accompanying the dense proposal map are a temporal overlapness map and a motion smoothness map that measure a proposal’s confidence to be an event from temporal detection accuracy and movement consistency perspective. Involving the two maps guarantees SoundDet to be trained in a spatio-temporally unified manner. Experimental results on the public DCASE dataset show the advantage of SoundDet on both segment-based evaluation and our newly proposed event-based evaluation system.

AAAI Conference 2021 Conference Paper

VMLoc: Variational Fusion For Learning-Based Multimodal Camera Localization

  • Kaichen Zhou
  • Changhao Chen
  • Bing Wang
  • Muhamad Risqi U. Saputra
  • Niki Trigoni
  • Andrew Markham

Recent learning-based approaches have achieved impressive results in the field of single-shot camera localization. However, how best to fuse multiple modalities (e. g. , image and depth) and to deal with degraded or missing input are less well studied. In particular, we note that previous approaches towards deep fusion do not perform significantly better than models employing a single modality. We conjecture that this is because of the naive approaches to feature space fusion through summation or concatenation which do not take into account the different strengths of each modality. To address this, we propose an end-to-end framework, termed VM- Loc, to fuse different sensor inputs into a common latent space through a variational Product-of-Experts (PoE) followed by attention-based fusion. Unlike previous multimodal variational works directly adapting the objective function of vanilla variational auto-encoder, we show how camera localization can be accurately estimated through an unbiased objective function based on importance weighting. Our model is extensively evaluated on RGB-D datasets and the results prove the efficacy of our model. The source code is available at https: //github. com/Zalex97/VMLoc.

AAAI Conference 2020 Conference Paper

AtLoc: Attention Guided Camera Localization

  • Bing Wang
  • Changhao Chen
  • Chris Xiaoxuan Lu
  • Peijun Zhao
  • Niki Trigoni
  • Andrew Markham

Deep learning has achieved impressive results in camera localization, but current single-image techniques typically suffer from a lack of robustness, leading to large outliers. To some extent, this has been tackled by sequential (multi-images) or geometry constraint approaches, which can learn to reject dynamic objects and illumination conditions to achieve better performance. In this work, we show that attention can be used to force the network to focus on more geometrically robust objects and features, achieving state-of-the-art performance in common benchmark, even if using only a single image as input. Extensive experimental evidence is provided through public indoor and outdoor datasets. Through visualization of the saliency maps, we demonstrate how the network learns to reject dynamic objects, yielding superior global camera pose regression performance. The source code is avaliable at https: //github. com/BingCS/AtLoc.

ICRA Conference 2020 Conference Paper

Heart Rate Sensing with a Robot Mounted mmWave Radar

  • Peijun Zhao
  • Chris Xiaoxuan Lu
  • Bing Wang 0013
  • Changhao Chen
  • Linhai Xie
  • Mengyu Wang
  • Niki Trigoni
  • Andrew Markham

Heart rate monitoring at home is a useful metric for assessing health e. g. of the elderly or patients in post-operative recovery. Although non-contact heart rate monitoring has been widely explored, typically using a static, wall-mounted device, measurements are limited to a single room and sensitive to user orientation and position. In this work, we propose mBeats, a robot mounted millimeter wave (mmWave) radar system that provide periodic heart rate measurements under different user poses, without interfering in a users daily activities. mBeats contains a mmWave servoing module that adaptively adjusts the sensor angle to the best reflection pro le. Furthermore, mBeats features a deep neural network predictor, which can estimate heart rate from the lower leg and additionally provides estimation uncertainty. Through extensive experiments, we demonstrate accurate and robust operation of mBeats in a range of scenarios. We believe by integrating mobility and adaptability, mBeats can empower many down-stream healthcare applications at home, such as palliative care, post-operative rehabilitation and telemedicine.

ICRA Conference 2020 Conference Paper

SnapNav: Learning Mapless Visual Navigation with Sparse Directional Guidance and Visual Reference

  • Linhai Xie
  • Andrew Markham
  • Niki Trigoni

Learning-based visual navigation still remains a challenging problem in robotics, with two overarching issues: how to transfer the learnt policy to unseen scenarios, and how to deploy the system on real robots. In this paper, we propose a deep neural network based visual navigation system, SnapNav. Unlike map-based navigation or Visual-Teach-and-Repeat (VT&R), SnapNav only receives a few snapshots of the environment combined with directional guidance to allow it to execute the navigation task. Additionally, SnapNav can be easily deployed on real robots due to a two-level hierarchy: a high level commander that provides directional commands and a low level controller that provides real-time control and obstacle avoidance. This also allows us to effectively use simulated and real data to train the different layers of the hierarchy, facilitating robust control. Extensive experimental results show that SnapNav achieves a highly autonomous navigation ability compared to baseline models, enabling sparse, map-less navigation in previously unseen environments.

IROS Conference 2019 Conference Paper

DeepPCO: End-to-End Point Cloud Odometry through Deep Parallel Neural Network

  • Wei Wang 0226
  • Muhamad Risqi Utama Saputra
  • Peijun Zhao
  • Pedro P. B. de Gusmao
  • Bo Yang 0027
  • Changhao Chen
  • Andrew Markham
  • Niki Trigoni

Odometry is of key importance for localization in the absence of a map. There is considerable work in the area of visual odometry (VO), and recent advances in deep learning have brought novel approaches to VO, which directly learn salient features from raw images. These learning-based approaches have led to more accurate and robust VO systems. However, they have not been well applied to point cloud data yet. In this work, we investigate how to exploit deep learning to estimate point cloud odometry (PCO), which may serve as a critical component in point cloud-based downstream tasks or learning-based systems. Specifically, we propose a novel end-to-end deep parallel neural network called DeepPCO, which can estimate the 6-DOF poses using consecutive point clouds. It consists of two parallel sub-networks to estimate 3D translation and orientation respectively rather than a single neural network. We validate our approach on KITTI Visual Odometry/SLAM benchmark dataset with different baselines. Experiments demonstrate that the proposed approach achieves good performance in terms of pose accuracy.

ICRA Conference 2019 Conference Paper

GANVO: Unsupervised Deep Monocular Visual Odometry and Depth Estimation with Generative Adversarial Networks

  • Yasin Almalioglu
  • Muhamad Risqi Utama Saputra
  • Pedro P. B. de Gusmao
  • Andrew Markham
  • Niki Trigoni

In the last decade, supervised deep learning approaches have been extensively employed in visual odometry (VO) applications, which is not feasible in environments where labelled data is not abundant. On the other hand, unsupervised deep learning approaches for localization and mapping in unknown environments from unlabelled data have received comparatively less attention in VO research. In this study, we propose a generative unsupervised learning framework that predicts 6-DoF pose camera motion and monocular depth map of the scene from unlabelled RGB image sequences, using deep convolutional Generative Adversarial Networks (GANs). We create a supervisory signal by warping view sequences and assigning the re-projection minimization to the objective loss function that is adopted in multi-view pose estimation and single-view depth generation network. Detailed quantitative and qualitative evaluations of the proposed framework on the KITTI [1] and Cityscapes [2] datasets show that the proposed method outperforms both existing traditional and unsupervised deep VO methods providing better results for both pose estimation and depth recovery.

ICRA Conference 2019 Conference Paper

Learning Monocular Visual Odometry through Geometry-Aware Curriculum Learning

  • Muhamad Risqi Utama Saputra
  • Pedro P. B. de Gusmao
  • Sen Wang 0002
  • Andrew Markham
  • Niki Trigoni

Inspired by the cognitive process of humans and animals, Curriculum Learning (CL) trains a model by gradually increasing the difficulty of the training data. In this paper, we study whether CL can be applied to complex geometry problems like estimating monocular Visual Odometry (VO). Unlike existing CL approaches, we present a novel CL strategy for learning the geometry of monocular VO by gradually making the learning objective more difficult during training. To this end, we propose a novel geometry-aware objective function by jointly optimizing relative and composite transformations over small windows via bounded pose regression loss. A cascade optical flow network followed by recurrent network with a differentiable windowed composition layer, termed CL-VO, is devised to learn the proposed objective. Evaluation on three real-world datasets shows superior performance of CL-VO over state-of-the-art feature-based and learning-based VO.

NeurIPS Conference 2019 Conference Paper

Learning Object Bounding Boxes for 3D Instance Segmentation on Point Clouds

  • Bo Yang
  • Jianan Wang
  • Ronald Clark
  • Qingyong Hu
  • Sen Wang
  • Andrew Markham
  • Niki Trigoni

We propose a novel, conceptually simple and general framework for instance segmentation on 3D point clouds. Our method, called 3D-BoNet, follows the simple design philosophy of per-point multilayer perceptrons (MLPs). The framework directly regresses 3D bounding boxes for all instances in a point cloud, while simultaneously predicting a point-level mask for each instance. It consists of a backbone network followed by two parallel network branches for 1) bounding box regression and 2) point mask prediction. 3D-BoNet is single-stage, anchor-free and end-to-end trainable. Moreover, it is remarkably computationally efficient as, unlike existing approaches, it does not require any post-processing steps such as non-maximum suppression, feature sampling, clustering or voting. Extensive experiments show that our approach surpasses existing work on both ScanNet and S3DIS datasets while being approximately 10x more computationally efficient. Comprehensive ablation studies demonstrate the effectiveness of our design.

AAAI Conference 2019 Conference Paper

MotionTransformer: Transferring Neural Inertial Tracking between Domains

  • Changhao Chen
  • Yishu Miao
  • Chris Xiaoxuan Lu
  • Linhai Xie
  • Phil Blunsom
  • Andrew Markham
  • Niki Trigoni

Inertial information processing plays a pivotal role in egomotion awareness for mobile agents, as inertial measurements are entirely egocentric and not environment dependent. However, they are affected greatly by changes in sensor placement/orientation or motion dynamics, and it is infeasible to collect labelled data from every domain. To overcome the challenges of domain adaptation on long sensory sequences, we propose MotionTransformer - a novel framework that extracts domain-invariant features of raw sequences from arbitrary domains, and transforms to new domains without any paired data. Through the experiments, we demonstrate that it is able to efficiently and effectively convert the raw sequence from a new unlabelled target domain into an accurate inertial trajectory, benefiting from the motion knowledge transferred from the labelled source domain. We also conduct real-world experiments to show our framework can reconstruct physically meaningful trajectories from raw IMU measurements obtained with a standard mobile phone in various attachments.

AAMAS Conference 2019 Conference Paper

Optimising Worlds to Evaluate and Influence Reinforcement Learning Agents

  • Richard Everett
  • Adam Cobb
  • Andrew Markham
  • Stephen Roberts

Training reinforcement learning agents on a distribution of procedurally generated environments has become an increasingly common method for obtaining more generalisable agents. However, this makes evaluation challenging, as the space of possible environment settings is large; simply looking at the average performance is insufficient for understanding how well - or how poorly - the agents perform. To address this, we introduce a method for strategically evaluating and influencing the behaviour of reinforcement learning agents. Using deep generative modelling to encode the environment, we propose a World Agent which efficiently generates and optimises worlds (i. e. environment settings) relative to the performance of the agents. Through the use of our method on two distinct environments, we demonstrate the existence of worlds which minimise and maximise agent reward beyond the typically reported average reward. Additionally, we show how our method can also be used to modify the distribution of worlds that agents train on, influencing their emergent behaviour to be more desirable.

IJCAI Conference 2018 Conference Paper

3D-PhysNet: Learning the Intuitive Physics of Non-Rigid Object Deformations

  • Zhihua Wang
  • Stefano Rosa
  • Bo Yang
  • Sen Wang
  • Niki Trigoni
  • Andrew Markham

The ability to interact and understand the environment is a fundamental prerequisite for a wide range of applications from robotics to augmented reality. In particular, predicting how deformable objects will react to applied forces in real time is a significant challenge. This is further confounded by the fact that shape information about encountered objects in the real world is often impaired by occlusions, noise and missing regions e. g. a robot manipulating an object will only be able to observe a partial view of the entire solid. In this work we present a framework, 3D-PhysNet, which is able to predict how a three-dimensional solid will deform under an applied force using intuitive physics modelling. In particular, we propose a new method to encode the physical properties of the material and the applied force, enabling generalisation over materials. The key is to combine deep variational autoencoders with adversarial training, conditioned on the applied force and the material properties. We further propose a cascaded architecture that takes a single 2. 5D depth view of the object and predicts its deformation. Training data is provided by a physics simulator. The network is fast enough to be used in real-time applications from partial views. Experimental results show the viability and the generalisation properties of the proposed architecture.

ICRA Conference 2018 Conference Paper

DEFO-NET: Learning Body Deformation Using Generative Adversarial Networks

  • Zhihua Wang 0005
  • Stefano Rosa
  • Linhai Xie
  • Bo Yang 0027
  • Sen Wang 0002
  • Niki Trigoni
  • Andrew Markham

Modelling the physical properties of everyday objects is a fundamental prerequisite for autonomous robots. We present a novel generative adversarial network (DEFO-NET), able to predict body deformations under external forces from a single RGB-D image. The network is based on an invertible conditional Generative Adversarial Network (IcGAN) and is trained on a collection of different objects of interest generated by a physical finite element model simulator. Defo-netinherits the generalisation properties of GANs. This means that the network is able to reconstruct the whole 3-D appearance of the object given a single depth view of the object and to generalise to unseen object configurations. Contrary to traditional finite element methods, our approach is fast enough to be used in real-time applications. We apply the network to the problem of safe and fast navigation of mobile robots carrying payloads over different obstacles and floor materials. Experimental results in real scenarios show how a robot equipped with an RGB-D camera can use the network to predict terrain deformations under different payload configurations and use this to avoid unsafe areas.

ICRA Conference 2018 Conference Paper

iMag: Accurate and Rapidly Deployable Inertial Magneto-Inductive Localisation

  • Bo Wei 0003
  • Niki Trigoni
  • Andrew Markham

Localisation is of importance for many applications. Our motivating scenarios are short-term construction work and emergency rescue. Not only is accuracy necessary, these scenarios also require rapid setup and robustness to environmental conditions. These requirements preclude the use of many traditional methods e. g. vision-based, laser-based, Ultra-wide band (UWB) and Global Positioning System (GPS)-based localisation systems. To solve these challenges, we introduce iMag, an accurate and rapidly deployable inertial magneto-inductive (MI) localisation system. It localises monitored workers using a single MI transmitter and inertial measurement units with minimal setup effort. However, MI location estimates can be distorted and ambiguous. To solve this problem, we suggest a novel method to use MI devices for sensing environmental distortions, and use these to correctly close inertial loops. By applying robust simultaneous localisation and mapping (SLAM), our proposed localisation method achieves excellent tracking accuracy, and can improve performance significantly compared with only using an inertial measurement unit (IMU) and MI device for localisation.

AAAI Conference 2018 Conference Paper

IONet: Learning to Cure the Curse of Drift in Inertial Odometry

  • Changhao Chen
  • Xiaoxuan Lu
  • Andrew Markham
  • Niki Trigoni

Inertial sensors play a pivotal role in indoor localization, which in turn lays the foundation for pervasive personal applications. However, low-cost inertial sensors, as commonly found in smartphones, are plagued by bias and noise, which leads to unbounded growth in error when accelerations are double integrated to obtain displacement. Small errors in state estimation propagate to make odometry virtually unusable in a matter of seconds. We propose to break the cycle of continuous integration, and instead segment inertial data into independent windows. The challenge becomes estimating the latent states of each window, such as velocity and orientation, as these are not directly observable from sensor data. We demonstrate how to formulate this as an optimization problem, and show how deep recurrent neural networks can yield highly accurate trajectories, outperforming state-of-the-art shallow techniques, on a wide range of tests and attachments. In particular, we demonstrate that IONet can generalize to estimate odometry for non-periodic motion, such as a shopping trolley or baby-stroller, an extremely challenging task for existing techniques.

ICRA Conference 2018 Conference Paper

Learning with Training Wheels: Speeding up Training with a Simple Controller for Deep Reinforcement Learning

  • Linhai Xie
  • Sen Wang 0002
  • Stefano Rosa
  • Andrew Markham
  • Niki Trigoni

Deep Reinforcement Learning (DRL) has been applied successfully to many robotic applications. However, the large number of trials needed for training is a key issue. Most of existing techniques developed to improve training efficiency (e. g. imitation) target on general tasks rather than being tailored for robot applications, which have their specific context to benefit from. We propose a novel framework, Assisted Reinforcement Learning, where a classical controller (e. g. a PID controller) is used as an alternative, switchable policy to speed up training of DRL for local planning and navigation problems. The core idea is that the simple control law allows the robot to rapidly learn sensible primitives, like driving in a straight line, instead of random exploration. As the actor network becomes more advanced, it can then take over to perform more complex actions, like obstacle avoidance. Eventually, the simple controller can be discarded entirely. We show that not only does this technique train faster, it also is less sensitive to the structure of the DRL network and consistently outperforms a standard Deep Deterministic Policy Gradient network. We demonstrate the results in both simulation and real-world experiments.

AAAI Conference 2017 Short Paper

Evolutionary Machine Learning for RTS Game StarCraft

  • Lianlong Wu
  • Andrew Markham

Real-Time Strategy (RTS) games involve multiple agents acting simultaneously, and result in enormous state dimensionality. In this paper, we propose an abstracted and simplified model for the famous game StarCraft, and design a dynamic programming algorithm to solve the building order problem, which takes minimal time to achieve a specific target. In addition, Genetic Algorithms (GA) are used to find an optimal target for the opening stage.

IROS Conference 2017 Conference Paper

GraphTinker: Outlier rejection and inlier injection for pose graph SLAM

  • Linhai Xie
  • Sen Wang 0002
  • Andrew Markham
  • Niki Trigoni

In pose graph Simultaneous Localization and Mapping (SLAM) systems, incorrect loop closures can seriously hinder optimizers from converging to correct solutions, significantly degrading both localization accuracy and map consistency. Therefore, it is crucial to enhance their robustness in the presence of numerous false-positive loop closures. Existing approaches tend to fail when working with very unreliable front-end systems, where the majority of inferred loop closures are incorrect. In this paper, we propose a novel middle layer, seamlessly embedded between front and back ends, to boost the robustness of the whole SLAM system. The main contributions of this paper are two-fold: 1) the proposed middle layer offers a new mechanism to reliably detect and remove false-positive loop closures, even if they form the overwhelming majority; 2) artificial loop closures are automatically reconstructed and injected into pose graphs in the framework of an Extended Rauch-Tung-Striebel smoother, reinforcing reliable loop closures. The proposed algorithm alters the graph generated by the front-end and can then be optimized by any back-end system. Extensive experiments are conducted to demonstrate significantly improved accuracy and robustness compared with state-of-the-art methods and various back-ends, verifying the effectiveness of the proposed algorithm.

UAI Conference 2017 Conference Paper

Interpreting Lion Behaviour as Probabilistic Programs

  • Neil Dhir
  • Matthijs Vákár
  • Matthew Wijers
  • Andrew Markham
  • Frank Wood

We consider the problem of unsupervised learning of meaningful behavioural segments of high-dimensional time-series observations, collected from a pride of African lions1. We demonstrate, by way of a probabilistic programming system (PPS), a methodology which allows for quick iteration over models and Bayesian inferences, which enables us to learn meaningful behavioural segments. We introduce a new Bayesian nonparametric (BNP) state-space model, which extends the hierarchical Dirichlet process (HDP) hidden Markov model (HMM) with an explicit BNP treatment of duration distributions, to deal with different levels of granularity of the latent behavioural space of the lions. The ease with which this is done exemplifies the flexibility that a PPS gives a scientist2. Furthermore, we combine this approach with unsupervised feature learning, using variational autoencoders.

AAAI Conference 2017 Conference Paper

VINet: Visual-Inertial Odometry as a Sequence-to-Sequence Learning Problem

  • Ronald Clark
  • Sen Wang
  • Hongkai Wen
  • Andrew Markham
  • Niki Trigoni

In this paper we present an on-manifold sequence-tosequence learning approach to motion estimation using visual and inertial sensors. It is to the best of our knowledge the first end-to-end trainable method for visual-inertial odometry which performs fusion of the data at an intermediate feature-representation level. Our method has numerous advantages over traditional approaches. Specifically, it eliminates the need for tedious manual synchronization of the camera and IMU as well as eliminating the need for manual calibration between the IMU and camera. A further advantage is that our model naturally and elegantly incorporates domain specific information which significantly mitigates drift. We show that our approach is competitive with state-of-theart traditional methods when accurate calibration data is available and can be trained to outperform them in the presence of calibration and synchronization errors.