Arrow Research search

Author name cluster

Dinesh Manocha

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

175 papers
2 author rows

Possible papers

175

AAAI Conference 2026 Conference Paper

Bi-VLM: Binary Post-Training Quantization for Vision-Language Models

  • Xijun Wang
  • Rayyan Abdalla
  • Junyun Huang
  • Chengyuan Zhang
  • Ruiqi Xian
  • Dinesh Manocha

We address the critical gap between the computational demands of vision-language models and the possible ultra-low-bit weight precision (bitwidth <= 2 bits) we can use for higher efficiency. Our work is motivated by the substantial computational cost and memory requirements of VLMs, which restrict their applicability in hardware-constrained environments. We propose Bi-VLM, which separates model weights non-uniformly based on the Gaussian quantiles. Our formulation groups the model weights into outlier and multiple inlier subsets, ensuring that each subset contains a proportion of weights corresponding to its quantile in the distribution. We propose a saliency-aware hybrid quantization algorithm and use it to quantize weights by imposing different constraints on the scaler and binary matrices based on the saliency metric and compression objective. We have evaluated our approach on different VLMs. For the language model part of the VLM, our Bi-VLM outperforms the SOTA by 3%-47% on the visual question answering task in terms of four different benchmarks and three different models. For the overall VLM, our Bi-VLM outperforms the SOTA by 4%-45%.

AAAI Conference 2026 Conference Paper

MMAU-Pro: A Challenging and Comprehensive Benchmark for Holistic Evaluation of Audio General Intelligence

  • Sonal Kumar
  • Šimon Sedláček
  • Vaibhavi Lokegaonkar
  • Fernando López
  • Wenyi Yu
  • Nishit Anand
  • Hyeonggon Ryu
  • Lichang Chen

Audio comprehension—including speech, non-speech sounds, and music—is essential for achieving human-level intelligence. Consequently, AI agents must demonstrate holistic audio understanding to qualify as generally intelligent. However, evaluating auditory intelligence comprehensively remains challenging. To address this gap, we introduce MMAU-Pro, the most comprehensive and rigorously curated benchmark for assessing audio intelligence in AI systems. MMAU-Pro contains 5,305 instances, where each instance has one or more audios paired with human expert-generated question-answer pairs, spanning speech, sound, music, and their combinations. Unlike existing benchmarks, MMAU-Pro evaluates auditory intelligence across 49 unique skills and multiple complex dimensions, including long-form audio comprehension, spatial audio reasoning, multi-audio understanding, among others. All questions are meticulously designed to require deliberate multi-hop reasoning, including both multiple-choice and open-ended response formats. Importantly, audio data is sourced directly ``from the wild" rather than from existing datasets with known distributions. We evaluate 22 leading open-source and proprietary multimodal AI models, revealing significant limitations: even state-of-the-art models such as Gemini 2.5 Flash and Audio Flamingo 3 achieve only 57.33% and 45.9% accuracy, respectively, approaching random performance in multiple categories. Our extensive analysis highlights specific shortcomings and provides novel insights, offering actionable perspectives for the community to enhance future AI systems' progression toward audio general intelligence.

AAAI Conference 2026 Conference Paper

UAV4D: Dynamic Neural Rendering of Human-Centric UAV Imagery Using Gaussian Splatting

  • Jaehoon Choi
  • Dongki Jung
  • Chris Maxey
  • Sungmin Eum
  • Yonghan Lee
  • Dinesh Manocha
  • Heesung Kwon

Despite significant advancements in dynamic neural rendering, existing methods fail to address the unique challenges posed by UAV-captured scenarios, particularly those involving monocular camera setups, top-down perspective, and multiple small, moving humans, which are not adequately represented in existing datasets. In this work, we introduce UAV4D, a framework for enabling photorealistic rendering for dynamic real-world scenes captured by UAVs. Specifically, we address the challenge of reconstructing dynamic scenes with multiple moving pedestrians from monocular video data without the need for additional sensors. We use a combination of a 3D foundation model and a human mesh reconstruction model to reconstruct both the scene background and humans. We propose a novel approach to resolve the scene scale ambiguity and place both humans and the scene in world coordinates by identifying human-scene contact points. Additionally, we exploit the SMPL model and background mesh to initialize Gaussian splats, enabling holistic scene rendering. We evaluated our method on three complex UAV-captured datasets: VisDrone, Manipal-UAV, and Okutama-Action, each with distinct characteristics and 10-50 humans. Our results demonstrate the benefits of our approach over existing methods in novel view synthesis, achieving a 1.5 dB PSNR improvement and superior visual sharpness.

ICML Conference 2025 Conference Paper

Audio Flamingo 2: An Audio-Language Model with Long-Audio Understanding and Expert Reasoning Abilities

  • Sreyan Ghosh
  • Zhifeng Kong
  • Sonal Kumar
  • S. Sakshi
  • Jaehyeon Kim
  • Wei Ping
  • Rafael Valle
  • Dinesh Manocha

Understanding and reasoning over non-speech sounds and music are crucial for both humans and AI agents to interact effectively with their environments. In this paper, we introduce Audio Flamingo 2 (AF2), an Audio-Language Model (ALM) with advanced audio understanding and reasoning capabilities. AF2 leverages (i) a custom CLAP model, (ii) synthetic Audio QA data for fine-grained audio reasoning, and (iii) a multi-stage curriculum learning strategy. AF2 achieves state-of-the-art performance with only a 3B parameter small language model, surpassing large open-source and proprietary models across over 20 benchmarks. Next, for the first time, we extend audio understanding to long audio segments (30 secs to 5 mins) and propose LongAudio, a large and novel dataset for training ALMs on long audio captioning and question-answering tasks. Fine-tuning AF2 on LongAudio leads to exceptional performance on our proposed LongAudioBench, an expert-annotated benchmark for evaluating ALMs on long audio understanding capabilities. We conduct extensive ablation studies to confirm the efficacy of our approach.

NeurIPS Conference 2025 Conference Paper

Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models

  • Sreyan Ghosh
  • Arushi Goel
  • Jaehyeon Kim
  • Sonal Kumar
  • Zhifeng Kong
  • Sang-gil Lee
  • Chao-Han Yang
  • Ramani Duraiswami

We present Audio Flamingo 3 (AF3), a fully open state-of-the-art (SOTA) large audio-language model that advances reasoning and understanding across speech, sound, and music. AF3 introduces: (i) AF-Whisper, a unified audio encoder trained using a novel strategy for joint representation learning across all 3 modalities of speech, sound, and music; (ii) flexible, on-demand thinking, allowing the model to do chain-of-thought-type reasoning before answering; (iii) multi-turn, multi-audio chat; (iv) long audio understanding and reasoning (including speech) up to 10 minutes; and (v) voice-to-voice interaction. To enable these capabilities, we propose several large-scale training datasets curated using novel strategies, including AudioSkills-XL, LongAudio-XL, AF-Think, and AF-Chat, and train AF3 with a novel five-stage curriculum-based training strategy. Trained on only open-source audio data, AF3 achieves new SOTA results on over 20+ (long) audio understanding and reasoning benchmarks, surpassing both open-weight and closed-source models trained on much larger datasets.

IROS Conference 2025 Conference Paper

AutoSpatial: Visual-Language Reasoning for Social Robot Navigation through Efficient Spatial Reasoning Learning

  • Yangzhe Kong
  • Daeun Song
  • Jing Liang 0006
  • Dinesh Manocha
  • Ziyu Yao 0002
  • Xuesu Xiao

We present a novel method, AutoSpatial, an efficient approach with structured spatial grounding to enhance VLMs’ spatial reasoning. By combining minimal manual supervision with large-scale Visual Question-Answering (VQA) pairs auto-labeling, our approach tackles the challenge of VLMs’ limited spatial understanding in social navigation tasks. By applying a hierarchical two-round VQA strategy during training, AutoSpatial achieves both global and detailed understanding of scenarios, demonstrating more accurate spatial perception, movement prediction, Chain of Thought (CoT) reasoning, final action, and explanation compared to other SOTA approaches. These five components are essential for comprehensive social navigation reasoning. Our approach was evaluated using both expert systems (GPT-4o, Gemini 2. 0 Flash, and Claude 3. 5 Sonnet) that provided cross-validation scores and human evaluators who assigned relative rankings to compare model performances across four key aspects. Augmented by the enhanced spatial reasoning capabilities, AutoSpatial demonstrates substantial improvements by averaged cross-validation score from expert systems in: perception & prediction (up to 10. 71%), reasoning (up to 16. 26%), action (up to 20. 50%), and explanation (up to 18. 73%) compared to baseline models trained only on manually annotated data.

ICRA Conference 2025 Conference Paper

Behav: Behavioral Rule Guided Autonomy Using VLMs for Robot Navigation in Outdoor Scenes

  • Kasun Weerakoon
  • Mohamed Elnoor
  • Gershom Seneviratne
  • Vignesh Rajagopal
  • Senthil Hariharan Arul
  • Jing Liang 0006
  • Mohamed Khalid M. Jaffar
  • Dinesh Manocha

We present BehAV, a novel approach for autonomous robot navigation in outdoor scenes guided by human instructions and leveraging Vision Language Models (VLMs). Our method interprets human commands using a Large Language Model (LLM), and categorizes the instructions into navigation and behavioral guidelines. Navigation guidelines consist of directional commands (e. g. , “move forward until“) and associated landmarks (e. g. , “the building with blue windows”), while behavioral guidelines encompass regulatory actions (e. g. , “stay on“) and their corresponding objects (e. g. , “pavements“). We use VLMs for their zero-shot scene understanding capabilities to estimate landmark locations from RGB images for robot navigation. Further, we introduce a novel scene representation that utilizes VLMs to ground behavioral rules into a behavioral cost map. This cost map encodes the presence of behavioral objects within the scene and assigns costs based on their regulatory actions. The behavioral cost map is integrated with a LiDAR-based occupancy map for navigation. To navigate outdoor scenes while adhering to the instructed behaviors, we present an unconstrained Model Predictive Control (MPC)based planner that prioritizes both reaching landmarks and following behavioral guidelines. We evaluate the performance of BehAV on a quadruped robot across diverse real-world scenarios, demonstrating a 22. 49 % improvement in alignment with human-teleoperated actions, as measured by Fréchet distance, and achieving a 40 % higher navigation success rate compared to state-of-the-art methods.

TMLR Journal 2025 Journal Article

Beyond Joint Demonstrations: Personalized Expert Guidance for Efficient Multi-Agent Reinforcement Learning

  • Peihong Yu
  • Manav Mishra
  • Alec Koppel
  • Carl Busart
  • Priya Narayan
  • Dinesh Manocha
  • Amrit Singh Bedi
  • Pratap Tokekar

Multi-Agent Reinforcement Learning (MARL) algorithms face the challenge of efficient exploration due to the exponential increase in the size of the joint state-action space. While demonstration-guided learning has proven beneficial in single-agent settings, its direct applicability to MARL is hindered by the practical difficulty of obtaining joint expert demonstrations. In this work, we introduce a novel concept of personalized expert demonstrations, tailored for each individual agent or, more broadly, each individual type of agent within a heterogeneous team. These demonstrations solely pertain to single-agent behaviors and how each agent can achieve personal goals without encompassing any cooperative elements, thus naively imitating them will not achieve cooperation due to potential conflicts. To this end, we propose an approach that selectively utilizes personalized expert demonstrations as guidance and allows agents to learn to cooperate, namely personalized expert-guided MARL (PegMARL). This algorithm utilizes two discriminators: the first provides incentives based on the alignment of individual agent behavior with demonstrations, and the second regulates incentives based on whether the behaviors lead to the desired outcome. We evaluate PegMARL using personalized demonstrations in both discrete and continuous environments. The results demonstrate that PegMARL learns near-optimal policies even when provided with suboptimal demonstrations and outperforms state-of-the-art MARL algorithms in solving coordinated tasks. We also showcase PegMARL’s capability of leveraging joint demonstrations in the StarCraft scenario and converging effectively even with demonstrations from non-co-trained policies.

ICML Conference 2025 Conference Paper

Bounded Rationality for LLMs: Satisficing Alignment at Inference-Time

  • Mohamad Fares El Hajj Chehade
  • Soumya Suvra Ghosal
  • Souradip Chakraborty
  • Avinash Reddy
  • Dinesh Manocha
  • Hao Zhu
  • Amrit Singh Bedi

Aligning large language models with humans is challenging due to the inherently multifaceted nature of preference feedback. While existing approaches typically frame this as a multi-objective optimization problem, they often overlook how humans actually make decisions. Research on bounded rationality suggests that human decision making follows satisficing strategies- optimizing primary objectives while ensuring others meet acceptable thresholds. To bridge this gap and operationalize the notion of satisficing alignment, we propose SITAlign: an inference-time framework that addresses the multifaceted nature of alignment by maximizing a primary objective while satisfying threshold-based constraints on secondary criteria. We provide theoretical insights by deriving sub-optimality bounds of our satisficing-based inference alignment approach. We empirically validate SITAlign’s performance through extensive experimentation on multiple benchmarks. For instance, on the PKU-SafeRLHF dataset with the primary objective of maximizing helpfulness while ensuring a threshold on harmlessness, SITAlign outperforms the state-of-the-art multi-objective decoding strategy by a margin of 22. 3% in terms of GPT-4 win-tie rate for helpfulness reward while adhering to the threshold on harmlessness.

ICLR Conference 2025 Conference Paper

Collab: Controlled Decoding using Mixture of Agents for LLM Alignment

  • Souradip Chakraborty
  • Sujay Bhatt
  • Udari Madhushani
  • Soumya Suvra Ghosal
  • Jiahao Qiu
  • Mengdi Wang 0001
  • Dinesh Manocha
  • Furong Huang

Alignment of Large Language models (LLMs) is crucial for safe and trustworthy deployment in applications. Reinforcement learning from human feedback (RLHF) has emerged as an effective technique to align LLMs to human preferences, and broader utilities, but it requires updating billions of model parameters which is computationally expensive. Controlled Decoding, by contrast, provides a mechanism for aligning a model at inference time without retraining. However, single-agent decoding approaches often struggle to adapt to diverse tasks due to the complexity and variability inherent in these tasks. To strengthen the test-time performance w.r.t the target task, we propose a mixture of agents-based decoding strategies leveraging the existing off-the-shelf aligned LLM policies. Treating each prior policy as an agent in the spirit of mixture of agent collaboration, we develop a decoding method that allows for inference-time alignment through a token-level selection strategy among multiple agents. For each token, the most suitable LLM is dynamically chosen from a pool of models based on a long-term utility metric. This policy-switching mechanism ensures optimal model selection at each step, enabling efficient collaboration and alignment among LLMs during decoding. Theoretical analysis of our proposed algorithm establishes optimal performance with respect to the target task represented via a target reward, for the given off-the-shelf models. We conduct comprehensive empirical evaluations with open-source aligned models on diverse tasks and preferences, which demonstrates the merits of this approach over single-agent decoding baselines. Notably, COLLAB surpasses the current SoTA decoding strategy, achieving an improvement of {up to 1.56x} in average reward and $71.89\%$ in GPT-4 based win-tie rate.

IROS Conference 2025 Conference Paper

Confidence-Controlled Exploration: Efficient Sparse-Reward Policy Learning for Robot Navigation

  • Bhrij Patel
  • Kasun Weerakoon
  • Wesley A. Suttle
  • Alec Koppel
  • Brian M. Sadler
  • Tianyi Zhou 0001
  • Dinesh Manocha
  • Amrit Singh Bedi

Reinforcement learning (RL) is a promising approach for robotic navigation, allowing robots to learn through trial and error. However, real-world robotic tasks often suffer from sparse rewards, leading to inefficient exploration and suboptimal policies due to sample inefficiency of RL. In this work, we introduce Confidence-Controlled Exploration (CCE), a novel method that improves sample efficiency in RL-based robotic navigation without modifying the reward function. Unlike existing approaches, such as entropy regularization and reward shaping, which can introduce instability by altering rewards, CCE dynamically adjusts trajectory length based on policy entropy. Specifically, it shortens trajectories when uncertainty is high to enhance exploration and extends them when confidence is high to prioritize exploitation. CCE is a principled and practical solution inspired by a theoretical connection between policy entropy and gradient estimation. It integrates seamlessly with on-policy and off-policy RL methods and requires minimal modifications. We validate CCE across REINFORCE, PPO, and SAC in both simulated and real-world navigation tasks. CCE outperforms fixed-trajectory and entropy-regularized baselines, achieving an 18% higher success rate, 20-38% shorter paths, and 9. 32% lower elevation costs under a fixed training sample budget. Finally, we deploy CCE on a Clearpath Husky robot, demonstrating its effectiveness in complex outdoor environments.

IROS Conference 2025 Conference Paper

CROSS-GAiT: Cross-Attention-Based Multimodal Representation Fusion for Parametric Gait Adaptation in Complex Terrains

  • Gershom Seneviratne
  • Kasun Weerakoon
  • Mohamed Elnoor
  • Vignesh Rajgopal
  • Harshavarthan Varatharajan
  • Mohamed Khalid M. Jaffar
  • Jason L. Pusey
  • Dinesh Manocha

We present CROSS-GAiT, a novel algorithm for quadruped robots that uses Cross Attention to fuse terrain representations derived from visual and time-series inputs; including linear accelerations, angular velocities, and joint efforts. These fused representations are used to continuously adjust two critical gait parameters (step height and hip splay), enabling adaptive gaits that respond dynamically to varying terrain conditions. To generate terrain representations, we process visual inputs through a masked Vision Transformer (ViT) encoder and time-series data through a dilated causal convolutional encoder. The Cross Attention mechanism then selects and integrates the most relevant features from each modality, combining terrain characteristics with robot dynamics for informed gait adaptation. This fused representation allows CROSS-GAiT to continuously adjust gait parameters in response to unpredictable terrain conditions in real-time. We train CROSS-GAiT on a diverse set of terrains including asphalt, concrete, brick pavements, grass, dense vegetation, pebbles, gravel, and sand and validate its generalization ability on unseen environments. Our hardware implementation on the Ghost Robotics Vision 60 demonstrates superior performance in challenging terrains, such as high-density vegetation, unstable surfaces, sandbanks, and deformable substrates. We observe at least a 7. 04% reduction in IMU energy density and a 27. 3% reduction in total joint effort, which directly correlates with increased stability and reduced energy usage when compared to state-of-the-art methods. Furthermore, CROSS-GAiT demonstrates at least a 64. 5% increase in success rate and a 4. 91% reduction in time to reach the goal in four complex scenarios. Additionally, the learned representations perform 4. 48% better than the state-of-the-art on a terrain classification task.

NeurIPS Conference 2025 Conference Paper

Does Thinking More Always Help? Mirage of Test-Time Scaling in Reasoning Models

  • Soumya Suvra Ghosal
  • Souradip Chakraborty
  • Avinash Reddy
  • Yifu Lu
  • Mengdi Wang
  • Dinesh Manocha
  • Furong Huang
  • Mohammad Ghavamzadeh

Recent trends in test-time scaling for reasoning models (e. g. , OpenAI o1, DeepSeek R1) have led to a popular belief that extending thinking traces using prompts like “Wait” or “Let me rethink” can improve performance. This raises a natural question: Does thinking more at test-time truly lead to better reasoning? To answer this question, we perform a detailed empirical study across models and benchmarks, which reveals a consistent pattern of initial performance improvements from additional thinking followed by a decline, due to "overthinking". To understand this non-monotonic trend, we consider a simple probabilistic model, which reveals that additional thinking increases output variance—creating an illusion of improved reasoning while ultimately undermining precision. Thus, observed gains from "more thinking" are not true indicators of improved reasoning, but artifacts stemming from the connection between model uncertainty and evaluation metric. This suggests that test-time scaling through extended thinking is not an effective way to utilize the inference thinking budget. Recognizing these limitations, we introduce an alternative test-time scaling approach, parallel thinking, inspired by Best-of-N sampling. Our method generates multiple independent reasoning paths within the same inference budget and selects the most consistent response via majority vote, achieving up to 20% higher accuracy compared to extended thinking. This provides a simple yet effective mechanism for test-time scaling of reasoning models.

IROS Conference 2025 Conference Paper

ET-Former: Efficient Triplane Deformable Attention for 3D Semantic Scene Completion From Monocular Camera

  • Jing Liang 0006
  • He Yin
  • Xuewei Tony Qi
  • Jong Jin Park
  • Min Sun 0001
  • Rajasimman Madhivanan
  • Dinesh Manocha

We introduce ET-Former, a novel end-to-end algorithm for semantic scene completion using a single monocular camera. Our approach generates a semantic occupancy map from single RGB observation while simultaneously providing uncertainty estimates for semantic predictions. By designing a triplane-based deformable attention mechanism, our approach improves geometric understanding of the scene than other SOTA approaches and reduces noise in semantic predictions. Additionally, through the use of a Conditional Variational AutoEncoder (CVAE), we estimate the uncertainties of these predictions. The generated semantic and uncertainty maps will help formulate navigation strategies that facilitate safe and permissible decision making in the future. Evaluated on the Semantic-KITTI dataset, ET-Former achieves the highest Intersection over Union (IoU) and mean IoU (mIoU) scores while maintaining the lowest GPU memory usage, surpassing state-of-the-art (SOTA) methods. It improves the SOTA scores of IoU from 44. 71 to 51. 49 and mIoU from 15. 04 to 16. 30 on SeamnticKITTI test, with a notably low training memory consumption of 10. 9 GB, achieving at least a 25% reduction compared to previous methods. Project page: https://github.com/amazon-science/ET-Former.

ICRA Conference 2025 Conference Paper

Gnd: Global Navigation Dataset With Multi-Modal Perception and Multi-Category Traversability in Outdoor Campus Environments

  • Jing Liang 0006
  • Dibyendu Das
  • Daeun Song
  • Md Nahid Hasan Shuvo
  • Mohammad Durrani
  • Karthik Taranath
  • Ivan Penskiy
  • Dinesh Manocha

Navigating large-scale outdoor environments requires complex reasoning in terms of geometric structures, environmental semantics, and terrain characteristics, which are typically captured by onboard sensors such as LiDAR and cameras. While current mobile robots can navigate such environments using pre-defined, high-precision maps based on hand-crafted rules catered for the specific environment, they lack commonsense reasoning capabilities, especially the traversability analysis, that most humans possess when navigating unknown outdoor spaces. To address this gap, we introduce the Global Navigation Dataset (GND), a large-scale dataset that integrates multi-modal sensory data, including 3D LiDAR point clouds and RGB and 360° images, as well as multi-category traversability maps (pedestrian walkways, vehicle roadways, stairs, off-road terrain, and obstacles) from ten university campuses. These environments encompass a variety of parks, urban settings, elevation changes, and campus layouts of different scales. The dataset covers approximately $2. 7 ~\text{km}^{2}$ and includes at least 350 buildings in total. We also present a set of novel applications of GND to showcase its utility to enable global robot navigation, such as map-based global navigation, mapless navigation, and global place recognition. GND's website can be found at https://cs.gmu.edu/xiao/Research/GND/.

ICLR Conference 2025 Conference Paper

How Learnable Grids Recover Fine Detail in Low Dimensions: A Neural Tangent Kernel Analysis of Multigrid Parametric Encodings

  • Samuel Audia
  • Soheil Feizi
  • Matthias Zwicker
  • Dinesh Manocha

Neural networks that map between low dimensional spaces are ubiquitous in computer graphics and scientific computing; however, in their naive implementation, they are unable to learn high frequency information. We present a comprehensive analysis comparing the two most common techniques for mitigating this spectral bias: Fourier feature encodings (FFE) and multigrid parametric encodings (MPE). FFEs are seen as the standard for low dimensional mappings, but MPEs often outperform them and learn representations with higher resolution and finer detail. FFE's roots in the Fourier transform, make it susceptible to aliasing if pushed too far, while MPEs, which use a learned grid structure, have no such limitation. To understand the difference in performance, we use the neural tangent kernel (NTK) to evaluate these encodings through the lens of an analogous kernel regression. By finding a lower bound on the smallest eigenvalue of the NTK, we prove that MPEs improve a network's performance through the structure of their grid and not their learnable embedding. This mechanism is fundamentally different from FFEs, which rely solely on their embedding space to improve performance. Results are empirically validated on a 2D image regression task using images taken from 100 synonym sets of ImageNet and 3D implicit surface regression on objects from the Stanford graphics dataset. Using peak signal-to-noise ratio (PSNR) and multiscale structural similarity (MS-SSIM) to evaluate how well fine details are learned, we show that the MPE increases the minimum eigenvalue by 8 orders of magnitude over the baseline and 2 orders of magnitude over the FFE. The increase in spectrum corresponds to a 15 dB (PSNR) / 0.65 (MS-SSIM) increase over baseline and a 12 dB (PSNR) / 0.33 (MS-SSIM) increase over the FFE.

ICRA Conference 2025 Conference Paper

Improving Zero-Shot ObjectNav with Generative Communication

  • Vishnu Sashank Dorbala
  • Vishnu Dutt Sharma
  • Pratap Tokekar
  • Dinesh Manocha

We propose a new method for improving zero-shot ObjectNav that aims to utilize potentially available environmental percepts for navigational assistance. Our approach takes into account that the ground agent may have limited and sometimes obstructed view. Our formulation encourages Generative Communication (GC) between an assistive overhead agent with a global view containing the target object and the ground agent with an obfuscated view; both equipped with Vision-Language Models (VLMs) for vision-to-language translation. In this assisted setup, the embodied agents communicate environmental information before the ground agent executes actions towards a target. Despite the overhead agent having a global view with the target, we note a drop in performance (−13% in OSR and −13% in SPL) of a fully cooperative assistance scheme over an unassisted baseline. In contrast, a selective assistance scheme where the ground agent retains its independent exploratory behaviour shows a 10% OSR and 7. 65% SPL improvement. To explain navigation performance, we analyze the GC for unique traits, quantifying the presence of hallucination and cooperation. Specifically, we identify the novel linguistic trait of preemptive hallucination in our embodied setting, where the overhead agent assumes that the ground agent has executed an action in the dialogue when it is yet to move, and note its strong correlation with navigation performance. We conduct real-world experiments and present some qualitative examples where we mitigate hallucinations via prompt finetuning to improve ObjectNav performance.

IROS Conference 2025 Conference Paper

Is the House Ready For Sleeptime? Generating and Evaluating Situational Queries for Embodied Question Answering

  • Vishnu Sashank Dorbala
  • Prasoon Goyal
  • Robinson Piramuthu
  • Michael Johnston
  • Reza Ghanadan
  • Dinesh Manocha

We present and tackle the problem of Embodied Question Answering (EQA) with Situational Queries (S-EQA) in a household environment. Unlike prior EQA work tackling simple queries that directly reference target objects and properties ("What is the color of the car? "), situational queries (such as "Is the house ready for sleeptime? ") are challenging as they require the agent to correctly identify multiple object-states (Doors: Closed, Lights: Off, etc.) and reach a consensus on their states for an answer. Towards this objective, we first introduce a novel Prompt-Generate-Evaluate (PGE) scheme that wraps around an LLM’s output to generate unique situational queries and corresponding consensus object information. PGE is used to generate 2K datapoints in the VirtualHome simulator, which is then annotated for ground truth answers via a large scale user-study conducted on M-Turk. With a high rate of answerability (97. 26%) on this study, we establish that LLMs are good at generating situational data. However, in evaluating the data using an LLM, we observe a low correlation of 46. 2% with the ground truth human annotations; indicating that while LLMs are good at generating situational data, they struggle to answer them according to consensus. When asked for reasoning, we observe the LLM often goes against commonsense in justifying its answer. Finally, we utilize PGE to generate situational data in a real-world environment, exposing LLM hallucination in generating reliable object-states when a structured scene graph is unavailable. To the best of our knowledge, this is the first work to introduce EQA in the context of situational queries and also the first to present a generative approach for query creation. We aim to foster research on improving the real-world usability of embodied agents through this work.

IROS Conference 2025 Conference Paper

LBAP: Improved Uncertainty Alignment of LLM Planners using Bayesian Inference

  • James F. Mullen Jr.
  • Dinesh Manocha

Large language models (LLMs) showcase many desirable traits for intelligent and helpful robots. However, they are also known to hallucinate predictions. This issue is exacerbated in robotics where LLM hallucinations may result in robots confidently executing plans that are contrary to user goals or relying more frequently on human assistance. In this work, we present LBAP, a novel approach for utilizing off-the-shelf LLMs, alongside Bayesian inference for uncertainty Alignment in robotic Planners that minimizes hallucinations and human intervention. Our key finding is that we can use Bayesian inference to more accurately calibrate a robots confidence measure through accounting for both scene grounding and world knowledge. This process allows us to mitigate hallucinations and better align the LLM’s confidence measure with the probability of success. Through experiments in both simulation and the real world on tasks with a variety of ambiguities, we show that LBAP significantly increases success rate and decreases the amount of human intervention required relative to prior art. For example, in our real-world testing paradigm, LBAP decreases the human help rate of previous methods by over 33% at a success rate of 70%.

NeurIPS Conference 2025 Conference Paper

MAGNET: A Multi-agent Framework for Finding Audio-Visual Needles by Reasoning over Multi-Video Haystacks

  • Sanjoy Chowdhury
  • Mohamed Elmoghany
  • Yohan Abeysinghe
  • Junjie Fei
  • Sayan Nag
  • Salman Khan
  • Mohamed Elhoseiny
  • Dinesh Manocha

Large multimodal models (LMMs) have shown remarkable progress in audiovisual understanding, yet they struggle with real-world scenarios that require complex reasoning across extensive video collections. Existing benchmarks for video question answering remain limited in scope, typically involving one clip per query, which falls short of representing the challenges of large-scale, audiovisual retrieval and reasoning encountered in practical applications. To bridge this gap, we introduce a novel task named AVHaystacksQA, where the goal is to identify salient segments across different videos in response to a query and link them together to generate the most informative answer. To this end, we present AVHaystacks, an audio-visual benchmark comprising 3100 annotated QA pairs designed to assess the capabilities of LMMs in multi-video retrieval and temporal grounding task. Additionally, we propose a model-agnostic, multi-agent framework MAGNET to address this challenge, achieving up to 89% and 65% relative improvements over baseline methods on BLEU@4 and GPT evaluation scores in QA task on our proposed AVHaystacks. To enable robust evaluation of multi-video retrieval and temporal grounding for optimal response generation, we introduce two new metrics, STEM, which captures alignment errors between a ground truth and a predicted step sequence and MTGS, to facilitate balanced and interpretable evaluation of segment-level grounding performance.

ICLR Conference 2025 Conference Paper

MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark

  • S. Sakshi
  • Utkarsh Tyagi
  • Sonal Kumar
  • Ashish Seth
  • Ramaneswaran S.
  • Oriol Nieto
  • Ramani Duraiswami
  • Sreyan Ghosh

The ability to comprehend audio—which includes speech, non-speech sounds, and music—is crucial for AI agents to interact effectively with the world. We present MMAU, a novel benchmark designed to evaluate multimodal audio understanding models on tasks requiring expert-level knowledge and complex reasoning. MMAU comprises 10k carefully curated audio clips paired with human-annotated natural language questions and answers spanning speech, environmental sounds, and music. It includes information extraction and reasoning questions, requiring models to demonstrate 27 distinct skills across unique and challenging tasks. Unlike existing benchmarks, MMAU emphasizes advanced perception and reasoning with domain-specific knowledge, challenging models to tackle tasks akin to those faced by experts. We assess 18 open-source and proprietary (Large) Audio-Language Models, demonstrating the significant challenges posed by MMAU. Notably, even the most advanced Gemini 2.0 Flash achieves only 59.93% accuracy, and the state-of-the-art open-source Qwen2-Audio achieves only 52.50%, highlighting considerable room for improvement. We believe MMAU will drive the audio and multimodal research community to develop more advanced audio understanding models capable of solving complex audio tasks.

IROS Conference 2025 Conference Paper

On the Vulnerability of LLM/VLM-Controlled Robotics

  • Xiyang Wu
  • Souradip Chakraborty
  • Ruiqi Xian
  • Jing Liang 0006
  • Tianrui Guan
  • Fuxiao Liu
  • Brian M. Sadler
  • Dinesh Manocha

In this work, we highlight vulnerabilities in robotic systems integrating large language models (LLMs) and vision-language models (VLMs) due to input modality sensitivities. While LLM/VLM-controlled robots show impressive performance across various tasks, their reliability under slight input variations remains underexplored yet critical. These models are highly sensitive to instruction or perceptual input changes, which can trigger misalignment issues, leading to execution failures with severe real-world consequences. To study this issue, we analyze the misalignment-induced vulnerabilities within LLM/VLM-controlled robotic systems and present a mathematical formulation for failure modes arising from variations in input modalities. We propose empirical perturbation strategies to expose these vulnerabilities and validate their effectiveness through experiments on multiple robot manipulation tasks. Our results show that simple input perturbations reduce task execution success rates by 22. 2% and 14. 6% in two representative LLM/VLM-controlled robotic systems. These findings underscore the importance of input modality robustness and motivate further research to ensure the safe and reliable deployment of advanced LLM/VLM-controlled robotic systems.

NeurIPS Conference 2025 Conference Paper

RPG360: Robust 360 Depth Estimation with Perspective Foundation Models and Graph Optimization

  • Dongki Jung
  • Jaehoon Choi
  • Yonghan Lee
  • Dinesh Manocha

The increasing use of 360$^\circ$ images across various domains has emphasized the need for robust depth estimation techniques tailored for omnidirectional images. However, obtaining large-scale labeled datasets for 360$^\circ$ depth estimation remains a significant challenge. In this paper, we propose RPG360, a training-free robust 360$^\circ$ monocular depth estimation method that leverages perspective foundation models and graph optimization. Our approach converts 360$^\circ$ images into six- face cubemap representations, where a perspective foundation model is employed to estimate depth and surface normals. To address depth scale inconsistencies across different faces of the cubemap, we introduce a novel depth scale alignment technique using graph-based optimization, which parameterizes the predicted depth and normal maps while incorporating an additional per-face scale parameter. This optimization ensures depth scale consistency across the six-face cubemap while preserving 3D structural integrity. Furthermore, as foundation models exhibit inherent robustness in zero-shot settings, our method achieves superior performance across diverse datasets, including Matterport3D, Stanford2D3D, and 360Loc. We also demonstrate the versatility of our depth estimation approach by validating its benefits in downstream tasks such as feature matching 3. 2 ∼ 5. 4% and Structure from Motion 0. 2 ∼ 9. 7% in AUC@5$^\circ$.

IROS Conference 2025 Conference Paper

Social-LLaVA: Enhancing Social Robot Navigation through Human-Language Reasoning

  • Amirreza Payandeh
  • Daeun Song
  • Mohammad Nazeri
  • Jing Liang 0006
  • Praneel Mukherjee
  • Amir Hossain Raj
  • Yangzhe Kong
  • Dinesh Manocha

As mobile robots become increasingly common in human-centric environments, social navigation—adhering to unwritten social norms rather than merely avoiding pedestrians—has drawn growing attention. Existing methods, from hand-crafted techniques to learning-based approaches, often overlook the nuanced context and scene understanding that humans naturally exhibit. Inspired by studies indicating the critical role of language in cognition and reasoning, we propose a new approach to bridge robot perception and socially aware actions through human-like language reasoning. We introduce Social robot Navigation via Explainable Interactions (SNEI), a human-annotated vision-language dataset comprising over 40K Visual Question Answering (VQA) pairs across 2K unique social scenarios, drawn from diverse, unstructured public spaces. SNEI contains perception, prediction, chain-of-thought reasoning, action, and explanation, thereby allowing robots to interpret social contexts in human language. We fine-tune a Vision-Language Model, Social-LLaVA, on SNEI to demonstrate the potential of language-guided reasoning for high-level navigation tasks. Experimental evaluations—both quantitative and qualitative—demonstrate that Social-LLaVA can outperform state-of-the-art models. †.

ICLR Conference 2025 Conference Paper

Synthio: Augmenting Small-Scale Audio Classification Datasets with Synthetic Data

  • Sreyan Ghosh
  • Sonal Kumar
  • Zhifeng Kong
  • Rafael Valle
  • Bryan Catanzaro
  • Dinesh Manocha

We present Synthio, a novel approach for augmenting small-scale audio classification datasets with synthetic data. Our goal is to improve audio classification accuracy with limited labeled data. Traditional data augmentation techniques, which apply artificial transformations (e.g., adding random noise or masking segments), struggle to create data that captures the true diversity present in real-world audios. To address this shortcoming, we propose to augment the dataset with synthetic audio generated from text-to-audio (T2A) diffusion models. However, synthesizing effective augmentations is challenging because not only should the generated data be acoustically consistent with the underlying small-scale dataset, but they should also have sufficient compositional diversity. To overcome the first challenge, we align the generations of the T2A model with the small-scale dataset using preference optimization. This ensures that the acoustic characteristics of the generated data remain consistent with the small-scale dataset. To address the second challenge, we propose a novel caption generation technique that leverages the reasoning capabilities of Large Language Models to (1) generate diverse and meaningful audio captions and (2) iteratively refine their quality. The generated captions are then used to prompt the aligned T2A model. We extensively evaluate Synthio on ten datasets and four simulated limited-data settings. Results indicate our method consistently outperforms all baselines by 0.1%-39% using a T2A model trained only on weakly-captioned AudioSet.

IROS Conference 2025 Conference Paper

TK-Planes: Tiered K-Planes with High Dimensional Feature Vectors for Dynamic UAV-based Scenes

  • Christopher Maxey
  • Jaehoon Choi
  • Yonghan Lee 0001
  • Hyungtae Lee
  • Dinesh Manocha
  • Heesung Kwon

In this paper, we present a new approach to improve the neural rendering fidelity of in-the-wild unmanned aerial vehicle (UAV)-based scenes. Our formulation is designed for dynamic scenes, consisting of small moving objects or human actions in particular. We propose an extension of K-Planes Neural Radiance Field (NeRF), wherein our algorithm stores a set of tiered high dimensional feature vectors. The tiered feature vectors are generated to effectively model conceptual information about a scene as well as to be processed by an image decoder that transforms output feature maps into RGB images. Our technique leverages the information among both static and dynamic objects within a scene and is able to capture salient scene attributes of high altitude videos. We evaluate its performance on challenging datasets, including Okutama Action and UG2, and observe considerable improvement in accuracy over state of the art neural rendering methods.

ICLR Conference 2025 Conference Paper

Towards Optimal Multi-draft Speculative Decoding

  • Zhengmian Hu
  • Tong Zheng
  • Vignesh Viswanathan
  • Ziyi Chen 0002
  • Ryan A. Rossi
  • Yihan Wu
  • Dinesh Manocha
  • Heng Huang 0001

Large Language Models (LLMs) have become an indispensable part of natural language processing tasks. However, autoregressive sampling has become an efficiency bottleneck. Multi-Draft Speculative Decoding (MDSD) is a recent approach where, when generating each token, a small draft model generates multiple drafts, and the target LLM verifies them in parallel, ensuring that the final output conforms to the target model distribution. The two main design choices in MDSD are the draft sampling method and the verification algorithm. For a fixed draft sampling method, the optimal acceptance rate is a solution to an optimal transport problem, but the complexity of this problem makes it difficult to solve for the optimal acceptance rate and measure the gap between existing verification algorithms and the theoretical upper bound. This paper discusses the dual of the optimal transport problem, providing a way to efficiently compute the optimal acceptance rate. For the first time, we measure the theoretical upper bound of MDSD efficiency for vocabulary sizes in the thousands and quantify the gap between existing verification algorithms and this bound. We also compare different draft sampling methods based on their optimal acceptance rates. Our results show that the draft sampling method strongly influences the optimal acceptance rate, with sampling without replacement outperforming sampling with replacement. Additionally, existing verification algorithms do not reach the theoretical upper bound for both without replacement and with replacement sampling. Our findings suggest that carefully designed draft sampling methods can potentially improve the optimal acceptance rate and enable the development of verification algorithms that closely match the theoretical upper bound.

NeurIPS Conference 2025 Conference Paper

VideoHallu: Evaluating and Mitigating Multi-modal Hallucinations on Synthetic Video Understanding

  • Zongxia Li
  • Xiyang Wu
  • Guangyao Shi
  • Yubin Qin
  • Hongyang Du
  • Tianyi Zhou
  • Dinesh Manocha
  • Jordan Boyd-Graber

Vision Language models (VLMs) have achieved remarkable success in video understanding tasks. Yet, a key question remains: Do they comprehend visual information or merely learn superficial mappings between visual and textual patterns? Understanding visual cues, particularly those related to physics and common sense, is crucial for AI systems interacting with the physical world. However, existing VLM evaluations primarily rely on positive-control tests using real-world videos that resemble training distributions. While VLMs perform well on such benchmarks, it is unclear whether they grasp underlying visual and contextual signals or simply exploit visual-language correlations. To fill this gap, we propose incorporating negative-control tests, i. e. , videos depicting physically impossible or logically inconsistent scenarios, and evaluating whether models can recognize these violations. True visual understanding should evince comparable performance across both positive and negative tests. Since such content is rare in the real world, we introduce VideoHallu, a synthetic video dataset featuring physics- and commonsense-violating scenes generated using state-of-the-art tools such as Veo2, Sora, and Kling. The dataset includes expert-annotated question-answer pairs spanning four categories of physical and commonsense violations, designed to be straightforward for human reasoning. We evaluate several leading VLMs, including Qwen-2. 5-VL, Video-R1, and VideoChat-R1. Despite their strong performance on real-world benchmarks (e. g. , MVBench, MMVU), these models hallucinate or fail to detect physical or logical violations, revealing fundamental weaknesses in visual understanding. Finally, we explore reinforcement learning-based post-training on our negative dataset: fine-tuning improves performance on VideoHallu without degrading results on standard benchmarks, indicating enhanced visual reasoning in VLMs. Our data is available at https: //github. com/zli12321/VideoHallu. git.

ICLR Conference 2025 Conference Paper

Visual Description Grounding Reduces Hallucinations and Boosts Reasoning in LVLMs

  • Sreyan Ghosh
  • Chandra Kiran Reddy Evuru
  • Sonal Kumar
  • Utkarsh Tyagi
  • Oriol Nieto
  • Zeyu Jin
  • Dinesh Manocha

Large Vision-Language Models (LVLMs) often produce responses that misalign with factual information, a phenomenon known as hallucinations. While hallucinations are well-studied, the exact causes behind them remain underexplored. In this paper, we first investigate the root causes of hallucinations in LVLMs. Our findings reveal that existing mitigation techniques primarily reduce hallucinations for visual recognition prompts—those that require simple descriptions of visual elements—but fail for cognitive prompts that demand deliberate reasoning. We identify the core issue as a lack of true visual perception in LVLMs: although they can accurately recognize visual elements, they struggle to fully interpret these elements in the context of the input prompt and effectively link this recognition to their internal knowledge, which is critical for reasoning. To address this gap, we introduce Visual Description Grounded Decoding (VDGD), a simple, robust, and training-free method designed to enhance visual perception and improve reasoning capabilities in LVLMs. VDGD works by first generating a detailed description of the image and appending it as a prefix to the instruction. During response generation, tokens are sampled based on their KL divergence to the description, favoring candidates with lower divergence. Experimental results on multiple visual reasoning benchmarks and LVLMs demonstrate that VDGD consistently outperforms existing baselines 2% - 33%. Finally, we introduce VaLLu, a benchmark designed for comprehensive evaluation of the cognitive capabilities of LVLMs.

ICRA Conference 2025 Conference Paper

VLM-GroNav: Robot Navigation Using Physically Grounded Vision-Language Models in Outdoor Environments

  • Mohamed Elnoor
  • Kasun Weerakoon
  • Gershom Seneviratne
  • Ruiqi Xian
  • Tianrui Guan
  • Mohamed Khalid M. Jaffar
  • Vignesh Rajagopal
  • Dinesh Manocha

We present a novel autonomous robot navigation algorithm for outdoor environments that is capable of handling diverse terrain traversability conditions. Our approach, VLM-GroNav, uses vision-language models (VLMs) and integrates them with physical grounding that is used to assess intrinsic terrain properties such as deformability and slipperiness. We use proprioceptive-based sensing, which provides direct measurements of these physical properties, and enhances the overall semantic understanding of the terrains. Our formulation uses in-context learning to ground the VLM's semantic understanding with proprioceptive data to allow dynamic updates of traversability estimates based on the robot's real-time physical interactions with the environment. We use the updated traversability estimations to inform both the local and global planners for real-time trajectory replanning. We validate our method on a legged robot (Ghost Vision 60) and a wheeled robot (Clearpath Husky), in diverse real-world outdoor environments with different deformable and slippery terrains. In practice, we observe significant improvements over state-of-the-art methods by up to 50% increase in navigation success rate.

ICRA Conference 2025 Conference Paper

ZSORN: Language-Driven Object-Centric Zero-Shot Object Retrieval and Navigation

  • Tianrui Guan
  • Yurou Yang
  • Harry Cheng 0003
  • Muyuan Lin
  • Richard Kim
  • Rajasimman Madhivanan
  • Arnie Sen
  • Dinesh Manocha

In this paper, we present ZSORN, a novel language-driven object-centric image representation for object retrieval and navigation task within complex scenes. We propose an object-centric image representation and corresponding losses for visual-language model (VLM) fine-tuning, which can handle complex object-level queries. In addition, we design a novel LLM-based augmentation and prompt templates for stability during training and zero-shot inference. We implement our method on Astro robot and deploy it in both simulated and real-world environments for zero-shot object navigation. We show that our proposed method can achieve an improvement of 1. 38 - 13. 38% in terms of text-to-image recall on different benchmark settings for the retrieval task. For object navigation, we show the benefit of our approach in simulation and real world, showing 5% and 16. 67% improvement in terms of navigation success rate, respectively.

ICML Conference 2024 Conference Paper

A Closer Look at the Limitations of Instruction Tuning

  • Sreyan Ghosh
  • Chandra Kiran Reddy Evuru
  • Sonal Kumar
  • Ramaneswaran S.
  • Deepali Aneja
  • Zeyu Jin
  • Ramani Duraiswami
  • Dinesh Manocha

Instruction Tuning (IT), the process of training large language models (LLMs) using instruction-response pairs, has emerged as the predominant method for transforming base pre-trained LLMs into open-domain conversational agents. While IT has achieved notable success and widespread adoption, its limitations and shortcomings remain underexplored. In this paper, through rigorous experiments and an in-depth analysis of the changes LLMs undergo through IT, we reveal various limitations of IT. In particular, we show that (1) IT fails to enhance knowledge or skills in LLMs. LoRA fine-tuning is limited to learning response initiation and style tokens, and full-parameter fine-tuning leads to knowledge degradation. (2) Copying response patterns from IT datasets derived from knowledgeable sources leads to a decline in response quality. (3) Full-parameter fine-tuning increases hallucination by inaccurately borrowing tokens from conceptually similar instances in the IT dataset for generating responses. (4) Popular methods to improve IT do not lead to performance improvements over a simple LoRA fine-tuned model. Our findings reveal that responses generated solely from pre-trained knowledge consistently outperform responses by models that learn any form of new knowledge from IT on open-source datasets. We hope the insights and challenges revealed in this paper inspire future work in related directions.

ICRA Conference 2024 Conference Paper

AG-Cvg: Coverage Planning with a Mobile Recharging UGV and an Energy-Constrained UAV

  • Nare Karapetyan
  • Ahmad Bilal Asghar
  • Amisha Bhaskar
  • Guangyao Shi
  • Dinesh Manocha
  • Pratap Tokekar

In this paper, we present an approach for coverage path planning for a team of an energy-constrained Unmanned Aerial Vehicle (UAV) and an Unmanned Ground Vehicle (UGV). Both the UAV and the UGV have predefined areas that they have to cover. The goal is to perform complete coverage by both robots while minimizing the coverage time. The UGV can also serve as a mobile recharging station. The UAV and UGV need to occasionally rendezvous for recharging. We propose a heuristic method to address this NP-Hard planning problem. Our approach involves initially determining coverage paths without factoring in energy constraints. Subsequently, we cluster segments of these paths and employ graph matching to assign UAV clusters to UGV clusters for efficient recharging management. We perform numerical analysis on real-world coverage applications and show that compared with a greedy approach our method reduces rendezvous overhead on average by 11. 33%. We demonstrate proof-of-concept with a team of a VOXL m500 drone and a Clearpath Jackal ground vehicle, providing a complete system from the offline algorithm to the field execution.

IROS Conference 2024 Conference Paper

AGL-Net: Aerial-Ground Cross-Modal Global Localization with Varying Scales

  • Tianrui Guan
  • Ruiqi Xian
  • Xijun Wang 0002
  • Xiyang Wu
  • Mohamed Elnoor
  • Daeun Song
  • Dinesh Manocha

We present AGL-NET, a novel learning-based method for global localization using LiDAR point clouds and satellite maps. AGL-Net tackles two critical challenges: bridging the representation gap between image and points modalities for robust feature matching, and handling inherent scale discrepancies between global view and local view. To address these challenges, AGL-Net leverages a unified network architecture with a novel two-stage matching design. The first stage extracts informative neural features directly from raw sensor data and performs initial feature matching. The second stage refines this matching process by extracting informative skeleton features and incorporating a novel scale alignment step to rectify scale variations between LiDAR and map data. Furthermore, a novel scale and skeleton loss function guides the network toward learning scale-invariant feature representations, eliminating the need for pre-processing satellite maps. This significantly improves real-world applicability in scenarios with unknown map scales. To facilitate rigorous performance evaluation, we introduce a meticulously designed dataset within the CARLA simulator specifically tailored for metric localization training and assessment.

IROS Conference 2024 Conference Paper

AMCO: Adaptive Multimodal Coupling of Vision and Proprioception for Quadruped Robot Navigation in Outdoor Environments

  • Mohamed Elnoor
  • Kasun Weerakoon
  • Adarsh Jagan Sathyamoorthy
  • Tianrui Guan
  • Vignesh Rajagopal
  • Dinesh Manocha

We present AMCO, a novel navigation method for quadruped robots that adaptively combines vision-based and proprioception-based perception capabilities. Our approach uses three cost maps: general knowledge map; traversability history map; and current proprioception map; which are derived from a robot’s vision and proprioception data, and couples them to obtain a coupled traversability cost map for navigation. The general knowledge map encodes terrains semantically segmented from visual sensing, and represents a terrain’s typically expected traversability. The traversability history map encodes the robot’s recent proprioceptive measurements on a terrain and its semantic segmentation as a cost map. Further, the robot’s present proprioceptive measurement is encoded as a cost map in the current proprioception map. As the general knowledge map and traversability history map rely on semantic segmentation, we evaluate the reliability of the visual sensory data by estimating the brightness and motion blur of input RGB images and accordingly combine the three cost maps to obtain the coupled traversability cost map used for navigation. Leveraging this adaptive coupling, the robot can depend on the most reliable input modality available. Finally, we present a novel planner that selects appropriate gaits and velocities for traversing challenging outdoor environments using the coupled traversability cost map. We demonstrate AMCO’s navigation performance in different real-world outdoor environments and observe 10. 8%-34. 9% reduction w. r. t. two stability metrics, and up to 50% improvement in terms of success rate compared to current navigation methods.

ICLR Conference 2024 Conference Paper

CompA: Addressing the Gap in Compositional Reasoning in Audio-Language Models

  • Sreyan Ghosh
  • Ashish Seth
  • Sonal Kumar
  • Utkarsh Tyagi
  • Chandra Kiran Reddy Evuru
  • Ramaneswaran S.
  • Sakshi Singh
  • Oriol Nieto

A fundamental characteristic of audio is its compositional nature. Audio-language models (ALMs) trained using a contrastive approach (e.g., CLAP) that learns a shared representation between audio and language modalities have improved performance in many downstream applications, including zero-shot audio classification, audio retrieval, etc. However, the ability of these models to effectively perform compositional reasoning remains largely unexplored and necessitates additional research. In this paper, we propose CompA, a collection of two expert-annotated benchmarks with a majority of real-world audio samples, to evaluate compositional reasoning in ALMs. Our proposed CompA-order evaluates how well an ALM understands the order or occurrence of acoustic events in audio, and CompA-attribute evaluates attribute-binding of acoustic events. An instance from either benchmark consists of two audio-caption pairs, where both audios have the same acoustic events but with different compositions. An ALM is evaluated on how well it matches the right audio to the right caption. Using this benchmark, we first show that current ALMs perform only marginally better than random chance, thereby struggling with compositional reasoning. Next, we propose CompA-CLAP, where we fine-tune CLAP using a novel learning method to improve its compositional reasoning abilities. To train CompA-CLAP, we first propose improvements to contrastive training with composition-aware hard negatives, allowing for more focused training. Next, we propose a novel modular contrastive loss that helps the model learn fine-grained compositional understanding and overcomes the acute scarcity of openly available compositional audios. CompA-CLAP significantly improves over all our baseline models on the CompA benchmark, indicating its superior compositional reasoning capabilities.

IROS Conference 2024 Conference Paper

CoNVOI: Context-aware Navigation using Vision Language Models in Outdoor and Indoor Environments

  • Adarsh Jagan Sathyamoorthy
  • Kasun Weerakoon
  • Mohamed Elnoor
  • Anuj Zore
  • Brian Ichter
  • Fei Xia 0002
  • Jie Tan 0001
  • Wenhao Yu 0003

We present CoNVOI, a novel method for autonomous robot navigation in real-world indoor and outdoor environments using Vision Language Models (VLMs). We employ VLMs in two ways: first, we leverage their zero-shot image classification capability to identify the context or scenario (e. g. , indoor corridor, outdoor terrain, crosswalk, etc) of the robot’s surroundings, and formulate context-based navigation behaviors as simple text prompts (e. g. "stay on the pavement"). Second, we utilize their state-of-the-art semantic understanding and logical reasoning capabilities to compute a suitable trajectory given the identified context. To this end, we propose a novel multi-modal visual marking approach to annotate the obstacle-free regions in the RGB image used as input to the VLM with numbers, by correlating it with a local occupancy map of the environment. The marked numbers ground image locations in the real-world, direct the VLM’s attention solely to navigable locations, and elucidate the spatial relationships between them and terrains depicted in the image to the VLM. Next, we query the VLM to select numbers on the marked image that satisfy the context-based behavior text prompt, and construct a reference path using the selected numbers. Finally, we propose a method to extrapolate the reference trajectory when the robot’s environmental context has not changed to prevent unnecessary VLM queries. We use the reference trajectory to guide a motion planner, and demonstrate that it leads to human-like behaviors (e. g. not cutting through a group of people, using crosswalks, etc.) in various real-world indoor and outdoor scenarios. We perform several ablations and navigation comparisons and demonstrate that CoNVOI’s trajectories are most similar to human teleoperated ground truth in terms of Fréchet distance (9. 7-58. 2% closer), lowest path errors (up to 88. 13% lower), and up to 86. 09% lower % of unacceptable paths.

IROS Conference 2024 Conference Paper

DTG: Diffusion-based Trajectory Generation for Mapless Global Navigation

  • Jing Liang 0006
  • Amirreza Payandeh
  • Daeun Song
  • Xuesu Xiao
  • Dinesh Manocha

We present a novel end-to-end diffusion-based trajectory generation method, DTG, for mapless global navigation in challenging outdoor scenarios with occlusions and unstructured off-road features like grass, buildings, bushes, etc. Given a distant goal, our approach computes a trajectory that satisfies the following goals: (1) minimize the travel distance to the goal; (2) maximize the traversability by choosing paths that do not lie in undesirable areas. Specifically, we present a novel Conditional RNN(CRNN) for diffusion models to efficiently generate trajectories. Furthermore, we propose an adaptive training method that ensures that the diffusion model generates more traversable trajectories. We evaluate our methods in various outdoor scenes and compare the performance with other global navigation algorithms on a Husky robot. In practice, we observe at least a 15% improvement in traveling distance and around a 7% improvement in traversability. Video and Code: https://github.com/jingGM/DTG.git.

IROS Conference 2024 Conference Paper

LANCAR: Leveraging Language for Context-Aware Robot Locomotion in Unstructured Environments

  • Chak Lam Shek
  • Xiyang Wu
  • Wesley A. Suttle
  • Carl E. Busart
  • Erin G. Zaroukian
  • Dinesh Manocha
  • Pratap Tokekar
  • Amrit Singh Bedi

Navigating robots through unstructured terrains is challenging, primarily due to the dynamic environmental changes. While humans adeptly navigate such terrains by using context from their observations, creating a similar context-aware navigation system for robots is difficult. The essence of the issue lies in the acquisition and interpretation of context information, a task complicated by the inherent ambiguity of human language. In this work, we introduce LANCAR, which addresses this issue by combining a context translator with reinforcement learning (RL) agents for context-aware locomotion. LANCAR allows robots to comprehend context information through Large Language Models (LLMs) sourced from human observers and convert this information into actionable context embeddings. These embeddings, combined with the robot’s sensor data, provide a complete input for the RL agent’s policy network. We provide an extensive evaluation of LANCAR under different levels of context ambiguity and compare with alternative methods. The experimental results showcase the superior generalizability and adaptability across different terrains. Notably, LANCAR shows at least a 7. 4% increase in episodic reward over the best alternatives, highlighting its potential to enhance robotic navigation in unstructured environments. More details and experiment videos could be found in this link.

ICML Conference 2024 Conference Paper

MaxMin-RLHF: Alignment with Diverse Human Preferences

  • Souradip Chakraborty
  • Jiahao Qiu
  • Hui Yuan 0002
  • Alec Koppel
  • Dinesh Manocha
  • Furong Huang
  • Amrit Singh Bedi
  • Mengdi Wang 0001

Reinforcement Learning from Human Feedback (RLHF) aligns language models to human preferences by employing a singular reward model derived from preference data. However, the single reward model overlooks the rich diversity of human preferences inherent in data collected from multiple users. In this work, we first derive an impossibility result of alignment with single reward RLHF, thereby highlighting its insufficiency in representing diverse human preferences. Next, we propose to learn a mixture of reward models via an expectation-maximization algorithm and solve a MaxMin alignment objective inspired by the Egalitarian principle in social choice theory to better honor diverse human preferences. We present comprehensive experimental results on small-scale (GPT-2) and large-scale language (with Tulu2-7B)) and show the efficacy of the proposed approach in the presence of diversity among human preferences. We remark that our findings in this work are not only limited to language models but also extend to reinforcement learning in general.

ICRA Conference 2024 Conference Paper

MIM: Indoor and Outdoor Navigation in Complex Environments Using Multi-Layer Intensity Maps

  • Adarsh Jagan Sathyamoorthy
  • Kasun Weerakoon
  • Mohamed Elnoor
  • Mason Russell
  • Jason L. Pusey
  • Dinesh Manocha

We present MIM (Multi-Layer Intensity Map), a novel 3D object representation for robot perception and autonomous navigation. MIMs consist of multiple stacked layers of 2D grid maps each derived from reflected point cloud intensities corresponding to a certain height interval. The different layers of MIMs can be used to simultaneously estimate obstacles’ height, solidity/density, and opacity. We demonstrate that MIMs’ can help accurately differentiate obstacles that are safe to navigate through (e. g. beaded/string curtains, pliable tall grass), from ones that must be avoided (e. g. transparent surfaces such as glass walls, bushes, trees, etc.) in indoor and outdoor environments. Further, to handle narrow passages, and navigate through non-solid obstacles in dense environments, we propose an approach to adaptively inflate or enlarge the obstacles detected on MIMs based on their solidity, and the robot’s preferred velocity direction. We demonstrate these improved navigation capabilities in real-world narrow, dense environments using a real Turtlebot and Boston Dynamics Spot robots. We observe significant increases in success rates to more than 50%, up to a 9. 5% decrease in normalized trajectory length, and up to a 22. 6% increase in the F-score compared to current navigation methods using other sensor modalities.

ICRA Conference 2024 Conference Paper

MTG: Mapless Trajectory Generator with Traversability Coverage for Outdoor Navigation

  • Jing Liang 0006
  • Peng Gao 0007
  • Xuesu Xiao
  • Adarsh Jagan Sathyamoorthy
  • Mohamed Elnoor
  • Ming Lin 0003
  • Dinesh Manocha

We present a novel learning-based trajectory generation algorithm for outdoor robot navigation. Our goal is to compute collision-free paths that also satisfy the environment-specific traversability constraints. Our approach is designed for global planning using limited onboard robot perception in mapless environments while ensuring comprehensive coverage of all traversable directions. Our formulation uses a Conditional Variational Autoencoder (CVAE) generative model that is enhanced with traversability constraints and an optimization formulation used for the coverage. We highlight the benefits of our approach over state-of-the-art trajectory generation approaches and demonstrate its performance in challenging and large outdoor environments, including around buildings, across intersections, along trails, and off-road terrain, using a Clearpath Husky and a Boston Dynamics Spot robot. In practice, our approach results in a 6% improvement in coverage of traversable areas and an 89% reduction in trajectory portions residing in non-traversable regions. Our video is here: https://youtu.be/3eJ2soAzXnU

ICLR Conference 2024 Conference Paper

PARL: A Unified Framework for Policy Alignment in Reinforcement Learning from Human Feedback

  • Souradip Chakraborty
  • Amrit Singh Bedi
  • Alec Koppel
  • Huazheng Wang
  • Dinesh Manocha
  • Mengdi Wang 0001
  • Furong Huang

We present a novel unified bilevel optimization-based framework, \textsf{PARL}, formulated to address the recently highlighted critical issue of policy alignment in reinforcement learning using utility or preference-based feedback. We identify a major gap within current algorithmic designs for solving policy alignment due to a lack of precise characterization of the dependence of the alignment objective on the data generated by policy trajectories. This shortfall contributes to the sub-optimal performance observed in contemporary algorithms. Our framework addressed these concerns by explicitly parameterizing the distribution of the upper alignment objective (reward design) by the lower optimal variable (optimal policy for the designed reward). Interestingly, from an optimization perspective, our formulation leads to a new class of stochastic bilevel problems where the stochasticity at the upper objective depends upon the lower-level variable. {True to our best knowledge, this work presents the first formulation of the RLHF as a bilevel optimization problem which generalizes the existing RLHF formulations and addresses the existing distribution shift issues in RLHF formulations.} To demonstrate the efficacy of our formulation in resolving alignment issues in RL, we devised an algorithm named \textsf{A-PARL} to solve PARL problem, establishing sample complexity bounds of order $\mathcal{O}(1/T)$. Our empirical results substantiate that the proposed \textsf{PARL} can address the alignment concerns in RL by showing significant improvements (up to 63\% in terms of required samples) for policy alignment in large-scale environments of the Deepmind control suite and Meta world tasks.

IROS Conference 2024 Conference Paper

PoCo: Point Context Cluster for RGBD Indoor Place Recognition

  • Jing Liang 0006
  • Zhuo Deng
  • Zheming Zhou
  • Omid Ghasemalizadeh
  • Dinesh Manocha
  • Min Sun 0001
  • Cheng-Hao Kuo
  • Arnie Sen

We present a novel end-to-end algorithm (PoCo) for the indoor RGB-D place recognition task, aimed at identifying the most likely match for a given query frame within a reference database. The task presents inherent challenges attributed to the constrained field of view and limited range of perception sensors. We propose a new network architecture, which generalizes the recent Context of Clusters (CoCs) to extract global descriptors directly from the noisy point clouds through end-to-end learning. Moreover, we develop the architecture by integrating both color and geometric modalities into the point features to enhance the global descriptor representation. We conducted evaluations on public datasets ScanNet-PR and ARKit with 807 and 5047 scenarios, respectively. PoCo achieves SOTA performance: on ScanNet-PR, we achieve R@1 of 64. 63%, a 5. 7% improvement from the best-published result CGis (61. 12%); on Arkit, we achieve R@1 of 45. 12%, a 13. 3% improvement from the best-published result CGis (39. 82%). In addition, PoCo shows higher efficiency than CGis in inference time (1. 75X-faster), and we demonstrate the effectiveness of PoCo in recognizing places within a real-world laboratory environment. Video: https://youtu.be/D8dObAeMiCw;

ICML Conference 2024 Conference Paper

Position: On the Possibilities of AI-Generated Text Detection

  • Souradip Chakraborty
  • Amrit Singh Bedi
  • Sicheng Zhu
  • Bang An 0001
  • Dinesh Manocha
  • Furong Huang

Our study addresses the challenge of distinguishing human-written text from Large Language Model (LLM) outputs. We provide evidence that this differentiation is consistently feasible, except when human and machine text distributions are indistinguishable across their entire support. Employing information theory, we show that while detecting machine-generated text becomes harder as it nears human quality, it remains possible with adequate text data. We introduce guidelines on the required text data quantity, either through sample size or sequence length, for reliable AI text detection, through derivations of sample complexity bounds. This research paves the way for advanced detection methods. Our comprehensive empirical tests, conducted across various datasets (Xsum, Squad, IMDb, and Kaggle FakeNews) and with several state-of-the-art text generators (GPT-2, GPT-3. 5-Turbo, Llama, Llama-2-13B-Chat-HF, Llama-2-70B-Chat-HF), assess the viability of enhanced detection methods against detectors like RoBERTa-Large/Base-Detector and GPTZero, with increasing sample sizes and sequence lengths. Our findings align with OpenAI’s empirical data related to sequence length, marking the first theoretical substantiation for these observations.

IROS Conference 2024 Conference Paper

SCP: Soft Conditional Prompt Learning for Aerial Video Action Recognition

  • Xijun Wang 0002
  • Ruiqi Xian
  • Tianrui Guan
  • Fuxiao Liu
  • Dinesh Manocha

We present a new learning approach, Soft Conditional Prompt Learning (SCP), which leverages the strengths of prompt learning for aerial video action recognition. Our approach is designed to predict the action of each agent by helping the models focus on the descriptions or instructions associated with actions in the input videos for aerial/robot visual perception. Our formulation supports various prompts, including learnable prompts, auxiliary visual information, and large vision models to improve the recognition performance. We present a soft conditional prompt method that learns to dynamically generate prompts from a pool of prompt experts under different video inputs. By sharing the same objective with the task, our proposed SCP can optimize prompts that guide the model’s predictions while explicitly learning input-invariant (prompt experts pool) and input-specific (data-dependent) prompt knowledge. In practice, we observe a 3. 17 − 10. 2% accuracy improvement on the aerial video datasets (Okutama [1], NECDrone [2]), which consist of scenes with single-agent and multi-agent actions. We further evaluate our approach on ground camera videos to verify the effectiveness and generalization and achieve a 1. 0 − 3. 6% improvement on SSV2 [3]. We integrate our method into the ROS2 as well.

ICRA Conference 2024 Conference Paper

Sim-to-Real Robotic Sketching using Behavior Cloning and Reinforcement Learning

  • Biao Jia
  • Dinesh Manocha

Robotic sketching in real-world scenarios poses a challenging problem with diverse applications in art, robotics, and digital design. We present a novel approach that bridges the gap between digital and robotic sketching, leveraging behavior cloning and reinforcement learning techniques. This paper introduces an approach aimed at bringing the gap between simulated and real-world robotic sketching closer together through the integration of behavior cloning and reinforcement learning techniques. Our approach trains painting policies that operate effectively in both virtual environments and real-world robotic sketching systems. We have implemented a robotic sketching system featuring an UltraArm robot equipped with a RealSense D415 camera, closely emulating the MyPaint virtual environment. Our system can perceive its environment and adapt painting policies to natural painting media. Our results highlight the effectiveness of our agent in terms of acquiring policies for high-dimensional continuous action spaces, enabling the seamless transfer of brush manipulation techniques from simulation to practical robotic sketching. Furthermore, we demonstrate our robotic sketching system’s capability to generate complex images and strokes using various configurations. https://sites.google.com/view/sketchingrobot

ICML Conference 2024 Conference Paper

Towards Global Optimality for Practical Average Reward Reinforcement Learning without Mixing Time Oracles

  • Bhrij Patel
  • Wesley A. Suttle
  • Alec Koppel
  • Vaneet Aggarwal
  • Brian M. Sadler
  • Dinesh Manocha
  • Amrit Singh Bedi

In the context of average-reward reinforcement learning, the requirement for oracle knowledge of the mixing time, a measure of the duration a Markov chain under a fixed policy needs to achieve its stationary distribution, poses a significant challenge for the global convergence of policy gradient methods. This requirement is particularly problematic due to the difficulty and expense of estimating mixing time in environments with large state spaces, leading to the necessity of impractically long trajectories for effective gradient estimation in practical applications. To address this limitation, we consider the Multi-level Actor-Critic (MAC) framework, which incorporates a Multi-level Monte-Carlo (MLMC) gradient estimator. With our approach, we effectively alleviate the dependency on mixing time knowledge, a first for average-reward MDPs global convergence. Furthermore, our approach exhibits the tightest available dependence of $\mathcal{O}(\sqrt{\tau_{mix}})$ known from prior work. With a 2D grid world goal-reaching navigation experiment, we demonstrate that MAC outperforms the existing state-of-the-art policy gradient-based method for average reward settings.

NeurIPS Conference 2024 Conference Paper

Transfer Q-star : Principled Decoding for LLM Alignment

  • Souradip Chakraborty
  • Soumya Suvra Ghosal
  • Ming Yin
  • Dinesh Manocha
  • Mengdi Wang
  • Amrit Singh Bedi
  • Furong Huang

Aligning foundation models is essential for their safe and trustworthy deployment. However, traditional fine-tuning methods are computationally intensive and require updating billions of model parameters. A promising alternative, alignment via decoding, adjusts the response distribution directly without model updates to maximize a target reward $r$, thus providing a lightweight and adaptable framework for alignment. However, principled decoding methods rely on oracle access to an optimal Q-function ($Q^*$), which is often unavailable in practice. Hence, prior SoTA methods either approximate this $Q^*$ using $Q^{\pi_{\text{sft}}}$ (derived from the reference $\texttt{SFT}$ model) or rely on short-term rewards, resulting in sub-optimal decoding performance. In this work, we propose $\texttt{Transfer Q}^*$, which implicitly estimates the optimal value function for a target reward $r$ through a baseline model $\rho_{\texttt{BL}}$ aligned with a baseline reward $r_{\texttt{BL}}$ (which can be different from the target reward $r$). Theoretical analyses of $\texttt{Transfer Q}^*$ provide a rigorous characterization of its optimality, deriving an upper bound on the sub-optimality gap and identifying a hyperparameter to control the deviation from the pre-trained reference $\texttt{SFT}$ model based on user needs. Our approach significantly reduces the sub-optimality gap observed in prior SoTA methods and demonstrates superior empirical performance across key metrics such as coherence, diversity, and quality in extensive tests on several synthetic and real datasets.

ICRA Conference 2024 Conference Paper

UAV-Sim: NeRF-based Synthetic Data Generation for UAV-based Perception

  • Christopher Maxey
  • Jaehoon Choi
  • Hyungtae Lee
  • Dinesh Manocha
  • Heesung Kwon

Tremendous variations coupled with large degrees of freedom in UAV-based imaging conditions lead to a significant lack of data in adequately learning UAV-based perception models. Using various synthetic renderers in conjunction with perception models is prevalent to create synthetic data to augment the learning in the ground-based imaging domain. However, severe challenges in the austere UAV-based domain require distinctive solutions to image synthesis for data augmentation. In this work, we leverage recent advancements in neural rendering to improve static and dynamic novel-view UAV-based image synthesis, especially from high altitudes, capturing salient scene attributes. Finally, we demonstrate a considerable performance boost is achieved when a state-of-the-art detection model is optimized primarily on hybrid sets of real and synthetic data instead of the real or synthetic data separately.

ICRA Conference 2024 Conference Paper

Unconstrained Model Predictive Control for Robot Navigation under Uncertainty

  • Senthil Hariharan Arul
  • Jong Jin Park
  • Vishnu Prem
  • Yang Zhang
  • Dinesh Manocha

In this paper, we present a probabilistic and unconstrained model predictive control formulation for robot navigation under uncertainty. We present (1) a closed-form approximation of the probability of collision that naturally models the propagation of uncertainty over the planning horizon and is computationally cheap to evaluate, and (2) a collision-cost formulation which provably preserves forward invariance (i. e. , keeps the robot away from obstacles) when combined with the probability formulation. Notably, our formulation avoids hard constraints by construction, which in turn avoids abrupt transitions in robot behavior around the constraint boundaries ensuring graceful navigation. Further, we present proof for the forward invariance and the stability of the approach. We compare the efficacy of our method with the baseline [1], which the proposed approach builds on. We demonstrate that the approach results in confident and safe robot navigation in tight spaces by smoothly slowing down the robot in low survivability environments (e. g. , tight corridors), but also allows it to move away from obstacles safely when needed.

ICRA Conference 2024 Conference Paper

VAPOR: Legged Robot Navigation in Unstructured Outdoor Environments using Offline Reinforcement Learning

  • Kasun Weerakoon
  • Adarsh Jagan Sathyamoorthy
  • Mohamed Elnoor
  • Dinesh Manocha

We present VAPOR, a novel method for autonomous legged robot navigation in unstructured, densely vegetated outdoor environments using offline Reinforcement Learning (RL). Our method trains a novel RL policy using an actor-critic network and arbitrary data collected in real outdoor vegetation. Our policy uses height and intensity-based cost maps derived from 3D LiDAR point clouds, a goal cost map, and processed proprioception data as state inputs, and learns the physical and geometric properties of the surrounding obstacles such as height, density, and solidity/stiffness. The fully-trained policy’s critic network is then used to evaluate the quality of dynamically feasible velocities generated from a novel contextaware planner. Our planner adapts the robot’s velocity space based on the presence of entrapment including vegetation, and narrow passages in dense environments. We demonstrate our method’s capabilities on a Spot robot in complex real-world outdoor scenes, including dense vegetation. We observe that VAPOR’s actions improve success rates by up to 40%, decrease the average current consumption by up to 2. 9%, and decrease the normalized trajectory length by up to 11. 2% compared to existing end-to-end offline RL and other outdoor navigation methods.

IROS Conference 2024 Conference Paper

VLPG-Nav: Object Navigation Using Visual Language Pose Graph and Object Localization Probability Maps

  • Senthil Hariharan Arul
  • Dhruva Kumar
  • Vivek Sugirtharaj
  • Richard Kim
  • Xuewei Tony Qi
  • Rajasimman Madhivanan
  • Arnie Sen
  • Dinesh Manocha

We present VLPG-Nav, a visual language navigation method for guiding robots to specified objects within household scenes. Unlike existing methods primarily focused on navigating the robot toward objects, our approach considers the additional challenge of centering the object within the robot’s camera view. Our method builds a visual language pose graph (VLPG) that functions as a spatial map of VL embeddings. Given an open-vocabulary object query, we plan a viewpoint for object navigation using the VLPG. Despite navigating to the viewpoint, real-world challenges such as object occlusion, displacement, and the robot’s localization errors can prevent visibility. We build an object localization probability map that leverages the robot’s current observations and prior VLPG. When the object is not visible, the probability map is updated, and an alternate viewpoint is computed. In addition, we propose an object-centering formulation that locally adjusts the robot’s pose to center the object in the camera view. We evaluate the effectiveness of our approach through simulations and real-world experiments, evaluating its ability to successfully view and center the object within the camera’s field of view. VLPG-Nav demonstrates improved performance in locating the object, navigating around occlusions, and centering the object within the robot’s camera view, outperforming selected baselines in the evaluation settings.

IROS Conference 2024 Conference Paper

When, What, and with Whom to Communicate: Enhancing RL-based Multi-Robot Navigation through Selective Communication

  • Senthil Hariharan Arul
  • Amrit Singh Bedi
  • Dinesh Manocha

Decentralized navigation methods rely primarily on local observations, lacking the global awareness needed to coordinate effectively within a multi-agent system. Exchanging relevant messages between agents can promote cooperation and improve navigation efficiency. We present a Reinforcement Learning (RL)-based decentralized navigation approach that learns ‘when, ’ ‘what, ’ and ‘with whom’ to communicate for safe and cooperative navigation. Our method leverages a visual transformer and self-attention mechanism to encode the local occupancy map and the state information of neighbors into fixed-length encodings, allowing it to handle an arbitrary number of neighbors for collision-free navigation. In addition, the network encodes the agent’s state information and observations of neighboring agents into a concise message vector by learning what information is crucial to communicate, which is shared with neighboring agents upon request. Moreover, to avoid indiscriminate broadcasting, the network learns when and with whom to communicate and request message vectors. Subsequently, the messages communicated alongside the local information are used to guide navigation decisions. We evaluate our method against state-of-the-art baselines in complex scenarios, including narrow corridors and environments with multiple agents. We observe considerable improvements in terms of navigation performance, showing up to ∼ 2× improvement in navigation success rates and a reduction of up to ∼ 20% in path length.

TMLR Journal 2023 Journal Article

A Survey on the Possibilities & Impossibilities of AI-generated Text Detection

  • Soumya Suvra Ghosal
  • Souradip Chakraborty
  • Jonas Geiping
  • Furong Huang
  • Dinesh Manocha
  • Amrit Bedi

Large Language Models (LLMs) have revolutionized the domain of natural language processing (NLP) with remarkable capabilities of generating human-like text responses. However, despite these advancements, several works in the existing literature have raised serious concerns about the potential misuse of LLMs such as spreading misinformation, generating fake news, plagiarism in academia, and contaminating the web. To address these concerns, a consensus among the research community is to develop algorithmic solutions to detect AI-generated text. The basic idea is that whenever we can tell if the given text is either written by a human or an AI, we can utilize this information to address the above-mentioned concerns. To that end, a plethora of detection frameworks have been proposed, highlighting the possibilities of AI-generated text detection. But in parallel to the development of detection frameworks, researchers have also concentrated on designing strategies to elude detection, i.e., focusing on the impossibilities of AI-generated text detection. This is a crucial step in order to make sure the detection frameworks are robust enough and it is not too easy to fool a detector. Despite the huge interest and the flurry of research in this domain, the community currently lacks a comprehensive analysis of recent developments. In this survey, we aim to provide a concise categorization and overview of current work encompassing both the prospects and the limitations of AI-generated text detection. To enrich the collective knowledge, we engage in an exhaustive discussion on critical and challenging open questions related to ongoing research on AI-generated text detection.

ICRA Conference 2023 Conference Paper

AZTR: Aerial Video Action Recognition with Auto Zoom and Temporal Reasoning

  • Xijun Wang 0002
  • Ruiqi Xian
  • Tianrui Guan
  • Celso M. de Melo
  • Stephen M. Nogar
  • Aniket Bera
  • Dinesh Manocha

We propose a novel approach for aerial video action recognition. Our method is designed for videos captured using UAVs and can run on edge or mobile devices. We present a learning-based approach that uses customized auto zoom to automatically identify the human target and scale it appropriately. This makes it easier to extract the key features and reduces the computational overhead. We also present an efficient temporal reasoning algorithm to capture the action information along the spatial and temporal domains within a controllable computational cost. Our approach has been implemented and evaluated both on the desktop with high-end GPUs and on the low power Robotics RB5 Platform for robots and drones. In practice, we achieve $6. 1-7. 4 \%$ improvement over SOTA in Top-1 accuracy on the RoCoG-v2 dataset, 8. 3-10. 4% improvement on the UAV-Human dataset and 3. 2% improvement on the Drone Action dataset.

ICML Conference 2023 Conference Paper

Beyond Exponentially Fast Mixing in Average-Reward Reinforcement Learning via Multi-Level Monte Carlo Actor-Critic

  • Wesley A. Suttle
  • Amrit Singh Bedi
  • Bhrij Patel
  • Brian M. Sadler
  • Alec Koppel
  • Dinesh Manocha

Many existing reinforcement learning (RL) methods employ stochastic gradient iteration on the back end, whose stability hinges upon a hypothesis that the data-generating process mixes exponentially fast with a rate parameter that appears in the step-size selection. Unfortunately, this assumption is violated for large state spaces or settings with sparse rewards, and the mixing time is unknown, making the step size inoperable. In this work, we propose an RL methodology attuned to the mixing time by employing a multi-level Monte Carlo estimator for the critic, the actor, and the average reward embedded within an actor-critic (AC) algorithm. This method, which we call M ulti-level A ctor- C ritic (MAC), is developed specifically for infinite-horizon average-reward settings and neither relies on oracle knowledge of the mixing time in its parameter selection nor assumes its exponential decay; it is therefore readily applicable to applications with slower mixing times. Nonetheless, it achieves a convergence rate comparable to SOTA actor-critic algorithms. We experimentally show that these alleviated restrictions on the technical conditions required for stability translate to superior performance in practice for RL problems with sparse rewards.

ICRA Conference 2023 Conference Paper

Dealing with Sparse Rewards in Continuous Control Robotics via Heavy-Tailed Policy Optimization

  • Souradip Chakraborty
  • Amrit Singh Bedi
  • Kasun Weerakoon
  • Prithvi Poddar
  • Alec Koppel
  • Pratap Tokekar
  • Dinesh Manocha

In this paper, we present a novel Heavy-Tailed Stochastic Policy Gradient (HT-PSG) algorithm to deal with the challenges of sparse rewards in continuous control problems. Sparse rewards are common in continuous control robotics tasks such as manipulation and navigation and make the learning problem hard due to the non-trivial estimation of value functions over the state space. This demands either reward shaping or expert demonstrations for the sparse reward environment. However, obtaining high-quality demonstrations is quite expensive and sometimes even impossible. We propose a heavy-tailed policy parametrization along with a modified momentum-based policy gradient tracking scheme (HT-SPG) to induce a stable exploratory behavior in the algorithm. The proposed algorithm does not require access to expert demonstrations. We test the performance of HT-SPG on various benchmark tasks of continuous control with sparse rewards such as 1D Mario, Pathological Mountain Car, Sparse Pendulum in OpenAI Gym, and Sparse MuJoCo environments (Hopper-v2, Half-Cheetah, Walker-2D). We show consistent performance improvement across all tasks in terms of high average cumulative reward without requiring access to expert demonstrations. We further demonstrate that a navigation policy trained using HT-SPG can be easily transferred into a Clearpath Husky robot to perform real-world navigation tasks.

ICRA Conference 2023 Conference Paper

DifFAR: Differentiable Frequency-based Disentanglement for Aerial Video Action Recognition

  • Divya Kothandaraman
  • Ming Lin 0003
  • Dinesh Manocha

We present a learning algorithm, DifFAR, for human activity recognition in videos. Our approach is designed for UAV videos, which are mainly acquired from obliquely placed dynamic cameras that contain a human actor along with background motion. Typically, the human actors occupy less than one-tenth of the spatial resolution. DifFAR simultaneously harnesses the benefits of frequency domain representations, a classical analysis tool in signal processing, and data driven neural networks. We build a differentiable static-dynamic frequency mask prior to model the salient static and dynamic pixels in the video, crucial for the underlying task of action recognition. We use this differentiable mask prior to enable the neural network to intrinsically learn disentangled feature representations via an identity loss function. Our formulation empowers the network to inherently compute disentangled salient features within its layers. Further, we propose a cost-function encapsulating temporal relevance and spatial content to sample the most important frame within uniformly spaced video segments. We conduct extensive experiments on the UAV Human dataset and the NEC Drone dataset and demonstrate relative improvements of 5. 72% - 13. 00% over the state-of-the-art and 14. 28% - 38. 05% over the corresponding baseline model.

AAAI Conference 2023 Conference Paper

DocEdit: Language-Guided Document Editing

  • Puneet Mathur
  • Rajiv Jain
  • Jiuxiang Gu
  • Franck Dernoncourt
  • Dinesh Manocha
  • Vlad I. Morariu

Professional document editing tools require a certain level of expertise to perform complex edit operations. To make editing tools accessible to increasingly novice users, we investigate intelligent document assistant systems that can make or suggest edits based on a user's natural language request. Such a system should be able to understand the user's ambiguous requests and contextualize them to the visual cues and textual content found in a document image to edit localized unstructured text and structured layouts. To this end, we propose a new task of language-guided localized document editing, where the user provides a document and an open vocabulary editing request, and the intelligent system produces a command that can be used to automate edits in real-world document editing software. In support of this task, we curate the DocEdit dataset, a collection of approximately 28K instances of user edit requests over PDF and design templates along with their corresponding ground truth software executable commands. To our knowledge, this is the first dataset that provides a diverse mix of edit operations with direct and indirect references to the embedded text and visual objects such as paragraphs, lists, tables, etc. We also propose DocEditor, a Transformer-based localization-aware multimodal (textual, spatial, and visual) model that performs the new task. The model attends to both document objects and related text contents which may be referred to in a user edit request, generating a multimodal embedding that is used to predict an edit command and associated bounding box localizing it. Our proposed model empirically outperforms other baseline deep learning approaches by 15-18%, providing a strong starting point for future work.

IROS Conference 2023 Conference Paper

DS-MPEPC: Safe and Deadlock-Avoiding Robot Navigation in Cluttered Dynamic Scenes

  • Senthil Hariharan Arul
  • Jong Jin Park
  • Dinesh Manocha

We present an algorithm for safe robot navigation in complex dynamic environments using a variant of model predictive equilibrium point control. We use an optimization formulation to navigate robots gracefully in dynamic environments by optimizing over a trajectory cost function at each timestep. We present a novel trajectory cost formulation that significantly reduces conservative and deadlocking behaviors and generates smooth trajectories. In particular, we propose a new collision probability function that effectively captures the risk associated with a given configuration and the time to avoid collisions based on the velocity direction. Moreover, we propose a terminal state cost based on the expected time-to-goal and time-to-collision values that helps in avoiding trajectories that could result in deadlock. We evaluate our cost formulation in multiple simulated scenarios, including narrow corridors with dynamic obstacles, and observe significantly improved navigation behavior and reduced deadlocks as compared to prior methods.

ICRA Conference 2023 Conference Paper

METEOR: A Dense, Heterogeneous, and Unstructured Traffic Dataset with Rare Behaviors

  • Rohan Chandra
  • Xijun Wang 0002
  • Mridul Mahajan
  • Rahul Kala
  • Rishitha Palugulla
  • Chandrababu Naidu
  • Alok Jain
  • Dinesh Manocha

We present a new traffic dataset, Meteor, which captures traffic patterns and multi-agent driving behaviors in unstructured scenarios. Meteor consists of more than 1000 one-minute videos, over 2 million annotated frames with bounding boxes and GPS trajectories for 16 unique agent categories, and more than 13 million bounding boxes for traffic agents. Meteor is a dataset for rare and interesting, multi-agent driving behaviors that are grouped into traffic violations, atypical interactions, and diverse scenarios. Every video in Meteor is tagged using a diverse range of factors corresponding to weather, time of the day, road conditions, and traffic density. We use Meteor to benchmark perception methods for object detection and multi-agent behavior prediction. Our key finding is that state-of-the-art models for object detection and behavior prediction, which otherwise succeed on existing datasets such as Waymo, fail on the Meteor dataset. Meteor is a step towards developing more sophisticated perception models for dense, heterogeneous, and unstructured scenarios.

AAAI Conference 2023 Conference Paper

Posterior Coreset Construction with Kernelized Stein Discrepancy for Model-Based Reinforcement Learning

  • Souradip Chakraborty
  • Amrit Singh Bedi
  • Pratap Tokekar
  • Alec Koppel
  • Brian Sadler
  • Furong Huang
  • Dinesh Manocha

Model-based approaches to reinforcement learning (MBRL) exhibit favorable performance in practice, but their theoretical guarantees in large spaces are mostly restricted to the setting when transition model is Gaussian or Lipschitz, and demands a posterior estimate whose representational complexity grows unbounded with time. In this work, we develop a novel MBRL method (i) which relaxes the assumptions on the target transition model to belong to a generic family of mixture models; (ii) is applicable to large-scale training by incorporating a compression step such that the posterior estimate consists of a Bayesian coreset of only statistically significant past state-action pairs; and (iii) exhibits a sublinear Bayesian regret. To achieve these results, we adopt an approach based upon Stein's method, which, under a smoothness condition on the constructed posterior and target, allows distributional distance to be evaluated in closed form as the kernelized Stein discrepancy (KSD). The aforementioned compression step is then computed in terms of greedily retaining only those samples which are more than a certain KSD away from the previous model estimate. Experimentally, we observe that this approach is competitive with several state-of-the-art RL methodologies, and can achieve up-to 50 percent reduction in wall clock time in some continuous control environments.

ICRA Conference 2023 Conference Paper

Real-Time Decentralized Navigation of Nonholonomic Agents Using Shifted Yielding Areas

  • Liang He 0008
  • Zherong Pan
  • Dinesh Manocha

We present a lightweight, decentralized algorithm for navigating multiple nonholonomic agents through challenging environments with narrow passages. Our key idea is to allow agents to yield to each other in large open areas instead of narrow passages, to increase the success rate of conventional decentralized algorithms. At pre-processing time, our method computes a medial axis for the freespace. A reference trajectory is then computed and projected onto the medial axis for each agent. During run time, when an agent senses other agents moving in the opposite direction, our algorithm uses the medial axis to estimate a Point of Impact (POI) as well as the available area around the POI. If the area around the POI is not large enough for yielding behaviors to be successful, we shift the POI to nearby large areas by modulating the agent's reference trajectory and traveling speed. We evaluate our method on a row of 4 environments with up to 15 robots, and we find our method incurs a marginal computational overhead of 10–30 ms on average, achieving real-time performance. Afterward, our planned reference trajectories can be tracked using local navigation algorithms to achieve up to a 100% higher success rate over local navigation algorithms alone.

ICRA Conference 2023 Conference Paper

RTAW: An Attention Inspired Reinforcement Learning Method for Multi-Robot Task Allocation in Warehouse Environments

  • Aakriti Agrawal
  • Amrit Singh Bedi
  • Dinesh Manocha

We present a novel reinforcement learning based algorithm for multi-robot task allocation problem in ware-house environments. We formulate it as a Markov Decision Process and solve via a novel deep multi-agent reinforcement learning method (called RTAW) with attention inspired policy architecture. Hence, our proposed policy network uses global embeddings that are independent of the number of robots/tasks. We utilize proximal policy optimization algorithm for training and use a carefully designed reward to obtain a converged policy. The converged policy ensures cooperation among different robots to minimize total travel delay (TTD) which ultimately improves the makespan for a sufficiently large task-list. In our extensive experiments, we compare the performance of our RTAW algorithm to state of the art methods such as myopic pickup distance minimization (greedy) and regret based baselines on different navigation schemes. We show an improvement of upto 14% (25–1000 seconds) in TTD on scenarios with hundreds or thousands of tasks for different challenging warehouse layouts and task generation schemes. We also demonstrate the scalability of our approach by showing performance with up to 1000 robots in simulations.

ICML Conference 2023 Conference Paper

STEERING: Stein Information Directed Exploration for Model-Based Reinforcement Learning

  • Souradip Chakraborty
  • Amrit Singh Bedi
  • Alec Koppel
  • Mengdi Wang 0001
  • Furong Huang
  • Dinesh Manocha

Directed Exploration is a crucial challenge in reinforcement learning (RL), especially when rewards are sparse. Information-directed sampling (IDS), which optimizes the information ratio, seeks to do so by augmenting regret with information gain. However, estimating information gain is computationally intractable or relies on restrictive assumptions which prohibit its use in many practical instances. In this work, we posit an alternative exploration incentive in terms of the integral probability metric (IPM) between a current estimate of the transition model and the unknown optimal, which under suitable conditions, can be computed in closed form with the kernelized Stein discrepancy (KSD). Based on KSD, we develop a novel algorithm STEERING: STEin information dirEcted exploration for model-based Reinforcement LearnING. To enable its derivation, we develop fundamentally new variants of KSD for discrete conditional distributions. We further establish that STEERING archives sublinear Bayesian regret, improving upon prior learning rates of information-augmented MBRL, IDS included. Experimentally, we show that the proposed algorithm is computationally affordable and outperforms several prior approaches.

ICRA Conference 2023 Conference Paper

Synthetic-to-Real Domain Adaptation for Action Recognition: A Dataset and Baseline Performances

  • Arun V. Reddy
  • Ketul Shah
  • William Paul
  • Rohita Mocharla
  • Judy Hoffman
  • Kapil D. Katyal
  • Dinesh Manocha
  • Celso M. de Melo

Human action recognition is a challenging problem, particularly when there is high variability in factors such as subject appearance, backgrounds and viewpoint. While deep neural networks (DNNs) have been shown to perform well on action recognition tasks, they typically require large amounts of high-quality labeled data to achieve robust performance across a variety of conditions. Synthetic data has shown promise as a way to avoid the substantial costs and potential ethical concerns associated with collecting and labeling enormous amounts of data in the real-world. However, synthetic data may differ from real data in important ways. This phenomenon, known as domain shift, can limit the utility of synthetic data in robotics applications. To mitigate the effects of domain shift, substantial effort is being dedicated to the development of domain adaptation (DA) techniques. Yet, much remains to be understood about how best to develop these techniques. In this paper, we introduce a new dataset called Robot Control Gestures (RoCoG-v2). The dataset is composed of both real and synthetic videos from seven gesture classes, and is intended to support the study of synthetic-to-real domain shift for video-based action recognition. Our work expands upon existing datasets by focusing the action classes on gestures for human-robot teaming, as well as by enabling investigation of domain shift in both ground and aerial views. We present baseline results using state-of-the-art action recognition and domain adaptation algorithms and offer initial insight on tackling the synthetic-to-real and ground-to-air domain shifts. Instructions on accessing the dataset can be found at https://github.com/reddyav1/RoCoG-v2.

IROS Conference 2023 Conference Paper

VERN: Vegetation-Aware Robot Navigation in Dense Unstructured Outdoor Environments

  • Adarsh Jagan Sathyamoorthy
  • Kasun Weerakoon
  • Tianrui Guan
  • Mason Russell
  • Damon Conover
  • Jason L. Pusey
  • Dinesh Manocha

We propose a novel method for autonomous legged robot navigation in densely vegetated environments with a variety of pliable/traversable and non-pliable/untraversable vegetation. We present a novel few-shot learning classifier that can be trained on a few hundred RGB images to differentiate flora that can be navigated through, from the ones that must be circumvented. Using the vegetation classification and 2D lidar scans, our method constructs a vegetation-aware traversability cost map that accurately represents the pliable and non-pliable obstacles with lower, and higher traversability costs, respectively. Our cost map construction accounts for misclassifications of the vegetation and further lowers the risk of collisions, freezing and entrapment in vegetation during navigation. Furthermore, we propose holonomic recovery behaviors for the robot for scenarios where it freezes, or gets physically entrapped in dense, pliable vegetation. We demonstrate our method on a Boston Dynamics Spot robot in real-world unstructured environments with sparse and dense tall grass, bushes, trees, etc. We observe an increase of 25-90% in success rates, 10-90% decrease in freezing rate, and up to 65% decrease in the false positive rate compared to existing methods.

IROS Conference 2022 Conference Paper

AFR: An Efficient Buffering Algorithm for Cloud Robotic Systems

  • Yu-Ping Wang 0001
  • Hao-Ning Wang
  • Zi-Xin Zou
  • Dinesh Manocha

Communication between robots and the server is a major problem for cloud robotic systems. In this paper, we address the problem caused by data loss during such communications and propose an efficient buffering algorithm, called AFR, to solve the problem. We model the problem into an optimization problem to maximize the received Quantity of Information (QoI). Our AFR algorithm is formally proved to achieve near-optimal QoI, which has a lower bound that is a constant multiple of the unrealizable optimal QoI. We implement our AFR algorithm in ROS without changing the API for the applications. Our experiments on two cloud robot applications show that our AFR algorithm can efficiently and effectively reduce the impact of data loss. For the remote mapping application, the RMSE caused by data loss can be reduced by about 20%. For the remote tracking application, the probability of tracking failure caused by data loss can be reduced from about 40 %-60 % to under 10%. Meanwhile, our AFR algorithm introduces time overhead of under 10 microseconds.

IROS Conference 2022 Conference Paper

CGLR: Dense Multi-Agent Navigation Using Voronoi Cells and Congestion Metric-based Replanning

  • Senthil Hariharan Arul
  • Dinesh Manocha

We present a decentralized path-planning algorithm for navigating multiple differential-drive robots in dense environments. In contrast to prior decentralized methods, we propose a novel congestion metric-based replanning that couples local and global planning techniques to efficiently navigate in scenarios with multiple corridors. To handle dense scenes with narrow passages, our approach computes the initial path for each agent to its assigned goal using a lattice planner. Based on neighbors' information, each agent performs online replanning using a congestion metric that tends to reduce the collisions and improves the navigation performance. Furthermore, we use the Voronoi cells of each agent to plan the local motion as well as a corridor selection strategy to limit the congestion in narrow passages. We evaluate the performance of our approach in complex scenes with tens of agents and narrow passages. We show that our Coupled Global-Local approach and Replanning (CGLR) improves the performance and efficiency over prior decentralized methods. In addition, our approach results in a higher success rate in terms of collision-free navigation to the goals, showing improvement in the range of 3-70% over prior decentralized solutions in certain scenarios.

IROS Conference 2022 Conference Paper

DC-MRTA: Decentralized Multi-Robot Task Allocation and Navigation in Complex Environments

  • Aakriti Agrawal
  • Senthil Hariharan Arul
  • Amrit Singh Bedi
  • Dinesh Manocha

We present a novel reinforcement learning (RL) based task allocation and decentralized navigation algorithm for mobile robots in warehouse environments. Our approach is designed for scenarios in which multiple robots are used to perform various pick up and delivery tasks. We consider the problem of joint decentralized task allocation and navigation and present a two level approach to solve it. At the higher level, we solve the task allocation by formulating it in terms of Markov Decision Processes and choosing the appropriate rewards to minimize the Total Travel Delay (TTD). At the lower level, we use a decentralized navigation scheme based on ORCA that enables each robot to perform these tasks in an independent manner, and avoid collisions with other robots and dynamic obstacles. We combine these lower and upper levels by defining rewards for the higher level as the feedback from the lower level navigation algorithm. We perform extensive evaluation in complex warehouse layouts with large number of agents and highlight the benefits over state-of-the-art algorithms based on myopic pickup distance minimization and regret-based task selection. We observe improvement up to 14% in terms of task completion time and up-to 40% improvement in terms of computing collision-free trajectories for the robots.

ICRA Conference 2022 Conference Paper

Game-Theoretic Planning for Autonomous Driving among Risk-Aware Human Drivers

  • Rohan Chandra
  • Mingyu Wang 0002
  • Mac Schwager
  • Dinesh Manocha

We present a novel approach for risk-aware planning with human agents in multi-agent traffic scenarios. Our approach takes into account the wide range of human driver behaviors on the road, from aggressive maneuvers like speeding and overtaking, to conservative traits like driving slowly and conforming to the right-most lane. In our approach, we learn a mapping from a data-driven human driver behavior model called the CMetric to a driver's entropic risk preference. We then use the derived risk preference within a game-theoretic risk-sensitive planner to model risk-aware interactions among human drivers and an autonomous vehicle in various traffic scenarios. We demonstrate our method in a merging scenario, where our results show that the final trajectories obtained from the risk-aware planner generate desirable emergent behaviors. Particularly, our planner recognizes aggressive human drivers and yields to them while maintaining a greater distance from them. In a user study, participants were able to distinguish between aggressive and conservative simulated drivers based on trajectories generated from our risk-sensitive planner. We also observe that aggressive human driving results in more frequent lane-changing in the planner. Finally, we compare the performance of our modified risk-aware planner with existing methods and show that modeling human driver behavior leads to safer navigation.

ICRA Conference 2022 Conference Paper

MotionHint: Self-Supervised Monocular Visual Odometry with Motion Constraints

  • Cong Wang 0045
  • Yu-Ping Wang 0001
  • Dinesh Manocha

We present a novel self-supervised algorithm named MotionHint for monocular visual odometry (VO) that takes motion constraints into account. A key aspect of our approach is to use an appropriate motion model that can help existing self-supervised monocular VO (SSM-VO) algorithms to overcome issues related to the local minima within their self-supervised loss functions. The motion model is expressed with a neural network named PPnet. It is trained to coarsely predict the next pose of the camera and the uncertainty of this prediction. Our self-supervised approach combines the original loss and the motion loss, which is the weighted difference between the prediction and the generated ego-motion. Taking two existing SSM-VO systems as our baseline, we evaluate our MotionHint algorithm on the standard KITTI benchmark. Experimental results show that our MotionHint algorithm can be easily applied to existing open-sourced state-of-the-art SSM-VO systems to greatly improve the performance by reducing the resulting ATE by up to 28. 73%.

IROS Conference 2022 Conference Paper

Multi-Robot Path Planning Using Medial-Axis-Based Pebble-Graph Embedding

  • Liang He 0008
  • Zherong Pan
  • Kiril Solovey
  • Biao Jia
  • Dinesh Manocha

We present a centralized algorithm for labeled, disk-shaped Multi-Robot Path Planning (MPP) in a continuous planar workspace with polygonal boundaries. Our method automatically transform the continuous problem into a discrete, graph-based variant termed the pebble motion problem, which can be solved efficiently. To construct the underlying pebble graph, we identify inscribed circles in the workspace via a medial axis transform and organize robots into layers within each inscribed circle. We show that our layered pebble-graph enables collision-free motions, allowing all graph-restricted MPP instances to be feasible. MPP instances with continuous start and goal positions can then be solved via local navigations that route robots from and to graph vertices. We tested our method on several environments with high robot-packing densities (up to 61. 6% of the workspace). For environments with narrow passages, such density violates the well-separated assumptions made by state-of-the-art MPP planners, while our method achieves an average success rate of 83%.

ICML Conference 2022 Conference Paper

N-Penetrate: Active Learning of Neural Collision Handler for Complex 3D Mesh Deformations

  • Qingyang Tan
  • Zherong Pan
  • Breannan Smith
  • Takaaki Shiratori
  • Dinesh Manocha

We present a robust learning algorithm to detect and handle collisions in 3D deforming meshes. We first train a neural network to detect collisions and then use a numerical optimization algorithm to resolve penetrations guided by the network. Our learned collision handler can resolve collisions for unseen, high-dimensional meshes with thousands of vertices. To obtain stable network performance in such large and unseen spaces, we apply active learning by progressively inserting new collision data based on the network inferences. We automatically label these new data using an analytical collision detector and progressively fine-tune our detection networks. We evaluate our method for collision handling of complex, 3D meshes coming from several datasets with different shapes and topologies, including datasets corresponding to dressed and undressed human poses, cloth simulations, and human hand poses acquired using multi-view capture systems.

ICRA Conference 2022 Conference Paper

SelfTune: Metrically Scaled Monocular Depth Estimation through Self-Supervised Learning

  • Jaehoon Choi
  • Dongki Jung
  • Yonghan Lee 0001
  • Deokhwa Kim
  • Dinesh Manocha
  • Donghwan Lee

Monocular depth estimation in the wild inherently predicts depth up to an unknown scale. To resolve scale ambiguity issue, we present a learning algorithm that leverages monocular simultaneous localization and mapping (SLAM) with proprioceptive sensors. Such monocular SLAM systems can provide metrically scaled camera poses. Given these metric poses and monocular sequences, we propose a self-supervised learning method for the pre-trained supervised monocular depth networks to enable metrically scaled depth estimation. Our approach is based on a teacher-student formulation which guides our network to predict high-quality depths. We demonstrate that our approach is useful for various applications such as mobile robot navigation and is applicable to diverse environments. Our full system shows improvements over recent self-supervised depth estimation and completion methods on EuRoC, OpenLORIS, and ScanNet datasets.

ICRA Conference 2022 Conference Paper

TERP: Reliable Planning in Uneven Outdoor Environments using Deep Reinforcement Learning

  • Kasun Weerakoon
  • Adarsh Jagan Sathyamoorthy
  • Utsav Patel
  • Dinesh Manocha

We present a novel method for reliable robot navigation in uneven outdoor terrains. Our approach employs a fully-trained Deep Reinforcement Learning (DRL) network that uses elevation maps of the environment, robot pose, and goal as inputs to compute an attention mask of the environment. The attention mask is used to identify reduced stability regions in the elevation map and is computed using channel and spatial attention modules and a novel reward function. We continuously compute and update a navigation cost-map that encodes the elevation information or the level-of-flatness of the terrain using the attention mask. We then generate locally least-cost waypoints on the cost-map and compute the final dynamically feasible trajectory using another DRL-based method. Our approach guarantees safe, locally least-cost paths and dynamically feasible robot velocities in uneven terrains. We observe an increase of 35. 18% in terms of success rate and, a decrease of 26. 14% in the cumulative elevation gradient of the robot's trajectory compared to prior navigation methods in high-elevation regions. We evaluate our method on a Husky robot in real-world uneven terrains (∼ $4m$ of elevation gain) and demonstrate its benefits.

IROS Conference 2022 Conference Paper

TerraPN: Unstructured Terrain Navigation using Online Self-Supervised Learning

  • Adarsh Jagan Sathyamoorthy
  • Kasun Weerakoon
  • Tianrui Guan
  • Jing Liang 0006
  • Dinesh Manocha

We present TerraPN, a novel method that learns the surface properties (traction, bumpiness, deformability, etc.) of complex outdoor terrains directly from robot-terrain interactions through self-supervised learning, and uses it for autonomous robot navigation. Our method uses RGB images of terrain surfaces and the robot's velocities as inputs, and the IMU vibrations and odometry errors experienced by the robot as labels for self-supervision. Our method computes a surface cost map that differentiates smooth, high-traction surfaces (low navigation costs) from bumpy, slippery, deformable surfaces (high navigation costs). We compute the cost map by non-uniformly sampling patches from the input RGB image by detecting boundaries between surfaces resulting in low inference times (47. 27% lower) compared to uniform sampling and existing segmentation methods. We present a novel navigation algorithm that accounts for a surface's cost, computes cost-based acceleration limits for the robot, and dynamically feasible, collision-free trajectories. TerraPN's surface cost prediction can be trained in ∼ 25 minutes for five different surfaces, compared to several hours for previous learning-based segmentation methods. In terms of navigation, our method outperforms previous works in terms of success rates (up to 35. 84% higher), vibration cost of the trajectories (up to 21. 52% lower), and slowing the robot on bumpy, deformable surfaces (up to 46. 76% slower) in different scenarios.

ICRA Conference 2021 Conference Paper

DWA-RL: Dynamically Feasible Deep Reinforcement Learning Policy for Robot Navigation among Mobile Obstacles

  • Utsav Patel
  • Nithish K. Sanjeev Kumar
  • Adarsh Jagan Sathyamoorthy
  • Dinesh Manocha

We present a novel Deep Reinforcement Learning (DRL) based policy to compute dynamically feasible and spatially aware velocities for a robot navigating among mobile obstacles. Our approach combines the benefits of the Dynamic Window Approach (DWA) in terms of satisfying the robot’s dynamics constraints with state-of-the-art DRL-based navigation methods that can handle moving obstacles and pedestrians well. Our formulation achieves these goals by embedding the environmental obstacles’ motions in a novel low-dimensional observation space. It also uses a novel reward function to positively reinforce velocities that move the robot away from the obstacle’s heading direction leading to significantly lower number of collisions. We evaluate our method in realistic 3-D simulated environments and on a real differential drive robot in challenging dense indoor scenarios with several walking pedestrians. We compare our method with state-of-the-art collision avoidance methods and observe significant improvements in terms of success rate (up to 33% increase), number of dynamics constraint violations (up to 61% decrease), and smoothness. We also conduct ablation studies to highlight the advantages of our observation space formulation, and reward structure.

AAAI Conference 2021 Conference Paper

LCollision: Fast Generation of Collision-Free Human Poses using Learned Non-Penetration Constraints

  • Qingyang Tan
  • Zherong Pan
  • Dinesh Manocha

We present LCollision, a learning-based method that synthesizes collision-free 3D human poses. At the crux of our approach is a novel deep architecture that simultaneously decodes new human poses from the latent space and predicts colliding body parts. These two components of our architecture are used as the objective function and surrogate hard constraints in a constrained optimization for collision-free human pose generation. A novel aspect of our approach is the use of a bilevel autoencoder that decomposes whole-body collisions into groups of collisions between localized body parts. By solving the constrained optimizations, we show that a significant amount of collision artifacts can be resolved. Furthermore, in a large test set of 2. 5 × 106 randomized poses from SCAPE, our architecture achieves a collision-prediction accuracy of 94. 1% with 80× speedup over exact collision detection algorithms. To the best of our knowledge, LCollision is the first approach that accelerates collision detection and resolves penetrations using a neural network.

ICRA Conference 2021 Conference Paper

Multi-Agent Ergodic Coverage in Urban Environments

  • Shivang Patel
  • Senthil Hariharan Arul
  • Pranav Dhulipala
  • Ming Lin 0003
  • Dinesh Manocha
  • Huan Xu 0002
  • Michael W. Otte

An important aspect of dynamic urban coverage is how building collision avoidance is incorporated into the overall coverage mission. We consider a multi-agent urban dynamic coverage problem in which a team of flying agents uses downward facing cameras to observe the street-level environment outside of buildings. Cameras are assumed to be ineffective above a maximum altitude (lower than building height), such that agents must move around or over buildings to complete their mission. The main objective of this paper is to compare three different building avoidance strategies that are compatible with dynamic ergodic methods. To provide context for these results, we also compare our results to three other common coverage methods including: boustrophedon coverage (lawn-mower sweep), Voronoi region based coverage, and a naive grid method. All algorithms are evaluated in simulation with respect to four performance metrics (percent coverage, revisit count, revisit time, and the integral of area viewed over time), across team sizes ranging from 1 to 25 agents, and in five types of urban environments of varying density and height. We find that the relative performance of algorithms changes based on the ratio of team size to search area, as well the height and density characteristics of the urban environment.

IROS Conference 2021 Conference Paper

ORBBuf: A Robust Buffering Method for Remote Visual SLAM

  • Yu-Ping Wang 0001
  • Zi-Xin Zou
  • Cong Wang 0045
  • Yue-Jiang Dong
  • Lei Qiao 0002
  • Dinesh Manocha

The data loss caused by unreliable network seriously impacts the results of remote visual SLAM systems. From our experiment, a loss of less than 1 second of data can cause a visual SLAM algorithm to lose tracking. We present a novel buffering method, ORBBuf, to reduce the impact of data loss on remote visual SLAM systems. We model the buffering problem as an optimization problem by introducing a similarity metric between frames. To solve the buffering problem, we present an efficient greedy algorithm to discard the frames that have the least impact on the quality of SLAM results. We implement our ORBBuf method on ROS, a widely used middleware framework. Through an extensive evaluation on real-world scenarios and tens of gigabytes of datasets, we demonstrate that our ORBBuf method can be applied to different state-estimation algorithms (DSO and VINS-Fusion), different sensor data (both monocular images and stereo images), different scenes (both indoor and outdoor), and different network environments (both WiFi networks and 4G networks). Our experimental results indicate that the network losses indeed affect the SLAM results, and our ORBBuf method can reduce the RMSE up to 50 times comparing with the Drop-Oldest and Random buffering methods.

IJCAI Conference 2021 Conference Paper

Point-based Acoustic Scattering for Interactive Sound Propagation via Surface Encoding

  • Hsien-Yu Meng
  • Zhenyu Tang
  • Dinesh Manocha

We present a novel geometric deep learning method to compute the acoustic scattering properties of geometric objects. Our learning algorithm uses a point cloud representation of objects to compute the scattering properties and integrates them with ray tracing for interactive sound propagation in dynamic scenes. We use discrete Laplacian-based surface encoders and approximate the neighborhood of each point using a shared multi-layer perceptron. We show that our formulation is permutation invariant and present a neural network that computes the scattering function using spherical harmonics. Our approach can handle objects with arbitrary topologies and deforming models, and takes less than 1ms per object on a commodity GPU. We have analyzed the accuracy and perform validation on thousands of unseen 3D objects and highlight the benefits over other point-based geometric deep learning methods. To the best of our knowledge, this is the first real-time learning algorithm that can approximate the acoustic scattering properties of arbitrary objects with high accuracy.

ICRA Conference 2021 Conference Paper

SelfDeco: Self-Supervised Monocular Depth Completion in Challenging Indoor Environments

  • Jaehoon Choi
  • Dongki Jung
  • Yonghan Lee 0001
  • Deokhwa Kim
  • Dinesh Manocha
  • Donghwan Lee

We present a novel algorithm for self-supervised monocular depth completion. Our approach is based on training a neural network that requires only sparse depth measurements and corresponding monocular video sequences without dense depth labels. Our self-supervised algorithm is designed for challenging indoor environments with textureless regions, glossy and transparent surfaces, moving people, longer and diverse depth ranges and scenes captured by complex ego-motions. Our novel architecture leverages both deep stacks of sparse convolution blocks to extract sparse depth features and pixel-adaptive convolutions to fuse image and depth features. We compare with existing approaches in NYUv2, KITTI and NAVERLABS indoor datasets, and observe 5 - 34 % improvements in root- means-square error (RMSE) reduction.

IROS Conference 2021 Conference Paper

V-RVO: Decentralized Multi-Agent Collision Avoidance using Voronoi Diagrams and Reciprocal Velocity Obstacles

  • Senthil Hariharan Arul
  • Dinesh Manocha

We present a decentralized collision avoidance method for dense environments based on buffered Voronoi cells (BVC) and reciprocal velocity obstacles (RVO). Our approach is designed for scenarios with a large number of agents in close proximity and provides passive-friendly collision avoidance guarantees. The Voronoi cells are superimposed with RVO cones to compute a suitable direction for each agent, and we use that direction to compute a local collision-free path. Our approach can also satisfy double-integrator dynamics, and we use the properties of the BVC to formulate a simple, decentralized deadlock resolution strategy. We demonstrate the benefits of V-RVO in complex scenarios with tens of agents in close proximity. In practice, V-RVO’s performance is comparable to prior velocity-obstacle methods, and the collision avoidance behavior is significantly less conservative than ORCA.

IROS Conference 2021 Conference Paper

XAI-N: Sensor-based Robot Navigation using Expert Policies and Decision Trees

  • Aaron M. Roth
  • Jing Liang 0006
  • Dinesh Manocha

We present a novel sensor-based learning navigation algorithm to compute a collision-free trajectory for a robot in dense and dynamic environments with moving obstacles or targets. Our approach uses deep reinforcement learning-based expert policy that is trained using a sim2real paradigm. In order to increase the reliability and handle the failure cases of the expert policy, we combine with a policy extraction technique to transform the resulting policy into a decision tree format. We use properties of decision trees to analyze and modify the policy and improve performance of navigation algorithm including smoothness, frequency of oscillation, frequency of immobilization, and obstruction of target. Overall, we are able to modify the policy to design an improved learning algorithm without retraining. We highlight the benefits of our approach in simulated environments and navigating a Clearpath Jackal robot among moving pedestrians. (Videos at this url: https://gamma.umd.edu/researchdirections/xrl/navviper)

IROS Conference 2020 Conference Paper

CMetric: A Driving Behavior Measure using Centrality Functions

  • Rohan Chandra
  • Uttaran Bhattacharya
  • Trisha Mittal
  • Aniket Bera
  • Dinesh Manocha

We present a new measure, CMetric, to classify driver behaviors using centrality functions. Our formulation combines concepts from computational graph theory and social traffic psychology to quantify and classify the behavior of human drivers. CMetric is used to compute the probability of a vehicle executing a driving style, as well as the intensity used to execute the style. Our approach is designed for realtime autonomous driving applications, where the trajectory of each vehicle or road-agent is extracted from a video. We compute a dynamic geometric graph (DGG) based on the positions and proximity of the road-agents and centrality functions corresponding to closeness and degree. These functions are used to compute the CMetric based on style likelihood and style intensity estimates. Our approach is general and makes no assumption about traffic density, heterogeneity, or how driving behaviors change over time. We present an algorithm to compute CMetric and demonstrate its performance on real-world traffic datasets. To test the accuracy of CMetric, we introduce a new evaluation protocol (called "Time Deviation Error") that measures the difference between human prediction and the prediction made by CMetric.

IJCAI Conference 2020 Conference Paper

Crowd-Steer: Realtime Smooth and Collision-Free Robot Navigation in Densely Crowded Scenarios Trained using High-Fidelity Simulation

  • Jing Liang
  • Utsav Patel
  • Adarsh Jagan Sathyamoorthy
  • Dinesh Manocha

We present a novel high fidelity 3-D simulator that significantly reduces the sim-to-real gap for collision avoidance in dense crowds using Deep Reinforcement Learning (DRL). Our simulator models realistic crowd and pedestrian behaviors, along with friction, sensor noise and delays in the simulated robot model. We also describe a technique to incrementally control the randomness and complexity of training scenarios to achieve better convergence and generalization capabilities. We demonstrate the effectiveness of our simulator by training a policy that fuses data from multiple perception sensors such as a 2-D lidar and a depth camera to detect pedestrians and computes smooth, collision-free velocities. Our novel reward function and multi-sensor formulation results in smooth and unobtrusive navigation. We have evaluated the learned policy on two differential drive robots and evaluate its performance in new dense crowd scenarios, narrow corridors, T and L-junctions, etc. We observe that our algorithm outperforms prior dynamic navigation techniques in terms of metrics such as success rate, trajectory length, mean time to goal, and smoothness.

IROS Conference 2020 Conference Paper

DeepMNavigate: Deep Reinforced Multi-Robot Navigation Unifying Local & Global Collision Avoidance

  • Qingyang Tan
  • Tingxiang Fan
  • Jia Pan 0001
  • Dinesh Manocha

We present a novel algorithm (DeepMNavigate) for global multi-agent navigation in dense scenarios using deep reinforcement learning (DRL). Our approach uses local and global information for each robot from motion information maps. We use a three-layer CNN that takes these maps as input to generate a suitable action to drive each robot to its goal position. Our approach is general, learns an optimal policy using a multi-scenario, multi-state training algorithm, and can directly handle raw sensor measurements for local observations. We demonstrate the performance on dense, complex benchmarks with narrow passages and environments with tens of agents. We highlight the algorithm’s benefits over prior learning methods and geometric decentralized algorithms in complex scenarios.

ICRA Conference 2020 Conference Paper

DenseCAvoid: Real-time Navigation in Dense Crowds using Anticipatory Behaviors

  • Adarsh Jagan Sathyamoorthy
  • Jing Liang 0006
  • Utsav Patel
  • Tianrui Guan
  • Rohan Chandra
  • Dinesh Manocha

We present DenseCAvoid, a novel algorithm for navigating a robot through dense crowds and avoiding collisions by anticipating pedestrian behaviors. Our formulation uses visual sensors and a pedestrian trajectory prediction algorithm to track pedestrians in a set of input frames and compute bounding boxes that extrapolate to the pedestrian positions in a future time. Our hybrid approach combines this trajectory prediction with a Deep Reinforcement Learning-based collision avoidance method to train a policy to generate smoother, safer, and more robust trajectories during run-time. We train our policy in realistic 3-D simulations of static and dynamic scenarios with multiple pedestrians. In practice, our hybrid approach generalizes well to unseen, real-world scenarios and can navigate a robot through dense crowds (~1-2 humans per square meter) in indoor scenarios, including narrow corridors and lobbies. As compared to cases where prediction was not used, we observe that our method reduces the occurrence of the robot freezing in a crowd by up to 48%, and performs comparably with respect to trajectory lengths and mean arrival times to goal.

ICRA Conference 2020 Conference Paper

GraphRQI: Classifying Driver Behaviors Using Graph Spectrums

  • Rohan Chandra
  • Uttaran Bhattacharya
  • Trisha Mittal
  • Xiaoyu Li
  • Aniket Bera
  • Dinesh Manocha

We present a novel algorithm (GraphRQI) to identify driver behaviors from road-agent trajectories. Our approach assumes that the road-agents exhibit a range of driving traits, such as aggressive or conservative driving. Moreover, these traits affect the trajectories of nearby road-agents as well as the interactions between road-agents. We represent these inter-agent interactions using unweighted and undirected traffic graphs. Our algorithm classifies the driver behavior using a supervised learning algorithm by reducing the computation to the spectral analysis of the traffic graph. Moreover, we present a novel eigenvalue algorithm to compute the spectrum efficiently. We provide theoretical guarantees for the running time complexity of our eigenvalue algorithm and show that it is faster than previous methods by 2 times. We evaluate the classification accuracy of our approach on traffic videos and autonomous driving datasets corresponding to urban traffic. In practice, GraphRQI achieves an accuracy improvement of up to 25% over prior driver behavior classification algorithms. We also use our classification algorithm to predict the future trajectories of road-agents.

ICRA Conference 2020 Conference Paper

Grasping Fragile Objects Using A Stress-Minimization Metric

  • Zherong Pan
  • Xifeng Gao
  • Dinesh Manocha

We present a new method to generate optimal grasps for brittle and fragile objects using a novel stress- minimization (SM) metric. Our approach is designed for objects that are composed of homogeneous isotopic materials. Our SM metric measures the maximal resistible external wrenches that would not result in fractures in the target objects. In this paper, we propose methods to compute our new metric. We also use our SM metric to design optimal grasp planning algorithms. Finally, we compare the performance of our metric and conventional grasp metrics, including Q 1, Q ∞, Q G11, Q MSV, Q VEW. Our experiments show that our SM metric takes into account the material characteristics and object shapes to indicate the fragile regions, where prior methods may not work well. We also show that the computational cost of our SM metric is on par with prior methods. Finally, we show that grasp planners guided by our metric can lower the probability of breaking target objects.

ICRA Conference 2020 Conference Paper

Learning Resilient Behaviors for Navigation Under Uncertainty

  • Tingxiang Fan
  • Pinxin Long
  • Wenxi Liu
  • Jia Pan 0001
  • Ruigang Yang
  • Dinesh Manocha

Deep reinforcement learning has great potential to acquire complex, adaptive behaviors for autonomous agents automatically. However, the underlying neural network polices have not been widely deployed in real-world applications, especially in these safety-critical tasks (e. g. , autonomous driving). One of the reasons is that the learned policy cannot perform flexible and resilient behaviors as traditional methods to adapt to diverse environments. In this paper, we consider the problem that a mobile robot learns adaptive and resilient behaviors for navigating in unseen uncertain environments while avoiding collisions. We present a novel approach for uncertainty-aware navigation by introducing an uncertainty-aware predictor to model the environmental uncertainty, and we propose a novel uncertainty-aware navigation network to learn resilient behaviors in the prior unknown environments. To train the proposed uncertainty-aware network more stably and efficiently, we present the temperature decay training paradigm, which balances exploration and exploitation during the training process. Our experimental evaluation demonstrates that our approach can learn resilient behaviors in diverse environments and generate adaptive trajectories according to environmental uncertainties.

AAAI Conference 2020 Conference Paper

M3ER: Multiplicative Multimodal Emotion Recognition using Facial, Textual, and Speech Cues

  • Trisha Mittal
  • Uttaran Bhattacharya
  • Rohan Chandra
  • Aniket Bera
  • Dinesh Manocha

We present M3ER, a learning-based method for emotion recognition from multiple input modalities. Our approach combines cues from multiple co-occurring modalities (such as face, text, and speech) and also is more robust than other methods to sensor noise in any of the individual modalities. M3ER models a novel, data-driven multiplicative fusion method to combine the modalities, which learn to emphasize the more reliable cues and suppress others on a persample basis. By introducing a check step which uses Canonical Correlational Analysis to differentiate between ineffective and effective modalities, M3ER is robust to sensor noise. M3ER also generates proxy features in place of the ineffectual modalities. We demonstrate the efficiency of our network through experimentation on two benchmark datasets, IEMOCAP and CMU-MOSEI. We report a mean accuracy of 82. 7% on IEMOCAP and 89. 0% on CMU-MOSEI, which, collectively, is an improvement of about 5% over prior work.

AAAI Conference 2020 Conference Paper

NeoNav: Improving the Generalization of Visual Navigation via Generating Next Expected Observations

  • Qiaoyun Wu
  • Dinesh Manocha
  • Jun Wang
  • Kai Xu

We propose improving the cross-target and cross-scene generalization of visual navigation through learning an agent that is guided by conceiving the next observations it expects to see. This is achieved by learning a variational Bayesian model, called NeoNav, which generates the next expected observations (NEO) conditioned on the current observations of the agent and the target view. Our generative model is learned through optimizing a variational objective encompassing two key designs. First, the latent distribution is conditioned on current observations and the target view, leading to a modelbased, target-driven navigation. Second, the latent space is modeled with a Mixture of Gaussians conditioned on the current observation and the next best action. Our use of mixtureof-posteriors prior effectively alleviates the issue of overregularized latent space, thus significantly boosting the model generalization for new targets and in novel scenes. Moreover, the NEO generation models the forward dynamics of agentenvironment interaction, which improves the quality of approximate inference and hence benefits data efficiency. We have conducted extensive evaluations on both real-world and synthetic benchmarks, and show that our model consistently outperforms the state-of-the-art models in terms of success rate, data efficiency, and generalization.

IROS Conference 2020 Conference Paper

ProxEmo: Gait-based Emotion Learning and Multi-view Proxemic Fusion for Socially-Aware Robot Navigation

  • Venkatraman Narayanan
  • Bala Murali Manoghar
  • Vishnu Sashank Dorbala
  • Dinesh Manocha
  • Aniket Bera

We present ProxEmo, a novel end-to-end emotion prediction algorithm for socially aware robot navigation among pedestrians. Our approach predicts the perceived emotions of a pedestrian from walking gaits, which is then used for emotion-guided navigation taking into account social and proxemic constraints. To classify emotions, we propose a multi-view skeleton graph convolution-based model that works on a commodity camera mounted onto a moving robot. Our emotion recognition is integrated into a mapless navigation scheme and makes no assumptions about the environment of pedestrian motion. It achieves a mean average emotion prediction precision of 82. 47% on the Emotion-Gait benchmark dataset. We outperform current state-of-art algorithms for emotion recognition from 3D gaits. We highlight its benefits in terms of navigation in indoor scenes using a Clearpath Jackal robot.

ICRA Conference 2020 Conference Paper

RoadTrack: Realtime Tracking of Road Agents in Dense and Heterogeneous Environments

  • Rohan Chandra
  • Uttaran Bhattacharya
  • Tanmay Randhavane
  • Aniket Bera
  • Dinesh Manocha

We present a realtime tracking algorithm, Road-Track, to track heterogeneous road-agents in dense traffic videos. Our approach is designed for dense traffic scenarios that consist of different road-agents such as pedestrians, two-wheelers, cars, buses, etc. sharing the road. We use the tracking-by-detection approach where we track a road-agent by matching the appearance or bounding box region in the current frame with the predicted bounding box region propagated from the previous frame. Roadtrack uses a novel motion model called the Simultaneous Collision Avoidance and Interaction (SimCAI) model to predict the motion of road-agents by modeling collision avoidance and interactions between the road-agents for the next frame. We demonstrate the advantage of RoadTrack on a dataset of dense traffic videos and observe an accuracy of 75. 8% on this dataset, outperforming prior state-of-the-art tracking algorithms by at least 5. 2%. RoadTrack operates in realtime at approximately 30 fps and is at least 4× faster than prior tracking algorithms on standard tracking datasets.

AAAI Conference 2020 Conference Paper

STEP: Spatial Temporal Graph Convolutional Networks for Emotion Perception from Gaits

  • Uttaran Bhattacharya
  • Trisha Mittal
  • Rohan Chandra
  • Tanmay Randhavane
  • Aniket Bera
  • Dinesh Manocha

We present a novel classifier network called STEP, to classify perceived human emotion from gaits, based on a Spatial Temporal Graph Convolutional Network (ST-GCN) architecture. Given an RGB video of an individual walking, our formulation implicitly exploits the gait features to classify the perceived emotion of the human into one of four emotions: happy, sad, angry, or neutral. We train STEP on annotated real-world gait videos, augmented with annotated synthetic gaits generated using a novel generative network called STEP-Gen, built on an ST-GCN based Conditional Variational Autoencoder (CVAE). We incorporate a novel pushpull regularization loss in the CVAE formulation of STEP- Gen to generate realistic gaits and improve the classification accuracy of STEP. We also release a novel dataset (E-Gait), which consists of 4, 227 human gaits annotated with perceived emotions along with thousands of synthetic gaits. In practice, STEP can learn the affective features and exhibits classification accuracy of 88% on E-Gait, which is 14–30% more accurate over prior methods.

IROS Conference 2019 Conference Paper

DensePeds: Pedestrian Tracking in Dense Crowds Using Front-RVO and Sparse Features

  • Rohan Chandra
  • Uttaran Bhattacharya
  • Aniket Bera
  • Dinesh Manocha

We present a pedestrian tracking algorithm, DensePeds, that tracks individuals in highly dense crowds (>2 pedestrians per square meter). Our approach is designed for videos captured from front-facing or elevated cameras. We present a new motion model called Front-RVO (FRVO) for predicting pedestrian movements in dense situations using collision avoidance constraints and combine it with state-of-the-art Mask R-CNN to compute sparse feature vectors that reduce the loss of pedestrian tracks (false negatives). We evaluate DensePeds on the standard MOT benchmarks as well as a new dense crowd dataset. In practice, our approach is 4. 5 × faster than prior tracking algorithms on the MOT benchmark and we are state-of-the-art in dense crowd videos by over 2. 6% on the absolute scale on average.

ICRA Conference 2019 Conference Paper

Diffraction-Aware Sound Localization for a Non-Line-of-Sight Source

  • Inkyu An
  • Doheon Lee
  • Jung-Woo Choi
  • Dinesh Manocha
  • Sung-Eui Yoon

We present a novel sound localization algorithm for a non-line-of-sight (NLOS) sound source in indoor environments. Our approach exploits the diffraction properties of sound waves as they bend around a barrier or an obstacle in the scene. We combine a ray tracing-based sound propagation algorithm with a Uniform Theory of Diffraction (UTD) model, which simulate bending effects by placing a virtual sound source on a wedge in the environment. We precompute the wedges of a reconstructed mesh of an indoor scene and use them to generate diffraction acoustic rays to localize the 3D position of the source. Our method identifies the convergence region of those generated acoustic rays as the estimated source position based on a particle filter. We have evaluated our algorithm in multiple scenarios consisting of static and dynamic NLOS sound sources. In our tested cases, our approach can localize a source position with an average accuracy error of 0. 7m, measured by the L2 distance between estimated and actual source locations in a 7m×7m×3m room. Furthermore, we observe 37% to 130% improvement in accuracy over a state-of-the-art localization method that does not model diffraction effects, especially when a sound source is not visible to the robot.

ICRA Conference 2019 Conference Paper

Efficient Generation of Motion Plans from Attribute-Based Natural Language Instructions Using Dynamic Constraint Mapping

  • Jae Sung Park
  • Biao Jia
  • Mohit Bansal
  • Dinesh Manocha

We present an algorithm for combining natural language processing (NLP) and fast robot motion planning to automatically generate robot movements. Our formulation uses a novel concept called Dynamic Constraint Mapping to transform complex, attribute-based natural language instructions into appropriate cost functions and parametric constraints for optimization-based motion planning. We generate a factor graph from natural language instructions called the Dynamic Grounding Graph (DGG), which takes latent parameters into account. The coefficients of this factor graph are learned based on conditional random fields (CRFs) and are used to dynamically generate the constraints for motion planning. We map the cost function directly to the motion parameters of the planner and compute smooth trajectories in dynamic scenes. We highlight the performance of our approach in a simulated environment and via a human interacting with a 7-DOF Fetch robot using intricate language commands including negation, orientation specification, and distance constraints.

ICRA Conference 2019 Conference Paper

Fast Motion Planning for High-DOF Robot Systems Using Hierarchical System Identification

  • Biao Jia
  • Zherong Pan
  • Dinesh Manocha

We present an efficient algorithm for motion planning and controlling a robot system with a high number of degrees-of-freedom (DOF). These systems include high-DOF soft robots and articulated robots interacting with a deformable environment. We present a novel technique to accelerate the evaluations of the forward dynamics function by storing the results of costly computations in a hierarchical adaptive grid. Furthermore, we exploit the underactuated properties of the robot systems and build the grid in a low-dimensional space. Our approach approximates the forward dynamics function with guaranteed error bounds and can be used in optimization-based motion planning and reinforcement-learning-based feed-back control. We highlight the performance on two high-DOF robot systems: a line-actuated elastic robot arm and an underwater swimming robot in water. Compared to prior techniques based on exact dynamics evaluation, we observe one to two orders of magnitude improvement in the performance.

IROS Conference 2019 Conference Paper

Generating Grasp Poses for a High-DOF Gripper Using Neural Networks

  • Min Liu 0019
  • Zherong Pan
  • Kai Xu 0004
  • Kanishka Ganguly
  • Dinesh Manocha

We present a learning-based method for representing grasp poses of a high-DOF hand using neural networks. Due to redundancy in such high-DOF grippers, there exists a large number of equally effective grasp poses for a given target object, making it difficult for the neural network to find consistent grasp poses. We resolve this ambiguity by generating an augmented dataset that covers many possible grasps for each target object and train our neural networks using a consistency loss function to identify a one-to-one mapping from objects to grasp poses. We further enhance the quality of neural-network-predicted grasp poses using a collision loss function to avoid penetrations. We use an object dataset that combines the BigBIRD Database, the KIT Database, the YCB Database, and the Grasp Dataset to show that our method can generate high-DOF grasp poses with higher accuracy than supervised learning baselines. The quality of the grasp poses is on par with the groundtruth poses in the dataset. In addition, our method is robust and can handle noisy object models such as those constructed from multi-view depth images, allowing our method to be implemented on a 25-DOF Shadow Hand hardware platform.

ICRA Conference 2019 Conference Paper

Pedestrian Dominance Modeling for Socially-Aware Robot Navigation

  • Tanmay Randhavane
  • Aniket Bera
  • Emily Kubin
  • Austin Wang
  • Kurt Gray
  • Dinesh Manocha

We present a Pedestrian Dominance Model (PDM) to identify the dominance characteristics of pedestrians for robot navigation. Through a perception study on a simulated dataset of pedestrians, PDM models the perceived dominance levels of pedestrians with varying motion behaviors corresponding to trajectory, speed, and personal space. At runtime, we use PDM to identify the dominance levels of pedestrians to facilitate socially-aware navigation for the robots. PDM can predict dominance levels from trajectories with ~85% accuracy. Prior studies in psychology literature indicate that when interacting with humans, people are more comfortable around people that exhibit complementary movement behaviors. Our algorithm leverages this by enabling the robots to exhibit complementing responses to pedestrian dominance. We also present an application of PDM for generating dominance-based collision-avoidance behaviors in the navigation of autonomous vehicles among pedestrians. We demonstrate the benefits of our algorithm for robots navigating among tens of pedestrians in simulated environments.

AAAI Conference 2019 Conference Paper

TrafficPredict: Trajectory Prediction for Heterogeneous Traffic-Agents

  • Yuexin Ma
  • Xinge Zhu
  • Sibo Zhang
  • Ruigang Yang
  • Wenping Wang
  • Dinesh Manocha

To safely and efficiently navigate in complex urban traffic, autonomous vehicles must make responsible predictions in relation to surrounding traffic-agents (vehicles, bicycles, pedestrians, etc.). A challenging and critical task is to explore the movement patterns of different traffic-agents and predict their future trajectories accurately to help the autonomous vehicle make reasonable navigation decision. To solve this problem, we propose a long short-term memory-based (LSTM-based) realtime traffic prediction algorithm, TrafficPredict. Our approach uses an instance layer to learn instances’ movements and interactions and has a category layer to learn the similarities of instances belonging to the same type to refine the prediction. In order to evaluate its performance, we collected trajectory datasets in a large city consisting of varying conditions and traffic densities. The dataset includes many challenging scenarios where vehicles, bicycles, and pedestrians move among one another. We evaluate the performance of TrafficPredict on our new dataset and highlight its higher accuracy for trajectory prediction by comparing with prior prediction methods.

ICRA Conference 2019 Conference Paper

Transferring Grasp Configurations using Active Learning and Local Replanning

  • Hao Tian 0003
  • Changbo Wang
  • Dinesh Manocha
  • Xinyu Zhang 0002

We present a new approach to transfer grasp configurations from prior example objects to novel objects. We assume the novel and example objects have the same topology and similar shapes. We perform 3D segmentation on these objects using geometric and semantic shape characteristics. We compute a grasp space for each part of the example object using active learning. We build bijective contact mapping between these model parts and compute the corresponding grasps for novel objects. Finally, we assemble the individual parts and use local replanning to adjust grasp configurations while maintaining its stability and physical constraints. Our approach is general, can handle all kind of objects represented using mesh or point cloud and a variety of robotic hands.

IROS Conference 2019 Conference Paper

TZC: Efficient Inter-Process Communication for Robotics Middleware with Partial Serialization

  • Yu-Ping Wang 0001
  • Wende Tan
  • Xu-Qiang Hu
  • Dinesh Manocha
  • Shi-Min Hu 0001

Inter-process communication (IPC) is one of the core functions of modern robotics middleware. We propose an efficient IPC technique called TZC (Towards Zero-Copy). As a core component of TZC, we design a novel algorithm called partial serialization. Our formulation can generate messages that can be divided into two parts. During message transmission, one part is transmitted through a socket and the other part uses shared memory. The part within shared memory is never copied or serialized during its lifetime. We have integrated TZC with ROS and ROS2 and find that TZC can be easily combined with current open-source platforms. By using TZC, the overhead of IPC remains constant when the message size grows. In particular, when the message size is 4MB (less than the size of a full HD image), TZC can reduce the overhead of ROS IPC from tens of milliseconds to hundreds of microseconds and can reduce the overhead of ROS2 IPC from hundreds of milliseconds to less than 1 millisecond. We also demonstrate the benefits of TZC by integrating it with TurtleBot2 to be used in autonomous driving scenarios. We show that by using TZC, the braking distance can be 16% shorter than with ROS.

AAMAS Conference 2018 Conference Paper

ACMICS: An Agent Communication Model for Interacting Crowd Simulation

  • Kurtulus Kullu
  • Ugur G�d�kbay
  • Dinesh Manocha

We present and evaluate a novel approach to simulate communication between the agents. Our approach distinguishes low- and high-level communication tasks. This separation makes it easy to extend and use it in new scenarios. We highlight the benefits of our approach using different simulation scenarios consisting of hundreds of agents. We also model evacuation behavior in unknown environments and highlight the benefits of our approach particularly in simulating such behavior.

AAMAS Conference 2018 Conference Paper

Efficient Reciprocal Collision Avoidance between Heterogeneous Agents Using CTMAT

  • Yuexin Ma
  • Dinesh Manocha
  • Wenping Wang

We present a novel algorithm for reciprocal collision avoidance between heterogeneous agents of different shapes and sizes. We present a novel CTMAT representation based on medial axis transform to compute a tight fitting bounding shape for each agent. Each CTMAT is represented using tuples, which are composed of circular arcs and line segments. Based on the reciprocal velocity obstacle formulation, we reduce the problem to solving a lowdimensional linear programming between each pair of tuples belonging to adjacent agents. We precompute the Minkowski Sums of tuples to accelerate the runtime performance. Finally, we provide an efficient method to update the orientation of each agent in a local manner. We have implemented the algorithm and highlight its performance on benchmarks corresponding to road traffic scenarios and different vehicles. The overall runtime performance is comparable to prior multi-agent collision avoidance algorithms that use circular or elliptical agents. Our approach is less conservative and results in fewer false collisions.

IROS Conference 2018 Conference Paper

Identifying Driver Behaviors Using Trajectory Features for Vehicle Navigation

  • Ernest Cheung
  • Aniket Bera
  • Emily Kubin
  • Kurt Gray
  • Dinesh Manocha

We present a novel approach to automatically identify driver behaviors from vehicle trajectories and use them for safe navigation of autonomous vehicles. We propose a novel set of features that can be easily extracted from car trajectories. We derive a data-driven mapping between these features and six driver behaviors using an elaborate web-based user study. We also compute a summarized score indicating a level of awareness that is needed while driving next to other vehicles. We also incorporate our algorithm into a vehicle navigation simulation system and demonstrate its benefits in terms of safer realtime navigation, while driving next to aggressive or dangerous drivers.

ICRA Conference 2018 Conference Paper

Manipulating Highly Deformable Materials Using a Visual Feedback Dictionary

  • Biao Jia
  • Zhe Hu
  • Jia Pan 0001
  • Dinesh Manocha

The complex physical properties of highly deformable materials such as clothes pose significant challenges for autonomous robotic manipulation systems. We present a novel visual feedback dictionary-based method for manipulating deformable objects towards a desired configuration. Our approach is based on visual servoing and we use an efficient technique to extract key features from the RGB sensor stream in the form of a histogram of deformable model features. These histogram features serve as high-level representations of the state of the deformable material. Next, we collect manipulation data and use a visual feedback dictionary that maps the velocity in the high-dimensional feature space to the velocity of the robotic end-effectors for manipulation. We have evaluated our approach on a set of complex manipulation tasks and human-robot manipulation tasks on different cloth pieces with varying material characteristics.

AAAI Conference 2018 Conference Paper

MixedPeds: Pedestrian Detection in Unannotated Videos Using Synthetically Generated Human-Agents for Training

  • Ernest Cheung
  • Anson Wong
  • Aniket Bera
  • Dinesh Manocha

We present a new method for training pedestrian detectors on an unannotated set of images. We produce a mixed reality dataset that is composed of real-world background images and synthetically generated static human-agents. Our approach is general, robust, and makes few assumptions about the unannotated dataset. We automatically extract from the dataset: i) the vanishing point to calibrate the virtual camera, and ii) the pedestrians’ scales to generate a Spawn Probability Map, which is a novel concept that guides our algorithm to place the pedestrians at appropriate locations. After putting synthetic human-agents in the unannotated images, we use these augmented images to train a Pedestrian Detector, with the annotations generated along with the synthetic agents. We conducted our experiments using Faster R-CNN by comparing the detection results on the unannotated dataset performed by the detector trained using our approach and detectors trained with other manually labeled datasets. We showed that our approach improves the average precision by 5-13% over these detectors.

IROS Conference 2018 Conference Paper

Position-Based Time-Integrator for Frictional Articulated Body Dynamics

  • Zherong Pan
  • Dinesh Manocha

We present a new time-integrator for modeling the frictional dynamics of articulated bodies. Our formulation represents the configuration of the articulated body using position variables and then uses those variables to model the friction forces between the articulated body and the environment. Our approach corresponds to a Newton-type optimization scheme that is guaranteed to converge so that it is stable with large timestep sizes. We evaluate the accuracy and stability of our time-integrator by comparing it with a conventional formulations based on the Newton-Euler equation and demonstrate the benefits on standard controller-optimization applications. We achieve 3-5 times speedup over a Newton-Euler-based simulator on a CPU. Our approach can be easily parallelized on a GPU and results in additional 4-15 times performance improvement.

ICRA Conference 2018 Conference Paper

Realtime Planning for High-DOF Deformable Bodies Using Two-Stage Learning

  • Zherong Pan
  • Dinesh Manocha

We present a method for planning the motion of arbitrarily-shaped volumetric deformable bodies or robots through complex environments. Such robots have very high-dimensional configuration spaces and we compute trajectories that satisfy the dynamics constraints using a two-stage learning method. First, we train a multitask controller parameterized using dynamic movement primitives (DMP), which encodes various locomotion or movement skills. Next, we train a neural-network controller to select the DMP task to navigate the robot through environments while avoiding obstacles. By combining the finite element method (FEM), model reduction, and contact invariant optimization (CIO), the DMP controller's parameters can be optimized efficiently using a gradient-based method, while the neural-network's parameters are optimized using Deep Q-Learning (DQL). This two-stage learning algorithm also allows us to reuse the trained DMP controller for different navigation tasks, such as moving through different environmental types and to different goal positions. Our results show that the learned motion planner can navigate swimming and walking deformable robots with thousands of DOFs at realtime.

ICRA Conference 2018 Conference Paper

Reflection-Aware Sound Source Localization

  • Inkyu An
  • Myung-Bae Son
  • Dinesh Manocha
  • Sung-Eui Yoon

We present a novel, reflection-aware method for 3D sound localization in indoor environments. Unlike prior approaches, which are mainly based on continuous sound signals from a stationary source, our formulation is designed to localize the position instantaneously from signals within a single frame. We consider direct sound and indirect sound signals that reach the microphones after reflecting off surfaces such as ceilings or walls. We then generate and trace direct and reflected acoustic paths using inverse acoustic ray tracing and utilize these paths with Monte Carlo localization to estimate a 3D sound source position. We have implemented our method on a robot with a cube-shaped microphone array and tested it against different settings with continuous and intermittent sound signals with a stationary or a mobile source. Across different settings, our approach can localize the sound with an average distance error of 0. 8 m tested in a room of 7 m by 7 m area with 3 m height, including a mobile and non-line-of-sight sound source. We also reveal that the modeling of indirect rays increases the localization accuracy by 40% compared to only using direct acoustic rays.

IROS Conference 2018 Conference Paper

The Socially Invisible Robot Navigation in the Social World Using Robot Entitativity

  • Aniket Bera
  • Tanmay Randhavane
  • Emily Kubin
  • Austin Wang
  • Kurt Gray
  • Dinesh Manocha

We present a real-time, data-driven algorithm to enhance the social-invisibility of robots within crowds. Our approach is based on prior psychological research, which reveals that people notice and-importantly-react negatively to groups of social actors when they have high entitativity, moving in a tight group with similar appearances and trajectories. In order to evaluate that behavior, we performed a user study to develop navigational algorithms that minimize entitativity. This study establishes mapping between emotional reactions and multi-robot trajectories and appearances, and further generalizes the finding across various environmental conditions. We demonstrate the applicability of our entitativity modeling for trajectory computation for active surveillance and dynamic intervention in simulated robot-human interaction scenarios. Our approach empirically shows that various levels of entitative robots can be used to both avoid and influence pedestrians while not eliciting strong emotional reactions, giving multi-robot systems socially-invisibility.

JAAMAS Journal 2017 Journal Article

ACMICS: an agent communication model for interacting crowd simulation

  • Kurtulus Kullu
  • Uğur Güdükbay
  • Dinesh Manocha

Abstract Behavioral plausibility is one of the major aims of crowd simulation research. We present a novel approach that simulates communication between the agents and assess its influence on overall crowd behavior. Our formulation uses a communication model that tends to simulate human-like communication capability. The underlying formulation is based on a message structure that corresponds to a simplified version of Foundation for Intelligent Physical Agents Agent Communication Language Message Structure Specification. Our algorithm distinguishes between low- and high-level communication tasks so that ACMICS can be easily extended and employed in new simulation scenarios. We highlight the performance of our communication model on different crowd simulation scenarios. We also extend our approach to model evacuation behavior in unknown environments. Overall, our communication model has a small runtime overhead and can be used for interactive simulation with tens or hundreds of agents.

IJCAI Conference 2017 Conference Paper

Aggressive, Tense or Shy? Identifying Personality Traits from Crowd Videos

  • Aniket Bera
  • Tanmay Randhavane
  • Dinesh Manocha

We present a real-time algorithm to automatically classify the behavior or personality of a pedestrian based on his or her movements in a crowd video. Our classification criterion is based on Personality Trait theory. We present a statistical scheme that dynamically learns the behavior of every pedestrian and computes its motion model. This model is combined with global crowd characteristics to compute the movement patterns and motion dynamics and use them for crowd prediction. Our learning scheme is general and we highlight its performance in identifying the personality of different pedestrians in low and high density crowd videos. We also evaluate the accuracy by comparing the results with a user study.

IROS Conference 2017 Conference Paper

AutonoVi: Autonomous vehicle planning with dynamic maneuvers and traffic constraints

  • Andrew Best
  • Sahil Narang
  • Daniel Barber
  • Dinesh Manocha

We present AutonoVi, a novel algorithm for autonomous vehicle navigation that supports dynamic maneuvers and integrates traffic constraints and norms. Our approach is based on optimization-based maneuver planning that supports dynamic lane-changes, swerving, and braking in all traffic scenarios and guides the vehicle to its goal position. We take into account various traffic constraints, including collision avoidance with other vehicles, pedestrians, and cyclists using control velocity obstacles. We use a data-driven approach to model the vehicle dynamics for control and collision avoidance. Furthermore, our trajectory computation algorithm takes into account traffic rules and behaviors, such as stopping at intersections and stoplights, based on an arc-spline representation. We have evaluated our algorithm in a simulated environment and tested its interactive performance in urban and highway driving scenarios with tens of vehicles, pedestrians, and cyclists. These scenarios include jaywalking pedestrians, sudden stops from high speeds, safely passing cyclists, a vehicle suddenly swerving into the roadway, and high-density traffic where the vehicle must change lanes to progress more effectively.

ICRA Conference 2017 Conference Paper

Efficient multi-agent global navigation using interpolating bridges

  • Liang He 0008
  • Jia Pan 0001
  • Dinesh Manocha

We present a novel approach for collision-free global navigation for continuous-time multi-agent systems with general linear dynamics. Our approach is general and can be used to perform collision-free navigation in 2D and 3D workspaces with narrow passages and crowded regions. As part of pre-computation, we compute multiple bridges in the narrow or tight regions in the workspace using kinodynamic RRT algorithms. Our bridge has certain geometric properties that enable us to calculate a collision-free trajectory for each agent using simple interpolation at runtime. Moreover, we combine interpolated bridge trajectories with local multi-agent navigation algorithms to compute global collision-free paths for each agent. The overall approach combines the performance benefits of coupled multi-agent algorithms with the precomputed trajectories of the bridges to handle challenging scenarios. In practice, our approach can perform global navigation for tens to hundreds of agents on a single CPU core in 2D and 3D workspaces.

ICRA Conference 2017 Conference Paper

Efficient probabilistic collision detection for non-convex shapes

  • Jae Sung Park
  • Chonhyon Park
  • Dinesh Manocha

We present new algorithms to perform fast probabilistic collision queries between convex as well as non-convex objects. Our approach is applicable to general shapes, where one or more objects are represented using Gaussian probability distributions. We present a fast new algorithm for a pair of convex objects, and extend the approach to non-convex models using hierarchical representations. We highlight the performance of our algorithms with various convex and non-convex shapes on complex synthetic benchmarks and trajectory planning benchmarks for a 7-DOF Fetch robot arm.

IROS Conference 2017 Conference Paper

Feedback motion planning for liquid pouring using supervised learning

  • Zherong Pan
  • Dinesh Manocha

We present a novel motion planning algorithm for pouring a liquid body from a source to a target container. Our approach uses a receding-horizon optimization strategy that considers liquid dynamics and various other constraints. To handle liquid dynamics without costly fluid simulations, we use a neural network to infer a set of key liquid-related parameters from the observation of the current liquid configuration. To train the neural network, we generate a dataset of successful pouring examples using stochastic optimization in a problem-specific search space. These parameters are then used in the objective function for trajectory optimization. Our feedback motion planner achieves real-time performance, and we observe a high success rate in our simulated 2D and 3D liquid pouring benchmarks.

IROS Conference 2017 Conference Paper

Multi-contact frictional rigid dynamics using impulse decomposition

  • Sheng Li 0008
  • Tianxiang Zhang
  • Guoping Wang
  • Hanqiu Sun
  • Dinesh Manocha

We present an interactive and stable multi-contact dynamic simulation algorithm for rigid bodies. Our approach is based on fast frictional dynamics (FFD) [14], which is designed for large sets of non-convex rigid bodies. We use a new friction model that performs velocity-level multi-contact simulation using impulse decomposition. Moreover, we accurately handle friction at each contact point using contact distribution and frictional impulse solvers, which also account for relative motion. We evaluate our algorithm's performance on many complex multi-body benchmarks with thousands of contacts. In practice, our dynamics simulation algorithm takes a few milliseconds per timestep and exhibits more stable behaviors.

IROS Conference 2017 Conference Paper

PRVO: Probabilistic Reciprocal Velocity Obstacle for multi robot navigation under uncertainty

  • Bharath Gopalakrishnan
  • Arun Kumar Singh 0001
  • Meha Kaushik
  • K. Madhava Krishna
  • Dinesh Manocha

We present PRVO, a probabilistic variant of Reciprocal Velocity Obstacle (RVO) for decentralized multi-robot navigation under uncertainty. PRVO characterizes the space of velocities that would allow each robot to fulfill its share in collision avoidance with a specified probability. PRVO is modeled as chance constraints over the velocity level constraints defined by RVO and takes into account the uncertainty associated with both state estimation as well as the actuation of each robot. Since chance constraints are in general computationally intractable, we propose a series of reformulations which when combined with time scaling based concepts leads to a closed form characterization of solution space of PRVO for a given probability of collision avoidance. We validate our formulation through numerical simulations in which we highlight the advantages of PRVO over the related existing formulations.

IROS Conference 2017 Conference Paper

SocioSense: Robot navigation amongst pedestrians with social and psychological constraints

  • Aniket Bera
  • Tanmay Randhavane
  • Rohan Prinja
  • Dinesh Manocha

We present a real-time algorithm, SocioSense, for socially-aware navigation of a robot amongst pedestrians. Our approach computes time-varying behaviors of each pedestrian using Bayesian learning and Personality Trait theory. These psychological characteristics are used for long-term path prediction and generating proxemic characteristics for each pedestrian. We combine these psychological constraints with social constraints to perform human-aware robot navigation in low- to medium-density crowds. The estimation of time-varying behaviors and pedestrian personalities can improve the performance of long-term path prediction by 21%, as compared to prior interactive path prediction algorithms. We also demonstrate the benefits of our socially-aware navigation in simulated environments with tens of pedestrians.

ICRA Conference 2016 Conference Paper

GLMP- realtime pedestrian path prediction using global and local movement patterns

  • Aniket Bera
  • Sujeong Kim
  • Tanmay Randhavane
  • Srihari Pratapa
  • Dinesh Manocha

We present a novel real-time algorithm to predict the path of pedestrians in cluttered environments. Our approach makes no assumption about pedestrian motion or crowd density, and is useful for short-term as well as long-term prediction. We interactively learn the characteristics of pedestrian motion and movement patterns from 2D trajectories using Bayesian inference. These include local movement patterns corresponding to the current and preferred velocities and global characteristics such as entry points and movement features. Our approach involves no precomputation and we demonstrate the real-time performance of our prediction algorithm on sparse and noisy trajectory data extracted from dense indoor and outdoor crowd videos. The combination of local and global movement patterns can improve the accuracy of long-term prediction by 12–18% over prior methods in high-density videos.

IROS Conference 2016 Conference Paper

Motion planning for fluid manipulation using simplified dynamics

  • Zherong Pan
  • Dinesh Manocha

We present an optimization-based motion planning algorithm to compute a smooth, collision-free trajectory for a manipulator used to transfer a liquid from a source to a target container. We take into account fluid dynamics constraints as part of the trajectory computation. In order to avoid the high complexity of exact fluid simulation, we introduce a simplified dynamics model based on physically inspired approximations and system identification. Our optimization approach can incorporate various other constraints such as collision avoidance with obstacles, kinematic and dynamics constraints of the manipulator, and fluid dynamics characteristics. We demonstrate the performance of our planner on different benchmarks corresponding to various obstacles and container shapes. We also evaluate its accuracy by validating the motion plan using an accurate but computationally costly Navier-Stokes fluid simulation.

ICRA Conference 2016 Conference Paper

Proxemic group behaviors using reciprocal multi-agent navigation

  • Liang He 0008
  • Jia Pan 0001
  • Wenping Wang 0001
  • Dinesh Manocha

We present a decentralized algorithm for group-based coherent and reciprocal multi-agent navigation. In addition to generating collision-free trajectories for each agent, our approach is able to simulate macroscopic group movements and proxemic behaviors that result in coherent navigation. Our approach is general, makes no assumptions about the size or shape of the group, and can generate smooth trajectories for the agents. Furthermore, it can dynamically adapt to obstacles or the behavior of other agents. The additional overhead of generating proxemic group behaviors is relatively small and our approach can simulate hundreds of agents in real-time. We highlight its benefits on different benchmarks.

ICRA Conference 2016 Conference Paper

Real-time reciprocal collision avoidance with elliptical agents

  • Andrew Best
  • Sahil Narang
  • Dinesh Manocha

We present a novel algorithm for real-time collision-free navigation between elliptical agents. Each robot or agent is represented using a tight-fitting 2D ellipse in the plane. We extend the reciprocal velocity obstacle formulation by using conservative linear approximations of ellipses and derive sufficient conditions for collision-free motion based on low-dimensional linear programming. We use precomputed Minkowski Sum approximations for real-time and conservative collision avoidance in large multi-agent environments. Finally, we present efficient techniques to update the orientation to compute collision-free trajectories. Our algorithm can handle thousands of elliptical agents in real-time on a single core and provides significant speedups over prior algorithms for elliptical agents. We compare the runtime performance and behavior with circular agents on different benchmarks.

ICAPS Conference 2016 Conference Paper

Robot Motion Planning for Pouring Liquids

  • Zherong Pan
  • Chonhyon Park
  • Dinesh Manocha

We present a new algorithm to compute a collision-free trajectory for a robot manipulator to pour liquid from one container to the other. Our formulation uses a physical fluid model to predicate its highly deformable motion. We present simulation guided and optimization based method to automatically compute the transferring trajectory. Instead of abstract or simplified liquid models, we use the full-featured and accurate Navier-Stokes model that provides the fine-grained information of velocity distribution inside the liquid body. Moreover, this information is used as an additional guiding energy term for the planner. One of our key contributions is the tight integration between the fine-grained fluid simulator, liquid transfer controller, and the optimization-based planner. We have implemented the method using hybrid particle-mesh fluid simulator (FLIP) and demonstrated its performance on 4 benchmarks, with different cup shapes and viscosity coefficients.

IROS Conference 2015 Conference Paper

Hybrid penetration depth computation using local projection and machine learning

  • Yeojin Kim
  • Dinesh Manocha
  • Young J. Kim

We present a new hybrid approach to computing penetration depth (PD) for general polygonal models. Our approach exploits both local and global approaches to PD computation and can compute error-bounded PD approximations for both deep and shallow penetrations. We use a two-step formulation: the first step corresponds to a global approximation approach that samples the configuration space with bounded error using support vector machines; the second step corresponds to a local optimization that performs a projection operation refining the penetration depth. We have implemented this hybrid algorithm on a standard PC platform and tested its performance with various benchmarks. The experimental results show that our algorithm offers significant benefits over previously developed local-only and global-only methods used to compute the PD.

ICRA Conference 2015 Conference Paper

REACH - Realtime crowd tracking using a hybrid motion model

  • Aniket Bera
  • Dinesh Manocha

We present a novel, real-time algorithm to extract the trajectory of each pedestrian in moderately dense crowd videos. In order to improve the tracking accuracy, we use a hybrid motion model that combines discrete and continuous flow models. The discrete model is based on microscopic agent formulation and is used for local navigation, interaction, and collision avoidance. The continuum model accounts for macroscopic behaviors, including crowd orientation and flow. We use our hybrid model with particle filters to compute the trajectories at interactive rates. We demonstrate its performance in moderately-dense crowd videos with tens of pedestrians and highlight the improved accuracy on different datasets.

ICRA Conference 2014 Conference Paper

AdaPT: Real-time adaptive pedestrian tracking for crowded scenes

  • Aniket Bera
  • Nico Galoppo
  • Dillon Sharlet
  • Adam T. Lake
  • Dinesh Manocha

We present a novel realtime algorithm to compute the trajectory of each pedestrian in a crowded scene. Our formulation is based on an adaptive scheme that uses a combination of deterministic and probabilistic trackers to achieve high accuracy and efficiency simultaneously. Furthermore, we integrate it with a multi-agent motion model and local interaction scheme to accurately compute the trajectory of each pedestrian. We highlight the performance and benefits of our algorithm on well-known datasets with tens of pedestrians.

ICRA Conference 2014 Conference Paper

Poisson-RRT

  • Chonhyon Park
  • Jia Pan 0001
  • Dinesh Manocha

We present an RRT-based motion planning algorithm that uses the maximal Poisson-disk sampling scheme. Our approach exploits the free-disk property of the maximal Poisson-disk samples to generate nodes and perform tree expansion. Furthermore, we use an adaptive scheme to generate more samples in challenging regions of the configuration space. Our approach can be easily parallelized on multi-core CPUs and many-core GPUs. We highlight the performance of our algorithm on different benchmarks.

AAMAS Conference 2013 Conference Paper

Goal Velocity Obstacles for Spatial Navigation of Multiple Virtual Agents

  • Jamie Snape
  • Dinesh Manocha

We present the goal velocity obstacle for the spatial navigation of multiple virtual agents to planar goal regions in the two-dimensional workspace. Our approach uses velocity obstacles not only to compute collision-avoiding velocities with respect to other agents, but also to specify velocities that will direct an agent toward its spatial goal region. We demonstrate shorter path lengths and fewer collisions with only microseconds of additional computation per agent per time step than velocity-based methods that optimize on a single, preferred velocity toward the goal of each agent.

ICRA Conference 2013 Conference Paper

Real-time collision detection and distance computation on point cloud sensor data

  • Jia Pan 0001
  • Ioan Alexandru Sucan
  • Sachin Chitta
  • Dinesh Manocha

Most prior techniques for proximity computations are designed for synthetic models and assume exact geometric representations. However, real robots construct representations of the environment using their sensors, and the generated representations are more cluttered and less precise than synthetic models. Furthermore, this sensor data is updated at high frequency. In this paper, we present new collision- and distance-query algorithms, which can efficiently handle large amounts of point cloud sensor data received at real-time rates. We present two novel techniques to accelerate the computation of broad-phase data structures: 1) we present a progressive technique that incrementally computes a high-quality dynamic AABB tree for fast culling, and 2) we directly use an octree representation of the point cloud data as a proximity data structure. We assign a probability value to each leaf node of the tree, and the algorithm computes the nodes corresponding to high collision probability. In practice, our new approaches can be an order of magnitude faster than previous methods. We demonstrate the performance of the new methods on both synthetic data and on sensor data collected using a Kinect™ for motion planning for a mobile manipulator robot.

ICRA Conference 2013 Conference Paper

Real-time optimization-based planning in dynamic environments using GPUs

  • Chonhyon Park
  • Jia Pan 0001
  • Dinesh Manocha

We present a novel algorithm to compute collision-free trajectories in dynamic environments. Our approach is general and does not require a priori knowledge about the obstacles or their motion. We use a replanning framework that interleaves optimization-based planning with execution. Furthermore, we describe a parallel formulation that exploits a high number of cores on commodity graphics processors (GPUs) to compute a high-quality path in a given time interval. We derive bounds on how parallelization can improve the responsiveness of the planner and the quality of the trajectory.

ICRA Conference 2012 Conference Paper

FCL: A general purpose library for collision and proximity queries

  • Jia Pan 0001
  • Sachin Chitta
  • Dinesh Manocha

We present a new collision and proximity library that integrates several techniques for fast and accurate collision checking and proximity computation. Our library is based on hierarchical representations and designed to perform multiple proximity queries on different model representations. The set of queries includes discrete collision detection, continuous collision detection, separation distance computation and penetration depth estimation. The input models may correspond to triangulated rigid or deformable models and articulated models. Moreover, FCL can perform probabilistic collision checking between noisy point clouds that are captured using cameras or LIDAR sensors. The main benefit of FCL lies in the fact that it provides a unified interface that can be used by various applications. Furthermore, its flexible architecture makes it easier to implement new algorithms within this framework. The runtime performance of the library is comparable to state of the art collision and proximity algorithms. We demonstrate its performance on synthetic datasets as well as motion planning and grasping computations performed using a two-armed mobile manipulation robot.

ICAPS Conference 2012 Conference Paper

ITOMP: Incremental Trajectory Optimization for Real-Time Replanning in Dynamic Environments

  • Chonhyon Park
  • Jia Pan 0001
  • Dinesh Manocha

We present a novel optimization-based algorithm for motion planning in dynamic environments. Our approach uses a stochastic trajectory optimization framework to avoid collisions and satisfy smoothness and dynamics constraints. Our algorithm does not require a priori knowledge about global motion or trajectories of dynamic obstacles. Rather, we compute a conservative local bound on the position or trajectory of each obstacle over a short time and use the bound to compute a collision-free trajectory for the robot in an incremental manner. Moreover, we interleave planning and execution of the robot in an adaptive manner to balance between the planning horizon and responsiveness to obstacle. We highlight the performance of our planner in a simulated dynamic environment with the 7-DOF PR2 robot arm and dynamic obstacles.

ICRA Conference 2012 Conference Paper

LQG-obstacles: Feedback control with collision avoidance for mobile robots with motion and sensing uncertainty

  • Jur van den Berg
  • David Wilkie
  • Stephen J. Guy
  • Marc Niethammer
  • Dinesh Manocha

This paper presents LQG-Obstacles, a new concept that combines linear-quadratic feedback control of mobile robots with guaranteed avoidance of collisions with obstacles. Our approach generalizes the concept of Velocity Obstacles [3] to any robotic system with a linear Gaussian dynamics model. We integrate a Kalman filter for state estimation and an LQR feedback controller into a closed-loop dynamics model of which a higher-level control objective is the “control input”. We then define the LQG-Obstacle as the set of control objectives that result in a collision with high probability. Selecting a control objective outside the LQG-Obstacle then produces collision-free motion. We demonstrate the potential of LQG-Obstacles by safely and smoothly navigating a simulated quadrotor helicopter with complex non-linear dynamics and motion and sensing uncertainty through three-dimensional environments with obstacles and narrow passages.

ICRA Conference 2012 Conference Paper

Real-time footstep planning for humanoid robots among 3D obstacles using a hybrid bounding box

  • Nicolas Perrin-Gilbert
  • Olivier Stasse
  • Florent Lamiraux
  • Young J. Kim
  • Dinesh Manocha

In this paper we introduce a new bounding box method for footstep planning for humanoid robots. Similar to the classic bounding box method (which uses a single rectangular box to encompass the robot) it is computationally efficient, easy to implement and can be combined with any rigid body motion planning library. However, unlike the classic bounding box method, our method takes into account the stepping over capabilities of the robot, and generates precise leg trajectories to avoid obstacles on the ground. We demonstrate that this method is well suited for footstep planning in cluttered environments.

SoCS Conference 2012 Conference Paper

Real-Time Optimization-Based Planning in Dynamic Environments Using GPUs

  • Chonhyon Park
  • Jia Pan 0001
  • Dinesh Manocha

We present a novel algorithm to compute collision-free trajectories in dynamic environments. Our approach is general and makes no assumption about the obstacles or their motion. We use a replanning framework that interleaves optimization-based planning with execution. Furthermore, we describe a parallel formulation that exploits high number of cores on commodity graphics processors (GPUs) to compute a high-quality path in a given time interval. Overall, we show that search in configuration spaces can be significantly accelerated by using GPU parallelism.

ICRA Conference 2011 Conference Paper

Reciprocal collision avoidance with acceleration-velocity obstacles

  • Jur van den Berg
  • Jamie Snape
  • Stephen J. Guy
  • Dinesh Manocha

We present an approach for collision avoidance for mobile robots that takes into account acceleration constraints. We discuss both the case of navigating a single robot among moving obstacles, and the case of multiple robots reciprocally avoiding collisions with each other while navigating a common workspace. Inspired by the concept of velocity obstacles [3], we introduce the acceleration-velocity obstacle (AVO) to let a robot avoid collisions with moving obstacles while obeying acceleration constraints. AVO characterizes the set of new velocities the robot can safely reach and adopt using proportional control of the acceleration. We extend this concept to reciprocal collision avoidance for multi-robot settings, by letting each robot take half of the responsibility of avoiding pairwise collisions. Our formulation guarantees collision-free navigation even as the robots act independently and simultaneously, without coordination. Our approach is designed for holonomic robots, but can also be applied to kinematically constrained non-holonomic robots such as cars. We have implemented our approach, and we show simulation results in challenging environments with large numbers of robots and obstacles.

AAAI Conference 2011 Conference Paper

Self-Aware Traffic Route Planning

  • David Wilkie
  • Jur van den Berg
  • Ming Lin
  • Dinesh Manocha

One of the most ubiquitous AI applications is vehicle route planning. While state-of-the-art systems take into account current traffic conditions or historic traffic data, current planning approaches ignore the impact of their own plans on the future traffic conditions. We present a novel algorithm for self-aware route planning that uses the routes it plans for current vehicle traffic to more accurately predict future traffic conditions for subsequent cars. Our planner uses a roadmap with stochastic, timevarying traffic densities that are defined by a combination of historical data and the densities predicted by the planned routes for the cars ahead of the current traf- fic. We have applied our algorithm to large-scale traf- fic route planning, and demonstrated that our self-aware route planner can more accurately predict future traf- fic conditions, which results in a reduction of the travel time for those vehicles that use our algorithm.

ICRA Conference 2010 Conference Paper

A fast n-dimensional ray-shooting algorithm for grasping force optimization

  • Yu Zheng 0001
  • Ming Lin 0003
  • Dinesh Manocha

We present an efficient algorithm for solving the ray-shooting problem on high dimensional sets. Our algorithm computes the intersection of the boundary of a compact convex set with a ray emanating from an interior point of the set and represents the intersection point as a convex combination of a set of affinely independent points. We use our intersection algorithm to compute two types of optimal grasping forces, where either the sum or the maximum of normal force components is minimized. In our simulation, the algorithm converges well and performs the computations in tens of milliseconds on a laptop.

IROS Conference 2010 Conference Paper

A walking pattern generator for biped robots on uneven terrains

  • Yu Zheng 0001
  • Ming Lin 0003
  • Dinesh Manocha
  • Albertus H. Adiwahono
  • Chee-Meng Chew

We present a new method to generate biped walking patterns for biped robots on uneven terrains. Our formulation uses a universal stability criterion that checks whether the resultant of the gravity wrench and the inertia wrench of a robot lies in the convex cone of the wrenches resulting from contacts between the robot and the environment. We present an algorithm to compute the feasible acceleration of the robot's CoM (center of mass) and use that algorithm to generate biped walking patterns. Our approach is more general and applicable to uneven terrains as compared with prior methods based on the ZMP (zero-moment point) criterion. We highlight its applications on some benchmarks.

ICRA Conference 2010 Conference Paper

Continuous collision detection for non-rigid contact computations using local advancement

  • Min Tang 0004
  • Young J. Kim
  • Dinesh Manocha

We present a novel algorithm to perform continuous collision detection(CCD) between non-rigid, deformable models using local advancement. Given the initial and final configurations of a deformable model, our algorithm computes linear deformation by interpolating the vertices from the initial to the final configurations with a straight line path and checks for collision along that path. Our approach is applicable to polygon-soup models with arbitrary topology, handles self-collisions and makes no assumption about the underlying non-rigid motion. We accelerate the algorithm by computing motion bounds on the primitives and their bounding volumes. These bounds are combined with hierarchical culling techniques and used for fast collision checking. In practice, we have observed up to four times improvement in running time because of local advancement.

IROS Conference 2010 Conference Paper

Efficient nearest-neighbor computation for GPU-based motion planning

  • Jia Pan 0001
  • Christian Lauterbach
  • Dinesh Manocha

We present a novel k-nearest neighbor search algorithm (KNNS) for proximity computation in motion planning algorithm that exploits the computational capabilities of many-core GPUs. Our approach uses locality sensitive hashing and cuckoo hashing to construct an efficient KNNS algorithm that has linear space and time complexity and exploits the multiple cores and data parallelism effectively. In practice, we see magnitude improvement in speed and scalability over prior GPU-based KNNS algorithm. On some benchmarks, our KNNS algorithm improves the performance of overall planner by 20-40 times for CPU-based planner and up to 2 times for GPU-based planner.

AAAI Conference 2010 Conference Paper

g-Planner: Real-time Motion Planning and Global Navigation using GPUs

  • Jia Pan
  • Christian Lauterbach
  • Dinesh Manocha

We present novel randomized algorithms for solving global motion planning problems that exploit the computational capabilities of many-core GPUs. Our approach uses thread and data parallelism to achieve high performance for all components of sample-based algorithms, including random sampling, nearest neighbor computation, local planning, collision queries and graph search. This approach can efficiently solve both the multi-query and single-query versions of the problem and obtain considerable speedups over prior CPU-based algorithms. We demonstrate the efficiency of our algorithms by applying them to a number of 6DOF planning benchmarks in 3D environments. Overall, this is the first algorithm that can perform real-time motion planning and global navigation using commodity hardware.

AAMAS Conference 2010 Conference Paper

Independent Navigation of Multiple Robots and Virtual Agents

  • Jamie Snape
  • Stephen J. Guy
  • Jur van den Berg
  • Sean Curtis
  • Sachin Patil
  • Ming C. Lin
  • Dinesh Manocha

We demonstrate an approach for collision- and oscillation-free navigation of multiple robots or virtual agents amongsteach other. Each entity acts independently and uses onlyboth the position and velocity of nearby entities to predicttheir future trajectories in order to avoid collisions. Entitiestake into account that the other entities are responding tothem likewise to prevent oscillations.

AAMAS Conference 2010 Conference Paper

Modeling Collision Avoidance Behavior for Virtual Humans

  • Stephen J. Guy
  • Ming C. Lin
  • Dinesh Manocha

In this paper, we present a new trajectory planning algorithm for virtual humans. Our approach focuses on implicitcooperation between multiple virtual agents in order to sharethe work of avoiding collisions with each other. Specifically, we extend recent work on multi-robot planning to bettermodel how humans avoid collisions by introducing new parameters that model human traits, such as reaction timeand biomechanical limitations. We validate this new modelbased on data of real humans walking captured by the Locanthrope project. We also show how our model extendsto complex scenarios with multiple agents interacting witheach other and avoiding nearby obstacles.

ICRA Conference 2010 Conference Paper

Navigating multiple simple-airplanes in 3D workspace

  • Jamie Snape
  • Dinesh Manocha

We present an algorithm for collision-free navigation of multiple flying robots in three-dimensional workspace. Our approach extends the model of a simple car to a simple-airplane, which has constraints on speed and steering angle and includes a configuration variable for the altitude. We use a locally optimal reciprocal collision avoidance scheme that computes the trajectory without any collisions or oscillations for each airplane independently. In addition, our algorithm explicitly considers the kinematic and dynamic constraints of a simple-airplane and uses the notion of variable reciprocity when choosing velocities to ensure that simple-airplanes that are less constrained take more responsibility for avoiding collisions. We test our approach in two simulations and compute collision-free and oscillation-free trajectories that satisfy the kinematic and dynamic constraints of each simple-airplane.

ICRA Conference 2010 Conference Paper

Retraction-based RRT planner for articulated models

  • Jia Pan 0001
  • Liangjun Zhang
  • Dinesh Manocha

We present a new retraction algorithm for high DOF articulated models and use our algorithm to improve the performance of RRT planners in narrow passages. The retraction step is formulated as a constrained optimization problem and performs iterative refinement on the boundary of C-Obstacle space. We also combine the retraction algorithm with decomposition planners to handle very high DOF articulated models. The performance of our approach is analyzed using Voronoi diagrams and we show that our retraction algorithm provides a good approximation to the ideal RRT-extension in constrained environments. We have implemented our algorithm and tested its performance on robots with more than 40 DOFs in complex environments. In practice, we observe significant performance (2-80X) improvement over prior RRT planners on challenging scenarios with narrow passages.

IROS Conference 2010 Conference Paper

Smooth and collision-free navigation for multiple robots under differential-drive constraints

  • Jamie Snape
  • Jur van den Berg
  • Stephen J. Guy
  • Dinesh Manocha

We present a method for smooth and collision-free navigation for multiple independent robots under differential-drive constraints. Our algorithm is based on the optimal reciprocal collision avoidance formulation and guarantees both smoothness in the trajectories of the robots and locally collision-free paths. We provide proofs of these guarantees and demonstrate the effectiveness of our method in experimental scenarios using iRobot Create mobile robots navigating amongst each other.

ICRA Conference 2009 Conference Paper

C 2 A: Controlled conservative advancement for continuous collision detection of polygonal models

  • Min Tang 0004
  • Young J. Kim
  • Dinesh Manocha

We present a simple and fast algorithm to perform continuous collision detection between polygonal models undergoing rigid motion for interactive applications. Our approach can handle all triangulated models and makes no assumption about the underlying geometry and topology. The algorithm uses the notion of conservative advancement (CA), originally developed for convex polytopes. We extend this formulation to general models using swept sphere volume hierarchy and present a compact formulation to compute the motion bounds along with a novel controlling scheme. We have implemented the algorithm and highlight its performance on various benchmarks. In practice, our algorithm can perform continuous collision queries in few milli-seconds on models composed of tens of thousands of triangles.

IROS Conference 2009 Conference Paper

Generalized velocity obstacles

  • David Wilkie
  • Jur van den Berg
  • Dinesh Manocha

We address the problem of real-time navigation in dynamic environments for car-like robots. We present an approach to identify controls that will lead to a collision with a moving obstacle at some point in the future. Our approach generalizes the concept of velocity obstacles, which have been used for navigation among dynamic obstacles, and takes into account the constraints of a car-like robot. We use this formulation to find controls that will allow collision free navigation in dynamic environments. Finally, we demonstrate the performance of our algorithm on a simulated car-like robot among moving obstacles.

ICRA Conference 2009 Conference Paper

Global vector field computation for feedback motion planning

  • Liangjun Zhang
  • Steven M. LaValle
  • Dinesh Manocha

We present a global vector field computation algorithm in configuration spaces for smooth feedback motion planning. Our algorithm performs approximate cell decomposition in the configuration space and approximates the free space using rectanguloid cells. We compute a smooth local vector field for each cell in the free space and address the issue of the smooth composition of the local vector fields between the non-uniform adjacent cells. We show that the integral curve over the computed vector field is guaranteed to converge to the goal configuration, be collision-free, and maintain C infin smoothness. As compared to prior approaches, our algorithm works well on non-convex robots and obstacles. We demonstrate its performance on planar robots with 2 or 3 DOFs, articulated robots composed of 3 serial links and multi-robot systems with 6 DOFs.

IROS Conference 2009 Conference Paper

Independent navigation of multiple mobile robots with hybrid reciprocal velocity obstacles

  • Jamie Snape
  • Jur van den Berg
  • Stephen J. Guy
  • Dinesh Manocha

We present an approach for smooth and collision-free navigation of multiple mobile robots amongst each other. Each robot senses its surroundings and acts independently without central coordination or communication with other robots. Our approach uses both the current position and the velocity of other robots to predict their future trajectory in order to avoid collisions. Moreover, our approach is reciprocal and avoids oscillations by explicitly taking into account that the other robots also sense their surroundings and change their trajectories accordingly. We build on prior work related to velocity obstacles and reciprocal velocity obstacles and introduce the concept of hybrid reciprocal velocity obstacles for collision avoidance that takes into account the kinematics of the robots and uncertainty in sensor data. We apply our approach to a set of iRobot Create robots using centralized sensing and show natural, direct, and collision-free navigation in several challenging scenarios.

ICRA Conference 2009 Conference Paper

Multi-robot coordination using generalized social potential fields

  • Russell Gayle
  • William Moss
  • Ming Lin 0003
  • Dinesh Manocha

We present a novel approach to compute collision-free paths for multiple robots subject to local coordination constraints. More specifically, given a set of robots, their initial and final configurations, and possibly some additional coordination constraints, our goal is to compute a collision-free path between the initial and final configuration that maintains the constraints. To solve this problem, our approach generalizes the social potential field method to be applicable to both convex and nonconvex polyhedra. Social potential fields are then integrated into a “physics-based motion planning” framework which uses constrained dynamics to solve the motion planning problem. Our approach is able to plan for over 200 robots while averaging about 110 ms per step in a variety of environments.

ICRA Conference 2008 Conference Paper

An efficient retraction-based RRT planner

  • Liangjun Zhang
  • Dinesh Manocha

We present a novel optimization-based retraction algorithm to improve the performance of sample-based planners in narrow passages for 3D rigid robots. The retraction step is formulated as an optimization problem using an appropriate distance metric in the configuration space. Our algorithm computes samples near the boundary of C-obstacle using local contact analysis and uses those samples to improve the performance of RRT planners in narrow passages. We analyze the performance of our planner using Voronoi diagrams and show that the tree can grow closely towards any randomly generated sample. Our algorithm is general and applicable to all polygonal models. In practice, we observe significant speedups over prior RRT planners on challenging scenarios with narrow passages.

ICRA Conference 2008 Conference Paper

Reciprocal Velocity Obstacles for real-time multi-agent navigation

  • Jur van den Berg
  • Ming Lin 0003
  • Dinesh Manocha

In this paper, we propose a new concept — the ‘Reciprocal Velocity Obstacle’— for real-time multi-agent navigation. We consider the case in which each agent navigates independently without explicit communication with other agents. Our formulation is an extension of the Velocity Obstacle concept [3], which was introduced for navigation among (passively) moving obstacles. Our approach takes into account the reactive behavior of the other agents by implicitly assuming that the other agents make a similar collision-avoidance reasoning. We show that this method guarantees safe and oscillation-free motions for each of the agents. We apply our concept to navigation of hundreds of agents in densely populated environments containing both static and moving obstacles, and we show that real-time and scalable performance is achieved in such challenging scenarios.

IROS Conference 2007 Conference Paper

A hybrid approach for complete motion planning

  • Liangjun Zhang
  • Young J. Kim
  • Dinesh Manocha

We present an efficient algorithm for complete motion planning that combines approximate cell decomposition (ACD) with probabilistic roadmaps (PRM). Our approach uses ACD to subdivide the configuration space into cells and computes localized roadmaps by generating samples within these cells. We augment the connectivity graph for adjacent cells in ACD with pseudo-free edges that are computed based on localized roadmaps. These roadmaps are used to capture the connectivity of free space and guide the adaptive subdivision algorithm. At the same time, we use cell decomposition to check for path non-existence and generate samples in narrow passages. Overall, our hybrid algorithm combines the efficiency of PRM methods with the completeness of ACD-based algorithms. We have implemented our algorithm on 3-DOF and 4-DOF robots. We demonstrate its performance on planning scenarios with narrow passages or no collision-free paths. In practice, we observe up to 10 times improvement in performance over prior complete motion planning algorithms.

ICRA Conference 2007 Conference Paper

Efficient Motion Planning of Highly Articulated Chains using Physics-based Sampling

  • Russell Gayle
  • Stephane Redon
  • Avneesh Sud
  • Ming Lin 0003
  • Dinesh Manocha

We present a novel motion planning algorithm that efficiently generates physics-based samples in a kinematically and dynamically constrained space of a highly articulated chain. Similar to prior kinodynamic planning methods, the sampled nodes in our roadmaps are generated based on dynamic simulation. Moreover, we bias these samples by using constraint forces designed to avoid collisions while moving toward the goal configuration. We adaptively reduce the complexity of the state space by determining a subset of joints that contribute most towards the motion and only simulate these joints. Based on these configurations, we compute a valid path that satisfies non-penetration, kinematic, and dynamics constraints. Our approach can be easily combined with a variety of motion planning algorithms including probabilistic roadmaps (PRMs) and rapidly-exploring random trees (RRTs) and applied to articulated robots with hundreds of joints. We demonstrate the performance of our algorithm on several challenging benchmarks.

IROS Conference 2007 Conference Paper

Reactive deformation roadmaps: motion planning of multiple robots in dynamic environments

  • Russell Gayle
  • Avneesh Sud
  • Ming Lin 0003
  • Dinesh Manocha

We present a novel algorithm for motion planning of multiple robots amongst dynamic obstacles. Our approach is based on a new roadmap representation that uses deformable links and dynamically retracts to capture the connectivity of the free space. We use Newtonian physics and Hooke's Law to update the position of the milestones and deform the links in response to the motion of other robots and the obstacles. Based on this roadmap representation, we describe our planning algorithms that can compute collision-free paths for tens of robots in complex dynamic environments.

ICRA Conference 2006 Conference Paper

Fast C-obstacle Query Computation for Motion Planning

  • Liangjun Zhang
  • Young J. Kim
  • Gokul Varadhan
  • Dinesh Manocha

The configuration space of a robot is partitioned into free space and C-obstacle space. Most of the prior work in collision detection and motion planning algorithms is targeted towards checking whether a configuration or a 1D path lies in the free space. In this paper, we address the problem of checking whether a C-space primitive or a spatial cell lies completely inside C-obstacle space, without explicitly computing the boundary of C-obstacle. We refer to the problem as the C-obstacle query. We present a fast and conservative algorithm to perform this C-obstacle query. Our algorithm uses the notion of generalized penetration depth that takes into account both translational and rotational motion. We compute the generalized penetration depth for polyhedral objects and compare it with the extent of the motion that the polyhedral robot can undergo. Our approach is general and useful for designing practical algorithms for complete motion planning of rigid robots. We have integrated our query computation algorithm with star-shaped roadmaps (G. Varadhan and D. Manocha, 2005) - a deterministic sampling approach for complete motion planning. We have applied our modified planning algorithm to planar robots undergoing translational and rotational motion in complex 2D environments. Our algorithm is able to perform the C-obstacle query in milliseconds and improves the performance of the complete motion planning algorithm

ICRA Conference 2006 Conference Paper

Topology Preserving Approximation of Free Configuration Space

  • Gokul Varadhan
  • Young J. Kim
  • Shankar Krishnan
  • Dinesh Manocha

We present a simple algorithm for approximating the free configuration space of robots with low degrees of freedom (DOFs). We represent the free space as an arrangement of contact surfaces. We approximate the free space using an adaptive volumetric grid that is computed by performing simple geometric tests on the contact surfaces. We use an isosurface extraction algorithm to compute a piecewise-linear approximation to the boundary of the free space. We prove that our approximation is topologically equivalent to the exact free space boundary. We also ensure that our approximation is geometrically close to the exact free space boundary by bounding its two-sided Hausdorff error. We have applied our algorithm to compute the free configuration space for the following instances: (1) a 2D polygonal robot with translational and rotational DOFs navigating among polygonal obstacles, and (2) a 3D polyhedral robot translating among polyhedral obstacles. In practice, our algorithm works well on robots with three DOFs

ICRA Conference 2005 Conference Paper

Constraint-Based Motion Planning of Deformable Robots

  • Russell Gayle
  • Ming Lin 0003
  • Dinesh Manocha

We present a novel algorithm for motion planning of a deformable robot in a static environment. Given the initial and final configuration of the robot, our algorithm computes an approximate path using the probabilistic roadmap method. We use "constraint-based planning" to simulate robot deformation and make appropriate path adjustments and corrections to compute a collision-free path. Our algorithm takes into account geometric constraints like non-penetration and physical constraints like volume preservation. We highlight the performance of our planner on different scenarios of varying complexity.

ICRA Conference 2002 Conference Paper

DEEP: Dual-Space Expansion for Estimating Penetration Depth Between Convex Polytopes

  • Young J. Kim
  • Ming Lin 0003
  • Dinesh Manocha

We present an incremental algorithm to estimate the penetration depth between convex polytopes in 3D. The algorithm incrementally seeks a "locally optimal solution" by walking on the surface of the Minkowski sums. The surface of the Minkowski sums is computed implicitly by constructing a local Gauss map. In practice, the algorithm works well when there is high motion coherence in the environment and is able to compute the optimal solution in most cases.

IROS Conference 2001 Conference Paper

A Voronoi-based hybrid motion planner

  • Mark Foskey
  • Maxim Garber
  • Ming Lin 0003
  • Dinesh Manocha

We present a hybrid path planning algorithm for rigid and articulated bodies translating and rotating in a 3D workspace. Our approach generates a Voronoi roadmap in the workspace and combines it with "bridges" computed by a randomized path planner with Voronoi-biased sampling. The Voronoi roadmap is computed from a discrete approximation to the generalized Voronoi diagram (GVD) of the workspace, which is generated using graphics hardware. By using this GVD, portions of the path can be generated without random sampling, substantially reducing the number of random samples needed for the full query. The planner has been implemented and tested on a number of benchmarks. Some preliminary comparisons with a randomized motion planner indicate that our planner performs more than an order of magnitude faster in several challenging scenarios.

ICRA Conference 2000 Conference Paper

Fast Distance Queries with Rectangular Swept Sphere Volumes

  • Eric Larsen
  • Stefan Gottschalk
  • Ming Lin 0003
  • Dinesh Manocha

We present new distance computation algorithms using hierarchies of rectangular swept spheres. Each bounding volume of the tree is described as the Minkowski sum of a rectangle and a sphere, and fits tightly to the underlying geometry. We present accurate and efficient algorithms to build the hierarchies and perform distance queries between the bounding volumes. We also present traversal techniques for accelerating distance queries using coherence and priority directed search. These algorithms have been used to perform proximity queries for applications including virtual prototyping, dynamic simulation, and motion planning on complex models. As compared to earlier algorithms based on bounding volume hierarchies for separation distance and approximate distance computation, our algorithms have achieved significant speedups on many benchmarks.

ICRA Conference 2000 Conference Paper

Interactive Motion Planning Using Hardware-Accelerated Computation of Generalized Voronoi Diagrams

  • Kenneth E. Hoff III
  • Tim Culver
  • John Keyser
  • Ming Lin 0003
  • Dinesh Manocha

We present techniques for fast motion planning by using discrete approximations of generalized Voronoi diagrams, computed with graphics hardware. Approaches based on this diagram computation are applicable to both static and dynamic environments of fairly high complexity. We compute a discrete Voronoi diagram by rendering a 3D distance mesh for each Voronoi site. The sites can be points, line segments, polygons, polyhedra, curves and surfaces. The computation of the generalized Voronoi diagram provides fast proximity query toolkits for motion planning. The tools provide the distance to the nearest obstacle stored in the Z-buffer, as well as the Voronoi boundaries, Voronoi vertices and weighted Voronoi graphs extracted from the frame buffer using continuation methods. We have implemented these algorithms and demonstrated their performance for path planning in a complex dynamic environment composed of more than 140, 000 polygons.

ICRA Conference 1995 Conference Paper

Fast Algorighms for Penetration and Contact Determination between Non-Convex Polyhedral Models

  • Ming Lin 0003
  • Dinesh Manocha
  • Madhav K. Ponamgi

We present fast algorithms for penetration detection and contact determination between polyhedral models in dynamic environments. They are based on a distance computation algorithm for convex polytopes and a hierarchical coherence-based algorithm to compute contacts. In particular, we extend an earlier expected constant time algorithm for distance computation between convex polytopes to detect penetrations. The algorithm computes all the contacts between the convex hulls of the polytopes. After identifying the contact regions it traverses the features lying beneath them to more precisely determine the contact regions. The traversal employs a dynamic technique, sweep and prune, to overcome the O(n/sup 2/) pairwise feature checks. The complexity of the overall algorithm is output sensitive. We demonstrate its performance on the dynamic simulation of a threaded insertion.

ICRA Conference 1994 Conference Paper

A Fast Algorithm and System for the Inverse Kinematics of General Serial Manipulators

  • Dinesh Manocha
  • Yunshan Zhu

We present fast and robust algorithms for the inverse kinematics of serial manipulators consisting of six or fewer joints. When stated mathematically, the problem of inverse kinematics reduces to simultaneously solving a system of algebraic equations. In this paper, we use a series of algebraic and numeric transformations to reduce the problem to computing the eigenstructure of a matrix pencil. To efficiently compute the eigenstructure, we make use of the symbolic formulation of the matrix and use a number of techniques from linear algebra and matrix computations. The resulting algorithm computes all the solution of a serial manipulator with six or fewer joints in the order of tens of milliseconds on the current workstations. It has been implemented as part of a generic package, KINEM, for the inverse kinematics of serial manipulators. >

ICRA Conference 1994 Conference Paper

Fast Contact Determination in Dynamic Environments

  • Ming Lin 0003
  • Dinesh Manocha
  • John F. Canny

We present an efficient contact determination algorithm for objects undergoing rigid motion. The environment consists of polytopes and models described by algebraic sets. We extend an expected constant time collision detection algorithm between convex polytopes to concave polytopes and curved models. The algorithm makes use of hierarchical representations for concave polytopes and local, global methods for solving polynomial equations to determine possible contact points. We also propose techniques to reduce O(n/sup 2/) pairwise intersection tests for a large environment of n objects. These algorithms work well in practice and give real time performance for most environments. >

ICRA Conference 1992 Conference Paper

Real time inverse kinematics for general 6R manipulators

  • Dinesh Manocha
  • John F. Canny

The authors present a real-time algorithm for the inverse kinematics of general 6R robot manipulators. The algorithm involves symbolic preprocessing, matrix computation and a variety of numerical techniques. The numerical accuracy of these operations is well understood and for most cases it is possible to compute accurate solutions using 64-b IEEE floating point arithmetic available on most workstations. The average running time of the algorithm, for most cases, is 11 ms on an IBM RS/6000 workstation. >