Author name cluster

Jean Oh

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

44 papers

2 author rows

ICRA Conference 2025 Conference Paper

Bridging Spectral-Wise and Multi-Spectral Depth Estimation Via Geometry-Guided Contrastive Learning

Ukcheol Shin
Kyunghyun Lee 0004
Jean Oh

Deploying depth estimation networks in the real world requires high-level robustness against various adverse conditions to ensure safe and reliable autonomy. For this purpose, many autonomous vehicles employ multi-modal sensor systems, including an RGB camera, NIR camera, thermal camera, LiDAR, or Radar. They mainly adopt two strategies to use multiple sensors: modality-wise and multi-modal fused inference. The former method is flexible but memory-inefficient, unreliable, and vulnerable. Multi-modal fusion can provide high-level reliability, yet it needs a specialized architecture. In this paper, we propose an effective solution, named align-and-fuse strategy, for the depth estimation from multi-spectral images. In the align stage, we align embedding spaces between multiple spectrum bands to learn shareable representation across multi-spectral images by minimizing contrastive loss of global and spatially aligned local features with geometry cue. After that, in the fuse stage, we train an attachable feature fusion module that can selectively aggregate the multi-spectral features for reliable and robust prediction results. Based on the proposed method, a single-depth network can achieve both spectral-invariant and multi-spectral fused depth estimation while preserving reliability, memory efficiency, and flexibility.

Details

IJCAI Conference 2025 Conference Paper

Sample-Efficient Behavior Cloning Using General Domain Knowledge

Feiyu Zhu
Jean Oh
Reid Simmons

Behavior cloning has shown success in many sequential decision-making tasks by learning from expert demonstrations, yet they can be very sample inefficient and fail to generalize to unseen scenarios. One approach to these problems is to introduce general domain knowledge, such that the policy can focus on the essential features and may generalize to unseen states by applying that knowledge. Although this knowledge is easy to acquire from the experts, it is hard to be combined with learning from individual examples due to the lack of semantic structure in neural networks and the time-consuming nature of feature engineering. To enable learning from both general knowledge and specific demonstration trajectories, we use a large language model’s coding capability to instantiate a policy structure based on expert domain knowledge expressed in natural language and tune the parameters in the policy with demonstrations. We name this approach the Knowledge Informed Model (KIM) as the structure reflects the semantics of expert knowledge. In our experiments with lunar lander and car racing tasks, our approach learns to solve the tasks with as few as 5 demonstrations and is robust to action noise, outperforming the baseline model without domain knowledge. This indicates that with the help of large language models, we can incorporate domain knowledge into the structure of the policy, increasing sample efficiency for behavior cloning.

PDF Details DOI

ICRA Conference 2025 Conference Paper

Soft Robotic Dynamic in-Hand Pen Spinning

Yunchao Yao
Uksang Yoo
Jean Oh
Christopher G. Atkeson
Jeffrey Ichnowski

Dynamic in-hand manipulation remains challenging for soft robotic systems, which have demonstrated advantages in safe, compliant interactions but struggle with highspeed dynamic tasks. In this work, we present SWIFT, a system for learning dynamic tasks using a soft and compliant robotic hand. Unlike previous works that rely on simulation, quasistatic actions, and precise object models, SWIFT learns to spin a pen through trial and error using only real-world data and without requiring explicit knowledge of the pen's physical attributes. With self-labeled trials sampled from the real world, SWIFT discovers the set of pen grasping and spinning primitive parameters that enables a soft hand to spin a pen reliably. After 130 sampled actions per object, SWIFT achieves 10/10 success rate across three pens with different weights and weight distributions, demonstrating generalizability and robustness to changes in object properties. The results highlight the potential for soft robotic end-effectors to perform dynamic tasks. We also demonstrate generalization to different shapes and weights, such as a brush and a screwdriver, with 10/10 and 5/10 success rates, respectively. Videos, data, and code are available at https://soft-spin.github.io.

Details

ICLR Conference 2025 Conference Paper

Stabilizing Reinforcement Learning in Differentiable Multiphysics Simulation

Eliot Xing
Vernon Luk
Jean Oh

Recent advances in GPU-based parallel simulation have enabled practitioners to collect large amounts of data and train complex control policies using deep reinforcement learning (RL), on commodity GPUs. However, such successes for RL in robotics have been limited to tasks sufficiently simulated by fast rigid-body dynamics. Simulation techniques for soft bodies are comparatively several orders of magnitude slower, thereby limiting the use of RL due to sample complexity requirements. To address this challenge, this paper presents both a novel RL algorithm and a simulation platform to enable scaling RL on tasks involving rigid bodies and deformables. We introduce Soft Analytic Policy Optimization (SAPO), a maximum entropy first-order model-based actor-critic RL algorithm, which uses first-order analytic gradients from differentiable simulation to train a stochastic actor to maximize expected return and entropy. Alongside our approach, we develop Rewarped, a parallel differentiable multiphysics simulation platform that supports simulating various materials beyond rigid bodies. We re-implement challenging manipulation and locomotion tasks in Rewarped, and show that SAPO outperforms baselines over a range of tasks that involve interaction between rigid bodies, articulations, and deformables. Additional details at https://rewarped.github.io/.

Details

IROS Conference 2025 Conference Paper

VPOcc: Exploiting Vanishing Point for 3D Semantic Occupancy Prediction

Junsu Kim
Junhee Lee
Ukcheol Shin
Jean Oh
Kyungdon Joo

Understanding 3D scenes semantically and spatially is crucial for the safe navigation of robots and autonomous vehicles, aiding obstacle avoidance and accurate trajectory planning. Camera-based 3D semantic occupancy prediction, which infers complete voxel grids from 2D images, is gaining importance in robot vision for its resource efficiency compared to 3D sensors. However, this task inherently suffers from a 2D–3D discrepancy, where objects of the same size in 3D space appear at different scales in a 2D image depending on their distance from the camera due to perspective projection. To tackle this issue, we propose a novel framework called VPOcc that leverages a vanishing point (VP) to mitigate the 2D-3D discrepancy at both the pixel and feature levels. As a pixel-level solution, we introduce a VPZoomer module, which warps images by counteracting the perspective effect using a VP-based homography transformation. In addition, as a feature-level solution, we propose a VP-guided cross-attention (VPCA) module that performs perspective-aware feature aggregation, utilizing 2D image features that are more suitable for 3D space. Lastly, we integrate two feature volumes extracted from the original and warped images to compensate for each other through a spatial volume fusion (SVF) module. By effectively incorporating VP into the network, our framework achieves improvements in both IoU and mIoU metrics on SemanticKITTI and SSCBench-KITTI360 datasets. Additional details are available at https://vision3d-lab.github.io/vpocc/.

Details

ICRA Conference 2024 Conference Paper

CoFRIDA: Self-Supervised Fine-Tuning for Human-Robot Co-Painting

Peter Schaldenbrand
Gaurav Parmar
Jun-Yan Zhu
James McCann
Jean Oh

Prior robot painting and drawing work, such as FRIDA, has focused on decreasing the sim-to-real gap and expanding input modalities for users, but the interaction with these systems generally exists only in the input stages. To support interactive, human-robot collaborative painting, we introduce the Collaborative FRIDA (CoFRIDA) robot painting framework, which can co-paint by modifying and engaging with content already painted by a human collaborator. To improve text-image alignment–FRIDA’s major weakness–our system uses pre-trained text-to-image models; however, pre-trained models in the context of real-world co-painting do not perform well because they (1) do not understand the constraints and abilities of the robot and (2) cannot perform co-painting without making unrealistic edits to the canvas and overwriting content. We propose a self-supervised fine-tuning procedure that can tackle both issues, allowing the use of pre-trained state-of-the-art text-image alignment models with robots to enable co-painting in the physical world. Our open-source approach, CoFRIDA, creates paintings and drawings that match the input text prompt more clearly than FRIDA, both from a blank canvas and one with human created work. More generally, our fine-tuning procedure successfully encodes the robot’s constraints and abilities into a foundation model, showcasing promising results as an effective method for reducing sim-to-real gaps. https://pschaldenbrand.github.io/cofrida/

Details

ICRA Conference 2024 Conference Paper

Complementary Random Masking for RGB-Thermal Semantic Segmentation

Ukcheol Shin
Kyunghyun Lee 0004
In-So Kweon
Jean Oh

RGB-thermal semantic segmentation is one potential solution to achieve reliable semantic scene understanding in adverse weather and lighting conditions. However, the previous studies mostly focus on designing a multi-modal fusion module without consideration of the nature of multi-modality inputs. Therefore, the networks easily become over-reliant on a single modality, making it difficult to learn complementary and meaningful representations for each modality. This paper proposes 1) a complementary random masking strategy of RGB-T images and 2) self-distillation loss between clean and masked input modalities. The proposed masking strategy prevents over-reliance on a single modality. It also improves the accuracy and robustness of the neural network by forcing the network to segment and classify objects even when one modality is partially available. Also, the proposed self-distillation loss encourages the network to extract complementary and meaningful representations from a single modality or complementary masked modalities. We achieve state-of-the-art performance over three RGB-T semantic segmentation benchmarks. Our source code is available at https://github.com/UkcheolShin/CRM_RGBTSeg.

Details

IROS Conference 2024 Conference Paper

Density-aware Domain Generalization for LiDAR Semantic Segmentation

Jaeyeul Kim
Jungwan Woo
Ukcheol Shin
Jean Oh
Sunghoon Im 0001

3D LiDAR-based perception has made remarkable advancements, leading to the widespread adoption of LiDAR in autonomous driving systems. Despite these technological strides, variations in LiDAR sensors and environmental conditions can significantly deteriorate the performance of perception models, primarily due to changes in the density of point clouds. Recent studies in domain generalization have aimed to mitigate this challenge; however, they often rely on the availability of sequential data and ego-motion, which limits their applicability. To address these limitations, we propose two novel methods that enable network operation in a density-aware fashion without any constraints, thereby ensuring consistent performance despite fluctuations in point cloud density. First, we design the network to be density-aware by utilizing the kernel occupancy information from the 3D sparse convolution as geometric features. Subsequently, we further enhance density awareness by incorporating voxel-wise density prediction as an auxiliary task in a self-supervised manner. Our method demonstrates superior performance over current state-of-the-art approaches, achieving this without the need for specific data prerequisites. Our approach is compatible with a variety of 3D backbone architectures, enhancing domain generalization performance by 18. 4% while adding a minimal computational overhead of only 7ms.

Details

ICRA Conference 2024 Conference Paper

POE: Acoustic Soft Robotic Proprioception for Omnidirectional End-effectors

Uksang Yoo
Ziven Lopez
Jeffrey Ichnowski
Jean Oh

Shape estimation is crucial for precise control of soft robots. However, soft robot shape estimation and proprioception are challenging due to their complex deformation behaviors and infinite degrees of freedom. Their continuously deforming bodies complicate integrating rigid sensors and reliably estimating its shape. In this work, we present Proprioceptive Omnidirectional End-effector (POE), a tendon-driven soft robot with six embedded microphones. We first introduce novel applications of 3D reconstruction methods to acoustic signals from the microphones for soft robot shape proprioception. To improve the proprioception pipeline’s training efficiency and model prediction consistency, we present POE-M. POE-M predicts key point positions from acoustic signal observations and uses an energy-minimization method to reconstruct a physically admissible high-resolution mesh of POE. We evaluate mesh reconstruction on simulated data and the POE-M pipeline with real-world experiments. Ablation studies suggest POE-M’s guidance of the key points during the mesh reconstruction process provides robustness and stability to the pipeline. POE-M reduced the maximum Chamfer distance error by 23. 1 % compared to the state-of-the-art end-to-end soft robot proprioception models and achieved 4. 91 mm average Chamfer distance error during evaluation. Supplemental materials, experiment data, and visualizations are available at sites. google.com/view/acoustic-poe.

Details

IROS Conference 2024 Conference Paper

Robot Synesthesia: A Sound and Emotion Guided Robot Painter

Vihaan Misra
Peter Schaldenbrand
Jean Oh

If a picture paints a thousand words, sound may voice a million. While recent robotic painting and image synthesis methods have achieved progress in generating visuals from text inputs, the translation of sound into images is vastly unexplored. Generally, sound-based interfaces and sonic interactions have the potential to expand accessibility and control for the user and provide a means to convey complex emotions and the dynamic aspects of the real world. In this paper, we propose an approach for using sound and speech to guide a robotic painting process, known here as robot synesthesia. For general sound, we encode the simulated paintings and input sounds into the same latent space. For speech, we decouple speech audio into its transcribed text and the tone of the speech. Whereas we use the text to control the content, we estimate the emotions from the tone to guide the mood of the painting. Our approach has been fully integrated with FRIDA, a robotic painting framework, adding sound and speech to FRIDA’s existing input modalities such as text and style. In two surveys, participants were able to correctly guess the emotion or natural sound used to generate a given painting more than twice as likely as random chance. On our sound-guided image manipulation and music-guided paintings, we discuss the results qualitatively.

Details

IROS Conference 2024 Conference Paper

Towards Human-Centered Construction Robotics: A Reinforcement Learning-Driven Companion Robot for Contextually Assisting Carpentry Workers

Yuning Wu 0003
Jiaying Wei
Jean Oh
Daniel Cardoso Llach

In the dynamic construction industry, traditional robotic integration has primarily focused on automating specific tasks, often overlooking the complexity and variability of human aspects in construction workflows. This paper introduces a human-centered approach with a "work companion rover" designed to assist construction workers within their existing practices, aiming to enhance safety and workflow fluency while respecting construction labor’s skilled nature. We conduct an in-depth study on deploying a robotic system in carpentry formwork, showcasing a prototype that emphasizes mobility, safety, and comfortable worker-robot collaboration in dynamic environments through a contextual Reinforcement Learning (RL)-driven modular framework. Our research advances robotic applications in construction, advocating for collaborative models where adaptive robots support rather than replace humans and underscores the potential for an interactive and collaborative human-robot workforce.

Details

IROS Conference 2024 Conference Paper

Translating Agent-Environment Interactions from Humans to Robots

Tanmay Shankar
Chaitanya Chawla
Almutwakel Hassan
Jean Oh

Humans are remarkably adept at imitating other people performing tasks, afforded by their ability to abstract away irrelevant details and focus on the task strategy of the demonstrator. In this paper, we take steps towards enabling robots with this ability, and present a framework, TransAct to do so. TransAct first builds on prior skill learning work to learn temporally abstract representations of common agent-environment interactions in manipulation tasks, e. g. , a robot pouring from a cup. Given a human demonstration of an unseen unknown task, TransAct then translates the underlying sequence of interactions (i. e. , the human task strategy) to a robot learner. Through experiments on real-world human and robot datasets, we demonstrate TransAct’s ability to accurately represent diverse agent-environment interactions. Moreover, TransAct empowers robots to consume human task demonstrations and compose corresponding interactions with similar environmental effects to perform the tasks themselves in a zero shot manner, without access to paired demonstrations or dense annotations. We present visualizations of our results at https://sites.google.com/view/interaction-abstractions.

Details

IJCAI Conference 2023 Conference Paper

Core Challenges in Embodied Vision-Language Planning (Extended Abstract)

Jonathan Francis
Nariaki Kitamura
Felix Labelle
Xiaopeng Lu
Ingrid Navarro
Jean Oh

Recent advances in the areas of Multimodal Machine Learning and Artificial Intelligence (AI) have led to the development of challenging tasks at the intersection of Computer Vision, Natural Language Processing, and Robotics. Whereas many approaches and previous survey pursuits have characterised one or two of these dimensions, there has not been a holistic analysis at the center of all three. Moreover, even when combinations of these topics are considered, more focus is placed on describing, e. g. , current architectural methods, as opposed to also illustrating high-level challenges and opportunities for the field. In this survey paper, we discuss Embodied Vision-Language Planning (EVLP) tasks, a family of prominent embodied navigation and manipulation problems that jointly leverage computer vision and natural language for interaction in physical environments. We propose a taxonomy to unify these tasks and provide an in-depth analysis and comparison of the new and current algorithmic approaches, metrics, simulators, and datasets used for EVLP tasks. Finally, we present the core challenges that we believe new EVLP works should seek to address, and we advocate for task construction that enables model generalisability and furthers real-world deployment.

PDF Details DOI

ICRA Conference 2023 Conference Paper

Follow The Rules: Online Signal Temporal Logic Tree Search for Guided Imitation Learning in Stochastic Domains

Jasmine Jerry Aloor
Jay Patrikar
Parv Kapoor
Jean Oh
Sebastian A. Scherer

Seamlessly integrating rules in Learning-from-Demonstrations (LfD) policies is a critical requirement to enable the real-world deployment of AI agents. Recently, Signal Temporal Logic (STL) has been shown to be an effective language for encoding rules as spatio-temporal constraints. This work uses Monte Carlo Tree Search (MCTS) as a means of integrating STL specification into a vanilla LfD policy to improve constraint satisfaction. We propose augmenting the MCTS heuristic with STL robustness values to bias the tree search towards branches with higher constraint satisfaction. While the domain-independent method can be applied to integrate STL rules online into any pre-trained LfD algorithm, we choose goal-conditioned Generative Adversarial Imitation Learning as the offline LfD policy. We apply the proposed method to the domain of planning trajectories for General Aviation aircraft around a non-towered airfield. Results using the simulator trained on real-world data showcase 60% improved performance over baseline LfD methods that do not use STL heuristics. [Code] 1 1 Codebase: https://github.com/castacks/mcts-stl-planning [Video] 2 2 Video: https://youtu.be/fiFCwc57MQs

Details

ICRA Conference 2023 Conference Paper

FRIDA: A Collaborative Robot Painter with a Differentiable, Real2Sim2Real Planning Environment

Peter Schaldenbrand
James McCann
Jean Oh

Painting is an artistic process of rendering visual content that achieves the high-level communication goals of an artist that may change dynamically throughout the creative process. In this paper, we present a Framework and Robotics Initiative for Developing Arts (FRIDA) that enables humans to produce paintings on canvases by collaborating with a painter robot using simple inputs such as language descriptions or images. FRIDA introduces several technical innovations for computationally modeling a creative painting process. First, we develop a fully differentiable simulation environment for painting, adopting the idea of real to simulation to real (real2sim2real). We show that our proposed simulated painting environment is higher fidelity to reality than existing simulation environments used for robot painting. Second, to model the evolving dynamics of a creative process, we develop a planning approach that can continuously optimize the painting plan based on the evolving canvas with respect to the high-level goals. In contrast to existing approaches where the content generation process and action planning are performed independently and sequentially, FRIDA adapts to the stochastic nature of using paint and a brush by continually re-planning and re-assessing its semantic goals based on its visual perception of the painting progress. We describe the details on the technical approach as well as the system integration. FRIDA software is freely available at: https://github.com/cmubig/Frida.

Details

IROS Conference 2023 Conference Paper

T2FPV: Dataset and Method for Correcting First-Person View Errors in Pedestrian Trajectory Prediction

Benjamin Stoler
Meghdeep Jana
Soonmin Hwang
Jean Oh

Predicting pedestrian motion is essential for developing socially-aware robots that interact in a crowded environment. While the natural visual perspective for a social interaction setting is an egocentric view, the majority of existing work in trajectory prediction therein has been investigated purely in the top-down trajectory space. To support first-person view trajectory prediction research, we present T2FPV, a method for constructing high-fidelity first-person view (FPV) datasets given a real-world, top-down trajectory dataset; we showcase our approach on the ETH/UCY pedestrian dataset to generate the egocentric visual data of all interacting pedestrians, creating the T2FPV-ETH dataset. In this setting, FPV-specific errors arise due to imperfect detection and tracking, occlusions, and field-of-view (FOV) limitations of the camera. To address these errors, we propose CoFE, a module that further refines the imputation of missing data in an end-to-end manner with trajectory forecasting algorithms. Our method reduces the impact of such FPV errors on downstream prediction performance, decreasing displacement error by more than 10% on average. To facilitate research engagement, we release our T2FPV-ETH dataset and software tools § § https://github.com/cmubig/T2FPV.

Details

ICRA Conference 2022 Conference Paper

Autonomous Exploration Development Environment and the Planning Algorithms

Chao Cao
Hongbiao Zhu
Fan Yang 0092
Yukun Xia
Howie Choset
Jean Oh
Ji Zhang 0003

Autonomous Exploration Development Environment is an open-source repository released to facilitate development of high-level planning algorithms and integration of com-plete autonomous navigation systems. The repository contains representative simulation environment models, fundamental navigation modules, e. g. , local planner, terrain traversability analysis, waypoint following, and visualization tools. Together with two of our high-level planner releases - TARE planner for exploration and FAR planner for route planning, we detail usage of the three open-source repositories and share experiences in integration of autonomous navigation systems. We use DARPA Subterranean Challenge as a use case where the repositories together form the main navigation system of the CMU-OSU Team. In the end, we discuss a few potential use cases in extended applications.

Details

JAIR Journal 2022 Journal Article

Core Challenges in Embodied Vision-Language Planning

Jonathan Francis
Nariaki Kitamura
Felix Labelle
Xiaopeng Lu
Ingrid Navarro
Jean Oh

Recent advances in the areas of multimodal machine learning and artificial intelligence (AI) have led to the development of challenging tasks at the intersection of Computer Vision, Natural Language Processing, and Embodied AI. Whereas many approaches and previous survey pursuits have characterised one or two of these dimensions, there has not been a holistic analysis at the center of all three. Moreover, even when combinations of these topics are considered, more focus is placed on describing, e.g., current architectural methods, as opposed to also illustrating high-level challenges and opportunities for the field. In this survey paper, we discuss Embodied Vision-Language Planning (EVLP) tasks, a family of prominent embodied navigation and manipulation problems that jointly use computer vision and natural language. We propose a taxonomy to unify these tasks and provide an in-depth analysis and comparison of the new and current algorithmic approaches, metrics, simulated environments, as well as the datasets used for EVLP tasks. Finally, we present the core challenges that we believe new EVLP works should seek to address, and we advocate for task construction that enables model generalizability and furthers real-world deployment.

PDF Details DOI

IROS Conference 2022 Conference Paper

FAR Planner: Fast, Attemptable Route Planner using Dynamic Visibility Update

Fan Yang 0092
Chao Cao
Hongbiao Zhu
Jean Oh
Ji Zhang 0003

Path planning in unknown environments remains a challenging problem, as the environment is gradually observed during the navigation, the underlying planner has to update the environment representation and replan, promptly and constantly, to account for the new observations. In this paper, we present a visibility graph-based planning framework capable of dealing with navigation tasks in both known and unknown environments. The planner employs a polygonal representation of the environment and constructs the representation by extracting edge points around obstacles to form enclosed polygons. With that, the method dynamically updates a global visibility graph using a two-layered data structure, expanding the visibility edges along with the navigation, and removing edges that become occluded by newly observed obstacles. When navigating in unknown environments, the method is attemptable in discovering a way to the goal by picking up the environment layout on the fly, updating the visibility graph, and fast replanning corresponding to the newly observed environment. We evaluate the method in simulated and real-world settings. The method shows the capability to attempt and navigate through unknown environments, reducing travel time by up to 12-47% from search-based methods: A*, D* Lite, and more than 24-35% from sampling-based methods: RRT*, BIT*, and SPARS.

Details

ICRA Conference 2022 Conference Paper

Predicting Like A Pilot: Dataset and Method to Predict Socially-Aware Aircraft Trajectories in Non-Towered Terminal Airspace

Jay Patrikar
Brady G. Moon
Jean Oh
Sebastian A. Scherer

Pilots operating aircraft in non-towered terminal airspace rely on their situational awareness and prior knowledge to predict the future trajectories of other agents. These predictions are conditioned on the past trajectories of other agents, agent-agent social interactions and environmental context such as airport location and weather. This paper provides a dataset, TrajAir, that captures this behaviour in non-towered terminal airspace around a regional airport. We also present a baseline socially-aware trajectory prediction algorithm, TrajAirNet, that uses the dataset to predict the trajectories of all agents. The dataset is collected for 111 days over 8 months and contains ADS-B transponder data along with the corresponding METAR weather data. The data is processed to be used as a benchmark with other publicly available social navigation datasets. To the best of the authors' knowledge, this is the first 3D social aerial navigation dataset, thus introducing social navigation for autonomous aviation. TrajAirNet combines state-of-the-art modules in social navigation to provide predictions in a static environment with a dynamic context. Both the TrajAir dataset and TrajAirNet prediction algorithm are open-source. [Dataset] 1 1 Dataset: https://theairlab.org/trajair/ [Code] 2 2 Codebase: https://github.com/castacks/trajairnet [Video] 3 3 Video: https://youtu.be/e1AQXrxB2gw

Details

IROS Conference 2022 Conference Paper

RCA: Ride Comfort-Aware Visual Navigation via Self-Supervised Learning

Xinjie Yao
Ji Zhang 0003
Jean Oh

Under shared autonomy, wheelchair users expect vehicles to provide safe and comfortable rides while following users' high-level navigation plans. To find such a path, vehicles negotiate with different terrains and assess their traversal difficulty. Most prior works model surroundings either through geometric representations or semantic classifications, which do not reflect perceived motion intensity and ride comfort in downstream navigation tasks. We propose to model ride comfort explicitly in traversability analysis using proprioceptive sensing. We develop a self-supervised learning framework to predict traversability costmap from first-person-view images by leveraging vehicle states as training signals. Our approach estimates how the vehicle would “feel” if traversing over based on terrain appearances. We then show our navigation system provides human-preferred ride comfort through robot experiments together with a human evaluation study. The project could be found at https://sites.google.com/view/rca-navigation.

Details

IROS Conference 2022 Conference Paper

Social-PatteRNN: Socially-Aware Trajectory Prediction Guided by Motion Patterns

Ingrid Navarro
Jean Oh

As robots across domains start collaborating with humans in shared environments, algorithms that enable them to reason over human intent are important to achieve safe inter-play. In our work, we study human intent through the problem of predicting trajectories in dynamic environments. We explore domains where navigation guidelines are relatively strictly defined but not clearly marked in their physical environments. We hypothesize that within these domains, agents tend to exhibit short-term motion patterns that reveal context information related to the agent's general direction, intermediate goals and rules of motion, e. g. , social behavior. From this intuition, we propose Social-PatteRNN, an algorithm for recurrent, multi-modal trajectory prediction that exploits motion patterns to encode the aforesaid contexts. Our approach guides long-term trajectory prediction by learning to predict short-term motion patterns. It then extracts sub-goal information from the patterns and aggregates it as social context. We assess our approach across three domains: humans crowds, humans in sports and manned aircraft in terminal airspace, achieving state-of-the-art performance.

Details

IJCAI Conference 2022 Conference Paper

StyleCLIPDraw: Coupling Content and Style in Text-to-Drawing Translation

Peter Schaldenbrand
Zhixuan Liu
Jean Oh

Generating images that fit a given text description using machine learning has improved greatly with the release of technologies such as the CLIP image-text encoder model; however, current methods lack artistic control of the style of image to be generated. We present an approach for generating styled drawings for a given text description where a user can specify a desired drawing style using a sample image. Inspired by a theory in art that style and content are generally inseparable during the creative process, we propose a coupled approach, known here as StyleCLIPDraw, whereby the drawing is generated by optimizing for style and content simultaneously throughout the process as opposed to applying style transfer after creating content in a sequence. Based on human evaluation, the styles of images generated by StyleCLIPDraw are strongly preferred to those by the sequential approach. Although the quality of content generation degrades for certain styles, overall considering both content and style, StyleCLIPDraw is found far more preferred, indicating the importance of style, look, and feel of machine generated images to people as well as indicating that style is coupled in the drawing process itself. Our code, a demonstration, and style evaluation data are publicly available.

PDF Details DOI

ICML Conference 2022 Conference Paper

Translating Robot Skills: Learning Unsupervised Skill Correspondences Across Robots

Tanmay Shankar
Yixin Lin
Aravind Rajeswaran
Vikash Kumar
Stuart Anderson
Jean Oh

In this paper, we explore how we can endow robots with the ability to learn correspondences between their own skills, and those of morphologically different robots in different domains, in an entirely unsupervised manner. We make the insight that different morphological robots use similar task strategies to solve similar tasks. Based on this insight, we frame learning skill correspondences as a problem of matching distributions of sequences of skills across robots. We then present an unsupervised objective that encourages a learnt skill translation model to match these distributions across domains, inspired by recent advances in unsupervised machine translation. Our approach is able to learn semantically meaningful correspondences between skills across multiple robot-robot and human-robot domain pairs despite being completely unsupervised. Further, the learnt correspondences enable the transfer of task strategies across robots and domains. We present dynamic visualizations of our results at https: //sites. google. com/view/translatingrobotskills/home.

Details

AAAI Conference 2021 Conference Paper

Content Masked Loss: Human-Like Brush Stroke Planning in a Reinforcement Learning Painting Agent

Peter Schaldenbrand
Jean Oh

The objective of most Reinforcement Learning painting agents is to minimize the loss between a target image and the paint canvas. Human painter artistry emphasizes important features of the target image rather than simply reproducing it. Using adversarial or L2 losses in the RL painting models, although its final output is generally a work of finesse, produces a stroke sequence that is vastly different from that which a human would produce since the model does not have knowledge about the abstract features in the target image. In order to increase the human-like planning of the model without the use of expensive human data, we introduce a new loss function for use with the model’s reward function: Content Masked Loss. In the context of robot painting, Content Masked Loss employs an object detection model to extract features which are used to assign higher weight to regions of the canvas that a human would find important for recognizing content. The results, based on 332 human evaluators, show that the digital paintings produced by our Content Masked model show detectable subject matter earlier in the stroke sequence than existing methods without compromising on the quality of the final painting. Our code is available at https: //github. com/pschaldenbrand/ContentMaskedLoss.

PDF Details

ICRA Conference 2020 Conference Paper

A Generative Approach for Socially Compliant Navigation

Chieh-En Tsai
Jean Oh

Robots navigating in human crowds need to optimize their paths not only for their task performance but also for their compliance to social norms. One of the key challenges in this context is the lack of standard metrics for evaluating and optimizing a socially compliant behavior. Existing works in social navigation can be grouped according to the differences in their optimization objectives. For instance, the reinforcement learning approaches tend to optimize on the comfort aspect of the socially compliant navigation, whereas the inverse reinforcement learning approaches are designed to achieve natural behavior. In this paper, we propose NaviGAN, a generative navigation algorithm that jointly optimizes both of the comfort and naturalness aspects. Our approach is designed as an adversarial training framework that can learn to generate a navigation path that is both optimized for achieving a goal and for complying with latent social rules. A set of experiments has been carried out on multiple datasets to demonstrate the strengths of the proposed approach quantitatively. We also perform extensive experiments using a physical robot in a realworld environment to qualitatively evaluate the trained social navigation behavior. The video recordings of the robot experiments can be found in the link: https://youtu.be/61blDymjCpw.

Details

ICRA Conference 2020 Conference Paper

CNN-Based Simultaneous Dehazing and Depth Estimation

Byeong-Uk Lee
Kyunghyun Lee 0004
Jean Oh
In-So Kweon

It is difficult for both cameras and depth sensors to obtain reliable information in hazy scenes. Therefore, image dehazing is still one of the most challenging problems to solve in computer vision and robotics. With the development of convolutional neural networks (CNNs), lots of dehazing and depth estimation algorithms using CNNs have emerged. However, very few of those try to solve these two problems at the same time. Focusing on the fact that traditional haze modeling contains depth information in its formula, we propose a CNN-based simultaneous dehazing and depth estimation network. Our network aims to estimate both a dehazed image and a fully scaled depth map from a single hazy RGB input with end-to-end training. The network contains a single dense encoder and four separate decoders; each of them shares the encoded image representation while performing individual tasks. We suggest a novel depth-transmission consistency loss in the training scheme to fully utilize the correlation between the depth information and transmission map. To demonstrate the robustness and effectiveness of our algorithm, we performed various ablation studies and compared our results to those of state-of-the-art algorithms in dehazing and single image depth estimation, both qualitatively and quantitatively. Furthermore, we show the generality of our network by applying it to some real-world examples.

Details

ICRA Conference 2020 Conference Paper

Learning Shape-based Representation for Visual Localization in Extremely Changing Conditions

Hae-Gon Jeon
Sunghoon Im 0001
Jean Oh
Martial Hebert

Visual localization is an important task for applications such as navigation and augmented reality, but is a challenging problem when there are changes in scene appearances through day, seasons, or environments. In this paper, we present a convolutional neural network (CNN)-based approach for visual localization across normal to drastic appearance variations such as pre- and post-disaster cases. Our approach aims to address two key challenges: (1) to reduce the biases based on scene textures as in traditional CNNs, our model learns a shape-based representation by training on stylized images; (2) to make the model robust against layout changes, our approach uses the estimated dominant planes of query images as approximate scene coordinates. Our method is evaluated on various scenes including a simulated disaster dataset to demonstrate the effectiveness of our method in significant changes of scene layout. Experimental results show that our method provides reliable camera pose predictions in various changing conditions.

Details

IJCAI Conference 2019 Conference Paper

Image Captioning with Compositional Neural Module Networks

Junjiao Tian
Jean Oh

In image captioning where fluency is an important factor in evaluation, n-gram metrics, sequential models are commonly used; however, sequential models generally result in overgeneralized expressions that lack the details that may be present in an input image. Inspired by the idea of the compositional neural module networks in the visual question answering task, we introduce a hierarchical framework for image captioning that explores both compositionality and sequentiality of natural language. Our algorithm learns to compose a detail-rich sentence by selectively attending to different modules corresponding to unique aspects of each object detected in an input image to include specific descriptions such as counts and color. In a set of experiments on the MSCOCO dataset, the proposed model outperforms a state-of-the art model across multiple evaluation metrics, more importantly, presenting visually interpretable results. Furthermore, the breakdown of subcategories f-scores of the SPICE metric and human evaluation on Amazon Mechanical Turk show that our compositional module networks effectively generate accurate and detailed captions.

PDF Details

ICRA Conference 2018 Conference Paper

Social Attention: Modeling Attention in Human Crowds

Anirudh Vemula
Katharina Muelling
Jean Oh

Robots that navigate through human crowds need to be able to plan safe, efficient, and human predictable trajectories. This is a particularly challenging problem as it requires the robot to predict future human trajectories within a crowd where everyone implicitly cooperates with each other to avoid collisions. Previous approaches to human trajectory prediction have modeled the interactions between humans as a function of proximity. However, that is not necessarily true as some people in our immediate vicinity moving in the same direction might not be as important as other people that are further away, but that might collide with us in the future. In this work, we propose Social Attention, a novel trajectory prediction model that captures the relative importance of each person when navigating in the crowd, irrespective of their proximity. We demonstrate the performance of our method against a state-of-the-art approach on two publicly available crowd datasets and analyze the trained attention model to gain a better understanding of which surrounding agents humans attend to, when navigating in a crowd.

Details

ICRA Conference 2017 Conference Paper

Modeling cooperative navigation in dense human crowds

Anirudh Vemula
Katharina Muelling
Jean Oh

For robots to be a part of our daily life, they need to be able to navigate among crowds not only safely but also in a socially compliant fashion. This is a challenging problem because humans tend to navigate by implicitly cooperating with one another to avoid collisions, while heading toward their respective destinations. Previous approaches have used handcrafted functions based on proximity to model human-human and human-robot interactions. However, these approaches can only model simple interactions and fail to generalize for complex crowded settings. In this paper, we develop an approach that models the joint distribution over future trajectories of all interacting agents in the crowd, through a local interaction model that we train using real human trajectory data. The interaction model infers the velocity of each agent based on the spatial orientation of other agents in his vicinity. During prediction, our approach infers the goal of the agent from its past trajectory and uses the learned model to predict its future trajectory. We demonstrate the performance of our method against a state-of-the-art approach on a public dataset and show that our model outperforms when predicting future trajectories for longer horizons.

Details

AAAI Conference 2017 Conference Paper

Vision-Language Fusion for Object Recognition

Sz-Rung Shiang
Stephanie Rosenthal
Anatole Gershman
Jaime Carbonell
Jean Oh

While recent advances in computer vision have caused object recognition rates to spike, there is still much room for improvement. In this paper, we develop an algorithm to improve object recognition by integrating human-generated contextual information with vision algorithms. Speciﬁcally, we examine how interactive systems such as robots can utilize two types of context information–verbal descriptions of an environment and human-labeled datasets. We propose a re-ranking schema, MultiRank, for object recognition that can ef- ﬁciently combine such information with the computer vision results. In our experiments, we achieve up to 9. 4% and 16. 6% accuracy improvements using the oracle and the detected bounding boxes, respectively, over the vision-only recognizers. We conclude that our algorithm has the ability to make a signiﬁcant impact on object recognition in robotics and beyond.

PDF Details

IJCAI Conference 2016 Conference Paper

Learning Qualitative Spatial Relations for Robotic Navigation

Abdeslam Boularias
Felix Duvallet
Jean Oh
Anthony Stentz

We consider the problem of robots following natural language commands through previously unknown outdoor environments. A robot receives commands in natural language, such as Navigate around the building to the car left of the fire hydrant and near the tree. The robot needs first to classify its surrounding objects into categories, using images obtained from its sensors. The result of this classification is a map of the environment, where each object is given a list of semantic labels, such as tree or car, with varying degrees of confidence. Then, the robot needs to ground the nouns in the command, i. e. , mapping each noun in the command into a physical object in the environment. The robot needs also to ground a specified navigation mode, such as navigate quickly or navigate covertly, as a cost map. In this work, we show how to ground nouns and navigation modes by learning from examples demonstrated by humans.

PDF Details

SoCS Conference 2016 Conference Paper

Path Planning in Dynamic Environments with Adaptive Dimensionality

Anirudh Vemula
Katharina Muelling
Jean Oh

Path planning in the presence of dynamic obstacles is a challenging problem due to the added time dimension in search space. In approaches that ignore the time dimension and treat dynamic obstacles as static, frequent re-planning is unavoidable as the obstacles move, and their solutions are generally sub-optimal and can be incomplete. To achieve both optimality and completeness, it is necessary to consider the time dimension during planning. The notion of adaptive dimensionality has been successfully used in high-dimensional motion planning such as manipulation of robot arms, but has not been used in the context of path planning in dynamic environments. In this paper, we apply the idea of adaptive dimensionality to speed up path planning in dynamic environments for a robot with no assumptions on its dynamic model. Specifically, our approach considers the time dimension only in those regions of the environment where a potential collision may occur, and plans in a low-dimensional state-space elsewhere. We show that our approach is complete and is guaranteed to find a solution, if one exists, within a cost sub-optimality bound. We experimentally validate our method on the problem of 3D vehicle navigation (x, y, heading) in dynamic environments. Our results show that the presented approach achieves substantial speedups in planning time over 4D heuristic-based A*, especially when the resulting plan deviates significantly from the one suggested by the heuristic.

Details

ICRA Conference 2015 Conference Paper

Grounding spatial relations for outdoor robot navigation

Abdeslam Boularias
Felix Duvallet
Jean Oh
Anthony Stentz

We propose a language-driven navigation approach for commanding mobile robots in outdoor environments. We consider unknown environments that contain previously unseen objects. The proposed approach aims at making interactions in human-robot teams natural. Robots receive from human teammates commands in natural language, such as “Navigate around the building to the car left of the fire hydrant and near the tree”. A robot needs first to classify its surrounding objects into categories, using images obtained from its sensors. The result of this classification is a map of the environment, where each object is given a list of semantic labels, such as “tree” and “car”, with varying degrees of confidence. Then, the robot needs to ground the nouns in the command. Grounding, the main focus of this paper, is mapping each noun in the command into a physical object in the environment. We use a probabilistic model for interpreting the spatial relations, such as “left of” and “near”. The model is learned from examples provided by humans. For each noun in the command, a distribution on the objects in the environment is computed by combining spatial constraints with a prior given as the semantic classifier's confidence values. The robot needs also to ground the navigation mode specified in the command, such as “navigate quickly” and “navigate covertly”, as a cost map. The cost map is also learned from examples, using Inverse Optimal Control (IOC). The cost map and the grounded goal are used to generate a path for the robot. This approach is evaluated on a robot in a real-world environment. Our experiments clearly show that the proposed approach is efficient for commanding outdoor robots.

Details

IROS Conference 2015 Conference Paper

Inferring door locations from a teammate's trajectory in stealth human-robot team operations

Jean Oh
Luis Ernesto Navarro-Serment
Arne Suppé
Anthony Stentz
Martial Hebert

Robot perception is generally viewed as the interpretation of data from various types of sensors such as cameras. In this paper, we study indirect perception where a robot can perceive new information by making inferences from non-visual observations of human teammates. As a proof-of-concept study, we specifically focus on a door detection problem in a stealth mission setting where a team operation must not be exposed to the visibility of the team's opponents. We use a special type of the Noisy-OR model known as BN2O model of Bayesian inference network to represent the inter-visibility and to infer the locations of the doors, i. e. , potential locations of the opponents. Experimental results on both synthetic data and real person tracking data achieve an F-measure of over. 9 on average, suggesting further investigation on the use of non-visual perception in human-robot team operations.

Details

AAAI Conference 2015 Conference Paper

Toward Mobile Robots Reasoning Like Humans

Jean Oh
Arne Suppé
Felix Duvallet
Abdeslam Boularias
Luis Navarro-Serment
Martial Hebert
Anthony Stentz
Jerry Vinokurov

Robots are increasingly becoming key players in human-robot teams. To become effective teammates, robots must possess profound understanding of an environment, be able to reason about the desired commands and goals within a speciﬁc context, and be able to communicate with human teammates in a clear and natural way. To address these challenges, we have developed an intelligence architecture that combines cognitive components to carry out high-level cognitive tasks, semantic perception to label regions in the world, and a natural language component to reason about the command and its relationship to the objects in the world. This paper describes recent developments using this architecture on a ﬁelded mobile robot platform operating in unknown urban environments. We report a summary of extensive outdoor experiments; the results suggest that a multidisciplinary approach to robotics has the potential to create competent human-robot teams.

PDF Details

AAMAS Conference 2013 Conference Paper

Enhancing Robot Perception Using Human Teammates

Jean Oh
Arne Suppe
Anthony Stentz
Martial Hebert

In robotics research, perception is one of the most challenging tasks. In contrast to existing approaches that rely only on computer vision, we propose an alternative method for improving perception by learning from human teammates. To evaluate, we apply this idea to a door detection problem. A set of preliminary experiments has been completed using software agents with real vision data. Our results demonstrate that information inferred from teammate observations significantly improves the perception precision.

PDF

EAAI Journal 2013 Journal Article

Prognostic normative reasoning

Jean Oh
Felipe Meneguzzi
Katia Sycara
Timothy J. Norman

Details DOI

AAMAS Conference 2012 Conference Paper

A cognitive architecture for emergency response

Felipe Meneguzzi
Siddharth Mehrotra
James Tittle
Jean Oh
Nilanjan Chakraborty
Katia Sycara
Michael Lewis

Plan recognition, cognitive workload estimation and human assistance have been extensively studied in the AI and human factors communities, but have seldom been integrated and evaluated as complete systems. In this paper, we develop an assistant agent architecture integrating plan recognition, current and future user information needs, workload estimation and adaptive information presentation to aid an emergency response manager in making high quality decisions under time stress, while avoiding cognitive overload. We describe its main components as well as results for en experiment simulating various possible executions of the emergency response plans used in the real world, comparing reaction time of an assisted versus an unassisted human.

PDF

IJCAI Conference 2011 Conference Paper

An Agent Architecture for Prognostic Reasoning Assistance

Jean Oh
Felipe Meneguzzi
Katia Sycara
Timothy J. Norman

In this paper we describe a software assistant agent that can proactively assist human users situated in a time-constrained environment to perform normative reasoning--reasoning about prohibitions and obligations--so that the user can focus on her planning objectives. In order to provide proactive assistance, the agent must be able to 1) recognize the user's planned activities, 2) reason about potential needs of assistance associated with those predicted activities, and 3) plan to provide appropriate assistance suitable for newly identified user needs. To address these specific requirements, we develop an agent architecture that integrates user intention recognition, normative reasoning over a user's intention, and planning, execution and replanning for assistive actions. This paper presents the agent architecture and discusses practical applications of this approach.

PDF Details DOI

AAMAS Conference 2011 Conference Paper

Prognostic Normative Reasoning in Coalition Planning

Jean Oh
Felipe Meneguzzi
Katia Sycara
Timothy J. Norman

In this paper we describe a software assistant agent that can proactively assist human users situated in a time-constrained coalition environment. The cognitive workload is significantly increased when the user must not only cope with a complex environment, but also with a set of unaccustomed rules that prescribe how the coalition planning process must be carried out. In this context, we introduce the notion of prognostic norm reasoning to predict the user's likely normative violations, allowing the assistant agent to plan and take remedial actions before the violations actually occur. To the best of our knowledge, our approach is the first that manages norms in a proactive and autonomous manner.

PDF

ECAI Conference 2010 Conference Paper

ANTIPA: an agent architecture for intelligent information assistance

Jean Oh
Felipe Meneguzzi
Katia P. Sycara

Human users trying to plan and accomplish information-dependent goals in highly dynamic environments with prevalent uncertainty must consult various types of information sources in their decision-making processes while the information requirements change as they plan and re-plan. When the users must make time-critical decisions in information-intensive tasks they become cognitively overloaded not only by the planning activities but also by the information-gathering activities at various points in the planning process. We have developed the ANTicipatory Information and Planning Agent (ANTIPA) to manage information adaptively in order to mitigate user cognitive overload. To this end, the agent brings information to the user as a result of user requests but most crucially, it proactively predicts the user's prospective information needs by recognizing the user's plan; pre-fetches information that is likely to be used in the future; and offers the information when it is relevant to the current or future planning decisions. This paper introduces a fully implemented agent of the ANTIPA architecture using a decision-theoretic user model.

Details

AAMAS Conference 2008 Conference Paper

A Few Good Agents: Multi-Agent Social Learning

Jean Oh
Stephen Smith

In this paper, we investigate multi-agent learning (MAL) in a multi-agent resource selection problem (MARS) in which a large group of agents are competing for common resources. Since agents in such a setting are self-interested, MAL in MARS domains typically focuses on the convergence to a set of non-cooperative equilibria. As seen in the example of prisoner’s dilemma, however, selfish equilibria are not necessarily optimal with respect to the natural objective function of a target problem, e. g. , resource utilization in the case of MARS. Conversely, a centrally administered optimization of physically distributed agents is infeasible in many reallife applications such as transportation traffic problems. In order to explore the possibility for a middle ground solution, we analyze two types of costs for evaluating MAL algorithms in this context. The quality loss of a selfish algorithm can be quantitatively measured by the price of anarchy, i. e. , the ratio of the objective function value of a selfish solution to that of an optimal solution. Analogously, we introduce the price of monarchy of a learning algorithm to quantify the practical cost of coordination in terms of communication cost. We then introduce a multi-agent social learning approach named A Few Good Agents (AFGA) that motivates selfinterested agents to cooperate with one another to reduce the price of anarchy, while bounding the price of monarchy at the same time. A preliminary set of experiments on the El Farol bar problem, a simple example of MARS, show promising results.

PDF