Arrow Research search

Author name cluster

Ruohan Gao

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

7 papers
2 author rows

Possible papers

7

AAAI Conference 2025 Conference Paper

Multisensory Machine Intelligence

  • Ruohan Gao

The future of Artificial Intelligence demands a paradigm shift towards multisensory perception—to systems that can digest ongoing multisensory observations, that can discover structure in unlabeled raw sensory data, and that can intelligently fuse useful information from different sensory modalities for decision making. While we humans perceive the world by looking, listening, touching, smelling, and tasting, traditional form of machine intelligence mostly focuses on a single sensory modality, particularly vision. Therefore, my research, which I call multisensory machine intelligence, aims to empower machines to emulate and enhance human capabilities in seeing, hearing, and feeling, ultimately enabling them to comprehensively perceive, understand, and interact with the multisensory world. In my AAAI-25 new faculty highlight talk, I will present my research that studies two important aspects of the multisensory world: 1) multisensory objects, and 2) multisensory space. In both aspects, I will talk about how we design systems to reliably capture multisensory data from real-world objects and space, how we effectively model them with differentiable simulation algorithms that build a unified multisensory representation to virtualize real objects, and how we explore creative cross-modal/multi-modal applications with sight, sound, and touch in vision, graphics, and robotics. In the end, I will briefly conclude with my future plans.

ECAI Conference 2024 Conference Paper

VMFTransformer: An Angle-Preserving and Auto-Scaling Machine for Multi-Horizon Probabilistic Forecasting

  • Yunyi Zhou
  • Ruohan Gao
  • Xinping Zheng
  • Yuchen Huang
  • Zhixuan Chu

As deep learning develops, the major research methodologies of time series forecasting can be divided into two categories, i. e. , iterative and direct methods. In the iterative methods, since a small amount of error is produced at each time step, the recursive structure can potentially lead to large error accumulations over longer forecasting horizons. Although the direct methods can avoid this puzzle involved in the iterative methods, they face abuse of conditional independence among time points. This impractical assumption can also lead to biased models. To solve these challenges, we propose a direct approach for multi-horizon probabilistic forecasting, which can effectively characterize the dependence across future horizons. Specifically, we consider the multi-horizon target as a random vector. The direction of the vector embodies the temporal dependence, and the length of the vector measures the overall scale across each horizon. Therefore, we respectively apply the von Mises-Fisher (VMF) distribution and the truncated normal distribution to characterize the target vector’s angle and magnitude in our model. Extensive results demonstrate the superiority of our framework over eight state-of-the-art methods.

ICLR Conference 2023 Conference Paper

An Extensible Multi-modal Multi-task Object Dataset with Materials

  • Trevor Standley
  • Ruohan Gao
  • Dawn Chen
  • Jiajun Wu 0001
  • Silvio Savarese

We present EMMa, an Extensible, Multimodal dataset of Amazon product listings that contains rich Material annotations. It contains more than 2.8 million objects, each with image(s), listing text, mass, price, product ratings, and position in Amazon’s product-category taxonomy. We also design a comprehensive taxonomy of 182 physical materials (e.g., Plastic → Thermoplastic → Acrylic). Objects areannotated with one or more materials from this taxonomy. With the numerous attributes available for each object, we develop a Smart Labeling framework to quickly add new binary labels to all objects with very little manual labeling effort, making the dataset extensible. Each object attribute in our dataset can be included in either the model inputs or outputs, leading to combinatorial possibilities in task configurations. For example, we can train a model to predict the object category from the listing text, or the mass and price from the product listing image. EMMa offers a new benchmark for multi-task learning in computer vision and NLP, and allows practitioners to efficiently add new tasks and object attributes at scale.

TMLR Journal 2023 Journal Article

Learning Object-Centric Neural Scattering Functions for Free-viewpoint Relighting and Scene Composition

  • Hong-Xing Yu
  • Michelle Guo
  • Alireza Fathi
  • Yen-Yu Chang
  • Eric Ryan Chan
  • Ruohan Gao
  • Thomas Funkhouser
  • Jiajun Wu

Photorealistic object appearance modeling from 2D images is a constant topic in vision and graphics. While neural implicit methods (such as Neural Radiance Fields) have shown high-fidelity view synthesis results, they cannot relight the captured objects. More recent neural inverse rendering approaches have enabled object relighting, but they represent surface properties as simple BRDFs, and therefore cannot handle translucent objects. We propose Object-Centric Neural Scattering Functions (OSFs) for learning to reconstruct object appearance from only images. OSFs not only support free-viewpoint object relighting, but also can model both opaque and translucent objects. While accurately modeling subsurface light transport for translucent objects can be highly complex and even intractable for neural methods, OSFs learn to approximate the radiance transfer from a distant light to an outgoing direction at any spatial location. This approximation avoids explicitly modeling complex subsurface scattering, making learning a neural implicit model tractable. Experiments on real and synthetic data show that OSFs accurately reconstruct appearances for both opaque and translucent objects, allowing faithful free-viewpoint relighting as well as scene composition. In our supplementary material, we include a video for an overview. Project website with video results: https://kovenyu.com/OSF/

ICRA Conference 2023 Conference Paper

Sonicverse: A Multisensory Simulation Platform for Embodied Household Agents that See and Hear

  • Ruohan Gao
  • Hao Li 0076
  • Gokul Dharan
  • Zhuzhu Wang
  • Chengshu Li 0002
  • Fei Xia 0002
  • Silvio Savarese
  • Li Fei-Fei 0001

Developing embodied agents in simulation has been a key research topic in recent years. Exciting new tasks, algorithms, and benchmarks have been developed in various simulators. However, most of them assume deaf agents in silent environments, while we humans perceive the world with multiple senses. We introduce Sonicverse, a multisensory simulation platform with integrated audio-visual simulation for training household agents that can both see and hear. Sonicverse models realistic continuous audio rendering in 3D environments in real-time. Together with a new audio-visual VR interface that allows humans to interact with agents with audio, Sonicverse enables a series of embodied AI tasks that need audio-visual perception. For semantic audio-visual navigation in particular, we also propose a new multi-task learning model that achieves state-of-the-art performance. In addition, we demonstrate Sonicverse's realism via sim-to-real transfer, which has not been achieved by other simulators: an agent trained in Sonicverse can successfully perform audio-visual navigation in real-world environments. Sonicverse is available at: https://github.com/StanfordVL/Sonicverse.

NeurIPS Conference 2023 Conference Paper

SoundCam: A Dataset for Finding Humans Using Room Acoustics

  • Mason Wang
  • Samuel Clarke
  • Jui-Hsien Wang
  • Ruohan Gao
  • Jiajun Wu

A room’s acoustic properties are a product of the room’s geometry, the objects within the room, and their specific positions. A room’s acoustic properties can be characterized by its impulse response (RIR) between a source and listener location, or roughly inferred from recordings of natural signals present in the room. Variations in the positions of objects in a room can effect measurable changes in the room’s acoustic properties, as characterized by the RIR. Existing datasets of RIRs either do not systematically vary positions of objects in an environment, or they consist of only simulated RIRs. We present SoundCam, the largest dataset of unique RIRs from in-the-wild rooms publicly released to date. It includes 5, 000 10-channel real-world measurements of room impulse responses and 2, 000 10-channel recordings of music in three different rooms, including a controlled acoustic lab, an in-the-wild living room, and a conference room, with different humans in positions throughout each room. We show that these measurements can be used for interesting tasks, such as detecting and identifying humans, and tracking their positions.

ICLR Conference 2021 Conference Paper

Learning to Set Waypoints for Audio-Visual Navigation

  • Changan Chen
  • Sagnik Majumder
  • Ziad Al-Halah
  • Ruohan Gao
  • Santhosh Kumar Ramakrishnan
  • Kristen Grauman

In audio-visual navigation, an agent intelligently travels through a complex, unmapped 3D environment using both sights and sounds to find a sound source (e.g., a phone ringing in another room). Existing models learn to act at a fixed granularity of agent motion and rely on simple recurrent aggregations of the audio observations. We introduce a reinforcement learning approach to audio-visual navigation with two key novel elements: 1) waypoints that are dynamically set and learned end-to-end within the navigation policy, and 2) an acoustic memory that provides a structured, spatially grounded record of what the agent has heard as it moves. Both new ideas capitalize on the synergy of audio and visual data for revealing the geometry of an unmapped space. We demonstrate our approach on two challenging datasets of real-world 3D scenes, Replica and Matterport3D. Our model improves the state of the art by a substantial margin, and our experiments reveal that learning the links between sights, sounds, and space is essential for audio-visual navigation.