Author name cluster

Yu Hao

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

14 papers

2 author rows

IROS Conference 2025 Conference Paper

Socially-Aware Robot Navigation Enhanced by Bidirectional Natural Language Conversations Using Large Language Models

Congcong Wen
Yifan Liu
Geeta Chandra Raju Bethala
Shuaihang Yuan
Hao Huang 0003
Yu Hao
Mengyu Wang 0001
Yu-Shen Liu

Robotic navigation plays a pivotal role in a wide range of real-world applications. While traditional navigation systems focus on efficiency and obstacle avoidance, their inability to model complex human behaviors in shared spaces has underscored the growing need for socially aware navigation. In this work, we explore a novel paradigm of socially aware robot navigation empowered by large language models (LLMs), and propose HSAC-LLM, a hybrid framework that seamlessly integrates deep reinforcement learning with the reasoning and communication capabilities of LLMs. Unlike prior approaches that passively predict pedestrian trajectories or issue pre-scripted alerts, HSAC-LLM enables bidirectional natural language interaction, allowing robots to proactively engage in dialogue with pedestrians to resolve potential conflicts and negotiate path decisions. Extensive evaluations across 2D simulations, Gazebo environments, and real-world deployments demonstrate that HSAC-LLM consistently outperforms state-of-the-art DRL baselines under our proposed socially aware navigation metric, which covers safety, efficiency, and human comfort. By bridging linguistic reasoning and interactive motion planning, our results highlight the potential of LLM-augmented agents for robust, adaptive, and human-aligned navigation in real-world settings. Project page: https://hsacllm.github.io/.

Details

IROS Conference 2024 Conference Paper

ChatMap: A Wearable Platform Based on the Multi-modal Foundation Model to Augment Spatial Cognition for People with Blindness and Low Vision

Yu Hao
Alexey Magay
Hao Huang 0003
Shuaihang Yuan
Congcong Wen
Yi Fang 0006

Spatial cognition refers to the ability to gain knowledge about their surroundings and utilize this information to identify their location, acquire resources, and navigate their way back to familiar places. People with blindness and low vision (pBLV) face significant challenges with spatial cognition due to the reliance on visual input. Without the full range of visual cues, pBLV individuals often find it difficult to grasp a comprehensive understanding of their environment, leading to obstacles in scene recognition and precise object localization, especially in unfamiliar environments. This limitation extends to their ability to independently detect and avoid potential tripping hazards, making navigation and interaction with their environment more challenging. In this paper, we present a pioneering wearable platform tailored to enhance the spatial cognition of pBLV through the integration of multi-modal foundation model. The proposed platform integrates a wearable camera with audio module and leverages the advanced capabilities of vision language foundation model (i. e. , GPT-4 and GPT-4V), for the nuanced processing of visual and textual data. Specifically, we employ vision language models to bridge the gap between visual information and the proprioception of visually impaired users, offering more intelligible guidance by aligning visual data with the natural perception of space and movement. Then we apply prompt engineering to guide the large language model to act as an assistant tailored specifically for pBLV users to produce accurate answers. Another innovation in our model is the incorporation of a chain of thought reasoning process, which enhances the accuracy and interpretability of the model, facilitating the generation of more precise responses to complex user inquiries across diverse environmental contexts. To assess the practical impact of our proposed wearable platform, we carried out a series of real-world experiments across three tasks that are commonly challenging for people with blindness and low vision: risk assessment, object localization, and scene recognition. Additionally, through an ablation study conducted on the VizWiz dataset, we rigorously assess the contribution of each individual module, substantiating the integral role in the model’s overall performance.

Details

IROS Conference 2024 Conference Paper

EMBOSR: Embodied Spatial Reasoning for Enhanced Situated Question Answering in 3D Scenes

Yu Hao
Fan Yang
Nicholas Fang
Yu-Shen Liu

3D Embodied Spatial Reasoning, emphasizing an agent’s interaction with its surroundings for spatial information inference, is adeptly facilitated by the process of Situated Question Answering in 3D Scenes (SQA3D). SQA3D requires an agent to comprehend its position and orientation within a 3D scene based on a textual situation and then utilize this understanding to answer questions about the surrounding environment in that context. Previous methods in this field face substantial challenges, including a dependency on constant retraining on limited datasets, which leads to poor performance in unseen scenarios, limited expandability, and inadequate generalization. To address these challenges, we present a new embodied spatial reasoning paradigm for enhanced SQA3D, fusing the capabilities of foundation models with the chain of thought methodology. This approach is designed to elevate adaptability and scalability in a wide array of 3D environments. A new aspect of our model is the integration of a chain of thought reasoning process, which significantly augments the model’s capability for spatial reasoning and complex query handling in diverse 3D environments. In our structured experiments, we compare our approach against other methods with varying architectures, demonstrating its efficacy in multiple tasks including SQA3D and 3D captioning. We also assess the informativeness contained in the generated answers for complex queries. Ablation studies further delineate the individual contributions of our method to its overall performance. The results consistently affirm our proposed method’s effectiveness and efficiency.

Details

NeurIPS Conference 2024 Conference Paper

GAMap: Zero-Shot Object Goal Navigation with Multi-Scale Geometric-Affordance Guidance

Shuaihang Yuan
Hao Huang
Yu Hao
Congcong Wen
Anthony Tzes
Yi Fang

Zero-Shot Object Goal Navigation (ZS-OGN) enables robots to navigate toward objects of unseen categories without prior training. Traditional approaches often leverage categorical semantic information for navigation guidance, which struggles when only partial objects are observed or detailed and functional representations of the environment are lacking. To resolve the above two issues, we propose \textit{Geometric-part and Affordance Maps} (GAMap), a novel method that integrates object parts and affordance attributes for navigation guidance. Our method includes a multi-scale scoring approach to capture geometric-part and affordance attributes of objects at different scales. Comprehensive experiments conducted on the HM3D and Gibson benchmark datasets demonstrate improvements in Success Rates and Success weighted by Path Length, underscoring the efficacy of our geometric-part and affordance-guided navigation approach in enhancing robot autonomy and versatility, without any additional task-specific training or fine-tuning with the semantics of unseen objects and/or the locomotions of the robot.

PDF Details DOI

JBHI Journal 2024 Journal Article

Multi-Domain Based Dynamic Graph Representation Learning for EEG Emotion Recognition

Hao Tang
Songyun Xie
Xinzhou Xie
Yujie Cui
Bohan Li
Dalu Zheng
Yu Hao
Xiangming Wang

Graph neural networks (GNNs) have demonstrated efficient processing of graph-structured data, making them a promising method for electroencephalogram (EEG) emotion recognition. However, due to dynamic functional connectivity and nonlinear relationships between brain regions, representing EEG as graph data remains a great challenge. To solve this problem, we proposed a multi-domain based graph representation learning (MD $^{2}$ GRL) framework to model EEG signals as graph data. Specifically, MD $^{2}$ GRL leverages gated recurrent units (GRU) and power spectral density (PSD) to construct node features of two subgraphs. Subsequently, the self-attention mechanism is adopted to learn the similarity matrix between nodes and fuse it with the intrinsic spatial matrix of EEG to compute the corresponding adjacency matrix. In addition, we introduced a learnable soft thresholding operator to sparsify the adjacency matrix to reduce noise in the graph structure. In the downstream task, we designed a dual-branch GNN and incorporated spatial asymmetry for graph coarsening. We conducted experiments using the publicly available datasets SEED and DEAP, separately for subject-dependent and subject-independent, to evaluate the performance of our model in emotion classification. Experimental results demonstrated that our method achieved state-of-the-art (SOTA) classification performance in both subject-dependent and subject-independent experiments. Furthermore, the visualization analysis of the learned graph structure reveals EEG channel connections that are significantly related to emotion and suppress irrelevant noise. These findings are consistent with established neuroscience research and demonstrate the potential of our approach in comprehending the neural underpinnings of emotion.

Details DOI

ICRA Conference 2024 Conference Paper

Noisy Few-shot 3D Point Cloud Scene Segmentation

Hao Huang 0003
Shuaihang Yuan
Congcong Wen
Yu Hao
Yi Fang 0006

3D scene semantic segmentation plays a crucial role in robotics by enabling robots to understand and interpret their environment in a detailed and context-aware manner, facilitating tasks such as navigation, object manipulation, and interaction within complex spaces. A preponderance of methodology predominantly adopts a fully supervised framework for 3D point cloud scene semantic segmentation. Such paradigms exhibit an intrinsic dependency on extensive labeled datasets, presenting challenges in acquisition and exhibiting incapacity to segment novel classes, especially when the training data are contaminated by noisy samples. To address these limitations, this study introduces a novel few-shot segmentation approach to robustly segment 3D point cloud scenes with noisy labels using a meta-learning scheme. Specifically, we first build a multi-prototype graph and then suppress samples with noisy labels based on the graph structure. A subgraph bagging scheme is then proposed to conduct semi-supervised transductive learning to propagate labels. To optimize the graph structure to learn discriminative prototype features, we design a triplet contrastive loss to increase the compactness of these subgraphs. We evaluated our method on two widely used 3D point cloud scene segmentation benchmarks within few-shot (i. e. , 2/3-way 5-shot) segmentation settings with noisy samples. Experimental results demonstrate the improvement of our method over the compared baselines, illustrating the robustness of our method in few-shot 3D scene segmentation against noisy samples. The code is available at: https://github.com/hhuang-code/Noisy_Fewshot_Segmentation.

Details

IROS Conference 2024 Conference Paper

Weakly Scene Segmentation Using Efficient Transformer

Hao Huang 0003
Shuaihang Yuan
Congcong Wen
Yu Hao
Yi Fang 0006

Current methods for large-scale point cloud scene semantic segmentation rely on manually annotated dense point-wise labels, which are costly, labor-intensive, and prone to errors. Consequently, gathering point cloud scenes with billions of labeled points is impractical in real-world scenarios. In this paper, we introduce a novel weak supervision approach to semantically segment large-scale indoor scenes, requiring only 1‰ of the points to be labeled. Specifically, we develop an efficient point neighbor Transformer to capture the geometry of local point cloud patches. To address the quadratic complexity of self-attention computation in Transformers, particularly for large-scale point clouds, we propose approximating the self-attention matrix using low-rank and sparse decomposition. Building on the point neighbor Transformer as foundational blocks, we design a Low-rank Sparse Transformer Network (LST-Net) for weakly supervised large-scale point cloud scene semantic segmentation. Experimental results on two commonly used indoor point cloud scene segmentation benchmarks demonstrate that our model achieves performance comparable to those of both weakly supervised and fully supervised methods. Our code can be found in https://github.com/hhuang-code/LST-Net.

Details

IROS Conference 2023 Conference Paper

Understanding the Impact of Image Quality and Distance of Objects to Object Detection Performance

Yu Hao
Haoyang Pei
Yixuan Lyu
Zhongzheng Yuan
John-Ross Rizzo
Yao Wang 0001
Yi Fang 0006

Object detection is a fundamental task for autonomous driving, which aim to identify and localize objects within an image. Deep learning has made great strides for object detection, with popular models including Faster R-CNN, YOLO, and SSD. The detection accuracy and computational cost of object detection depend on the spatial resolution of an image, which may be constrained by both the camera and storage considerations. Furthermore, original images are often compressed and uploaded to a remote server for object detection. Compression is often achieved by reducing either spatial or amplitude resolution or, at times, both, both of which have well-known effects on performance. Detection accuracy also depends on the distance of the object of interest from the camera. Our work examines the impact of spatial and amplitude resolution, as well as object distance, on object detection accuracy and computational cost. As existing models are optimized for uncompressed (or lightly compressed) images over a narrow range of spatial resolution, we develop a resolution-adaptive variant of YOLOv5 (RA-YOLO), which varies the number of scales in the feature pyramid and detection head based on the spatial resolution of the input image. To train and evaluate this new method, we created a dataset of images with diverse spatial and amplitude resolutions by combining images from the TJU and Eurocity datasets and generating different resolutions by applying spatial resizing and compression. We first show that RA-YOLO achieves a good trade-off between detection accuracy and inference time over a large range of spatial resolutions. We then evaluate the impact of spatial and amplitude resolutions on object detection accuracy using the proposed RA-YOLO model. We demonstrate that the optimal spatial resolution that leads to the highest detection accuracy depends on the ‘tolerated’ image size (constrained by the available bandwidth or storage). We further assess the impact of the distance of an object to the camera on the detection accuracy and show that higher spatial resolution enables a greater detection range. These results provide important guidelines for choosing the image spatial resolution and compression settings predicated on available bandwidth, storage, desired inference time, and/or desired detection range, in practical applications.

Details

NeurIPS Conference 2021 Conference Paper

KS-GNN: Keywords Search over Incomplete Graphs via Graphs Neural Network

Yu Hao
Xin Cao
Yufan Sheng
Yixiang Fang
Wei Wang

Keyword search is a fundamental task to retrieve information that is the most relevant to the query keywords. Keyword search over graphs aims to find subtrees or subgraphs containing all query keywords ranked according to some criteria. Existing studies all assume that the graphs have complete information. However, real-world graphs may contain some missing information (such as edges or keywords), thus making the problem much more challenging. To solve the problem of keyword search over incomplete graphs, we propose a novel model named KS-GNN based on the graph neural network and the auto-encoder. By considering the latent relationships and the frequency of different keywords, the proposed KS-GNN aims to alleviate the effect of missing information and is able to learn low-dimensional representative node embeddings that preserve both graph structure and keyword features. Our model can effectively answer keyword search queries with linear time complexity over incomplete graphs. The experiments on four real-world datasets show that our model consistently achieves better performance than state-of-the-art baseline methods in graphs having missing information.

PDF Details

IJCAI Conference 2020 Conference Paper

Inductive Link Prediction for Nodes Having Only Attribute Information

Yu Hao
Xin Cao
Yixiang Fang
Xike Xie
Sibo Wang

Predicting the link between two nodes is a fundamental problem for graph data analytics. In attributed graphs, both the structure and attribute information can be utilized for link prediction. Most existing studies focus on transductive link prediction where both nodes are already in the graph. However, many real-world applications require inductive prediction for new nodes having only attribute information. It is more challenging since the new nodes do not have structure information and cannot be seen during the model training. To solve this problem, we propose a model called DEAL, which consists of three components: two node embedding encoders and one alignment mechanism. The two encoders aim to output the attribute-oriented node embedding and the structure-oriented node embedding, and the alignment mechanism aligns the two types of embeddings to build the connections between the attributes and links. Our model DEAL is versatile in the sense that it works for both inductive and transductive link prediction. Extensive experiments on several benchmark datasets show that our proposed model significantly outperforms existing inductive link prediction methods, and also outperforms the state-of-the-art methods on transductive link prediction.

PDF Details DOI

AAAI Conference 2019 Conference Paper

Exploiting Sentence Embedding for Medical Question Answering

Yu Hao
Xien Liu
Ji Wu
Ping Lv

Despite the great success of word embedding, sentence embedding remains a not-well-solved problem. In this paper, we present a supervised learning framework to exploit sentence embedding for the medical question answering task. The learning framework consists of two main parts: 1) a sentence embedding producing module, and 2) a scoring module. The former is developed with contextual self-attention and multi-scale techniques to encode a sentence into an embedding tensor. This module is shortly called Contextual self-Attention Multi-scale Sentence Embedding (CAMSE). The latter employs two scoring strategies: Semantic Matching Scoring (SMS) and Semantic Association Scoring (SAS). SMS measures similarity while SAS captures association between sentence pairs: a medical question concatenated with a candidate choice, and a piece of corresponding supportive evidence. The proposed framework is examined by two Medical Question Answering(MedicalQA) datasets which are collected from real-world applications: medical exam and clinical diagnosis based on electronic medical records (EMR). The comparison results show that our proposed framework achieved significant improvements compared to competitive baseline approaches. Additionally, a series of controlled experiments are also conducted to illustrate that the multi-scale strategy and the contextual self-attention layer play important roles for producing effective sentence embedding, and the two kinds of scoring strategies are highly complementary to each other for question answering problems.

PDF Details

KR Conference 2016 Short Paper

Knowledge Graph Embedding by Flexible Translation

Jun Feng
Minlie Huang
Mingdong Wang
Mantong Zhou
Yu Hao
Xiaoyan Zhu

Knowledge graph embedding refers to projecting entities and relations in knowledge graph into continuous vector spaces. Current state-of-the-art models are translation-based model, which build embeddings by treating relation as translation from head entity to tail entity. However, previous models is too strict to model the complex and diverse entities and relations(e. g. symmetric/transitive/one-to-many/many-to-many relations). To address these issues, we propose a new principle to allow ﬂexible translation between entity and relation vectors. We can design a novel score function to favor ﬂexible translation for each translation-based models without increasing model complexity. To evaluate the proposed principle, we incorporate it into previous method and conduct triple classiﬁcation on benchmark datasets. Experimental results show that the principle can remarkably improve the performance compared with several state-of-the-art baselines. (a) TransE (b) Flexible Translation Figure 1: Illustration of TransE and our proposed Flexible Translation. There are three triples, which share the same head entity (“Michael Jackson”) and the same relation (“publish song”), while having three different tail entities (“Beat it”, “Billie Jean”, and “Thriller”). (a) TransE can hardly distinguish different tail entities as they all approximated to the sum of head vector and relation vector. (b) Instead of strictly constraining h+r=t, our principle is to enforce that h+r has the same direction with t.

IJCAI Conference 2015 Conference Paper

Tackling Data Sparseness in Recommendation using Social Media based Topic Hierarchy Modeling

Xingwei Zhu
Zhao-Yan Ming
Yu Hao
Xiaoyan Zhu

Recommendation systems play an important role in E-Commerce. However, their potential usefulness in real world applications is greatly limited by the availability of historical rating records from the customers. This paper presents a novel method to tackle the problem of data sparseness in user ratings with rich and timely domain information from social media. We first extract multiple side information for products from their relevant social media contents. Next, we convert the information into weighted topic-item ratings and inject them into an extended latent factor based recommendation model in an optimized approach. Our evaluation on two real world datasets demonstrates the superiority of our method over state-of-the-art methods.

PDF Details

IJCAI Conference 2011 Conference Paper

Semantic Relationship Discovery with Wikipedia Structure

Fan Bu
Yu Hao
Xiaoyan Zhu

Thanks to the idea of social collaboration, Wikipedia has accumulated vast amount of semi-structured knowledge in which the link structure reflects human's cognition on semantic relationship to some extent. In this paper, we proposed a novel method RCRank to jointly compute concept-concept relatedness and concept-category relatedness base on the assumption that information carried in concept-concept links and concept-category links can mutually reinforce each other. Different from previous work, RCRank can not only find semantically related concepts but also interpret their relations by categories. Experimental results on concept recommendation and relation interpretation show that our method substantially outperforms classical methods.

PDF Details DOI