Arrow Research search

Author name cluster

Congcong Wen

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

10 papers
2 author rows

Possible papers

10

ICLR Conference 2025 Conference Paper

DiffGAD: A Diffusion-based Unsupervised Graph Anomaly Detector

  • Jinghan Li
  • Yuan Gao
  • Jinda Lu
  • Junfeng Fang
  • Congcong Wen
  • Hui Lin
  • Xiang Wang 0010

Graph Anomaly Detection (GAD) is crucial for identifying abnormal entities within networks, garnering significant attention across various fields. Traditional unsupervised methods, which decode encoded latent representations of unlabeled data with a reconstruction focus, often fail to capture critical discriminative content, leading to suboptimal anomaly detection. To address these challenges, we present a Diffusion-based Graph Anomaly Detector (DiffGAD). At the heart of DiffGAD is a novel latent space learning paradigm, meticulously designed to enhance the model's proficiency by guiding it with discriminative content. This innovative approach leverages diffusion sampling to infuse the latent space with discriminative content and introduces a content-preservation mechanism that retains valuable information across different scales, significantly improving the model’s adeptness at identifying anomalies with limited time and space complexity. Our comprehensive evaluation of DiffGAD, conducted on six real-world and large-scale datasets with various metrics, demonstrated its exceptional performance. Our code is available at https://github.com/fortunato-all/DiffGAD

IROS Conference 2025 Conference Paper

ResLPR: A LiDAR Data Restoration Network and Benchmark for Robust Place Recognition Against Weather Corruptions

  • Wenqing Kuang
  • Xiongwei Zhao
  • Yehui Shen
  • Congcong Wen
  • Huimin Lu 0002
  • Zongtan Zhou
  • Xieyuanli Chen

LiDAR-based place recognition (LPR) is a key component for autonomous driving, and its resilience to environmental corruption is critical for safety in high-stakes applications. While state-of-the-art (SOTA) LPR methods perform well in clean weather, they still struggle with weather-induced corruption commonly encountered in driving scenarios. To tackle this, we propose ResLPRNet, a novel LiDAR data restoration network that largely enhances LPR performance under adverse weather by restoring corrupted LiDAR scans using a wavelet transform-based network. ResLPRNet is efficient, lightweight and can be integrated plug-and-play with pretrained LPR models without substantial additional computational cost. Given the lack of LPR datasets under adverse weather, we introduce ResLPR, a novel benchmark that examines SOTA LPR methods under a wide range of LiDAR distortions induced by severe snow, fog, and rain conditions. Experiments on our proposed WeatherKITTI and WeatherNCLT datasets demonstrate the resilience and notable gains achieved by using our restoration method with multiple LPR approaches in challenging weather scenarios. Our code and benchmark are publicly available here: https://github.com/nubot-nudt/ResLPR.

IROS Conference 2025 Conference Paper

Socially-Aware Robot Navigation Enhanced by Bidirectional Natural Language Conversations Using Large Language Models

  • Congcong Wen
  • Yifan Liu
  • Geeta Chandra Raju Bethala
  • Shuaihang Yuan
  • Hao Huang 0003
  • Yu Hao
  • Mengyu Wang 0001
  • Yu-Shen Liu

Robotic navigation plays a pivotal role in a wide range of real-world applications. While traditional navigation systems focus on efficiency and obstacle avoidance, their inability to model complex human behaviors in shared spaces has underscored the growing need for socially aware navigation. In this work, we explore a novel paradigm of socially aware robot navigation empowered by large language models (LLMs), and propose HSAC-LLM, a hybrid framework that seamlessly integrates deep reinforcement learning with the reasoning and communication capabilities of LLMs. Unlike prior approaches that passively predict pedestrian trajectories or issue pre-scripted alerts, HSAC-LLM enables bidirectional natural language interaction, allowing robots to proactively engage in dialogue with pedestrians to resolve potential conflicts and negotiate path decisions. Extensive evaluations across 2D simulations, Gazebo environments, and real-world deployments demonstrate that HSAC-LLM consistently outperforms state-of-the-art DRL baselines under our proposed socially aware navigation metric, which covers safety, efficiency, and human comfort. By bridging linguistic reasoning and interactive motion planning, our results highlight the potential of LLM-augmented agents for robust, adaptive, and human-aligned navigation in real-world settings. Project page: https://hsacllm.github.io/.

NeurIPS Conference 2025 Conference Paper

SplitFlow: Flow Decomposition for Inversion-Free Text-to-Image Editing

  • Sung-Hoon Yoon
  • Minghan Li
  • Gaspard Beaudouin
  • Congcong Wen
  • Muhammad Rafay Azhar
  • Mengyu Wang

Rectified flow models have become a $\textit{de facto}$ standard in image generation due to their stable sampling trajectories and high-fidelity outputs. Despite their strong generative capabilities, they face critical limitations in image editing tasks: inaccurate inversion processes for mapping real images back into the latent space, and gradient entanglement issues during editing often result in outputs that do not faithfully reflect the target prompt. Recent efforts have attempted to directly map source and target distributions via ODE-based approaches without inversion; however, these methods still yield suboptimal editing quality. In this work, we propose a flow decomposition-and-aggregation framework built upon an inversion-free formulation to address these limitations. Specifically, we semantically decompose the target prompt into multiple sub-prompts, compute an independent flow for each, and aggregate them to form a unified editing trajectory. While we empirically observe that decomposing the original flow enhances diversity in the target space, generating semantically aligned outputs still requires consistent guidance toward the full target prompt. To this end, we design a projection and soft-aggregation mechanism for flow, inspired by gradient conflict resolution in multi-task learning. This approach adaptively weights the sub-target velocity fields, suppressing semantic redundancy while emphasizing distinct directions, thereby preserving both diversity and consistency in the final edited output. Experimental results demonstrate that our method outperforms existing zero-shot editing approaches in terms of semantic fidelity and attribute disentanglement. The code is available at https: //github. com/Harvard-AI-and-Robotics-Lab/SplitFlow.

AAAI Conference 2025 Conference Paper

Towards Robust Visual Question Answering via Prompt-Driven Geometric Harmonization

  • Yishu Liu
  • Jiawei Zhu
  • Congcong Wen
  • Guangming Lu
  • Hui Lin
  • Bingzhi Chen

Visual Question Answering (VQA) has garnered significant attention as a crucial link between vision and language, aimed at generating accurate responses to visual queries. However, current VQA models still struggle with the challenges of minority class collapse and spurious semantic correlations posed by language bias and imbalanced distributions. To address these challenges, this paper proposes a novel Prompt-Driven Geometric Harmonization (PDGH) paradigm, which integrates both geometric structure and information entropy principles to enhance the ability of VQA models to generalize effectively across diverse scenarios. Specifically, our PDGH approach is meticulously designed to generate image-generated prompts that are guided by specific question cues, facilitating a more accurate and context-aware understanding of the visual content. Moreover, we project the prompt-visual-question and visual-question joint representations into a unified hypersphere space, applying feature weight self-orthogonality and prompt-information entropy correction constraints to optimize the margin, further alleviating minority class collapse and correcting language bias. To maintain the geometric integrity of the representation space, we introduce multi-space geometric contrast constraints to minimize the impact of spurious priors introduced during training. Finally, a semantic matrix is constructed for the coordinated joint representation to ensure that the learned instances are semantically consistent and improve reasoning ability. Extensive experiments on various general and medical VQA datasets demonstrate the consistent superiority of our PDGH approach over existing state-of-the-art baselines.

IROS Conference 2024 Conference Paper

ChatMap: A Wearable Platform Based on the Multi-modal Foundation Model to Augment Spatial Cognition for People with Blindness and Low Vision

  • Yu Hao
  • Alexey Magay
  • Hao Huang 0003
  • Shuaihang Yuan
  • Congcong Wen
  • Yi Fang 0006

Spatial cognition refers to the ability to gain knowledge about their surroundings and utilize this information to identify their location, acquire resources, and navigate their way back to familiar places. People with blindness and low vision (pBLV) face significant challenges with spatial cognition due to the reliance on visual input. Without the full range of visual cues, pBLV individuals often find it difficult to grasp a comprehensive understanding of their environment, leading to obstacles in scene recognition and precise object localization, especially in unfamiliar environments. This limitation extends to their ability to independently detect and avoid potential tripping hazards, making navigation and interaction with their environment more challenging. In this paper, we present a pioneering wearable platform tailored to enhance the spatial cognition of pBLV through the integration of multi-modal foundation model. The proposed platform integrates a wearable camera with audio module and leverages the advanced capabilities of vision language foundation model (i. e. , GPT-4 and GPT-4V), for the nuanced processing of visual and textual data. Specifically, we employ vision language models to bridge the gap between visual information and the proprioception of visually impaired users, offering more intelligible guidance by aligning visual data with the natural perception of space and movement. Then we apply prompt engineering to guide the large language model to act as an assistant tailored specifically for pBLV users to produce accurate answers. Another innovation in our model is the incorporation of a chain of thought reasoning process, which enhances the accuracy and interpretability of the model, facilitating the generation of more precise responses to complex user inquiries across diverse environmental contexts. To assess the practical impact of our proposed wearable platform, we carried out a series of real-world experiments across three tasks that are commonly challenging for people with blindness and low vision: risk assessment, object localization, and scene recognition. Additionally, through an ablation study conducted on the VizWiz dataset, we rigorously assess the contribution of each individual module, substantiating the integral role in the model’s overall performance.

NeurIPS Conference 2024 Conference Paper

GAMap: Zero-Shot Object Goal Navigation with Multi-Scale Geometric-Affordance Guidance

  • Shuaihang Yuan
  • Hao Huang
  • Yu Hao
  • Congcong Wen
  • Anthony Tzes
  • Yi Fang

Zero-Shot Object Goal Navigation (ZS-OGN) enables robots to navigate toward objects of unseen categories without prior training. Traditional approaches often leverage categorical semantic information for navigation guidance, which struggles when only partial objects are observed or detailed and functional representations of the environment are lacking. To resolve the above two issues, we propose \textit{Geometric-part and Affordance Maps} (GAMap), a novel method that integrates object parts and affordance attributes for navigation guidance. Our method includes a multi-scale scoring approach to capture geometric-part and affordance attributes of objects at different scales. Comprehensive experiments conducted on the HM3D and Gibson benchmark datasets demonstrate improvements in Success Rates and Success weighted by Path Length, underscoring the efficacy of our geometric-part and affordance-guided navigation approach in enhancing robot autonomy and versatility, without any additional task-specific training or fine-tuning with the semantics of unseen objects and/or the locomotions of the robot.

ICRA Conference 2024 Conference Paper

Noisy Few-shot 3D Point Cloud Scene Segmentation

  • Hao Huang 0003
  • Shuaihang Yuan
  • Congcong Wen
  • Yu Hao
  • Yi Fang 0006

3D scene semantic segmentation plays a crucial role in robotics by enabling robots to understand and interpret their environment in a detailed and context-aware manner, facilitating tasks such as navigation, object manipulation, and interaction within complex spaces. A preponderance of methodology predominantly adopts a fully supervised framework for 3D point cloud scene semantic segmentation. Such paradigms exhibit an intrinsic dependency on extensive labeled datasets, presenting challenges in acquisition and exhibiting incapacity to segment novel classes, especially when the training data are contaminated by noisy samples. To address these limitations, this study introduces a novel few-shot segmentation approach to robustly segment 3D point cloud scenes with noisy labels using a meta-learning scheme. Specifically, we first build a multi-prototype graph and then suppress samples with noisy labels based on the graph structure. A subgraph bagging scheme is then proposed to conduct semi-supervised transductive learning to propagate labels. To optimize the graph structure to learn discriminative prototype features, we design a triplet contrastive loss to increase the compactness of these subgraphs. We evaluated our method on two widely used 3D point cloud scene segmentation benchmarks within few-shot (i. e. , 2/3-way 5-shot) segmentation settings with noisy samples. Experimental results demonstrate the improvement of our method over the compared baselines, illustrating the robustness of our method in few-shot 3D scene segmentation against noisy samples. The code is available at: https://github.com/hhuang-code/Noisy_Fewshot_Segmentation.

IROS Conference 2024 Conference Paper

Weakly Scene Segmentation Using Efficient Transformer

  • Hao Huang 0003
  • Shuaihang Yuan
  • Congcong Wen
  • Yu Hao
  • Yi Fang 0006

Current methods for large-scale point cloud scene semantic segmentation rely on manually annotated dense point-wise labels, which are costly, labor-intensive, and prone to errors. Consequently, gathering point cloud scenes with billions of labeled points is impractical in real-world scenarios. In this paper, we introduce a novel weak supervision approach to semantically segment large-scale indoor scenes, requiring only 1‰ of the points to be labeled. Specifically, we develop an efficient point neighbor Transformer to capture the geometry of local point cloud patches. To address the quadratic complexity of self-attention computation in Transformers, particularly for large-scale point clouds, we propose approximating the self-attention matrix using low-rank and sparse decomposition. Building on the point neighbor Transformer as foundational blocks, we design a Low-rank Sparse Transformer Network (LST-Net) for weakly supervised large-scale point cloud scene semantic segmentation. Experimental results on two commonly used indoor point cloud scene segmentation benchmarks demonstrate that our model achieves performance comparable to those of both weakly supervised and fully supervised methods. Our code can be found in https://github.com/hhuang-code/LST-Net.

ICRA Conference 2023 Conference Paper

Pyramid Learnable Tokens for 3D LiDAR Place Recognition

  • Congcong Wen
  • Hao Huang 0003
  • Yu-Shen Liu
  • Yi Fang 0006

3D LiDAR place recognition plays a vital role in various robot applications' including robotic navigation, autonomous driving, and simultaneous localization and mapping. However, most previous studies evaluated their models on accumulated 2D scans instead of real-world 3D LiDAR scans with a larger number of points, which limits the application in real scenarios. To address this limitation, we propose a point transformer network with pyramid learnable tokens (PTNet-PLT) to learn global descriptors for an actual scanned 3D LiDAR place recognition. Specifically, we first present a novel shifted cube attention module that consists of a self-attention module for local feature extraction and a cross-attention module for regional feature aggregation. The self-attention module constrains attention computation on a locally partitioned cube and builds connections across cubes based on the shifted cube scheme. In addition, the cross-attention module introduces several learnable tokens to separately aggregate features of points with similar features but spatially distant into an arbitrarily shaped region, which enables the model to capture long-term dependencies of the points. Next, we build a pyramid architecture network to learn multi-scale features and involve a decreasing number of tokens at each layer to aggregate features over a larger region. Finally, we obtain the global descriptor by concatenating learned region tokens of all layers. Experiments on three datasets, including USyd Campus, Oxford Robot-Car, and KITTI, demonstrate the effectiveness and generalization of the proposed model for large-scale 3D LiDAR place recognition.