Author name cluster

Guang Tan

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

10 papers

2 author rows

AAAI Conference 2026 Conference Paper

COVR: Collaborative Optimization of VLMs and RL Agent for Visual-Based Control

Canming Xia
Peixi Peng
Guang Tan
Zhan Su
Haoran Xu
Zhenxian Liu
Luntong Li

Visual reinforcement learning (RL) suffers from poor sample efficiency due to high-dimensional observations in complex tasks. While existing works have shown that vision-language models (VLMs) can assist RL, they often focus on knowledge distillation from the VLM to RL, overlooking the potential of RL-generated interaction data to enhance the VLM. To address this, we propose COVR, a collaborative optimization framework that enables the mutual enhancement of the VLM and RL policies. Specifically, COVR fine-tunes the VLM with RL-generated data to enhance the semantic reasoning ability consistent with the target task, and uses the enhanced VLM to further guide policy learning via action priors. To improve fine-tuning efficiency, we introduce two key modules: (1) an Exploration-Driven Dynamic Filter module that preserves valuable exploration samples using adaptive thresholds based on the degree of exploration, and (2) a Return-Aware Adaptive Loss Weight module that improves the stability of training by quantifying the inconsistency of sampling actions via return signals of RL. We further design a progressive fine-tuning strategy to reduce resource consumption. Extensive experiments show that COVR achieves strong performance across various challenging visual control tasks.

PDF Details DOI

AAAI Conference 2026 Conference Paper

HouseTune: Two-Stage Floorplan Generation with LLM Assistance

Ziyang Zong
Guanying Chen
Zhaohuan Zhan
Fengcheng Yu
Guang Tan

This paper proposes a two-stage text-to-floorplan generation framework that combines the reasoning capability of Large Language Models (LLMs) with the generative power of diffusion models. In the first stage, we leverage a Chain-of-Thought (CoT) prompting strategy to guide an LLM in generating an initial layout, Layout-Init, from natural language descriptions, which ensures a user-friendly and intuitive design process. However, Layout-Init may lack precise geometric alignment and fine-grained structural details due to the inherent limitations of LLMs. To address this, in the second stage we propose a Dual-Noise Prior-Preserved Diffusion (DNPP-Diffusion) model to refine Layout-Init into a final floorplan that better adheres to physical constraints and user requirements. By combining LLMs and a dedicated refining model, our approach is able to generate high-quality floorplans without requiring large-scale domain-specific training data. Experimental results demonstrate its advantages in comparison with state of the art methods, and validate its effectiveness in home design applications.

PDF Details DOI

AAAI Conference 2025 Conference Paper

BEVSync: Asynchronous Data Alignment for Camera-based Vehicle-Infrastructure Cooperative Perception Under Uncertain Delays

Wentao Wang
Jiaqian Wang
Yuxin Deng
Guang Tan

Vehicle-to-infrastructure (V2I) cooperative perception systems can enhance the sensing abilities of autonomous vehicles. Existing V2I solutions often consider LiDARs devices instead of cameras, the most prevalent sensors with low cost and wide installation. In addition, a major challenge that has been underexplored is the time asynchrony between image frames from different sources. This asynchrony arises because of clock differences, varying times involved in data processing and transmission, causing uncertain delays that complicate data alignment and potentially reduce perception accuracy. We propose BEVSync, a camera-based V2I cooperative perception system that adaptively aligns frames from the ego-vehicle and infrastructure by compensating for motion deviations. Specifically, we develop an extractor-compensator model to extract and predict perceptual features using historical frames, thereby smoothing out the data misalignment. Experiments on the real-world dataset DAIR-V2X show that our approach surpasses existing methods in terms of performance and robustness.

PDF Details DOI

ICML Conference 2025 Conference Paper

Commute Graph Neural Networks

Wei Zhuo 0006
Han Yu 0001
Guang Tan
Xiaoxiao Li

Graph Neural Networks (GNNs) have shown remarkable success in learning from graph-structured data. However, their application to directed graphs (digraphs) presents unique challenges, primarily due to the inherent asymmetry in node relationships. Traditional GNNs are adept at capturing unidirectional relations but fall short in encoding the mutual path dependencies between nodes, such as asymmetrical shortest paths typically found in digraphs. Recognizing this gap, we introduce Commute Graph Neural Networks (CGNN), an approach that seamlessly integrates node-wise commute time into the message passing scheme. The cornerstone of CGNN is an efficient method for computing commute time using a newly formulated digraph Laplacian. Commute time is then integrated into the neighborhood aggregation process, with neighbor contributions weighted according to their respective commute time to the central node in each layer. It enables CGNN to directly capture the mutual, asymmetric relationships in digraphs. Extensive experiments on 8 benchmarking datasets confirm the superiority of CGNN against 13 state-of-the-art methods.

Details

AAAI Conference 2025 Conference Paper

Exploiting Continuous Motion Clues for Vision-Based Occupancy Prediction

Haoran Xu
Peixi Peng
Xinyi Zhang
Guang Tan
Yaokun Li
Shuaixian Wang
Luntong Li

Occupancy networks aim to reconstruct the surroundings with occupied semantic voxels. However, frequent object occlusions often occur in dynamic real-world scenarios, which cannot be captured by independent frames. Most existing occupancy networks generate results without explicitly considering past occupancy states and continuous visual changes over time, limiting their temporal accuracy. We tackle it by treating the task from a new continuous updating perspective, which considers historical data and continuous motion clues. We propose a new approach termed Continuous Motion clue exploitation for Occupancy Prediction (CMOP), which incorporates three key designs: (i) Propagator: which forecasts future occupancy states based on historical data; (ii) Tracker: which updates the occupancy on a per-frame basis using dynamic visual motion information; and (iii) Fuser: which aggregates results from the Propagator and Tracker into more robust and accurate occupancy results. Experiments on several benchmarks demonstrate that CMOP outperforms state-of-the-art baselines.

PDF Details DOI

ICRA Conference 2024 Conference Paper

InterCoop: Spatio-Temporal Interaction Aware Cooperative Perception for Networked Vehicles

Wentao Wang
Haoran Xu 0004
Guang Tan

In autonomous driving, cooperative perception through vehicle-to-vehicle (V2V) communication is considered crucial for enhancing traffic safety and efficiency. However, existing methods often simplify the handling of perception data from multiple vehicles. In these approaches, the egovehicle aggregates observations from all neighboring connected cooperative vehicles (CCV), without considering the interactions between the vehicles or making differentiated use of the acquired sensing data. This approach can result in suboptimal performance due to the increase of noise and large transmission delay. In this paper, we introduce a novel approach to cooperative perception. By fusing both the road topology and trajectory histories of neighboring CCVs, our model learns an interaction score for each CCV. These scores prioritize vehicles that are most relevant to the current driving scenario, offering valuable guidance for selective fusion of sensor data, thereby enhancing driving decision-making. The proposed method is validated through experiments conducted on the CARLA simulator. Results demonstrate that our approach surpasses existing methods in terms of performance and robustness.

Details

ICLR Conference 2024 Conference Paper

Partitioning Message Passing for Graph Fraud Detection

Wei Zhuo 0006
Zemin Liu
Bryan Hooi
Bingsheng He
Guang Tan
Rizal Fathony
Jia Chen 0011

Label imbalance and homophily-heterophily mixture are the fundamental problems encountered when applying Graph Neural Networks (GNNs) to Graph Fraud Detection (GFD) tasks. Existing GNN-based GFD models are designed to augment graph structure to accommodate the inductive bias of GNNs towards homophily, by excluding heterophilic neighbors during message passing. In our work, we argue that the key to applying GNNs for GFD is not to exclude but to {\em distinguish} neighbors with different labels. Grounded in this perspective, we introduce Partitioning Message Passing (PMP), an intuitive yet effective message passing paradigm expressly crafted for GFD. Specifically, in the neighbor aggregation stage of PMP, neighbors with different classes are aggregated with distinct node-specific aggregation functions. By this means, the center node can adaptively adjust the information aggregated from its heterophilic and homophilic neighbors, thus avoiding the model gradient being dominated by benign nodes which occupy the majority of the population. We theoretically establish a connection between the spatial formulation of PMP and spectral analysis to characterize that PMP operates an adaptive node-specific spectral graph filter, which demonstrates the capability of PMP to handle heterophily-homophily mixed graphs. Extensive experimental results show that PMP can significantly boost the performance on GFD tasks.

Details

ICRA Conference 2024 Conference Paper

Resolving Loop Closure Confusion in Repetitive Environments for Visual SLAM through AI Foundation Models Assistance

Hongzhou Li
Sijie Yu
Shengkai Zhang
Guang Tan

In visual SLAM (VSLAM) systems, loop closure plays a crucial role in reducing accumulated errors. However, VSLAM systems relying on low-level visual features often suffer from the problem of perceptual confusion in repetitive environments, where scenes in different locations are incorrectly identified as the same. Existing work has attempted to introduce object-level features or artificial landmarks. The former approach struggles to distinguish visually similar but different objects, while the latter is both time-consuming and labor-intensive. This paper introduces a novel loop closure detection method that leverages pretrained AI foundation models to extract rich semantic information about specific types of objects (e. g. , door numbers), referred to as semantic anchors, that help to distinguish similar scenes better. In settings such as office buildings, hotels, and warehouses, this approach helps to improve the robustness of loop closure detection. We validate the effectiveness of our method through experiments conducted in both simulated and real-world environments.

Details

NeurIPS Conference 2022 Conference Paper

Efficient Graph Similarity Computation with Alignment Regularization

Wei Zhuo
Guang Tan

We consider the graph similarity computation (GSC) task based on graph edit distance (GED) estimation. State-of-the-art methods treat GSC as a learning-based prediction task using Graph Neural Networks (GNNs). To capture fine-grained interactions between pair-wise graphs, these methods mostly contain a node-level matching module in the end-to-end learning pipeline, which causes high computational costs in both the training and inference stages. We show that the expensive node-to-node matching module is not necessary for GSC, and high-quality learning can be attained with a simple yet powerful regularization technique, which we call the Alignment Regularization (AReg). In the training stage, the AReg term imposes a node-graph correspondence constraint on the GNN encoder. In the inference stage, the graph-level representations learned by the GNN encoder are directly used to compute the similarity score without using AReg again to speed up inference. We further propose a multi-scale GED discriminator to enhance the expressive ability of the learned representations. Extensive experiments on real-world datasets demonstrate the effectiveness, efficiency and transferability of our approach.

PDF Details

IJCAI Conference 2022 Conference Paper

Proximity Enhanced Graph Neural Networks with Channel Contrast

Wei Zhuo
Guang Tan

We consider graph representation learning in an unsupervised manner. Graph neural networks use neighborhood aggregation as a core component that results in feature smoothing among nodes in proximity. While successful in various prediction tasks, such a paradigm falls short of capturing nodes' similarities over a long distance, which proves to be important for high-quality learning. To tackle this problem, we strengthen the graph with three types of additional graph views, in which each node is directly linked to a set of nodes with the highest similarity in terms of node features, neighborhood features or local structures. Not restricted by connectivity in the original graph, the generated views provide new and complementary perspectives from which to look at the relationship between nodes. Inspired by the recent success of contrastive learning approaches, we propose a self-supervised method that aims to learn node representations by maximizing the agreement between representations across generated views and the original graph, without the requirement of any label information. We also propose a channel-level contrast approach that greatly reduces computation cost. Extensive experiments on six assortative graphs and three disassortative graphs demonstrate the effectiveness of our approach.

PDF Details DOI