Author name cluster

Kun Wang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

50 papers

2 author rows

AAAI Conference 2026 Conference Paper

Drift-aware Collaborative Assistance Mixture of Experts for Heterogeneous Multistream Learning

En Yu
Jie Lu
Kun Wang
Xiaoyu Yang
Guangquan Zhang

Learning from multiple data streams in real-world scenarios is fundamentally challenging due to intrinsic heterogeneity and unpredictable concept drifts. Existing methods typically assume homogeneous streams and employ static architectures with indiscriminate knowledge fusion, limiting generalizability in complex dynamic environments. To tackle this gap, we propose CAMEL, a dynamic Collaborative Assistance Mixture of Experts Learning framework. It addresses heterogeneity by assigning each stream an independent system with a dedicated feature extractor and task-specific head. Meanwhile, a dynamic pool of specialized private experts captures stream-specific idiosyncratic patterns. Crucially, collaboration across these heterogeneous streams is enabled by a dedicated assistance expert. This expert employs a multi-head attention mechanism to distill and integrate relevant context autonomously from all other concurrent streams. It facilitates targeted knowledge transfer while inherently mitigating negative transfer from irrelevant sources. Furthermore, we propose an Autonomous Expert Tuner (AET) strategy, which dynamically manages expert lifecycles in response to drift. It instantiates new experts for emerging concepts (freezing prior ones to prevent catastrophic forgetting) and prunes obsolete ones. This expert-level plasticity provides a robust and efficient mechanism for online model capacity adaptation. Extensive experiments demonstrate CAMEL’s superior generalizability across diverse multistreams and exceptional resilience against complex concept drifts.

PDF Details DOI

IS Journal 2026 Journal Article

Graph-Augmented Large Language Model Agents: Current Progress and Future Prospects

Yixin Liu
Guibin Zhang
Kun Wang
Shiyuan Li
Shirui Pan
Bo An

Autonomous agents based on large language models (LLMs) have demonstrated impressive capabilities in numerous real-world applications. While most LLMs are limited in several key agentic procedures, graphs can serve as a powerful auxiliary structure to enhance structure, continuity, and coordination in complex agent workflows. Given the rapid growth and fragmentation of research on Graph-augmented LLM Agents (GLA), this article offers a timely and comprehensive overview of recent advances and highlights key directions for future work. Specifically, we categorize existing GLA methods by their primary functions in LLM agent systems, including planning, memory, and tool usage, and then analyze how graphs and graph learning algorithms contribute to each. For multiagent systems, we further discuss how GLA solutions facilitate the orchestration, efficiency optimization, and trustworthiness of MAS. Finally, we highlight key future directions to advance this field, from improving structural adaptability to enabling unified, scalable, and multimodal GLA systems.

Details DOI

AAAI Conference 2026 Conference Paper

Hidden in the Noise: Unveiling Backdoors in Audio LLMs Alignment Through Latent Acoustic Pattern Triggers

Liang Lin
Miao Yu
Kaiwen Luo
Yibo Zhang
Lilan Peng
Dexian Wang
Xuehai Tang
Yuanhe Zhang

As Audio Large Language Models (ALLMs) emerge as powerful tools for speech processing, their safety implications demand urgent attention. While considerable research has explored textual and vision safety, audio’s distinct characteristics present significant challenges. This paper first investigates: Is ALLM vulnerable to backdoor attacks exploiting acoustic triggers? In response to this issue, we introduce Hidden in the Noise (HIN), a novel backdoor attack framework designed to exploit subtle, audio-specific features. HIN applies acoustic modifications to raw audio waveforms, such as alterations to temporal dynamics and strategic injection of spectrally tailored noise. These changes introduce consistent patterns that an ALLM’s acoustic feature encoder captures, embedding robust triggers within the audio stream. To evaluate ALLM robustness against audio-feature-based triggers, we develop the AudioSafe benchmark, assessing nine distinct risk types. Extensive experiments on AudioSafe and three established safety datasets reveal critical vulnerabilities in existing ALLMs: (I) audio features like environment noise and speech rate variations achieve over 90% average attack success rate, (II) ALLMs exhibit significant sensitivity differences across acoustic features, particularly showing minimal response to volume as a trigger, and (III) poisoned sample inclusion causes only marginal loss curve fluctuations, highlighting the attack’s stealth.

PDF Details DOI

AAAI Conference 2026 Conference Paper

NeuralOM: Neural Ocean Model for Subseasonal-to-Seasonal Simulation

Yuan Gao
Hao Wu
Fan Xu
Yanfei Xiang
Ruijian Gou
Ruiqi Shu
Qingsong Wen
Xian Wu

Long-term, high-fidelity simulation of slow-changing physical systems, such as the ocean and climate, presents a fundamental challenge in scientific computing. Traditional autoregressive machine learning models often fail in these tasks as minor errors accumulate and lead to rapid forecast degradation. To address this problem, we propose NeuralOM, a general neural operator framework designed for simulating complex, slow-changing dynamics. NeuralOM's core consists of two key innovations: (1) a Progressive Residual Correction Framework that decomposes the forecasting task into a series of fine-grained refinement steps, effectively suppressing long-term error accumulation; and (2) a Physics-Guided Graph Network whose built-in adaptive messaging mechanism explicitly models multi-scale physical interactions, such as gradient-driven flows and multiplicative couplings, thereby enhancing physical consistency while maintaining computational efficiency. We validate NeuralOM on the challenging task of global Subseasonal-to-Seasonal (S2S) ocean simulation. Extensive experiments demonstrate that NeuralOM not only surpasses state-of-the-art models in forecast accuracy and long-term stability, but also excels in simulating extreme events. For instance, at a 60-day lead time, NeuralOM achieves a 13.3% lower RMSE compared to the best-performing baseline, offering a stable, efficient, and physically-aware paradigm for data-driven scientific computing.

PDF Details DOI

JBHI Journal 2026 Journal Article

Pathology-Guided AI System for Accurate Segmentation and Diagnosis of Cervical Spondylosis

Qi Zhang
Xiuyuan Chen
Ziyi He
Lianming Wu
Kun Wang
Jianqi Sun
Hongxing Shen

Cervical spondylosis, a complex and prevalent condition, demands precise and efficient diagnostic techniques for accurate assessment. While MRI offers detailed visualization of cervical spine anatomy, manual interpretation remains labor-intensive and prone to error. To address this, we developed an innovative AI-assisted Expert-based Diagnosis System 1 that automates both segmentation and diagnosis of cervical spondylosis using MRI. Leveraging multi-center datasets of cervical MRI images from patients with cervical spondylosis, our system features a pathology-guided segmentation model capable of accurately segmenting key cervical anatomical structures. The segmentation is followed by an expert-based diagnostic framework that automates the calculation of critical clinical indicators. Our segmentation model achieved an impressive average Dice coefficient exceeding 0. 90 across four cervical spinal anatomies and demonstrated enhanced accuracy in herniation areas. Diagnostic evaluation further showcased the system’s precision, with the lowest mean average errors (MAE) for the C2-C7 Cobb angle and the Maximum Spinal Cord Compression (MSCC) coefficient. In addition, our method delivered high accuracy, precision, recall, and F1 scores in herniation localization, K-line status assessment, T2 hyperintensity detection, and Kang grading. Comparative analysis and external validation demonstrate that our system outperforms existing methods, establishing a new benchmark for segmentation and diagnostic tasks for cervical spondylosis.

Details DOI

AAAI Conference 2026 Conference Paper

SafeSieve: From Heuristics to Experience in Progressive Pruning for LLM-based Multi-Agent Communication

Ruijia Zhang
Xinyan Zhao
Ruixiang Wang
Sigen Chen
Guibin Zhang
An Zhang
Kun Wang
Qingsong Wen

LLM-based multi-agent systems exhibit strong collaborative capabilities but often suffer from redundant communication and excessive token overhead. Existing methods typically enhance efficiency through pretrained GNNs or greedy algorithms, but often isolate pre- and post-task optimization, lacking a unified strategy. To this end, we present SafeSieve, a progressive and adaptive multi-agent pruning algorithm that dynamically refines the inter-agent communication through a novel dual-mechanism. SafeSieve integrates initial LLM-based semantic evaluation with accumulated performance feedback, enabling a smooth transition from heuristic initialization to experience-driven refinement. Unlike existing greedy Top-k pruning methods, SafeSieve employs 0-extension clustering to preserve structurally coherent agent groups while eliminating ineffective links. Experiments across benchmarks (SVAMP, HumanEval, etc.) showcase that SafeSieve achieves 94.01% average accuracy while reducing token usage by 12.4%-27.8%. Results further demonstrate robustness under prompt injection attacks (1.23% average accuracy drop). In heterogeneous settings, SafeSieve reduces deployment costs by 13.3% while maintaining performance. These results establish SafeSieve as an efficient, GPU-free, and scalable framework for practical multi-agent systems. Our code can be found below.

PDF Details DOI

JBHI Journal 2026 Journal Article

WOADNet: A Wavelet-Inspired Orientational Adaptive Dictionary Network for CT Metal Artifact Reduction

Tong Jin
Jin Liu
Diandian Wang
Kun Wang
Chenlong Miao
Yikun Zhang
Dianlin Hu
Zhan Wu

In computed tomography (CT), metal artifacts pose a persistent challenge to achieving high-quality imaging. Despite advancements in metal artifact reduction (MAR) techniques, many existing approaches have not fully leveraged the intrinsic a priori knowledge related to metal artifacts, improved model interpretability, or addressed the complex texture of CT images effectively. To address these limitations, we propose a novel and interpretable framework, the wavelet-inspired oriented adaptive dictionary network (WOADNet). WOADNet builds on sparse coding with orientational information in the wavelet domain. By exploring the discriminative features of artifacts and anatomical tissues, we adopt a high-precision filter parameterization strategy that incorporates multiangle rotations. Furthermore, we integrate a reweighted sparse constraint framework into the convolutional dictionary learning process and employ a cross-space, multiscale attention mechanism to construct an adaptive convolutional dictionary unit for the artifact feature encoder. This innovative design allows for flexible adjustment of weights and convolutional representations, resulting in significant image quality improvements. The experimental results using synthetic and clinical datasets demonstrate that WOADNet outperforms both traditional and state-of-the-art MAR methods in terms of suppressing artifacts.

Details DOI

NeurIPS Conference 2025 Conference Paper

AgentAuditor: Human-level Safety and Security Evaluation for LLM Agents

Hanjun Luo
Shenyu Dai
Chiming Ni
Xinfeng Li
Guibin Zhang
Kun Wang
Tongliang Liu
Hanan Salam

Despite the rapid advancement of LLM-based agents, the reliable evaluation of their safety and security remains a significant challenge. Existing rule-based or LLM-based evaluators often miss dangers in agents' step-by-step actions, overlook subtle meanings, fail to see how small issues compound, and get confused by unclear safety or security rules. To overcome this evaluation crisis, we introduce AgentAuditor, a universal, training-free, memory-augmented reasoning framework that empowers LLM evaluators to emulate human expert evaluators. AgentAuditor constructs an experiential memory by having an LLM adaptively extract structured semantic features (e. g. , scenario, risk, behavior) and generate associated chain-of-thought reasoning traces for past interactions. A multi-stage, context-aware retrieval-augmented generation process then dynamically retrieves the most relevant reasoning experiences to guide the LLM evaluator's assessment of new cases. Moreover, we developed ASSEBench, the first benchmark designed to check how well LLM-based evaluators can spot both safety risks and security threats. ASSEBench comprises 2293 meticulously annotated interaction records, covering 15 risk types across 29 application scenarios. A key feature of ASSEBench is its nuanced approach to ambiguous risk situations, employing "Strict" and "Lenient" judgment standards. Experiments demonstrate that AgentAuditor not only consistently improves the evaluation performance of LLMs across all benchmarks but also sets a new state-of-the-art in LLM-as-a-judge for agent safety and security, achieving human-level accuracy. Our work is openly accessible at https: //github. com/Astarojth/AgentAuditor-ASSEBench.

PDF Details

JBHI Journal 2025 Journal Article

An Online Adaptation Framework for Enhancing Calibration-Free SSVEP-Based BCI Performance

Weize Chen
Jie Mei
Xiaolin Xiao
Ang Li
Lingling Tao
Kun Wang
Minpeng Xu
Dong Ming

Accomplishing a plug-and-play steady-state visual evoked potential (SSVEP)-based brain-computer interface (BCI) remains a critical challenge, due to the unsatisfying performance of calibration-free decoding algorithms. A current method called online adaptive canonical correlation analysis (OACCA) has proved efficient in enhancing calibration-free performance by self-adaptation merely with online data. However, OACCA only concerns the adaptation of spatial filters and excludes other useful adaptive procedures like individual template estimation, hindering fully exploitable model decoding and adaptation. This study proposes a new online adaptation framework termed online adaptive extended correlation analysis (OAECA) to augment the calibration-free online adaptation loop. OAECA first recalls and cleans the online trials for reliable data learning, then tunes individual templates and spatial filters for complete model updating, and finally adopts extended feature matching to improve target recognition. The simulation results on two public SSVEP datasets revealed that OAECA significantly outperformed OACCA for almost all 105 subjects, and both offline and online experiments further confirmed the effectiveness of OAECA. Particularly, OAECA achieved the highest average information transfer rate (ITR) of 202. 17 bits/min in the online experiment, significantly exceeding the state-of-the-art OACCA of 177. 02 bits/min. This study enhanced the calibration-free performance through comprehensive online adaptation, hopefully advancing SSVEP-based BCIs toward practical plug-and-play real-world applications.

Details DOI

NeurIPS Conference 2025 Conference Paper

Auditing Meta-Cognitive Hallucinations in Reasoning Large Language Models

Haolang Lu
Yilian Liu
Jingxin Xu
Guoshun Nan
Yuanlong Yu
Zhican Chen
Kun Wang

The development of Reasoning Large Language Models (RLLMs) has significantly improved multi-step reasoning capabilities, but it has also made hallucination problems more frequent and harder to eliminate. While existing approaches address hallucination through external knowledge integration, model parameter analysis, or self-verification mechanisms, they fail to provide a comprehensive insight into how hallucinations emerge and evolve throughout the reasoning chain. In this work, we investigate hallucination causality under constrained knowledge domains by auditing the Chain-of-Thought (CoT) trajectory and assessing the model's cognitive confidence in potentially erroneous or biased claims. Analysis reveals that in long-CoT settings, RLLMs may iteratively reinforce biases and errors through flawed reflective processes, ultimately inducing hallucinated reasoning paths. Counterintuitively, even with interventions at hallucination origins, reasoning chains display pronounced ''chain disloyalty'', resisting correction and sustaining flawed trajectories. We further point out that existing hallucination detection methods are less reliable and interpretable than previously assumed, especially in complex multi-step reasoning contexts. Unlike circuit tracing that requires access to model parameters, our auditing enables more interpretable long-chain hallucination attribution in black-box settings, demonstrating stronger generalizability and practical utility. Our code is available at this link.

PDF Details

NeurIPS Conference 2025 Conference Paper

Breaking the Discretization Barrier of Continuous Physics Simulation Learning

Fan Xu
Hao Wu
Nan Wang
Lilan Peng
Kun Wang
Wei Gong
Xibin Zhao

The modeling of complicated time-evolving physical dynamics from partial observations is a long-standing challenge. Particularly, observations can be sparsely distributed in a seemingly random or unstructured manner, making it difficult to capture highly nonlinear features in a variety of scientific and engineering problems. However, existing data-driven approaches are often constrained by fixed spatial and temporal discretization. While some researchers attempt to achieve spatio-temporal continuity by designing novel strategies, they either overly rely on traditional numerical methods or fail to truly overcome the limitations imposed by discretization. To address these, we propose CoPS, a purely data-driven methods, to effectively model continuous physics simulation from partial observations. Specifically, we employ multiplicative filter network to fuse and encode spatial information with the corresponding observations. Then we customize geometric grids and use message-passing mechanism to map features from original spatial domain to the customized grids. Subsequently, CoPS models continuous-time dynamics by designing multi-scale graph ODEs, while introducing a Markov-based neural auto-correction module to assist and constrain the continuous extrapolations. Comprehensive experiments demonstrate that CoPS advances the state-of-the-art methods in space-time continuous modeling across various scenarios. The source code is available at~\url{https: //github. com/Sunxkissed/CoPS}.

PDF Details

NeurIPS Conference 2025 Conference Paper

Can LLMs Outshine Conventional Recommenders? A Comparative Evaluation

Qijiong Liu
Jieming Zhu
Lu Fan
Kun Wang
Hengchang Hu
Wei Guo
Yong Liu
Xiao-ming Wu

Integrating large language models (LLMs) into recommender systems has created new opportunities for improving recommendation quality. However, a comprehensive benchmark is needed to thoroughly evaluate and compare the recommendation capabilities of LLMs with traditional recommender systems. In this paper, we introduce \recbench{}, which systematically investigates various item representation forms (including unique identifier, text, semantic embedding, and semantic identifier) and evaluates two primary recommendation tasks, i. e. , click-through rate prediction (CTR) and sequential recommendation (SeqRec). Our extensive experiments cover up to 17 large models and are conducted across five diverse datasets from fashion, news, video, books, and music domains. Our findings indicate that LLM-based recommenders outperform conventional recommenders, achieving up to a 5% AUC improvement in CTR and up to a 170% NDCG@10 improvement in SeqRec. However, these substantial performance gains come at the expense of significantly reduced inference efficiency, rendering LLMs impractical as real-time recommenders. We have released our code and data to enable other researchers to reproduce and build upon our experimental results.

PDF Details

NeurIPS Conference 2025 Conference Paper

CellVerse: Do Large Language Models Really Understand Cell Biology?

Fan Zhang
Tianyu Liu
Zhihong Zhu
Hao Wu
Haixin Wang
Donghao Zhou
Yefeng Zheng
Kun Wang

Recent studies have demonstrated the feasibility of modeling single-cell data as natural languages and the potential of leveraging powerful large language models (LLMs) for understanding cell biology. However, a comprehensive evaluation of LLMs' performance on language-driven single-cell analysis tasks still remains unexplored. Motivated by this challenge, we introduce CellVerse, a unified language-centric question-answering benchmark that integrates four types of single-cell multi-omics data and encompasses three hierarchical levels of single-cell analysis tasks: cell type annotation (cell-level), drug response prediction (drug-level), and perturbation analysis (gene-level). Going beyond this, we systematically evaluate the performance across 14 open-source and closed-source LLMs ranging 160M $\rightarrow$ 671B on CellVerse. Remarkably, the experimental results reveal: (1) Existing specialist models (C2S-Pythia) fail to make reasonable decisions across all sub-tasks within CellVerse, while generalist models such as Qwen, Llama, GPT, and DeepSeek family models exhibit preliminary understanding capabilities within the realm of cell biology. (2) The performance of current LLMs falls short of expectations and has substantial room for improvement. Notably, in the widely studied drug response prediction task, none of the evaluated LLMs demonstrate significant performance improvement over random guessing. CellVerse offers the first large-scale empirical demonstration that significant challenges still remain in applying LLMs to cell biology. By introducing CellVerse, we lay the foundation for advancing cell biology through natural languages and hope this paradigm could facilitate next-generation single-cell analysis. Project Page: https: //cellverse-cuhk. github. io

PDF Details

JBHI Journal 2025 Journal Article

Decoding Arm Movement Direction Using Ultra-High-Density EEG

Zhen Ma
Xinyi Yang
Jiayuan Meng
Kun Wang
Minpeng Xu
Dong Ming

Detecting arm movement direction is significant for individuals with upper-limb motor disabilities to restore independent self-care abilities. It involves accurately decoding the fine movement patterns of the arm, which has become feasible using invasive brain-computer interfaces (BCIs). However, it is still a significant challenge for traditional electroencephalography (EEG) based BCIs to decode multi-directional arm movements effectively. This study designed an ultra-high-density (UHD) EEG system to decode multi-directional arm movements. The system contains 200 electrodes with an interval of about 4 mm. We analyzed the patterns of the UHD EEG signals induced by arm movements in different directions. To extract discriminative features from UHD EEG, we proposed a spatial filtering method combining principal component analysis (PCA) and discriminative spatial pattern (DSP). We collected EEG signals from five healthy subjects (two left-handed and three right-handed) to verify the system's feasibility. The movement-related cortical potentials (MRCPs) showed a certain degree of separability both in waveforms and spatial patterns for arm movements in different directions. This study achieved an average classification accuracy of 63. 15 (8. 71)% for both arms (eight-class task) with a peak accuracy of 77. 24%. For the dominant arm (four-class task), we obtained an average accuracy of 75. 31 (9. 21)% with a peak accuracy of 85. 00%. For the first time, this study simultaneously decodes multi-directional movements of both arms using UHD EEG. This study provides a promising approach for detecting information about arm movement directions, which is significant for the development of BCIs.

Details DOI

AAAI Conference 2025 Conference Paper

Depth-Centric Dehazing and Depth-Estimation from Real-World Hazy Driving Video

Junkai Fan
Kun Wang
Zhiqiang Yan
Xiang Chen
Shangbing Gao
Jun Li
Jian Yang

In this paper, we study the challenging problem of simultaneously removing haze and estimating depth from real monocular hazy videos. These tasks are inherently complementary: enhanced depth estimation improves dehazing via the atmospheric scattering model (ASM), while superior dehazing contributes to more accurate depth estimation through the brightness consistency constraint (BCC). To tackle these intertwined tasks, we propose a novel depth-centric learning framework that integrates the ASM model with the BCC constraint. Our key idea is that both ASM and BCC rely on a shared depth estimation network. This network simultaneously exploits adjacent dehazed frames to enhance depth estimation via BCC and uses the refined depth cues to more effectively remove haze through ASM. Additionally, we leverage a non-aligned clear video and its estimated depth to independently regularize the dehazing and depth estimation networks. This is achieved by designing two discriminator networks: D_MFIR enhances high-frequency details in dehazed videos, and D_MDR reduces the occurrence of black holes in low-texture regions. Extensive experiments demonstrate that the proposed method outperforms current state-of-the-art techniques in both video dehazing and depth estimation tasks, especially in real-world hazy scenes.

PDF Details DOI

ICML Conference 2025 Conference Paper

Discovering Latent Causal Graphs from Spatiotemporal Data

Kun Wang
Sumanth Varambally
Duncan Watson-Parris
Yian Ma
Rose Yu

Many important phenomena in scientific fields like climate, neuroscience, and epidemiology are naturally represented as spatiotemporal gridded data with complex interactions. Inferring causal relationships from these data is a challenging problem compounded by the high dimensionality of such data and the correlations between spatially proximate points. We present SPACY (SPAtiotemporal Causal discoverY), a novel framework based on variational inference, designed to model latent time series and their causal relationships from spatiotemporal data. SPACY alleviates the high-dimensional challenge by discovering causal structures in the latent space. To aggregate spatially proximate, correlated grid points, we use spatial factors, parametrized by spatial kernel functions, to map observational time series to latent representations. Theoretically, we generalize the problem to a continuous spatial domain and establish identifiability when the observations arise from a nonlinear, invertible function of the product of latent series and spatial factors. Using this approach, we avoid assumptions that are often unverifiable, including those about instantaneous effects or sufficient variability. Empirically, SPACY outperforms state-of-the-art baselines on synthetic data, even in challenging settings where existing methods struggle, while remaining scalable for large grids. SPACY also identifies key known phenomena from real-world climate data. An implementation of SPACY is available at https: //github. com/Rose-STL-Lab/SPACY/

Details

NeurIPS Conference 2025 Conference Paper

G-Memory: Tracing Hierarchical Memory for Multi-Agent Systems

Guibin Zhang
Muxin Fu
Kun Wang
Frank Wan
Miao Yu
Shuicheng Yan

Large language model (LLM)-powered multi-agent systems (MAS) have demonstrated cognitive and execution capabilities that far exceed those of single LLM agents, yet their capacity for self-evolution remains hampered by underdeveloped memory architectures. Upon close inspection, we are alarmed to discover that prevailing MAS memory mechanisms (1) are overly simplistic, completely disregarding the nuanced inter-agent collaboration trajectories, and (2) lack cross-trial and agent-specific customization, in stark contrast to the expressive memory developed for single agents. To bridge this gap, we introduce G-Memory, a hierarchical, agentic memory system for MAS inspired by organizational memory theory, which manages the lengthy MAS interaction via a three-tier graph hierarchy: insight, query, and interaction graphs. Upon receiving a new user query, G-Memory performs bi-directional memory traversal to retrieve both \textit{high-level, generalizable insights} that enable the system to leverage cross-trial knowledge, and \textit{fine-grained, condensed interaction trajectories} that compactly encode prior collaboration experiences. Upon task execution, the entire hierarchy evolves by assimilating new collaborative trajectories, nurturing the progressive evolution of agent teams. Extensive experiments across five benchmarks, three LLM backbones, and three popular MAS frameworks demonstrate that G-Memory improves success rates in embodied action and accuracy in knowledge QA by up to $20. 89\\%$ and $10. 12\\%$, respectively, without any modifications to the original frameworks.

PDF Details

NeurIPS Conference 2025 Conference Paper

GoT: Unleashing Reasoning Capability of MLLM for Visual Generation and Editing

Rongyao Fang
Chengqi Duan
Kun Wang
Linjiang Huang
Hao Li
Hao Tian
Shilin Yan
Weihao Yu

Current image generation and editing methods primarily process textual prompts as direct inputs without explicit reasoning about visual composition or operational steps. We present Generation Chain-of-Thought (GoT), a novel paradigm that empowers a Multimodal Large Language Model (MLLM) to first generate an explicit, structured reasoning chain in natural language—detailing semantic relationships, object attributes, and, crucially, precise spatial coordinates—before any image synthesis occurs. This intermediate reasoning output directly guides the subsequent visual generation or editing process. This approach transforms conventional text-to-image generation and editing into a reasoning-guided framework that analyzes semantic relationships and spatial arrangements. We define the formulation of GoT and construct large-scale GoT datasets containing over \textbf{9M} samples with detailed reasoning chains capturing semantic-spatial relationships. To leverage the advantages of GoT, we implement a unified framework that integrates Qwen2. 5-VL for reasoning chain generation with an end-to-end diffusion model enhanced by our novel Semantic-Spatial Guidance Module. Experiments show our GoT framework achieves excellent performance on both generation and editing tasks, with significant improvements over baselines. Additionally, our approach enables interactive visual generation, allowing users to explicitly modify reasoning steps for precise image adjustments. GoT pioneers a new direction for reasoning-driven visual generation and editing, producing images that better align with human intent. We will release our datasets and models to facilitate future research.

PDF Details

NeurIPS Conference 2025 Conference Paper

GuardReasoner-VL: Safeguarding VLMs via Reinforced Reasoning

Yue Liu
Shengfang Zhai
Mingzhe Du
Yulin Chen
Tri Cao
Hongcheng Gao
Cheng Wang
Xinfeng Li

To enhance the safety of VLMs, this paper introduces a novel reasoning-based VLM guard model dubbed GuardReasoner-VL. The core idea is to incentivize the guard model to deliberatively reason before making moderation decisions via online RL. First, we construct GuardReasoner-VLTrain, a reasoning corpus with 123K samples and 631K reasoning steps, spanning text, image, and text-image inputs. Then, based on it, we cold-start our model's reasoning ability via SFT. In addition, we further enhance reasoning regarding moderation through online RL. Concretely, to enhance diversity and difficulty of samples, we conduct rejection sampling followed by data augmentation via the proposed safety-aware data concatenation. Besides, we use a dynamic clipping parameter to encourage exploration in early stages and exploitation in later stages. To balance performance and token efficiency, we design a length-aware safety reward that integrates accuracy, format, and token cost. Extensive experiments demonstrate the superiority of our model. Remarkably, it surpasses the runner-up by 19. 27% F1 score on average, as shown in Figure 1. We release data, code, and models (3B/7B) of GuardReasoner-VL: https: //github. com/yueliu1999/GuardReasoner-VL.

PDF Details

JBHI Journal 2025 Journal Article

HealthiVert-GAN: A Novel Framework of Pseudo-Healthy Vertebral Image Synthesis for Interpretable Compression Fracture Grading

Qi Zhang
Cheng Chuang
Shunan Zhang
Ziqi Zhao
Kun Wang
Jun Xu
Jianqi Sun

Osteoporotic vertebral compression fractures (OVCFs) are prevalent in the elderly population, typically assessed on computed tomography (CT) scans by evaluating vertebral height loss. This assessment helps determine the fracture's impact on spinal stability and the need for surgical intervention. However, the absence of pre-fracture CT scans and standardized vertebral references leads to measurement errors and inter-observer variability, while irregular compression patterns further challenge the precise grading of fracture severity. While deep learning methods have shown promise in aiding OVCFs screening, they often lack interpretability and sufficient sensitivity, limiting their clinical applicability. To address these challenges, we introduce a novel vertebra synthesis-height loss quantification-OVCFs grading framework. Our proposed model, HealthiVert-GAN 1, utilizes a coarse-to-fine synthesis network designed to generate pseudo-healthy vertebral images that simulate the pre-fracture state of fractured vertebrae. This model integrates three auxiliary modules that leverage the morphology and height information of adjacent healthy vertebrae to ensure anatomical consistency. Additionally, we introduce the Relative Height Loss of Vertebrae (RHLV) as a quantification metric, which divides each vertebra into three sections to measure height loss between pre-fracture and post-fracture states, followed by fracture severity classification using a Support Vector Machine (SVM). Our approach achieves state-of-the-art classification performance on both the Verse2019 dataset and in-house dataset, and it provides cross-sectional distribution maps of vertebral height loss. This practical tool enhances diagnostic accuracy in clinical settings and assisting in surgical decision-making.

Details DOI

NeurIPS Conference 2025 Conference Paper

Improved Regret and Contextual Linear Extension for Pandora's Box and Prophet Inequality

Junyan Liu
Ziyun Chen
Kun Wang
Haipeng Luo
Lillian Ratliff

We study the Pandora’s Box problem in an online learning setting with semi-bandit feedback. In each round, the learner sequentially pays to open up to $n$ boxes with unknown reward distributions, observes rewards upon opening, and decides when to stop. The utility of the learner is the maximum observed reward minus the cumulative cost of opened boxes, and the goal is to minimize regret defined as the gap between the cumulative expected utility and that of the optimal policy. We propose a new algorithm that achieves $\widetilde{O}(\sqrt{nT})$ regret after $T$ rounds, which improves the $\widetilde{O}(n\sqrt{T})$ bound of Agarwal et al. [2024] and matches the known lower bound up to logarithmic factors. To better capture real-life applications, we then extend our results to a natural but challenging contextual linear setting, where each box's expected reward is linear in some known but time-varying $d$-dimensional context and the noise distribution is fixed over time. We design an algorithm that learns both the linear function and the noise distributions, achieving $\widetilde{O}(nd\sqrt{T})$ regret. Finally, we show that our techniques also apply to the online Prophet Inequality problem, where the learner must decide immediately whether or not to accept a revealed reward. In both non-contextual and contextual settings, our approach achieves similar improvements and regret bounds.

PDF Details

NeurIPS Conference 2025 Conference Paper

Improving Deep Learning for Accelerated MRI With Data Filtering

Kang Lin
Anselm Krainovic
Kun Wang
Reinhard Heckel

Deep neural networks achieve state-of-the-art results for accelerated MRI reconstruction. Most research on deep learning based imaging focuses on improving neural network architectures trained and evaluated on fixed and homogeneous training and evaluation data. In this work, we investigate data curation strategies for improving MRI reconstruction. We assemble a large dataset of raw k-space data from 18 public sources consisting of 1. 1M images and construct a diverse evaluation set comprising 48 test sets, capturing variations in anatomy, contrast, number of coils, and other key factors. We propose and study different data filtering strategies to enhance performance of current state-of-the-art neural networks for accelerated MRI reconstruction. Our experiments show that filtering the training data leads to consistent, albeit modest, performance gains. These performance gains are robust across different training set sizes and accelerations, and we find that filtering is particularly beneficial when the proportion of in-distribution data in the unfiltered training set is low.

PDF Details

NeurIPS Conference 2025 Conference Paper

LIFEBENCH: Evaluating Length Instruction Following in Large Language Models

Wei Zhang
Zhenhong Zhou
Kun Wang
Junfeng Fang
Rongwu Xu
Yuanhe Zhang
Rui Wang
Ge Zhang

While large language models (LLMs) can solve PhD-level reasoning problems over long context inputs, they still struggle with a seemingly simpler task: following explicit length instructions —e. g. , write a 10, 000-word novel. Additionally, models often generate far too short outputs, terminate prematurely, or even refuse the request. Existing benchmarks focus primarily on evaluating generations quality, but often overlook whether the generations meet length constraints. To this end, we introduce Length Instruction Following Evaluation Benchmark (LIFEBench) to comprehensively evaluate LLMs' ability to follow length instructions across diverse tasks and a wide range of specified lengths. LIFEBench consists of 10, 800 instances across 4 task categories in both English and Chinese, covering length constraints ranging from 16 to 8192 words. We evaluate 26 widely-used LLMs and find that most models reasonably follow short-length instructions but deteriorate sharply beyond a certain threshold. Surprisingly, almost all models fail to reach the vendor-claimed maximum output lengths in practice, as further confirmed by our evaluations extending up to 32K words. Even long-context LLMs, despite their extended input-output windows, counterintuitively fail to improve length-instructions following. Notably, Reasoning LLMs outperform even specialized long-text generation models, achieving state-of-the-art length following. Overall, LIFEBench uncovers fundamental limitations in current LLMs' length instructions following ability, offering critical insights for future progress.

PDF Details

NeurIPS Conference 2025 Conference Paper

Seg-VAR:Image Segmentation with Visual Autoregressive Modeling

Rongkun Zheng
Lu Qi
Xi Chen
Yi Wang
Kun Wang
Hengshuang Zhao

While visual autoregressive modeling (VAR) strategies have shed light on image generation with the autoregressive models, their potential for segmentation, a task that requires precise low-level spatial perception, remains unexplored. Inspired by the multi-scale modeling of classic Mask2Former-based models, we propose Seg-VAR, a novel framework that rethinks segmentation as a conditional autoregressive mask generation problem. This is achieved by replacing the discriminative learning with the latent learning process. Specifically, our method incorporates three core components: (1) an image encoder generating latent priors from input images, (2) a spatial-aware seglat (a latent expression of segmentation mask) encoder that maps segmentation masks into discrete latent tokens using a location-sensitive color mapping to distinguish instances, and (3) a decoder reconstructing masks from these latents. A multi-stage training strategy is introduced: first learning seglat representations via image-seglat joint training, then refining latent transformations, and finally aligning image-encoder-derived latents with seglat distributions. Experiments show Seg-VAR outperforms previous discriminative and generative methods on various segmentation tasks and validation benchmarks. By framing segmentation as a sequential hierarchical prediction task, Seg-VAR opens new avenues for integrating autoregressive reasoning into spatial-aware vision systems.

PDF Details

NeurIPS Conference 2025 Conference Paper

Unleashing Foundation Vision Models: Adaptive Transfer for Diverse Data-Limited Scientific Domains

Qiankun Li
Feng He
Huabao Chen
Xin Ning
Kun Wang
Zengfu Wang

In the big data era, the computer vision field benefits from large-scale datasets such as LAION-2B, LAION-400M, and ImageNet-21K, Kinetics, on which popular models like the ViT and ConvNeXt series have been pre-trained, acquiring substantial knowledge. However, numerous downstream tasks in specialized and data-limited scientific domains continue to pose significant challenges. In this paper, we propose a novel Cluster Attention Adapter (CLAdapter), which refines and adapts the rich representations learned from large-scale data to various data-limited downstream tasks. Specifically, CLAdapter introduces attention mechanisms and cluster centers to personalize the enhancement of transformed features through distribution correlation and transformation matrices. This enables models fine-tuned with CLAdapter to learn distinct representations tailored to different feature sets, facilitating the models' adaptation from rich pre-trained features to various downstream scenarios effectively. In addition, CLAdapter's unified interface design allows for seamless integration with multiple model architectures, including CNNs and Transformers, in both 2D and 3D contexts. Through extensive experiments on 10 datasets spanning domains such as generic, multimedia, biological, medical, industrial, agricultural, environmental, geographical, materials science, out-of-distribution (OOD), and 3D analysis, CLAdapter achieves state-of-the-art performance across diverse data-limited scientific domains, demonstrating its effectiveness in unleashing the potential of foundation vision models via adaptive transfer. Code is available at https: //github. com/qklee-lz/CLAdapter.

PDF Details

AAAI Conference 2025 Conference Paper

UrbanVLP: Multi-Granularity Vision-Language Pretraining for Urban Socioeconomic Indicator Prediction

Xixuan Hao
Wei Chen
Yibo Yan
Siru Zhong
Kun Wang
Qingsong Wen
Yuxuan Liang

Urban socioeconomic indicator prediction aims to infer various metrics related to sustainable development in diverse urban landscapes using data-driven methods. However, prevalent pretrained models, particularly those reliant on satellite imagery, face dual challenges. Firstly, concentrating solely on macro-level patterns from satellite data may introduce bias, lacking nuanced details at micro levels, such as architectural details at a place. Secondly, the text generated by the precursor work UrbanCLIP, which fully utilizes the extensive knowledge of LLMs, frequently exhibits issues such as hallucination and homogenization, resulting in a lack of reliable quality. In response to these issues, we devise a novel framework entitled UrbanVLP based on Vision-Language Pretraining. Our UrbanVLP seamlessly integrates multi-granularity information from both macro (satellite) and micro (street-view) levels, overcoming the limitations of prior pretrained models. Moreover, it introduces automatic text generation and calibration, providing a robust guarantee for producing high-quality text descriptions of urban imagery. Rigorous experiments conducted across six socioeconomic indicator prediction tasks underscore its superior performance.

PDF Details DOI

AAAI Conference 2024 Conference Paper

AltNeRF: Learning Robust Neural Radiance Field via Alternating Depth-Pose Optimization

Kun Wang
Zhiqiang Yan
Huang Tian
Zhenyu Zhang
Xiang Li
Jun Li
Jian Yang

Neural Radiance Fields (NeRF) have shown promise in generating realistic novel views from sparse scene images. However, existing NeRF approaches often encounter challenges due to the lack of explicit 3D supervision and imprecise camera poses, resulting in suboptimal outcomes. To tackle these issues, we propose AltNeRF---a novel framework designed to create resilient NeRF representations using self-supervised monocular depth estimation (SMDE) from monocular videos, without relying on known camera poses. SMDE in AltNeRF masterfully learns depth and pose priors to regulate NeRF training. The depth prior enriches NeRF's capacity for precise scene geometry depiction, while the pose prior provides a robust starting point for subsequent pose refinement. Moreover, we introduce an alternating algorithm that harmoniously melds NeRF outputs into SMDE through a consistence-driven mechanism, thus enhancing the integrity of depth priors. This alternation empowers AltNeRF to progressively refine NeRF representations, yielding the synthesis of realistic novel views. Extensive experiments showcase the compelling capabilities of AltNeRF in generating high-fidelity and robust novel views that closely resemble reality.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

Causal Deciphering and Inpainting in Spatio-Temporal Dynamics via Diffusion Model

Yifan Duan
Jian Zhao
Junyuan Mao
Hao Wu
Jingyu Xu
Shilong Wang
Caoyuan Ma
Kai Wang

Spatio-temporal (ST) prediction has garnered a De facto attention in earth sciences, such as meteorological prediction, human mobility perception. However, the scarcity of data coupled with the high expenses involved in sensor deployment results in notable data imbalances. Furthermore, models that are excessively customized and devoid of causal connections further undermine the generalizability and interpretability. To this end, we establish a causal framework for ST predictions, termed CaPaint, which targets to identify causal regions in data and endow model with causal reasoning ability in a two-stage process. Going beyond this process, we utilize the back-door adjustment to specifically address the sub-regions identified as non-causal in the upstream phase. Specifically, we employ a novel image inpainting technique. By using a fine-tuned unconditional Diffusion Probabilistic Model (DDPM) as the generative prior, we in-fill the masks defined as environmental parts, offering the possibility of reliable extrapolation for potential data distributions. CaPaint overcomes the high complexity dilemma of optimal ST causal discovery models by reducing the data generation complexity from exponential to quasi-linear levels. Extensive experiments conducted on five real-world ST benchmarks demonstrate that integrating the CaPaint concept allows models to achieve improvements ranging from 4. 3% to 77. 3%. Moreover, compared to traditional mainstream ST augmenters, CaPaint underscores the potential of diffusion models in ST enhancement, offering a novel paradigm for this field. Our project is available at https: //anonymous. 4open. science/r/12345-DFCC.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

DCDepth: Progressive Monocular Depth Estimation in Discrete Cosine Domain

Kun Wang
Zhiqiang Yan
Junkai Fan
Wanlu Zhu
Xiang Li
Jun Li
Jian Yang

In this paper, we introduce DCDepth, a novel framework for the long-standing monocular depth estimation task. Moving beyond conventional pixel-wise depth estimation in the spatial domain, our approach estimates the frequency coefficients of depth patches after transforming them into the discrete cosine domain. This unique formulation allows for the modeling of local depth correlations within each patch. Crucially, the frequency transformation segregates the depth information into various frequency components, with low-frequency components encapsulating the core scene structure and high-frequency components detailing the finer aspects. This decomposition forms the basis of our progressive strategy, which begins with the prediction of low-frequency components to establish a global scene context, followed by successive refinement of local details through the prediction of higher-frequency components. We conduct comprehensive experiments on NYU-Depth-V2, TOFDC, and KITTI datasets, and demonstrate the state-of-the-art performance of DCDepth. Code is available at https: //github. com/w2kun/DCDepth.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

Dueling over Dessert, Mastering the Art of Repeated Cake Cutting

Simina Brânzei
MohammadTaghi Hajiaghayi
Reed Phillips
Suho Shin
Kun Wang

We consider the setting of repeated fair division between two players, denoted Alice and Bob, with private valuations over a cake. In each round, a new cake arrives, which is identical to the ones in previous rounds. Alice cuts the cake at a point of her choice, while Bob chooses the left piece or the right piece, leaving the remainder for Alice. We consider two versions: sequential, where Bob observes Alice's cut point before choosing left/right, and simultaneous, where he only observes her cut point after making his choice. The simultaneous version was first considered by Aumann and Maschler. We observe that if Bob is almost myopic and chooses his favorite piece too often, then he can be systematically exploited by Alice through a strategy akin to a binary search. This strategy allows Alice to approximate Bob's preferences with increasing precision, thereby securing a disproportionate share of the resource over time. We analyze the limits of how much a player can exploit the other one and show that fair utility profiles are in fact achievable. Specifically, the players can enforce the equitable utility profile of $(1/2, 1/2)$ in the limit on every trajectory of play, by keeping the other player's utility to approximately $1/2$ on average while guaranteeing they themselves get at least approximately $1/2$ on average. We show this theorem using a connection with Blackwell approachability. Finally, we analyze a natural dynamic known as fictitious play, where players best respond to the empirical distribution of the other player. We show thatfictitious play converges to the equitable utility profile of $(1/2, 1/2)$ at a rate of $O(1/\sqrt{T})$.

PDF Details DOI

AAAI Conference 2024 Conference Paper

Earthfarsser: Versatile Spatio-Temporal Dynamical Systems Modeling in One Model

Hao Wu
Yuxuan Liang
Wei Xiong
Zhengyang Zhou
Wei Huang
Shilong Wang
Kun Wang

Efficiently modeling spatio-temporal (ST) physical processes and observations presents a challenging problem for the deep learning community. Many recent studies have concentrated on meticulously reconciling various advantages, leading to designed models that are neither simple nor practical. To address this issue, this paper presents a systematic study on existing shortcomings faced by off-the-shelf models, including lack of local fidelity, poor prediction performance over long time-steps, low scalability, and inefficiency. To systematically address the aforementioned problems, we propose an EarthFarseer, a concise framework that combines parallel local convolutions and global Fourier-based transformer architectures, enabling dynamically capture the local-global spatial interactions and dependencies. EarthFarseer also incorporates a multi-scale fully convolutional and Fourier architectures to efficiently and effectively capture the temporal evolution. Our proposal demonstrates strong adaptability across various tasks and datasets, with fast convergence and better local fidelity in long time-steps predictions. Extensive experiments and visualizations over eight human society physical and natural physical datasets demonstrates the state-of-the-art performance of EarthFarseer. We release our code at https://github.com/easylearningscores/EarthFarseer.

PDF Details DOI

EAAI Journal 2024 Journal Article

Feature matching based on Gaussian kernel convolution and minimum relative motion

Kun Wang
Chengcai Leng
Huaiping Yan
Jinye Peng
Zhao Pei
Anup Basu

Details DOI

NeurIPS Conference 2024 Conference Paper

GDeR: Safeguarding Efficiency, Balancing, and Robustness via Prototypical Graph Pruning

Guibin Zhang
Haonan Dong
Yuchen Zhang
Zhixun Li
Dingshuo Chen
Kai Wang
Tianlong Chen
Yuxuan Liang

Training high-quality deep models necessitates vast amounts of data, resulting in overwhelming computational and memory demands. Recently, data pruning, distillation, and coreset selection have been developed to streamline data volume by \textit{retaining}, \textit{synthesizing}, or \textit{selecting} a small yet informative subset from the full set. Among these methods, data pruning incurs the least additional training cost and offers the most practical acceleration benefits. However, it is the most vulnerable, often suffering significant performance degradation with imbalanced or biased data schema, thus raising concerns about its accuracy and reliability in on-device deployment. Therefore, there is a looming need for a new data pruning paradigm that maintains the efficiency of previous practices while ensuring balance and robustness. Unlike the fields of computer vision and natural language processing, where mature solutions have been developed to address these issues, graph neural networks (GNNs) continue to struggle with increasingly large-scale, imbalanced, and noisy datasets, lacking a unified dataset pruning solution. To achieve this, we introduce a novel dynamic soft-pruning method, \ourmethod, designed to update the training ``basket'' during the process using trainable prototypes. \ourmethod first constructs a well-modeled graph embedding hypersphere and then samples \textit{representative, balanced, and unbiased subsets} from this embedding space, which achieves the goal we called {\fontfamily{lmtt}\selectfont \textbf{Graph Training Debugging}}. Extensive experiments on four datasets across three GNN backbones, demonstrate that \ourmethod (I) achieves or surpasses the performance of the full dataset with $30\%\sim50\%$ fewer training samples, (II) attains up to a $2. 81\times$ lossless training speedup, and (III) outperforms state-of-the-art pruning methods in imbalanced training and noisy training scenarios by $0. 3\%\sim4. 3\%$ and $3. 6\%\sim7. 8\%$, respectively.

PDF Details DOI

ICML Conference 2024 Conference Paper

Gradient-based Visual Explanation for Transformer-based CLIP

Chenyang Zhao 0011
Kun Wang
Xingyu Zeng
Rui Zhao
Antoni B. Chan

Significant progress has been achieved on the improvement and downstream usages of the Contrastive Language-Image Pre-training (CLIP) vision-language model, while less attention is paid to the interpretation of CLIP. We propose a Gradient-based visual Explanation method for CLIP (Grad-ECLIP), which interprets the matching result of CLIP for specific input image-text pair. By decomposing the architecture of the encoder and discovering the relationship between the matching similarity and intermediate spatial features, Grad-ECLIP produces effective heat maps that show the influence of image regions or words on the CLIP results. Different from the previous Transformer interpretation methods that focus on the utilization of self-attention maps, which are typically extremely sparse in CLIP, we produce high-quality visual explanations by applying channel and spatial weights on token features. Qualitative and quantitative evaluations verify the superiority of Grad-ECLIP compared with the state-of-the-art methods. A series of analysis are conducted based on our visual explanation results, from which we explore the working mechanism of image-text matching, and the strengths and limitations in attribution identification of CLIP. Codes are available here: https: //github. com/Cyang-Zhao/Grad-Eclip.

Details

NeurIPS Conference 2024 Conference Paper

MotionTTT: 2D Test-Time-Training Motion Estimation for 3D Motion Corrected MRI

Tobit Klug
Kun Wang
Stefan Ruschke
Reinhard Heckel

A major challenge of the long measurement times in magnetic resonance imaging (MRI), an important medical imaging technology, is that patients may move during data acquisition. This leads to severe motion artifacts in the reconstructed images and volumes. In this paper, we propose MotionTTT a deep learning-based test-time-training (TTT) method for accurate motion estimation. The key idea is that a neural network trained for motion-free reconstruction has a small loss if there is no motion, thus optimizing over motion parameters passed through the reconstruction network enables accurate estimation of motion. The estimated motion parameters enable to correct for the motion and to reconstruct accurate motion-corrected images. Our method uses 2D reconstruction networks to estimate rigid motion in 3D, and constitutes the first deep learning based method for 3D rigid motion estimation towards 3D-motion-corrected MRI. We show that our method can provably reconstruct motion parameters for a simple signal and neural network model. We demonstrate the effectiveness of our method for both retrospectively simulated motion and prospectively collected real motion-corrupted data. Code is available at \url{https: //github. com/MLI-lab/MRI_MotionTTT}.

PDF Details DOI

IJCAI Conference 2024 Conference Paper

Predicting Carpark Availability in Singapore with Cross-Domain Data: A New Dataset and A Data-Driven Approach

Huaiwu Zhang
Yutong Xia
Siru Zhong
Kun Wang
Zekun Tong
Qingsong Wen
Roger Zimmermann
Yuxuan Liang

The increasing number of vehicles highlights the need for efficient parking space management. Predicting real-time Parking Availability (PA) can help mitigate traffic congestion and the corresponding social problems, which is a pressing issue in densely populated cities like Singapore. In this study, we aim to collectively predict future PA across Singapore with complex factors from various domains. The contributions in this paper are listed as follows: (1) A New Dataset: We introduce the SINPA dataset, containing a year's worth of PA data from 1, 687 parking lots in Singapore, enriched with various spatial and temporal factors. (2) A Data-Driven Approach: We present DeepPA, a novel deep-learning framework, to collectively and efficiently predict future PA across thousands of parking lots. (3) Extensive Experiments and Deployment: DeepPA demonstrates a 9. 2% reduction in prediction error for up to 3-hour forecasts compared to existing advanced models. Furthermore, we implement DeepPA in a practical web-based platform to provide real-time PA predictions to aid drivers and inform urban planning for the governors in Singapore. We release the dataset and source code at https: //github. com/yoshall/SINPA.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

SyncVIS: Synchronized Video Instance Segmentation

Rongkun Zheng
Lu Qi
Xi Chen
Yi Wang
Kun Wang
Yu Qiao
Hengshuang Zhao

Recent DETR-based methods have advanced the development of Video Instance Segmentation (VIS) through transformers' efficiency and capability in modeling spatial and temporal information. Despite harvesting remarkable progress, existing works follow asynchronous designs, which model video sequences via either video-level queries only or adopting query-sensitive cascade structures, resulting in difficulties when handling complex and challenging video scenarios. In this work, we analyze the cause of this phenomenon and the limitations of the current solutions, and propose to conduct synchronized modeling via a new framework named SyncVIS. Specifically, SyncVIS explicitly introduces video-level query embeddings and designs two key modules to synchronize video-level query with frame-level query embeddings: a synchronized video-frame modeling paradigm and a synchronized embedding optimization strategy. The former attempts to promote the mutual learning of frame- and video-level embeddings with each other and the latter divides large video sequences into small clips for easier optimization. Extensive experimental evaluations are conducted on the challenging YouTube-VIS 2019 & 2021 & 2022, and OVIS benchmarks, and SyncVIS achieves state-of-the-art results, which demonstrates the effectiveness and generality of the proposed approach. The code is available at https: //github. com/rkzheng99/SyncVIS.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

Towards Neuron Attributions in Multi-Modal Large Language Models

Junfeng Fang
Zongze Bi
Ruipeng Wang
Houcheng Jiang
Yuan Gao
Kun Wang
An Zhang
Jie Shi

As Large Language Models (LLMs) demonstrate impressive capabilities, demystifying their internal mechanisms becomes increasingly vital. Neuron attribution, which attributes LLM outputs to specific neurons to reveal the semantic properties they learn, has emerged as a key interpretability approach. However, while neuron attribution has made significant progress in deciphering text-only LLMs, its application to Multimodal LLMs (MLLMs) remains less explored. To address this gap, we propose a novel Neuron Attribution method tailored for MLLMs, termed NAM. Specifically, NAM not only reveals the modality-specific semantic knowledge learned by neurons within MLLMs, but also highlights several intriguing properties of neurons, such as cross-modal invariance and semantic sensitivity. These properties collectively elucidate the inner workings mechanism of MLLMs, providing a deeper understanding of how MLLMs process and generate multi-modal content. Through theoretical analysis and empirical validation, we demonstrate the efficacy of NAM and the valuable insights it offers. Furthermore, leveraging NAM, we introduce a multi-modal knowledge editing paradigm, underscoring the practical significance of our approach for downstream applications of MLLMs.

PDF Details DOI

IJCAI Conference 2024 Conference Paper

Towards Robust Trajectory Representations: Isolating Environmental Confounders with Causal Learning

Kang Luo
Yuanshao Zhu
Wei Chen
Kun Wang
Zhengyang Zhou
Sijie Ruan
Yuxuan Liang

Trajectory modeling refers to characterizing human movement behavior, serving as a pivotal step in understanding mobility patterns. Nevertheless, existing studies typically ignore the confounding effects of geospatial context, leading to the acquisition of spurious correlations and limited generalization capabilities. To bridge this gap, we initially formulate a Structural Causal Model (SCM) to decipher the trajectory representation learning process from a causal perspective. Building upon the SCM, we further present a Trajectory modeling framework (TrajCL) based on Causal Learning, which leverages the backdoor adjustment theory as an intervention tool to eliminate the spurious correlations between geospatial context and trajectories. Extensive experiments on two real-world datasets verify that TrajCL markedly enhances performance in trajectory classification tasks while showcasing superior generalization and interpretability.

PDF Details DOI

NeurIPS Conference 2023 Conference Paper

Deciphering Spatio-Temporal Graph Forecasting: A Causal Lens and Treatment

Yutong Xia
Yuxuan Liang
Haomin Wen
Xu Liu
Kun Wang
Zhengyang Zhou
Roger Zimmermann

Spatio-Temporal Graph (STG) forecasting is a fundamental task in many real-world applications. Spatio-Temporal Graph Neural Networks have emerged as the most popular method for STG forecasting, but they often struggle with temporal out-of-distribution (OoD) issues and dynamic spatial causation. In this paper, we propose a novel framework called CaST to tackle these two challenges via causal treatments. Concretely, leveraging a causal lens, we first build a structural causal model to decipher the data generation process of STGs. To handle the temporal OoD issue, we employ the back-door adjustment by a novel disentanglement block to separate the temporal environments from input data. Moreover, we utilize the front-door adjustment and adopt edge-level convolution to model the ripple effect of causation. Experiments results on three real-world datasets demonstrate the effectiveness of CaST, which consistently outperforms existing methods with good interpretability. Our source code is available at https: //github. com/yutong-xia/CaST.

PDF Details

AAAI Conference 2023 Conference Paper

DesNet: Decomposed Scale-Consistent Network for Unsupervised Depth Completion

Zhiqiang Yan
Kun Wang
Xiang Li
Zhenyu Zhang
Jun Li
Jian Yang

Unsupervised depth completion aims to recover dense depth from the sparse one without using the ground-truth annotation. Although depth measurement obtained from LiDAR is usually sparse, it contains valid and real distance information, i.e., scale-consistent absolute depth values. Meanwhile, scale-agnostic counterparts seek to estimate relative depth and have achieved impressive performance. To leverage both the inherent characteristics, we thus suggest to model scale-consistent depth upon unsupervised scale-agnostic frameworks. Specifically, we propose the decomposed scale-consistent learning (DSCL) strategy, which disintegrates the absolute depth into relative depth prediction and global scale estimation, contributing to individual learning benefits. But unfortunately, most existing unsupervised scale-agnostic frameworks heavily suffer from depth holes due to the extremely sparse depth input and weak supervisory signal. To tackle this issue, we introduce the global depth guidance (GDG) module, which attentively propagates dense depth reference into the sparse target via novel dense-to-sparse attention. Extensive experiments show the superiority of our method on outdoor KITTI, ranking 1st and outperforming the best KBNet more than 12% in RMSE. Additionally, our approach achieves state-of-the-art performance on indoor NYUv2 benchmark as well.

PDF Details DOI

JBHI Journal 2023 Journal Article

Dual-Input Transformer: An End-to-End Model for Preoperative Assessment of Pathological Complete Response to Neoadjuvant Chemotherapy in Breast Cancer Ultrasonography

Tong Tong
Dongyang Li
Jionghui Gu
Guo Chen
Guotao Bai
Xin Yang
Kun Wang
Tianan Jiang

Neoadjuvant chemotherapy (NAC) is the primary method to reduce the burden of tumor and metastasis; in the treatment of breast cancer, it may provide additional opportunities for breast-conserving surgery. Preoperative assessment of pathological complete response (PCR) to NAC is important for developing individualized treatment approaches and predicting patient prognosis. Compared to magnetic resonance imaging (MRI) and mammography, ultrasonography (US) has the advantages of simplicity, flexibility, and real-time imaging. Moreover, it does not require radiation and can provide multi-time acquisition of the tumor during NAC treatment. Recently, deep learning radiomics models based on multi-time-point US images for the prediction of NAC effectiveness have been proposed. To further improve the prediction performance, we carefully designed four supporting modules for our proposed dual-input transformer (DiT): isolated tokens-to-token patch embedding module, shared position embedding, time embedding, and weighted average pooling feature representation modules. The design of each module considers the characteristics of the US images at multiple time points. We validated our model on our retrospective US dataset composed of 484 cases from two centers whose consistency is not sufficiently high. Patients were allocated to training (n = 297), validation (n = 99), and external test (n = 88) sets. The results show that our model can achieve better performance than the Siamese CNN and the standard tokens-to-token vision transformer without using multi-time-point images. The ablation study also proved the effectiveness of each module designed for DiT.

Details DOI

NeurIPS Conference 2023 Conference Paper

TMT-VIS: Taxonomy-aware Multi-dataset Joint Training for Video Instance Segmentation

Rongkun Zheng
Lu Qi
Xi Chen
Yi Wang
Kun Wang
Yu Qiao
Hengshuang Zhao

Training on large-scale datasets can boost the performance of video instance segmentation while the annotated datasets for VIS are hard to scale up due to the high labor cost. What we possess are numerous isolated filed-specific datasets, thus, it is appealing to jointly train models across the aggregation of datasets to enhance data volume and diversity. However, due to the heterogeneity in category space, as mask precision increase with the data volume, simply utilizing multiple datasets will dilute the attention of models on different taxonomy. Thus, increasing the data scale and enriching taxonomy space while improving classification precision is important. In this work, we analyze that providing extra taxonomy information can help models concentrate on specific taxonomy, and propose our model named Taxonomy-aware Multi-dataset Joint Training for Video Instance Segmentation (TMT-VIS) to address this vital challenge. Specifically, we design a two-stage taxonomy aggregation module that first compiles taxonomy information from input videos and then aggregates these taxonomy priors into instance queries before the transformer decoder. We conduct extensive experimental evaluations on four popular and challenging benchmarks, including YouTube-VIS 2019, YouTube-VIS 2021, OVIS, and UVO. Our model shows significant improvement over the baseline solutions, and sets new state-of-the-art records on all these benchmarks. These appealing and encouraging results demonstrate the effectiveness and generality of our proposed approach. The code and trained models will be publicly available.

PDF Details

ICRA Conference 2022 Conference Paper

IPS300+: a Challenging multi-modal data sets for Intersection Perception System

Huanan Wang
Xinyu Zhang 0001
Zhiwei Li
Jun Li 0082
Kun Wang
Zhu Lei
Haibing Ren

Due to high complexity and occlusion, insufficient perception in the crowded urban intersection can be a serious safety risk for both human drivers and autonomous algorithms, whereas CVIS (Cooperative Vehicle Infrastructure System) is a proposed solution for full-participants perception under this scenario. However, the research on roadside multi-modal perception is still in its infancy, and there is no open-source data sets for such scene. Accordingly, this paper fills the gap. Through an IPS (Intersection Perception System) installed at the diagonal of the intersection, this paper proposes a high-quality multi-modal data sets for the intersection perception task. The center of the experimental intersection covers an area of 3000m 2, and the extended distance reaches 300m, which is typical for CVIS. The first batch of open-source data includes 14198 frames, and each frame has an average of 319. 84 labels, which is 9. 6 times larger than the most crowded data sets (H3D data sets in 2019) by now. Our data sets is available at: http://www.openmpd.com/column/IPS300.

Details

IROS Conference 2021 Conference Paper

A Novel 2-SUR 6-DOF Parallel Manipulator Actuated by Spherical Motion Generators

Kun Wang
Xiaoyong Wu
Yujin Wang
Bo Li
Bo Yuan
Shaoping Bai

A novel 6-DOF parallel manipulator with two spherical-universal-revolute limbs is proposed in this work. Compared with general 6-DOF parallel manipulators of six kinematic limbs, this new manipulator actuated by spherical motion generators has only two limbs, which brings kinematic advantages such as small footprint and large workspace. The inverse position problem of the manipulator is solved by an analytical approach, upon which velocity equations are formulated. Kinematic performance including workspace and also manipulability are calculated to show the advantages of the new design.

Details

NeurIPS Conference 2021 Conference Paper

Exploring Forensic Dental Identification with Deep Learning

Yuan Liang
Weikun Han
Liang Qiu
Chen Wu
Yiting Shao
Kun Wang
Lei He

Dental forensic identification targets to identify persons with dental traces. The task is vital for the investigation of criminal scenes and mass disasters because of the resistance of dental structures and the wide-existence of dental imaging. However, no widely accepted automated solution is available for this labour-costly task. In this work, we pioneer to study deep learning for dental forensic identification based on panoramic radiographs. We construct a comprehensive benchmark with various dental variations that can adequately reflect the difficulties of the task. By considering the task's unique challenges, we propose FoID, a deep learning method featured by: (\textit{i}) clinical-inspired attention localization, (\textit{ii}) domain-specific augmentations that enable instance discriminative learning, and (\textit{iii}) transformer-based self-attention mechanism that dynamically reasons the relative importance of attentions. We show that FoID can outperform traditional approaches by at least \textbf{22. 98\%} in terms of Rank-1 accuracy, and outperform strong CNN baselines by at least \textbf{10. 50\%} in terms of mean Average Precision (mAP). Moreover, extensive ablation studies verify the effectiveness of each building blocks of FoID. Our work can be a first step towards the automated system for forensic identification among large-scale multi-site databases. Also, the proposed techniques, \textit{e. g. }, self-attention mechanism, can also be meaningful for other identification tasks, \textit{e. g. }, pedestrian re-identification. Related data and codes can be found at \href{https: //github. com/liangyuandg/FoID}{https: //github. com/liangyuandg/FoID}.

PDF Details

AAAI Conference 2021 Conference Paper

Oral-3D: Reconstructing the 3D Structure of Oral Cavity from Panoramic X-ray

Weinan Song
Yuan Liang
Jiawei Yang
Kun Wang
Lei He

Panoramic X-ray (PX) provides a 2D picture of the patient’s mouth in a panoramic view to help dentists observe the invisible disease inside the gum. However, it provides limited 2D information compared with cone-beam computed tomography (CBCT), another dental imaging method that generates a 3D picture of the oral cavity but with more radiation dose and a higher price. Consequently, it is of great interest to reconstruct the 3D structure from a 2D X-ray image, which can greatly explore the application of X-ray imaging in dental surgeries. In this paper, we propose a framework, named Oral-3D, to reconstruct the 3D oral cavity from a single PX image and prior information of the dental arch. Specifically, we first train a generative model to learn the cross-dimension transformation from 2D to 3D. Then we restore the shape of the oral cavity with a deformation module with the dental arch curve, which can be obtained simply by taking a photo of the patient’s mouth. To be noted, Oral-3D can restore both the density of bony tissues and the curved mandible surface. Experimental results show that Oral-3D can efficiently and effectively reconstruct the 3D oral structure and show critical information in clinical applications, e. g. , tooth pulling and dental implants. To the best of our knowledge, we are the first to explore this domain transformation problem between these two imaging methods.

PDF Details

AAAI Conference 2020 Conference Paper

Alignment-Enhanced Transformer for Constraining NMT with Pre-Specified Translations

Kai Song
Kun Wang
Heng Yu
Yue Zhang
Zhongqiang Huang
Weihua Luo
Xiangyu Duan
Min Zhang

We investigate the task of constraining NMT with prespeciﬁed translations, which has practical signiﬁcance for a number of research and industrial applications. Existing works impose pre-speciﬁed translations as lexical constraints during decoding, which are based on word alignments derived from target-to-source attention weights. However, multiple recent studies have found that word alignment derived from generic attention heads in the Transformer is unreliable. We address this problem by introducing a dedicated head in the multi-head Transformer architecture to capture external supervision signals. Results on ﬁve language pairs show that our method is highly effective in constraining NMT with pre-speciﬁed translations, consistently outperforming previous methods in translation quality.

PDF Details

ICRA Conference 2010 Conference Paper

Internal force compensating method for wall-climbing caterpillar robot

Wei Wang 0034
Kun Wang
Houxiang Zhang
Jianwei Zhang 0001

The redundant driving problem is an inherent phenomenon existing in a modular caterpillar robot. To limit the internal forces arising from the redundant driving, this paper proposes a joint torque control method, which is based on an assumption that there is only one active joint in the four-link mechanism driving the climbing gait. Except the active joint, the other three joints are all considered as passive joints, whose torques tend to be zero, although they are driven by motors in reality. According to the analyses of static forces in the closed chain state, the ideal torque of named active joint is calculated, and will be followed by the joint in real climbing locomotion. The experiments reveal the reasonability and feasibility of the proposed joint control method, as well as the limitations of current prototype and control algorithm.

Details

YNIMG Journal 2007 Journal Article

The relationship within and between the extrinsic and intrinsic systems indicated by resting state correlational patterns of sensory cortices

Lixia Tian
Tianzi Jiang
Yong Liu
Chunshui Yu
Kun Wang
Yuan Zhou
Ming Song
Kuncheng Li

Details DOI