Author name cluster

Chen Wu

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

14 papers

2 author rows

AAAI Conference 2026 Conference Paper

CAST-LUT: Tokenizer-Guided HSV Look-Up Tables for Purple Flare Removal

Pu Wang
Shuning Sun
Jialang Lu
Chen Wu
Zhihua Zhang
Youshan Zhang
Chenggang Shan
Dianjie Lu

Purple flare, a diffuse chromatic aberration artifact commonly found around highlight areas, severely degrades the tone transition and color of the image. Existing traditional methods are based on hand-crafted features, which lack flexibility and rely entirely on fixed priors, while the scarcity of paired training data critically hampers deep learning. To address this issue, we propose a novel network built upon decoupled HSV Look-Up Tables (LUTs). The method aims to simplify color correction by adjusting the Hue (H), Saturation (S), and Value (V) components independently. This approach resolves the inherent color coupling problems in traditional methods. Our model adopts a two-stage architecture: First, a Chroma-Aware Spectral Tokenizer (CAST) converts the input image from RGB space to HSV space and independently encodes the Hue (H) and Value (V) channels into a set of semantic tokens describing the Purple flare status; second, the HSV-LUT module takes these tokens as input and dynamically generates independent correction curves (1D-LUTs) for the three channels H, S, and V. To effectively train and validate our model, we built the first large-scale purple flare dataset with diverse scenes. We also proposed new metrics and a loss function specifically designed for this task. Extensive experiments demonstrate that our model not only significantly outperforms existing methods in visual effects but also achieves state-of-the-art performance on all quantitative metrics.

PDF Details DOI

AAAI Conference 2026 Conference Paper

DCA-LUT: Deep Chromatic Alignment with 5D LUT for Purple Fringing Removal

Jialang Lu
Shuning Sun
Pu Wang
Chen Wu
Feng Gao
Lina Gong
Dianjie Lu
Guijuan Zhang

Purple fringing, a persistent artifact caused by Longitudinal Chromatic Aberration (LCA) in camera lenses, has long degraded the clarity and realism of digital imaging. Traditional solutions rely on complex and expensive apochromatic (APO) lens hardware and the extraction of handcrafted features, ignoring the data-driven approach. To fill this gap, we introduce DCA-LUT, the first deep learning framework for purple fringing removal. Inspired by the physical root of the problem-the spatial misalignment of RGB color channels due to lens dispersion, we introduce a novel Chromatic-Aware Coordinate Transformation (CA-CT) module, learning an image-adaptive color space to decouple and isolate fringing into a dedicated dimension. This targeted separation allows the network to learn a precise "purple fringe channel," which then guides the accurate restoration of the luminance channel. The final color correction is performed by a learned 5D Look-Up Table (5D LUT), enabling efficient and powerful non-linear color mapping. To enable robust training and fair evaluation, we constructed a large-scale synthetic purple fringing dataset (PF-Synth). Extensive experiments in synthetic and real-world datasets demonstrate that our method achieves state-of-the-art performance in purple fringing removal.

PDF Details DOI

AAAI Conference 2026 Conference Paper

Mixture-of-Trees: Learning to Select and Weigh Reasoning Paths for Efficient LLM Inference

Yangbo Wei
Zhen Huang
Shaoqiang Lu
Junhong Qian
Dongge Qin
Ting Jung Lin
WEI W. XING
Chen Wu

We introduce Mixture-of-Trees (MoT), a novel framework that integrates sparse expert activation with structured tree-based reasoning for efficient LLM inference. MoT employs a learned gating mechanism to selectively activate only the most relevant expert reasoning trees for each problem, where experts use models of varying capacities based on task complexity. The framework features three key innovations: (1) sparse expert activation through unified gating networks, (2) specialized expert trees that leverage domain-specific expertise while optimizing the quality-efficiency trade-off, and (3) collaborative debate mechanisms for conflicting solutions. Additionally, MoT includes a shared baseline tree with early stopping—activated experts perform lightweight validation and terminate early when confidence is high. Experiments across five benchmarks (GSM8K, MATH, AIME 2024, MMLU, HotpotQA) show that MoT achieves 2-7 percentage point accuracy improvements while reducing LLM calls by 37-40% compared to existing multi-path methods.

PDF Details DOI

AAAI Conference 2026 Conference Paper

MMMamba: A Versatile Cross-Modal in Context Fusion Framework for Pan-Sharpening and Zero-Shot Image Enhancement

Yingying Wang
Xuanhua He
Chen Wu
Jialing Huang
Suiyun Zhang
Rui Liu
Xinghao Ding
Haoxuan Che

Pan-sharpening aims to generate high-resolution multispectral (HRMS) images by integrating a high-resolution panchromatic (PAN) image with its corresponding low-resolution multispectral (MS) image. To achieve effective fusion, it is crucial to fully exploit the complementary information between the two modalities. Traditional CNN-based methods typically rely on channel-wise concatenation with fixed convolutional operators, which limits their adaptability to diverse spatial and spectral variations. While cross-attention mechanisms enable global interactions, they are computationally inefficient and may dilute fine-grained correspondences, making it difficult to capture complex semantic relationships. Recent advances in the Multimodal Diffusion Transformer (MMDiT) architecture have demonstrated impressive success in image generation and editing tasks. Unlike cross-attention, MMDiT employs in-context conditioning to facilitate more direct and efficient cross-modal information exchange. In this paper, we propose MMMamba, a cross-modal in-context fusion framework for pan-sharpening, with the flexibility to support image super-resolution in a zero-shot manner. Built upon the Mamba architecture, our design ensures linear computational complexity while maintaining strong cross-modal interaction capacity. Furthermore, we introduce a novel multimodal interleaved (MI) scanning mechanism that facilitates effective information exchange between the PAN and MS modalities. Extensive experiments demonstrate the superior performance of our method compared to existing state-of-the-art (SOTA) techniques across multiple tasks and benchmarks.

PDF Details DOI

AAAI Conference 2026 Conference Paper

Reality vs Counterfactual: Multi-World Contrastive Reinforcement Learning for Enhancing MLLM’s Theory of Mind in Egocentric Videos

Guiyang Hou
Yihui Fu
Chen Wu
Xiang Huang
Zhe Zheng
Wenqi Zhang
Yongliang Shen
Weiming Lu

Theory of Mind (ToM) refers to the ability to infer others' mental states, which is an essential capability for embodied AI agents to effectively collaborate and interact with humans. While improving Large Language Models' ability to reason about characters' mental states in text-based stories/dialogues has been extensively studied, enhancing Multimodal Large Language Models' ToM capabilities, particularly in egocentric video from an embodied perspective, remains unexplored. In this paper, we propose a contrastive Reinforcement Learning (RL) paradigm that explicitly encourages models to leverage temporal and causal evolutionary patterns in user action sequences to infer user's mental states (goals, beliefs, and potential next actions). Evaluation results on in-domain and out-of-domain demonstrate that our method achieves performance improvements of (+30.00%, +2.00%) and (+5.83%, +5.00%) compared to the backbone model and vanilla Group Relative Policy Optimization (GRPO) model, respectively. Additionally, we compare the performance of two post-training paradigms (Supervise Fine-Tuning and RL) and systematically analyze the reasoning trajectories across the base model, vanilla GRPO model, and our proposed method.

PDF Details DOI

YNIMG Journal 2026 Journal Article

The amplitude and latency of the earliest signal in V1 encode bottom-up saliency by feature conjunction

Chen Wu
Xiaoning Li
Huan Li
Xuan Wang
Ziang Yin
Zeyu Wang
Peng Zhang
Zhikuan Yang

ICRA Conference 2025 Conference Paper

DAP-LED: Learning Degradation-Aware Priors with Clip for Joint Low-Light Enhancement and Deblurring

Ling Wang
Chen Wu
Lin Wang

Autonomous vehicles and robots often struggle with reliable visual perception at night due to the low illumination and motion blur caused by the long exposure time of RGB cameras. Existing methods address this challenge by sequentially connecting the off-the-shelf pretrained lowlight enhancement and deblurring models. Unfortunately, these methods often lead to noticeable artifacts (e. g. , color distortions) in the over-exposed regions or make it hardly possible to learn the motion cues of the dark regions. In this paper, we interestingly find vision-language models, e. g. , Contrastive LanguageImage Pretraining (CLIP), can comprehensively perceive diverse degradation levels at night. In light of this, we propose a novel transformer-based joint learning framework, named DAP-LED, which can jointly achieve low-light enhancement and deblurring, benefiting downstream tasks, such as depth estimation, segmentation, and detection in the dark. The key insight is to leverage CLIP to adaptively learn the degradation levels from images at night. This subtly enables learning rich semantic information and visual representation for optimization of the joint tasks. To achieve this, we first introduce a CLIPguided cross-fusion module to obtain multi-scale patch-wise degradation heatmaps from the image embeddings. Then, the heatmaps are fused via the designed CLIP-enhanced transformer blocks to retain useful degradation information for effective model optimization. Experimental results show that, compared to existing methods, our DAP-LED achieves state-of-the-art performance in the dark. Meanwhile, the enhanced results are demonstrated to be effective for three downstream tasks. For demo and more results, please check the project page: https://vlislab22.github.io/dap-led/.

NeurIPS Conference 2025 Conference Paper

OpenCUA: Open Foundations for Computer-Use Agents

Xinyuan Wang
Bowen Wang
Dunjie Lu
Junlin Yang
Tianbao Xie
Junli Wang
Jiaqi Deng
Xiaole Guo

Vision-language models have demonstrated impressive capabilities as computer-use agents (CUAs) capable of automating diverse computer tasks. As their commercial potential grows, critical details of the most capable CUA systems remain closed. As these agents will increasingly mediate digital interactions and execute consequential decisions on our behalf, the research community needs access to open CUA frameworks to study their capabilities, limitations, and risks. To bridge this gap, we propose OpenCUA, a comprehensive open-source framework for scaling CUA data and foundation models. Our framework consists of: (1) an annotation infrastructure that seamlessly captures human computer-use demonstrations; (2) AgentNet, the first large-scale computer-use task dataset spanning 3 operating systems and 200+ applications and websites; (3) a scalable pipeline that transforms demonstrations into state–action pairs with reflective long Chain-of-Thought reasoning that sustain robust performance gains as data scales. Our end-to-end agent models demonstrate strong performance across CUA benchmarks. In particular, OpenCUA-72B achieves an average success rate of 45. 0% on OSWorld‑Verified, establishing a new state-of-the-art (SOTA) among open-source models. Further analysis confirms that our approach generalizes well across domains and benefits significantly from increased test-time computation. We release our annotation tool, datasets, code, and models to build open foundations for further CUA research.

IROS Conference 2025 Conference Paper

Runtime Energy-Efficient Control Policy for Mobile Robots with Computing Workload and Battery Awareness

Chen Wu
M. Hashem Haghbayan
Abdul Malik
Antonio Miele
Juha Plosila

Energy efficiency is a fundamental goal in robotic control. Various components within a robot, such as mechanical systems, computational units, and sensors, consume energy, all powered by the battery unit. Each component features several actuators and individual controllers that optimize energy usage locally, often without regard to one another. In this paper, we highlight a significant phenomenon indicating a considerable dependency between the mechanical and computational parts of the robot as energy consumers and the battery state of charge (SOC) as the energy provider. We demonstrate that as the battery SOC fluctuates, the behavior of energy consumption also varies, necessitating a unified controller with awareness of this relationship. Motivated by this observation, we propose a battery-aware co-optimization strategy for the mechanical and computational units, leveraging configuration space exploration to optimize the motor speed and the CPU frequency under different environmental conditions and battery SOC levels. Experimental results demonstrate the effectiveness of our approach in extending the operational lifetime of a robot under varying battery SOC and workload conditions, enhancing the energy efficiency of a case study rover by up to 53. 93% w. r. t. selected baselines and similar past approaches.

NeurIPS Conference 2021 Conference Paper

Exploring Forensic Dental Identification with Deep Learning

Yuan Liang
Weikun Han
Liang Qiu
Chen Wu
Yiting Shao
Kun Wang
Lei He

Dental forensic identification targets to identify persons with dental traces. The task is vital for the investigation of criminal scenes and mass disasters because of the resistance of dental structures and the wide-existence of dental imaging. However, no widely accepted automated solution is available for this labour-costly task. In this work, we pioneer to study deep learning for dental forensic identification based on panoramic radiographs. We construct a comprehensive benchmark with various dental variations that can adequately reflect the difficulties of the task. By considering the task's unique challenges, we propose FoID, a deep learning method featured by: (\textit{i}) clinical-inspired attention localization, (\textit{ii}) domain-specific augmentations that enable instance discriminative learning, and (\textit{iii}) transformer-based self-attention mechanism that dynamically reasons the relative importance of attentions. We show that FoID can outperform traditional approaches by at least \textbf{22. 98\%} in terms of Rank-1 accuracy, and outperform strong CNN baselines by at least \textbf{10. 50\%} in terms of mean Average Precision (mAP). Moreover, extensive ablation studies verify the effectiveness of each building blocks of FoID. Our work can be a first step towards the automated system for forensic identification among large-scale multi-site databases. Also, the proposed techniques, \textit{e. g. }, self-attention mechanism, can also be meaningful for other identification tasks, \textit{e. g. }, pedestrian re-identification. Related data and codes can be found at \href{https: //github. com/liangyuandg/FoID}{https: //github. com/liangyuandg/FoID}.

AAAI Conference 2021 Conference Paper

Learning to Truncate Ranked Lists for Information Retrieval

Chen Wu
Ruqing Zhang
Jiafeng Guo
Yixing Fan
Yanyan Lan
Xueqi Cheng

Ranked list truncation is of critical importance in a variety of professional information retrieval applications such as patent search or legal search. The goal is to dynamically determine the number of returned documents according to some userdefined objectives, in order to reach a balance between the overall utility of the results and user efforts. Existing methods formulate this task as a sequential decision problem and take some pre-defined loss as a proxy objective, which suffers from the limitation of local decision and non-direct optimization. In this work, we propose a global decision based truncation model named AttnCut, which directly optimizes user-defined objectives for the ranked list truncation. Specifically, we take the successful transformer architecture to capture the global dependency within the ranked list for truncation decision, and employ the reward augmented maximum likelihood (RAML) for direct optimization. We consider two types of user-defined objectives which are of practical usage. One is the widely adopted metric such as F1 which acts as a balanced objective, and the other is the best F1 under some minimal recall constraint which represents a typical objective in professional search. Empirical results over the Robust04 and MQ2007 datasets demonstrate the effectiveness of our approach as compared with the state-of-the-art baselines.

IJCAI Conference 2021 Conference Paper

Learning Visual Words for Weakly-Supervised Semantic Segmentation

Lixiang Ru
Bo Du
Chen Wu

Current weakly-supervised semantic segmentation (WSSS) methods with image-level labels mainly adopt class activation maps (CAM) to generate the initial pseudo labels. However, CAM usually only identifies the most discriminative object extents, which is attributed to the fact that the network doesn't need to discover the integral object to recognize image-level labels. In this work, to tackle this problem, we proposed to simultaneously learn the image-level labels and local visual word labels. Specifically, in each forward propagation, the feature maps of the input image will be encoded to visual words with a learnable codebook. By enforcing the network to classify the encoded fine-grained visual words, the generated CAM could cover more semantic regions. Besides, we also proposed a hybrid spatial pyramid pooling module that could preserve local maximum and global average values of feature maps, so that more object details and less background were considered. Based on the proposed methods, we conducted experiments on the PASCAL VOC 2012 dataset. Our proposed method achieved 67. 2% mIoU on the val set and 67. 3% mIoU on the test set, which outperformed recent state-of-the-art methods.

PDF Details DOI

AAAI Conference 2020 Conference Paper

Generating Well-Formed Answers by Machine Reading with Stochastic Selector Networks

Bin Bi
Chen Wu
Ming Yan
Wei Wang
Jiangnan Xia
Chenliang Li

Question answering (QA) based on machine reading comprehension has been a recent surge in popularity, yet most work has focused on extractive methods. We instead address a more challenging QA problem of generating a well-formed answer by reading and summarizing the paragraph for a given question. For the generative QA task, we introduce a new neural architecture, LatentQA, in which a novel stochastic selector network composes a well-formed answer with words selected from the question, the paragraph and the global vocabulary, based on a sequence of discrete latent variables. Bayesian inference for the latent variables is performed to train the LatentQA model. The experiments on public datasets of natural answer generation conﬁrm the effectiveness of LatentQA in generating high-quality well-formed answers.

AAAI Conference 2019 Conference Paper

A Deep Cascade Model for Multi-Document Reading Comprehension

Ming Yan
Jiangnan Xia
Chen Wu
Bin Bi
Zhongzhou Zhao
Ji Zhang
Luo Si
Rui Wang

A fundamental trade-off between effectiveness and efficiency needs to be balanced when designing an online question answering system. Effectiveness comes from sophisticated functions such as extractive machine reading comprehension (MRC), while efficiency is obtained from improvements in preliminary retrieval components such as candidate document selection and paragraph ranking. Given the complexity of the real-world multi-document MRC scenario, it is difficult to jointly optimize both in an end-to-end system. To address this problem, we develop a novel deep cascade learning model, which progressively evolves from the documentlevel and paragraph-level ranking of candidate texts to more precise answer extraction with machine reading comprehension. Specifically, irrelevant documents and paragraphs are first filtered out with simple functions for efficiency consideration. Then we jointly train three modules on the remaining texts for better tracking the answer: the document extraction, the paragraph extraction and the answer extraction. Experiment results show that the proposed method outperforms the previous state-of-the-art methods on two large-scale multidocument benchmark datasets, i. e. , TriviaQA and DuReader. In addition, our online system can stably serve typical scenarios with millions of daily requests in less than 50ms.