Author name cluster

Pengwei Wang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

5 papers

1 author row

AAAI Conference 2026 Conference Paper

WIET: Harmonizing Group-aware Model Weighting and Worker Allocation for Ensemble Temporal Prediction MaaS

Binbin Feng
Shikun He
Yingxin Wang
Pengwei Wang
Xiang Gao
Zhijun Ding

Ensemble Temporal Prediction Model-as-a-Service (ETP-MaaS) has become crucial in fields like financial modeling and cloud monitoring. Existing solutions fail to co-optimally address a two-fold challenge of dynamic collaboration and heterogeneity, treating models as independent entities and employing simplistic worker allocation rules. However, at the model level, data volatility means that optimal performance requires identifying and weighting constantly shifting subgroups of base models, not just individual ones; at the system level, these model groups must be efficiently mapped to a pool of heterogeneous and dynamically available workers. To this end, we introduce WIET, an efficient ETP-MaaS system that co-optimizes model weighting and worker allocation. For adaptive weighting, WIET identifies evolving group behaviors among base models and propose a novel group temporal locality-enhanced weighting method. Additionally, WIET develops an efficient, multi-dimensional worker allocation method powered by hybrid heuristic optimization, effectively reducing bottlenecks and resource waste. Experiments show WIET consistently outperforms state-of-the-art methods in terms of accuracy, latency, and resource usage across various workloads and tasks.

PDF Details DOI

NeurIPS Conference 2025 Conference Paper

AC-DiT: Adaptive Coordination Diffusion Transformer for Mobile Manipulation

Sixiang Chen
Jiaming Liu
Siyuan Qian
Han Jiang
Zhuoyang Liu
Chenyang Gu
Xiaoqi Li
Chengkai Hou

Recently, mobile manipulation has attracted increasing attention for enabling language-conditioned robotic control in household tasks. However, existing methods still face challenges in coordinating mobile base and manipulator, primarily due to two limitations. On the one hand, they fail to explicitly model the influence of the mobile base on manipulator control, which easily leads to error accumulation under high degrees of freedom. On the other hand, they treat the entire mobile manipulation process with the same visual observation modality (e. g. , either all 2D or all 3D), overlooking the distinct multimodal perception requirements at different stages during mobile manipulation. To address this, we propose the Adaptive Coordination Diffusion Transformer (AC-DiT), which enhances mobile base and manipulator coordination for end-to-end mobile manipulation. First, since the motion of the mobile base directly influences the manipulator's actions, we introduce a mobility-to-body conditioning mechanism that guides the model to first extract base motion representations, which are then used as context prior for predicting whole-body actions. This enables whole-body control that accounts for the potential impact of the mobile base’s motion. Second, to meet the perception requirements at different stages of mobile manipulation, we design a perception-aware multimodal conditioning strategy that dynamically adjusts the fusion weights between various 2D visual images and 3D point clouds, yielding visual features tailored to the current perceptual needs. This allows the model to, for example, adaptively rely more on 2D inputs when semantic information is crucial for action prediction, while placing greater emphasis on 3D geometric information when precise spatial understanding is required. We empirically validate AC-DiT through extensive experiments on both simulated and real-world mobile manipulation tasks, demonstrating superior performance compared to existing methods.

PDF Details

NeurIPS Conference 2025 Conference Paper

Reason-RFT: Reinforcement Fine-Tuning for Visual Reasoning of Vision Language Models

Huajie Tan
Yuheng Ji
Xiaoshuai Hao
Xiansheng Chen
Pengwei Wang
Zhongyuan Wang
Shanghang Zhang

Visual reasoning abilities play a crucial role in understanding complex multimodal data, advancing both domain-specific applications and artificial general intelligence (AGI). Existing methods enhance Vision-Language Models (VLMs) through Chain-of-Thought (CoT) supervised fine-tuning using meticulously annotated data. However, this approach may lead to overfitting and cognitive rigidity, limiting the model’s generalization ability under domain shifts and reducing real-world applicability. To overcome these limitations, we propose Reason-RFT, a two-stage reinforcement fine-tuning framework for visual reasoning. First, Supervised Fine-Tuning (SFT) with curated CoT data activates the reasoning potential of VLMs. This is followed by reinforcement learning based on Group Relative Policy Optimization (GRPO), which generates multiple reasoning-response pairs to enhance adaptability to domain shifts. To evaluate Reason-RFT, we reconstructed a comprehensive dataset covering visual counting, structural perception, and spatial transformation, serving as a benchmark for systematic assessment across three key dimensions. Experimental results highlight three advantages: (1) performance enhancement, with Reason-RFT achieving state-of-the-art results and outperforming both open-source and proprietary models; (2) generalization superiority, maintaining robust performance under domain shifts across various tasks; and (3) data efficiency, excelling in few-shot learning scenarios and surpassing full-dataset SFT baselines. Reason-RFT introduces a novel training paradigm for visual reasoning and marks a significant step forward in multimodal research.

PDF Details

NeurIPS Conference 2025 Conference Paper

RoboRefer: Towards Spatial Referring with Reasoning in Vision-Language Models for Robotics

Enshen Zhou
Jingkun An
Cheng Chi
Yi Han
Shanyu Rong
Chi Zhang
Pengwei Wang
Zhongyuan Wang

Spatial referring is a fundamental capability of embodied robots to interact with the 3D physical world. However, even with the powerful pretrained VLMs, recent approaches are still not qualified to accurately understand the complex 3D scenes and dynamically reason about the instruction-indicated locations for interaction. To this end, we propose RoboRefer, a 3D-aware vision language model (VLM) that can first achieve precise spatial understanding by integrating a disentangled but dedicated depth encoder via supervised fine-tuning (SFT). Moreover, RoboRefer advances generalized multi-step spatial reasoning via reinforcement fine-tuning (RFT), with metric-sensitive process reward functions tailored for spatial referring tasks. To support SFT and RFT training, we introduce RefSpatial, a large-scale dataset of 20M QA pairs (2x prior), covering 31 spatial relations (vs. 15 prior) and supporting complex reasoning processes (up to 5 steps). In addition, we introduce RefSpatial-Bench, a challenging benchmark filling the gap in evaluating spatial referring with multi-step reasoning. Experiments show that SFT-trained RoboRefer achieves state-of-the-art spatial understanding, with an average success rate of 89. 6%. RFT-trained RoboRefer further outperforms all other baselines by a large margin, even surpassing Gemini-2. 5-Pro by 12. 4% in average accuracy on RefSpatial-Bench. Notably, RoboRefer can be integrated with various control policies to execute long-horizon, dynamic tasks across diverse robots (e, g. , UR5, G1 humanoid) in cluttered real-world scenes.

PDF Details

TIST Journal 2018 Journal Article

Concept and Attention-Based CNN for Question Retrieval in Multi-View Learning

Pengwei Wang
Lei Ji
Jun Yan
Dejing Dou
Nisansa De Silva
Yong Zhang
Lianwen Jin

Question retrieval, which aims to find similar versions of a given question, is playing a pivotal role in various question answering (QA) systems. This task is quite challenging, mainly in regard to five aspects: synonymy, polysemy, word order, question length, and data sparsity. In this article, we propose a unified framework to simultaneously handle these five problems. We use the word combined with corresponding concept information to handle the synonymy problem and the polysemous problem. Concept embedding and word embedding are learned at the same time from both the context-dependent and context-independent views. To handle the word-order problem, we propose a high-level feature-embedded convolutional semantic model to learn question embedding by inputting concept embedding and word embedding. Due to the fact that the lengths of some questions are long, we propose a value-based convolutional attentional method to enhance the proposed high-level feature-embedded convolutional semantic model in learning the key parts of the question and the answer. The proposed high-level feature-embedded convolutional semantic model nicely represents the hierarchical structures of word information and concept information in sentences with their layer-by-layer convolution and pooling. Finally, to resolve data sparsity, we propose using the multi-view learning method to train the attention-based convolutional semantic model on question–answer pairs. To the best of our knowledge, we are the first to propose simultaneously handling the above five problems in question retrieval using one framework. Experiments on three real question-answering datasets show that the proposed framework significantly outperforms the state-of-the-art solutions.

Details DOI