Arrow Research search

Author name cluster

Qi Wang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

87 papers
2 author rows

Possible papers

87

AAAI Conference 2026 Conference Paper

AbductiveMLLM: Boosting Visual Abductive Reasoning Within MLLMs

  • Boyu Chang
  • Qi Wang
  • Xi Guo
  • Zhixiong Nan
  • Yazhou Yao
  • Tianfei Zhou

Visual abductive reasoning (VAR) is a challenging task that requires AI systems to infer the most likely explanation for incomplete visual observations. While recent MLLMs develop strong general-purpose multimodal reasoning capabilities, they remain fall short in abductive inference, as compared to human beings. To bridge this gap, we draw inspiration from the interplay between verbal and pictorial abduction in human cognition, and propose to strengthen abduction of MLLMs by mimicking such dual-mode behavior. Concretely, we introduce AbductiveMLLM comprising of two synergistic components: REASONER and IMAGINER. The REASONER operates in the verbal domain. It first explores a broad space of possible explanations using a blind LLM and then prunes visually incongruent hypotheses based on cross-modal causal alignment. The remaining hypotheses are introduced into the MLLM as targeted priors, steering its reasoning toward causally coherent explanations. The IMAGINER, on the other hand, further guides MLLMs by emulating human-like pictorial thinking. It conditions a text-to-image diffusion model on both the input video and the REASONER’s output embeddings to “imagine” plausible visual scenes that correspond to verbal explanation, thereby enriching MLLMs' contextual grounding. The two components are trained jointly in an end-to-end manner. Experiments on standard VAR benchmarks show that AbductiveMLLM achieves state-of-the-art performance, consistently outperforming traditional solutions and advanced MLLMs.

AAAI Conference 2026 Conference Paper

Beyond Retraining: Training-Free Unknown Class Filtering for Source-Free Open Set Domain Adaptation of Vision–Language Models

  • Yongguang Li
  • Jindong Li
  • Qi Wang
  • QianLi Xing
  • Runliang Niu
  • Shengsheng Wang
  • Menglin Yang

Vision-language models (VLMs) have gained widespread attention for their strong zero-shot capabilities across numerous downstream tasks. However, these models assume that each test image’s class label is drawn from a predefined label set and lack a reliable mechanism to reject samples from emerging unknown classes when only unlabeled data are available. To address this gap, open-set domain adaptation methods retrain models to push potential unknowns away from known clusters. Yet, some unknown samples remain stably anchored to specific known classes in the VLM feature space due to semantic relevance, which is termed as Semantic Affinity Anchoring (SAA). Forcibly repelling these samples unavoidably distorts the native geometry of VLMs and degrades performance. Meanwhile, existing score‑based unknown detectors use simplistic thresholds and suffer from threshold sensitivity, resulting in sub‑optimal performance. To address aforementioned issues, we propose VLM-OpenXpert, which comprises two training‑free, plug‑and‑play inference modules. SUFF performs SVD on high-confidence unknowns to extract a low-rank "unknown subspace". Each sample’s projection onto this subspace is weighted and softly removed from its feature, suppressing unknown components while preserving semantics. BGAT corrects score skewness via a Box–Cox transform, then fits a bimodal Gaussian mixture to adaptively estimate the optimal threshold balancing known-class recognition and unknown-class rejection. Experiments on 9 benchmarks and three backbones (CLIP, SigLIP, ALIGN) under Source-Free OSDA settings show that our training-free pipeline matches or outperforms retraining-heavy state-of-the-art methods, establishing a powerful lightweight inference calibration paradigm for open-set VLM deployment.

AAAI Conference 2026 Conference Paper

CoGrad3D: Spatially-Coupled Timestep Optimization with Orthogonal Gradient Fusion for 3D Generation

  • Haoyang Tong
  • Hongbo Wang
  • Jin Liu
  • Qi Wang
  • Jie Cao
  • Ran He

Score Distillation Sampling has driven recent advances in text-to-3D generation. However, current approaches often fail to produce 3D assets that are both rich in detail and consistent across viewpoints. These limitations primarily arise from imbalanced guidance on fine-grained details and an overdependence on single-view optimization—issues exacerbated by the excessive randomness in selecting diffusion timesteps and camera configurations. Such deficiencies commonly lead to blurry textures and inter-view inconsistencies, which degrade visual realism and hinder practical deployment. To tackle these challenges, we introduce CoGrad3D, a unified generative refinement framework that adopts a continuously adaptive optimization strategy. By dynamically modulating the optimization focus based on real-time convergence signals, CoGrad3D ensures balanced progress toward both geometric completeness and high-fidelity detail. Concretely, we propose an adaptive region sampling strategy that emphasizes under-converged viewing areas, promoting stable and uniform optimization. To facilitate the transition from coarse geometry to fine-grained reconstruction, we develop a region-aware temporal scheduling scheme that integrates global training dynamics with local convergence feedback. Furthermore, we introduce a gradient fusion mechanism that consolidates historical gradients from adjacent viewpoints, mitigating view-specific artifacts and promoting the emergence of coherent 3D structures. Extensive experiments demonstrate that CoGrad3D substantially surpasses existing methods in both geometric consistency and texture fidelity, enabling the generation of high-quality, view-consistent 3D models from textual descriptions.

JBHI Journal 2026 Journal Article

Dual-Student Adversarial Framework With Discriminator and Consistency-Driven Learning for Semi-Supervised Medical Image Segmentation

  • Haifan Wu
  • Yuhan Geng
  • Di Gai
  • Jieying Tu
  • Xin Xiong
  • Qi Wang
  • Zheng Huang

Semi-supervised medical image segmentation is essential for alleviating the cost of manual annotation in clinical applications. However, existing methods often suffer from unreliable pseudo-labels and confirmation bias in consistency-based training, which can lead to unstable optimization and degraded performance. To address these issues, a novel method named dual-Student adversarial framework with discriminator and consistency-driven learning for semi-supervised medical image segmentation is proposed. Specifically, an adversarial learning-based segmentation refinement (ALSR) module is designed to encourage prediction diversity between two student networks and leverage a shared discriminator for adversarial refinement of pseudo-labels. To further stabilize the consistency process, a residual exponential moving average (R-EMA) is applied in the uncertainty estimation with inter-instance consistency measurement (UIM) module to construct a robust teacher model, while noisy voxel predictions are selectively filtered based on uncertainty estimation. In addition, a Contrastive Representation Stabilization (CRS) module is developed to enhance voxel-level semantic alignment by performing contrastive learning only on confident regions, improving feature discriminability and structural consistency. Extensive experiments on benchmark datasets demonstrate that our method consistently outperforms prior state-of-the-art approaches.

AAAI Conference 2026 Conference Paper

Exploring Generalizable Remote Sensing Change Detection via Low-Rank Exchange Adaptation of Vision Foundation Model

  • Mingwei Zhang
  • Jingtao Hu
  • Qiang Li
  • Qi Wang

Remote sensing change detection (CD) has achieved remarkable progress in recent years. However, little attention has been paid to generalizable change detection (GCD) methods that can effectively generalize to unseen scenarios or domains beyond the training distribution. The major challenges in GCD arise from domain diversity and bitemporal domain shifts in remote sensing images, caused by variations in imaging platforms, acquisition times, geographic regions, and observed events. To tackle these challenges, we propose GenCD, a GCD framework built upon vision foundation models (VFMs). Specifically, GenCD introduces two key components: (1) a Low-Rank Exchange Adaptation (LREA) strategy of VFMs that aligns bitemporal representations while preserving the generalization capacity of VFMs on single-temporal inputs; and (2) a Token-Guided Feature Refinement (TGFR) mechanism that leverages an input-independent token as a guide to refine difference features, improving the discrimination between changed and unchanged regions. We conduct extensive cross-dataset evaluations on eight diverse datasets across three binary CD tasks: land cover, land use, and building-only CD. The results consistently demonstrate the superior generalization of GenCD over SoTA methods, highlighting its effectiveness in GCD.

AAAI Conference 2026 Conference Paper

HISE-KT: Synergizing Heterogeneous Information Networks and LLMs for Explainable Knowledge Tracing with Meta-Path Optimization

  • Zhiyi Duan
  • Zixing Shi
  • Hongyu Yuan
  • Qi Wang

Knowledge Tracing (KT) aims to mine students’ evolving knowledge states and predict their future question-answering performance. Existing methods based on heterogeneous information networks (HINs) are prone to introducing noises due to manual or random selection of meta-paths and lack necessary quality assessment of meta-path instances. Conversely, recent large language models (LLMs)-based methods ignore the rich information across students, and both paradigms struggle to deliver consistently accurate and evidence-based explanations. To address these issues, we propose an innovative framework, HIN-LLM Synergistic Enhanced Knowledge Tracing (HISE-KT), which seamlessly integrates HINs with LLMs. HISE-KT first builds a multi-relationship HIN containing diverse node types to capture the structural relations through multiple meta-paths. The LLM is then employed to intelligently score and filter meta-path instances and retain high-quality paths, pioneering automated meta-path quality assessment. Inspired by educational psychology principles, a similar student retrieval mechanism based on meta-paths is designed to provide a more valuable context for prediction. Finally, HISE-KT uses a structured prompt to integrate the target student's history with the retrieved similar trajectories, enabling the LLM to generate not only accurate predictions but also evidence-backed, explainable analysis reports. Experiments on four public datasets show that HISE-KT outperforms existing KT baselines in both prediction performance and interpretability.

JBHI Journal 2026 Journal Article

HyperSynergyX: Synergistic Drug Combination Prediction via Hypergraph Modeling and Knowledge Graph-Enhanced Retrieval-Augmented Generation

  • Qi Wang
  • Bingzheng Wu
  • Minglang Xu
  • Xiya Liu
  • Yiming Mao
  • Zhiheng Zhou
  • Guiying Yan

Drug combination therapy is pivotal for complex diseases, but identifying synergistic three-drug regimens remains challenging due to both combinatorial explosion and the opacity of existing computational models. To address this, we introduce HyperSynergyX, an explainable framework that integrates synergy prediction with mechanistic explanation. Its core predictive component, a Dual-Biased Random Walk on Hypergraphs (DBRWH), models higher-order interactions among drugs on a three drug hypergraph and identifies latent combination patterns via tensor decomposition. To enhance interpretability, we couple DBRWH with a knowledge-graph–enhanced retrieval augmented generation (KG-RAG) module that retrieves mechanistically relevant subgraphs and uses them to generate biologically grounded hypotheses for predicted synergies. On breast-cancer data, DBRWH achieves AUROC/AUPRC of 0. 9593/0. 9453 under 5-fold cross-validation, and on lung cancer data it achieves 0. 9262/0. 9481, outperforming strong deep learning and hypergraph baselines. By linking predictive performance with mechanistic interpretability, HyperSynergyX provides a robust and transparent tool to accelerate multi-drug discovery and support rational regimen design in precision oncology. The code is available at: https://github.com/wangqi27/HyperSynergyX.

AAAI Conference 2026 Conference Paper

PIMRL: Physics-Informed Multi-Scale Recurrent Learning for Burst-Sampled Spatiotemporal Dynamics

  • Han Wan
  • Qi Wang
  • Yuan Mi
  • Rui Zhang
  • Hao Sun

Deep learning has shown strong potential in modeling complex spatiotemporal dynamics. However, most existing methods depend on densely and uniformly sampled data, which is often unavailable in practice due to sensor and cost limitations. In many real-world settings, such as mobile sensing and physical experiments, data are burst-sampled with short high-frequency segments followed by long gaps, making it difficult to learn accurate dynamics from sparse observations. To address this issue, we propose Physics-Informed Multi-Scale Recurrent Learning (PIMRL), a novel framework specifically designed for burst-sampled spatiotemporal data. PIMRL combines macro-scale latent dynamics inference with micro-scale adaptive refinement guided by incomplete prior information from partial differential equations (PDEs). It further introduces a temporal message-passing mechanism to effectively propagate information across burst intervals. This multi-scale architecture enables PIMRL to model complex systems accurately even under severe data scarcity. We evaluate our approach on five benchmark datasets involving 1D to 3D multi-scale PDEs. The results show that PIMRL consistently outperforms state-of-the-art baselines, achieving substantial improvements and reducing errors by up to 80\% in the most challenging settings, which demonstrates the clear advantage of our model. Our work demonstrates the effectiveness of physics-informed recurrent learning for accurate and efficient modeling of sparse spatiotemporal systems.

AAAI Conference 2026 Conference Paper

Reasoning via Implicit Self-supervised Emergence for Instruction Segmentation

  • Qing Zhou
  • Lichang Yang
  • Yuyu Jia
  • Junyu Gao
  • Weiping Ni
  • Junzheng Wu
  • Qi Wang

We challenge the assumption that complex instruction-guided segmentation tasks necessitate equally complex and explicit supervision. This paper introduces RISE (Reasoning via Implicit Self-supervised Emergence), a framework that learns intricate compositional reasoning, spanning spatial relations to world knowledge, without a single ground-truth mask. To achieve this, RISE employs reinforcement learning with GRPO guided by a single, strikingly simple reward: the semantic alignment score between the textual instruction and the predicted image region. Our primary discovery is the implicit emergence of a high-quality chain-of-thought process from this minimalist signal. Within a structured format, the model autonomously learns to understand instructions by accessing its latent knowledge, inferring spatial relationships—capabilities inherent in its architecture but unlocked by our simple objective. Remarkably, our emergent reasoning yields highly competitive results: RISE achieves 58.7 gIoU on the ReasonSeg benchmark, on par with methods using geometric rewards. Furthermore, we show extreme data efficiency: a variant trained on only 2,000 ImageNet-label pairs establishes a new state-of-the-art for annotation-free referring segmentation with 79.6 cIoU on RefCOCO.

AAAI Conference 2026 Conference Paper

Slender3D: Curve-Guided Multi-View Reconstruction of Slender Structures

  • Suqin Wang
  • Zeyi Wang
  • Min Shi
  • Zhaoxin Li
  • Qi Wang
  • Xiujuan Chai
  • Dengming Zhu

Although geometric reconstruction of general objects from images has made remarkable progress in recent years, slender structures remain largely underexplored, despite their critical importance in engineering, biomedical, and agricultural applications. To bridge this gap, we propose a dedicated 2DGS-based geometric reconstruction framework tailored for slender structures, achieving accurate and faithful geometry recovery. Our method first addresses the challenge that most slender objects are texture-less, which hinders reliable feature matching and pose estimation in traditional SfM pipelines. By leveraging the curve-like nature of slender structures, we perform a curve-guided SfM process that provides robust camera poses and accurate 3D curve initialization for Gaussian primitives. To ensure SfM reliability, we introduce a high-precision mask extraction strategy that integrates geometric priors with a segmentation network, effectively handling self-occlusion and thin geometry. Furthermore, to enhance fine geometric recovery, we incorporate a differentiable Poisson reconstruction module to extract an initial mesh during training, which is then refined via image-space iterative optimization using differentiable mesh rasterization. In contrast to conventional approaches that rely on differentiable Gaussian rasterization followed by TSDF-based mesh extraction, our method avoids the additional geometric errors and artifacts introduced during the intermediate TSDF conversion, thereby improving the overall reconstruction quality. Comprehensive experiments on both synthetic and real-world datasets validate that our method achieves superior reconstruction quality compared to state-of-the-art approaches.

AAAI Conference 2026 Conference Paper

Symphony-MoE: Harmonizing Disparate Pre-trained Models into a Coherent Mixture-of-Experts

  • Qi Wang
  • Hanyang Peng
  • Yue Yu

Mixture-of-Experts (MoE) models enable scalable performance by activating large parameter sets sparsely, minimizing computational overhead. To mitigate the prohibitive cost of training MoEs from scratch, recent work employs upcycling, reusing a single pre-trained dense model by replicating its feed-forward network (FFN) layers into experts. However, this limits expert diversity, as all experts originate from a single pre-trained dense model. This paper addresses this limitation by constructing powerful MoE models using experts sourced from multiple identically-architected but disparate pre-trained models (e.g., Qwen2.5-Coder and Qwen2). A key challenge lies in the fact that these source models occupy disparate, dissonant regions of the parameter space, making direct upcycling prone to severe performance degradation. To overcome this, we propose Symphony-MoE, a novel two-stage framework designed to harmonize these models into a single, coherent expert mixture. First, we establish this harmony in a training-free manner: we construct a shared backbone via a layer-aware fusion strategy and, crucially, alleviate parameter misalignment among experts using activation-based functional alignment. Subsequently, a stage of post-training coordinates the entire architecture. Experiments demonstrate that our method successfully integrates experts from heterogeneous sources, achieving an MoE model that significantly surpasses baselines in multi-domain tasks and out-of-distribution generalization.

AAAI Conference 2026 Conference Paper

TAPO: Dynamic Teacher and Perturbed Answer Injection for Policy Optimization

  • Maowei Jiang
  • Zihang Wang
  • Qi Wang
  • Peter Búš
  • Moquan Cheng
  • Yifan Wang
  • Quangao Liu
  • Ruiqi Li

Reinforcement learning (RL) has emerged as a powerful framework to improve the reasoning performance of large language models (LLMs), with approaches such as Group Relative Policy Optimization (GRPO) showing promising results. However, GRPO and its variants struggle with collapsed groups (i.e., all-correct or all-incorrect completions), leading to zero-variance rewards and ineffective gradient signals. Moreover, focusing solely on final answer correctness while ignoring the reasoning process, along with rigid length penalties, can hinder training stability and output quality. To address these issues, we introduce TAPO, a reinforcement learning framework that enhances optimization signals by modifying sampled completions within training groups. TAPO incorporates three core techniques: (1) Dynamic Teacher Injection (DTI), which selectively injects high-quality or adversarial examples to restore effective gradient signals in collapsed groups; (2) Perturbed Answer Injection (PAI), which makes partially correct completions to provide contrastive supervision separating reasoning correctness but wrong answer from the trajectories; and (3) InfoLen-Aware Reward Shaping, a fine-grained reward strategy that penalizes outputs based on both length and semantic redundancy, encouraging concise yet informative responses. Extensive experimental results demonstrate that TAPO significantly improves the mathematical reasoning capabilities of LLMs across multiple challenging benchmarks, outperforming the GRPO baseline by a substantial margin. Component-wise ablations further validate the contribution of each proposed technique.

AAAI Conference 2026 Conference Paper

Target-Balanced Score Distillation

  • Zhou Xu
  • Qi Wang
  • Yuxiao Yang
  • Luyuan Zhang
  • Zhang Liang
  • Yang Li

Score Distillation Sampling (SDS) enables 3D asset generation by distilling priors from pretrained 2D text-to-image diffusion models, but vanilla SDS suffers from over-saturation and over-smoothing. To mitigate this issue, recent variants have incorporated negative prompts. However, these methods face a critical trade-off: limited texture optimization, or significant texture gains with shape distortion. In this work, we first conduct a systematic analysis and reveal that this trade-off is fundamentally governed by the utilization of the negative prompts, where Target Negative Prompts (TNP) that embed target information in the negative prompts dramatically enhancing texture realism and fidelity but inducing shape distortions. Informed by this key insight, we introduce the Target-Balanced Score Distillation (TBSD). It formulates generation as a multi-objective optimization problem and introduces an adaptive strategy that effectively resolves the aforementioned trade-off. Extensive experiments demonstrate that TBSD significantly outperforms existing state-of-the-art methods, yielding 3D assets with high-fidelity textures and geometrically accurate shape.

NeurIPS Conference 2025 Conference Paper

Adaptive Neighborhood-Constrained Q Learning for Offline Reinforcement Learning

  • Yixiu Mao
  • Yun Qu
  • Qi Wang
  • Xiangyang Ji

Offline reinforcement learning (RL) suffers from extrapolation errors induced by out-of-distribution (OOD) actions. To address this, offline RL algorithms typically impose constraints on action selection, which can be systematically categorized into density, support, and sample constraints. However, we show that each category has inherent limitations: density and sample constraints tend to be overly conservative in many scenarios, while the support constraint, though least restrictive, faces challenges in accurately modeling the behavior policy. To overcome these limitations, we propose a new neighborhood constraint that restricts action selection in the Bellman target to the union of neighborhoods of dataset actions. Theoretically, the constraint not only bounds extrapolation errors and distribution shift under certain conditions, but also approximates the support constraint without requiring behavior policy modeling. Moreover, it retains substantial flexibility and enables pointwise conservatism by adapting the neighborhood radius for each data point. In practice, we employ data quality as the adaptation criterion and design an adaptive neighborhood constraint. Building on an efficient bilevel optimization framework, we develop a simple yet effective algorithm, Adaptive Neighborhood-constrained Q learning (ANQ), to perform Q learning with target actions satisfying this constraint. Empirically, ANQ achieves state-of-the-art performance on standard offline RL benchmarks and exhibits strong robustness in scenarios with noisy or limited data.

AAAI Conference 2025 Conference Paper

CLIP-driven View-aware Prompt Learning for Unsupervised Vehicle Re-identification

  • Jiyang Xu
  • Qi Wang
  • Xin Xiong
  • Di Gai
  • Ruihua Zhou
  • Dong Wang

With the emergence of vision-language pre-trained models, such as CLIP, some textual prompts have been gradually introduced recently into re-identification (Re-ID) tasks to obtain considerably robust multimodal information. However, most textual descriptions based on vehicle Re-ID tasks only contain identity index words without specific words to describe vehicle view information, thereby resulting in difficulty to be widely applied in vehicle Re-ID tasks with view variations. This case inspires us to propose a CLIP-driven view-aware prompt learning framework for unsupervised vehicle Re-ID. We first design a learnable textual prompt template called view-aware context optimization (ViewCoOp) based on dynamic multi-view word embeddings, which can fully obtain the proportion and position encoding of each view in the whole vehicle body region. Subsequently, a cross-modal mutual graph is constructed to explore the connections between inter-modal and intra-modal. Each sample is treated as a graph node, which extracts textual features based on ViewCoOp and the visual features of images. Moreover, leveraging the inter-cluster and intra-cluster correlation in the bimodal clustering results in the determination of connectivity between graph node pairs. Lastly, the proposed cross-modal mutual graph method utilizes supervised information from the bimodal gap to directly fine-tune the image encoder of CLIP for downstream unsupervised vehicle Re-ID tasks. Extensive experiments verify that the proposed method is capable of effectively obtaining cross-modal description ability from multiple views.

JBHI Journal 2025 Journal Article

Edge-Guided Multi-Scale Frequency Attention Network for Gastrointestinal Cancer Image Segmentation

  • Zhiwen Liao
  • Qi Wang
  • Xinyi Tang
  • Han Wang
  • Jun Hu
  • Pengxiang Su
  • Evangelos K. Markakis
  • Peng Luo

Image segmentation is a critical technology to improve the accuracy of clinical decisions and treatments in computer-aided diagnostic systems. However, the diverse morphology and fuzzy boundaries of gastrointestinal tumors incur substantial challenges for existing segmentation models, leading to inaccurate feature capture and generating suboptimal results. For solving these problems, we design an edge-guided multi-scale frequency attention network for the gastrointestinal tumor segmentation task, termed EGMFA-Net, which consists of a Kernel Adaptive Enhancement Module (KAEM) and a Frequency-domain Self-attention Module (FDSA). Specifically, KAEM adaptively adjusts the feature extraction kernel based on the morphology of different lesion regions, which enhances the recognition of different morphology regions via a progressive optimization strategy of feature expression. Furthermore, FDSA effectively aggregates multi-scale features in the frequency domain to achieve global receptive fields while preserving more high-frequency details, thereby enhancing adaptability to complex pathological contexts. Extensive experiments on eight medical image benchmark datasets, including SEED, Kvasir, ClinicDB, ColonDB, ETIS, BKAI, CVC-300, and Synapse, show that EGMFA-Net attains state-of-the-art performance over existing methods. Our implementation is available at https://github.com/med-segment/egmfa-net.

NeurIPS Conference 2025 Conference Paper

Gains: Fine-grained Federated Domain Adaptation in Open Set

  • Zhengyi Zhong
  • Wenzheng Jiang
  • Weidong Bao
  • Ji Wang
  • Qi Wang
  • Guanbo Wang
  • Yongheng Deng
  • Ju Ren

Conventional federated learning (FL) assumes a closed world with a fixed total number of clients. In contrast, new clients continuously join the FL process in real-world scenarios, introducing new knowledge. This raises two critical demands: detecting new knowledge, i. e. , knowledge discovery, and integrating it into the global model, i. e. , knowledge adaptation. Existing research focuses on coarse-grained knowledge discovery, and often sacrifices source domain performance and adaptation efficiency. To this end, we propose a fine-grained federated domain adaptation approach in open set (Gains). Gains splits the model into an encoder and a classifier, empirically revealing features extracted by the encoder are sensitive to domain shifts while classifier parameters are sensitive to class increments. Based on this, we develop fine-grained knowledge discovery and contribution-driven aggregation techniques to identify and incorporate new knowledge. Additionally, an anti-forgetting mechanism is designed to preserve source domain performance, ensuring balanced adaptation. Experimental results on multi-domain datasets across three typical data-shift scenarios demonstrate that Gains significantly outperforms other baselines in performance for both source-domain and target-domain clients. Code is available at: https: //github. com/Zhong-Zhengyi/Gains.

AAAI Conference 2025 Conference Paper

GTDE: Grouped Training with Decentralized Execution for Multi-agent Actor-Critic

  • Mengxian Li
  • Qi Wang
  • Yongjun Xu

The rapid advancement of multi-agent reinforcement learning (MARL) has given rise to diverse training paradigms to learn the policies of each agent in the multi-agent system. The paradigms of decentralized training and execution (DTDE) and centralized training with decentralized execution (CTDE) have been proposed and widely applied. However, as the number of agents increases, the inherent limitations of these frameworks significantly degrade the performance metrics, such as win rate, total reward, etc. To reduce the influence of the increasing number of agents on the performance metrics, we propose a novel training paradigm of grouped training decentralized execution (GTDE). This framework eliminates the need for a centralized module and relies solely on local information, effectively meeting the training requirements of large-scale multi-agent systems. Specifically, we first introduce an adaptive grouping module, which divides each agent into different groups based on their observation history. To implement end-to-end training, GTDE uses Gumbel-Sigmoid for efficient point-to-point sampling on the grouping distribution while ensuring gradient backpropagation. To adapt to the uncertainty in the number of members in a group, two methods are used to implement a group information aggregation module that merges member information within the group. Empirical results show that in a cooperative environment with 495 agents, GTDE increased the total reward by an average of 382% compared to the baseline. In a competitive environment with 64 agents, GTDE achieved a 100% win rate against the baseline.

NeurIPS Conference 2025 Conference Paper

H3D-DGS: Exploring Heterogeneous 3D Motion Representation for Deformable 3D Gaussian Splatting

  • Bing He
  • Yunuo Chen
  • Guo Lu
  • Qi Wang
  • Qunshan Gu
  • Rong Xie
  • Li Song
  • Wenjun Zhang

Dynamic scene reconstruction poses a persistent challenge in 3D vision. Deformable 3D Gaussian Splatting has emerged as an effective method for this task, offering real-time rendering and high visual fidelity. This approach decomposes a dynamic scene into a static representation in a canonical space and time-varying scene motion. Scene motion is defined as the collective movement of all Gaussian points, and for compactness, existing approaches commonly adopt implicit neural fields or sparse control points. However, these methods predominantly rely on gradient-based optimization for all motion information. Due to the high degree of freedom, they struggle to converge on real-world datasets exhibiting complex motion. To preserve the compactness of motion representation and address convergence challenges, this paper proposes heterogeneous 3D control points, termed \textbf{H3D control points}, whose attributes are obtained using a hybrid strategy combining optical flow back-projection and gradient-based methods. This design decouples directly observable motion components from those that are geometrically occluded. Specifically, components of 3D motion that project onto the image plane are directly acquired via optical flow back projection, while unobservable portions are refined through gradient-based optimization. Experiments on the Neu3DV and CMU-Panoptic datasets demonstrate that our method achieves superior performance over state-of-the-art deformable 3D Gaussian splatting techniques. Remarkably, our method converges within just 100 iterations and achieves a per-frame processing speed of 2 seconds on a single NVIDIA RTX 4070 GPU.

IJCAI Conference 2025 Conference Paper

Leveraging Pretrained Diffusion Models for Zero-Shot Part Assembly

  • Ruiyuan Zhang
  • Qi Wang
  • Jiaxiang Liu
  • Yuchi Huo
  • Chao Wu

3D part assembly aims to understand part relationships and predict their 6-DoF poses to construct realistic 3D shapes, addressing the growing demand for autonomous assembly, which is crucial for robots. Existing methods mainly estimate the transformation of each part by training neural networks under supervision, which requires a substantial quantity of manually labeled data. However, the high cost of data collection and the immense variability of real-world shapes and parts make traditional methods impractical for large-scale applications. In this paper, we propose first a zero-shot part assembly method that utilizes pre-trained point cloud diffusion models as discriminators in the assembly process, guiding the manipulation of parts to form realistic shapes. Specifically, we theoretically demonstrate that utilizing a diffusion model for zero-shot part assembly can be transformed into an Iterative Closest Point (ICP) process. Then, we propose a novel pushing-away strategy to address the overlap parts, thereby further enhancing the robustness of the method. To verify our work, we conduct extensive experiments and quantitative comparisons to several strong baseline methods, demonstrating the effectiveness of the proposed approach, which even surpasses the supervised learning method. The code has been released on https: //github. com/Ruiyuan-Zhang/Zero-Shot-Assembly.

IJCAI Conference 2025 Conference Paper

PeSANet: Physics-encoded Spectral Attention Network for Simulating PDE-Governed Complex Systems

  • Han Wan
  • Rui Zhang
  • Qi Wang
  • Yang Liu
  • Hao Sun

Accurately modeling and forecasting complex systems governed by partial differential equations (PDEs) is crucial in various scientific and engineering domains. However, traditional numerical methods struggle in real-world scenarios due to incomplete or unknown physical laws. Meanwhile, machine learning approaches often fail to generalize effectively when faced with scarce observational data and the challenge of capturing local and global features. To this end, we propose the Physics-encoded Spectral Attention Network (PeSANet), which integrates local and global information to forecast complex systems with limited data and incomplete physical priors. The model consists of two key components: a physics-encoded block that uses hard constraints to approximate local differential operators from limited data, and a spectral-enhanced block that captures long-range global dependencies in the frequency domain. Specifically, we introduce a novel spectral attention mechanism to model inter-spectrum relationships and learn long-range spatial features. Experimental results demonstrate that PeSANet outperforms existing methods across all metrics, particularly in long-term forecasting accuracy, providing a promising solution for simulating complex systems with limited data and incomplete physics.

JBHI Journal 2025 Journal Article

Re-Visible Dual-Domain Self-Supervised Deep Unfolding Network for MRI Reconstruction

  • Hao Zhang
  • Qi Wang
  • Jian Sun
  • Zhijie Wen
  • Jun Shi
  • Shihui Ying

Magnetic Resonance Imaging (MRI) is widely used in clinical practice, but suffers from prolonged acquisition time. Although deep learning methods have been proposed to accelerate acquisition and demonstrate promising performance, they rely on high-quality fully-sampled datasets for training in a supervised manner. However, such datasets are time-consuming and expensive-to-collect, which constrains their broader applications. On the other hand, self-supervised methods offer an alternative by enabling learning from under-sampled data alone, but most existing methods rely on further partitioned under-sampled k-space data as model's input for training, which causes an input distribution shift between the the training stage and the inference stage. Additionally, their models have not effectively incorporated comprehensive image priors, leading to degraded reconstruction performance. In this paper, we propose a novel re-visible dual-domain self-supervised deep unfolding network to address these issues when only under-sampled datasets are available. Specifically, by incorporating re-visible dual-domain loss, all under-sampled k-space data are utilized during training to mitigate the input distribution shift caused by further partitioning. This design enables the model to implicitly adapt to all under-sampled k-space data as input. Additionally, we design a Deep Unfolding Network based on Chambolle and Pock Proximal Point Algorithm (DUN-CP-PPA) to achieve end-to-end reconstruction. By employing a Spatial-Frequency Feature Extraction (SFFE) block to capture both global and local representations, the model effectively integrates imaging physics with comprehensive image priors to enhance reconstruction performance. Experiments on both single-coil and multi-coil datasets demonstrate that our method outperforms state-of-the-art approaches in terms of reconstruction performance and generalization capability.

NeurIPS Conference 2025 Conference Paper

RHYTHM: Reasoning with Hierarchical Temporal Tokenization for Human Mobility

  • Haoyu He
  • Haozheng Luo
  • Yan Chen
  • Qi Wang

Predicting human mobility is inherently challenging due to complex long-range dependencies and multi-scale periodic behaviors. To address this, we introduce RHYTHM (Reasoning with Hierarchical Temporal Tokenization for Human Mobility), a unified framework that leverages large language models (LLMs) as general-purpose spatio-temporal predictors and trajectory reasoners. Methodologically, RHYTHM employs temporal tokenization to partition each trajectory into daily segments and encode them as discrete tokens with hierarchical attention that captures both daily and weekly dependencies, thereby quadratically reducing the sequence length while preserving cyclical information. Additionally, we enrich token representations by adding pre-computed prompt embeddings for trajectory segments and prediction targets via a frozen LLM, and feeding these combined embeddings back into the LLM backbone to capture complex interdependencies. Computationally, RHYTHM keeps the pretrained LLM backbone frozen, yielding faster training and lower memory usage. We evaluate our model against state-of-the-art methods using three real-world datasets. Notably, RHYTHM achieves a 2. 4% improvement in overall accuracy, a 5. 0% increase on weekends, and a 24. 6% reduction in training time. Code is publicly available at https: //github. com/he-h/rhythm.

NeurIPS Conference 2025 Conference Paper

Selective Learning for Deep Time Series Forecasting

  • Yisong Fu
  • Zezhi Shao
  • Chengqing Yu
  • Yujie Li
  • Zhulin An
  • Qi Wang
  • Yongjun Xu
  • Fei Wang

Benefiting from high capacity for capturing complex temporal patterns, deep learning (DL) has significantly advanced time series forecasting (TSF). However, deep models tend to suffer from severe overfitting due to the inherent vulnerability of time series to noise and anomalies. The prevailing DL paradigm uniformly optimizes all timesteps through the MSE loss and learns those uncertain and anomalous timesteps without difference, ultimately resulting in overfitting. To address this, we propose a novel selective learning strategy for deep TSF. Specifically, selective learning screens a subset of the whole timesteps to calculate the MSE loss in optimization, guiding the model to focus on generalizable timesteps while disregarding non-generalizable ones. Our framework introduces a dual-mask mechanism to target timesteps: (1) an uncertainty mask leveraging residual entropy to filter uncertain timesteps, and (2) an anomaly mask employing residual lower bound estimation to exclude anomalous timesteps. Extensive experiments across eight real-world datasets demonstrate that selective learning can significantly improve the predictive performance for typical state-of-the-art deep models, including 37. 4% MSE reduction for Informer, 8. 4% for TimesNet, and 6. 5% for iTransformer.

NeurIPS Conference 2025 Conference Paper

VideoRFT: Incentivizing Video Reasoning Capability in MLLMs via Reinforced Fine-Tuning

  • Qi Wang
  • Yanrui Yu
  • Ye Yuan
  • Rui Mao
  • Tianfei Zhou

Reinforcement fine-tuning (RFT) has shown great promise in achieving humanlevel reasoning capabilities of Large Language Models (LLMs), and has recently been extended to MLLMs. Nevertheless, reasoning about videos, which is a fundamental aspect of human intelligence, remains a persistent challenge due to the complex logic, temporal and causal structures inherent in video data. To fill this gap, we propose VideoRFT, a novel approach that extends the RFT paradigm to cultivate human-like video reasoning capabilities in MLLMs. VideoRFT follows the standard two-stage scheme in RFT: supervised fine-tuning (SFT) with chain-of-thought (CoT) annotations, followed by reinforcement learning (RL) to improve generalization. A central challenge to achieve this in the video domain lies in the scarcity of large-scale, high-quality video CoT datasets. We address this by building a multi-expert-driven, cognition-inspired CoT curation pipeline. First, we devise a cognition-inspired prompting strategy to elicit a reasoning LLM to generate preliminary CoTs based solely on rich, structured, and literal representations of video content. Subsequently, these CoTs are revised by a MLLM conditioned on the actual video, ensuring visual consistency and reducing visual hallucinations. This pipeline results in two new datasets, i. e. VideoRFT-CoT-102K for SFT and VideoRFT-RL-310K for RL. To further strengthen the RL phase, we introduce a novel semantic-consistency reward that explicitly promotes the alignment between textual reasoning and visual evidence. This reward encourages the model to produce coherent, context-aware reasoning outputs grounded in visual input. Extensive experiments show that VideoRFT achieves state-of-the-art performance on six video reasoning benchmarks.

AAAI Conference 2024 Conference Paper

Cross-Sentence Gloss Consistency for Continuous Sign Language Recognition

  • Qi Rao
  • Ke Sun
  • Xiaohan Wang
  • Qi Wang
  • Bang Zhang

Continuous sign language recognition (CSLR) aims to recognize gloss sequences from continuous sign videos. Recent works enhance the gloss representation consistency by mining correlations between visual and contextual modules within individual sentences. However, there still remain much richer correlations among glosses across different sentences. In this paper, we present a simple yet effective Cross-Sentence Gloss Consistency (CSGC), which enforces glosses belonging to a same category to be more consistent in representation than those belonging to different categories, across all training sentences. Specifically, in CSGC, a prototype is maintained for each gloss category and benefits the gloss discrimination in a contrastive way. Thanks to the well-distinguished gloss prototype, an auxiliary similarity classifier is devised to enhance the recognition clues, thus yielding more accurate results. Extensive experiments conducted on three CSLR datasets show that our proposed CSGC significantly boosts the performance of CSLR, surpassing existing state-of-the-art works by large margins (i.e., 1.6% on PHOENIX14, 2.4% on PHOENIX14-T, and 5.7% on CSL-Daily).

NeurIPS Conference 2024 Conference Paper

Doubly Mild Generalization for Offline Reinforcement Learning

  • Yixiu Mao
  • Qi Wang
  • Yun Qu
  • Yuhang Jiang
  • Xiangyang Ji

Offline Reinforcement Learning (RL) suffers from the extrapolation error and value overestimation. From a generalization perspective, this issue can be attributed to the over-generalization of value functions or policies towards out-of-distribution (OOD) actions. Significant efforts have been devoted to mitigating such generalization, and recent in-sample learning approaches have further succeeded in entirely eschewing it. Nevertheless, we show that mild generalization beyond the dataset can be trusted and leveraged to improve performance under certain conditions. To appropriately exploit generalization in offline RL, we propose Doubly Mild Generalization (DMG), comprising (i) mild action generalization and (ii) mild generalization propagation. The former refers to selecting actions in a close neighborhood of the dataset to maximize the Q values. Even so, the potential erroneous generalization can still be propagated, accumulated, and exacerbated by bootstrapping. In light of this, the latter concept is introduced to mitigate the generalization propagation without impeding the propagation of RL learning signals. Theoretically, DMG guarantees better performance than the in-sample optimal policy in the oracle generalization scenario. Even under worst-case generalization, DMG can still control value overestimation at a certain level and lower bound the performance. Empirically, DMG achieves state-of-the-art performance across Gym-MuJoCo locomotion tasks and challenging AntMaze tasks. Moreover, benefiting from its flexibility in both generalization aspects, DMG enjoys a seamless transition from offline to online learning and attains strong online fine-tuning performance.

IJCAI Conference 2024 Conference Paper

Error-aware Sampling in Adaptive Shells for Neural Surface Reconstruction

  • Qi Wang
  • Yuchi Huo
  • Qi Ye
  • Rui Wang
  • Hujun Bao

Neural implicit surfaces with signed distance functions (SDFs) achieve superior quality in 3D geometry reconstruction. However, training SDFs is time-consuming because it requires a great number of samples to calculate accurate weight distributions and a considerable amount of samples sampled from the distribution for integrating the rendering results. Some existing sampling strategies focus on this problem. During the training, they assume a spatially-consistent convergence speed of kernel size, thus still suffering from low convergence or errors. Instead, we introduce an error-aware sampling method based on thin intervals of valid weight distributions, dubbed adaptive shells, to reduce the number of samples while still maintaining the reconstruction accuracy. To this end, we first extend Laplace-based neural implicit surfaces with learned spatially-varying kernel sizes which indicates the range of valid weight distributions. Then, the adaptive shell for each ray is determined by an efficient double-clipping strategy with spatially-varying SDF values and kernel sizes, fitting larger kernel sizes to wider shells. Finally, we calculate the error-bounded cumulative distribution functions (CDFs) of shells to conduct efficient importance sampling, achieving low-variance rendering with fewer calculations. Extensive results in various scenes demonstrate the superiority of our sampling technique, including significantly reducing sample counts and training time, even improving the reconstruction quality. The code is available at https: //github. com/erernan/ESampling.

IJCAI Conference 2024 Conference Paper

FLDM-VTON: Faithful Latent Diffusion Model for Virtual Try-on

  • Chenhui Wang
  • Tao Chen
  • Zhihao Chen
  • Zhizhong Huang
  • Taoran Jiang
  • Qi Wang
  • Hongming Shan

Despite their impressive generative performance, latent diffusion model-based virtual try-on (VTON) methods lack faithfulness to crucial details of the clothes, such as style, pattern, and text. To alleviate these issues caused by the diffusion stochastic nature and latent supervision, we propose a novel Faithful Latent Diffusion Model for VTON, termed FLDM-VTON. FLDM-VTON improves the conventional latent diffusion process in three major aspects. First, we propose incorporating warped clothes as both the starting point and local condition, supplying the model with faithful clothes priors. Second, we introduce a novel clothes flattening network to constrain generated try-on images, providing clothes-consistent faithful supervision. Third, we devise a clothes-posterior sampling for faithful inference, further enhancing the model performance over conventional clothes-agnostic Gaussian sampling. Extensive experimental results on the benchmark VITON-HD and Dress Code datasets demonstrate that our FLDM-VTON outperforms state-of-the-art baselines and is able to generate photo-realistic try-on images with faithful clothing details.

NeurIPS Conference 2024 Conference Paper

GO4Align: Group Optimization for Multi-Task Alignment

  • Jiayi Shen
  • Qi Wang
  • Zehao Xiao
  • Nanne van Noord
  • Marcel Worring

This paper proposes GO4Align, a multi-task optimization approach that tackles task imbalance by explicitly aligning the optimization across tasks. To achieve this, we design an adaptive group risk minimization strategy, comprising two techniques in implementation: (i) dynamical group assignment, which clusters similar tasks based on task interactions; (ii) risk-guided group indicators, which exploit consistent task correlations with risk information from previous iterations. Comprehensive experimental results on diverse benchmarks demonstrate our method's performance superiority with even lower computational costs.

JBHI Journal 2024 Journal Article

Improving Needle Tip Tracking and Detection in Ultrasound-Based Navigation System Using Deep Learning-Enabled Approach

  • Hui Che
  • Jiaxin Qin
  • Yao Chen
  • Zihan Ji
  • Yibo Yan
  • Jing Yang
  • Qi Wang
  • Chaofeng Liang

Ultrasound-guided percutaneous interventions have numerous advantages over traditional techniques. Accurate needle placement in the target anatomy is crucial for successful intervention, and reliable visual information is essential to achieve this. However, previous studies have revealed several challenges, such as the variability in needle echogenicity and the common misalignment of the ultrasound beam and the needle. Advanced techniques have been developed to optimize needle visualization, including hardware-based and image-processing-based methods. This paper proposes a novel strategy of integrating ultrasound-based deep learning approaches into an optical navigation system to enhance needle visualization and improve tip positioning accuracy. Both the tracking and detection algorithms are optimized utilizing optical tracking information. The information is introduced into the tracking network to define the search patch update strategy and form a trajectory reference to correct tracking results. In the detection network, the original image is processed according to the needle insertion position and current position given by the optical localization system to locate a coarse region, and the depth-score criterion is adopted to optimize detection results. Extensive experiments demonstrate that our approach achieves promising tip tracking and detection performance with tip localization errors of 1. 11 $\pm $ 0. 59 mm and 1. 17 $\pm$ 0. 70 mm, respectively. Moreover, we establish a paired dataset consisting of ultrasound images and their corresponding spatial tip coordinates acquired from the optical tracking system and conduct real puncture experiments to verify the effectiveness of the proposed methods. Our approach significantly improves needle visualization and provides physicians with visual guidance for posture adjustment.

TMLR Journal 2024 Journal Article

Large Language Models can be Guided to Evade AI-generated Text Detection

  • Ning Lu
  • Shengcai Liu
  • Rui He
  • Yew-Soon Ong
  • Qi Wang
  • Ke Tang

Large language models (LLMs) have shown remarkable performance in various tasks and have been extensively utilized by the public. However, the increasing concerns regarding the misuse of LLMs, such as plagiarism and spamming, have led to the development of multiple detectors, including fine-tuned classifiers and statistical methods. In this study, we equip LLMs with prompts, rather than relying on an external paraphraser, to evaluate the vulnerability of these detectors. We propose a novel Substitution-based In-Context example Optimization method (SICO) to automatically construct prompts for evading the detectors. SICO is cost-efficient as it requires only 40 human-written examples and a limited number of LLM inferences to generate a prompt. Moreover, once a task-specific prompt has been constructed, it can be universally used against a wide range of detectors. Extensive experiments across three real-world tasks demonstrate that SICO significantly outperforms the paraphraser baselines and enables GPT-3.5 to successfully evade six detectors, decreasing their AUC by 0.5 on average. Furthermore, a comprehensive human evaluation show that the SICO-generated text achieves human-level readability and task completion rates, while preserving high imperceptibility. Finally, we propose an ensemble approach to enhance the robustness of detectors against SICO attack.

NeurIPS Conference 2024 Conference Paper

Making Offline RL Online: Collaborative World Models for Offline Visual Reinforcement Learning

  • Qi Wang
  • Junming Yang
  • Yunbo Wang
  • Xin Jin
  • Wenjun Zeng
  • Xiaokang Yang

Training offline RL models using visual inputs poses two significant challenges, i. e. , the overfitting problem in representation learning and the overestimation bias for expected future rewards. Recent work has attempted to alleviate the overestimation bias by encouraging conservative behaviors. This paper, in contrast, tries to build more flexible constraints for value estimation without impeding the exploration of potential advantages. The key idea is to leverage off-the-shelf RL simulators, which can be easily interacted with in an online manner, as the “ test bed ” for offline policies. To enable effective online-to-offline knowledge transfer, we introduce CoWorld, a model-based RL approach that mitigates cross-domain discrepancies in state and reward spaces. Experimental results demonstrate the effectiveness of CoWorld, outperforming existing RL approaches by large margins.

IJCAI Conference 2024 Conference Paper

MuChin: A Chinese Colloquial Description Benchmark for Evaluating Language Models in the Field of Music

  • Zihao Wang
  • Shuyu Li
  • Tao Zhang
  • Qi Wang
  • Pengfei Yu
  • Jinyang Luo
  • Yan Liu
  • Ming Xi

The rapidly evolving multimodal Large Language Models (LLMs) urgently require new benchmarks to uniformly evaluate their performance on understanding and textually describing music. However, due to semantic gaps between Music Information Retrieval (MIR) algorithms and human understanding, discrepancies between professionals and the public, and low precision of annotations, existing music description datasets cannot serve as benchmarks. To this end, we present MuChin, the first open-source music description benchmark in Chinese colloquial language, designed to evaluate the performance of multimodal LLMs in understanding and describing music. We established the Caichong Music Annotation Platform (CaiMAP) that employs an innovative multi-person, multi-stage assurance method, and recruited both amateurs and professionals to ensure the precision of annotations and alignment with popular semantics. Utilizing this method, we built a large-scale, private dataset with multi-dimensional, high-precision music annotations, the Caichong Music Dataset (CaiMD), and carefully selected 1, 000 high-quality entries to serve as the test set for MuChin. Based on MuChin, we analyzed the discrepancies between professionals and amateurs in terms of music description, and empirically demonstrated the effectiveness of CaiMD for fine-tuning LLMs. Ultimately, we employed MuChin to evaluate existing music understanding models on their ability to provide colloquial descriptions of music.

NeurIPS Conference 2024 Conference Paper

Offline Reinforcement Learning with OOD State Correction and OOD Action Suppression

  • Yixiu Mao
  • Qi Wang
  • Chen Chen
  • Yun Qu
  • Xiangyang Ji

In offline reinforcement learning (RL), addressing the out-of-distribution (OOD) action issue has been a focus, but we argue that there exists an OOD state issue that also impairs performance yet has been underexplored. Such an issue describes the scenario when the agent encounters states out of the offline dataset during the test phase, leading to uncontrolled behavior and performance degradation. To this end, we propose SCAS, a simple yet effective approach that unifies OOD state correction and OOD action suppression in offline RL. Technically, SCAS achieves value-aware OOD state correction, capable of correcting the agent from OOD states to high-value in-distribution states. Theoretical and empirical results show that SCAS also exhibits the effect of suppressing OOD actions. On standard offline RL benchmarks, SCAS achieves excellent performance without additional hyperparameter tuning. Moreover, benefiting from its OOD state correction feature, SCAS demonstrates enhanced robustness against environmental perturbations.

NeurIPS Conference 2024 Conference Paper

P$^2$C$^2$Net: PDE-Preserved Coarse Correction Network for efficient prediction of spatiotemporal dynamics

  • Qi Wang
  • Pu Ren
  • Hao Zhou
  • Xin-Yang Liu
  • Zhiwen Deng
  • Yi Zhang
  • Ruizhi Chengze
  • Hongsheng Liu

When solving partial differential equations (PDEs), classical numerical methods often require fine mesh grids and small time stepping to meet stability, consistency, and convergence conditions, leading to high computational cost. Recently, machine learning has been increasingly utilized to solve PDE problems, but they often encounter challenges related to interpretability, generalizability, and strong dependency on rich labeled data. Hence, we introduce a new PDE-Preserved Coarse Correction Network (P$^2$C$^2$Net) to efficiently solve spatiotemporal PDE problems on coarse mesh grids in small data regimes. The model consists of two synergistic modules: (1) a trainable PDE block that learns to update the coarse solution (i. e. , the system state), based on a high-order numerical scheme with boundary condition encoding, and (2) a neural network block that consistently corrects the solution on the fly. In particular, we propose a learnable symmetric Conv filter, with weights shared over the entire model, to accurately estimate the spatial derivatives of PDE based on the neural-corrected system state. The resulting physics-encoded model is capable of handling limited training data (e. g. , 3--5 trajectories) and accelerates the prediction of PDE solutions on coarse spatiotemporal grids while maintaining a high accuracy. P$^2$C$^2$Net achieves consistent state-of-the-art performance with over 50\% gain (e. g. , in terms of relative prediction error) across four datasets covering complex reaction-diffusion processes and turbulent flows.

NeurIPS Conference 2024 Conference Paper

Resource-Aware Federated Self-Supervised Learning with Global Class Representations

  • Mingyi Li
  • Xiao Zhang
  • Qi Wang
  • Tengfei Liu
  • Ruofan Wu
  • Weiqiang Wang
  • Fuzhen Zhuang
  • Hui Xiong

Due to the heterogeneous architectures and class skew, the global representation models training in resource-adaptive federated self-supervised learning face with tricky challenges: $\textit{deviated representation abilities}$ and $\textit{inconsistent representation spaces}$. In this work, we are the first to propose a multi-teacher knowledge distillation framework, namely $\textit{FedMKD}$, to learn global representations with whole class knowledge from heterogeneous clients even under extreme class skew. Firstly, the adaptive knowledge integration mechanism is designed to learn better representations from all heterogeneous models with deviated representation abilities. Then the weighted combination of the self-supervised loss and the distillation loss can support the global model to encode all classes from clients into a unified space. Besides, the global knowledge anchored alignment module can make the local representation spaces close to the global spaces, which further improves the representation abilities of local ones. Finally, extensive experiments conducted on two datasets demonstrate the effectiveness of $\textit{FedMKD}$ which outperforms state-of-the-art baselines 4. 78\% under linear evaluation on average.

IJCAI Conference 2024 Conference Paper

ScreenAgent: A Vision Language Model-driven Computer Control Agent

  • Runliang Niu
  • Jindong Li
  • Shiqi Wang
  • Yali Fu
  • Xiyu Hu
  • Xueyuan Leng
  • He Kong
  • Yi Chang

Large Language Models (LLM) can invoke a variety of tools and APIs to complete complex tasks. The computer, as the most powerful and universal tool, could potentially be controlled by a trained LLM agent. Powered by the computer, we can hopefully build a more generalized agent to assist humans in various daily digital works. In this paper, we construct an environment for a Vision Language Model (VLM) agent to interact with a real computer screen. Within this environment, the agent can observe screenshots and manipulate the Graphical User Interface (GUI) by outputting mouse and keyboard actions. We also design an automated control pipeline that includes planning, acting, and reflecting phases, guiding the agent to continuously interact with the environment and complete multi-step tasks. Additionally, we construct the ScreenAgent Dataset, which collects screenshots and action sequences when completing daily computer tasks. Finally, we train a model, ScreenAgent, which achieves comparable computer control capabilities to GPT-4V and demonstrated more precise UI positioning capabilities. Our attempts could inspire further research on building a generalist LLM agent. The code and more detailed information are at https: //github. com/niuzaisheng/ScreenAgent.

NeurIPS Conference 2024 Conference Paper

Theoretical Investigations and Practical Enhancements on Tail Task Risk Minimization in Meta Learning

  • Yiqin Lv
  • Qi Wang
  • Dong Liang
  • Zheng Xie

Meta learning is a promising paradigm in the era of large models and task distributional robustness has become an indispensable consideration in real-world scenarios. Recent advances have examined the effectiveness of tail task risk minimization in fast adaptation robustness improvement \citep{wang2023simple}. This work contributes to more theoretical investigations and practical enhancements in the field. Specifically, we reduce the distributionally robust strategy to a max-min optimization problem, constitute the Stackelberg equilibrium as the solution concept, and estimate the convergence rate. In the presence of tail risk, we further derive the generalization bound, establish connections with estimated quantiles, and practically improve the studied strategy. Accordingly, extensive evaluations demonstrate the significance of our proposal in boosting robustness.

NeurIPS Conference 2023 Conference Paper

A Simple Yet Effective Strategy to Robustify the Meta Learning Paradigm

  • Qi Wang
  • Yiqin Lv
  • Yanghe Feng
  • Zheng Xie
  • Jincai Huang

Meta learning is a promising paradigm to enable skill transfer across tasks. Most previous methods employ the empirical risk minimization principle in optimization. However, the resulting worst fast adaptation to a subset of tasks can be catastrophic in risk-sensitive scenarios. To robustify fast adaptation, this paper optimizes meta learning pipelines from a distributionally robust perspective and meta trains models with the measure of tail task risk. We take the two-stage strategy as heuristics to solve the robust meta learning problem, controlling the worst fast adaptation cases at a certain probabilistic level. Experimental results show that our simple method can improve the robustness of meta learning to task distributions and reduce the conditional expectation of the worst fast adaptation risk.

NeurIPS Conference 2023 Conference Paper

Episodic Multi-Task Learning with Heterogeneous Neural Processes

  • Jiayi Shen
  • Xiantong Zhen
  • Qi Wang
  • Marcel Worring

This paper focuses on the data-insufficiency problem in multi-task learning within an episodic training setup. Specifically, we explore the potential of heterogeneous information across tasks and meta-knowledge among episodes to effectively tackle each task with limited data. Existing meta-learning methods often fail to take advantage of crucial heterogeneous information in a single episode, while multi-task learning models neglect reusing experience from earlier episodes. To address the problem of insufficient data, we develop Heterogeneous Neural Processes (HNPs) for the episodic multi-task setup. Within the framework of hierarchical Bayes, HNPs effectively capitalize on prior experiences as meta-knowledge and capture task-relatedness among heterogeneous tasks, mitigating data-insufficiency. Meanwhile, transformer-structured inference modules are designed to enable efficient inferences toward meta-knowledge and task-relatedness. In this way, HNPs can learn more powerful functional priors for adapting to novel heterogeneous tasks in each meta-test episode. Experimental results show the superior performance of the proposed HNPs over typical baselines, and ablation studies verify the effectiveness of the designed inference modules.

IROS Conference 2023 Conference Paper

Hierarchical Attention Network for Planning-Informed Multi-Agent Trajectory Prediction

  • Wenyi Xiong
  • Jian Chen
  • Xinfang Zhang
  • Qi Wang
  • Ziheng Qi

The accurate prediction of the neighboring vehicles' trajectories affects the security of autonomous driving vehicles. However, it is challenging for existing methods to anticipating the trajectories of vehicles in the vicinity due to the uncertainty of driving behaviors and the complex interaction patterns of traffic flows. In this study, incorporating the planning information of the ego vehicle, we propose a novel trajectory prediction approach based on the hierarchical attention mechanism. Firstly, a spatio-temporary attention module is presented to extract the social interaction of surrounding vehicles and capture the temporal dependence of continuous frame historical information and planning information. Then, a hard-soft attention module is designed to perform two tasks: weighing the importance of both historical and future information, and learning different location information about the target vehicles. Our method is evaluated on two national highway datasets. The experimental results show that our algorithm achieves the state-of-the-art performance.

IJCAI Conference 2023 Conference Paper

Multi-level Graph Contrastive Prototypical Clustering

  • Yuchao Zhang
  • Yuan Yuan
  • Qi Wang

Recently, graph neural networks (GNNs) have drawn a surge of investigations in deep graph clustering. Nevertheless, existing approaches predominantly are inclined to semantic-agnostic since GNNs exhibit inherent limitations in capturing global underlying semantic structures. Meanwhile, multiple objectives are imposed within one latent space, whereas representations from different granularities may presumably conflict with each other, yielding severe performance degradation for clustering. To this end, we propose a novel Multi-Level Graph Contrastive Prototypical Clustering (MLG-CPC) framework for end-to-end clustering. Specifically, a Prototype Discrimination (ProDisc) objective function is proposed to explicitly capture semantic information via cluster assignments. Moreover, to alleviate the issue of objectives conflict, we introduce to perceive representations of different granularities within individual feature-, prototypical-, and cluster-level spaces by the feature decorrelation, prototype contrast, and cluster space consistency respectively. Extensive experiments on four benchmarks demonstrate the superiority of the proposed MLG-CPC against the state-of-the-art graph clustering approaches.

IJCAI Conference 2022 Conference Paper

A Speech-driven Sign Language Avatar Animation System for Hearing Impaired Applications

  • Li Hu
  • Jiahui Li
  • Jiashuo Zhang
  • Qi Wang
  • Bang Zhang
  • Ping Tan

Sign language is the communication language used in hearing impaired community. Recently, the research of sign language production has made great progress but still need to cope with some critical challenges. In this paper, we propose a system-level scheme and push forward the implementation of sign language production for practical usage. We build a system capable of translating speech into sign language avatar. Different from previous approach only focusing on single technology, we systematically combine algorithms of language translation, body gesture animation and facial avatar generation. We also develop two applications: Sign Language Interpretation APP and Virtual Sign Language Anchor, to facilitate easy and clear communication for hearing impaired people.

IJCAI Conference 2022 Conference Paper

AttExplainer: Explain Transformer via Attention by Reinforcement Learning

  • Runliang Niu
  • Zhepei Wei
  • Yan Wang
  • Qi Wang

Transformer and its variants, built based on attention mechanisms, have recently achieved remarkable performance in many NLP tasks. Most existing works on Transformer explanation tend to reveal and utilize the attention matrix with human subjective intuitions in a qualitative manner. However, the huge size of dimensions directly challenges these methods to quantitatively analyze the attention matrix. Therefore, in this paper, we propose a novel reinforcement learning (RL) based framework for Transformer explanation via attention matrix, namely AttExplainer. The RL agent learns to perform step-by-step masking operations by observing the change in attention matrices. We have adapted our method to two scenarios, perturbation-based model explanation and text adversarial attack. Experiments on three widely used text classification benchmarks validate the effectiveness of the proposed method compared to state-of-the-art baselines. Additional studies show that our method is highly transferable and consistent with human intuition. The code of this paper is available at https: //github. com/niuzaisheng/AttExplainer.

NeurIPS Conference 2022 Conference Paper

Learning Expressive Meta-Representations with Mixture of Expert Neural Processes

  • Qi Wang
  • Herke van Hoof

Neural processes (NPs) formulate exchangeable stochastic processes and are promising models for meta learning that do not require gradient updates during the testing phase. However, most NP variants place a strong emphasis on a global latent variable. This weakens the approximation power and restricts the scope of applications using NP variants, especially when data generative processes are complicated. To resolve these issues, we propose to combine the Mixture of Expert models with Neural Processes to develop more expressive exchangeable stochastic processes, referred to as Mixture of Expert Neural Processes (MoE-NPs). Then we apply MoE-NPs to both few-shot supervised learning and meta reinforcement learning tasks. Empirical results demonstrate MoE-NPs' strong generalization capability to unseen tasks in these benchmarks.

JBHI Journal 2022 Journal Article

MRI Generated From CT for Acute Ischemic Stroke Combining Radiomics and Generative Adversarial Networks

  • Eryan Feng
  • Pinle Qin
  • Rui Chai
  • Jianchao Zeng
  • Qi Wang
  • Yanfeng Meng
  • Peng Wang

Compared to computed tomography (CT), magnetic resonance imaging (MRI) is more sensitive to acute ischemic stroke lesion. However, MRI is time-consuming, expensive, and susceptible to interference from metal implants. Generating MRI images from CT images can address the limitations of MRI. The key problem in the process is obtaining lesion information from CT. In this study, we propose a cross-modal image generation algorithm from CT to MRI for acute ischemic stroke by combining radiomics with generative adversarial networks. First, the lesion candidate region was obtained using radiomics, the radiomic features of the region were extracted, and the feature with the largest information gain was selected and visualized as a feature map. Then, the concatenation of the extracted feature map and the CT image was input in the generator. We added a residual module after the downsampling of the generator, following the general shape of U-Net, which can deepen the network without causing degradation problems. In addition, we introduced the lesion feature similarity loss function to focus the model on the similarity of the lesion. Through the subjective judgment of two experienced radiologists and using evaluation metrics, the results showed that the generated MRI images were very similar to the real MRI images. Moreover, the locations of the lesions were correct, and the shapes of lesions were similar to those of the real lesions, which can help doctors with timely diagnosis and treatment.

JBHI Journal 2021 Journal Article

i Phantom: A Framework for Automated Creation of Individualized Computational Phantoms and Its Application to CT Organ Dosimetry

  • Wanyi Fu
  • Shobhit Sharma
  • Ehsan Abadi
  • Alexandros-Stavros Iliopoulos
  • Qi Wang
  • Joseph Y. Lo
  • Xiaobai Sun
  • William P. Segars

Objective: This study aims to develop and validate a novel framework, iPhantom, for automated creation of patient-specific phantoms or “digital-twins (DT)” using patient medical images. The framework is applied to assess radiation dose to radiosensitive organs in CT imaging of individual patients. Method: Given a volume of patient CT images, iPhantom segments selected anchor organs and structures (e. g. , liver, bones, pancreas) using a learning-based model developed for multi-organ CT segmentation. Organs which are challenging to segment (e. g. , intestines) are incorporated from a matched phantom template, using a diffeomorphic registration model developed for multi-organ phantom-voxels. The resulting digital-twin phantoms are used to assess organ doses during routine CT exams. Result: iPhantom was validated on both with a set of XCAT digital phantoms (n = 50) and an independent clinical dataset (n = 10) with similar accuracy. iPhantom precisely predicted all organ locations yielding Dice Similarity Coefficients (DSC) 0. 6 - 1 for anchor organs and DSC of 0. 3-0. 9 for all other organs. iPhantom showed <; 10% errors in estimated radiation dose for the majority of organs, which was notably superior to the state-of-the-art baseline method (20-35% dose errors). Conclusion: iPhantom enables automated and accurate creation of patient-specific phantoms and, for the first time, provides sufficient and automated patient-specific dose estimates for CT dosimetry. Significance: The new framework brings the creation and application of CHPs (computational human phantoms) to the level of individual CHPs through automation, achieving wide and precise organ localization, paving the way for clinical monitoring, personalized optimization, and large-scale research.

NeurIPS Conference 2020 Conference Paper

Unsupervised Semantic Aggregation and Deformable Template Matching for Semi-Supervised Learning

  • Tao Han
  • Junyu Gao
  • Yuan Yuan
  • Qi Wang

Unlabeled data learning has attracted considerable attention recently. However, it is still elusive to extract the expected high-level semantic feature with mere unsupervised learning. In the meantime, semi-supervised learning (SSL) demonstrates a promising future in leveraging few samples. In this paper, we combine both to propose an Unsupervised Semantic Aggregation and Deformable Template Matching (USADTM) framework for SSL, which strives to improve the classification performance with few labeled data and then reduce the cost in data annotating. Specifically, unsupervised semantic aggregation based on Triplet Mutual Information (T-MI) loss is explored to generate semantic labels for unlabeled data. Then the semantic labels are aligned to the actual class by the supervision of labeled data. Furthermore, a feature pool that stores the labeled samples is dynamically updated to assign proxy labels for unlabeled data, which are used as targets for cross-entropy minimization. Extensive experiments and analysis across four standard semi-supervised learning benchmarks validate that USADTM achieves top performance (e. g. , 90. 46% accuracy on CIFAR-10 with 40 labels and 95. 20% accuracy with 250 labels). The code is released at https: //github. com/taohan10200/USADTM.

AAAI Conference 2020 Conference Paper

Weakly-Supervised Video Moment Retrieval via Semantic Completion Network

  • Zhijie Lin
  • Zhou Zhao
  • Zhu Zhang
  • Qi Wang
  • Huasheng Liu

Video moment retrieval is to search the moment that is most relevant to the given natural language query. Existing methods are mostly trained in a fully-supervised setting, which requires the full annotations of temporal boundary for each query. However, manually labeling the annotations is actually time-consuming and expensive. In this paper, we propose a novel weakly-supervised moment retrieval framework requiring only coarse video-level annotations for training. Specifically, we devise a proposal generation module that aggregates the context information to generate and score all candidate proposals in one single pass. We then devise an algorithm that considers both exploitation and exploration to select top- K proposals. Next, we build a semantic completion module to measure the semantic similarity between the selected proposals and query, compute reward and provide feedbacks to the proposal generation module for scoring refinement. Experiments on the ActivityCaptions and Charades-STA demonstrate the effectiveness of our proposed method.

AAAI Conference 2019 Conference Paper

ACM: Adaptive Cross-Modal Graph Convolutional Neural Networks for RGB-D Scene Recognition

  • Yuan Yuan
  • Zhitong Xiong
  • Qi Wang

RGB image classification has achieved significant performance improvement with the resurge of deep convolutional neural networks. However, mono-modal deep models for RGB image still have several limitations when applied to RGB-D scene recognition. 1) Images for scene classification usually contain more than one typical object with flexible spatial distribution, so the object-level local features should also be considered in addition to global scene representation. 2) Multi-modal features in RGB-D scene classification are still under-utilized. Simply combining these modal-specific features suffers from the semantic gaps between different modalities. 3) Most existing methods neglect the complex relationships among multiple modality features. Considering these limitations, this paper proposes an adaptive crossmodal (ACM) feature learning framework based on graph convolutional neural networks for RGB-D scene recognition. In order to make better use of the modal-specific cues, this approach mines the intra-modality relationships among the selected local features from one modality. To leverage the multi-modal knowledge more effectively, the proposed approach models the inter-modality relationships between two modalities through the cross-modal graph (CMG). We evaluate the proposed method on two public RGB-D scene classification datasets: SUN-RGBD and NYUD V2, and the proposed method achieves state-of-the-art performance.

AAAI Conference 2019 Conference Paper

Memory-Augmented Temporal Dynamic Learning for Action Recognition

  • Yuan Yuan
  • Dong Wang
  • Qi Wang

Human actions captured in video sequences contain two crucial factors for action recognition, i. e. , visual appearance and motion dynamics. To model these two aspects, Convolutional and Recurrent Neural Networks (CNNs and RNNs) are adopted in most existing successful methods for recognizing actions. However, CNN based methods are limited in modeling long-term motion dynamics. RNNs are able to learn temporal motion dynamics but lack effective ways to tackle unsteady dynamics in long-duration motion. In this work, we propose a memory-augmented temporal dynamic learning network, which learns to write the most evident information into an external memory module and ignore irrelevant ones. In particular, we present a differential memory controller to make a discrete decision on whether the external memory module should be updated with current feature. The discrete memory controller takes in the memory history, context embedding and current feature as inputs and controls information flow into the external memory module. Additionally, we train this discrete memory controller using straight-through estimator. We evaluate this end-to-end system on benchmark datasets (UCF101 and HMDB51) of human action recognition. The experimental results show consistent improvements on both datasets over prior works and our baselines.

AAAI Conference 2018 Conference Paper

Inferring Emotion from Conversational Voice Data: A Semi-Supervised Multi-Path Generative Neural Network Approach

  • Suping Zhou
  • Jia Jia
  • Qi Wang
  • Yufei Dong
  • Yufeng Yin
  • Kehua Lei

To give a more humanized response in Voice Dialogue Applications (VDAs), inferring emotion states from users’ queries may play an important role. However, in VDAs, we have tremendous amount of VDA users and massive scale of unlabeled data with high dimension features from multimodal information, which challenge the traditional speech emotion recognition methods. In this paper, to better infer emotion from conversational voice data, we propose a semisupervised multi-path generative neural network. Specifically, first, we build a novel supervised multi-path deep neural network framework. To avoid high dimensional input, raw features are trained by groups in local classifiers. Then high-level features of each local classifiers are concatenated as input of a global classifier. These two kinds classifiers are trained simultaneously through a single objective function to achieve a more effective and discriminative emotion inferring. To further solve the labeled-datascarcity problem, we extend the multi-path deep neural network to a generative model based on semi-supervised variational autoencoder(semi-VAE), which is able to train the labeled and unlabeled data simultaneously. Experiment based on a 24, 000 real-world dataset collected from Sogou Voice Assistant1 (SVAD13) and a benchmark dataset IEMOCAP show that our method significantly outperforms the existing state-of-the-art results.

IJCAI Conference 2018 Conference Paper

Nonrigid Points Alignment with Soft-weighted Selection

  • Xuelong Li
  • Jian Yang
  • Qi Wang

Point set registration (PSR) is a crucial problem in computer vision and pattern recognition. Existing PSR methods cannot align point sets robustly due to degradations, such as deformation, noise, occlusion, outlier, and multi-view changes. In this paper, we present a self-selected regularized Gaussian fields criterion for nonrigid point matching. Unlike most existing methods, we formulate the registration problem as a sparse approximation task with low rank constraint in reproducing kernel Hilbert space (RKHS). A self-selected mechanism is used to dynamically assign real-valued label for each point in an accuracy-aware weighting manner, which makes the model focus more on the reliable points in position. Based on the label, an equivalent matching number optimization is embedded into the non-rigid criterion to enhance the reliability of the approximation. Experimental results show that the proposed method can achieve a better result in both registration accuracy and correct matches compared to state-of-the-art approaches.

AAAI Conference 2017 Conference Paper

A Multiview-Based Parameter Free Framework for Group Detection

  • Xuelong Li
  • Mulin Chen
  • Feiping Nie
  • Qi Wang

Group detection is fundamentally important for analyzing crowd behaviors, and has attracted plenty of attention in arti- ficial intelligence. However, existing works mostly have limitations due to the insufficient utilization of crowd properties and the arbitrary processing of individuals. In this paper, we propose the Multiview-based Parameter Free (MPF) approach to detect groups in crowd scenes. The main contributions made in this study are threefold: (1) a new structural context descriptor is designed to characterize the structural property of individuals in crowd motions; (2) an selfweighted multiview clustering method is proposed to cluster feature points by incorporating their motion and context similarities; (3) a novel framework is introduced for group detection, which is able to determine the group number automatically without any parameter or threshold to be tuned. Extensive experiments on various real world datasets demonstrate the effectiveness of the proposed approach, and show its superiority against state-of-the-art group detection techniques.

IJCAI Conference 2017 Conference Paper

Convolutional 2D LDA for Nonlinear Dimensionality Reduction

  • Qi Wang
  • Zequn Qin
  • Feiping Nie
  • Yuan Yuan

Representing high-volume and high-order data is an essential problem, especially in machine learning field. Although existing two-dimensional (2D) discriminant analysis achieves promising performance, the single and linear projection features make it difficult to analyze more complex data. In this paper, we propose a novel convolutional two-dimensional linear discriminant analysis (2D LDA) method for data representation. In order to deal with nonlinear data, a specially designed Convolutional Neural Networks (CNN) is presented, which can be proved having the equivalent objective function with common 2D LDA. In this way, the discriminant ability can benefit from not only the nonlinearity of Convolutional Neural Networks, but also the powerful learning process. Experiment results on several datasets show that the proposed method performs better than other state-of-the-art methods in terms of classification accuracy.

IJCAI Conference 2017 Conference Paper

Locality Adaptive Discriminant Analysis

  • Xuelong Li
  • Mulin Chen
  • Feiping Nie
  • Qi Wang

Linear Discriminant Analysis (LDA) is a popular technique for supervised dimensionality reduction, and its performance is satisfying when dealing with Gaussian distributed data. However, the neglect of local data structure makes LDA inapplicable to many real-world situations. So some works focus on the discriminant analysis between neighbor points, which can be easily affected by the noise in the original data space. In this paper, we propose a new supervised dimensionality reduction method, Locality Adaptive Discriminant Analysis (LADA), to lean a representative subspace of the data. Compared to LDA and its variants, the proposed method has three salient advantages: (1) it finds the principle projection directions without imposing any assumption on the data distribution; (2) it’s able to exploit the local manifold structure of data in the desired subspace; (3) it exploits the points’ neighbor relationship automatically without introducing any additional parameter to be tuned. Performance on synthetic datasets and real-world benchmark datasets demonstrate the superiority of the proposed method.

AAAI Conference 2017 Conference Paper

Quantifying and Detecting Collective Motion by Manifold Learning

  • Qi Wang
  • Mulin Chen
  • Xuelong Li

The analysis of collective motion has attracted many researchers in artificial intelligence. Though plenty of works have been done on this topic, the achieved performance is still unsatisfying due to the complex nature of collective motions. By investigating the similarity of individuals, this paper proposes a novel framework for both quantifying and detecting collective motions. Our main contributions are threefold: (1) the time-varying dynamics of individuals are deeply investigated to better characterize the individual motion; (2) a structure-based collectiveness measurement is designed to precisely quantify both individual-level and scene-level properties of collective motions; (3) a multi-stage clustering strategy is presented to discover a more comprehensive understanding of the crowd scenes, containing both local and global collective motions. Extensive experimental results on real world data sets show that our method is capable of handling crowd scenes with complicated structures and various dynamics, and demonstrate its superior performance against state-of-the-art competitors.

IROS Conference 2005 Conference Paper

The Pantograph Mk-II: a haptic instrument

  • Gianni Campion
  • Qi Wang
  • Vincent Hayward

We describe the redesign and the performance evaluation of a high-performance haptic device system called the Pantograph. The device is based on a two degree-of-freedom parallel mechanism which was designed for optimized dynamic performance, but which also is well kinematically conditioned. The results show that the system is capable of producing accurate tactile signals in the DC-400 Hz range and can resolve displacements of the order of 10 /spl mu/m. Future improvements are discussed.

IROS Conference 2002 Conference Paper

A prototype virtual haptic bronchoscope

  • Qi Wang
  • Yongsheng Ou
  • Yangsheng Xu

In this paper, we describe the design of the hardware and software for a virtual bronchoscope with force feedback. A haptic interface allows surgeons to feel the reaction force of virtual pneumonic surgery as if they were touching the area directly. We present novel algorithms for haptic force rendering, and examine its ability to display force. The rendering algorithms have been interfaced with a force-reflecting device. This virtual haptic bronchoscope is of significance in training inexperienced doctors in pneumonic diagnosis and surgery.

ICRA Conference 2000 Conference Paper

On Tracking Control of Mobile Manipulators

  • Wenjie Dong
  • Yangsheng Xu
  • Qi Wang

This paper studies the tracking control problem of mobile manipulators with consideration of the interaction between the mobile platform and the manipulator. A global tracking controller is proposed based on the dynamics of the defined tracking error and the extended Barbalat's lemma. The proposed controller ensures that the full state of the system asymptotically track the given desired trajectory globally in the presence of the system coupling. Extensive simulations presented in the paper show the effectiveness of the proposed approach.

ICRA Conference 1998 Conference Paper

Towards Real-Time Robot Programming by Human Demonstration for 6D Force Controlled Actions

  • Qi Wang
  • Joris De Schutter

An approach for real-time robot programming by human demonstration for 6D force controlled actions is presented. A human operator utilises a joystick to guide a robot with a force sensor to execute a task including continuous contact between a manipulated object and an unmodelled environment. During the demonstration, the position, velocity and force of the manipulated object as well as the human commands via the joystick are recorded. In real-time, the recorded information is translated into a textual robot program providing more robust execution in the presence of uncertainties. This approach has three main features (1) online control type adjustment; (2) automatic subtask termination; (3) real-time program generation. Experiments show the potential industrial applicability.

IROS Conference 1996 Conference Paper

An environment for compliant motion programming by human demonstration

  • Sean Graves
  • Qi Wang
  • Wim Witvrouw
  • Joris De Schutter

An integrated system for programming by demonstration, visualizing, and executing compliant motion programs is described. A human operator utilises a joystick to guide a robot with a force sensor to do a task including continuous contact between manipulator and environment. The demonstration may be executed either on an actual robot, or in a graphically simulated environment. During the demonstration, the position, velocity and force of the object manipulated are acquired. Then the recorded data are processed, analysed, and translated into a textual robot program, which provides more robust execution in the presence of uncertainties. The system is composed of a model-based reaction force simulator, a visualization package, a rule-based translator, and an interpreter for compliant motion programs. Experiments show the industrial applicability.

ICRA Conference 1996 Conference Paper

Derivation of compliant motion programs based on human demonstration

  • Qi Wang
  • Joris De Schutter
  • Wim Witvrouw
  • Sean Graves

An approach to force controlled robot programming by human demonstration is presented. A human operator utilises a joystick to guide a robot with a force sensor to do a task including continuous contact between a manipulated object and an un-modelled environment. During the demonstration, the position, velocity and force of the object manipulated are acquired. Then the recorded data are processed, analysed, and translated into a textual robot program, which provides more robust execution in the presence of uncertainties. This approach consists of three key techniques-data processing, subtask segmentation and termination condition identification. A software package is developed to generate the programs automatically. Experiments show the industrial applicability.