Author name cluster

Qi Wang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

87 papers

2 author rows

EAAI Journal 2026 Journal Article

A Chinese financial event knowledge graph-based retrieval-augmented generation framework for financial question answering

Haitao Cheng
Ke Wang
Qi Wang
Tao Liu
Kai Sheng

Details DOI

AAAI Conference 2026 Conference Paper

AbductiveMLLM: Boosting Visual Abductive Reasoning Within MLLMs

Boyu Chang
Qi Wang
Xi Guo
Zhixiong Nan
Yazhou Yao
Tianfei Zhou

Visual abductive reasoning (VAR) is a challenging task that requires AI systems to infer the most likely explanation for incomplete visual observations. While recent MLLMs develop strong general-purpose multimodal reasoning capabilities, they remain fall short in abductive inference, as compared to human beings. To bridge this gap, we draw inspiration from the interplay between verbal and pictorial abduction in human cognition, and propose to strengthen abduction of MLLMs by mimicking such dual-mode behavior. Concretely, we introduce AbductiveMLLM comprising of two synergistic components: REASONER and IMAGINER. The REASONER operates in the verbal domain. It first explores a broad space of possible explanations using a blind LLM and then prunes visually incongruent hypotheses based on cross-modal causal alignment. The remaining hypotheses are introduced into the MLLM as targeted priors, steering its reasoning toward causally coherent explanations. The IMAGINER, on the other hand, further guides MLLMs by emulating human-like pictorial thinking. It conditions a text-to-image diffusion model on both the input video and the REASONER’s output embeddings to “imagine” plausible visual scenes that correspond to verbal explanation, thereby enriching MLLMs' contextual grounding. The two components are trained jointly in an end-to-end manner. Experiments on standard VAR benchmarks show that AbductiveMLLM achieves state-of-the-art performance, consistently outperforming traditional solutions and advanced MLLMs.

PDF Details DOI

AAAI Conference 2026 Conference Paper

Beyond Retraining: Training-Free Unknown Class Filtering for Source-Free Open Set Domain Adaptation of Vision–Language Models

Yongguang Li
Jindong Li
Qi Wang
QianLi Xing
Runliang Niu
Shengsheng Wang
Menglin Yang

Vision-language models (VLMs) have gained widespread attention for their strong zero-shot capabilities across numerous downstream tasks. However, these models assume that each test image’s class label is drawn from a predefined label set and lack a reliable mechanism to reject samples from emerging unknown classes when only unlabeled data are available. To address this gap, open-set domain adaptation methods retrain models to push potential unknowns away from known clusters. Yet, some unknown samples remain stably anchored to specific known classes in the VLM feature space due to semantic relevance, which is termed as Semantic Affinity Anchoring (SAA). Forcibly repelling these samples unavoidably distorts the native geometry of VLMs and degrades performance. Meanwhile, existing score‑based unknown detectors use simplistic thresholds and suffer from threshold sensitivity, resulting in sub‑optimal performance. To address aforementioned issues, we propose VLM-OpenXpert, which comprises two training‑free, plug‑and‑play inference modules. SUFF performs SVD on high-confidence unknowns to extract a low-rank "unknown subspace". Each sample’s projection onto this subspace is weighted and softly removed from its feature, suppressing unknown components while preserving semantics. BGAT corrects score skewness via a Box–Cox transform, then fits a bimodal Gaussian mixture to adaptively estimate the optimal threshold balancing known-class recognition and unknown-class rejection. Experiments on 9 benchmarks and three backbones (CLIP, SigLIP, ALIGN) under Source-Free OSDA settings show that our training-free pipeline matches or outperforms retraining-heavy state-of-the-art methods, establishing a powerful lightweight inference calibration paradigm for open-set VLM deployment.

PDF Details DOI

AAAI Conference 2026 Conference Paper

CoGrad3D: Spatially-Coupled Timestep Optimization with Orthogonal Gradient Fusion for 3D Generation

Haoyang Tong
Hongbo Wang
Jin Liu
Qi Wang
Jie Cao
Ran He

Score Distillation Sampling has driven recent advances in text-to-3D generation. However, current approaches often fail to produce 3D assets that are both rich in detail and consistent across viewpoints. These limitations primarily arise from imbalanced guidance on fine-grained details and an overdependence on single-view optimization—issues exacerbated by the excessive randomness in selecting diffusion timesteps and camera configurations. Such deficiencies commonly lead to blurry textures and inter-view inconsistencies, which degrade visual realism and hinder practical deployment. To tackle these challenges, we introduce CoGrad3D, a unified generative refinement framework that adopts a continuously adaptive optimization strategy. By dynamically modulating the optimization focus based on real-time convergence signals, CoGrad3D ensures balanced progress toward both geometric completeness and high-fidelity detail. Concretely, we propose an adaptive region sampling strategy that emphasizes under-converged viewing areas, promoting stable and uniform optimization. To facilitate the transition from coarse geometry to fine-grained reconstruction, we develop a region-aware temporal scheduling scheme that integrates global training dynamics with local convergence feedback. Furthermore, we introduce a gradient fusion mechanism that consolidates historical gradients from adjacent viewpoints, mitigating view-specific artifacts and promoting the emergence of coherent 3D structures. Extensive experiments demonstrate that CoGrad3D substantially surpasses existing methods in both geometric consistency and texture fidelity, enabling the generation of high-quality, view-consistent 3D models from textual descriptions.

PDF Details DOI

EAAI Journal 2026 Journal Article

Deep learning-aided Laser Doppler Velocimeter-Inertial Measurement Unit Fusion for Robust Vehicle Localization in Global Navigation Satellite Systems-denied environments

Zhiyi Xiang
Qi Wang
Xiaoming Nie
Jian Zhou

Details DOI

JBHI Journal 2026 Journal Article

Dual-Student Adversarial Framework With Discriminator and Consistency-Driven Learning for Semi-Supervised Medical Image Segmentation

Haifan Wu
Yuhan Geng
Di Gai
Jieying Tu
Xin Xiong
Qi Wang
Zheng Huang

Semi-supervised medical image segmentation is essential for alleviating the cost of manual annotation in clinical applications. However, existing methods often suffer from unreliable pseudo-labels and confirmation bias in consistency-based training, which can lead to unstable optimization and degraded performance. To address these issues, a novel method named dual-Student adversarial framework with discriminator and consistency-driven learning for semi-supervised medical image segmentation is proposed. Specifically, an adversarial learning-based segmentation refinement (ALSR) module is designed to encourage prediction diversity between two student networks and leverage a shared discriminator for adversarial refinement of pseudo-labels. To further stabilize the consistency process, a residual exponential moving average (R-EMA) is applied in the uncertainty estimation with inter-instance consistency measurement (UIM) module to construct a robust teacher model, while noisy voxel predictions are selectively filtered based on uncertainty estimation. In addition, a Contrastive Representation Stabilization (CRS) module is developed to enhance voxel-level semantic alignment by performing contrastive learning only on confident regions, improving feature discriminability and structural consistency. Extensive experiments on benchmark datasets demonstrate that our method consistently outperforms prior state-of-the-art approaches.

Details DOI

AAAI Conference 2026 Conference Paper

Exploring Generalizable Remote Sensing Change Detection via Low-Rank Exchange Adaptation of Vision Foundation Model

Mingwei Zhang
Jingtao Hu
Qiang Li
Qi Wang

Remote sensing change detection (CD) has achieved remarkable progress in recent years. However, little attention has been paid to generalizable change detection (GCD) methods that can effectively generalize to unseen scenarios or domains beyond the training distribution. The major challenges in GCD arise from domain diversity and bitemporal domain shifts in remote sensing images, caused by variations in imaging platforms, acquisition times, geographic regions, and observed events. To tackle these challenges, we propose GenCD, a GCD framework built upon vision foundation models (VFMs). Specifically, GenCD introduces two key components: (1) a Low-Rank Exchange Adaptation (LREA) strategy of VFMs that aligns bitemporal representations while preserving the generalization capacity of VFMs on single-temporal inputs; and (2) a Token-Guided Feature Refinement (TGFR) mechanism that leverages an input-independent token as a guide to refine difference features, improving the discrimination between changed and unchanged regions. We conduct extensive cross-dataset evaluations on eight diverse datasets across three binary CD tasks: land cover, land use, and building-only CD. The results consistently demonstrate the superior generalization of GenCD over SoTA methods, highlighting its effectiveness in GCD.

PDF Details DOI

AAAI Conference 2026 Conference Paper

HISE-KT: Synergizing Heterogeneous Information Networks and LLMs for Explainable Knowledge Tracing with Meta-Path Optimization

Zhiyi Duan
Zixing Shi
Hongyu Yuan
Qi Wang

Knowledge Tracing (KT) aims to mine students’ evolving knowledge states and predict their future question-answering performance. Existing methods based on heterogeneous information networks (HINs) are prone to introducing noises due to manual or random selection of meta-paths and lack necessary quality assessment of meta-path instances. Conversely, recent large language models (LLMs)-based methods ignore the rich information across students, and both paradigms struggle to deliver consistently accurate and evidence-based explanations. To address these issues, we propose an innovative framework, HIN-LLM Synergistic Enhanced Knowledge Tracing (HISE-KT), which seamlessly integrates HINs with LLMs. HISE-KT first builds a multi-relationship HIN containing diverse node types to capture the structural relations through multiple meta-paths. The LLM is then employed to intelligently score and filter meta-path instances and retain high-quality paths, pioneering automated meta-path quality assessment. Inspired by educational psychology principles, a similar student retrieval mechanism based on meta-paths is designed to provide a more valuable context for prediction. Finally, HISE-KT uses a structured prompt to integrate the target student's history with the retrieved similar trajectories, enabling the LLM to generate not only accurate predictions but also evidence-backed, explainable analysis reports. Experiments on four public datasets show that HISE-KT outperforms existing KT baselines in both prediction performance and interpretability.

PDF Details DOI

JBHI Journal 2026 Journal Article

HyperSynergyX: Synergistic Drug Combination Prediction via Hypergraph Modeling and Knowledge Graph-Enhanced Retrieval-Augmented Generation

Qi Wang
Bingzheng Wu
Minglang Xu
Xiya Liu
Yiming Mao
Zhiheng Zhou
Guiying Yan

Drug combination therapy is pivotal for complex diseases, but identifying synergistic three-drug regimens remains challenging due to both combinatorial explosion and the opacity of existing computational models. To address this, we introduce HyperSynergyX, an explainable framework that integrates synergy prediction with mechanistic explanation. Its core predictive component, a Dual-Biased Random Walk on Hypergraphs (DBRWH), models higher-order interactions among drugs on a three drug hypergraph and identifies latent combination patterns via tensor decomposition. To enhance interpretability, we couple DBRWH with a knowledge-graph–enhanced retrieval augmented generation (KG-RAG) module that retrieves mechanistically relevant subgraphs and uses them to generate biologically grounded hypotheses for predicted synergies. On breast-cancer data, DBRWH achieves AUROC/AUPRC of 0. 9593/0. 9453 under 5-fold cross-validation, and on lung cancer data it achieves 0. 9262/0. 9481, outperforming strong deep learning and hypergraph baselines. By linking predictive performance with mechanistic interpretability, HyperSynergyX provides a robust and transparent tool to accelerate multi-drug discovery and support rational regimen design in precision oncology. The code is available at: https://github.com/wangqi27/HyperSynergyX.

Details DOI

AAAI Conference 2026 Conference Paper

PIMRL: Physics-Informed Multi-Scale Recurrent Learning for Burst-Sampled Spatiotemporal Dynamics

Han Wan
Qi Wang
Yuan Mi
Rui Zhang
Hao Sun

Deep learning has shown strong potential in modeling complex spatiotemporal dynamics. However, most existing methods depend on densely and uniformly sampled data, which is often unavailable in practice due to sensor and cost limitations. In many real-world settings, such as mobile sensing and physical experiments, data are burst-sampled with short high-frequency segments followed by long gaps, making it difficult to learn accurate dynamics from sparse observations. To address this issue, we propose Physics-Informed Multi-Scale Recurrent Learning (PIMRL), a novel framework specifically designed for burst-sampled spatiotemporal data. PIMRL combines macro-scale latent dynamics inference with micro-scale adaptive refinement guided by incomplete prior information from partial differential equations (PDEs). It further introduces a temporal message-passing mechanism to effectively propagate information across burst intervals. This multi-scale architecture enables PIMRL to model complex systems accurately even under severe data scarcity. We evaluate our approach on five benchmark datasets involving 1D to 3D multi-scale PDEs. The results show that PIMRL consistently outperforms state-of-the-art baselines, achieving substantial improvements and reducing errors by up to 80\% in the most challenging settings, which demonstrates the clear advantage of our model. Our work demonstrates the effectiveness of physics-informed recurrent learning for accurate and efficient modeling of sparse spatiotemporal systems.

PDF Details DOI

EAAI Journal 2026 Journal Article

Predicting dielectric properties of polyetherimide-based composite via combined molecular dynamics simulation and machine learning

Yue Zhang
Zheng Gong
Changhai Zhang
Yongquan Zhang
Chao Yin
Xubin Wang
Tiandong Zhang
Xiajie Yi

Details DOI

AAAI Conference 2026 Conference Paper

Reasoning via Implicit Self-supervised Emergence for Instruction Segmentation

Qing Zhou
Lichang Yang
Yuyu Jia
Junyu Gao
Weiping Ni
Junzheng Wu
Qi Wang

We challenge the assumption that complex instruction-guided segmentation tasks necessitate equally complex and explicit supervision. This paper introduces RISE (Reasoning via Implicit Self-supervised Emergence), a framework that learns intricate compositional reasoning, spanning spatial relations to world knowledge, without a single ground-truth mask. To achieve this, RISE employs reinforcement learning with GRPO guided by a single, strikingly simple reward: the semantic alignment score between the textual instruction and the predicted image region. Our primary discovery is the implicit emergence of a high-quality chain-of-thought process from this minimalist signal. Within a structured format, the model autonomously learns to understand instructions by accessing its latent knowledge, inferring spatial relationships—capabilities inherent in its architecture but unlocked by our simple objective. Remarkably, our emergent reasoning yields highly competitive results: RISE achieves 58.7 gIoU on the ReasonSeg benchmark, on par with methods using geometric rewards. Furthermore, we show extreme data efficiency: a variant trained on only 2,000 ImageNet-label pairs establishes a new state-of-the-art for annotation-free referring segmentation with 79.6 cIoU on RefCOCO.

PDF Details DOI

EAAI Journal 2026 Journal Article

Research on cable terminal interface defect state detection based on electric field characteristics and multi-core improved support vector machine

Yujing Tang
Yang Fu
Qin Cai
Jieping Wu
Qi Wang
Guoqiang Gao

Details DOI

AAAI Conference 2026 Conference Paper

Slender3D: Curve-Guided Multi-View Reconstruction of Slender Structures

Suqin Wang
Zeyi Wang
Min Shi
Zhaoxin Li
Qi Wang
Xiujuan Chai
Dengming Zhu

Although geometric reconstruction of general objects from images has made remarkable progress in recent years, slender structures remain largely underexplored, despite their critical importance in engineering, biomedical, and agricultural applications. To bridge this gap, we propose a dedicated 2DGS-based geometric reconstruction framework tailored for slender structures, achieving accurate and faithful geometry recovery. Our method first addresses the challenge that most slender objects are texture-less, which hinders reliable feature matching and pose estimation in traditional SfM pipelines. By leveraging the curve-like nature of slender structures, we perform a curve-guided SfM process that provides robust camera poses and accurate 3D curve initialization for Gaussian primitives. To ensure SfM reliability, we introduce a high-precision mask extraction strategy that integrates geometric priors with a segmentation network, effectively handling self-occlusion and thin geometry. Furthermore, to enhance fine geometric recovery, we incorporate a differentiable Poisson reconstruction module to extract an initial mesh during training, which is then refined via image-space iterative optimization using differentiable mesh rasterization. In contrast to conventional approaches that rely on differentiable Gaussian rasterization followed by TSDF-based mesh extraction, our method avoids the additional geometric errors and artifacts introduced during the intermediate TSDF conversion, thereby improving the overall reconstruction quality. Comprehensive experiments on both synthetic and real-world datasets validate that our method achieves superior reconstruction quality compared to state-of-the-art approaches.

PDF Details DOI

AAAI Conference 2026 Conference Paper

Symphony-MoE: Harmonizing Disparate Pre-trained Models into a Coherent Mixture-of-Experts

Qi Wang
Hanyang Peng
Yue Yu

Mixture-of-Experts (MoE) models enable scalable performance by activating large parameter sets sparsely, minimizing computational overhead. To mitigate the prohibitive cost of training MoEs from scratch, recent work employs upcycling, reusing a single pre-trained dense model by replicating its feed-forward network (FFN) layers into experts. However, this limits expert diversity, as all experts originate from a single pre-trained dense model. This paper addresses this limitation by constructing powerful MoE models using experts sourced from multiple identically-architected but disparate pre-trained models (e.g., Qwen2.5-Coder and Qwen2). A key challenge lies in the fact that these source models occupy disparate, dissonant regions of the parameter space, making direct upcycling prone to severe performance degradation. To overcome this, we propose Symphony-MoE, a novel two-stage framework designed to harmonize these models into a single, coherent expert mixture. First, we establish this harmony in a training-free manner: we construct a shared backbone via a layer-aware fusion strategy and, crucially, alleviate parameter misalignment among experts using activation-based functional alignment. Subsequently, a stage of post-training coordinates the entire architecture. Experiments demonstrate that our method successfully integrates experts from heterogeneous sources, achieving an MoE model that significantly surpasses baselines in multi-domain tasks and out-of-distribution generalization.

PDF Details DOI

AAAI Conference 2026 Conference Paper

TAPO: Dynamic Teacher and Perturbed Answer Injection for Policy Optimization

Maowei Jiang
Zihang Wang
Qi Wang
Peter Búš
Moquan Cheng
Yifan Wang
Quangao Liu
Ruiqi Li

Reinforcement learning (RL) has emerged as a powerful framework to improve the reasoning performance of large language models (LLMs), with approaches such as Group Relative Policy Optimization (GRPO) showing promising results. However, GRPO and its variants struggle with collapsed groups (i.e., all-correct or all-incorrect completions), leading to zero-variance rewards and ineffective gradient signals. Moreover, focusing solely on final answer correctness while ignoring the reasoning process, along with rigid length penalties, can hinder training stability and output quality. To address these issues, we introduce TAPO, a reinforcement learning framework that enhances optimization signals by modifying sampled completions within training groups. TAPO incorporates three core techniques: (1) Dynamic Teacher Injection (DTI), which selectively injects high-quality or adversarial examples to restore effective gradient signals in collapsed groups; (2) Perturbed Answer Injection (PAI), which makes partially correct completions to provide contrastive supervision separating reasoning correctness but wrong answer from the trajectories; and (3) InfoLen-Aware Reward Shaping, a fine-grained reward strategy that penalizes outputs based on both length and semantic redundancy, encouraging concise yet informative responses. Extensive experimental results demonstrate that TAPO significantly improves the mathematical reasoning capabilities of LLMs across multiple challenging benchmarks, outperforming the GRPO baseline by a substantial margin. Component-wise ablations further validate the contribution of each proposed technique.

PDF Details DOI

AAAI Conference 2026 Conference Paper

Target-Balanced Score Distillation

Zhou Xu
Qi Wang
Yuxiao Yang
Luyuan Zhang
Zhang Liang
Yang Li

Score Distillation Sampling (SDS) enables 3D asset generation by distilling priors from pretrained 2D text-to-image diffusion models, but vanilla SDS suffers from over-saturation and over-smoothing. To mitigate this issue, recent variants have incorporated negative prompts. However, these methods face a critical trade-off: limited texture optimization, or significant texture gains with shape distortion. In this work, we first conduct a systematic analysis and reveal that this trade-off is fundamentally governed by the utilization of the negative prompts, where Target Negative Prompts (TNP) that embed target information in the negative prompts dramatically enhancing texture realism and fidelity but inducing shape distortions. Informed by this key insight, we introduce the Target-Balanced Score Distillation (TBSD). It formulates generation as a multi-objective optimization problem and introduces an adaptive strategy that effectively resolves the aforementioned trade-off. Extensive experiments demonstrate that TBSD significantly outperforms existing state-of-the-art methods, yielding 3D assets with high-fidelity textures and geometrically accurate shape.

PDF Details DOI

NeurIPS Conference 2025 Conference Paper

Adaptive Neighborhood-Constrained Q Learning for Offline Reinforcement Learning

Yixiu Mao
Yun Qu
Qi Wang
Xiangyang Ji

Offline reinforcement learning (RL) suffers from extrapolation errors induced by out-of-distribution (OOD) actions. To address this, offline RL algorithms typically impose constraints on action selection, which can be systematically categorized into density, support, and sample constraints. However, we show that each category has inherent limitations: density and sample constraints tend to be overly conservative in many scenarios, while the support constraint, though least restrictive, faces challenges in accurately modeling the behavior policy. To overcome these limitations, we propose a new neighborhood constraint that restricts action selection in the Bellman target to the union of neighborhoods of dataset actions. Theoretically, the constraint not only bounds extrapolation errors and distribution shift under certain conditions, but also approximates the support constraint without requiring behavior policy modeling. Moreover, it retains substantial flexibility and enables pointwise conservatism by adapting the neighborhood radius for each data point. In practice, we employ data quality as the adaptation criterion and design an adaptive neighborhood constraint. Building on an efficient bilevel optimization framework, we develop a simple yet effective algorithm, Adaptive Neighborhood-constrained Q learning (ANQ), to perform Q learning with target actions satisfying this constraint. Empirically, ANQ achieves state-of-the-art performance on standard offline RL benchmarks and exhibits strong robustness in scenarios with noisy or limited data.

PDF Details

AAAI Conference 2025 Conference Paper

CLIP-driven View-aware Prompt Learning for Unsupervised Vehicle Re-identification

Jiyang Xu
Qi Wang
Xin Xiong
Di Gai
Ruihua Zhou
Dong Wang

With the emergence of vision-language pre-trained models, such as CLIP, some textual prompts have been gradually introduced recently into re-identification (Re-ID) tasks to obtain considerably robust multimodal information. However, most textual descriptions based on vehicle Re-ID tasks only contain identity index words without specific words to describe vehicle view information, thereby resulting in difficulty to be widely applied in vehicle Re-ID tasks with view variations. This case inspires us to propose a CLIP-driven view-aware prompt learning framework for unsupervised vehicle Re-ID. We first design a learnable textual prompt template called view-aware context optimization (ViewCoOp) based on dynamic multi-view word embeddings, which can fully obtain the proportion and position encoding of each view in the whole vehicle body region. Subsequently, a cross-modal mutual graph is constructed to explore the connections between inter-modal and intra-modal. Each sample is treated as a graph node, which extracts textual features based on ViewCoOp and the visual features of images. Moreover, leveraging the inter-cluster and intra-cluster correlation in the bimodal clustering results in the determination of connectivity between graph node pairs. Lastly, the proposed cross-modal mutual graph method utilizes supervised information from the bimodal gap to directly fine-tune the image encoder of CLIP for downstream unsupervised vehicle Re-ID tasks. Extensive experiments verify that the proposed method is capable of effectively obtaining cross-modal description ability from multiple views.

PDF Details DOI

EAAI Journal 2025 Journal Article

Development of intelligent equipment for weed identification and variable spraying in lettuce fields based on instance segmentation framework

Long-Tao Niu
Wen-Hao Su
He-Yi Zhang
Qi Wang
Bo-Wen Dong
Yankun Peng

Details DOI

JBHI Journal 2025 Journal Article

Edge-Guided Multi-Scale Frequency Attention Network for Gastrointestinal Cancer Image Segmentation

Zhiwen Liao
Qi Wang
Xinyi Tang
Han Wang
Jun Hu
Pengxiang Su
Evangelos K. Markakis
Peng Luo

Image segmentation is a critical technology to improve the accuracy of clinical decisions and treatments in computer-aided diagnostic systems. However, the diverse morphology and fuzzy boundaries of gastrointestinal tumors incur substantial challenges for existing segmentation models, leading to inaccurate feature capture and generating suboptimal results. For solving these problems, we design an edge-guided multi-scale frequency attention network for the gastrointestinal tumor segmentation task, termed EGMFA-Net, which consists of a Kernel Adaptive Enhancement Module (KAEM) and a Frequency-domain Self-attention Module (FDSA). Specifically, KAEM adaptively adjusts the feature extraction kernel based on the morphology of different lesion regions, which enhances the recognition of different morphology regions via a progressive optimization strategy of feature expression. Furthermore, FDSA effectively aggregates multi-scale features in the frequency domain to achieve global receptive fields while preserving more high-frequency details, thereby enhancing adaptability to complex pathological contexts. Extensive experiments on eight medical image benchmark datasets, including SEED, Kvasir, ClinicDB, ColonDB, ETIS, BKAI, CVC-300, and Synapse, show that EGMFA-Net attains state-of-the-art performance over existing methods. Our implementation is available at https://github.com/med-segment/egmfa-net.

Details DOI

NeurIPS Conference 2025 Conference Paper

Gains: Fine-grained Federated Domain Adaptation in Open Set

Zhengyi Zhong
Wenzheng Jiang
Weidong Bao
Ji Wang
Qi Wang
Guanbo Wang
Yongheng Deng
Ju Ren

Conventional federated learning (FL) assumes a closed world with a fixed total number of clients. In contrast, new clients continuously join the FL process in real-world scenarios, introducing new knowledge. This raises two critical demands: detecting new knowledge, i. e. , knowledge discovery, and integrating it into the global model, i. e. , knowledge adaptation. Existing research focuses on coarse-grained knowledge discovery, and often sacrifices source domain performance and adaptation efficiency. To this end, we propose a fine-grained federated domain adaptation approach in open set (Gains). Gains splits the model into an encoder and a classifier, empirically revealing features extracted by the encoder are sensitive to domain shifts while classifier parameters are sensitive to class increments. Based on this, we develop fine-grained knowledge discovery and contribution-driven aggregation techniques to identify and incorporate new knowledge. Additionally, an anti-forgetting mechanism is designed to preserve source domain performance, ensuring balanced adaptation. Experimental results on multi-domain datasets across three typical data-shift scenarios demonstrate that Gains significantly outperforms other baselines in performance for both source-domain and target-domain clients. Code is available at: https: //github. com/Zhong-Zhengyi/Gains.

PDF Details

AAAI Conference 2025 Conference Paper

GTDE: Grouped Training with Decentralized Execution for Multi-agent Actor-Critic

Mengxian Li
Qi Wang
Yongjun Xu

The rapid advancement of multi-agent reinforcement learning (MARL) has given rise to diverse training paradigms to learn the policies of each agent in the multi-agent system. The paradigms of decentralized training and execution (DTDE) and centralized training with decentralized execution (CTDE) have been proposed and widely applied. However, as the number of agents increases, the inherent limitations of these frameworks significantly degrade the performance metrics, such as win rate, total reward, etc. To reduce the influence of the increasing number of agents on the performance metrics, we propose a novel training paradigm of grouped training decentralized execution (GTDE). This framework eliminates the need for a centralized module and relies solely on local information, effectively meeting the training requirements of large-scale multi-agent systems. Specifically, we first introduce an adaptive grouping module, which divides each agent into different groups based on their observation history. To implement end-to-end training, GTDE uses Gumbel-Sigmoid for efficient point-to-point sampling on the grouping distribution while ensuring gradient backpropagation. To adapt to the uncertainty in the number of members in a group, two methods are used to implement a group information aggregation module that merges member information within the group. Empirical results show that in a cooperative environment with 495 agents, GTDE increased the total reward by an average of 382% compared to the baseline. In a competitive environment with 64 agents, GTDE achieved a 100% win rate against the baseline.

PDF Details DOI

NeurIPS Conference 2025 Conference Paper

H3D-DGS: Exploring Heterogeneous 3D Motion Representation for Deformable 3D Gaussian Splatting

Bing He
Yunuo Chen
Guo Lu
Qi Wang
Qunshan Gu
Rong Xie
Li Song
Wenjun Zhang

Dynamic scene reconstruction poses a persistent challenge in 3D vision. Deformable 3D Gaussian Splatting has emerged as an effective method for this task, offering real-time rendering and high visual fidelity. This approach decomposes a dynamic scene into a static representation in a canonical space and time-varying scene motion. Scene motion is defined as the collective movement of all Gaussian points, and for compactness, existing approaches commonly adopt implicit neural fields or sparse control points. However, these methods predominantly rely on gradient-based optimization for all motion information. Due to the high degree of freedom, they struggle to converge on real-world datasets exhibiting complex motion. To preserve the compactness of motion representation and address convergence challenges, this paper proposes heterogeneous 3D control points, termed \textbf{H3D control points}, whose attributes are obtained using a hybrid strategy combining optical flow back-projection and gradient-based methods. This design decouples directly observable motion components from those that are geometrically occluded. Specifically, components of 3D motion that project onto the image plane are directly acquired via optical flow back projection, while unobservable portions are refined through gradient-based optimization. Experiments on the Neu3DV and CMU-Panoptic datasets demonstrate that our method achieves superior performance over state-of-the-art deformable 3D Gaussian splatting techniques. Remarkably, our method converges within just 100 iterations and achieves a per-frame processing speed of 2 seconds on a single NVIDIA RTX 4070 GPU.

PDF Details

IJCAI Conference 2025 Conference Paper

Leveraging Pretrained Diffusion Models for Zero-Shot Part Assembly

Ruiyuan Zhang
Qi Wang
Jiaxiang Liu
Yuchi Huo
Chao Wu

3D part assembly aims to understand part relationships and predict their 6-DoF poses to construct realistic 3D shapes, addressing the growing demand for autonomous assembly, which is crucial for robots. Existing methods mainly estimate the transformation of each part by training neural networks under supervision, which requires a substantial quantity of manually labeled data. However, the high cost of data collection and the immense variability of real-world shapes and parts make traditional methods impractical for large-scale applications. In this paper, we propose first a zero-shot part assembly method that utilizes pre-trained point cloud diffusion models as discriminators in the assembly process, guiding the manipulation of parts to form realistic shapes. Specifically, we theoretically demonstrate that utilizing a diffusion model for zero-shot part assembly can be transformed into an Iterative Closest Point (ICP) process. Then, we propose a novel pushing-away strategy to address the overlap parts, thereby further enhancing the robustness of the method. To verify our work, we conduct extensive experiments and quantitative comparisons to several strong baseline methods, demonstrating the effectiveness of the proposed approach, which even surpasses the supervised learning method. The code has been released on https: //github. com/Ruiyuan-Zhang/Zero-Shot-Assembly.

PDF Details DOI

IJCAI Conference 2025 Conference Paper

PeSANet: Physics-encoded Spectral Attention Network for Simulating PDE-Governed Complex Systems

Han Wan
Rui Zhang
Qi Wang
Yang Liu
Hao Sun

Accurately modeling and forecasting complex systems governed by partial differential equations (PDEs) is crucial in various scientific and engineering domains. However, traditional numerical methods struggle in real-world scenarios due to incomplete or unknown physical laws. Meanwhile, machine learning approaches often fail to generalize effectively when faced with scarce observational data and the challenge of capturing local and global features. To this end, we propose the Physics-encoded Spectral Attention Network (PeSANet), which integrates local and global information to forecast complex systems with limited data and incomplete physical priors. The model consists of two key components: a physics-encoded block that uses hard constraints to approximate local differential operators from limited data, and a spectral-enhanced block that captures long-range global dependencies in the frequency domain. Specifically, we introduce a novel spectral attention mechanism to model inter-spectrum relationships and learn long-range spatial features. Experimental results demonstrate that PeSANet outperforms existing methods across all metrics, particularly in long-term forecasting accuracy, providing a promising solution for simulating complex systems with limited data and incomplete physics.

PDF Details DOI

EAAI Journal 2025 Journal Article

Quantization-based deep diversified ensemble for medical image segmentation

Jiawei Zhang
Jialin Wang
Qi Wang
Yanchun Zhang
Weihong Han
Yangyang Mei
Yiyu Shi
Jian Zhuang

Details DOI

JBHI Journal 2025 Journal Article

Re-Visible Dual-Domain Self-Supervised Deep Unfolding Network for MRI Reconstruction

Hao Zhang
Qi Wang
Jian Sun
Zhijie Wen
Jun Shi
Shihui Ying

Magnetic Resonance Imaging (MRI) is widely used in clinical practice, but suffers from prolonged acquisition time. Although deep learning methods have been proposed to accelerate acquisition and demonstrate promising performance, they rely on high-quality fully-sampled datasets for training in a supervised manner. However, such datasets are time-consuming and expensive-to-collect, which constrains their broader applications. On the other hand, self-supervised methods offer an alternative by enabling learning from under-sampled data alone, but most existing methods rely on further partitioned under-sampled k-space data as model's input for training, which causes an input distribution shift between the the training stage and the inference stage. Additionally, their models have not effectively incorporated comprehensive image priors, leading to degraded reconstruction performance. In this paper, we propose a novel re-visible dual-domain self-supervised deep unfolding network to address these issues when only under-sampled datasets are available. Specifically, by incorporating re-visible dual-domain loss, all under-sampled k-space data are utilized during training to mitigate the input distribution shift caused by further partitioning. This design enables the model to implicitly adapt to all under-sampled k-space data as input. Additionally, we design a Deep Unfolding Network based on Chambolle and Pock Proximal Point Algorithm (DUN-CP-PPA) to achieve end-to-end reconstruction. By employing a Spatial-Frequency Feature Extraction (SFFE) block to capture both global and local representations, the model effectively integrates imaging physics with comprehensive image priors to enhance reconstruction performance. Experiments on both single-coil and multi-coil datasets demonstrate that our method outperforms state-of-the-art approaches in terms of reconstruction performance and generalization capability.

Details DOI

NeurIPS Conference 2025 Conference Paper

RHYTHM: Reasoning with Hierarchical Temporal Tokenization for Human Mobility

Haoyu He
Haozheng Luo
Yan Chen
Qi Wang

Predicting human mobility is inherently challenging due to complex long-range dependencies and multi-scale periodic behaviors. To address this, we introduce RHYTHM (Reasoning with Hierarchical Temporal Tokenization for Human Mobility), a unified framework that leverages large language models (LLMs) as general-purpose spatio-temporal predictors and trajectory reasoners. Methodologically, RHYTHM employs temporal tokenization to partition each trajectory into daily segments and encode them as discrete tokens with hierarchical attention that captures both daily and weekly dependencies, thereby quadratically reducing the sequence length while preserving cyclical information. Additionally, we enrich token representations by adding pre-computed prompt embeddings for trajectory segments and prediction targets via a frozen LLM, and feeding these combined embeddings back into the LLM backbone to capture complex interdependencies. Computationally, RHYTHM keeps the pretrained LLM backbone frozen, yielding faster training and lower memory usage. We evaluate our model against state-of-the-art methods using three real-world datasets. Notably, RHYTHM achieves a 2. 4% improvement in overall accuracy, a 5. 0% increase on weekends, and a 24. 6% reduction in training time. Code is publicly available at https: //github. com/he-h/rhythm.

PDF Details

NeurIPS Conference 2025 Conference Paper

Selective Learning for Deep Time Series Forecasting

Yisong Fu
Zezhi Shao
Chengqing Yu
Yujie Li
Zhulin An
Qi Wang
Yongjun Xu
Fei Wang

Benefiting from high capacity for capturing complex temporal patterns, deep learning (DL) has significantly advanced time series forecasting (TSF). However, deep models tend to suffer from severe overfitting due to the inherent vulnerability of time series to noise and anomalies. The prevailing DL paradigm uniformly optimizes all timesteps through the MSE loss and learns those uncertain and anomalous timesteps without difference, ultimately resulting in overfitting. To address this, we propose a novel selective learning strategy for deep TSF. Specifically, selective learning screens a subset of the whole timesteps to calculate the MSE loss in optimization, guiding the model to focus on generalizable timesteps while disregarding non-generalizable ones. Our framework introduces a dual-mask mechanism to target timesteps: (1) an uncertainty mask leveraging residual entropy to filter uncertain timesteps, and (2) an anomaly mask employing residual lower bound estimation to exclude anomalous timesteps. Extensive experiments across eight real-world datasets demonstrate that selective learning can significantly improve the predictive performance for typical state-of-the-art deep models, including 37. 4% MSE reduction for Informer, 8. 4% for TimesNet, and 6. 5% for iTransformer.

PDF Details

NeurIPS Conference 2025 Conference Paper

VideoRFT: Incentivizing Video Reasoning Capability in MLLMs via Reinforced Fine-Tuning

Qi Wang
Yanrui Yu
Ye Yuan
Rui Mao
Tianfei Zhou

Reinforcement fine-tuning (RFT) has shown great promise in achieving humanlevel reasoning capabilities of Large Language Models (LLMs), and has recently been extended to MLLMs. Nevertheless, reasoning about videos, which is a fundamental aspect of human intelligence, remains a persistent challenge due to the complex logic, temporal and causal structures inherent in video data. To fill this gap, we propose VideoRFT, a novel approach that extends the RFT paradigm to cultivate human-like video reasoning capabilities in MLLMs. VideoRFT follows the standard two-stage scheme in RFT: supervised fine-tuning (SFT) with chain-of-thought (CoT) annotations, followed by reinforcement learning (RL) to improve generalization. A central challenge to achieve this in the video domain lies in the scarcity of large-scale, high-quality video CoT datasets. We address this by building a multi-expert-driven, cognition-inspired CoT curation pipeline. First, we devise a cognition-inspired prompting strategy to elicit a reasoning LLM to generate preliminary CoTs based solely on rich, structured, and literal representations of video content. Subsequently, these CoTs are revised by a MLLM conditioned on the actual video, ensuring visual consistency and reducing visual hallucinations. This pipeline results in two new datasets, i. e. VideoRFT-CoT-102K for SFT and VideoRFT-RL-310K for RL. To further strengthen the RL phase, we introduce a novel semantic-consistency reward that explicitly promotes the alignment between textual reasoning and visual evidence. This reward encourages the model to produce coherent, context-aware reasoning outputs grounded in visual input. Extensive experiments show that VideoRFT achieves state-of-the-art performance on six video reasoning benchmarks.

PDF Details

AAAI Conference 2024 Conference Paper

Cross-Sentence Gloss Consistency for Continuous Sign Language Recognition

Qi Rao
Ke Sun
Xiaohan Wang
Qi Wang
Bang Zhang

Continuous sign language recognition (CSLR) aims to recognize gloss sequences from continuous sign videos. Recent works enhance the gloss representation consistency by mining correlations between visual and contextual modules within individual sentences. However, there still remain much richer correlations among glosses across different sentences. In this paper, we present a simple yet effective Cross-Sentence Gloss Consistency (CSGC), which enforces glosses belonging to a same category to be more consistent in representation than those belonging to different categories, across all training sentences. Specifically, in CSGC, a prototype is maintained for each gloss category and benefits the gloss discrimination in a contrastive way. Thanks to the well-distinguished gloss prototype, an auxiliary similarity classifier is devised to enhance the recognition clues, thus yielding more accurate results. Extensive experiments conducted on three CSLR datasets show that our proposed CSGC significantly boosts the performance of CSLR, surpassing existing state-of-the-art works by large margins (i.e., 1.6% on PHOENIX14, 2.4% on PHOENIX14-T, and 5.7% on CSL-Daily).

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

Doubly Mild Generalization for Offline Reinforcement Learning

Yixiu Mao
Qi Wang
Yun Qu
Yuhang Jiang
Xiangyang Ji

Offline Reinforcement Learning (RL) suffers from the extrapolation error and value overestimation. From a generalization perspective, this issue can be attributed to the over-generalization of value functions or policies towards out-of-distribution (OOD) actions. Significant efforts have been devoted to mitigating such generalization, and recent in-sample learning approaches have further succeeded in entirely eschewing it. Nevertheless, we show that mild generalization beyond the dataset can be trusted and leveraged to improve performance under certain conditions. To appropriately exploit generalization in offline RL, we propose Doubly Mild Generalization (DMG), comprising (i) mild action generalization and (ii) mild generalization propagation. The former refers to selecting actions in a close neighborhood of the dataset to maximize the Q values. Even so, the potential erroneous generalization can still be propagated, accumulated, and exacerbated by bootstrapping. In light of this, the latter concept is introduced to mitigate the generalization propagation without impeding the propagation of RL learning signals. Theoretically, DMG guarantees better performance than the in-sample optimal policy in the oracle generalization scenario. Even under worst-case generalization, DMG can still control value overestimation at a certain level and lower bound the performance. Empirically, DMG achieves state-of-the-art performance across Gym-MuJoCo locomotion tasks and challenging AntMaze tasks. Moreover, benefiting from its flexibility in both generalization aspects, DMG enjoys a seamless transition from offline to online learning and attains strong online fine-tuning performance.

PDF Details DOI

IJCAI Conference 2024 Conference Paper

Error-aware Sampling in Adaptive Shells for Neural Surface Reconstruction

Qi Wang
Yuchi Huo
Qi Ye
Rui Wang
Hujun Bao

Neural implicit surfaces with signed distance functions (SDFs) achieve superior quality in 3D geometry reconstruction. However, training SDFs is time-consuming because it requires a great number of samples to calculate accurate weight distributions and a considerable amount of samples sampled from the distribution for integrating the rendering results. Some existing sampling strategies focus on this problem. During the training, they assume a spatially-consistent convergence speed of kernel size, thus still suffering from low convergence or errors. Instead, we introduce an error-aware sampling method based on thin intervals of valid weight distributions, dubbed adaptive shells, to reduce the number of samples while still maintaining the reconstruction accuracy. To this end, we first extend Laplace-based neural implicit surfaces with learned spatially-varying kernel sizes which indicates the range of valid weight distributions. Then, the adaptive shell for each ray is determined by an efficient double-clipping strategy with spatially-varying SDF values and kernel sizes, fitting larger kernel sizes to wider shells. Finally, we calculate the error-bounded cumulative distribution functions (CDFs) of shells to conduct efficient importance sampling, achieving low-variance rendering with fewer calculations. Extensive results in various scenes demonstrate the superiority of our sampling technique, including significantly reducing sample counts and training time, even improving the reconstruction quality. The code is available at https: //github. com/erernan/ESampling.

PDF Details DOI

IJCAI Conference 2024 Conference Paper

FLDM-VTON: Faithful Latent Diffusion Model for Virtual Try-on

Chenhui Wang
Tao Chen
Zhihao Chen
Zhizhong Huang
Taoran Jiang
Qi Wang
Hongming Shan

Despite their impressive generative performance, latent diffusion model-based virtual try-on (VTON) methods lack faithfulness to crucial details of the clothes, such as style, pattern, and text. To alleviate these issues caused by the diffusion stochastic nature and latent supervision, we propose a novel Faithful Latent Diffusion Model for VTON, termed FLDM-VTON. FLDM-VTON improves the conventional latent diffusion process in three major aspects. First, we propose incorporating warped clothes as both the starting point and local condition, supplying the model with faithful clothes priors. Second, we introduce a novel clothes flattening network to constrain generated try-on images, providing clothes-consistent faithful supervision. Third, we devise a clothes-posterior sampling for faithful inference, further enhancing the model performance over conventional clothes-agnostic Gaussian sampling. Extensive experimental results on the benchmark VITON-HD and Dress Code datasets demonstrate that our FLDM-VTON outperforms state-of-the-art baselines and is able to generate photo-realistic try-on images with faithful clothing details.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

GO4Align: Group Optimization for Multi-Task Alignment

Jiayi Shen
Qi Wang
Zehao Xiao
Nanne van Noord
Marcel Worring

This paper proposes GO4Align, a multi-task optimization approach that tackles task imbalance by explicitly aligning the optimization across tasks. To achieve this, we design an adaptive group risk minimization strategy, comprising two techniques in implementation: (i) dynamical group assignment, which clusters similar tasks based on task interactions; (ii) risk-guided group indicators, which exploit consistent task correlations with risk information from previous iterations. Comprehensive experimental results on diverse benchmarks demonstrate our method's performance superiority with even lower computational costs.

PDF Details DOI

JBHI Journal 2024 Journal Article

Improving Needle Tip Tracking and Detection in Ultrasound-Based Navigation System Using Deep Learning-Enabled Approach

Hui Che
Jiaxin Qin
Yao Chen
Zihan Ji
Yibo Yan
Jing Yang
Qi Wang
Chaofeng Liang

Ultrasound-guided percutaneous interventions have numerous advantages over traditional techniques. Accurate needle placement in the target anatomy is crucial for successful intervention, and reliable visual information is essential to achieve this. However, previous studies have revealed several challenges, such as the variability in needle echogenicity and the common misalignment of the ultrasound beam and the needle. Advanced techniques have been developed to optimize needle visualization, including hardware-based and image-processing-based methods. This paper proposes a novel strategy of integrating ultrasound-based deep learning approaches into an optical navigation system to enhance needle visualization and improve tip positioning accuracy. Both the tracking and detection algorithms are optimized utilizing optical tracking information. The information is introduced into the tracking network to define the search patch update strategy and form a trajectory reference to correct tracking results. In the detection network, the original image is processed according to the needle insertion position and current position given by the optical localization system to locate a coarse region, and the depth-score criterion is adopted to optimize detection results. Extensive experiments demonstrate that our approach achieves promising tip tracking and detection performance with tip localization errors of 1. 11 $\pm $ 0. 59 mm and 1. 17 $\pm$ 0. 70 mm, respectively. Moreover, we establish a paired dataset consisting of ultrasound images and their corresponding spatial tip coordinates acquired from the optical tracking system and conduct real puncture experiments to verify the effectiveness of the proposed methods. Our approach significantly improves needle visualization and provides physicians with visual guidance for posture adjustment.

Details DOI

TMLR Journal 2024 Journal Article

Large Language Models can be Guided to Evade AI-generated Text Detection

Ning Lu
Shengcai Liu
Rui He
Yew-Soon Ong
Qi Wang
Ke Tang

Large language models (LLMs) have shown remarkable performance in various tasks and have been extensively utilized by the public. However, the increasing concerns regarding the misuse of LLMs, such as plagiarism and spamming, have led to the development of multiple detectors, including fine-tuned classifiers and statistical methods. In this study, we equip LLMs with prompts, rather than relying on an external paraphraser, to evaluate the vulnerability of these detectors. We propose a novel Substitution-based In-Context example Optimization method (SICO) to automatically construct prompts for evading the detectors. SICO is cost-efficient as it requires only 40 human-written examples and a limited number of LLM inferences to generate a prompt. Moreover, once a task-specific prompt has been constructed, it can be universally used against a wide range of detectors. Extensive experiments across three real-world tasks demonstrate that SICO significantly outperforms the paraphraser baselines and enables GPT-3.5 to successfully evade six detectors, decreasing their AUC by 0.5 on average. Furthermore, a comprehensive human evaluation show that the SICO-generated text achieves human-level readability and task completion rates, while preserving high imperceptibility. Finally, we propose an ensemble approach to enhance the robustness of detectors against SICO attack.

PDF Details

EAAI Journal 2024 Journal Article

Machine learning-driven feature importance appraisal of seismic parameters on tunnel damage and seismic fragility prediction

Qi Wang
Ping Geng
Liangjie Wang
Dingwei He
Huoming Shen

Details DOI

NeurIPS Conference 2024 Conference Paper

Making Offline RL Online: Collaborative World Models for Offline Visual Reinforcement Learning

Qi Wang
Junming Yang
Yunbo Wang
Xin Jin
Wenjun Zeng
Xiaokang Yang

Training offline RL models using visual inputs poses two significant challenges, i. e. , the overfitting problem in representation learning and the overestimation bias for expected future rewards. Recent work has attempted to alleviate the overestimation bias by encouraging conservative behaviors. This paper, in contrast, tries to build more flexible constraints for value estimation without impeding the exploration of potential advantages. The key idea is to leverage off-the-shelf RL simulators, which can be easily interacted with in an online manner, as the “ test bed ” for offline policies. To enable effective online-to-offline knowledge transfer, we introduce CoWorld, a model-based RL approach that mitigates cross-domain discrepancies in state and reward spaces. Experimental results demonstrate the effectiveness of CoWorld, outperforming existing RL approaches by large margins.

PDF Details DOI

EAAI Journal 2024 Journal Article

Meta-fourier neural operators for multi-task modeling of film cooling in gas turbine endwalls

Qi Wang
Jian Lou
Yang Li
Li Yang

Details DOI

IJCAI Conference 2024 Conference Paper

MuChin: A Chinese Colloquial Description Benchmark for Evaluating Language Models in the Field of Music

Zihao Wang
Shuyu Li
Tao Zhang
Qi Wang
Pengfei Yu
Jinyang Luo
Yan Liu
Ming Xi

The rapidly evolving multimodal Large Language Models (LLMs) urgently require new benchmarks to uniformly evaluate their performance on understanding and textually describing music. However, due to semantic gaps between Music Information Retrieval (MIR) algorithms and human understanding, discrepancies between professionals and the public, and low precision of annotations, existing music description datasets cannot serve as benchmarks. To this end, we present MuChin, the first open-source music description benchmark in Chinese colloquial language, designed to evaluate the performance of multimodal LLMs in understanding and describing music. We established the Caichong Music Annotation Platform (CaiMAP) that employs an innovative multi-person, multi-stage assurance method, and recruited both amateurs and professionals to ensure the precision of annotations and alignment with popular semantics. Utilizing this method, we built a large-scale, private dataset with multi-dimensional, high-precision music annotations, the Caichong Music Dataset (CaiMD), and carefully selected 1, 000 high-quality entries to serve as the test set for MuChin. Based on MuChin, we analyzed the discrepancies between professionals and amateurs in terms of music description, and empirically demonstrated the effectiveness of CaiMD for fine-tuning LLMs. Ultimately, we employed MuChin to evaluate existing music understanding models on their ability to provide colloquial descriptions of music.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

Offline Reinforcement Learning with OOD State Correction and OOD Action Suppression

Yixiu Mao
Qi Wang
Chen Chen
Yun Qu
Xiangyang Ji

In offline reinforcement learning (RL), addressing the out-of-distribution (OOD) action issue has been a focus, but we argue that there exists an OOD state issue that also impairs performance yet has been underexplored. Such an issue describes the scenario when the agent encounters states out of the offline dataset during the test phase, leading to uncontrolled behavior and performance degradation. To this end, we propose SCAS, a simple yet effective approach that unifies OOD state correction and OOD action suppression in offline RL. Technically, SCAS achieves value-aware OOD state correction, capable of correcting the agent from OOD states to high-value in-distribution states. Theoretical and empirical results show that SCAS also exhibits the effect of suppressing OOD actions. On standard offline RL benchmarks, SCAS achieves excellent performance without additional hyperparameter tuning. Moreover, benefiting from its OOD state correction feature, SCAS demonstrates enhanced robustness against environmental perturbations.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

P$^2$C$^2$Net: PDE-Preserved Coarse Correction Network for efficient prediction of spatiotemporal dynamics

Qi Wang
Pu Ren
Hao Zhou
Xin-Yang Liu
Zhiwen Deng
Yi Zhang
Ruizhi Chengze
Hongsheng Liu

When solving partial differential equations (PDEs), classical numerical methods often require fine mesh grids and small time stepping to meet stability, consistency, and convergence conditions, leading to high computational cost. Recently, machine learning has been increasingly utilized to solve PDE problems, but they often encounter challenges related to interpretability, generalizability, and strong dependency on rich labeled data. Hence, we introduce a new PDE-Preserved Coarse Correction Network (P$^2$C$^2$Net) to efficiently solve spatiotemporal PDE problems on coarse mesh grids in small data regimes. The model consists of two synergistic modules: (1) a trainable PDE block that learns to update the coarse solution (i. e. , the system state), based on a high-order numerical scheme with boundary condition encoding, and (2) a neural network block that consistently corrects the solution on the fly. In particular, we propose a learnable symmetric Conv filter, with weights shared over the entire model, to accurately estimate the spatial derivatives of PDE based on the neural-corrected system state. The resulting physics-encoded model is capable of handling limited training data (e. g. , 3--5 trajectories) and accelerates the prediction of PDE solutions on coarse spatiotemporal grids while maintaining a high accuracy. P$^2$C$^2$Net achieves consistent state-of-the-art performance with over 50\% gain (e. g. , in terms of relative prediction error) across four datasets covering complex reaction-diffusion processes and turbulent flows.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

Resource-Aware Federated Self-Supervised Learning with Global Class Representations

Mingyi Li
Xiao Zhang
Qi Wang
Tengfei Liu
Ruofan Wu
Weiqiang Wang
Fuzhen Zhuang
Hui Xiong

Due to the heterogeneous architectures and class skew, the global representation models training in resource-adaptive federated self-supervised learning face with tricky challenges: $\textit{deviated representation abilities}$ and $\textit{inconsistent representation spaces}$. In this work, we are the first to propose a multi-teacher knowledge distillation framework, namely $\textit{FedMKD}$, to learn global representations with whole class knowledge from heterogeneous clients even under extreme class skew. Firstly, the adaptive knowledge integration mechanism is designed to learn better representations from all heterogeneous models with deviated representation abilities. Then the weighted combination of the self-supervised loss and the distillation loss can support the global model to encode all classes from clients into a unified space. Besides, the global knowledge anchored alignment module can make the local representation spaces close to the global spaces, which further improves the representation abilities of local ones. Finally, extensive experiments conducted on two datasets demonstrate the effectiveness of $\textit{FedMKD}$ which outperforms state-of-the-art baselines 4. 78\% under linear evaluation on average.

PDF Details DOI

IJCAI Conference 2024 Conference Paper

ScreenAgent: A Vision Language Model-driven Computer Control Agent

Runliang Niu
Jindong Li
Shiqi Wang
Yali Fu
Xiyu Hu
Xueyuan Leng
He Kong
Yi Chang

Large Language Models (LLM) can invoke a variety of tools and APIs to complete complex tasks. The computer, as the most powerful and universal tool, could potentially be controlled by a trained LLM agent. Powered by the computer, we can hopefully build a more generalized agent to assist humans in various daily digital works. In this paper, we construct an environment for a Vision Language Model (VLM) agent to interact with a real computer screen. Within this environment, the agent can observe screenshots and manipulate the Graphical User Interface (GUI) by outputting mouse and keyboard actions. We also design an automated control pipeline that includes planning, acting, and reflecting phases, guiding the agent to continuously interact with the environment and complete multi-step tasks. Additionally, we construct the ScreenAgent Dataset, which collects screenshots and action sequences when completing daily computer tasks. Finally, we train a model, ScreenAgent, which achieves comparable computer control capabilities to GPT-4V and demonstrated more precise UI positioning capabilities. Our attempts could inspire further research on building a generalist LLM agent. The code and more detailed information are at https: //github. com/niuzaisheng/ScreenAgent.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

Theoretical Investigations and Practical Enhancements on Tail Task Risk Minimization in Meta Learning

Yiqin Lv
Qi Wang
Dong Liang
Zheng Xie

Meta learning is a promising paradigm in the era of large models and task distributional robustness has become an indispensable consideration in real-world scenarios. Recent advances have examined the effectiveness of tail task risk minimization in fast adaptation robustness improvement \citep{wang2023simple}. This work contributes to more theoretical investigations and practical enhancements in the field. Specifically, we reduce the distributionally robust strategy to a max-min optimization problem, constitute the Stackelberg equilibrium as the solution concept, and estimate the convergence rate. In the presence of tail risk, we further derive the generalization bound, establish connections with estimated quantiles, and practically improve the studied strategy. Accordingly, extensive evaluations demonstrate the significance of our proposal in boosting robustness.

PDF Details DOI

YNICL Journal 2024 Journal Article

Unveiling MRI markers for Parkinson’s Disease: GABAergic dysfunction and cortical changes

Yuan Tian
Sijia Geng
Tianyi Liu
Qi Wang
Jianxiu Lian
Liangjie Lin
Jiayu Li
Tao Gong

Details DOI

NeurIPS Conference 2023 Conference Paper

A Simple Yet Effective Strategy to Robustify the Meta Learning Paradigm

Qi Wang
Yiqin Lv
Yanghe Feng
Zheng Xie
Jincai Huang

Meta learning is a promising paradigm to enable skill transfer across tasks. Most previous methods employ the empirical risk minimization principle in optimization. However, the resulting worst fast adaptation to a subset of tasks can be catastrophic in risk-sensitive scenarios. To robustify fast adaptation, this paper optimizes meta learning pipelines from a distributionally robust perspective and meta trains models with the measure of tail task risk. We take the two-stage strategy as heuristics to solve the robust meta learning problem, controlling the worst fast adaptation cases at a certain probabilistic level. Experimental results show that our simple method can improve the robustness of meta learning to task distributions and reduce the conditional expectation of the worst fast adaptation risk.

PDF Details

EAAI Journal 2023 Journal Article

An autonomous cooperative system of multi-AUV for underwater targets detection and localization

Qi Wang
Bo He
Yixiao Zhang
Fei Yu
Xiaochao Huang
Rong Yang

Details DOI

EAAI Journal 2023 Journal Article

An online path planning algorithm for autonomous marine geomorphological surveys based on AUV

Yixiao Zhang
Qi Wang
Yue Shen
Bo He

Details DOI

YNICL Journal 2023 Journal Article

Cortical anatomical variations, gene expression profiles, and clinical phenotypes in patients with schizophrenia

Yong Han
Yongfeng Yang
Zhilu Zhou
Xueyan Jin
Han Shi
Minglong Shao
Meng Song
Xi Su

Details DOI

NeurIPS Conference 2023 Conference Paper

Episodic Multi-Task Learning with Heterogeneous Neural Processes

Jiayi Shen
Xiantong Zhen
Qi Wang
Marcel Worring

This paper focuses on the data-insufficiency problem in multi-task learning within an episodic training setup. Specifically, we explore the potential of heterogeneous information across tasks and meta-knowledge among episodes to effectively tackle each task with limited data. Existing meta-learning methods often fail to take advantage of crucial heterogeneous information in a single episode, while multi-task learning models neglect reusing experience from earlier episodes. To address the problem of insufficient data, we develop Heterogeneous Neural Processes (HNPs) for the episodic multi-task setup. Within the framework of hierarchical Bayes, HNPs effectively capitalize on prior experiences as meta-knowledge and capture task-relatedness among heterogeneous tasks, mitigating data-insufficiency. Meanwhile, transformer-structured inference modules are designed to enable efficient inferences toward meta-knowledge and task-relatedness. In this way, HNPs can learn more powerful functional priors for adapting to novel heterogeneous tasks in each meta-test episode. Experimental results show the superior performance of the proposed HNPs over typical baselines, and ablation studies verify the effectiveness of the designed inference modules.

PDF Details

IROS Conference 2023 Conference Paper

Hierarchical Attention Network for Planning-Informed Multi-Agent Trajectory Prediction

Wenyi Xiong
Jian Chen
Xinfang Zhang
Qi Wang
Ziheng Qi

The accurate prediction of the neighboring vehicles' trajectories affects the security of autonomous driving vehicles. However, it is challenging for existing methods to anticipating the trajectories of vehicles in the vicinity due to the uncertainty of driving behaviors and the complex interaction patterns of traffic flows. In this study, incorporating the planning information of the ego vehicle, we propose a novel trajectory prediction approach based on the hierarchical attention mechanism. Firstly, a spatio-temporary attention module is presented to extract the social interaction of surrounding vehicles and capture the temporal dependence of continuous frame historical information and planning information. Then, a hard-soft attention module is designed to perform two tasks: weighing the importance of both historical and future information, and learning different location information about the target vehicles. Our method is evaluated on two national highway datasets. The experimental results show that our algorithm achieves the state-of-the-art performance.

Details

EAAI Journal 2023 Journal Article

Meteorological data layout and task scheduling in a multi-cloud environment

Yongsheng Hao
Jie Cao
Qi Wang
Tinghuai Ma
Qin Wang
Xin Zhang

Details DOI

IJCAI Conference 2023 Conference Paper

Multi-level Graph Contrastive Prototypical Clustering

Yuchao Zhang
Yuan Yuan
Qi Wang

Recently, graph neural networks (GNNs) have drawn a surge of investigations in deep graph clustering. Nevertheless, existing approaches predominantly are inclined to semantic-agnostic since GNNs exhibit inherent limitations in capturing global underlying semantic structures. Meanwhile, multiple objectives are imposed within one latent space, whereas representations from different granularities may presumably conflict with each other, yielding severe performance degradation for clustering. To this end, we propose a novel Multi-Level Graph Contrastive Prototypical Clustering (MLG-CPC) framework for end-to-end clustering. Specifically, a Prototype Discrimination (ProDisc) objective function is proposed to explicitly capture semantic information via cluster assignments. Moreover, to alleviate the issue of objectives conflict, we introduce to perceive representations of different granularities within individual feature-, prototypical-, and cluster-level spaces by the feature decorrelation, prototype contrast, and cluster space consistency respectively. Extensive experiments on four benchmarks demonstrate the superiority of the proposed MLG-CPC against the state-of-the-art graph clustering approaches.

PDF Details DOI

EAAI Journal 2023 Journal Article

SAR ship localization method with denoising and feature refinement

Cheng Zha
Weidong Min
Qing Han
Wei Li
Xin Xiong
Qi Wang
Meng Zhu

Details DOI

EAAI Journal 2023 Journal Article

SHDM-NET: Heat map detail guidance with image matting for industrial weld semantic segmentation network

Qi Wang
Jingwu Mei
Wuming Jiang
Hegui Zhu

Details DOI

IJCAI Conference 2022 Conference Paper

A Speech-driven Sign Language Avatar Animation System for Hearing Impaired Applications

Li Hu
Jiahui Li
Jiashuo Zhang
Qi Wang
Bang Zhang
Ping Tan

Sign language is the communication language used in hearing impaired community. Recently, the research of sign language production has made great progress but still need to cope with some critical challenges. In this paper, we propose a system-level scheme and push forward the implementation of sign language production for practical usage. We build a system capable of translating speech into sign language avatar. Different from previous approach only focusing on single technology, we systematically combine algorithms of language translation, body gesture animation and facial avatar generation. We also develop two applications: Sign Language Interpretation APP and Virtual Sign Language Anchor, to facilitate easy and clear communication for hearing impaired people.

PDF Details DOI

IJCAI Conference 2022 Conference Paper

AttExplainer: Explain Transformer via Attention by Reinforcement Learning

Runliang Niu
Zhepei Wei
Yan Wang
Qi Wang

Transformer and its variants, built based on attention mechanisms, have recently achieved remarkable performance in many NLP tasks. Most existing works on Transformer explanation tend to reveal and utilize the attention matrix with human subjective intuitions in a qualitative manner. However, the huge size of dimensions directly challenges these methods to quantitatively analyze the attention matrix. Therefore, in this paper, we propose a novel reinforcement learning (RL) based framework for Transformer explanation via attention matrix, namely AttExplainer. The RL agent learns to perform step-by-step masking operations by observing the change in attention matrices. We have adapted our method to two scenarios, perturbation-based model explanation and text adversarial attack. Experiments on three widely used text classification benchmarks validate the effectiveness of the proposed method compared to state-of-the-art baselines. Additional studies show that our method is highly transferable and consistent with human intuition. The code of this paper is available at https: //github. com/niuzaisheng/AttExplainer.

PDF Details DOI

JBHI Journal 2022 Journal Article

Corrections to “ i Phantom: A Framework for Automated Creation of Individualized Computational Phantoms and its Application to CT Organ Dosimetry” [Aug 21 3061-3072]

Wanyi Fu
Shobhit Sharma
Ehsan Abadi
Alexandros-Stavros Iliopoulos
Qi Wang
Joseph Y. Lo
Xiaobai Sun
William P. Segars

In [1], the dose estimation accuracy using the alternative baseline method under modulated tube current was not correctly calculated due to an unintentional simulation error.

Details DOI

EAAI Journal 2022 Journal Article

Dual-branch framework: AUV-based target recognition method for marine survey

Fei Yu
Bo He
Jixin Liu
Qi Wang

Details DOI

YNIMG Journal 2022 Journal Article

Focal fMRI signal enhancement with implantable inductively coupled detectors

Yi Chen
Qi Wang
Sangcheon Choi
Hang Zeng
Kengo Takahashi
Chunqi Qian
Xin Yu

Details DOI

NeurIPS Conference 2022 Conference Paper

Learning Expressive Meta-Representations with Mixture of Expert Neural Processes

Qi Wang
Herke van Hoof

Neural processes (NPs) formulate exchangeable stochastic processes and are promising models for meta learning that do not require gradient updates during the testing phase. However, most NP variants place a strong emphasis on a global latent variable. This weakens the approximation power and restricts the scope of applications using NP variants, especially when data generative processes are complicated. To resolve these issues, we propose to combine the Mixture of Expert models with Neural Processes to develop more expressive exchangeable stochastic processes, referred to as Mixture of Expert Neural Processes (MoE-NPs). Then we apply MoE-NPs to both few-shot supervised learning and meta reinforcement learning tasks. Empirical results demonstrate MoE-NPs' strong generalization capability to unseen tasks in these benchmarks.

PDF Details

JBHI Journal 2022 Journal Article

MRI Generated From CT for Acute Ischemic Stroke Combining Radiomics and Generative Adversarial Networks

Eryan Feng
Pinle Qin
Rui Chai
Jianchao Zeng
Qi Wang
Yanfeng Meng
Peng Wang

Compared to computed tomography (CT), magnetic resonance imaging (MRI) is more sensitive to acute ischemic stroke lesion. However, MRI is time-consuming, expensive, and susceptible to interference from metal implants. Generating MRI images from CT images can address the limitations of MRI. The key problem in the process is obtaining lesion information from CT. In this study, we propose a cross-modal image generation algorithm from CT to MRI for acute ischemic stroke by combining radiomics with generative adversarial networks. First, the lesion candidate region was obtained using radiomics, the radiomic features of the region were extracted, and the feature with the largest information gain was selected and visualized as a feature map. Then, the concatenation of the extracted feature map and the CT image was input in the generator. We added a residual module after the downsampling of the generator, following the general shape of U-Net, which can deepen the network without causing degradation problems. In addition, we introduced the lesion feature similarity loss function to focus the model on the similarity of the lesion. Through the subjective judgment of two experienced radiologists and using evaluation metrics, the results showed that the generated MRI images were very similar to the real MRI images. Moreover, the locations of the lesions were correct, and the shapes of lesions were similar to those of the real lesions, which can help doctors with timely diagnosis and treatment.

Details DOI

JBHI Journal 2021 Journal Article

i Phantom: A Framework for Automated Creation of Individualized Computational Phantoms and Its Application to CT Organ Dosimetry

Wanyi Fu
Shobhit Sharma
Ehsan Abadi
Alexandros-Stavros Iliopoulos
Qi Wang
Joseph Y. Lo
Xiaobai Sun
William P. Segars

Objective: This study aims to develop and validate a novel framework, iPhantom, for automated creation of patient-specific phantoms or “digital-twins (DT)” using patient medical images. The framework is applied to assess radiation dose to radiosensitive organs in CT imaging of individual patients. Method: Given a volume of patient CT images, iPhantom segments selected anchor organs and structures (e. g. , liver, bones, pancreas) using a learning-based model developed for multi-organ CT segmentation. Organs which are challenging to segment (e. g. , intestines) are incorporated from a matched phantom template, using a diffeomorphic registration model developed for multi-organ phantom-voxels. The resulting digital-twin phantoms are used to assess organ doses during routine CT exams. Result: iPhantom was validated on both with a set of XCAT digital phantoms (n = 50) and an independent clinical dataset (n = 10) with similar accuracy. iPhantom precisely predicted all organ locations yielding Dice Similarity Coefficients (DSC) 0. 6 - 1 for anchor organs and DSC of 0. 3-0. 9 for all other organs. iPhantom showed <; 10% errors in estimated radiation dose for the majority of organs, which was notably superior to the state-of-the-art baseline method (20-35% dose errors). Conclusion: iPhantom enables automated and accurate creation of patient-specific phantoms and, for the first time, provides sufficient and automated patient-specific dose estimates for CT dosimetry. Significance: The new framework brings the creation and application of CHPs (computational human phantoms) to the level of individual CHPs through automation, achieving wide and precise organ localization, paving the way for clinical monitoring, personalized optimization, and large-scale research.

Details DOI

EAAI Journal 2021 Journal Article

Learning to traverse over graphs with a Monte Carlo tree search-based self-play framework

Qi Wang
Yongsheng Hao
Jie Cao

Details DOI

YNIMG Journal 2020 Journal Article

Inter-subject pattern analysis: A straightforward and powerful scheme for group-level MVPA

Qi Wang
Bastien Cagna
Thierry Chaminade
Sylvain Takerkart

Details DOI

NeurIPS Conference 2020 Conference Paper

Unsupervised Semantic Aggregation and Deformable Template Matching for Semi-Supervised Learning

Tao Han
Junyu Gao
Yuan Yuan
Qi Wang

Unlabeled data learning has attracted considerable attention recently. However, it is still elusive to extract the expected high-level semantic feature with mere unsupervised learning. In the meantime, semi-supervised learning (SSL) demonstrates a promising future in leveraging few samples. In this paper, we combine both to propose an Unsupervised Semantic Aggregation and Deformable Template Matching (USADTM) framework for SSL, which strives to improve the classification performance with few labeled data and then reduce the cost in data annotating. Specifically, unsupervised semantic aggregation based on Triplet Mutual Information (T-MI) loss is explored to generate semantic labels for unlabeled data. Then the semantic labels are aligned to the actual class by the supervision of labeled data. Furthermore, a feature pool that stores the labeled samples is dynamically updated to assign proxy labels for unlabeled data, which are used as targets for cross-entropy minimization. Extensive experiments and analysis across four standard semi-supervised learning benchmarks validate that USADTM achieves top performance (e. g. , 90. 46% accuracy on CIFAR-10 with 40 labels and 95. 20% accuracy with 250 labels). The code is released at https: //github. com/taohan10200/USADTM.

PDF Details

AAAI Conference 2020 Conference Paper

Weakly-Supervised Video Moment Retrieval via Semantic Completion Network

Zhijie Lin
Zhou Zhao
Zhu Zhang
Qi Wang
Huasheng Liu

Video moment retrieval is to search the moment that is most relevant to the given natural language query. Existing methods are mostly trained in a fully-supervised setting, which requires the full annotations of temporal boundary for each query. However, manually labeling the annotations is actually time-consuming and expensive. In this paper, we propose a novel weakly-supervised moment retrieval framework requiring only coarse video-level annotations for training. Speciﬁcally, we devise a proposal generation module that aggregates the context information to generate and score all candidate proposals in one single pass. We then devise an algorithm that considers both exploitation and exploration to select top- K proposals. Next, we build a semantic completion module to measure the semantic similarity between the selected proposals and query, compute reward and provide feedbacks to the proposal generation module for scoring reﬁnement. Experiments on the ActivityCaptions and Charades-STA demonstrate the effectiveness of our proposed method.

PDF Details

AAAI Conference 2019 Conference Paper

ACM: Adaptive Cross-Modal Graph Convolutional Neural Networks for RGB-D Scene Recognition

Yuan Yuan
Zhitong Xiong
Qi Wang

RGB image classification has achieved significant performance improvement with the resurge of deep convolutional neural networks. However, mono-modal deep models for RGB image still have several limitations when applied to RGB-D scene recognition. 1) Images for scene classification usually contain more than one typical object with flexible spatial distribution, so the object-level local features should also be considered in addition to global scene representation. 2) Multi-modal features in RGB-D scene classification are still under-utilized. Simply combining these modal-specific features suffers from the semantic gaps between different modalities. 3) Most existing methods neglect the complex relationships among multiple modality features. Considering these limitations, this paper proposes an adaptive crossmodal (ACM) feature learning framework based on graph convolutional neural networks for RGB-D scene recognition. In order to make better use of the modal-specific cues, this approach mines the intra-modality relationships among the selected local features from one modality. To leverage the multi-modal knowledge more effectively, the proposed approach models the inter-modality relationships between two modalities through the cross-modal graph (CMG). We evaluate the proposed method on two public RGB-D scene classification datasets: SUN-RGBD and NYUD V2, and the proposed method achieves state-of-the-art performance.

PDF Details

AAAI Conference 2019 Conference Paper

Memory-Augmented Temporal Dynamic Learning for Action Recognition

Yuan Yuan
Dong Wang
Qi Wang

Human actions captured in video sequences contain two crucial factors for action recognition, i. e. , visual appearance and motion dynamics. To model these two aspects, Convolutional and Recurrent Neural Networks (CNNs and RNNs) are adopted in most existing successful methods for recognizing actions. However, CNN based methods are limited in modeling long-term motion dynamics. RNNs are able to learn temporal motion dynamics but lack effective ways to tackle unsteady dynamics in long-duration motion. In this work, we propose a memory-augmented temporal dynamic learning network, which learns to write the most evident information into an external memory module and ignore irrelevant ones. In particular, we present a differential memory controller to make a discrete decision on whether the external memory module should be updated with current feature. The discrete memory controller takes in the memory history, context embedding and current feature as inputs and controls information flow into the external memory module. Additionally, we train this discrete memory controller using straight-through estimator. We evaluate this end-to-end system on benchmark datasets (UCF101 and HMDB51) of human action recognition. The experimental results show consistent improvements on both datasets over prior works and our baselines.

PDF Details

YNIMG Journal 2018 Journal Article

Correlation of neural activity with behavioral kinematics reveals distinct sensory encoding and evidence accumulation processes during active tactile sensing

Ioannis Delis
Jacek P. Dmochowski
Paul Sajda
Qi Wang

Details DOI

AAAI Conference 2018 Conference Paper

Inferring Emotion from Conversational Voice Data: A Semi-Supervised Multi-Path Generative Neural Network Approach

Suping Zhou
Jia Jia
Qi Wang
Yufei Dong
Yufeng Yin
Kehua Lei

To give a more humanized response in Voice Dialogue Applications (VDAs), inferring emotion states from users’ queries may play an important role. However, in VDAs, we have tremendous amount of VDA users and massive scale of unlabeled data with high dimension features from multimodal information, which challenge the traditional speech emotion recognition methods. In this paper, to better infer emotion from conversational voice data, we propose a semisupervised multi-path generative neural network. Speciﬁcally, ﬁrst, we build a novel supervised multi-path deep neural network framework. To avoid high dimensional input, raw features are trained by groups in local classiﬁers. Then high-level features of each local classiﬁers are concatenated as input of a global classiﬁer. These two kinds classiﬁers are trained simultaneously through a single objective function to achieve a more effective and discriminative emotion inferring. To further solve the labeled-datascarcity problem, we extend the multi-path deep neural network to a generative model based on semi-supervised variational autoencoder(semi-VAE), which is able to train the labeled and unlabeled data simultaneously. Experiment based on a 24, 000 real-world dataset collected from Sogou Voice Assistant1 (SVAD13) and a benchmark dataset IEMOCAP show that our method signiﬁcantly outperforms the existing state-of-the-art results.

PDF Details

IJCAI Conference 2018 Conference Paper

Nonrigid Points Alignment with Soft-weighted Selection

Xuelong Li
Jian Yang
Qi Wang

Point set registration (PSR) is a crucial problem in computer vision and pattern recognition. Existing PSR methods cannot align point sets robustly due to degradations, such as deformation, noise, occlusion, outlier, and multi-view changes. In this paper, we present a self-selected regularized Gaussian fields criterion for nonrigid point matching. Unlike most existing methods, we formulate the registration problem as a sparse approximation task with low rank constraint in reproducing kernel Hilbert space (RKHS). A self-selected mechanism is used to dynamically assign real-valued label for each point in an accuracy-aware weighting manner, which makes the model focus more on the reliable points in position. Based on the label, an equivalent matching number optimization is embedded into the non-rigid criterion to enhance the reliability of the approximation. Experimental results show that the proposed method can achieve a better result in both registration accuracy and correct matches compared to state-of-the-art approaches.

PDF Details

AAAI Conference 2017 Conference Paper

A Multiview-Based Parameter Free Framework for Group Detection

Xuelong Li
Mulin Chen
Feiping Nie
Qi Wang

Group detection is fundamentally important for analyzing crowd behaviors, and has attracted plenty of attention in arti- ﬁcial intelligence. However, existing works mostly have limitations due to the insufﬁcient utilization of crowd properties and the arbitrary processing of individuals. In this paper, we propose the Multiview-based Parameter Free (MPF) approach to detect groups in crowd scenes. The main contributions made in this study are threefold: (1) a new structural context descriptor is designed to characterize the structural property of individuals in crowd motions; (2) an selfweighted multiview clustering method is proposed to cluster feature points by incorporating their motion and context similarities; (3) a novel framework is introduced for group detection, which is able to determine the group number automatically without any parameter or threshold to be tuned. Extensive experiments on various real world datasets demonstrate the effectiveness of the proposed approach, and show its superiority against state-of-the-art group detection techniques.

PDF Details

IJCAI Conference 2017 Conference Paper

Convolutional 2D LDA for Nonlinear Dimensionality Reduction

Qi Wang
Zequn Qin
Feiping Nie
Yuan Yuan

Representing high-volume and high-order data is an essential problem, especially in machine learning field. Although existing two-dimensional (2D) discriminant analysis achieves promising performance, the single and linear projection features make it difficult to analyze more complex data. In this paper, we propose a novel convolutional two-dimensional linear discriminant analysis (2D LDA) method for data representation. In order to deal with nonlinear data, a specially designed Convolutional Neural Networks (CNN) is presented, which can be proved having the equivalent objective function with common 2D LDA. In this way, the discriminant ability can benefit from not only the nonlinearity of Convolutional Neural Networks, but also the powerful learning process. Experiment results on several datasets show that the proposed method performs better than other state-of-the-art methods in terms of classification accuracy.

PDF Details

IJCAI Conference 2017 Conference Paper

Locality Adaptive Discriminant Analysis

Xuelong Li
Mulin Chen
Feiping Nie
Qi Wang

Linear Discriminant Analysis (LDA) is a popular technique for supervised dimensionality reduction, and its performance is satisfying when dealing with Gaussian distributed data. However, the neglect of local data structure makes LDA inapplicable to many real-world situations. So some works focus on the discriminant analysis between neighbor points, which can be easily affected by the noise in the original data space. In this paper, we propose a new supervised dimensionality reduction method, Locality Adaptive Discriminant Analysis (LADA), to lean a representative subspace of the data. Compared to LDA and its variants, the proposed method has three salient advantages: (1) it finds the principle projection directions without imposing any assumption on the data distribution; (2) it’s able to exploit the local manifold structure of data in the desired subspace; (3) it exploits the points’ neighbor relationship automatically without introducing any additional parameter to be tuned. Performance on synthetic datasets and real-world benchmark datasets demonstrate the superiority of the proposed method.

PDF Details

AAAI Conference 2017 Conference Paper

Quantifying and Detecting Collective Motion by Manifold Learning

Qi Wang
Mulin Chen
Xuelong Li

The analysis of collective motion has attracted many researchers in artiﬁcial intelligence. Though plenty of works have been done on this topic, the achieved performance is still unsatisfying due to the complex nature of collective motions. By investigating the similarity of individuals, this paper proposes a novel framework for both quantifying and detecting collective motions. Our main contributions are threefold: (1) the time-varying dynamics of individuals are deeply investigated to better characterize the individual motion; (2) a structure-based collectiveness measurement is designed to precisely quantify both individual-level and scene-level properties of collective motions; (3) a multi-stage clustering strategy is presented to discover a more comprehensive understanding of the crowd scenes, containing both local and global collective motions. Extensive experimental results on real world data sets show that our method is capable of handling crowd scenes with complicated structures and various dynamics, and demonstrate its superior performance against state-of-the-art competitors.

PDF Details

TCS Journal 2016 Journal Article

The Space Complexity Analysis in the General Number Field Sieve Integer Factorization

Qi Wang
Xiubin Fan
Hongyan Zang
Yu Wang

Details DOI

YNIMG Journal 2010 Journal Article

Identifying gene regulatory networks in schizophrenia

Steven G. Potkin
Fabio Macciardi
Guia Guffanti
James H. Fallon
Qi Wang
Jessica A. Turner
Anita Lakatos
Michael F. Miles

Details DOI

IROS Conference 2005 Conference Paper

The Pantograph Mk-II: a haptic instrument

Gianni Campion
Qi Wang
Vincent Hayward

We describe the redesign and the performance evaluation of a high-performance haptic device system called the Pantograph. The device is based on a two degree-of-freedom parallel mechanism which was designed for optimized dynamic performance, but which also is well kinematically conditioned. The results show that the system is capable of producing accurate tactile signals in the DC-400 Hz range and can resolve displacements of the order of 10 /spl mu/m. Future improvements are discussed.

Details

IROS Conference 2002 Conference Paper

A prototype virtual haptic bronchoscope

Qi Wang
Yongsheng Ou
Yangsheng Xu

In this paper, we describe the design of the hardware and software for a virtual bronchoscope with force feedback. A haptic interface allows surgeons to feel the reaction force of virtual pneumonic surgery as if they were touching the area directly. We present novel algorithms for haptic force rendering, and examine its ability to display force. The rendering algorithms have been interfaced with a force-reflecting device. This virtual haptic bronchoscope is of significance in training inexperienced doctors in pneumonic diagnosis and surgery.

Details

ICRA Conference 2000 Conference Paper

On Tracking Control of Mobile Manipulators

Wenjie Dong
Yangsheng Xu
Qi Wang

This paper studies the tracking control problem of mobile manipulators with consideration of the interaction between the mobile platform and the manipulator. A global tracking controller is proposed based on the dynamics of the defined tracking error and the extended Barbalat's lemma. The proposed controller ensures that the full state of the system asymptotically track the given desired trajectory globally in the presence of the system coupling. Extensive simulations presented in the paper show the effectiveness of the proposed approach.

Details

ICRA Conference 1998 Conference Paper

Towards Real-Time Robot Programming by Human Demonstration for 6D Force Controlled Actions

Qi Wang
Joris De Schutter

An approach for real-time robot programming by human demonstration for 6D force controlled actions is presented. A human operator utilises a joystick to guide a robot with a force sensor to execute a task including continuous contact between a manipulated object and an unmodelled environment. During the demonstration, the position, velocity and force of the manipulated object as well as the human commands via the joystick are recorded. In real-time, the recorded information is translated into a textual robot program providing more robust execution in the presence of uncertainties. This approach has three main features (1) online control type adjustment; (2) automatic subtask termination; (3) real-time program generation. Experiments show the potential industrial applicability.

Details

IROS Conference 1996 Conference Paper

An environment for compliant motion programming by human demonstration

Sean Graves
Qi Wang
Wim Witvrouw
Joris De Schutter

An integrated system for programming by demonstration, visualizing, and executing compliant motion programs is described. A human operator utilises a joystick to guide a robot with a force sensor to do a task including continuous contact between manipulator and environment. The demonstration may be executed either on an actual robot, or in a graphically simulated environment. During the demonstration, the position, velocity and force of the object manipulated are acquired. Then the recorded data are processed, analysed, and translated into a textual robot program, which provides more robust execution in the presence of uncertainties. The system is composed of a model-based reaction force simulator, a visualization package, a rule-based translator, and an interpreter for compliant motion programs. Experiments show the industrial applicability.

Details

ICRA Conference 1996 Conference Paper

Derivation of compliant motion programs based on human demonstration

Qi Wang
Joris De Schutter
Wim Witvrouw
Sean Graves

An approach to force controlled robot programming by human demonstration is presented. A human operator utilises a joystick to guide a robot with a force sensor to do a task including continuous contact between a manipulated object and an un-modelled environment. During the demonstration, the position, velocity and force of the object manipulated are acquired. Then the recorded data are processed, analysed, and translated into a textual robot program, which provides more robust execution in the presence of uncertainties. This approach consists of three key techniques-data processing, subtask segmentation and termination condition identification. A software package is developed to generate the programs automatically. Experiments show the industrial applicability.

Details