Arrow Research search

Author name cluster

Peng Wang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

115 papers
2 author rows

Possible papers

115

EAAI Journal 2026 Journal Article

A highly deterministic defect detection method for high-resolution weld based on deep learning and entropy quantization theory

  • Liangliang Li
  • Peng Wang
  • Zhigang Lü
  • RuoHai Di
  • Mengyu Sun
  • Xueren Wang
  • Bin Wang

In the field of welding quality control, precise defect detection is crucial for ensuring structural safety and extending service life. Addressing the challenges in welding defect detection, this paper introduces a high-certainty defect detection model that integrates a time series model of weld feature information with entropy quantization theory. The model initially employs a weld localization model based on SCLT (Stack-CNN-LSTM-Transfer), which combines convolutional neural network (CNN) and long short-term memory network (LSTM) and utilizes transfer learning to process sequential data, thereby achieving high-precision localization of the weld area. Additionally, to overcome the limitations of existing datasets, a data augmentation method that dynamically adjusts the quality of recombined images is designed, enhancing the diversity of the datasets in terms of annotation pixel coverage and segmented area size distribution. Furthermore, this paper proposes a strategy of hybrid feature enhancement and multi-pool fusion coding. By designing multi-path feature fusion and multi-pool fusion modules, along with a cross-layer adaptive feature fusion decoding module, it achieves deep feature fusion and processing. Moreover, to address the insufficiency of deterministic output, a high-certainty dynamic kernel defect detection module is designed based on entropy quantization theory to enhance the certainty of defect detection outputs. Experimental results indicate that the model's localization accuracy at the upper and lower boundaries of the weld is 5. 5213 and 6. 1313, respectively, demonstrating superior localization capabilities. On the WFR (Welding feature reorganization) dataset, the model's DICE, Precision, Recall, and Jaccard reached 0. 9090, 0. 8646, 0. 9623, and 0. 8364, respectively, achieving the best detection accuracy compared to existing methods. Concurrently, significant performance improvements have been observed on the DAR (Dynamically adjustable recombination) dataset constructed in this paper. This work can effectively advance the development of welding defect detection technology and provide robust technical support for industrial automation quality control.

EAAI Journal 2026 Journal Article

A novel multi-modal attentional collaborative learning framework with semantic enhancement for audio–visual question answering

  • Jie Yang
  • Miao Ma
  • Peng Wang
  • Yutong Li
  • Zhao Pei
  • Chao Yao
  • Longjiang Guo

The Audio–Visual Question Answering (AVQA) task aims to extract audio and visual cues from videos for answering the questions. The popular two-stage method, such as Progressive Spatio-Temporal Perception Network (PSTP-Net), first locates key segments in the audio–visual scene based on the question and then identifies the most relevant audio–visual regions. While this reduces cue redundancy, it overlooks the complementary role of rich cues, which is crucial for a comprehensive understanding of audio–visual content. In this paper, we propose a novel framework to start from the question itself, guide the entire multi-modal collaborative learning process, and conduct audio–visual question answering. This method includes a semantically enhanced strategy using Multi-modal Large Language Models (MLLMs) applied as an engineering solution, and a multi-modal attentional collaborative learning process, which is the core algorithmic innovation. Extensive experiments on the Music Audio–Visual Question Answering dataset (MUSIC-AVQA) and Music Audio–Visual Question Answering dataset version 2 (MUSIC-AVQA v2) demonstrate the effectiveness of our method. Compared to the PSTP-Net, our method reduces the number of training parameters by 61. 23% and Floating-point Operations (FLOPs) by 60. 83%, while achieving 2. 61 percentage-point improvement in accuracy. This indicates that our method effectively captures and aligns rich audio–visual cues, significantly enhancing reasoning efficiency. Our code will be publicly available soon.

JBHI Journal 2026 Journal Article

APSevLM: Acute Pancreatitis Severity Language Model

  • Leqi Zheng
  • Jiajun Fang
  • Hongyi Chen
  • Naiqing Li
  • Yunyuan Huang
  • Qiulin Ge
  • Yang Gu
  • Tao Yu

Approximately one-fifth of patients with acute pancreatitis (AP) develop severe forms, which are associated with high mortality rates, making early prediction of severity crucial for effective patient management. In this study, we present APSevLM (Acute Pancreatitis Severity Language Model), a large language model (LLM)-based approach that integrates admission-time clinical data, imaging reports, and expert knowledge to predict AP severity at an early stage. Through a comprehensive evaluation using data from over five hundred patients, APSevLM outperforms traditional scoring systems (BISAP and MCTSI), conventional machine learning algorithms, and state-of-the-art deep learning models, achieving an AUC of 0. 857. Attention visualizations of the model explain complex mechanisms that dynamically weigh different information modalities based on case severity. Furthermore, a systematic feature importance analysis identifies key predictive factors, particularly hematological parameters and cardiac markers, offering valuable insights for clinical practice. Our study positions APSevLM as an accurate predictive model and highlights potential biomarkers for the early diagnosis of severe AP.

AAAI Conference 2026 Conference Paper

Balanced Knowledge Distillation for Large Language Models with Mix-of-Experts

  • Jiajun Liu
  • Yao He
  • Wenjun Ke
  • Peng Wang
  • Ziyu Shang
  • Guozheng Li
  • Zijie Xu

Mixture-of-Experts (MoE) architectures have recently become a more prevalent choice for large language models (LLMs) than dense architectures due to their superior performance. However, billions of parameters bring MoE LLMs a huge cost for deployment and inference. To address these issues, knowledge distillation (KD) has become a widely adopted technique to compress LLMs. Existing KD methods for LLMs can be divided into dense-to-dense and moe-to-dense distillation. Dense-to-dense distillation transfers knowledge between single dense LLMs, while moe-to-dense distillation attempts to transfer knowledge between the MoE LLMs and the dense LLMs. However, the architectural mismatch prevents the student from fully absorbing knowledge when distilling MoE LLMs. To address this limitation, we investigate a new distillation setting, moe-to-moe, which aims to fully leverage expert knowledge of teachers and enable the student to absorb it more effectively. Compared to dense-to-dense and moe-to-dense, moe-to-moe suffers from two imbalance issues. First, expert-coverage deficiency reflects an imbalanced knowledge transfer of teacher experts: traditional distillation utilizes only the few experts activated by the teacher router. Second, routing imbalance appears when the student routing distribution drifts from the teacher, which makes it difficult for students to learn how to distribute different experts. To overcome these issues, we propose a novel distillation framework for moe-to-moe, Balanced Distillation (B-Distill), which equally spreads teacher expertise across student experts while regularizing the student router toward teacher-consistent balance. First, to mitigate expert-coverage deficiency, we introduce Monte Carlo exploration, which stochastically perturbs router probabilities so every teacher and student expert is sampled without enlarging the search space. Second, to correct routing imbalance and avert load collapse, we propose an entropy-aware router distillation mechanism that aligns the student router with the teacher while curbing over-concentration. Experiments show that B-Distill outperforms baselines by up to 6.6% in Rouge-L.

AAAI Conference 2026 Conference Paper

Benchmarking and Enhancing Rule Knowledge-Driven Reasoning of Large Language Models

  • Zijie Xu
  • Wenjun Ke
  • Peng Wang
  • Guozheng Li
  • Qingjian Ni
  • Jiajun Liu
  • Ziyu Shang
  • Jing Zhou

Large Language Models (LLMs) have demonstrated strong capabilities across diverse tasks under the example-driven learning paradigm. However, in high-stakes domains such as emergency response and industrial safety, historical incidents are scarce, confidential, or both, while concise rule books are abundant. We formalize this underexplored setting as rule knowledge-driven reasoning and ask: Can LLMs reason reliably when rules are plentiful but examples are nearly absent? To study this question, we introduce RULER, an automatic benchmark that generates 32K rigorously verified questions from 1K expert-curated emergency response rules to probe three core abilities: rule memorization, single-rule application, and multi-rule complex reasoning. RULER is further equipped with a hallucination-aware evaluation suite and novel relational metrics. A comprehensive empirical study of five representative LLMs and five enhancement strategies shows that, even when models achieve reliable performance on rule memorization and single-rule application, multi-rule complex reasoning plateaus at 5.4 on a 10-point scale. To address this limitation, we propose RAMPS, a Rule knowledge-Aware Monte Carlo Tree Search Process-reward Supervision framework. RAMPS injects rule knowledge priors into MCTS, distills 12K step-level traces without human annotation, and trains an advantage-based reward model that scores candidate reasoning paths during beam search inference. Experimental results show that RAMPS significantly improves multi-rule complex reasoning performance to 7.7.

AAAI Conference 2026 Conference Paper

Better Matching, Less Forgetting: A Quality-Guided Matcher for Transformer-based Incremental Object Detection

  • Qirui Wu
  • Shizhou Zhang
  • De Cheng
  • Yinghui Xing
  • Lingyan Ran
  • Dahu Shi
  • Peng Wang

Incremental Object Detection (IOD) aims to continuously learn new object classes without forgetting previously learned ones. A persistent challenge is catastrophic forgetting, primarily attributed to background shift in conventional detectors. While pseudo-labeling mitigates this in dense detectors, we identify a novel, distinct source of forgetting specific to DETR-like architectures: background foregrounding. This arises from the exhaustiveness constraint of the Hungarian matcher, which forcibly assigns every ground truth target to one prediction, even when predictions primarily cover background regions (i.e., low IoU). This erroneous supervision compels the model to misclassify background features as specific foreground classes, disrupting learned representations and accelerating forgetting. To address this, we propose a Quality-guided Min-Cost Max-Flow (Q-MCMF) matcher. To avoid forced assignments, Q-MCMF builds a flow graph and prunes implausible matches based on geometric quality. It then optimizes for the final matching that minimizes cost and maximizes valid assignments. This strategy eliminates harmful supervision from background foregrounding while maximizing foreground learning signals. Extensive experiments on the COCO dataset under various incremental settings demonstrate that our method consistently outperforms existing state-of-the-art approaches.

AAAI Conference 2026 Conference Paper

Beyond Static: Related Questions Retrieval Through Conversations in Community Question Answering

  • Xiao Ao
  • Jie Zou
  • Yibiao Wei
  • Peng Wang
  • Weikang Guo

In community question answering (cQA) platforms like Stack Overflow, related question retrieval is recognized as a fundamental task that allows users to retrieve related questions to answer user queries automatically. Although many traditional approaches have been proposed for investigating this research field, they mostly rely on static approaches and neglect the interaction property. We argue that the conversational way can well distinguish the fine-grained representations of questions and has great potential to improve the performance of question retrieval. In this paper, we propose a related question retrieval model through conversations, called TeCQR, to locate related questions in cQA. Specifically, we build conversations by utilizing tag-enhanced clarifying questions. In addition, we design a noise tolerance model that evaluates the semantic similarity between questions and tags, enabling the model to effectively handle noisy feedback. Moreover, the tag-enhanced two-stage offline training is proposed to fully exploit the mutual relationships among user queries, questions, and tags to learn their fine-grained representations. Based on the learned representations and contextual conversations, TeCQR incorporates conversational feedback by learning to ask tag-enhanced clarifying questions to retrieve related questions more effectively. Experimental results demonstrate that our model significantly outperforms state-of-the-art baselines.

AAAI Conference 2026 Conference Paper

Do Large Language Models Reason About Uncertainty Like Humans? A Benchmark on Hurricane Forecast Visualization Comprehension

  • Le Liu
  • Yuhao Wang
  • Bohan Shen
  • Wei Zeng
  • Shizhou Zhang
  • Di Xu
  • Peng Wang

Uncertainty visualizations, such as hurricane cones and ensemble tracks, are essential for risk communication but are often misinterpreted, leading to harmful decisions. As AI assistants like large language models (LLMs) increasingly support understanding of graphics and decision-making, they offer a promising pathway to enhance the interpretation of complex visualizations and a new opportunity to examine and improve the interpretation of uncertainty. We introduce UnReason, the first benchmark that systematically compares how humans and LLMs reason about hurricane forecast uncertainty visualizations. UnReason spans two escalating phases, seven representative visualization formats, six real hurricane cases, and three agent types (humans, LLMs with context, and LLMs without context), including 880 visualizations and 117,600 structured question–answer pairs under matched evaluation conditions. Phase 1 evaluates reasoning across implicit and explicit uncertainty encodings; Phase 2 examines reasoning under single- versus multi-dimensional uncertainty representations. We thoroughly assess damage estimation, reasoning strategies, and comprehension patterns, revealing that LLMs have a stronger semantic and conceptual understanding of uncertainty, and are less misled by visual variability, but still replicate key human biases during decision-making. Our findings offer insights into aligning LLM behavior with human cognition in uncertainty-rich visual reasoning tasks.

EAAI Journal 2026 Journal Article

Fractal-guided multi-scale contrastive learning for robust liver tumor classification in ultrasound

  • Xuping Zhang
  • Qingyuan Zhang
  • Tao Zhang
  • Ge Song
  • Peng Wang

Ultrasound-based liver lesion classification remains challenging because diagnostically relevant cues are often localized, structurally heterogeneous, and easily corrupted by speckle noise, boundary ambiguity, and acquisition variability. Existing deep learning methods mainly emphasize global image-level representations and may therefore be suboptimal for modeling lesion-relevant local morphology. We propose a structure-constrained multi-scale contrastive learning framework for liver ultrasound classification. During training, semantic-guided anchoring is used to identify informative local regions, within which positive samples are selected under joint constraints of semantic similarity, local statistical consistency, and fractal-based structural complexity. A multi-scale local positive mining strategy and multi-positive supervised contrastive objective are further introduced to capture lesion patterns across different spatial granularities and enhance class discrimination. Importantly, the framework requires no auxiliary annotations during inference and operates directly on full ultrasound images. Experiments on a private four-class liver ultrasound dataset and two public ultrasound datasets demonstrate consistent improvements over representative convolutional neural network-based, Transformer-based, self-supervised, and ultrasound-specific methods. On the private dataset, the proposed method achieved 97. 27% accuracy, and maintained 91. 15% accuracy on an independent external validation cohort. Ablation studies and interpretability analyses further show that the method improves both structural discriminability and lesion-focused attention. Overall, the proposed framework provides an effective and practically deployable solution for robust liver lesion classification in ultrasound imaging.

AAAI Conference 2026 Conference Paper

From Dialogue to Destination: Geography-Aware Large Language Models with Multimodal Fusion for Conversational Recommendation

  • Yeming Li
  • Chenxi Liu
  • Jie Zou
  • Cheng Long
  • Chaoning Zhang
  • Peng Wang
  • Yang Yang

Conversational Recommender Systems (CRS) aim to provide personalized recommendations by interacting with users through natural language dialogue. However, in scenarios requiring deep geospatial awareness, existing methods, including those based on Large Language Models (LLMs), still face significant challenges in effectively fusing heterogeneous, multimodal geographic information with dynamic dialogue context. Simple fusion strategies struggle to resolve the asymmetric dependencies between dynamic user intent and static geographic context and fail to bridge the semantic gap between LLMs and structured geospatial data. To address these issues, we propose a framework for geography-aware CRS, named GeoCRS. Our core idea is to empower a frozen LLM with powerful geospatial reasoning capabilities by conditioning it on a dynamic, multimodal guidance signal generated by an external fusion architecture, all without altering the LLM's internal parameters. Specifically, we first design a hierarchical geographical encoder to uniformly represent heterogeneous geographic data. Subsequently, we introduce a contextual feature modulation module that asymmetrically injects the geographic context into the user's dialogue intent via a novel modulation mechanism to improve conversational recommendation via both geographic and dialogue context. Extensive experiments on public benchmark datasets demonstrate that our proposed GeoCRS significantly outperforms state-of-the-art baselines on the geography-aware conversational recommendation task.

JBHI Journal 2026 Journal Article

FusionMVSA: Multi-View Fusion Strategy With Self-Attention for Enhancing Drug Recommendation

  • Yajie Meng
  • Zhuang Zhang
  • Xudong Shang
  • Xianfang Tang
  • Jincan Li
  • Zilong Zhang
  • Feifei Cui
  • Shuting Jin

Leveraging the wealth of biomedical data available, we can derive insights into the relationships between biological entities from various angles. This underscores the complexity and significance of developing a dynamic approach for integrating data from multiple sources, a critical endeavor in drug recommendation. In this study, we introduce an innovative deep learning approach termed “Multi-View Fusion Strategy with Self-Attention” (FusionMVSA), designed to predict associations between drugs and diseases. To effectively amalgamate data from diverse sources and extract representative features, we have developed a feature extraction mechanism that capitalizes on similarities. This mechanism computes self-attention across multiple perspectives using shared group parameters, thereby highlighting common characteristics. Simultaneously, we utilize biomedical similarities among multi-source data as guiding factors for calculating similarity, enabling the capture of more nuanced features. Subsequently, we integrate these features through a feature fusion process, where known associations between drugs and diseases act as guiding terms. This strategy allows us to uncover the complementary aspects of different viewpoints. Ultimately, we predict potential drug-disease associations using a multi-layer perceptron neural network. Our methodology has undergone rigorous testing through various cross-validation experiments and case studies. We are confident that FusionMVSA will prove to be a valuable tool in drug recommendation, offering new avenues for exploration and discovery in the quest to combat diseases.

EAAI Journal 2026 Journal Article

Hierarchical detection and evaluation method for surface defects based on dynamic feature selection and uncertainty-guided optimization

  • Liangliang Li
  • Peng Wang
  • RuoHai Di
  • Chao Xu
  • Mengyu Sun
  • Zhigang Lü
  • Yu Zhang

Steel is an essential industrial foundational material, and surface defects can severely compromise product performance and service life. Conventional detection methods are subject to numerous limitations. While deep learning-based approaches excel at defect localization, they remain deficient in extracting precise edge contour information and quantifying model uncertainty. To address these challenges, this paper proposes a hierarchical steel surface defect detection and evaluation framework based on dynamic feature selection and uncertainty-guided optimization. The proposed ConfidenceSeg-Net model integrates feature enhancement and dynamic selection mechanisms. During encoding, the ReSidual U-block module extracts and adaptively enhances multi-scale features. In the decoding phase, multi-scale feature fusion and attention-guided feature fusion modules are employed to integrate feature maps. The segmentation head combines depth feature space and side-output modules to generate final segmentation maps, thereby improving both accuracy and fine-detail representation. Furthermore, an uncertainty-constrained, credibility-driven loss optimization function is designed, comprising basic loss, credibility-weighted loss, uncertainty regularization, and credibility consistency loss terms. The weighted combination of these components significantly enhances model performance and reliability. Finally, a multi-dimensional comprehensive evaluation system is established to assess defect detection reliability. This system encompasses seven key metrics: primary prediction reliability, average auxiliary prediction reliability, mean uncertainty, multi-scale feature expressiveness, prediction stability, inter-layer feature consistency, and prediction diversity. Dimension-specific scores are computed and fused through weighted integration to generate a comprehensive credibility score, enabling thorough and nuanced performance assessment. Experimental results on public datasets demonstrate superior performance, with Precision, Recall, and Jaccard indices reaching 0. 8808, 0. 8877, and 0. 7917, respectively. The proposed framework also provides a portable, comprehensive credibility evaluation mechanism, substantially enhancing practical applicability and reliability.

AAAI Conference 2026 Conference Paper

Learning from Human Gaze: Human-like Robot Social Navigation in Dense Crowds

  • Zhecheng Yu
  • Yan Lyu
  • Chen Yang
  • Tao Chen
  • Yishuang Zhang
  • Bo Ling
  • Peng Wang
  • Guanyu Gao

Robot navigation in dense crowds requires understanding social cues that humans naturally use, yet existing methods struggle with real-world complexity. We investigate two questions: (1) Where do pedestrians look when navigating crowds? and (2) Can eye tracking improve robot navigation? To answer, we introduce GazeNav, an egocentric dataset collected via wearable eye trackers, featuring synchronized video, gaze, and trajectories in crowded environments. Analysis reveals that the gaze of pedestrians is closely related to the semantic presence and movement of other individuals, exhibiting distinct attention patterns across navigation behaviors. Building on this, we propose Gaze2Nav, a modular framework that first predicts human gaze to infer socially salient pedestrians, then incorporates the semantic attention into motion planning alongside visual inputs. Our method achieves 87.6% salient pedestrian prediction accuracy and reduces trajectory error by 15.4% over state-of-the-art baselines. By aligning with human gaze, our framework improves both performance and interpretability, advancing toward human-like, socially intelligent robot navigation.

AAAI Conference 2026 Conference Paper

Mem4D: Decoupling Static and Dynamic Memory for Dynamic Scene Reconstruction

  • Xudong Cai
  • Shuo Wang
  • Peng Wang
  • Yongcai Wang
  • Zhaoxin Fan
  • Wanting Li
  • Tianbao Zhang
  • Jianrong Tao

Reconstructing dense geometry for dynamic scenes from a monocular video is a critical yet challenging task. Recent memory-based methods enable efficient online reconstruction, but they fundamentally suffer from a Memory Demand Dilemma: The memory representation faces an inherent conflict between the long-term stability required for static structures and the rapid, high-fidelity detail retention needed for dynamic motion. This conflict forces existing methods into a compromise, leading to either geometric drift in static structures or blurred, inaccurate reconstructions of dynamic objects. To address this dilemma, we propose Mem4D, a novel framework that decouples the modeling of static geometry and dynamic motion. Guided by this insight, we design a dual-memory architecture: 1) The Transient Dynamics Memory (TDM) focuses on capturing high-frequency motion details from recent frames, enabling accurate and fine-grained modeling of dynamic content; 2) The Persistent Structure Memory (PSM) compresses and preserves long-term spatial information, ensuring global consistency and drift-free reconstruction for static elements. By alternating queries to these specialized memories, Mem4D simultaneously maintains static geometry with global consistency and reconstructs dynamic elements with high fidelity. Experiments on challenging benchmarks demonstrate that our method achieves state-of-the-art or competitive performance while maintaining high efficiency.

AAAI Conference 2026 Conference Paper

Optimizing LoRA Allocation of MoE with the Alignment of Topic Correlation

  • Hengyuan Xu
  • Wenjun Ke
  • Yao He
  • Jiajun Liu
  • Dong Nie
  • Peng Wang
  • Ziyu Shang
  • Zijie Xu

Mixture of experts (MoE) dynamically routes inputs to specialized expert networks to scale model capacity with low inference overhead. However, the excessive parameter growth in MoE models poses challenges in low-resource settings. To address these issues, MoE with parameter-efficient fine-tuning (PEFT) methods have emerged as a lightweight adaptation paradigm that distributes knowledge among experts via multiple LoRA blocks. Existing MoE-PEFT methods can be broadly categorized into External and Internal PEFT methods. External PEFT methods incorporate lightweight models into existing MoE architectures without modifying their routing, which limits the model’s parameter efficiency. To overcome these issues, Internal PEFT methods integrate MoE architectures into PEFT, enabling minimal parameter overhead. However, they still face two major challenges: (1) lack of expert functional differentiation, resulting in overlapping specialization across modules, and (2) absence of a structured attribution mechanism to guide expert selection based on semantic relevance. To alleviate these challenges, we propose TopicLoRA, a novel three-stage framework that leverages topic knowledge as semantic anchors to guide expert allocation. Specifically, (1) to address expert redundancy, we construct a topic-level prior graph using Graph Neural Network-enhanced representation learning over Big-Bench categories, enforcing structural separation among expert embeddings, and (2) to introduce semantic attribution, we design a dual-loss training mechanism that softly aligns input-query relevance with topic-guided routing distributions via KL divergence. Extensive experiments on representative datasets (e.g., MMLU, GSM8K, Flanv2) demonstrate that TopicLoRA outperforms state-of-the-art PEFT baselines by 2.40% on average in accuracy. Notably, the maximum improvement is 4.21%. Furthermore, ablation studies demonstrate that our framework's robustness to intricate topics and input sequence variations, which stems from the dual-loss training mechanism.

AAAI Conference 2026 Conference Paper

Seeing Beyond Illusion: Generalized and Efficient Mirror Detection

  • Mingfeng Zha
  • Guoqing Wang
  • Tianyu Li
  • Wei Dong
  • Peng Wang
  • Yang Yang

Reflective imaging enables the mirror imagings and physical entities to possess identical attributes, e.g., color and shape. Current mirror detection (MD) methods primarily rely on designing functional components to establish the correlation and disparities between the imagings and entities, thereby identifying the mirror regions. However, the exploration of extended scenes with dynamic content changes is rarely investigated. Therefore, we propose the MirrorSAM designed for MD based on the Segment Anything Model (SAM). Specifically, due to the varying reflections produced by mirrors in different positions and the complex visual space that interferes with localization, we design the hierarchical mixture of direction experts (HMDE) in the low-rank space to reduce biases towards entities in SAM and dynamically adjust experts based on the input scene. We observe differences in depth between mirrors and adjacent areas, and propose the depth token calibration (DTC), which introduces a learnable depth token to generate the depth map and serve as an error correction factor. We further formulate the selective pixel-prototype contrastive (SPPC) loss, selecting partially confusable samples to promote the decoupling of mirror and non-mirror representations. Extensive experiments conducted on four mirror benchmarks and two settings demonstrate that our approach surpasses state-of-the-art methods with few trainable parameters and FLOPs. We further extend to four transparent surface benchmarks to validate generalization.

AAAI Conference 2026 Conference Paper

TargetVAU: Multimodal Anomaly-Aware Reasoning for Target Behavior Understanding in Videos

  • Lingru Zhou
  • Peng Wu
  • Manqing Zhang
  • Qingsheng Wang
  • Guansong Pang
  • Peng Wang

Understanding anomalous human behaviors at a fine-grained level remains a major challenge in complex scenarios. Existing video anomaly understanding (VAU) methods often rely on coarse frame-level cues or overlook structured modeling of individual actions, limiting their capacity for reasoning about human interactions and accountability. To address these challenges, we propose TargetVAU, a multimodal anomaly-aware reasoning framework designed for individual-level anomaly recognition and explanation. TargetVAU first extracts both global-level and human-centric visual features using a frozen Vision Transformer (ViT) encoder. An Anomaly-focused Temporal Sampler is then employed to select behaviorally informative frames via a density-aware strategy guided by predicted anomaly scores. A Spatio-Temporal Interaction Graph is constructed to explicitly model interactions among individuals across time and space. These structured representations are fused with prompt embeddings via a frozen Q-Former to form a unified semantic representation. Finally, a large language model fine-tuned with low-rank adaptation (LoRA) performs instruction-guided reasoning to identify anomalous individuals and generate natural language explanations. Extensive experiments on UCCD and HIVAU-70K demonstrate that TargetVAU significantly outperforms existing methods in both accuracy and interpretability, advancing the state of individual-level anomaly understanding in surveillance videos.

EAAI Journal 2025 Journal Article

A 6-dimensional pose estimation method combining sparse viewpoint classification initialization and optical flow-guided iterative refinement

  • Huan Yang
  • Yue Wang
  • Xinghang Yin
  • Yongxu Liu
  • Peng Wang

Electronic equipment is typically a complex and high-precision electromechanical system, where the routing and bundling of Radio Frequency (RF) cables are crucial to equipment performance. Traditional assembly methods require workers to assemble according to the assembly process card, which can easily lead to incorrect or missing assembly, poor assembly consistency, and low efficiency. Augmented Reality (AR) assembly guidance can effectively improve efficiency and reduce errors. 6-dimensional (6D) pose estimation is a key technology for AR assembly guidance. In the assembly process of complex electronic products, existing deep learning methods suffer from poor tracking and localization robustness and real-time performance due to factors such as arm occlusion, resulting in slow tracking recovery. This article proposes a two-stage real-time 6D pose estimation method from coarse to fine, which can estimate the pose of target objects in complex backgrounds at a speed of about 20 frames per second and quickly recover after tracking target loss. The real-time and effectiveness were verified through experiments on the red squirrel and electronic chassis.

EAAI Journal 2025 Journal Article

A general framework for chromosomal anomaly detection based on dual constraints of nearest-neighbor and regionality

  • Yue Hao
  • Xin Wang
  • Ge Song
  • Zhiyuan Li
  • Lei Wang
  • Lingwei Li
  • Yongqi Nie
  • Peng Wang

The precise identification of structural chromosomal abnormalities (SCA) is essential for the diagnosis of genetic disorders and malignancies. Traditional karyotype analysis is labor-intensive and necessitates the expertise of cytogeneticists. We propose a dual-constraint enhanced framework that combines nearest-neighbor contrastive learning with one-class classification, facilitating automated abnormality detection without the need for anomalous data. Initially, positive sample pairs are constructed utilizing a Chromosomal Query Library (CQL). This process involves the dynamic selection of nearest neighbors, employing soft nearest neighbor selection and cosine similarity to improve feature consistency. Gaussian noise injection enhances generalization by diversifying representations, whereas a momentum update refines CQL embeddings. The Chromosome Banding module (CB module) extracts chromosomal features at multiple scales, whereas the Chromosome Batch Perception module (CBP module) emphasizes challenging samples through spatial and channel attention mechanisms. In the second stage, we present ChromosomeCutMix to create synthetic chromosomal anomalies, enhancing inter-class separation and improving anomaly detection. The proposed framework attains a classification accuracy of 97. 32% and an F1-score of 96. 69%, surpassing current methodologies in terms of sensitivity and robustness. Validated on public and clinical datasets, it offers dependable localization of biological anomalies and automated cytogenetic diagnostics, thereby enhancing the analysis of genetic disorders.

AAAI Conference 2025 Conference Paper

A Lightweight Sparse Interaction Network for Time Series Forecasting

  • Xu Zhang
  • Qitong Wang
  • Peng Wang
  • Wei Wang

Recent work shows that linear models can outperform several transformer models in long-term time-series forecasting (TSF). However, instead of explicitly performing temporal interaction through self-attention, linear models implicitly perform it based on stacked MLP structures, which may be insufficient in capturing the complex temporal dependencies and their performance still has potential for improvement. To this end, we propose a Lightweight Sparse Interaction Network (LSINet) for TSF task. Inspired by the sparsity of self-attention, we propose a Multihead Sparse Interaction Mechanism (MSIM). Different from self-attention, MSIM learns the important connections between time steps through sparsity-induced Bernoulli distribution to capture temporal dependencies for TSF. The sparsity is ensured by the proposed self-adaptive regularization loss. Moreover, we observe the shareability of temporal interactions and propose to perform Shared Interactions Learning (SIL) for MSIM to further enhance efficiency and improve convergence. LSINet is a linear model comprising only MLP structures with low overhead and equipped with explicit temporal interaction mechanisms. Extensive experiments on public datasets show that LSINet achieves both higher accuracy and better efficiency than advanced linear models and transformer models in TSF tasks.

ICLR Conference 2025 Conference Paper

Autoregressive Pretraining with Mamba in Vision

  • Sucheng Ren
  • Xianhang Li
  • Haoqin Tu
  • Feng Wang
  • Fangxun Shu
  • Lei Zhang
  • Jieru Mei
  • Linjie Yang

The vision community has started to build with the recently developed state space model, Mamba, as the new backbone for a range of tasks. This paper shows that Mamba's visual capability can be significantly enhanced through autoregressive pretraining, a direction not previously explored. Efficiency-wise, the autoregressive nature can well capitalize on the Mamba's unidirectional recurrent structure, enabling faster overall training speed compared to other training strategies like mask modeling. Performance-wise, autoregressive pretraining equips the Mamba architecture with markedly higher accuracy over its supervised-trained counterparts and, more importantly, successfully unlocks its scaling potential to large and even huge model sizes. For example, with autoregressive pretraining, a base-size Mamba attains 83.2\% ImageNet accuracy, outperforming its supervised counterpart by 2.0\%; our huge-size Mamba, the largest Vision Mamba to date, attains 85.0\% ImageNet accuracy (85.5\% when finetuned with $384\times384$ inputs), notably surpassing all other Mamba variants in vision. The code is available at \url{https://github.com/OliverRensu/ARM}.

IJCAI Conference 2025 Conference Paper

Can Large Models Teach Student Models to Solve Mathematical Problems Like Human Beings? A Reasoning Distillation Method via Multi-LoRA Interaction

  • Xinhe Li
  • Jiajun Liu
  • Peng Wang

Recent studies have demonstrated that Large Language Models (LLMs) have strong mathematical reasoning abilities but rely on hundreds of billions of parameters. To tackle the challenge of poor reasoning in Small Language Models (SLMs), existing methods typically leverage LLMs to generate massive amounts of data for cramming training. In psychology, they are akin to System 1 thinking, which resolves reasoning problems rapidly based on experience and intuition. However, human learning also requires System 2 thinking, where knowledge is first acquired and then reinforced through practice. Inspired by such two distinct modes of thinking, we propose a novel method based on the multi-LoRA Interaction for mathematical reasoning Distillation (LoRID). First, we input the question and reasoning of each sample into an LLM to create knowledge-enhanced datasets. Subsequently, we train a LoRA block on the student model as an Intuitive Reasoner (IR), which directly generates Chain-of-Thoughts for problem-solving. Then, to imitate System 2 thinking, we train the Knowledge Generator (KG) and Deep Reasoner (DR), respectively. The former outputs only knowledge after receiving problems, while the latter uses that knowledge to perform reasoning. Finally, to address the randomness in the generation of IR and DR, we evaluate whether their outputs are consistent, and the inference process needs to be iterated if not. This step can enhance the mathematical reasoning ability of SLMs through mutual feedback. Experimental results show that LoRID achieves state-of-the-art performance, especially on the GSM8K dataset, where it outperforms the second-best method by 2. 3%, 16. 1%, 2. 4%, 12. 3%, and 1. 8% accuracy across the five base models, respectively. Meanwhile, we select four strong baselines as System 1, and after integrating them with our method, the reasoning ability of student models is consistently and significantly improved. The datasets and codes are available at https: //github. com/Xinhe-Li/LoRID.

EAAI Journal 2025 Journal Article

Consistency-based decision-making method with linguistic Q-rung orthopair fuzzy preference relation for power battery selection of new energy vehicles

  • Xin Dong
  • Peide Liu
  • Peng Wang
  • Xiaoming Wu

In the era of global petrochemical depletion and increasingly serious environmental pollution, new energy vehicles, as a key industry to build a sustainable low-carbon society, have been paid more and more attention by countries all over the world. As the “heart” of new energy vehicles, power battery plays an important role in the core competitiveness of enterprises. Aiming at the fuzziness and uncertainty of complex power battery selection, a two-stage consistency optimization model based on preference relations and an interactive consistency improvement process are established in this paper. Firstly, by considering the interaction between membership and non-membership, this paper proposes an improved linguistic q-rung orthopair fuzzy weighted averaging operator. Then, the concept of linguistic q-rung orthopair fuzzy preference relation (Lq-ROFPR) is proposed, and its additive consistency index is given based on linguistic scaling function. Whereafter, for the Lq-ROFPR with unacceptable consistency, an interactive mechanism is proposed to improve the consistency level, which considers the minimum adjustment size of preference modification and the minimum number of adjustment elements in turn. Moreover, the method for solving the multi-attribute decision-making problems is formed and applied to the selection of power batteries in XP automobile company. Finally, the simulation experiment and comparative analysis with other methods show the effectiveness and rationality of this method in consistency optimization.

IROS Conference 2025 Conference Paper

ContextCache: Task-Aware Lifecycle Management for Memory-Efficient LLM Agent Deployment

  • Tao Liu
  • Ping Guo
  • Dong Feng
  • Peng Wang

LLM-based agents have demonstrated remarkable capabilities in multi-step reasoning and task execution across domains such as robotics and autonomous systems. However, deploying these agents on resource-constrained platforms presents a fundamental challenge: minimizing latency while optimizing memory usage. Existing caching techniques (KVCache, PrefixCache, PromptCache) improve inference speed by reusing cached context but overlook LLM dependency relationships in agent workflows, leading to excessive memory usage or redundant recomputation across LLM calls. To address this, we propose ContextCache, a task-aware lifecycle management framework that optimizes context fragment caching for multi-step LLM agents. ContextCache predicts the lifespan of each context fragment and dynamically allocates and releases GPU memory accordingly. We evaluate our approach on a newly constructed dataset, covering logistics coordination, assembly tasks, and health management. Experimental results demonstrate a 15% reduction in memory usage compared to state-of-the-art caching strategies, with no loss in inference efficiency, making our approach well-suited for real-world deployment in resource-constrained environments.

AAAI Conference 2025 Conference Paper

DoGA: Enhancing Grounded Object Detection via Grouped Pre-Training with Attributes

  • Yang Liu
  • Feng Hou
  • Yunjie Peng
  • Gangjian Zhang
  • Yao Zhang
  • Dong Xie
  • Peng Wang
  • Yang Zhang

Recent advances in vision-language pre-training have significantly enhanced the model capabilities on grounded object detection. However, these studies often pre-train with coarse-grained text prompts, such as plain category names and brief grounded phrases. This limitation curtails the model's capacity for fine-grained linguistic comprehension and leads to a significant decline in performance when faced with detailed descriptions or contextual information. To tackle these problems, we develop DoGA: Detect objects with Grouped Attributes, which employs commonly apparent attributes to bridge different granular semantics and uses specific attributes to identify the object discrepancy. Our DoGA incorporates three principle components: 1) Generation of attribute-based prompts, consisting of linguistic definitions enriched with common-sense visible attributes and hard negative notations deriving from the image-specific attribute features; 2) Paralleled entity fusion and optimization, designed to manage long attribute-based descriptions and negative concepts efficiently; and 3) Prompt-wise grouped training to accommodate model to perform many-to-many assignments, facilitating simultaneous training and inferring with multiple attribute-based synonyms. Extensive experiments demonstrate that training with synonymous attribute-based prompts allows DoGA to generalize multi-granular prompts and surpass previous state-of-the-art approaches, yielding 50.2 on the COCO and 38.0 on the LVIS benchmarks under the zero-short setting. We will make our code publicly available upon acceptance.

IROS Conference 2025 Conference Paper

Edge-Guided Lighting Adaptation: Real-Time Detection of Transparent Objects for Cell Culture Robot

  • Qingze Huang
  • Peng Wang
  • Xiangyan Zhang
  • Jian Li
  • Shimin Wei

In robot-assisted cell culture tasks, fluctuations in lighting conditions can result in blurred boundaries, intensified reflections, and pronounced refractions of transparent objects. These optical phenomena collectively escalate the complexity of image processing and target recognition. To address these challenges, this paper takes a dual-strategy approach. Firstly, it utilizes the Unity platform to construct a synthetic dataset (STTO-9k) containing 9, 000 images of six types of transparent objects, providing abundant training samples for the detection and recognition of transparent objects. Secondly, it proposes an improved YOLOv8 visual detection algorithm (YOLO-Edge-Guided Lighting Adaptation, YL-EGLA). The algorithm realizes feature fusion by dynamically extracting the high-dimensional features of the input through the self-attention mechanism combined with the enhanced edge features extracted by the edge detection operator, and is equipped with adaptive image enhancement module to ensure stable detection under different lighting conditions. Algorithm comparison results demonstrate that the YL-EGLA can be fully trained on the synthetic dataset and directly applied to real-world scenarios without additional fine-tuning. Furthermore, physical experiments further validate the efficiency and practicality of this algorithm in transparent object manipulation, fully showcasing its significant value in practical applications.

IJCAI Conference 2025 Conference Paper

FedCM: Client Clustering and Migration in Federated Learning via Gradient Path Similarity and Update Direction Deviation

  • Peng Wang
  • Shoupeng Lu
  • Hao Yin
  • Banglie Yang
  • Tianli Zhu
  • Cheng Dai

Federated learning (FL) enables collaborative training among multiple clients while preserving data privacy. However, its practical application is significantly limited by two major challenges: statistical heterogeneity and data distribution drift. Statistical heterogeneity causes the direction of local model updates to deviate from the global training objective, while data distribution drift leads to a mismatch between local models and their cluster models. To address these challenges, this paper proposes an adaptive clustered federated learning framework, Fed-CM. Initially, by capturing the dynamic patterns of personalized layer parameters in clients' models, Fed-CM effectively characterizes the correlations and distributional similarities among clients, reflecting the underlying statistical heterogeneity. Subsequently, this framework leverages client similarities to construct an undirected graph and adaptively performs effective cluster discovery with minimal dependence on hyperparameters. Furthermore, a monitoring strategy tracks the deviation between clients’ update directions and the dominant update direction of their clusters and then adaptively migrates clients experiencing data drift. Such a dynamic strategy helps maintain intra-cluster homogeneity and addresses the mismatch between local models and their cluster models. Compared to other state-of-the-art methods, experimental results on multiple datasets demonstrate that the proposed Fed-CM framework effectively addresses the challenges posed by statistical heterogeneity and data drift, significantly improving the performance and robustness of federated learning models.

EAAI Journal 2025 Journal Article

IMobileTransformer: A fusion-based lightweight model for rice disease identification

  • Yang Lu
  • Haoyang Zhou
  • Peng Wang
  • Erzhi Wang
  • Gongfa Li
  • Tongjian Yu

Rice blast, sheath blight, leaf scald, bacterial leaf blight, and brown spot severely threaten rice yield. To address the limitations of current deep learning methods in rice disease recognition, particularly their insufficient integration of local and global features, this study proposes an Improved MobileTransformer (IMobileTransformer) model. The proposed architecture synergistically combines MobileNet’s strengths in local feature extraction and lightweight architecture with Transformer’s superior capability in global information processing. Specifically, the model is designed with three functional branches: a) a MobileNet branch utilizing inverted residual structure with depthwise separable convolution layers to reduce parameters and computational complexity, b) a Transformer branch modified from Swin-Transformer architecture, where the Multilayer Perceptron (MLP) layer is enhanced by splitting input feature channels through an Inception-based structure to maintain global feature extraction efficiency while minimizing computational overhead, and c) a feature fusion branch that concatenates reshaped outputs from both branches through channel-wise stacking, enabling effective integration of local and global representations. Experimental results show that compared to classical models such as MobileNetV3-Large, EfficientNet-B0, Vision Transformer Base/16 (ViT-B/16), Shifted Window Transformer (Swin-Transformer), Tiny Vision Transformer (TinyViT), Mobile Vision Transformer (MobileViT), LocalViT-S, IMobileTransformer achieves a recognition accuracy of 99. 62% for rice diseases, with improvements of 1. 71%, 0. 91%, 38. 09%, 4. 17%, 1. 99%, 1. 5% and 0. 42%, respectively, providing an effective solution for rice disease recognition.

IJCAI Conference 2025 Conference Paper

Improving Consistency Identification in Task-oriented Dialogue Through Multi-Agent Collaboration

  • Peng Wang
  • Shuo Li
  • Ruoxi Zhou
  • Qiguang Chen
  • Xiao Xu
  • Hao Fei
  • Dagang Li
  • Wanxiang Che

Consistency identification in task-oriented dialog (CI-ToD) typically consists of three sub-tasks: User Query Inconsistency (QI) identification, Dialogue History Inconsistency (HI) identification, and Knowledge Base Inconsistency (KBI) identification, which aim to determine inconsistent relationships between system response and user query, dialogue history, and knowledge base. Previous approaches focus on the exploration of deep learning models for CI-ToD. While these models achieve remarkable progress, they still rely on large amounts of labeled data, which is hard to achieve in real-world scenarios. Motivated by this, in the paper, we aim to explore large language models for CI-ToD, which do not require any training data. In addition, we further introduce a multi-agent collaboration framework (MAC-CIToD) to model the interaction across three sub-tasks in CI-ToD, including (1) Full Connection paradigm, (2) Cycle Connection paradigm, and (3) Central Connection paradigm, which effectively builds interaction across QI, HI, and KBI. Experiments on the standard benchmark reveal that our framework achieves superior performance. Additionally, we compare MAC-CIToD with the most advanced trained approaches and find that its zero-shot performance on most metrics even surpasses that of models after training on the CI-ToD dataset.

YNIMG Journal 2025 Journal Article

Mapping subtype-specific disease epicenters and brain aging characteristics in major depressive disorder through normative model-driven analysis of brain structural alterations

  • Peng Wang
  • Yuhong Zheng
  • Li Sun
  • Yang Xiao
  • Xuelian Zang
  • Jinghua Wang
  • Jinhui Wang
  • Shao-Wei Xue

Major depressive disorder (MDD), a prevalent mental health condition, manifests intricate alterations in brain structure that evolve gradually over time and across various brain regions. Despite significant research efforts, two fundamental questions remain unsettled: the precise brain origins of MDD and whether MDD contributes to accelerates brain aging. To this end, we conducted a comprehensive investigation leveraging data from 830 MDD patients and 853 matched healthy controls (HC). Normative models, established on HC gray matter volume (GMV) data, were utilized to quantify individual deviations in GMV among MDD patients. Employing k-means clustering to these deviation profiles, we successfully discerned two clinically distinct subtypes. Subtype 1 is characterized by GMV atrophy, coupled with indications of accelerated brain aging processes. In contrast, subtype 2 exhibits increased GMV without significant acceleration of aging phenomena. Intriguingly, both subtypes converge on the default mode network as a common disease epicenter, highlighting a shared neurophysiological underpinning. However, subtype-specific epicenters diverge, with subtype 1 featuring unique foci primarily in the hippocampus and amygdala, whereas subtype 2 distinguishes itself with epicenters primarily located in the accumbens. This nuanced examination of subtype-specific brain alterations, incorporating their intricate spatiotemporal dynamics, provides profound insights into the heterogeneity and complexity inherent in MDD.

YNICL Journal 2025 Journal Article

Mechanisms underlying the spontaneous reorganization of depression network after stroke

  • Yirong Fang
  • Xian Chao
  • Zeyu Lu
  • Hongmei Huang
  • Ran Shi
  • Dawei Yin
  • Hao Chen
  • Yanan Lu

Exploring the causal relationship between focal brain lesions and post-stroke depression (PSD) can provide therapeutic insights. However, a gap exists between causal and therapeutic information. Exploring post-stroke brain repair processes post-stroke could bridge this gap. We defined a depression network using the normative connectome and investigated the predictive capacity of lesion-induced network damage on depressive symptoms in discovery cohort of 96 patients, at baseline and six months post-stroke. Stepwise functional connectivity (SFC) was used to examine topological changes in the depression network over time to identify patterns of network reorganization. The predictive value of reorganization information was evaluated for follow-up symptoms in discovery and validation cohort 1 (22 worsening PSD patients) as well as for treatment responsiveness in validation cohort 2 (23 antidepressant-treated patients). We evaluated the consistency of significant reorganization areas with neuromodulation targets. Spatial correlations of network reorganization patterns with gene expression and neurotransmitter maps were analyzed. The predictive power of network damage for symptoms diminished at follow-up compared to baseline (Δadjusted R2 = -0. 070, p < 0. 001). Reorganization information effectively predicted symptoms at follow-up in the discovery cohort (adjust R2 = 0. 217, 95 %CI: 0. 010 to 0. 431), as well as symptom exacerbation (r = 0. 421, p = 0. 033) and treatment responsiveness (r = 0. 587, p = 0. 012) in the validation cohorts. Regions undergoing significant reorganization overlapped with neuromodulatory targets known to be effective in treating depression. The reorganization of the depression network was associated with immune-inflammatory responses gene expressions and gamma-aminobutyric acid. Our findings may yield important insights into the repair mechanisms of PSD and provide a critical context for developing post-stroke treatment strategies.

AAAI Conference 2025 Conference Paper

PoseLLaVA: Pose Centric Multimodal LLM for Fine-Grained 3D Pose Manipulation

  • Dong Feng
  • Ping Guo
  • Encheng Peng
  • Mingmin Zhu
  • Wenhao Yu
  • Peng Wang

Manipulating human poses based on natural language is an emerging research field that has traditionally focused on coarse commands such as “walking” or “dancing.” However, fine-grained pose manipulation, like instructing “put both hands in front of the stomach,” remains underexplored. In this paper, we introduce PoseLLaVA, a pioneering model that integrates SMPL-based pose representations into the multimodal LLaVA framework. Through a novel pose encoder decoder mechanism, PoseLLaVA achieves precise alignment between pose, textual, and visual modalities, enabling detailed control over pose manipulation tasks. PoseLLaVA excels in three key tasks: pose estimation, generation, and adjustment, all driven by detailed language instructions. We further introduce a fine-grained pose adjustment dataset PosePart, where each sample contains an initial pose and a target pose, along with specific instructions for adjustments, mimicking the guidance a human instructor might provide. Extensive evaluations across these tasks demonstrate significant improvements over existing methods, including metrics such as MPJPE and PA-MPJPE, which measure SMPL reconstruction errors, and Recall rates, which assess feature alignment across modalities. Specifically, PoseLLaVA reduces MPJPE errors by more than 20% compared to state-of-the-art methods in pose adjustment and generation tasks. Additionally, we demonstrate the feasibility of combining PoseLLaVA with generative models, such as diffusion, for pose image editing, highlighting its potential applications in language-controlled pose manipulation.

IROS Conference 2025 Conference Paper

Quaternion Approximate Networks for Enhanced Image Classification and Oriented Object Detection

  • Bryce Grant
  • Peng Wang

This paper introduces Quaternion Approximate Networks (QUAN), a novel deep learning framework that leverages quaternion algebra for rotation equivariant image classification and object detection. Unlike conventional quaternion neural networks attempting to operate entirely in the quaternion domain, QUAN approximates quaternion convolution through Hamilton product decomposition using real-valued operations. This approach preserves geometric properties while enabling efficient implementation with custom CUDA kernels. We introduce Independent Quaternion Batch Normalization (IQBN) for training stability and extend quaternion operations to spatial attention mechanisms. QUAN is evaluated on image classification (CIFAR-10/100, ImageNet), object detection (COCO, DOTA), and robotic perception tasks. In classification tasks, QUAN achieves higher accuracy with fewer parameters and faster convergence compared to existing convolution and quaternion-based models. For objection detection, QUAN demonstrates improved parameter efficiency and rotation handling over standard Convolutional Neural Networks (CNNs) while establishing the SOTA for quaternion CNNs in this downstream task. These results highlight its potential for deployment in resource-constrained robotic systems requiring rotation-aware perception and application in other domains.

NeurIPS Conference 2025 Conference Paper

Styl3R: Instant 3D Stylized Reconstruction for Arbitrary Scenes and Styles

  • Peng Wang
  • Xiang Liu
  • Peidong Liu

Stylizing 3D scenes instantly while maintaining multi-view consistency and faithfully resembling a style image remains a significant challenge. Current state-of-the-art 3D stylization methods typically involve computationally intensive test-time optimization to transfer artistic features into a pretrained 3D representation, often requiring dense posed input images. In contrast, leveraging recent advances in feed-forward reconstruction models, we demonstrate a novel approach to achieve direct 3D stylization in less than a second using unposed sparse-view scene images and an arbitrary style image. To address the inherent decoupling between reconstruction and stylization, we introduce a branched architecture that separates structure modeling and appearance shading, effectively preventing stylistic transfer from distorting the underlying 3D scene structure. Furthermore, we adapt an identity loss to facilitate pre-training our stylization model through the novel view synthesis task. This strategy also allows our model to retain its original reconstruction capabilities while being fine-tuned for stylization. Comprehensive evaluations, using both in-domain and out-of-domain datasets, demonstrate that our approach produces high-quality stylized 3D content that achieve a superior blend of style and scene appearance, while also outperforming existing methods in terms of multi-view consistency and efficiency.

NeurIPS Conference 2025 Conference Paper

Table2LaTeX-RL: High-Fidelity LaTeX Code Generation from Table Images via Reinforced Multimodal Language Models

  • Jun Ling
  • Yao Qi
  • Tao Huang
  • Shibo Zhou
  • Yanqin Huang
  • Jiang Yang
  • Ziqi Song
  • Ying Zhou

In this work, we address the task of table image to LaTeX code generation, with the goal of automating the reconstruction of high-quality, publication-ready tables from visual inputs. A central challenge of this task lies in accurately handling complex tables—those with large sizes, deeply nested structures, and semantically rich or irregular cell content—where existing methods often fail. We begin with a comprehensive analysis, identifying key challenges and highlighting the limitations of current evaluation protocols. To overcome these issues, we propose a reinforced multimodal large language model (MLLM) framework, where a pre-trained MLLM is fine-tuned on a large-scale table-to-LaTeX dataset. To further improve generation quality, we introduce a dual-reward reinforcement learning strategy based on Group Relative Policy Optimization (GRPO). Unlike standard approaches that optimize purely over text outputs, our method incorporates both a structure-level reward on LaTeX code and a visual fidelity reward computed from rendered outputs, enabling direct optimization of the visual output quality. We adopt a hybrid evaluation protocol combining TEDS-Structure and CW-SSIM, and show that our method achieves state-of-the-art performance, particularly on structurally complex tables, demonstrating the effectiveness and robustness of our approach.

JMLR Journal 2025 Journal Article

Understanding Deep Representation Learning via Layerwise Feature Compression and Discrimination

  • Peng Wang
  • Xiao Li
  • Can Yaras
  • Zhihui Zhu
  • Laura Balzano
  • Wei Hu
  • Qing Qu

Over the past decade, deep learning has proven to be a highly effective tool for learning meaningful features from raw data. However, it remains an open question how deep networks perform hierarchical feature learning across layers. In this work, we attempt to unveil this mystery by investigating the structures of intermediate features. Motivated by our empirical findings that linear layers mimic the roles of deep layers in nonlinear networks for feature learning, we explore how deep linear networks transform input data into output by investigating the output (i.e., features) of each layer after training in the context of multi-class classification problems. Toward this goal, we first define metrics to measure within-class compression and between-class discrimination of intermediate features, respectively. Through theoretical analysis of these two metrics, we show that the evolution of features follows a simple and quantitative pattern from shallow to deep layers when the input data is nearly orthogonal and the network weights are minimum-norm, balanced, and approximately low-rank: each layer of the linear network progressively compresses within-class features at a geometric rate and discriminates between-class features at a linear rate with respect to the number of layers that data have passed through. To the best of our knowledge, this is the first quantitative characterization of feature evolution in hierarchical representations of deep linear networks. Moreover, our extensive experiments not only validate our theoretical results but also reveal a similar pattern in deep nonlinear networks, which aligns well with recent empirical studies. Finally, we demonstrate the practical value of our results in transfer learning. [abs] [ pdf ][ bib ] [ code ] &copy JMLR 2025. ( edit, beta )

NeurIPS Conference 2025 Conference Paper

Understanding Representation Dynamics of Diffusion Models via Low-Dimensional Modeling

  • Xiao Li
  • Zekai Zhang
  • Xiang Li
  • Siyi Chen
  • Zhihui Zhu
  • Peng Wang
  • Qing Qu

Diffusion models, though originally designed for generative tasks, have demonstrated impressive self-supervised representation learning capabilities. A particularly intriguing phenomenon in these models is the emergence of unimodal representation dynamics, where the quality of learned features peaks at an intermediate noise level. In this work, we conduct a comprehensive theoretical and empirical investigation of this phenomenon. Leveraging the inherent low-dimensionality structure of image data, we theoretically demonstrate that the unimodal dynamic emerges when the diffusion model successfully captures the underlying data distribution. The unimodality arises from an interplay between denoising strength and class confidence across noise scales. Empirically, we further show that, in classification tasks, the presence of unimodal dynamics reliably reflects the diffusion model’s generalization: it emerges when the model generate novel images and gradually transitions to a monotonically decreasing curve as the model begins to memorize the training data.

AAAI Conference 2025 Conference Paper

VarCMP: Adapting Cross-Modal Pre-Training Models for Video Anomaly Retrieval

  • Peng Wu
  • Wanshun Su
  • Xiangteng He
  • Peng Wang
  • Yanning Zhang

Video anomaly retrieval (VAR) aims to retrieve pertinent abnormal or normal videos from collections of untrimmed and long videos through cross-modal requires such as textual descriptions and synchronized audios. Cross-modal pre-training (CMP) models, by pre-training on large-scale cross-modal pairs, e.g., image and text, can learn the rich associations between different modalities, and this cross-modal association capability gives CMP an advantage in conventional retrieval tasks. Inspired by this, how to utilize the robust cross-modal association capabilities of CMP in VAR to search crucial visual component from these untrimmed and long videos becomes a critical research problem. Therefore, this paper proposes a VAR method based on CMP models, named VarCMP. First, a unified hierarchical alignment strategy is proposed to constrain the semantic and spatial consistency between video and text, as well as the semantic, temporal, and spatial consistency between video and audio. It fully leverages the efficient cross-modal association capabilities of CMP models by considering cross-modal similarities at multiple granularities, enabling VarCMP to achieve effective all-round information matching for both video-text and video-audio VAR tasks. Moreover, to further solve the problem of untrimmed and long video alignment, an anomaly-biased weighting is devised in the fine-grained alignment, which identifies key segments in untrimmed long videos using anomaly priors, giving them more attention, thereby discarding irrelevant segment information, and achieving more accurate matching with cross-modal queries. Extensive experiments demonstrates high efficacy of VarCMP in both video-text and video-audio VAR tasks, achieving significant improvements on both text-video (UCFCrime-AR) and audio-video (XDViolence-AR) datasets against the best competitors by 5.0% and 5.3% R@1.

NeurIPS Conference 2024 Conference Paper

A Full-duplex Speech Dialogue Scheme Based On Large Language Model

  • Peng Wang
  • Songshuo Lu
  • Yaohua Tang
  • Sijie Yan
  • Wei Xia
  • Yuanjun Xiong

We present a generative dialogue system capable of operating in a full-duplex manner, allowing for seamless interaction. It is based on a large language model (LLM) carefully aligned to be aware of a perception module, a motor function module, and the concept of a simple finite state machine (called neural FSM) with two states. The perception and motor function modules operate in tandem, allowing the system to speak and listen to the user simultaneously. The LLM generates textual tokens for inquiry responses and makes autonomous decisions to start responding to, wait for, or interrupt the user by emitting control tokens to the neural FSM. All these tasks of the LLM are carried out as next token prediction on a serialized view of the dialogue in real-time. In automatic quality evaluations simulating real-life interaction, the proposed system reduces the average conversation response latency by more than threefold compared with LLM-based half-duplex dialogue systems while responding within less than 500 milliseconds in more than 50% of evaluated interactions. Running an LLM with only 8 billion parameters, our system exhibits an 8% higher interruption precision rate than the best available commercial LLM for voice-based dialogue.

IJCAI Conference 2024 Conference Paper

C3L: Content Correlated Vision-Language Instruction Tuning Data Generation via Contrastive Learning

  • Ji Ma
  • Wei Suo
  • Peng Wang
  • Yanning Zhang

Vision-Language Instruction Tuning (VLIT) is a critical training phase for Large Vision-Language Models (LVLMs). With the improving capabilities of open-source LVLMs, researchers have increasingly turned to generate VLIT data by using open-source LVLMs and achieved significant progress. However, such data generation approaches are bottlenecked by the following challenges: 1) Since multi-modal models tend to be influenced by prior language knowledge, directly using LVLMs to generate VLIT data would inevitably lead to low content relevance between generated data and images. 2) To improve the ability of the models to generate VLIT data, previous methods have incorporated an additional training phase to boost the generative capacity. This process hurts the generalization of the models to unseen inputs (i. e. , “exposure bias” problem). In this paper, we propose a new Content Correlated VLIT data generation via Contrastive Learning (C3L). Specifically, we design a new content relevance module which enhances the content relevance between VLIT data and images by computing Image Instruction Correspondence Scores S(I2C). Moreover, a contrastive learning module is introduced to further boost the VLIT data generation capability of the LVLMs. A large number of automatic measures on four benchmarks show the effectiveness of our method.

AAAI Conference 2024 Conference Paper

ConsistNER: Towards Instructive NER Demonstrations for LLMs with the Consistency of Ontology and Context

  • Chenxiao Wu
  • Wenjun Ke
  • Peng Wang
  • Zhizhao Luo
  • Guozheng Li
  • Wanyi Chen

Named entity recognition (NER) aims to identify and classify specific entities mentioned in textual sentences. Most existing superior NER models employ the standard fully supervised paradigm, which requires a large amount of annotated data during training. In order to maintain performance with insufficient annotation resources (i.e., low resources), in-context learning (ICL) has drawn a lot of attention, due to its plug-and-play nature compared to other methods (e.g., meta-learning and prompt learning). In this manner, how to retrieve high-correlated demonstrations for target sentences serves as the key to emerging ICL ability. For the NER task, the correlation implies the consistency of both ontology (i.e., generalized entity type) and context (i.e., sentence semantic), which is ignored by previous NER demonstration retrieval techniques. To address this issue, we propose ConsistNER, a novel three-stage framework that incorporates ontological and contextual information for low-resource NER. Firstly, ConsistNER employs large language models (LLMs) to pre-recognize potential entities in a zero-shot manner. Secondly, ConsistNER retrieves the sentence-specific demonstrations for each target sentence based on the two following considerations: (1) Regarding ontological consistency, demonstrations are filtered into a candidate set based on ontology distribution. (2) Regarding contextual consistency, an entity-aware self-attention mechanism is introduced to focus more on the potential entities and semantic-correlated tokens. Finally, ConsistNER feeds the retrieved demonstrations for all target sentences into LLMs for prediction. We conduct experiments on four widely-adopted NER datasets, including both general and specific domains. Experimental results show that ConsistNER achieves a 6.01%-26.37% and 3.07%-21.18% improvement over the state-of-the-art baselines on Micro-F1 scores under 1- and 5-shot settings, respectively.

IJCAI Conference 2024 Conference Paper

Domain-Hierarchy Adaptation via Chain of Iterative Reasoning for Few-shot Hierarchical Text Classification

  • Ke Ji
  • Peng Wang
  • Wenjun Ke
  • Guozheng Li
  • Jiajun Liu
  • Jingsheng Gao
  • Ziyu Shang

Recently, various pre-trained language models (PLMs) have been proposed to prove their impressive performances on a wide range of few-shot tasks. However, limited by the unstructured prior knowledge in PLMs, it is difficult to maintain consistent performance on complex hierarchically dependent tasks, especially when the downstream data is extremely scarce. The main challenge is how to transfer the unstructured semantic space in PLMs to the downstream domain hierarchy. Unlike previous work on hierarchical text classification (HTC) which directly performs multi-label classification or uses graph neural network (GNN) to inject label hierarchy, in this work, we study the HTC problem under a few-shot setting to adapt knowledge in PLMs from an unstructured manner to the downstream hierarchy. Technically, we design a simple yet effective method named Hierarchical Iterative Conditional Random Field (HierICRF) to search the most domain-challenging directions and exquisitely crafts domain-hierarchy adaptation as a hierarchical iterative language modeling problem, and then it encourages the model to make hierarchical consistency self-correction during the inference, thereby achieving knowledge transfer with hierarchical consistency preservation. We perform HierICRF on various architectures, and extensive experiments on two popular HTC datasets demonstrate that prompt with HierICRF significantly boosts the few-shot HTC performance with an average Micro-F1 by 28. 80% to 1. 50% and Macro-F1 by 36. 29% to 1. 5% over the previous state-of-the-art (SOTA) baselines under few-shot settings (1->16), while remaining SOTA hierarchical consistency performance.

NeurIPS Conference 2024 Conference Paper

Efficient Adaptation of Pre-trained Vision Transformer via Householder Transformation

  • Wei Dong
  • Yuan Sun
  • Yiting Yang
  • Xing Zhang
  • Zhijun Lin
  • Qingsen Yan
  • Haokui Zhang
  • Peng Wang

A common strategy for Parameter-Efficient Fine-Tuning (PEFT) of pre-trained Vision Transformers (ViTs) involves adapting the model to downstream tasks by learning a low-rank adaptation matrix. This matrix is decomposed into a product of down-projection and up-projection matrices, with the bottleneck dimensionality being crucial for reducing the number of learnable parameters, as exemplified by prevalent methods like LoRA and Adapter. However, these low-rank strategies typically employ a fixed bottleneck dimensionality, which limits their flexibility in handling layer-wise variations. To address this limitation, we propose a novel PEFT approach inspired by Singular Value Decomposition (SVD) for representing the adaptation matrix. SVD decomposes a matrix into the product of a left unitary matrix, a diagonal matrix of scaling values, and a right unitary matrix. We utilize Householder transformations to construct orthogonal matrices that efficiently mimic the unitary matrices, requiring only a vector. The diagonal values are learned in a layer-wise manner, allowing them to flexibly capture the unique properties of each layer. This approach enables the generation of adaptation matrices with varying ranks across different layers, providing greater flexibility in adapting pre-trained models. Experiments on standard downstream vision tasks demonstrate that our method achieves promising fine-tuning performance.

NeurIPS Conference 2024 Conference Paper

Exploring Low-Dimensional Subspace in Diffusion Models for Controllable Image Editing

  • Siyi Chen
  • Huijie Zhang
  • Minzhe Guo
  • Yifu Lu
  • Peng Wang
  • Qing Qu

Recently, diffusion models have emerged as a powerful class of generative models. Despite their success, there is still limited understanding of their semantic spaces. This makes it challenging to achieve precise and disentangled image generation without additional training, especially in an unsupervised way. In this work, we improve the understanding of their semantic spaces from intriguing observations: among a certain range of noise levels, (1) the learned posterior mean predictor (PMP) in the diffusion model is locally linear, and (2) the singular vectors of its Jacobian lie in low-dimensional semantic subspaces. We provide a solid theoretical basis to justify the linearity and low-rankness in the PMP. These insights allow us to propose an unsupervised, single-step, training-free LO w-rank CO ntrollable image editing (LOCO Edit) method for precise local editing in diffusion models. LOCO Edit identified editing directions with nice properties: homogeneity, transferability, composability, and linearity. These properties of LOCO Edit benefit greatly from the low-dimensional semantic subspace. Our method can further be extended to unsupervised or text-supervised editing in various text-to-image diffusion models (T-LOCO Edit). Finally, extensive empirical experiments demonstrate the effectiveness and efficiency of LOCO Edit. The code and the arXiv version can be found on the project website.

IJCAI Conference 2024 Conference Paper

Fast and Continual Knowledge Graph Embedding via Incremental LoRA

  • Jiajun Liu
  • Wenjun Ke
  • Peng Wang
  • Jiahao Wang
  • Jinhua Gao
  • Ziyu Shang
  • Guozheng Li
  • Zijie Xu

Continual Knowledge Graph Embedding (CKGE) aims to efficiently learn new knowledge and simultaneously preserve old knowledge. Dominant approaches primarily focus on alleviating catastrophic forgetting of old knowledge but neglect efficient learning for the emergence of new knowledge. However, in real-world scenarios, knowledge graphs (KGs) are continuously growing, which brings a significant challenge to fine-tuning KGE models efficiently. To address this issue, we propose a fast CKGE framework (FastKGE), incorporating an incremental low-rank adapter (IncLoRA) mechanism to efficiently acquire new knowledge while preserving old knowledge. Specifically, to mitigate catastrophic forgetting, FastKGE isolates and allocates new knowledge to specific layers based on the fine-grained influence between old and new KGs. Subsequently, to accelerate fine-tuning, FastKGE devises an efficient IncLoRA mechanism, which embeds the specific layers into incremental low-rank adapters with fewer training parameters. Moreover, IncLoRA introduces adaptive rank allocation, which makes the LoRA aware of the importance of entities and adjusts its rank scale adaptively. We conduct experiments on four public datasets and two new datasets with a larger initial scale. Experimental results demonstrate that FastKGE can reduce training time by 34%-49% while still achieving competitive link prediction performance against state-of-the-art models on four public datasets (average MRR score of 21. 0% vs. 21. 1%). Meanwhile, on two newly constructed datasets, FastKGE saves 51%-68% training time and improves link prediction performance by 1. 5%.

ICML Conference 2024 Conference Paper

Generalization Analysis of Stochastic Weight Averaging with General Sampling

  • Peng Wang
  • Li Shen 0008
  • Zerui Tao
  • Shuaida He
  • Dacheng Tao

Stochastic weight averaging (SWA) method has empirically proven its advantages compared to stochastic gradient descent (SGD). Despite it is widespread used, theoretical investigations have been limited, particularly in scenarios beyond the ideal setting of convex and sampling with replacement. However, non-convex cases and sampling without replacement are very practical in real-world applications. The main challenges under the above settings are two-folds: (i) All the historical gradient information introduced by SWA is considered, while the analysis of SGD using the tool of uniform stability requires only to bound the current gradient. (ii) The $(1+\alpha\beta)$-expansion property causes the boundary of each gradient step dependent on the previous step, making the boundary of each historical gradient in SWA nested and the theoretical analysis even harder. To address the theoretical challenges, we adopt mathematical induction to find a recursive representation that bounds the gradient at each step. Based on this, we establish stability bounds supporting sampling with and without replacement in the non-convex setting. Furthermore, the derived generalization bounds of SWA are sharper than SGD. At last, experimental results on several benchmarks verify our theoretical results.

JBHI Journal 2024 Journal Article

Identifying Associations Between Small Nucleolar RNAs and Diseases via Graph Convolutional Network and Attention Mechanism

  • Shuchen Liu
  • Wen Zhu
  • Peng Wang
  • Shaoyou Yu
  • Fangxiang Wu

Research has shown that small nucleolar RNAs (snoRNAs) play crucial roles in various biological processes, and understanding disease pathogenesis by studying their relationship with diseases is beneficial. Currently, known associations are insufficient, and conventional biological experiments are costly and time-consuming. Therefore, developing efficient computational methods is crucial for identifying potential snoRNA-disease associations. In this paper, a method to identify snoRNA-disease associations based on graph convolutional network and multi-view graph attention mechanism (GCASDA) is proposed. Firstly, the similarity matrices of snoRNAs and diseases are calculated based on biological entity-related information, and the weights of the edges between the snoRNA nodes and the disease nodes are supplemented by random forest. Then two homogeneous graphs and one heterogeneous graph are constructed. Subsequently, different types of embedded features are extracted from the graphs using specific graph convolutional network structure and integrated through a multi-view graph attention mechanism to obtain node embedded feature representations. Finally, for each pair of nodes, in addition to their global features, node interaction features are passed together to a multilayer perceptron neural network (MLP) to identify snoRNA-disease associations. Experimental results show that GCASDA achieves 0. 9356 and 0. 9294 in AUC and AUPR, respectively, and significantly outperformed other state-of-the-art methods on the basis of different evaluation metrics. Furthermore, the case study could further demonstrate the realistic feasibility of GCASDA.

ICML Conference 2024 Conference Paper

Image Fusion via Vision-Language Model

  • Zixiang Zhao
  • Lilun Deng
  • Haowen Bai
  • Yukun Cui
  • Zhipeng Zhang
  • Yulun Zhang 0001
  • Haotong Qin
  • Dongdong Chen

Image fusion integrates essential information from multiple images into a single composite, enhancing structures, textures, and refining imperfections. Existing methods predominantly focus on pixel-level and semantic visual features for recognition, but often overlook the deeper text-level semantic information beyond vision. Therefore, we introduce a novel fusion paradigm named image Fusion via vIsion-Language Model (FILM), for the first time, utilizing explicit textual information from source images to guide the fusion process. Specifically, FILM generates semantic prompts from images and inputs them into ChatGPT for comprehensive textual descriptions. These descriptions are fused within the textual domain and guide the visual information fusion, enhancing feature extraction and contextual understanding, directed by textual semantic information via cross-attention. FILM has shown promising results in four image fusion tasks: infrared-visible, medical, multi-exposure, and multi-focus image fusion. We also propose a vision-language dataset containing ChatGPT-generated paragraph descriptions for the eight image fusion datasets across four fusion tasks, facilitating future research in vision-language model-based image fusion. Code and dataset are available at https: //github. com/Zhaozixiang1228/IF-FILM.

IJCAI Conference 2024 Conference Paper

Incorporating Schema-Aware Description into Document-Level Event Extraction

  • Zijie Xu
  • Peng Wang
  • Wenjun Ke
  • Guozheng Li
  • Jiajun Liu
  • Ke Ji
  • Xiye Chen
  • Chenxiao Wu

Document-level event extraction (DEE) aims to extract the structured event information from a given document, facing two critical challenges: (1) event arguments always scatter across sentences (arguments-scattering); (2) multiple events can co-occur in one document (multi-event). Most recent studies mainly follow two simplified settings to ease the challenges: one simplifies DEE with the no-trigger-words design (NDEE), and the other focuses on event argument extraction (DEAE), a sub-task of DEE. However, the former excludes trigger extraction and suffers from error propagation in the sub-tasks. The latter relies heavily on the gold triggers as prerequisites and struggles to distinguish multiple arguments playing the same role in different events. To address the limitations above, we propose a novel joint trigger and argument extraction paradigm SEELE to enhance the DEE model via incorporating SchEma-awarE descriptions into Document-Level Event extraction. Specifically, the schema-aware descriptions are leveraged from two aspects: (1) guiding the attention mechanism among event-aware tokens across sentences, which relieves arguments-scattering without error propagation; (2) performing the fine-grained contrastive learning to distinguish different events, which mitigates multi-event without gold triggers. Extensive experiments show the superiority of SEELE, achieving notable improvements (2. 1% to 9. 7% F1) on three NDEE datasets and competitive performance on two DEAE datasets. Our code is available at https: //github. com/TheoryRhapsody/SEELE.

IJCAI Conference 2024 Conference Paper

Learning Multi-Granularity and Adaptive Representation for Knowledge Graph Reasoning

  • Ziyu Shang
  • Peng Wang
  • Wenjun Ke
  • Jiajun Liu
  • Hailang Huang
  • Guozheng Li
  • Chenxiao Wu
  • Jianghan Liu

Knowledge graph reasoning (KGR) aims to infer new factual triples from existing knowledge graphs (KGs). Recently, a new category of methods, possessing both transductive and inductive reasoning capabilities, has been proposed to tackle this task via learning entity-independent representations from local neighboring structures. However, these methods are plagued by inefficiency issues and they exclusively capture evidence from well-designed local structures, ignoring the correlation between the query and different structures within KGs. In this work, we first propose a novel multi-granularity and adaptive representation framework, MulGA, exploiting the connectivity subgraph to uniformly and hierarchically model query-related triples, relation paths, and subgraphs without explicitly extracting any graph structure, hence mitigating inefficiency issues. Second, we introduce a message-passing mechanism across connectivity subgraphs, facilitating all entities to attain query-related structural representations of diverse granularity levels, i. e. , triple and relation paths of different lengths. Third, we design a self-attention-based merging mechanism that allocates weights to different granularities and then consolidates them into subgraph granularity representations for reasoning. The systematic experiments have been conducted on 15 benchmarks and MulGA achieves a significant improvement in MRR by an average of 1. 5% on transductive and 2. 7% on inductive tasks than existing state-of-the-art methods. Moreover, MulGA boasts faster convergence speed, competitive inference time, and alleviates the over-smoothing prevalent in graph neural networks.

IJCAI Conference 2024 Conference Paper

Meta In-Context Learning Makes Large Language Models Better Zero and Few-Shot Relation Extractors

  • Guozheng Li
  • Peng Wang
  • Jiajun Liu
  • Yikai Guo
  • Ke Ji
  • Ziyu Shang
  • Zijie Xu

Relation extraction (RE) is an important task that aims to identify the relationships between entities in texts. While large language models (LLMs) have revealed remarkable in-context learning (ICL) capability for general zero and few-shot learning, recent studies indicate that current LLMs still struggle with zero and few-shot RE. Previous studies are mainly dedicated to design prompt formats and select good examples for improving ICL-based RE. Although both factors are vital for ICL, if one can fundamentally boost the ICL capability of LLMs in RE, the zero and few-shot RE performance via ICL would be significantly improved. To this end, we introduce Micre (Meta In-Context learning of LLMs for Relation Extraction), a new meta-training framework for zero and few-shot RE where an LLM is tuned to do ICL on a diverse collection of RE datasets (i. e. , learning to learn in context for RE). Through meta-training, the model becomes more effectively to learn a new RE task in context by conditioning on a few training examples with no parameter updates or task-specific templates at inference time, enabling better zero and few-shot task generalization. We experiment Micre on various LLMs with different model scales and 12 public RE datasets, and then evaluate it on unseen RE benchmarks under zero and few-shot settings. Micre delivers comparable or superior performance compared to a range of baselines including supervised fine-tuning and typical in-context learning methods. We find that the gains are particular significant for larger model scales, and using a diverse set of the meta-training RE datasets is key to improvements. Empirically, we show that Micre can transfer the relation semantic knowledge via relation label name during inference on target RE datasets.

NeurIPS Conference 2024 Conference Paper

Meta-Exploiting Frequency Prior for Cross-Domain Few-Shot Learning

  • Fei Zhou
  • Peng Wang
  • Lei Zhang
  • Zhenghua Chen
  • Wei Wei
  • Chen Ding
  • Guosheng Lin
  • Yanning Zhang

Meta-learning offers a promising avenue for few-shot learning (FSL), enabling models to glean a generalizable feature embedding through episodic training on synthetic FSL tasks in a source domain. Yet, in practical scenarios where the target task diverges from that in the source domain, meta-learning based method is susceptible to over-fitting. To overcome this, we introduce a novel framework, Meta-Exploiting Frequency Prior for Cross-Domain Few-Shot Learning, which is crafted to comprehensively exploit the cross-domain transferable image prior that each image can be decomposed into complementary low-frequency content details and high-frequency robust structural characteristics. Motivated by this insight, we propose to decompose each query image into its high-frequency and low-frequency components, and parallel incorporate them into the feature embedding network to enhance the final category prediction. More importantly, we introduce a feature reconstruction prior and a prediction consistency prior to separately encourage the consistency of the intermediate feature as well as the final category prediction between the original query image and its decomposed frequency components. This allows for collectively guiding the network's meta-learning process with the aim of learning generalizable image feature embeddings, while not introducing any extra computational cost in the inference phase. Our framework establishes new state-of-the-art results on multiple cross-domain few-shot learning benchmarks.

ICLR Conference 2024 Conference Paper

MVDream: Multi-view Diffusion for 3D Generation

  • Yichun Shi
  • Peng Wang
  • Jianglong Ye
  • Long Mai
  • Kejie Li
  • Xiao Yang

We introduce MVDream, a diffusion model that is able to generate consistent multi-view images from a given text prompt. Learning from both 2D and 3D data, a multi-view diffusion model can achieve the generalizability of 2D diffusion models and the consistency of 3D renderings. We demonstrate that such a multi-view diffusion model is implicitly a generalizable 3D prior agnostic to 3D representations. It can be applied to 3D generation via Score Distillation Sampling, significantly enhancing the consistency and stability of existing 2D-lifting methods. It can also learn new concepts from a few 2D examples, akin to DreamBooth, but for 3D generation.

AAAI Conference 2024 Conference Paper

OntoFact: Unveiling Fantastic Fact-Skeleton of LLMs via Ontology-Driven Reinforcement Learning

  • Ziyu Shang
  • Wenjun Ke
  • Nana Xiu
  • Peng Wang
  • Jiajun Liu
  • Yanhui Li
  • Zhizhao Luo
  • Ke Ji

Large language models (LLMs) have demonstrated impressive proficiency in information retrieval, while they are prone to generating incorrect responses that conflict with reality, a phenomenon known as intrinsic hallucination. The critical challenge lies in the unclear and unreliable fact distribution within LLMs trained on vast amounts of data. The prevalent approach frames the factual detection task as a question-answering paradigm, where the LLMs are asked about factual knowledge and examined for correctness. However, existing studies primarily focused on deriving test cases only from several specific domains, such as movies and sports, limiting the comprehensive observation of missing knowledge and the analysis of unexpected hallucinations. To address this issue, we propose OntoFact, an adaptive framework for detecting unknown facts of LLMs, devoted to mining the ontology-level skeleton of the missing knowledge. Specifically, we argue that LLMs could expose the ontology-based similarity among missing facts and introduce five representative knowledge graphs (KGs) as benchmarks. We further devise a sophisticated ontology-driven reinforcement learning (ORL) mechanism to produce error-prone test cases with specific entities and relations automatically. The ORL mechanism rewards the KGs for navigating toward a feasible direction for unveiling factual errors. Moreover, empirical efforts demonstrate that dominant LLMs are biased towards answering Yes rather than No, regardless of whether this knowledge is included. To mitigate the overconfidence of LLMs, we leverage a hallucination-free detection (HFD) strategy to tackle unfair comparisons between baselines, thereby boosting the result robustness. Experimental results on 5 datasets, using 32 representative LLMs, reveal a general lack of fact in current LLMs. Notably, ChatGPT exhibits fact error rates of 51.6% on DBpedia and 64.7% on YAGO, respectively. Additionally, the ORL mechanism demonstrates promising error prediction scores, with F1 scores ranging from 70% to 90% across most LLMs. Compared to the exhaustive testing, ORL achieves an average recall of 80% while reducing evaluation time by 35.29% to 63.12%.

IJCAI Conference 2024 Conference Paper

Recall, Retrieve and Reason: Towards Better In-Context Relation Extraction

  • Guozheng Li
  • Peng Wang
  • Wenjun Ke
  • Yikai Guo
  • Ke Ji
  • Ziyu Shang
  • Jiajun Liu
  • Zijie Xu

Relation extraction (RE) aims to identify relations between entities mentioned in texts. Although large language models (LLMs) have demonstrated impressive in-context learning (ICL) abilities in various tasks, they still suffer from poor performances compared to most supervised fine-tuned RE methods. Utilizing ICL for RE with LLMs encounters two challenges: (1) retrieving good demonstrations from training examples, and (2) enabling LLMs exhibit strong ICL abilities in RE. On the one hand, retrieving good demonstrations is a non-trivial process in RE, which easily results in low relevance regarding entities and relations. On the other hand, ICL with an LLM achieves poor performance in RE while RE is different from language modeling in nature or the LLM is not large enough. In this work, we propose a novel recall-retrieve-reason RE framework that synergizes LLMs with retrieval corpora (training examples) to enable relevant retrieving and reliable in-context reasoning. Specifically, we distill the consistently ontological knowledge from training datasets to let LLMs generate relevant entity pairs grounded by retrieval corpora as valid queries. These entity pairs are then used to retrieve relevant training examples from the retrieval corpora as demonstrations for LLMs to conduct better ICL via instruction tuning. Extensive experiments on different LLMs and RE datasets demonstrate that our method generates relevant and valid entity pairs and boosts ICL abilities of LLMs, achieving competitive or new state-of-the-art performance on sentence-level RE compared to previous supervised fine-tuning methods and ICL-based methods.

EAAI Journal 2024 Journal Article

Synthetic data augmentation for high-resolution X-ray welding defect detection and classification based on a small number of real samples

  • Liangliang Li
  • Peng Wang
  • Jia Ren
  • Zhigang Lü
  • Xiaoyan Li
  • Hui Gao
  • RuoHai Di

Deep learning has become the dominant technology in most computer vision tasks. These methods often rely on a large number of labeled sample datasets for training, and in the field of non-destructive testing of welds in industrial manufacturing, weld images with defects are very scarce, and it is still a challenging challenge to construct high-resolution weld defect datasets that meet the requirements. To overcome this limitation, a new data augmentation method for high-resolution X-ray welding defect classification and synthesis based on a small number of real samples is proposed to realize the data augmentation of industrial nondestructive inspection X-ray film defect images. Firstly, to overcome the scarcity of the weld X-ray defect classification dataset, the weld X-ray defect classification dataset (Weld Defect Classification, WDC) is constructed. Secondly, the performance of 16 common deep classification models on WDC datasets is explored. Then, the images of the real local welding defects and the non-defective weld area are fused at random locations, and two data augmentation modes, (Single Image Single Defect, SISD) and (Single Image Multi Defects, SIMD), can generate defect files and annotation files (Visual Object Classes, VOC) at the same time, which can save a lot of time for manual marking. Finally, compared with the traditional data augmentation method, the proposed method can effectively improve the accuracy of defect detection and generalization, the mAP (Mean Average Precision, mAP) @0. 5 of YOLOV8X (You Only Look Once, YOLO) and YOLOV5. 6. 1X is 66. 6% and 72. 8%, which provides an effective solution for data sample generation in the industrial field.

AAAI Conference 2024 Conference Paper

Towards Continual Knowledge Graph Embedding via Incremental Distillation

  • Jiajun Liu
  • Wenjun Ke
  • Peng Wang
  • Ziyu Shang
  • Jinhua Gao
  • Guozheng Li
  • Ke Ji
  • Yanhe Liu

Traditional knowledge graph embedding (KGE) methods typically require preserving the entire knowledge graph (KG) with significant training costs when new knowledge emerges. To address this issue, the continual knowledge graph embedding (CKGE) task has been proposed to train the KGE model by learning emerging knowledge efficiently while simultaneously preserving decent old knowledge. However, the explicit graph structure in KGs, which is critical for the above goal, has been heavily ignored by existing CKGE methods. On the one hand, existing methods usually learn new triples in a random order, destroying the inner structure of new KGs. On the other hand, old triples are preserved with equal priority, failing to alleviate catastrophic forgetting effectively. In this paper, we propose a competitive method for CKGE based on incremental distillation (IncDE), which considers the full use of the explicit graph structure in KGs. First, to optimize the learning order, we introduce a hierarchical strategy, ranking new triples for layer-by-layer learning. By employing the inter- and intra-hierarchical orders together, new triples are grouped into layers based on the graph structure features. Secondly, to preserve the old knowledge effectively, we devise a novel incremental distillation mechanism, which facilitates the seamless transfer of entity representations from the previous layer to the next one, promoting old knowledge preservation. Finally, we adopt a two-stage training paradigm to avoid the over-corruption of old knowledge influenced by under-trained new knowledge. Experimental results demonstrate the superiority of IncDE over state-of-the-art baselines. Notably, the incremental distillation mechanism contributes to improvements of 0.2%-6.5% in the mean reciprocal rank (MRR) score. More exploratory experiments validate the effectiveness of IncDE in proficiently learning new knowledge while preserving old knowledge across all time steps.

AAAI Conference 2024 Conference Paper

Unify Named Entity Recognition Scenarios via Contrastive Real-Time Updating Prototype

  • Yanhe Liu
  • Peng Wang
  • Wenjun Ke
  • Guozheng Li
  • Xiye Chen
  • Jiteng Zhao
  • Ziyu Shang

Supervised named entity recognition (NER) aims to classify entity mentions into a fixed number of pre-defined types. However, in real-world scenarios, unknown entity types are continually involved. Naive fine-tuning will result in catastrophic forgetting on old entity types. Existing continual methods usually depend on knowledge distillation to alleviate forgetting, which are less effective on long task sequences. Moreover, most of them are specific to the class-incremental scenario and cannot adapt to the online scenario, which is more common in practice. In this paper, we propose a unified framework called Contrastive Real-time Updating Prototype (CRUP) that can handle different scenarios for NER. Specifically, we train a Gaussian projection model by a regularized contrastive objective. After training on each batch, we store the mean vectors of representations belong to new entity types as their prototypes. Meanwhile, we update existing prototypes belong to old types only based on representations of the current batch. The final prototypes will be used for the nearest class mean classification. In this way, CRUP can handle different scenarios through its batch-wise learning. Moreover, CRUP can alleviate forgetting in continual scenarios only with current data instead of old data. To comprehensively evaluate CRUP, we construct extensive benchmarks based on various datasets. Experimental results show that CRUP significantly outperforms baselines in continual scenarios and is also competitive in the supervised scenario.

NeurIPS Conference 2024 Conference Paper

Unveiling LoRA Intrinsic Ranks via Salience Analysis

  • Wenjun Ke
  • Jiahao Wang
  • Peng Wang
  • Jiajun Liu
  • Dong Nie
  • Guozheng Li
  • Yining Li

The immense parameter scale of large language models underscores the necessity for parameter-efficient fine-tuning methods. Methods based on Low-Rank Adaptation (LoRA) assume the low-rank characteristics of the incremental matrix and optimize the matrix obtained from low-rank decomposition. Although effective, these methods are constrained by a fixed and unalterable intrinsic rank, neglecting the variable importance of matrices. Consequently, methods for adaptive rank allocation are proposed, among which AdaLoRA demonstrates excellent fine-tuning performance. AdaLoRA conducts adaptation based on singular value decomposition (SVD), dynamically allocating intrinsic ranks according to importance. However, it still struggles to achieve a balance between fine-tuning effectiveness and efficiency, leading to limited rank allocation space. Additionally, the importance measurement focuses only on parameters with minimal impact on the loss, neglecting the dominant role of singular values in SVD-based matrices and the fluctuations during training. To address these issues, we propose SalientLoRA, which adaptively optimizes intrinsic ranks of LoRA via salience measurement. Firstly, during rank allocation, the salience measurement analyses the variation of singular value magnitudes across multiple time steps and establishes their inter-dependency relationships to assess the matrix importance. This measurement mitigates instability and randomness that may arise during importance assessment. Secondly, to achieve a balance between fine-tuning performance and efficiency, we propose an adaptive adjustment of time-series window, which adaptively controls the size of time-series for significance measurement and rank reduction during training, allowing for rapid rank allocation while maintaining training stability. This mechanism enables matrics to set a higher initial rank, thus expanding the allocation space for ranks. To evaluate the generality of our method across various tasks, we conduct experiments on natural language understanding (NLU), natural language generation (NLG), and large model instruction tuning tasks. Experimental results demonstrate the superiority of SalientLoRA, which outperforms state-of-the-art methods by 0. 96\%-3. 56\% on multiple datasets. Furthermore, as the rank allocation space expands, our method ensures fine-tuning efficiency, achieving a speed improvement of 94. 5\% compared to AdaLoRA. The code is publicly available at https: //github. com/Heyest/SalientLoRA.

AAAI Conference 2024 Conference Paper

VadCLIP: Adapting Vision-Language Models for Weakly Supervised Video Anomaly Detection

  • Peng Wu
  • Xuerong Zhou
  • Guansong Pang
  • Lingru Zhou
  • Qingsen Yan
  • Peng Wang
  • Yanning Zhang

The recent contrastive language-image pre-training (CLIP) model has shown great success in a wide range of image-level tasks, revealing remarkable ability for learning powerful visual representations with rich semantics. An open and worthwhile problem is efficiently adapting such a strong model to the video domain and designing a robust video anomaly detector. In this work, we propose VadCLIP, a new paradigm for weakly supervised video anomaly detection (WSVAD) by leveraging the frozen CLIP model directly without any pre-training and fine-tuning process. Unlike current works that directly feed extracted features into the weakly supervised classifier for frame-level binary classification, VadCLIP makes full use of fine-grained associations between vision and language on the strength of CLIP and involves dual branch. One branch simply utilizes visual features for coarse-grained binary classification, while the other fully leverages the fine-grained language-image alignment. With the benefit of dual branch, VadCLIP achieves both coarse-grained and fine-grained video anomaly detection by transferring pre-trained knowledge from CLIP to WSVAD task. We conduct extensive experiments on two commonly-used benchmarks, demonstrating that VadCLIP achieves the best performance on both coarse-grained and fine-grained WSVAD, surpassing the state-of-the-art methods by a large margin. Specifically, VadCLIP achieves 84.51% AP and 88.02% AUC on XD-Violence and UCF-Crime, respectively. Code and features are released at https://github.com/nwpu-zxr/VadCLIP.

NeurIPS Conference 2024 Conference Paper

Visual Prompt Tuning in Null Space for Continual Learning

  • Yue Lu
  • Shizhou Zhang
  • De Cheng
  • Yinghui Xing
  • Nannan Wang
  • Peng Wang
  • Yanning Zhang

Existing prompt-tuning methods have demonstrated impressive performances in continual learning (CL), by selecting and updating relevant prompts in the vision-transformer models. On the contrary, this paper aims to learn each task by tuning the prompts in the direction orthogonal to the subspace spanned by previous tasks' features, so as to ensure no interference on tasks that have been learned to overcome catastrophic forgetting in CL. However, different from the orthogonal projection in the traditional CNN architecture, the prompt gradient orthogonal projection in the ViT architecture shows completely different and greater challenges, i. e. , 1) the high-order and non-linear self-attention operation; 2) the drift of prompt distribution brought by the LayerNorm in the transformer block. Theoretically, we have finally deduced two consistency conditions to achieve the prompt gradient orthogonal projection, which provide a theoretical guarantee of eliminating interference on previously learned knowledge via the self-attention mechanism in visual prompt tuning. In practice, an effective null-space-based approximation solution has been proposed to implement the prompt gradient orthogonal projection. Extensive experimental results demonstrate the effectiveness of anti-forgetting on four class-incremental benchmarks with diverse pre-trained baseline models, and our approach achieves superior performances to state-of-the-art methods. Our code is available at https: //github. com/zugexiaodui/VPTinNSforCL

NeurIPS Conference 2024 Conference Paper

WISE: Rethinking the Knowledge Memory for Lifelong Model Editing of Large Language Models

  • Peng Wang
  • Zexi Li
  • Ningyu Zhang
  • Ziwen Xu
  • Yunzhi Yao
  • Yong Jiang
  • Pengjun Xie
  • Fei Huang

Large language models (LLMs) need knowledge updates to meet the ever-growing world facts and correct the hallucinated responses, facilitating the methods of lifelong model editing. Where the updated knowledge resides in memories is a fundamental question for model editing. In this paper, we find that editing either long-term memory (direct model parameters) or working memory (non-parametric knowledge of neural network activations/representations by retrieval) will result in an impossible triangle---reliability, generalization, and locality can not be realized together in the lifelong editing settings. For long-term memory, directly editing the parameters will cause conflicts with irrelevant pretrained knowledge or previous edits (poor reliability and locality). For working memory, retrieval-based activations can hardly make the model understand the edits and generalize (poor generalization). Therefore, we propose WISE to bridge the gap between memories. In WISE, we design a dual parametric memory scheme, which consists of the main memory for the pretrained knowledge and a side memory for the edited knowledge. We only edit the knowledge in the side memory and train a router to decide which memory to go through when given a query. For continual editing, we devise a knowledge-sharding mechanism where different sets of edits reside in distinct subspaces of parameters, and are subsequently merged into a shared memory without conflicts. Extensive experiments show that WISE can outperform previous model editing methods and overcome the impossible triangle under lifelong model editing of question answering, hallucination, and out-of-distribution settings across trending LLM architectures, e. g. , GPT, LLaMA, and Mistral.

EAAI Journal 2023 Journal Article

A geometry-aware deep network for depth estimation in monocular endoscopy

  • Yongming Yang
  • Shuwei Shao
  • Tao Yang
  • Peng Wang
  • Zhuo Yang
  • Chengdong Wu
  • Hao Liu

Monocular depth estimation is critical for endoscopists to perform spatial perception and 3D navigation of surgical sites. However, most of the existing methods ignore the important geometric structural consistency, which inevitably leads to performance degradation and distortion of 3D reconstruction. To address this issue, we introduce a gradient loss to penalize edge fluctuations ambiguous around stepped edge structures and a normal loss to explicitly express the sensitivity to frequently small structures, and propose a geometric consistency loss to spreads the spatial information across the sample grids to constrain the global geometric anatomy structures. In addition, we develop a synthetic RGB-Depth dataset that captures the anatomical structures under reflections and illumination variations. The proposed method is extensively validated across different datasets and clinical images and achieves mean RMSE values of 0. 066 (stomach), 0. 029 (small intestine), and 0. 139 (colon) on the EndoSLAM dataset. The generalizability of the proposed method achieves mean RMSE values of 12. 604 (T1-L1), 9. 930 (T2-L2), and 13. 893 (T3-L3) on the ColonDepth dataset. The experimental results show that our method exceeds previous state-of-the-art competitors and generates more consistent depth maps and reasonable anatomical structures. The quality of intraoperative 3D structure perception from endoscopic videos of the proposed method meets the accuracy requirements of video-CT registration algorithms for endoscopic navigation. The dataset and the source code will be available at https: //github. com/YYM-SIA/LINGMI-MR.

EAAI Journal 2023 Journal Article

A model adaptive updating kernel correlation filter tracker with deep CNN features

  • Zhigang Feng
  • Peng Wang

The tracker based on correlation filter shows excellent performance in tracking accuracy and running speed. However, the models of correlated filter trackers are always updated with fixed weights, which can degrade the tracking performance when the target in a variety of challenging scenarios. In this paper, we present a model adaptive updating method based on a fuzzy system, which can set different updating weights on each frame to effectively deal with the challenging scenarios in the tracking process. Attractively, this method can be perfectly used in all trackers based on correlation filtering. In addition, we combine deep CNN features that can describe target semantics with HOG features that have spatial descriptions. Using their complementarity to target descriptions, we establish HOG-based filter model and CNN-based filter model. To two response maps of the models, we propose a different fusion strategy based on quality measurement of tracking results, which can balance the accuracy and robustness of the tracker. Experiments on OTB-2013, OTB-2015, benchmark videos, and VOT2018 dataset show that our tracker (called MACF) is effective and exhibits competitive results compared with the recent state-of-the-art (SOTA) trackers.

AAAI Conference 2023 Conference Paper

Bidirectional Optical Flow NeRF: High Accuracy and High Quality under Fewer Views

  • Shuo Chen
  • Binbin Yan
  • Xinzhu Sang
  • Duo Chen
  • Peng Wang
  • Xiao Guo
  • Chongli Zhong
  • Huaming Wan

Neural Radiance Fields (NeRF) can implicitly represent 3D-consistent RGB images and geometric by optimizing an underlying continuous volumetric scene function using a sparse set of input views, which has greatly benefited view synthesis tasks. However, NeRF fails to estimate correct geometry when given fewer views, resulting in failure to synthesize novel views. Existing works rely on introducing depth images or adding depth estimation networks to resolve the problem of poor synthetic view in NeRF with fewer views. However, due to the lack of spatial consistency of the single-depth image and the poor performance of depth estimation with fewer views, the existing methods still have challenges in addressing this problem. So this paper proposes Bidirectional Optical Flow NeRF(BOF-NeRF), which addresses this problem by mining optical flow information between 2D images. Our key insight is that utilizing 2D optical flow images to design a loss can effectively guide NeRF to learn the correct geometry and synthesize the right novel view. We also propose a view-enhanced fusion method based on geometry and color consistency to solve the problem of novel view details loss in NeRF. We conduct extensive experiments on the NeRF-LLFF and DTU MVS benchmarks for novel view synthesis tasks with fewer images in different complex real scenes. We further demonstrate the robustness of BOF-NeRF under different baseline distances on the Middlebury dataset. In all cases, BOF-NeRF outperforms current state-of-the-art baselines for novel view synthesis and scene geometry estimation.

NeurIPS Conference 2023 Conference Paper

Efficient Adaptation of Large Vision Transformer via Adapter Re-Composing

  • Wei Dong
  • Dawei Yan
  • Zhijun Lin
  • Peng Wang

The advent of high-capacity pre-trained models has revolutionized problem-solving in computer vision, shifting the focus from training task-specific models to adapting pre-trained models. Consequently, effectively adapting large pre-trained models to downstream tasks in an efficient manner has become a prominent research area. Existing solutions primarily concentrate on designing lightweight adapters and their interaction with pre-trained models, with the goal of minimizing the number of parameters requiring updates. In this study, we propose a novel Adapter Re-Composing (ARC) strategy that addresses efficient pre-trained model adaptation from a fresh perspective. Our approach considers the reusability of adaptation parameters and introduces a parameter-sharing scheme. Specifically, we leverage symmetric down-/up-projections to construct bottleneck operations, which are shared across layers. By learning low-dimensional re-scaling coefficients, we can effectively re-compose layer-adaptive adapters. This parameter-sharing strategy in adapter design allows us to further reduce the number of new parameters while maintaining satisfactory performance, thereby offering a promising approach to compress the adaptation cost. We conduct experiments on 24 downstream image classification tasks using various Vision Transformer variants to evaluate our method. The results demonstrate that our approach achieves compelling transfer learning performance with a reduced parameter count. Our code is available at https: //github. com/DavidYanAnDe/ARC.

AAAI Conference 2023 Conference Paper

fmLRE: A Low-Resource Relation Extraction Model Based on Feature Mapping Similarity Calculation

  • Peng Wang
  • Tong Shao
  • Ke Ji
  • Guozheng Li
  • Wenjun Ke

Low-resource relation extraction (LRE) aims to extract relations from limited labeled corpora. Existing work takes advantages of self-training or distant supervision to expand the limited labeled data in the data-driven approaches, while the selection bias of pseudo labels may cause the error accumulation in subsequent relation classification. To address this issue, this paper proposes fmLRE, an iterative feedback method based on feature mapping similarity calculation to improve the accuracy of pseudo labels. First, it calculates the similarities between pseudo-label and real-label data of the same category in a feature mapping space based on semantic features of labeled dataset after feature projection. Then, it fine-tunes initial model according to the iterative process of reinforcement learning. Finally, the similarity is used as a threshold for screening high-precision pseudo-labels and the basis for setting different rewards, which also acts as a penalty term for the loss function of relation classifier. Experimental results demonstrate that fmLRE achieves the state-of-the-art performance compared with strong baselines on two public datasets.

AAAI Conference 2023 Conference Paper

IterDE: An Iterative Knowledge Distillation Framework for Knowledge Graph Embeddings

  • Jiajun Liu
  • Peng Wang
  • Ziyu Shang
  • Chenxiao Wu

Knowledge distillation for knowledge graph embedding (KGE) aims to reduce the KGE model size to address the challenges of storage limitations and knowledge reasoning efficiency. However, current work still suffers from the performance drops when compressing a high-dimensional original KGE model to a low-dimensional distillation KGE model. Moreover, most work focuses on the reduction of inference time but ignores the time-consuming training process of distilling KGE models. In this paper, we propose IterDE, a novel knowledge distillation framework for KGEs. First, IterDE introduces an iterative distillation way and enables a KGE model to alternately be a student model and a teacher model during the iterative distillation process. Consequently, knowledge can be transferred in a smooth manner between high-dimensional teacher models and low-dimensional student models, while preserving good KGE performances. Furthermore, in order to optimize the training process, we consider that different optimization objects between hard label loss and soft label loss can affect the efficiency of training, and then we propose a soft-label weighting dynamic adjustment mechanism that can balance the inconsistency of optimization direction between hard and soft label loss by gradually increasing the weighting of soft label loss. Our experimental results demonstrate that IterDE achieves a new state-of-the-art distillation performance for KGEs compared to strong baselines on the link prediction task. Significantly, IterDE can reduce the training time by 50% on average. Finally, more exploratory experiments show that the soft-label weighting dynamic adjustment mechanism and more fine-grained iterations can improve distillation performance.

YNIMG Journal 2023 Journal Article

Laminar neural dynamics of auditory evoked responses: Computational modeling of local field potentials in auditory cortex of non-human primates

  • Vincent S.C. Chien
  • Peng Wang
  • Burkhard Maess
  • Yonatan Fishman
  • Thomas R. Knösche

Evoked neural responses to sensory stimuli have been extensively investigated in humans and animal models both to enhance our understanding of brain function and to aid in clinical diagnosis of neurological and neuropsychiatric conditions. Recording and imaging techniques such as electroencephalography (EEG), magnetoencephalography (MEG), local field potentials (LFPs), and calcium imaging provide complementary information about different aspects of brain activity at different spatial and temporal scales. Modeling and simulations provide a way to integrate these different types of information to clarify underlying neural mechanisms. In this study, we aimed to shed light on the neural dynamics underlying auditory evoked responses by fitting a rate-based model to LFPs recorded via multi-contact electrodes which simultaneously sampled neural activity across cortical laminae. Recordings included neural population responses to best-frequency (BF) and non-BF tones at four representative sites in primary auditory cortex (A1) of awake monkeys. The model considered major neural populations of excitatory, parvalbumin-expressing (PV), and somatostatin-expressing (SOM) neurons across layers 2/3, 4, and 5/6. Unknown parameters, including the connection strength between the populations, were fitted to the data. Our results revealed similar population dynamics, fitted model parameters, predicted equivalent current dipoles (ECD), tuning curves, and lateral inhibition profiles across recording sites and animals, in spite of quite different extracellular current distributions. We found that PV firing rates were higher in BF than in non-BF responses, mainly due to different strengths of tonotopic thalamic input, whereas SOM firing rates were higher in non-BF than in BF responses due to lateral inhibition. In conclusion, we demonstrate the feasibility of the model-fitting approach in identifying the contributions of cell-type specific population activity to stimulus-evoked LFPs across cortical laminae, providing a foundation for further investigations into the dynamics of neural circuits underlying cortical sensory processing.

IJCAI Conference 2023 Conference Paper

LION: Label Disambiguation for Semi-supervised Facial Expression Recognition with Progressive Negative Learning

  • Zhongjing Du
  • Xu Jiang
  • Peng Wang
  • Qizheng Zhou
  • Xi Wu
  • Jiliu Zhou
  • Yan Wang

Semi-supervised deep facial expression recognition (SS-DFER) has recently attracted rising research interest due to its more practical setting of abundant unlabeled data. However, there are two main problems unconsidered in current SS-DFER methods: 1) label ambiguity, i. e. , given labels mismatch with facial expressions; 2) inefficient utilization of unlabeled data with low-confidence. In this paper, we propose a novel SS-DFER method, including a Label DIsambiguation module and a PrOgressive Negative Learning module, namely LION, to simultaneously address both problems. Specifically, the label disambiguation module operates on labeled data, including data with accurate labels (clear data) and ambiguous labels (ambiguous data). It first uses clear data to calculate prototypes for all the expression classes, and then re-assign a candidate label set to all the ambiguous data. Based on the prototypes and the candidate label set, the ambiguous data can be relabeled more accurately. As for unlabeled data with low-confidence, the progressive negative learning module is developed to iteratively mine more complete complementary labels, which can guide the model to reduce the association between data and corresponding complementary labels. Experiments on three challenging datasets show that our method significantly outperforms the current state-of-the-art approaches in SS-DFER and surpasses fully-supervised baselines. Code will be available at https: //github. com/NUM-7/LION.

EAAI Journal 2023 Journal Article

mmSignature: Semi-supervised human identification system based on millimeter wave radar

  • Yicheng Yao
  • Hao Zhang
  • Pan Xia
  • Changyu Liu
  • Fanglin Geng
  • Zhongrui Bai
  • Lidong Du
  • Xianxiang Chen

Human identification is vital in health monitoring, human-computer interaction, safety detection, and other fields. Compared with traditional vision-based methods, millimeter wave radar sensors can protect users' privacy and work in dark environments, which has a wide range of application prospects in iot fields such as smart homes and smart medical care. Previous studies need to manually collect labeled data, which makes the data collection work need substantial human resources and is unsuitable for popularization and application. We automatically collect multi-modal radar signals in users ' daily lives without requiring researchers to label data manually. Based on the proposed data collection method, we established the first semi-supervised data set for human identification, which includes synchronous radar point cloud data and range-velocity map data. The dataset contains four experiments, including ten monitoring users and ten other users. We propose a semi-supervised co-training framework based on multi-modal data fusion for human identification. The framework guides the models to learn from unlabeled data using the complementary characteristics of point cloud data and range-velocity map data. In addition, we propose an information fusion method to fuse the radar data of two modes to further improve the model's performance. The experimental results show that the proposed method achieves 93. 7% human identification accuracy, showing radar-based human identification technology's application and promotion potential.

NeurIPS Conference 2023 Conference Paper

MVDiffusion: Enabling Holistic Multi-view Image Generation with Correspondence-Aware Diffusion

  • Shitao Tang
  • Fuyang Zhang
  • Jiacheng Chen
  • Peng Wang
  • Yasutaka Furukawa

This paper introduces MVDiffusion, a simple yet effective method for generating consistent multi-view images from text prompts given pixel-to-pixel correspondences (e. g. , perspective crops from a panorama or multi-view images given depth maps and poses). Unlike prior methods that rely on iterative image warping and inpainting, MVDiffusion simultaneously generates all images with a global awareness, effectively addressing the prevalent error accumulation issue. At its core, MVDiffusion processes perspective images in parallel with a pre-trained text-to-image diffusion model, while integrating novel correspondence-aware attention layers to facilitate cross-view interactions. For panorama generation, while only trained with 10k panoramas, MVDiffusion is able to generate high-resolution photorealistic images for arbitrary texts or extrapolate one perspective image to a 360-degree view. For multi-view depth-to-image generation, MVDiffusion demonstrates state-of-the-art performance for texturing a scene mesh. The project page is at https: //mvdiffusion. github. io/.

AAAI Conference 2023 Conference Paper

Online Noisy Continual Relation Learning

  • Guozheng Li
  • Peng Wang
  • Qiqing Luo
  • Yanhe Liu
  • Wenjun Ke

Recent work for continual relation learning has achieved remarkable progress. However, most existing methods only focus on tackling catastrophic forgetting to improve performance in the existing setup, while continually learning relations in the real-world must overcome many other challenges. One is that the data possibly comes in an online streaming fashion with data distributions gradually changing and without distinct task boundaries. Another is that noisy labels are inevitable in real-world, as relation samples may be contaminated by label inconsistencies or labeled with distant supervision. In this work, therefore, we propose a novel continual relation learning framework that simultaneously addresses both online and noisy relation learning challenges. Our framework contains three key modules: (i) a sample separated online purifying module that divides the online data stream into clean and noisy samples, (ii) a self-supervised online learning module that circumvents inferior training signals caused by noisy data, and (iii) a semi-supervised offline finetuning module that ensures the participation of both clean and noisy samples. Experimental results on FewRel, TACRED and NYT-H with real-world noise demonstrate that our framework greatly outperforms the combinations of the state-of-the-art online continual learning and noisy label learning methods.

IJCAI Conference 2023 Conference Paper

PasCore: A Chinese Overlapping Relation Extraction Model Based on Global Pointer Annotation Strategy

  • Peng Wang
  • Jiafeng Xie
  • Xiye Chen
  • Guozheng Li
  • Wei Li

Recent work for extracting relations from texts has achieved excellent performance. However, existing studies mainly focus on simple relation extraction, these methods perform not well on overlapping triple problem because the tags of shared entities would conflict with each other. Especially, overlapping entities are common and indispensable in Chinese. To address this issue, this paper proposes PasCore, which utilizes a global pointer annotation strategy for overlapping relation extraction in Chinese. PasCore first obtains the sentence vector via general pre-training model encoder, and uses classifier to predicate relations. Subsequently, it uses global pointer annotation strategy for head entity annotation, which uses global tags to label the start and end positions of the entities. Finally, PasCore integrates the relation, head entity and its type to mark the tail entity. Furthermore, PasCore performs conditional layer normalization to fuse features, which connects all stages and greatly enriches the association between relations and entities. Experimental results on both Chinese and English real-world datasets demonstrate that PasCore outperforms strong baselines on relation extraction and, especially, shows superior performance on overlapping relation extraction.

AAAI Conference 2023 Conference Paper

Stop-Gradient Softmax Loss for Deep Metric Learning

  • Lu Yang
  • Peng Wang
  • Yanning Zhang

Deep metric learning aims to learn a feature space that models the similarity between images, and feature normalization is a critical step for boosting performance. However directly optimizing L2-normalized softmax loss cause the network to fail to converge. Therefore some SOTA approaches appends a scale layer after the inner product to relieve the convergence problem, but it incurs a new problem that it's difficult to learn the best scaling parameters. In this letter, we look into the characteristic of softmax-based approaches and propose a novel learning objective function Stop-Gradient Softmax Loss (SGSL) to solve the convergence problem in softmax-based deep metric learning with L2-normalization. In addition, we found a useful trick named Remove the last BN-ReLU (RBR). It removes the last BN-ReLU in the backbone to reduce the learning burden of the model. Experimental results on four fine-grained image retrieval benchmarks show that our proposed approach outperforms most existing approaches, i.e., our approach achieves 75.9% on CUB-200-2011, 94.7% on CARS196 and 83.1% on SOP which outperforms other approaches at least 1.7%, 2.9% and 1.7% on Recall@1.

NeurIPS Conference 2023 Conference Paper

Toward Re-Identifying Any Animal

  • Bingliang Jiao
  • Lingqiao Liu
  • Liying Gao
  • Ruiqi Wu
  • Guosheng Lin
  • Peng Wang
  • Yanning Zhang

The current state of re-identification (ReID) models poses limitations to their applicability in the open world, as they are primarily designed and trained for specific categories like person or vehicle. In light of the importance of ReID technology for tracking wildlife populations and migration patterns, we propose a new task called ``Re-identify Any Animal in the Wild'' (ReID-AW). This task aims to develop a ReID model capable of handling any unseen wildlife category it encounters. To address this challenge, we have created a comprehensive dataset called Wildlife-71, which includes ReID data from 71 different wildlife categories. This dataset is the first of its kind to encompass multiple object categories in the realm of ReID. Furthermore, we have developed a universal re-identification model named UniReID specifically for the ReID-AW task. To enhance the model's adaptability to the target category, we employ a dynamic prompting mechanism using category-specific visual prompts. These prompts are generated based on knowledge gained from a set of pre-selected images within the target category. Additionally, we leverage explicit semantic knowledge derived from the large-scale pre-trained language model, GPT-4. This allows UniReID to focus on regions that are particularly useful for distinguishing individuals within the target category. Extensive experiments have demonstrated the remarkable generalization capability of our UniReID model. It showcases promising performance in handling arbitrary wildlife categories, offering significant advancements in the field of ReID for wildlife conservation and research purposes.

IJCAI Conference 2023 Conference Paper

Towards Incremental NER Data Augmentation via Syntactic-aware Insertion Transformer

  • Wenjun Ke
  • Zongkai Tian
  • Qi Liu
  • Peng Wang
  • Jinhua Gao
  • Rui Qi

Named entity recognition (NER) aims to locate and classify named entities in natural language texts. Most existing high-performance NER models employ a supervised paradigm, which requires a large quantity of high-quality annotated data during training. In order to help NER models perform well in few-shot scenarios, data augmentation approaches attempt to build extra data by means of random editing or by using end-to-end generation with PLMs. However, these methods focus on only the fluency of generated sentences, ignoring the syntactic correlation between the new and raw sentences. Such uncorrelation also brings low diversity and inconsistent labeling of synthetic samples. To fill this gap, we present SAINT (Syntactic-Aware InsertioN Transformer), a hard-constraint controlled text generation model that incorporates syntactic information. The proposed method operates by inserting new tokens between existing entities in a parallel manner. During insertion procedure, new tokens will be added taking both semantic and syntactic factors into account. Hence the resulting sentence can retain the syntactic correctness with respect to the raw data. Experimental results on two benchmark datasets, i. e. , Ontonotes and Wikiann, demonstrate the comparable performance of SAINT over the state-of-the-art baselines.

IJCAI Conference 2022 Conference Paper

Corner Affinity: A Robust Grouping Algorithm to Make Corner-guided Detector Great Again

  • Haoran Wei
  • Chenglong Liu
  • Ping Guo
  • Yangguang Zhu
  • Jiamei Fu
  • Bing Wang
  • Peng Wang

Corner-guided detector enjoys potential ability to yield precise bounding boxes. However, unreliable corner pairs, generated by heuristic grouping guidance, hinder the development of this detector. In this paper, we propose a novel corner grouping algorithm, termed as Corner Affinity, to significantly boost the reliability and robustness of corner grouping. The proposed Corner Affinity is a couple of two interactional factors, namely, 1) the structure affinity (SA), applying to generate preliminary corner pairs through the corresponding object's shallow construction information. 2) the contexts affinity (CA), running as optimizing corner pairs via embedding deeper semantic features of affiliated instances. Equipped with the Corner Affinity, a detector can produce high-quality bounding boxes upon preferable paired corner keypoints. Experimental results show the superiority of our design on multiple benchmark datasets. Specifically, for CornerNet baseline, the proposed Corner Affinity brings AP boostings of 5. 8% on COCO, 35. 8% on Citypersons, and 17. 2% on UCAS-AOD without bells and whistles.

IJCAI Conference 2022 Conference Paper

FastRE: Towards Fast Relation Extraction with Convolutional Encoder and Improved Cascade Binary Tagging Framework

  • Guozheng Li
  • Xu Chen
  • Peng Wang
  • Jiafeng Xie
  • Qiqing Luo

Recent work for extracting relations from texts has achieved excellent performance. However, most existing methods pay less attention to the efficiency, making it still challenging to quickly extract relations from massive or streaming text data in realistic scenarios. The main efficiency bottleneck is that these methods use a Transformer-based pre-trained language model for encoding, which heavily affects the training speed and inference speed. To address this issue, we propose a fast relation extraction model (FastRE) based on convolutional encoder and improved cascade binary tagging framework. Compared to previous work, FastRE employs several innovations to improve efficiency while also keeping promising performance. Concretely, FastRE adopts a novel convolutional encoder architecture combined with dilated convolution, gated unit and residual connection, which significantly reduces the computation cost of training and inference, while maintaining the satisfactory performance. Moreover, to improve the cascade binary tagging framework, FastRE first introduces a type-relation mapping mechanism to accelerate tagging efficiency and alleviate relation redundancy, and then utilizes a position-dependent adaptive thresholding strategy to obtain higher tagging accuracy and better model generalization. Experimental results demonstrate that FastRE is well balanced between efficiency and performance, and achieves 3-10$\times$ training speed, 7-15$\times$ inference speed faster, and 1/100 parameters compared to the state-of-the-art models, while the performance is still competitive. Our code is available at \url{https: //github. com/seukgcode/FastRE}.

NeurIPS Conference 2022 Conference Paper

HumanLiker: A Human-like Object Detector to Model the Manual Labeling Process

  • Haoran Wei
  • Ping Guo
  • Yangguang Zhu
  • Chenglong Liu
  • Peng Wang

Popular object detection models generate bounding boxes in a different way than we humans. As an example, modern detectors yield object box either upon the regression of its center and width/height (center-guided detector), or by grouping paired estimated corners (corner-guided detector). However, that is not the pattern we manually label an object due to high degrees of freedom in searching centers or low efficiency of grouping corners. Empirically, humans run two steps to locate an object bounding box manually: 1) click the mouse at the top-left corner of object, and then drag the mouse to the bottom-right corner; 2) refine the corner positions to make the bounding box more precisely, if necessary. Inspired by this manual labeling process, we propose a novel human-like detector, termed as HumanLiker, which is devised as a two-stage end-to-end detector to simulate the two aforementioned. Like we humans in manual labeling, HumanLiker can effectively avert both the thorny center searching and heuristic corner grouping. Different from the mainstream detector branches, i. e. , the center/corner-guided methods, the HumanLiker provides a new paradigm which integrates the advantages of both branches to balance the detection efficiency and bounding box quality. On MS-COCO test-dev set, HumanLiker can achieve 50. 2%/51. 6% and 53. 8%/55. 6% in term of AP with ResNeXt-101 and SwinTransformer backbones in single/multi-scale testing, outperforming current popular center/corner-guided baselines (e. g. , DETR/CornerNet) by a large margin, with much less training epochs and higher inference FPS. Code will be available soon.

JBHI Journal 2022 Journal Article

MRI Generated From CT for Acute Ischemic Stroke Combining Radiomics and Generative Adversarial Networks

  • Eryan Feng
  • Pinle Qin
  • Rui Chai
  • Jianchao Zeng
  • Qi Wang
  • Yanfeng Meng
  • Peng Wang

Compared to computed tomography (CT), magnetic resonance imaging (MRI) is more sensitive to acute ischemic stroke lesion. However, MRI is time-consuming, expensive, and susceptible to interference from metal implants. Generating MRI images from CT images can address the limitations of MRI. The key problem in the process is obtaining lesion information from CT. In this study, we propose a cross-modal image generation algorithm from CT to MRI for acute ischemic stroke by combining radiomics with generative adversarial networks. First, the lesion candidate region was obtained using radiomics, the radiomic features of the region were extracted, and the feature with the largest information gain was selected and visualized as a feature map. Then, the concatenation of the extracted feature map and the CT image was input in the generator. We added a residual module after the downsampling of the generator, following the general shape of U-Net, which can deepen the network without causing degradation problems. In addition, we introduced the lesion feature similarity loss function to focus the model on the similarity of the lesion. Through the subjective judgment of two experienced radiologists and using evaluation metrics, the results showed that the generated MRI images were very similar to the real MRI images. Moreover, the locations of the lesions were correct, and the shapes of lesions were similar to those of the real lesions, which can help doctors with timely diagnosis and treatment.

NeurIPS Conference 2022 Conference Paper

Neural Collapse with Normalized Features: A Geometric Analysis over the Riemannian Manifold

  • Can Yaras
  • Peng Wang
  • Zhihui Zhu
  • Laura Balzano
  • Qing Qu

When training overparameterized deep networks for classification tasks, it has been widely observed that the learned features exhibit a so-called "neural collapse'" phenomenon. More specifically, for the output features of the penultimate layer, for each class the within-class features converge to their means, and the means of different classes exhibit a certain tight frame structure, which is also aligned with the last layer's classifier. As feature normalization in the last layer becomes a common practice in modern representation learning, in this work we theoretically justify the neural collapse phenomenon under normalized features. Based on an unconstrained feature model, we simplify the empirical loss function in a multi-class classification task into a nonconvex optimization problem over the Riemannian manifold by constraining all features and classifiers over the sphere. In this context, we analyze the nonconvex landscape of the Riemannian optimization problem over the product of spheres, showing a benign global landscape in the sense that the only global minimizers are the neural collapse solutions while all other critical points are strict saddle points with negative curvature. Experimental results on practical deep networks corroborate our theory and demonstrate that better representations can be learned faster via feature normalization. Code for our experiments can be found at https: //github. com/cjyaras/normalized-neural-collapse.

IROS Conference 2021 Conference Paper

A Collaborative Visual SLAM Framework for Service Robots

  • Ming Ouyang
  • Xuesong Shi
  • Yujie Wang
  • Yuxin Tian
  • Yingzhe Shen
  • Dawei Wang
  • Peng Wang
  • Zhiqiang Cao

We present a collaborative visual simultaneous localization and mapping (SLAM) framework for service robots. With an edge server maintaining a map database and performing global optimization, each robot can register to an existing map, update the map, or build new maps, all with a unified interface and low computation and memory cost. We design an elegant communication pipeline to enable real-time information sharing between robots. With a novel landmark organization and retrieval method on the server, each robot can acquire landmarks predicted to be in its view, to augment its local map. The framework is general enough to support both RGB-D and monocular cameras, as well as robots with multiple cameras, taking the rigid constraints between cameras into consideration. The proposed framework has been fully implemented and verified with public datasets and live experiments.

ICML Conference 2021 Conference Paper

AdaXpert: Adapting Neural Architecture for Growing Data

  • Shuaicheng Niu
  • Jiaxiang Wu 0001
  • Guanghui Xu 0002
  • Yifan Zhang 0004
  • Yong Guo
  • Peilin Zhao
  • Peng Wang
  • Mingkui Tan

In real-world applications, data often come in a growing manner, where the data volume and the number of classes may increase dynamically. This will bring a critical challenge for learning: given the increasing data volume or the number of classes, one has to instantaneously adjust the neural model capacity to obtain promising performance. Existing methods either ignore the growing nature of data or seek to independently search an optimal architecture for a given dataset, and thus are incapable of promptly adjusting the architectures for the changed data. To address this, we present a neural architecture adaptation method, namely Adaptation eXpert (AdaXpert), to efficiently adjust previous architectures on the growing data. Specifically, we introduce an architecture adjuster to generate a suitable architecture for each data snapshot, based on the previous architecture and the different extent between current and previous data distributions. Furthermore, we propose an adaptation condition to determine the necessity of adjustment, thereby avoiding unnecessary and time-consuming adjustments. Extensive experiments on two growth scenarios (increasing data volume and number of classes) demonstrate the effectiveness of the proposed method.

IJCAI Conference 2021 Conference Paper

Chop Chop BERT: Visual Question Answering by Chopping VisualBERT’s Heads

  • Chenyu Gao
  • Qi Zhu
  • Peng Wang
  • Qi Wu

Vision-and-Language (VL) pre-training has shown great potential on many related downstream tasks, such as Visual Question Answering (VQA), one of the most popular problems in the VL field. All of these pre-trained models (such as VisualBERT, ViLBERT, LXMERT and UNITER) are built with Transformer, which extends the classical attention mechanism to multiple layers and heads. To investigate why and how these models work on VQA so well, in this paper we explore the roles of individual heads and layers in Transformer models when handling 12 different types of questions. Specifically, we manually remove (chop) heads (or layers) from a pre-trained VisualBERT model at a time, and test it on different levels of questions to record its performance. As shown in the interesting echelon shape of the result matrices, experiments reveal different heads and layers are responsible for different question types, with higher-level layers activated by higher-level visual reasoning questions. Based on this observation, we design a dynamic chopping module that can automatically remove heads and layers of the VisualBERT at an instance level when dealing with different questions. Our dynamic chopping module can effectively reduce the parameters of the original model by 50%, while only damaging the accuracy by less than 1% on the VQA task.

EAAI Journal 2021 Journal Article

Feature-refined box particle filtering for autonomous vehicle localisation with OpenStreetMap

  • Peng Wang
  • Lyudmila Mihaylova
  • Philippe Bonnifait
  • Philippe Xu
  • Jianwen Jiang

Vehicle localisation is an important and challenging task in achieving autonomous driving. This work presents a box particle filter framework for vehicle self-localisation in the presence of sensor and map uncertainties. The proposed feature-refined box particle filter incorporates line features extracted from a multi-layer Light Detection And Ranging (LiDAR) sensor and information from OpenStreetMap to estimate vehicle states. A particle weight balance strategy is incorporated to account for the OpenStreetMap positional inaccuracy, which is assessed by comparing it to a high definition road map. The performance of the proposed framework is evaluated on a LiDAR dataset and compared with box particle filter variants. Experimental results show that the proposed framework achieves respectively 10% and 53% localisation performance improvement with reduced box volumes of 25% and 41%, when compared with the state-of-the-art interval analysis based box regularisation particle filter and the box particle filter.

NeurIPS Conference 2021 Conference Paper

NeuS: Learning Neural Implicit Surfaces by Volume Rendering for Multi-view Reconstruction

  • Peng Wang
  • Lingjie Liu
  • Yuan Liu
  • Christian Theobalt
  • Taku Komura
  • Wenping Wang

We present a novel neural surface reconstruction method, called NeuS, for reconstructing objects and scenes with high fidelity from 2D image inputs. Existing neural surface reconstruction approaches, such as DVR [Niemeyer et al. , 2020] and IDR [Yariv et al. , 2020], require foreground mask as supervision, easily get trapped in local minima, and therefore struggle with the reconstruction of objects with severe self-occlusion or thin structures. Meanwhile, recent neural methods for novel view synthesis, such as NeRF [Mildenhall et al. , 2020] and its variants, use volume rendering to produce a neural scene representation with robustness of optimization, even for highly complex objects. However, extracting high-quality surfaces from this learned implicit representation is difficult because there are not sufficient surface constraints in the representation. In NeuS, we propose to represent a surface as the zero-level set of a signed distance function (SDF) and develop a new volume rendering method to train a neural SDF representation. We observe that the conventional volume rendering method causes inherent geometric errors (i. e. bias) for surface reconstruction, and therefore propose a new formulation that is free of bias in the first order of approximation, thus leading to more accurate surface reconstruction even without the mask supervision. Experiments on the DTU dataset and the BlendedMVS dataset show that NeuS outperforms the state-of-the-arts in high-quality surface reconstruction, especially for objects and scenes with complex structures and self-occlusion.

YNIMG Journal 2021 Journal Article

Non-rhythmic temporal prediction involves phase resets of low-frequency delta oscillations

  • Jonathan Daume
  • Peng Wang
  • Alexander Maye
  • Dan Zhang
  • Andreas K. Engel

The phase of neural oscillatory signals aligns to the predicted onset of upcoming stimulation. Whether such phase alignments represent phase resets of underlying neural oscillations or just rhythmically evoked activity, and whether they can be observed in a rhythm-free visual context, however, remains unclear. Here, we recorded the magnetoencephalogram while participants were engaged in a temporal prediction task, judging the visual or tactile reappearance of a uniformly moving stimulus. The prediction conditions were contrasted with a control condition to dissociate phase adjustments of neural oscillations from stimulus-driven activity. We observed stronger delta band inter-trial phase consistency (ITPC) in a network of sensory, parietal and frontal brain areas, but no power increase reflecting stimulus-driven or prediction-related evoked activity. Delta ITPC further correlated with prediction performance in the cerebellum and visual cortex. Our results provide evidence that phase alignments of low-frequency neural oscillations underlie temporal predictions in a non-rhythmic visual and crossmodal context.

IJCAI Conference 2021 Conference Paper

Proposal-free One-stage Referring Expression via Grid-Word Cross-Attention

  • Wei Suo
  • MengYang Sun
  • Peng Wang
  • Qi Wu

Referring Expression Comprehension (REC) has become one of the most important tasks in visual reasoning, since it is an essential step for many vision-and-language tasks such as visual question answering. However, it has not been widely used in many downstream tasks because it suffers 1) two-stage methods exist heavy computation cost and inevitable error accumulation, and 2) one-stage methods have to depend on lots of hyper-parameters (such as anchors) to generate bounding box. In this paper, we present a proposal-free one-stage (PFOS) model that is able to regress the region-of-interest from the image, based on a textual query, in an end-to-end manner. Instead of using the dominant anchor proposal fashion, we directly take the dense-grid of image as input for a cross-attention transformer that learns grid-word correspondences. The final bounding box is predicted directly from the image without the time-consuming anchor selection process that previous methods suffer. Our model achieves the state-of-the-art performance on four referring expression datasets with higher efficiency, comparing to previous best one-stage and two-stage methods.

AAAI Conference 2021 Conference Paper

Simple is not Easy: A Simple Strong Baseline for TextVQA and TextCaps

  • Qi Zhu
  • Chenyu Gao
  • Peng Wang
  • Qi Wu

Texts appearing in daily scenes that can be recognized by OCR (Optical Character Recognition) tools contain significant information, such as street name, product brand and prices. Two tasks – text-based visual question answering and text-based image captioning, with a text extension from existing vision-language applications, are catching on rapidly. To address these problems, many sophisticated multi-modality encoding frameworks (such as heterogeneous graph structure) are being used. In this paper, we argue that a simple attention mechanism can do the same or even better job without any bells and whistles. Under this mechanism, we simply split OCR token features into separate visual- and linguisticattention branches, and send them to a popular Transformer decoder to generate answers or captions. Surprisingly, we find this simple baseline model is rather strong – it consistently outperforms state-of-the-art (SOTA) models on two popular benchmarks, TextVQA and all three tasks of ST-VQA, although these SOTA models use far more complex encoding mechanisms. Transferring it to text-based image captioning, we also surpass the TextCaps Challenge 2020 winner. We wish this work to set the new baseline for these two OCR text related applications and to inspire new thinking of multi-modality encoder design. Code is available at https: //github. com/ZephyrZhuQi/ssbaseline

IROS Conference 2021 Conference Paper

Super Odometry: IMU-centric LiDAR-Visual-Inertial Estimator for Challenging Environments

  • Shibo Zhao
  • Hengrui Zhang
  • Peng Wang
  • Lucas Nogueira
  • Sebastian A. Scherer

We propose Super Odometry, a high-precision multi-modal sensor fusion framework, providing a simple but effective way to fuse multiple sensors such as LiDAR, camera, and IMU sensors and achieve robust state estimation in perceptually-degraded environments. Different from traditional sensor-fusion methods, Super Odometry employs an IMU-centric data processing pipeline, which combines the advantages of loosely coupled methods with tightly coupled methods and recovers motion in a coarse-to-fine manner. The proposed framework is composed of three parts: IMU odometry, Visual-inertial odometry, and LiDAR-inertial odometry. The Visual-inertial odometry and LiDAR-inertial odometry provide the pose prior to constrain the IMU bias and receive the motion prediction from IMU odometry. To ensure high performance in real-time, we apply a dynamic octree that only consumes 10% of the running time compared with a static KD-tree. The proposed system was deployed on drones and ground robots, as part of Team Explorer’s effort to the DARPA Subterranean Challenge where the team won 1 st and 2 nd place in the Tunnel and Urban Circuits 1, respectively.

ICRA Conference 2021 Conference Paper

Vanishing Point Aided LiDAR-Visual-Inertial Estimator

  • Peng Wang
  • Zheng Fang 0001
  • Shibo Zhao
  • Yongnan Chen
  • Ming Zhou
  • Shan An

In this paper, we propose a vanishing point aided LiDAR-Visual-Inertial estimator to achieve real-time, low-drift and robust pose estimation. The proposed method is mainly composed of 3 sequential modules, namely IMU-aided vanishing point (VP) detection module, voxel-map based feature depth association module, and visual inertial fixed-lag smoother module. The IMU-aided VP detection module will detect feature points, line segments and vanishing points to establish robust correspondences in successive frames. In particular, we propose to use 1-line RANSAC method to provide stable VP hypotheses and polar grid to accelerate vanishing point hypothesis validation. After that, we propose a novel voxel-map based feature depth association method, to retrieve depth and assign depth to visual feature efficiently. Finally, the visual inertial fixed-lag smoother module is proposed to jointly minimize error terms. Experiments show that our method outperforms the state-of-the-art visual-inertial odometry and LiDAR-visual estimator in both indoor and outdoor environments.

EAAI Journal 2020 Journal Article

An approach based on linguistic spherical fuzzy sets for public evaluation of shared bicycles in China

  • Peide Liu
  • Baoying Zhu
  • Peng Wang
  • Mengjiao Shen

In the context of the booming sharing economy, shared bicycles as an important part of the sharing economy have been studied by many scholars, and these researches mainly focus on the socioeconomic characteristics of users and the system design level of shared bicycles, it is very necessary to study the evaluation method of shared bicycles which is a typical multi-attribute decision- making (MADM) problem. Firstly, the linguistic spherical fuzzy numbers (Lt-SFNs) is proposed to express the public’s language evaluation information. Compared with the linguistic intuitionistic fuzzy numbers (LIFNs) and the linguistic q-rung orthopair fuzzy numbers (Lq-ROFNs), Lt-SFNs have a wider information expression range. Then, in order to integrate the language evaluation information, the linguistic spherical fuzzy weighted averaging (Lt-SFSWA) operator is proposed, which can aggregate the group linguistic evaluation information. Further, the MABAC (Multi-Attributive Border Approximation area Comparison) method is extended to the linguistic spherical fuzzy environment and the Lt-SFS-MABAC method is proposed, which can process linguistic evaluation information and select an optimal alternative from a plurality of alternatives. At the same time, the TODIM (an acronym in Portuguese of Interactive and Multicriteria Decision Making) method is extended to the linguistic spherical fuzzy environment and the Lt-SFS-TODIM method is proposed. Lastly, we conducted sensitivity analysis and comparative analysis of the Lt-SFS-MABAC method, the Lt-SFS-TODIM method and the Lt-SFSWA method. The results show that the Lt-SFS-MABAC method is sensitive to weights, decision makers can use the Lt-SFS-MABAC method to make a realistic evaluation based on the actual environment.

AAAI Conference 2020 Conference Paper

AutoRemover: Automatic Object Removal for Autonomous Driving Videos

  • Rong Zhang
  • Wei Li
  • Peng Wang
  • Chenye Guan
  • Jin Fang
  • Yuhang Song
  • Jinhui Yu
  • Baoquan Chen

Motivated by the need for photo-realistic simulation in autonomous driving, in this paper we present a video inpainting algorithm AutoRemover, designed specifically for generating street-view videos without any moving objects. In our setup we have two challenges: the first is the shadow, shadows are usually unlabeled but tightly coupled with the moving objects. The second is the large ego-motion in the videos. To deal with shadows, we build up an autonomous driving shadow dataset and design a deep neural network to detect shadows automatically. To deal with large ego-motion, we take advantage of the multi-source data, in particular the 3D data, in autonomous driving. More specifically, the geometric relationship between frames is incorporated into an inpainting deep neural network to produce high-quality structurally consistent video output. Experiments show that our method outperforms other state-of-the-art (SOTA) object removal algorithms, reducing the RMSE by over 19%.

AAAI Conference 2020 Conference Paper

CSPN++: Learning Context and Resource Aware Convolutional Spatial Propagation Networks for Depth Completion

  • Xinjing Cheng
  • Peng Wang
  • Chenye Guan
  • Ruigang Yang

Depth Completion deals with the problem of converting a sparse depth map to a dense one, given the corresponding color image. Convolutional spatial propagation network (CSPN) is one of the state-of-the-art (SoTA) methods of depth completion, which recovers structural details of the scene. In this paper, we propose CSPN++, which further improves its effectiveness and efficiency by learning adaptive convolutional kernel sizes and the number of iterations for the propagation, thus the context and computational resource needed at each pixel could be dynamically assigned upon requests. Specifically, we formulate the learning of the two hyper-parameters as an architecture selection problem where various configurations of kernel sizes and numbers of iterations are first defined, and then a set of soft weighting parameters are trained to either properly assemble or select from the pre-defined configurations at each pixel. In our experiments, we find weighted assembling can lead to significant accuracy improvements, which we referred to as ”contextaware CSPN”, while weighted selection, ”resource-aware CSPN” can reduce the computational resource significantly with similar or better accuracy. Besides, the resource needed for CSPN++ can be adjusted w. r. t. the computational budget automatically. Finally, to avoid the side effects of noise or inaccurate sparse depths, we embed a gated network inside CSPN++, which further improves the performance. We demonstrate the effectiveness of CSPN++ on the KITTI depth completion benchmark, where it significantly improves over CSPN and other SoTA methods 1.

AAAI Conference 2020 Conference Paper

Discriminative and Robust Online Learning for Siamese Visual Tracking

  • Jinghao Zhou
  • Peng Wang
  • Haoyang Sun

The problem of visual object tracking has traditionally been handled by variant tracking paradigms, either learning a model of the object’s appearance exclusively online or matching the object with the target in an offline-trained embedding space. Despite the recent success, each method agonizes over its intrinsic constraint. The online-only approaches suffer from a lack of generalization of the model they learn thus are inferior in target regression, while the offline-only approaches (e. g. , convolutional siamese trackers) lack the target-specific context information thus are not discriminative enough to handle distractors, and robust enough to deformation. Therefore, we propose an online module with an attention mechanism for offline siamese networks to extract target-specific features under L2 error. We further propose a filter update strategy adaptive to treacherous background noises for discriminative learning, and a template update strategy to handle large target deformations for robust learning. Effectiveness can be validated in the consistent improvement over three siamese baselines: SiamFC, SiamRPN++, and SiamMask. Beyond that, our model based on SiamRPN++ obtains the best results over six popular tracking benchmarks and can operate beyond real-time.

IJCAI Conference 2020 Conference Paper

Label-Attended Hashing for Multi-Label Image Retrieval

  • Yanzhao Xie
  • Yu Liu
  • Yangtao Wang
  • Lianli Gao
  • Peng Wang
  • Ke Zhou

For the multi-label image retrieval, the existing hashing algorithms neglect the dependency between objects and thus fail to capture the attention information in the feature extraction, which affects the precision of hash codes. To address this problem, we explore the inter-dependency between objects through their co-occurrence correlation from the label set and adopt Multi-modal Factorized Bilinear (MFB) pooling component so that the image representation learning can capture this attention information. We propose a Label-Attended Hashing (LAH) algorithm which enables an end-to-end hash model with inter-dependency feature extraction. LAH first combines Convolutional Neural Network (CNN) and Graph Convolution Network (GCN) to separately generate the image representation and label co-occurrence embeddings, then adopts MFB to fuse these two modal vectors, finally learns the hash function with a Cauchy distribution based loss function via back propagation. Extensive experiments on public multi-label datasets demonstrate that (1) LAH can achieve the state-of-the-art retrieval results and (2) the usage of co-occurrence relationship and MFB not only promotes the precision of hash codes but also accelerates the hash learning. GitHub address: https: //github. com/IDSM-AI/LAH.

AAAI Conference 2020 Conference Paper

Pixel-Aware Deep Function-Mixture Network for Spectral Super-Resolution

  • Lei Zhang
  • Zhiqiang Lang
  • Peng Wang
  • Wei Wei
  • Shengcai Liao
  • Ling Shao
  • Yanning Zhang

Spectral super-resolution (SSR) aims at generating a hyperspectral image (HSI) from a given RGB image. Recently, a promising direction is to learn a complicated mapping function from the RGB image to the HSI counterpart using a deep convolutional neural network. This essentially involves mapping the RGB context within a size-specific receptive field centered at each pixel to its spectrum in the HSI. The focus thereon is to appropriately determine the receptive field size and establish the mapping function from RGB context to the corresponding spectrum. Due to their differences in category or spatial position, pixels in HSIs often require different-sized receptive fields and distinct mapping functions. However, few efforts have been invested to explicitly exploit this prior. To address this problem, we propose a pixel-aware deep function-mixture network for SSR, which is composed of a new class of modules, termed function-mixture (FM) blocks. Each FM block is equipped with some basis functions, i. e. , parallel subnets of different-sized receptive fields. Besides, it incorporates an extra subnet as a mixing function to generate pixel-wise weights, and then linearly mixes the outputs of all basis functions with those generated weights. This enables us to pixel-wisely determine the receptive field size and the mapping function. Moreover, we stack several such FM blocks to further increase the flexibility of the network in learning the pixel-wise mapping. To encourage feature reuse, intermediate features generated by the FM blocks are fused in late stage, which proves to be effective for boosting the SSR performance. Experimental results on three benchmark HSI datasets demonstrate the superiority of the proposed method.

AIIM Journal 2020 Journal Article

Real-world data medical knowledge graph: construction and applications

  • Linfeng Li
  • Peng Wang
  • Jun Yan
  • Yao Wang
  • Simin Li
  • Jinpeng Jiang
  • Zhe Sun
  • Buzhou Tang

Objective Medical knowledge graph (KG) is attracting attention from both academic and healthcare industry due to its power in intelligent healthcare applications. In this paper, we introduce a systematic approach to build medical KG from electronic medical records (EMRs) with evaluation by both technical experiments and end to end application examples. Materials and Methods The original data set contains 16, 217, 270 de-identified clinical visit data of 3, 767, 198 patients. The KG construction procedure includes 8 steps, which are data preparation, entity recognition, entity normalization, relation extraction, property calculation, graph cleaning, related-entity ranking, and graph embedding respectively. We propose a novel quadruplet structure to represent medical knowledge instead of the classical triplet in KG. A novel related-entity ranking function considering probability, specificity and reliability (PSR) is proposed. Besides, probabilistic translation on hyperplanes (PrTransH) algorithm is used to learn graph embedding for the generated KG. Results A medical KG with 9 entity types including disease, symptom, etc. was established, which contains 22, 508 entities and 579, 094 quadruplets. Compared with term frequency - inverse document frequency (TF/IDF) method, the normalized discounted cumulative gain (NDCG@10) increased from 0. 799 to 0. 906 with the proposed ranking function. The embedding representation for all entities and relations were learned, which are proven to be effective using disease clustering. Conclusion The established systematic procedure can efficiently construct a high-quality medical KG from large-scale EMRs. The proposed ranking function PSR achieves the best performance under all relations, and the disease clustering result validates the efficacy of the learned embedding vector as entity’s semantic representation. Moreover, the obtained KG finds many successful applications due to its statistics-based quadruplet. where N c o m i n is a minimum co-occurrence number and R is the basic reliability value. The reliability value can measure how reliable is the relationship between Si and Oij. The reason for the definition is the higher value of N co(Si, Oij ), the relationship is more reliable. However, the reliability values of the two relationships should not have a big difference if both of their co-occurrence numbers are very big. In our study, we finally set N c o m i n = 10 and R = 1 after some experiments. For instance, if co-occurrence numbers of three relationships are 1, 100 and 10000, their reliability values are 1, 2. 96 and 5 respectively.

IROS Conference 2020 Conference Paper

TP-TIO: A Robust Thermal-Inertial Odometry with Deep ThermalPoint

  • Shibo Zhao
  • Peng Wang
  • Hengrui Zhang
  • Zheng Fang 0001
  • Sebastian A. Scherer

To achieve robust motion estimation in visually degraded environments, thermal odometry has been an attraction in the robotics community. However, most thermal odometry methods are purely based on classical feature extractors, which is difficult to establish robust correspondences in successive frames due to sudden photometric changes and large thermal noise. To solve this problem, we propose ThermalPoint, a lightweight feature detection network specifically tailored for producing keypoints on thermal images, providing notable anti-noise improvements compared with other state-of-the-art methods. After that, we combine ThermalPoint with a novel radiometric feature tracking method, which directly makes use of full radiometric data and establishes reliable correspondences between sequential frames. Finally, taking advantage of an optimization-based visual-inertial framework, a deep feature-based thermal-inertial odometry (TP-TIO) framework is proposed and evaluated thoroughly in various visually degraded environments. Experiments show that our method outperforms state-of-the-art visual and laser odometry methods in smoke-filled environments and achieves competitive accuracy in normal environments.

AAAI Conference 2020 Conference Paper

V-PROM: A Benchmark for Visual Reasoning Using Visual Progressive Matrices

  • Damien Teney
  • Peng Wang
  • Jiewei Cao
  • Lingqiao Liu
  • Chunhua Shen
  • Anton van den Hengel

Advances in machine learning have generated increasing enthusiasm for tasks that require high-level reasoning on top of perceptual capabilities, particularly over visual data. Such tasks include, for example, image captioning, visual question answering, and visual navigation. Their evaluation is however hindered by task-specific confounding factors and dataset biases. In parallel, the existing benchmarks for abstract reasoning are limited to synthetic stimuli (e. g. images of simple shapes) and do not capture the challenges of real-world data. We propose a new large-scale benchmark to evaluates abstract reasoning over real visual data. The test involves visual questions that require operations fundamental to many high-level vision tasks, such as comparisons of counts and logical operations on complex visual properties. The benchmark measures a method’s ability to infer high-level relationships and to generalise them over image-based concepts. We provide multiple training/test splits that require controlled levels of generalization. We evaluate a range of deep learning architectures, and find that existing models, including those popular for vision-and-language tasks, are unable to solve seemingly-simple instances. Models using relational networks fare better but leave substantial room for improvement.

YNIMG Journal 2019 Journal Article

Long-range functional coupling predicts performance: Oscillatory EEG networks in multisensory processing

  • Peng Wang
  • Florian Göschl
  • Uwe Friese
  • Peter König
  • Andreas K. Engel

The integration of sensory signals from different modalities requires flexible interaction of remote brain areas. One candidate mechanism to establish communication in the brain is transient synchronization of oscillatory neural signals. Although there is abundant evidence for the involvement of cortical oscillations in brain functions based on the analysis of local power, assessment of the phase dynamics among spatially distributed neuronal populations and their relevance for behavior is still sparse. In the present study, we investigated the interaction between remote brain areas by analyzing high-density electroencephalogram (EEG) data obtained from human participants engaged in a visuotactile pattern matching task. We deployed an approach for purely data-driven clustering of neuronal phase coupling in source space, which allowed imaging of large-scale functional networks in space, time and frequency without defining a priori constraints. Based on the phase coupling results, we further explored how brain areas interacted across frequencies by computing phase-amplitude coupling. Several networks of interacting sources were identified with our approach, synchronizing their activity within and across the theta (∼5 Hz), alpha (∼10 Hz), and beta (∼20 Hz) frequency bands and involving multiple brain areas that have previously been associated with attention and motor control. We demonstrate the functional relevance of these networks by showing that phase delays – in contrast to spectral power – were predictive of task performance. The data-driven analysis approach employed in the current study allowed an unbiased examination of functional brain networks based on EEG source level connectivity data. Showcased for multisensory processing, our results provide evidence that large-scale neuronal coupling is vital to long-range communication in the human brain and relevant for the behavioral outcome in a cognitive task.

AAAI Conference 2019 Conference Paper

Show, Attend and Read: A Simple and Strong Baseline for Irregular Text Recognition

  • Hui Li
  • Peng Wang
  • Chunhua Shen
  • Guyu Zhang

Recognizing irregular text in natural scene images is challenging due to the large variance in text appearance, such as curvature, orientation and distortion. Most existing approaches rely heavily on sophisticated model designs and/or extra fine-grained annotations, which, to some extent, increase the difficulty in algorithm implementation and data collection. In this work, we propose an easy-to-implement strong baseline for irregular scene text recognition, using offthe-shelf neural network components and only word-level annotations. It is composed of a 31-layer ResNet, an LSTMbased encoder-decoder framework and a 2-dimensional attention module. Despite its simplicity, the proposed method is robust. It achieves state-of-the-art performance on irregular text recognition benchmarks and comparable results on regular text datasets. The code will be released.

ICRA Conference 2019 Conference Paper

Training a Binary Weight Object Detector by Knowledge Transfer for Autonomous Driving

  • Jiaolong Xu
  • Yiming Nie
  • Peng Wang
  • Antonio M. López 0001

Autonomous driving has harsh requirements of small model size and energy efficiency, in order to enable the embedded system to achieve real-time on-board object detection. Recent deep convolutional neural network based object detectors have achieved state-of-the-art accuracy. However, such models are trained with numerous parameters and their high computational costs and large storage prohibit the deployment to memory and computation resource limited systems. Low-precision neural networks are popular techniques for reducing the computation requirements and memory footprint. Among them, binary weight neural networks (BWNs) are the extreme case which quantizes the float-point into just 1 bit. BWNs are difficult to train and suffer from accuracy deprecation due to the extreme low-bit representation. To address this problem, we propose a knowledge transfer (KT) method to aid the training of BWN using a full-precision teacher network. We built DarkNet- and MobileNet-based binary weight YOLOv2 detectors and conduct experiments on KITTI benchmark for car, pedestrian and cyclist detection. The experimental results show that the proposed method maintains high detection accuracy while reducing the model size of DarkNet-YOLO from 257 MB to 8. 8 MB and MobileNet-YOLO from 193 MB to 7. 9 MB.

IJCAI Conference 2018 Conference Paper

Neural Networks Incorporating Unlabeled and Partially-labeled Data for Cross-domain Chinese Word Segmentation

  • Lujun Zhao
  • Qi Zhang
  • Peng Wang
  • Xiaoyu Liu

Most existing Chinese word segmentation (CWS) methods are usually supervised. Hence, large-scale annotated domain-specific datasets are needed for training. In this paper, we seek to address the problem of CWS for the resource-poor domains that lack annotated data. A novel neural network model is proposed to incorporate unlabeled and partially-labeled data. To make use of unlabeled data, we combine a bidirectional LSTM segmentation model with two character-level language models using a gate mechanism. These language models can capture co-occurrence information. To make use of partially-labeled data, we modify the original cross entropy loss function of RNN. Experimental results demonstrate that the method performs well on CWS tasks in a series of domains.

AAAI Conference 2018 Conference Paper

Unsupervised Learning of Geometry From Videos With Edge-Aware Depth-Normal Consistency

  • Zhenheng Yang
  • Peng Wang
  • Wei Xu
  • Liang Zhao
  • Ramakant Nevatia

Learning to reconstruct depths from a single image by watching unlabeled videos via deep convolutional network (DCN) is attracting significant attention in recent years, e. g. (Zhou et al. 2017). In this paper, we propose to use surface normal representation for unsupervised depth estimation framework. Our estimated depths are constrained to be compatible with predicted normals, yielding more robust geometry results. Specifically, we formulate an edge-aware depth-normal consistency term, and solve it by constructing a depth-to-normal layer and a normal-to-depth layer inside of the DCN. The depth-to-normal layer takes estimated depths as input, and computes normal directions using cross production based on neighboring pixels. Then given the estimated normals, the normal-to-depth layer outputs a regularized depth map through local planar smoothness. Both layers are computed with awareness of edges inside the image to help address the issue of depth/normal discontinuity and preserve sharp edges. Finally, to train the network, we apply the photometric error and gradient smoothness to supervise both depth and normal predictions. We conducted experiments on both outdoor (KITTI) and indoor (NYUv2) datasets, and showed that our algorithm vastly outperforms state-of-the-art, which demonstrates the benefits of our approach.

AAAI Conference 2017 Conference Paper

Event Video Mashup: From Hundreds of Videos to Minutes of Skeleton

  • Lianli Gao
  • Peng Wang
  • Jingkuan Song
  • Zi Huang
  • Jie Shao
  • Heng Shen

The explosive growth of video content on the Web has been revolutionizing the way people share, exchange and perceive information, such as events. While an individual video usually concerns a specific aspect of an event, the videos that are uploaded by different users at different locations and times can embody different emphasis and compensate each other in describing the event. Combining these videos from different sources together can unveil a more complete picture of the event. Simply concatenating videos together is an intuitive solution, but it may degrade user experience since it is time-consuming and tedious to view those highly redundant, noisy and disorganized content. Therefore, we develop a novel approach, termed event video mashup (EVM), to automatically generate a unified short video from a collection of Web videos to describe the storyline of an event. We propose a submodular based content selection model that embodies both importance and diversity to depict the event from comprehensive aspects in an efficient way. Importantly, the video content is organized temporally and semantically conforming to the event evolution. We evaluate our approach on a realworld YouTube event dataset collected by ourselves. The extensive experimental results demonstrate the effectiveness of the proposed framework.

IJCAI Conference 2017 Conference Paper

Explicit Knowledge-based Reasoning for Visual Question Answering

  • Peng Wang
  • Qi Wu
  • Chunhua Shen
  • Anthony Dick
  • Anton van den Hengel

We describe a method for visual question answering which is capable of reasoning about an image on the basis of information extracted from a large-scale knowledge base. The method not only answers natural language questions using concepts not contained in the image, but can explain the reasoning by which it developed its answer. It is capable of answering far more complex questions than the predominant long short-term memory-based approach, and outperforms it significantly in testing. We also provide a dataset and a protocol by which to evaluate general visual question answering methods.

JBHI Journal 2016 Journal Article

Comparison of Three Different Types of Wrist Pulse Signals by Their Physical Meanings and Diagnosis Performance

  • Wangmeng Zuo
  • Peng Wang
  • David Zhang

Increasing interest has been focused on computational pulse diagnosis where sensors are developed to acquire pulse signals, and machine learning techniques are exploited to analyze health conditions based on the acquired pulse signals. By far, a number of sensors have been employed for pulse signal acquisition, which can be grouped into three major categories, i. e. , pressure, photoelectric, and ultrasonic sensors. To guide the sensor selection for computational pulse diagnosis, in this paper, we analyze the physical meanings and sensitivities of signals acquired by these three types of sensors. The dependence and complementarity of the different sensors are discussed from both the perspective of cardiovascular fluid dynamics and comparative experiments by evaluating disease classification performance. Experimental results indicate that each sensor is more appropriate for the diagnosis of some specific disease that the changes of physiological factors can be effectively reflected by the sensor, e. g. , ultrasonic sensor for diabetes and pressure sensor for arteriosclerosis, and improved diagnosis performance can be obtained by combining three types of signals.

AAAI Conference 2016 Conference Paper

Pose-Guided Human Parsing by an AND/OR Graph Using Pose-Context Features

  • Fangting Xia
  • Jun Zhu
  • Peng Wang
  • Alan Yuille

Parsing human into semantic parts is crucial to human-centric analysis. In this paper, we propose a human parsing pipeline that uses pose cues, i. e. , estimates of human joint locations, to provide pose-guided segment proposals for semantic parts. These segment proposals are ranked using standard appearance cues, deep-learned semantic feature, and a novel pose feature called pose-context. Then these proposals are selected and assembled using an And-Or graph to output a parse of the person. The And-Or graph is able to deal with large human appearance variability due to pose, choice of clothes, etc. We evaluate our approach on the popular Penn-Fudan pedestrian parsing dataset, showing that it significantly outperforms the state-of-the-arts, and perform diagnostics to demonstrate the effectiveness of different stages of our pipeline.

NeurIPS Conference 2016 Conference Paper

SURGE: Surface Regularized Geometry Estimation from a Single Image

  • Peng Wang
  • Xiaohui Shen
  • Bryan Russell
  • Scott Cohen
  • Brian Price
  • Alan Yuille

This paper introduces an approach to regularize 2. 5D surface normal and depth predictions at each pixel given a single input image. The approach infers and reasons about the underlying 3D planar surfaces depicted in the image to snap predicted normals and depths to inferred planar surfaces, all while maintaining fine detail within objects. Our approach comprises two components: (i) a fourstream convolutional neural network (CNN) where depths, surface normals, and likelihoods of planar region and planar boundary are predicted at each pixel, followed by (ii) a dense conditional random field (DCRF) that integrates the four predictions such that the normals and depths are compatible with each other and regularized by the planar region and planar boundary information. The DCRF is formulated such that gradients can be passed to the surface normal and depth CNNs via backpropagation. In addition, we propose new planar wise metrics to evaluate geometry consistency within planar surfaces, which are more tightly related to dependent 3D editing applications. We show that our regularization yields a 30% relative improvement in planar consistency on the NYU v2 dataset.

IJCAI Conference 2015 Conference Paper

Convolutional Neural Networks for Text Hashing

  • Jiaming Xu
  • Peng Wang
  • Guanhua Tian
  • Bo Xu
  • Jun Zhao
  • Fangyuan Wang
  • Hongwei Hao

Hashing, as a popular approximate nearest neighbor search, has been widely used for large-scale similarity search. Recently, a spectrum of machine learning methods are utilized to learn similarity-preserving binary codes. However, most of them directly encode the explicit features, keywords, which fail to preserve the accurate semantic similarities in binary code beyond keyword matching, especially on short texts. Here we propose a novel text hashing framework with convolutional neural networks. In particular, we first embed the keyword features into compact binary code with a locality preserving constraint. Meanwhile word features and position features are together fed into a convolutional network to learn the implicit features which are further incorporated with the explicit features to fit the pre-trained binary code. Such base method can be successfully accomplished without any external tags/labels, and other three model variations are designed to integrate tags/labels. Experimental results show the superiority of our proposed approach over several state-of-the-art hashing methods when tested on one short text dataset as well as one normal text dataset.

IJCAI Conference 2015 Conference Paper

Sparse Probabilistic Matrix Factorization by Laplace Distribution for Collaborative Filtering

  • Liping Jing
  • Peng Wang
  • Liu Yang

In recommendation systems, probabilistic matrix factorization (PMF) is a state-of-the-art collaborative filtering method by determining the latent features to represent users and items. However, two major issues limiting the usefulness of PMF are the sparsity problem and long-tail distribution. Sparsity refers to the situation that the observed rating data are sparse, which results in that only part of latent features are informative for describing each item/user. Long tail distribution implies that a large fraction of items have few ratings. In this work, we propose a sparse probabilistic matrix factorization method (SPMF) by utilizing a Laplacian distribution to model the item/user factor vector. Laplacian distribution has ability to generate sparse coding, which is beneficial for SPMF to distinguish the relevant and irrelevant latent features with respect to each item/user. Meanwhile, the tails in Laplacian distribution are comparatively heavy, which is rewarding for SPMF to recommend the tail items. Furthermore, a distributed Gibbs sampling algorithm is developed to efficiently train the proposed sparse probabilistic model. A series of experiments on Netflix and Movielens datasets have been conducted to demonstrate that SPMF outperforms the existing PMF and its extended version Bayesian PMF (BPMF), especially for the recommendation of tail items.

IJCAI Conference 2011 Conference Paper

Matching Large Ontologies Based on Reduction Anchors

  • Peng Wang
  • Yuming Zhou
  • Baowen Xu

Matching large ontologies is a challenge due to the high time complexity. This paper proposes a new matching method for large ontologies based on reduction anchors. This method has a distinct advantage over the divide-and-conquer methods because it dose not need to partition large ontologies. In particular, two kinds of reduction anchors, positive and negative reduction anchors, are proposed to reduce the time complexity in matching. Positive reduction anchors use the concept hierarchy to predict the ignorable similarity calculations. Negative reduction anchors use the locality of matching to predict the ignorable similarity calculations. Our experimental results on the real world data sets show that the proposed method is efficient for matching large ontologies.

IROS Conference 2008 Conference Paper

Development of a multi-DOF exoskeleton based machine for injured fingers

  • Yili Fu 0001
  • Peng Wang
  • Shuguo Wang

In order to offer a method for the rehabilitation of injured fingers and a means of quantitative detection and evaluation, an exoskeleton based continuous passive motion (CPM) machine is presented in this paper. Corresponding to each finger of human hand, the CPM machine has 4 degrees of freedom (DOF) driven by two DC motors. The joint force and position sensors are all integrated into the machine. The device can be easily attached and also be adjusted to fit different hand sizes. During the injured fingerpsilas flexion and extension motion the machine can always exert perpendicular forces on the finger phalanges, meanwhile it can achieve the precise control of scope, force and speed of the moving fingers. In order to control the CPM machine, we have also designed an embedded control system based on S3C2410 (a kind of 32-bit RISC microprocessor). The whole system is open-ended for new functions and applications. The function modularization method provides a new thinking of design for the control system.