Arrow Research search

Author name cluster

Wei Wei

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

89 papers
2 author rows

Possible papers

89

AAAI Conference 2026 Conference Paper

Appearance Discrepancy-guided Sequence Hybrid Masking for Robust Scene Text Recognition

  • Shihao Zou
  • Wei Wei
  • Leyang Xu
  • Kaihe Xu
  • Wenfeng Xie

Masked Image Modeling (MIM) has been widely recognized as a powerful self-supervised paradigm for learning general-purpose visual representations. However, standard MIM based on random masking tends to underperform in domain-specific tasks like Scene Text Recognition (STR), due to challenges such as information sparsity and appearance discrepancies caused by partial occlusion or distortion. To address this issue, we propose a novel pre-training framework called Appearance Discrepancy-guided Sequence Hybrid Masking (DSHM), specifically designed to learn robust representations for STR. To this end, we introduce an Appearance Discrepancy Metric that quantifies the discrepancy level of each image patch by measuring its deviation from anisotropic local discrepancy and intra-instance global style discrepancy. The resulting discrepancy scores are utilized in two key components: (1) A Sequence Hybrid Masking strategy, which prioritizes masking high-discrepancy patches in coherent block forms, thereby elevating the pretext task from simple pixel-level completion to more complex structural reasoning; (2) Discrepancy-Conditioned Tokens (DC-Tokens), which encode prior knowledge about patch difficulty into the decoder, enabling an adaptive reconstruction process and improving the model robustness under scenarios with partial occlusion or text distortion. We achieve competitive performance on multiple benchmark datasets, including common benchmarks, Union14M benchmarks, and Chinese benchmarks.

AAAI Conference 2026 Conference Paper

ComoRAG: A Cognitive-Inspired Memory-Organized RAG for Stateful Long Narrative Reasoning

  • Juyuan Wang
  • Rongchen Zhao
  • Wei Wei
  • Yufeng Wang
  • Mo Yu
  • Jie Zhou
  • Jin Xu
  • Liyan Xu

Narrative comprehension on long stories and novels has been a challenging domain attributed to their intricate plotlines and entangled, often evolving relations among characters and entities. Given the LLM's diminished reasoning over extended context and its high computational cost, retrieval-based approaches remain a pivotal role in practice. However, traditional RAG methods could fall short due to their stateless, single-step retrieval process, which often overlooks the dynamic nature of capturing interconnected relations within long-range context. In this work, we propose ComoRAG, holding the principle that narrative reasoning is not a one-shot process, but a dynamic, evolving interplay between new evidence acquisition and past knowledge consolidation, analogous to human cognition on reasoning with memory-related signals in the brain. Specifically, when encountering a reasoning impasse, ComoRAG undergoes iterative reasoning cycles while interacting with a dynamic memory workspace. In each cycle, it generates probing queries to devise new exploratory paths, then integrates the retrieved evidence of new aspects into a global memory pool, thereby supporting the emergence of a coherent context for the query resolution. Across four challenging long-context narrative benchmarks (200K+ tokens), ComoRAG outperforms strong RAG baselines with consistent relative gains up to 11% compared to the strongest baseline. Further analysis reveals that ComoRAG is particularly advantageous for complex queries requiring global comprehension, offering a principled, cognitively motivated paradigm for retrieval-based stateful reasoning.

AAAI Conference 2026 Conference Paper

JoDiffusion: Jointly Diffusing Image with Pixel-Level Annotations for Semantic Segmentation Promotion

  • Haoyu Wang
  • Lei Zhang
  • Wenrui Liu
  • Dengyang Jiang
  • Wei Wei
  • Chen Ding

Given the inherently costly and time-intensive nature of pixel-level annotation, the generation of synthetic datasets comprising sufficiently diverse synthetic images paired with ground-truth pixel-level annotations has garnered increasing attention recently for training high-performance semantic segmentation models. However, existing methods necessitate to either predict pseudo annotations after image generation or generate images conditioned on manual annotation masks, which incurs image-annotation semantic inconsistency or scalability problem. To migrate both problems with one stone, we present a novel dataset generative diffusion framework for semantic segmentation, termed JoDiffusion. Firstly, given a standard latent diffusion model, JoDiffusion incorporates an independent annotation variational auto-encoder (VAE) network to map annotation masks into the latent space shared by images. Then, the diffusion model is tailored to capture the joint distribution of each image and its annotation mask conditioned on a text prompt. By doing these, JoDiffusion enables simultaneously generating paired images and semantically consistent annotation masks solely conditioned on text prompts, thereby demonstrating superior scalability. Additionally, a mask optimization strategy is developed to mitigate the annotation noise produced during generation. Experiments on Pascal VOC, COCO, and ADE20K datasets show that the annotated dataset generated by JoDiffusion yields substantial performance improvements in semantic segmentation compared to existing methods.

AAAI Conference 2026 Short Paper

ProRefine: Inference-Time Prompt Refinement with Textual Feedback (Student Abstract)

  • Deepak Pandita
  • Tharindu Cyril Weerasooriya
  • Ankit Shah
  • Isabelle Diana May-Xin Ng
  • Christopher M. Homan
  • Wei Wei

Agentic workflows, where multiple AI agents collaborate to accomplish complex tasks like reasoning or planning, play a substantial role in many cutting-edge commercial applications. These workflows depend critically on the prompts used to provide the roles models play in such workflows. Poorly designed prompts that fail even slightly to guide individual agents can lead to sub-optimal performance that may snowball within a system of agents, limiting their reliability and scalability. To address this important problem of inference-time prompt optimization, we introduce ProRefine, an innovative inference-time optimization method that uses an agentic loop of LLMs to generate and apply textual feedback. ProRefine dynamically refines prompts for multi-step reasoning tasks without additional training or ground truth labels. Evaluated on five benchmark mathematical reasoning datasets, ProRefine significantly surpasses zero-shot Chain-of-Thought baselines by 3 to 37 percentage points. This approach not only boosts accuracy but also allows smaller models to approach the performance of their larger counterparts. This highlights its potential for building cost-effective and powerful hybrid AI systems, thereby democratizing access to high-performing AI.

AAAI Conference 2026 Conference Paper

Robust-R1: Degradation-Aware Reasoning for Robust Visual Understanding

  • Jiaqi Tang
  • Jianmin Chen
  • Wei Wei
  • Xiaogang Xu
  • Runtao Liu
  • Xiangyu Wu
  • Qipeng Xie
  • Jiafei Wu

Multimodal Large Language Models struggle to maintain reliable performance under extreme real-world visual degradations, which impede their practical robustness. Existing robust MLLMs predominantly rely on implicit training/adaptation that focuses solely on visual encoder generalization, suffering from limited interpretability and isolated optimization. To overcome these limitations, we propose Robust-R1, a novel framework that explicitly models visual degradations through structured reasoning chains. Our approach integrates: (i) supervised fine-tuning for degradation-aware reasoning foundations, (ii) reward-driven alignment for accurately perceiving degradation parameters, and (iii) dynamic reasoning depth scaling adapted to degradation intensity. To facilitate this approach, we introduce a specialized 11K dataset featuring realistic degradations synthesized across four critical real-world visual processing stages, each annotated with structured chains connecting degradation parameters, perceptual influence, pristine semantic reasoning chain, and conclusion. Comprehensive evaluations demonstrate state-of-theart robustness: Robust-R1 outperforms all general and robust baselines on the real-world degradation benchmark R-Bench, while maintaining superior anti-degradation performance under multi-intensity adversarial degradations on MMMB, MMStar, and RealWorldQA.

JBHI Journal 2025 Journal Article

Bridged Semantic Alignment for Zero-Shot 3D Medical Image Diagnosis

  • Haoran Lai
  • Zihang Jiang
  • Qingsong Yao
  • Rongsheng Wang
  • Zhiyang He
  • Xiaodong Tao
  • Weifu Lv
  • Wei Wei

3D medical images such as computed tomography are widely used in clinical practice, offering a great potential for automatic diagnosis. Supervised learning-based approaches have achieved significant progress but rely heavily on extensive manual annotations, limited by the availability of training data and the diversity of abnormality types. Vision-language alignment (VLA) offers a promising alternative by enabling zero-shot learning without additional annotations. However, we empirically discover that the visual and textural embeddings after alignment endeavors from existing VLA methods form two well-separated clusters, presenting a wide gap to be bridged. To bridge this gap, we propose a Bridged Semantic Alignment (BrgSA) framework. First, we utilize a large language model to perform semantic summarization of reports, extracting high-level semantic information. Second, we design a Cross-Modal Knowledge Interaction module that leverages a cross-modal knowledge bank as a semantic bridge, facilitating interaction between the two modalities, narrowing the gap, and improving their alignment. To comprehensively evaluate our method, we construct a benchmark dataset that includes 15 underrepresented abnormalities as well as utilize two existing benchmark datasets. Experimental results demonstrate that BrgSA achieves state-of-the-art performances on both public benchmark datasets and our custom-labeled dataset, with significant improvements in zero-shot diagnosis of underrepresented abnormalities.

AAAI Conference 2025 Conference Paper

Dynamic Uncertainty Estimation for Offline Reinforcement Learning

  • Jiesheng Wang
  • Lin Li
  • Wei Wei
  • Yujia Zhang
  • Xin Yang

Offline reinforcement learning confronts the distributional shift challenge, a consequence of learning policy from static datasets. Current methods primarily handle this issue by aligning the learned policy with the behavior policy or conservatively estimating Q-values for out-of-distribution (OOD) actions. However, these approaches can lead to overly pessimistic estimation of Q-values of the OOD actions in unfamiliar situations, resulting in a suboptimal policy. To address this, we propose a new method, Dynamic Uncertainty estimation for Offline Reinforcement Learning. This method introduces a base density-truncated OOD data sampling approach to reduce the impact of extrapolation errors on uncertainty estimation. It enables conservative estimation of Q-values for OOD actions while avoiding negative impacts on in-distribution data. We also develop a dynamic uncertainty estimation mechanism to prevent excessive pessimism and enhance the generalization of the Q-function. This mechanism dynamically adjusts the degree of pessimism in the Q-function by minimizing the error between target and estimated values. Our method outperforms existing algorithms, as demonstrated by experimental results based on the D4RL benchmark, and proves its superiority in addressing the distributional shift challenge.

AAAI Conference 2025 Conference Paper

Editing Memories Through Few Targeted Neurons

  • Wei Zhou
  • Wei Wei
  • Guibang Cao
  • Fei Wang

Model editing is a novel research topic in large language models (LLMs), aimed at efficiently handling various knowledge editing tasks. Since irrelevant knowledge is difficult to measure, existing editing methods often lack explicit ways to preserve it, especially for editing methods based on the fine-tuning paradigm. They generally control the locality performance of model editing by constraining the range of changes in model parameters. However, their performance improvements are not always ideal, and may even lead to a decrease in the editing reliability. In this paper, we try to explore effective editing locality control methods based on the relationship between the stored knowledge and the strongly associated model components. Based on the discovery of ``knowledge neurons'' and enough experimental results, we further explore the potential characteristics between knowledge and model components, confirm and point out: (1) only 1% neurons have significant contributions to specific knowledge storage, and (2) these targeted neurons often have a high overlap for knowledge with similar relational descriptions, which means that knowledge with similar relationships may be severely affected when these targeted neurons are modified. Based on these findings, we propose Targeted Neurons Fine-tuning with Data Augmentation (TNF-DA), which performs data augmentation based on the relational representation of edited knowledge to improve editing locality. By freezing most of the model parameters and only fine-tuning the highly contributing neurons corresponding to the edited knowledge, we obtain desirable results in terms of generalization and specificity compared with previous fine-tuning-based methods. Extensive experiments have demonstrated the superior editing performance achieved by our proposed method.

AAAI Conference 2025 Conference Paper

Enhanced Sample Selection with Confidence Tracking: Identifying Correctly Labeled Yet Hard-to-Learn Samples in Noisy Data

  • Weiran Pan
  • Wei Wei
  • Feida Zhu
  • Yong Deng

We propose a novel sample selection method for image classification in the presence of noisy labels. Existing methods typically consider small-loss samples as correctly labeled. However, some correctly labeled samples are inherently difficult for the model to learn and can exhibit high loss similar to mislabeled samples in the early stages of training. Consequently, setting a threshold on per-sample loss to select correct labels results in a trade-off between precision and recall in sample selection: a lower threshold may miss many correctly labeled hard-to-learn samples (low recall), while a higher threshold may include many mislabeled samples (low precision). To address this issue, our goal is to accurately distinguish correctly labeled yet hard-to-learn samples from mislabeled ones, thus alleviating the trade-off dilemma. We achieve this by considering the trends in model prediction confidence rather than relying solely on loss values. Empirical observations show that only for correctly labeled samples, the model's prediction confidence for the annotated labels typically increases faster than for any other classes. Based on this insight, we propose tracking the confidence gaps between the annotated labels and other classes during training and evaluating their trends using the Mann-Kendall Test. A sample is considered potentially correctly labeled if all its confidence gaps tend to increase. Our method functions as a plug-and-play component that can be seamlessly integrated into existing sample selection techniques. Experiments on several standard benchmarks and real-world datasets demonstrate that our method enhances the performance of existing methods for learning with noisy labels.

NeurIPS Conference 2025 Conference Paper

FANS: A Flatness-Aware Network Structure for Generalization in Offline Reinforcement Learning

  • Da Wang
  • Yi Ma
  • Ting Guo
  • Hongyao Tang
  • Wei Wei
  • Jiye Liang

Offline reinforcement learning (RL) aims to learn optimal policies from static datasets while enhancing generalization to out-of-distribution (OOD) data. To mitigate overfitting to suboptimal behaviors in offline datasets, existing methods often relax constraints on policy and data or extract informative patterns through data-driven techniques. However, there has been limited exploration into structurally guiding the optimization process toward flatter regions of the solution space that offer better generalization. Motivated by this observation, we present \textit{FANS}, a generalization-oriented structured network framework that promotes flatter and robust policy learning by guiding the optimization trajectory through modular architectural design. FANS comprises four key components: (1) Residual Blocks, which facilitate compact and expressive representations; (2) Gaussian Activation, which promotes smoother gradients; (3) Layer Normalization, which mitigates overfitting; and (4) Ensemble Modeling, which reduces estimation variance. By integrating FANS into a standard actor-critic framework, we highlight that this remarkably simple architecture achieves superior performance across various tasks compared to many existing advanced methods. Moreover, we validate the effectiveness of FANS in mitigating overestimation and promoting generalization, demonstrating the promising potential of architectural design in advancing offline RL.

AAAI Conference 2025 Conference Paper

Improving Generalization in Offline Reinforcement Learning via Latent Distribution Representation Learning

  • Da Wang
  • Lin Li
  • Wei Wei
  • Qixian Yu
  • Jianye Hao
  • Jiye Liang

Dealing with the distribution shift is a significant challenge when building offline reinforcement learning (RL) models that can generalize from a static dataset to out-of-distribution (OOD) scenarios. Previous approaches have employed pessimism or conservatism strategies. More recently, data-driven work has taken a distributional perspective, treating offline data as a domain adaptation problem. However, these methods use heuristic techniques to simulate distribution shifts, resulting in a limited diversity of artificially created distribution gaps. In this paper, we propose a novel perspective: offline datasets inherently contain multiple latent distributions, with behavior data from diverse policies potentially following different distributions and data from the same policy across various time phases also exhibiting distribution variance. We introduce the Latent Distribution Representation Learning (LAD) framework, which aims to characterize the multiple latent distributions within offline data and reduce the distribution gaps between any pair of them. LAD consists of a min-max adversarial process: it first identifies the "worst-case" distributions to enlarge the diversity of distribution gaps and then reduces these gaps to learn invariant representations for generalization. We derive a generalization error bound to support LAD theoretically and verify its effectiveness through extensive experiments.

IJCAI Conference 2025 Conference Paper

Indirect Alignment and Relationship Preservation for Domain Generalization

  • Wei Wei
  • Zixiong Li
  • Jing Yan
  • Mingwen Shao
  • Lin Li

Domain generalization (DG) aims to train models on multiple source domains to generalize effectively to unseen target domains, addressing performance degradation caused by domain shifts. Many existing methods rely on direct feature alignment, which disrupts natural sequence relationships, causes misalignment and feature distortion, and leads to overfitting, especially with significant domain gaps. To tackle these issues, we propose a novel DG approach with two key modules: the Sample Difference Keeping (SDK) module, which preserves natural sequence relationships to enhance feature diversity and separability, and the Sample Consistency Alignment (SCA) module, which achieves indirect alignment by modeling inter-class and inter-domain relationship consistencies. This approach mitigates overfitting and misalignment, ensuring adaptability to significant domain gaps. Extensive experiments demonstrate that our framework consistently outperforms state-of-the-art methods.

ICRA Conference 2025 Conference Paper

Intraoperative 3D Shape Estimation of Magnetic Soft Guidewire

  • Yiting Zhao
  • Liwei Shi
  • Wei Wei
  • Nan Xiao

This paper introduces a 3D shape reconstruction technique for interventional devices in endovascular surgery, utilizing a flexible magnetic tip guidewire that preserves the fundamental attributes of standard guidewires. We developed a model that correlates the magnetic tip's shape with the surrounding magnetic field distribution to estimate the shape through the magnetic field. The inherently nonlinear relationship between the magnetic field distribution and the shape of the magnetic guidewire presents challenges for direct shape estimation. To address this, we incorporated image and physical constraints to streamline the estimation process. This method shows high accuracy and stability in shape estimation, with root mean square error (RMSE) and Hausdorff distance (HD) both below 1 mm, which is better than other existing estimation methods. Notably, the interventional guidewire requires no embedded sensors or wiring, and the fluoroscopic images used are standard in clinical practice. The reconstruction process is non-disruptive to clinical procedures, suggesting broad applicability in vascular interventional navigation.

NeurIPS Conference 2025 Conference Paper

KVLink: Accelerating Large Language Models via Efficient KV Cache Reuse

  • Jingbo Yang
  • Bairu Hou
  • Wei Wei
  • Yujia Bao
  • Shiyu Chang

We describe KVLink, an approach for efficient key-value (KV) cache reuse in large language models (LLMs). In many LLM applications, different inputs can share overlapping context, such as the same retrieved document appearing in multiple queries. However, the LLMs still need to encode the entire context for each query, leading to redundant computation. In this paper, we investigate a new strategy to eliminate such inefficiency, where the KV cache of each document is precomputed independently. During inference, the KV caches of retrieved documents are concatenated, allowing the model to reuse cached representations instead of recomputing them. To mitigate the performance degradation when using KV caches computed independently for each document, KVLink introduces two key techniques: adjusting positional embeddings of the KV cache at inference to match the global position after concatenation, and using trainable special tokens to restore self-attention across independently encoded documents. Experiments across 7 datasets demonstrate that KVLink improves question answering accuracy by an average of 4% over state-of-the-art methods. Furthermore, by leveraging precomputed KV caches, our approach reduces time-to-first-token by up to 96% compared to standard LLM inference, making it a scalable and efficient solution for context reuse. Additionally, KVLink can be combined with KV cache compression to further save cache loading and storage overhead while outperforming the baselines.

IROS Conference 2025 Conference Paper

MaskSem: Semantic-Guided Masking for Learning 3D Hybrid High-Order Motion Representation

  • Wei Wei
  • Shaojie Zhang
  • Yonghao Dang
  • Jianqin Yin

Human action recognition is a crucial task for intelligent robotics, particularly within the context of human-robot collaboration research. In self-supervised skeleton-based action recognition, the mask-based reconstruction paradigm learns the spatial structure and motion patterns of the skeleton by masking joints and reconstructing the target from unlabeled data. However, existing methods focus on a limited set of joints and low-order motion patterns, limiting the model’s ability to understand complex motion patterns. To address this issue, we introduce MaskSem, a novel semantic-guided masking method for learning 3D hybrid high-order motion representations. This novel framework leverages Grad-CAM based on relative motion to guide the masking of joints, which can be represented as the most semantically rich temporal orgions. The semantic-guided masking process can encourage the model to explore more discriminative features. Furthermore, we propose using hybrid high-order motion as the reconstruction target, enabling the model to learn multi-order motion patterns. Specifically, low-order motion velocity and high-order motion acceleration are used together as the reconstruction target. This approach offers a more comprehensive description of the dynamic motion process, enhancing the model’s understanding of motion patterns. Experiments on the NTU60, NTU120, and PKU-MMD datasets show that MaskSem, combined with a vanilla transformer, improves skeleton-based action recognition, making it more suitable for applications in human-robot interaction. The source code of our MaskSem is available at https://github.com/JayEason66/MaskSem.

IJCAI Conference 2025 Conference Paper

Multi-granularity Knowledge Transfer for Continual Reinforcement Learning

  • Chaofan Pan
  • Lingfei Ren
  • Yihui Feng
  • Linbo Xiong
  • Wei Wei
  • Yonghao Li
  • Xin Yang

Continual reinforcement learning (CRL) empowers RL agents with the ability to learn a sequence of tasks, accumulating knowledge learned in the past and using the knowledge for problemsolving or future task learning. However, existing methods often focus on transferring fine-grained knowledge across similar tasks, which neglects the multi-granularity structure of human cognitive control, resulting in insufficient knowledge transfer across diverse tasks. To enhance coarse-grained knowledge transfer, we propose a novel framework called MT-Core (as shorthand for Multi-granularity knowledge Transfer for Continual reinforcement learning). MT-Core has a key characteristic of multi-granularity policy learning: 1) a coarsegrained policy formulation for utilizing the powerful reasoning ability of the large language model (LLM) to set goals, and 2) a fine-grained policy learning through RL which is oriented by the goals. We also construct a new policy library (knowledge base) to store policies that can be retrieved for multi-granularity knowledge transfer. Experimental results demonstrate the superiority of the proposed MT-Core in handling diverse CRL tasks versus popular baselines.

JBHI Journal 2025 Journal Article

NRAG: A Knowledge-Enhanced LLM Framework for Interpretable Neurosurgical Disease Diagnosis in Outpatient and Emergency Settings

  • Haoyu Tian
  • Yiming Liu
  • Xinyu Dai
  • Xin Dong
  • Jian Yu
  • Wei Wei
  • Boran Wang
  • Xuezhong Zhou

Large language models (LLMs) have achieved state-of-the-art performance in numerous domains, yet their clinical deployment faces critical barriers, particularly insufficient reasoning in complex scenarios and limited interpretability. These challenges are exacerbated in neurosurgical diagnosis for outpatient and emergency settings, where time-sensitive decision-making, fragmented data, and complex comorbidities render conventional free-text-based modeling approaches unreliable. To address the limitations of existing LLMs in medical auxiliary diagnosis, particularly in interpretability and predictive performance, this study proposed NRAG, an auxiliary diagnosis method that combines LLMs with knowledge graphs (KGs). It extracts symptom descriptions from clinical records and performs personalized retrieval of associated paths in KG, and supplements potential patient symptoms to optimize the diagnosis model. Comparative experiments involving multiple general-domain and medical-domain LLMs, along with case studies, were conducted to validate the NRAG's effectiveness. Experimental results demonstrate that integrating KG significantly improves diagnosis accuracy, achieving an F1-score of 0. 8150. It also substantially improves model interpretability and performs excellently in expert evaluations. Ablation studies and comparative experiments with other general-domain and medical-domain LLMs confirm the superior performance of the proposed NRAG. NRAG effectively supplements missing symptom information and provides knowledge-path-based evidence for diagnosis results, while improving the precision and interpretability of intelligent diagnosis. Furthermore, this approach sets the foundation for intelligent diagnoses in neurosurgery while providing a methodological framework for the integration of in-depth clinical data mining with medical knowledge base resources.

IJCAI Conference 2025 Conference Paper

Prompt-Free Conditional Diffusion for Multi-object Image Augmentation

  • Haoyu Wang
  • Lei Zhang
  • Wei Wei
  • Chen Ding
  • Yanning Zhang

Diffusion model has underpinned much recent advances of dataset augmentation in various computer vision tasks. However, when involving generating multi-object images as real scenarios, most existing methods either rely entirely on text condition, resulting in a deviation between the generated objects and the original data, or rely too much on the original images, resulting in a lack of diversity in the generated images, which is of limited help to downstream tasks. To mitigate both problems with one stone, we propose a prompt-free conditional diffusion framework for multi-object image augmentation. Specifically, we introduce a local-global semantic fusion strategy to extract semantics from images to replace text, and inject knowledge into the diffusion model through LoRA to alleviate the category deviation between the original model and the target dataset. In addition, we design a reward model based counting loss to assist the traditional reconstruction loss for model training. By constraining the object counts of each category instead of pixel-by-pixel constraints, bridging the quantity deviation between the generated data and the original data while improving the diversity of the generated data. Experimental results demonstrate the superiority of the proposed method over several representative state-of-the-art baselines and showcase strong downstream task gain and out-of-domain generalization capabilities. Code is available at \href{https: //github. com/00why00/PFCD}{here}.

NeurIPS Conference 2025 Conference Paper

Risk-aware Direct Preference Optimization under Nested Risk Measure

  • Lijun Zhang
  • Lin Li
  • Yajie Qi
  • Huizhong Song
  • Yaodong Yang
  • Jun Wang
  • Wei Wei

When fine-tuning pre-trained Large Language Models (LLMs) to align with human values and intentions, maximizing the estimated reward can lead to superior performance, but it also introduces potential risks due to deviations from the reference model's intended behavior. Most existing methods typically introduce KL divergence to constrain deviations between the trained model and the reference model; however, this may not be sufficient in certain applications that require tight risk control. In this paper, we introduce Risk-aware Direct Preference Optimization (Ra-DPO), a novel approach that incorporates risk-awareness by employing a class of nested risk measures. This approach formulates a constrained risk-aware advantage function maximization problem and then converts the Bradley-Terry model into a token-level representation. The objective function maximizes the likelihood of the policy while suppressing the deviation between a trained model and the reference model using a sequential risk ratio, thereby enhancing the model's risk-awareness. Experimental results across three open-source datasets: IMDb Dataset, Anthropic HH Dataset, and AlpacaEval, demonstrate the proposed method's superior performance in balancing alignment performance and model drift.

AAAI Conference 2025 Conference Paper

Semantic Enhanced Heterogeneous Hypergraph Network for Collaborative Filtering

  • Mingtao Xu
  • Wei Wei
  • Peixuan Yang
  • Hulong Wu

Collaborative Filtering (CF) based on graph neural networks (GNNs) has yielded immense success for recommendation systems by capturing high-order dependencies from implicit feedback. Recently, the outstanding text comprehension ability of the Large Language Models (LLMs) has shown promising potential to provide auxiliary semantics for collaborative representation. However, when aligning textual information with collaborative signals, inconsistent semantics between user-item and item-item text pairs may lead to the degradation of the alignment model, thus hindering the recommender system from effectively utilizing heterogeneous information. In this paper, we propose a novel method: Semantic Enhanced Heterogeneous Hypergraph Network (SEHHN), which enhances the representations of CF correlations with semantics, thereby avoiding alignment degradation. To better model the collaborative signals, we design a graph autoencoder that captures the bidirectional relationship between user preferences and item features in review semantics. Furthermore, we develop an LLM-based item classifier to adaptively exploit potential correlations of items via the co-occurrences of item features. Finally, we design a heterogeneous hypergraph network to achieve efficient alignment and propagation of heterogeneous information, thereby alleviating the impact of semantic inconsistency on CFs. Extensive experiments on three real-world datasets demonstrate that our proposed SEHHN outperforms existing SOTA methods and validates the effectiveness of each component.

IROS Conference 2025 Conference Paper

Towards Physically Realizable Adversarial Attacks in Embodied Vision Navigation

  • Meng Chen
  • Jiawei Tu
  • Chao Qi
  • Yonghao Dang
  • Feng Zhou
  • Wei Wei
  • Jianqin Yin

The significant advancements in embodied vision navigation have raised concerns about its susceptibility to adversarial attacks exploiting deep neural networks. Investigating the adversarial robustness of embodied vision navigation is crucial, especially given the threat of 3D physical attacks that could pose risks to human safety. However, existing attack methods for embodied vision navigation often lack physical feasibility due to challenges in transferring digital perturbations into the physical world. Moreover, current physical attacks for object detection struggle to achieve both multi-view effectiveness and visual naturalness in navigation scenarios. To address this, we propose a practical attack method for embodied navigation by attaching adversarial patches to objects, where both opacity and textures are learnable. Specifically, to ensure effectiveness across varying viewpoints, we employ a multi-view optimization strategy based on object-aware sampling, which optimizes the patch’s texture based on feedback from the vision-based perception model used in navigation. To make the patch inconspicuous to human observers, we introduce a two-stage opacity optimization mechanism, in which opacity is fine-tuned after texture optimization. Experimental results demonstrate that our adversarial patches decrease the navigation success rate by an average of 22. 39%, outperforming previous methods in practicality, effectiveness, and naturalness. Code is available at: github.com/chen37058/Physical-Attacks-in-Embodied-Nav.

AAAI Conference 2024 Conference Paper

Detection-Based Intermediate Supervision for Visual Question Answering

  • Yuhang Liu
  • Daowan Peng
  • Wei Wei
  • Yuanyuan Fu
  • Wenfeng Xie
  • Dangyang Chen

Recently, neural module networks (NMNs) have yielded ongoing success in answering compositional visual questions, especially those involving multi-hop visual and logical reasoning. NMNs decompose the complex question into several sub-tasks using instance-modules from the reasoning paths of that question and then exploit intermediate supervisions to guide answer prediction, thereby improving inference interpretability. However, their performance may be hindered due to sketchy modeling of intermediate supervisions. For instance, (1) a prior assumption that each instance-module refers to only one grounded object yet overlooks other potentially associated grounded objects, impeding full cross-modal alignment learning; (2) IoU-based intermediate supervisions may introduce noise signals as the bounding box overlap issue might guide the model's focus towards irrelevant objects. To address these issues, a novel method, Detection-based Intermediate Supervision (DIS), is proposed, which adopts a generative detection framework to facilitate multiple grounding supervisions via sequence generation. As such, DIS offers more comprehensive and accurate intermediate supervisions, thereby boosting answer prediction performance. Furthermore, by considering intermediate results, DIS enhances the consistency in answering compositional questions and their sub-questions. Extensive experiments demonstrate the superiority of our proposed DIS, showcasing both improved accuracy and state-of-the-art reasoning consistency compared to prior approaches.

AAAI Conference 2024 Conference Paper

Enhancing Low-Resource Relation Representations through Multi-View Decoupling

  • Chenghao Fan
  • Wei Wei
  • Xiaoye Qu
  • Zhenyi Lu
  • Wenfeng Xie
  • Yu Cheng
  • Dangyang Chen

Recently, prompt-tuning with pre-trained language models (PLMs) has demonstrated the significantly enhancing ability of relation extraction (RE) tasks. However, in low-resource scenarios, where the available training data is scarce, previous prompt-based methods may still perform poorly for prompt-based representation learning due to a superficial understanding of the relation. To this end, we highlight the importance of learning high-quality relation representation in low-resource scenarios for RE, and propose a novel prompt-based relation representation method, named MVRE (Multi-View Relation Extraction), to better leverage the capacity of PLMs to improve the performance of RE within the low-resource prompt-tuning paradigm. Specifically, MVRE decouples each relation into different perspectives to encompass multi-view relation representations for maximizing the likelihood during relation inference. Furthermore, we also design a Global-Local loss and a Dynamic-Initialization method for better alignment of the multi-view relation-representing virtual words, containing the semantics of relation labels during the optimization learning process and initialization. Extensive experiments on three benchmark datasets show that our method can achieve state-of-the-art in low-resource settings.

NeurIPS Conference 2024 Conference Paper

FineCLIP: Self-distilled Region-based CLIP for Better Fine-grained Understanding

  • Dong Jing
  • Xiaolong He
  • Yutian Luo
  • Nanyi Fei
  • Guoxing Yang
  • Wei Wei
  • Huiwen Zhao
  • Zhiwu Lu

Contrastive Language-Image Pre-training (CLIP) achieves impressive performance on tasks like image classification and image-text retrieval by learning on large-scale image-text datasets. However, CLIP struggles with dense prediction tasks due to the poor grasp of the fine-grained details. Although existing works pay attention to this issue, they achieve limited improvements and usually sacrifice the important visual-semantic consistency. To overcome these limitations, we propose FineCLIP, which keeps the global contrastive learning to preserve the visual-semantic consistency and further enhances the fine-grained understanding through two innovations: 1) A real-time self-distillation scheme that facilitates the transfer of representation capability from global to local features. 2) A semantically-rich regional contrastive learning paradigm with generated region-text pairs, boosting the local representation capabilities with abundant fine-grained knowledge. Both cooperate to fully leverage diverse semantics and multi-grained complementary information. To validate the superiority of our FineCLIP and the rationality of each design, we conduct extensive experiments on challenging dense prediction and image-level tasks. All the observations demonstrate the effectiveness of FineCLIP.

JBHI Journal 2024 Journal Article

Guest Editorial Special Issue on Data-driven Cognitive Computing for Smart Healthcare Systems

  • Syed Hassan Shah
  • Wei Wei
  • Wei Wang

With the rising costs of drugs, medical devices, and diagnostic development, the topic of data-driven cognitive computing is currently an emerging research area in smart healthcare construction. With the support of machine learning and artificial intelligence empowered cognitive computing, the significant insights and knowledge hidden behind medical data can be capitalized for process optimization, anomaly detection, energy management, and so on. The special issue is an effort to provide a platform for researchers to explore healthcare issues supported by data-driven cognitive computing-related technologies from both theoretical and practical perspectives.

IJCAI Conference 2024 Conference Paper

Improving Pseudo Labels with Global-Local Denoising Framework for Cross-lingual Named Entity Recognition

  • Zhuojun Ding
  • Wei Wei
  • Xiaoye Qu
  • Dangyang Chen

Cross-lingual named entity recognition (NER) aims to train an NER model for the target language leveraging only labeled source language data and unlabeled target language data. Prior approaches either perform label projection on translated source language data or employ a source model to assign pseudo labels for target language data and train a target model on these pseudo-labeled data to generalize to the target language. However, these automatic labeling procedures inevitably introduce noisy labels, thus leading to a performance drop. In this paper, we propose a Global-Local Denoising framework (GLoDe) for cross-lingual NER. Specifically, GLoDe introduces a progressive denoising strategy to rectify incorrect pseudo labels by leveraging both global and local distribution information in the semantic space. The refined pseudo-labeled target language data significantly improves the model's generalization ability. Moreover, previous methods only consider improving the model with language-agnostic features, however, we argue that target language-specific features are also important and should never be ignored. To this end, we employ a simple auxiliary task to achieve this goal. Experimental results on two benchmark datasets with six target languages demonstrate that our proposed GLoDe significantly outperforms current state-of-the-art methods.

NeurIPS Conference 2024 Conference Paper

Large Language Models-guided Dynamic Adaptation for Temporal Knowledge Graph Reasoning

  • Jiapu Wang
  • Kai Sun
  • Linhao Luo
  • Wei Wei
  • Yongli Hu
  • Alan W. Liew
  • Shirui Pan
  • Baocai Yin

Temporal Knowledge Graph Reasoning (TKGR) is the process of utilizing temporal information to capture complex relations within a Temporal Knowledge Graph (TKG) to infer new knowledge. Conventional methods in TKGR typically depend on deep learning algorithms or temporal logical rules. However, deep learning-based TKGRs often lack interpretability, whereas rule-based TKGRs struggle to effectively learn temporal rules that capture temporal patterns. Recently, Large Language Models (LLMs) have demonstrated extensive knowledge and remarkable proficiency in temporal reasoning. Consequently, the employment of LLMs for Temporal Knowledge Graph Reasoning (TKGR) has sparked increasing interest among researchers. Nonetheless, LLMs are known to function as black boxes, making it challenging to comprehend their reasoning process. Additionally, due to the resource-intensive nature of fine-tuning, promptly updating LLMs to integrate evolving knowledge within TKGs for reasoning is impractical. To address these challenges, in this paper, we propose a Large Language Models-guided Dynamic Adaptation (LLM-DA) method for reasoning on TKGs. Specifically, LLM-DA harnesses the capabilities of LLMs to analyze historical data and extract temporal logical rules. These rules unveil temporal patterns and facilitate interpretable reasoning. To account for the evolving nature of TKGs, a dynamic adaptation strategy is proposed to update the LLM-generated rules with the latest events. This ensures that the extracted rules always incorporate the most recent knowledge and better generalize to the predictions on future events. Experimental results show that without the need of fine-tuning, LLM-DA significantly improves the accuracy of reasoning over several common datasets, providing a robust framework for TKGR tasks.

NeurIPS Conference 2024 Conference Paper

Meta-Exploiting Frequency Prior for Cross-Domain Few-Shot Learning

  • Fei Zhou
  • Peng Wang
  • Lei Zhang
  • Zhenghua Chen
  • Wei Wei
  • Chen Ding
  • Guosheng Lin
  • Yanning Zhang

Meta-learning offers a promising avenue for few-shot learning (FSL), enabling models to glean a generalizable feature embedding through episodic training on synthetic FSL tasks in a source domain. Yet, in practical scenarios where the target task diverges from that in the source domain, meta-learning based method is susceptible to over-fitting. To overcome this, we introduce a novel framework, Meta-Exploiting Frequency Prior for Cross-Domain Few-Shot Learning, which is crafted to comprehensively exploit the cross-domain transferable image prior that each image can be decomposed into complementary low-frequency content details and high-frequency robust structural characteristics. Motivated by this insight, we propose to decompose each query image into its high-frequency and low-frequency components, and parallel incorporate them into the feature embedding network to enhance the final category prediction. More importantly, we introduce a feature reconstruction prior and a prediction consistency prior to separately encourage the consistency of the intermediate feature as well as the final category prediction between the original query image and its decomposed frequency components. This allows for collectively guiding the network's meta-learning process with the aim of learning generalizable image feature embeddings, while not introducing any extra computational cost in the inference phase. Our framework establishes new state-of-the-art results on multiple cross-domain few-shot learning benchmarks.

NeurIPS Conference 2024 Conference Paper

On Giant's Shoulders: Effortless Weak to Strong by Dynamic Logits Fusion

  • Chenghao Fan
  • Zhenyi Lu
  • Wei Wei
  • Jie Tian
  • Xiaoye Qu
  • Dangyang Chen
  • Yu Cheng

Efficient fine-tuning of large language models for task-specific applications is imperative, yet the vast number of parameters in these models makes their training increasingly challenging. Despite numerous proposals for effective methods, a substantial memory overhead remains for gradient computations during updates. \thm{Can we fine-tune a series of task-specific small models and transfer their knowledge directly to a much larger model without additional training? } In this paper, we explore weak-to-strong specialization using logit arithmetic, facilitating a direct answer to this question. Existing weak-to-strong methods often employ a static knowledge transfer ratio and a single small model for transferring complex knowledge, which leads to suboptimal performance. To surmount these limitations, we propose a dynamic logit fusion approach that works with a series of task-specific small models, each specialized in a different task. This method adaptively allocates weights among these models at each decoding step, learning the weights through Kullback-Leibler divergence constrained optimization problems. We conduct extensive experiments across various benchmarks in both single-task and multi-task settings, achieving leading results. By transferring expertise from the 7B model to the 13B model, our method closes the performance gap by 96. 4\% in single-task scenarios and by 86. 3\% in multi-task scenarios compared to full fine-tuning of the 13B model. Notably, we achieve surpassing performance on unseen tasks. Moreover, we further demonstrate that our method can effortlessly integrate in-context learning for single tasks and task arithmetic for multi-task scenarios.

IJCAI Conference 2024 Conference Paper

Position Debiasing Fine-Tuning for Causal Perception in Long-Term Dialogue

  • Shixuan Fan
  • Wei Wei
  • Wendi Li
  • Xian-Ling Mao
  • Wenfeng Xie
  • Dangyang Chen

The core of the dialogue system is to generate relevant, informative, and human-like responses based on extensive dialogue history. Recently, dialogue generation domain has seen mainstream adoption of large language models (LLMs), due to its powerful capability in generating utterances. However, there is a natural deficiency for such models, that is, inherent position bias, which may lead them to pay more attention to the nearby utterances instead of causally relevant ones, resulting in generating irrelevant and generic responses in long-term dialogue. To alleviate such problem, in this paper, we propose a novel method, named Causal Perception long-term Dialogue framework (CPD), which employs perturbation-based causal variable discovery method to extract casually relevant utterances from the dialogue history and enhances model causal perception during fine-tuning. Specifically, a local-position awareness method is proposed in CPD for inter-sentence position correlation elimination, which helps models extract causally relevant utterances based on perturbations. Then, a casual-perception fine-tuning strategy is also proposed, to enhance the capability of discovering the causal invariant factors, by differently perturbing causally relevant and non-casually relevant ones for response generation. Experimental results on two datasets prove that our proposed method can effectively alleviate the position bias for multiple LLMs and achieve significant progress compared with existing baselines.

NeurIPS Conference 2024 Conference Paper

Scalable Constrained Policy Optimization for Safe Multi-agent Reinforcement Learning

  • Lijun Zhang
  • Lin Li
  • Wei Wei
  • Huizhong Song
  • Yaodong Yang
  • Jiye Liang

A challenging problem in seeking to bring multi-agent reinforcement learning (MARL) techniques into real-world applications, such as autonomous driving and drone swarms, is how to control multiple agents safely and cooperatively to accomplish tasks. Most existing safe MARL methods learn the centralized value function by introducing a global state to guide safety cooperation. However, the global coupling arising from agents’ safety constraints and the exponential growth of the state-action space size limit their applicability in instant communication or computing resource-constrained systems and larger multi-agent systems. In this paper, we develop a novel scalable and theoretically-justified multi-agent constrained policy optimization method. This method utilizes the rigorous bounds of the trust region method and the bounds of the truncated advantage function to provide a new local policy optimization objective for each agent. Also, we prove that the safety constraints and the joint policy improvement can be met when each agent adopts a sequential update scheme to optimize a $\kappa$-hop policy. Then, we propose a practical algorithm called Scalable MAPPO-Lagrangian (Scal-MAPPO-L). The proposed method’s effectiveness is verified on a collection of benchmark tasks, and the results support our theory that decentralized training with local interactions can still improve reward performance and satisfy safe constraints.

NeurIPS Conference 2024 Conference Paper

Twin-Merging: Dynamic Integration of Modular Expertise in Model Merging

  • Zhenyi Lu
  • Chenghao Fan
  • Wei Wei
  • Xiaoye Qu
  • Dangyang Chen
  • Yu Cheng

In the era of large language models, model merging is a promising way to combine multiple task-specific models into a single multitask model without extra training. However, two challenges remain: (a) interference between different models and (b) heterogeneous data during testing. Traditional model merging methods often show significant performance gaps compared to fine-tuned models due to these issues. Additionally, a one-size-fits-all model lacks flexibility for diverse test data, leading to performance degradation. We show that both shared and exclusive task-specific knowledge are crucial for merging performance, but directly merging exclusive knowledge hinders overall performance. In view of this, we propose Twin-Merging, a method that encompasses two principal stages: (1) modularizing knowledge into shared and exclusive components, with compression to reduce redundancy and enhance efficiency; (2) dynamically merging shared and task-specific knowledge based on the input. This approach narrows the performance gap between merged and fine-tuned models and improves adaptability to heterogeneous data. Extensive experiments on $20$ datasets for both language and vision tasks demonstrate the effectiveness of our method, showing an average improvement of $28. 34\%$ in absolute normalized score for discriminative tasks and even surpassing the fine-tuned upper bound on the generative tasks.

JBHI Journal 2024 Journal Article

WDFF-Net: Weighted Dual-Branch Feature Fusion Network for Polyp Segmentation With Object-Aware Attention Mechanism

  • Jie Cao
  • Xin Wang
  • Zhiwei Qu
  • Li Zhuo
  • Xiaoguang Li
  • Hui Zhang
  • Yang Yang
  • Wei Wei

Colon polyps in colonoscopy images exhibit significant differences in color, size, shape, appearance, and location, posing significant challenges to accurate polyp segmentation. In this paper, a Weighted Dual-branch Feature Fusion Network is proposed for Polyp Segmentation, named WDFF-Net, which adopts HarDNet68 as the backbone network. First, a dual-branch feature fusion network architecture is constructed, which includes a shared feature extractor and two feature fusion branches, i. e. Progressive Feature Fusion (PFF) branch and Scale-aware Feature Fusion (SFF) branch. The branches fuse the deep features of multiple layers for different purposes and with different fusion ways. The PFF branch is to address the under-segmentation or over-segmentation problems of flat polyps with low-edge contrast by iteratively fusing the features from low, medium, and high layers. The SFF branch is to tackle the the problem of drastic variations in polyp size and shape, especially the missed segmentation problem for small polyps. These two branches are complementary and play different roles, in improving segmentation accuracy. Second, an Object-aware Attention Mechanism (OAM) is proposed to enhance the features of the target regions and suppress those of the background regions, to interfere with the segmentation performance. Third, a weighted dual-branch the segmentation loss function is specifically designed, which dynamically assigns the weight factors of the loss functions for two branches to optimize their collaborative training. Experimental results on five public colon polyp datasets demonstrate that, the proposed WDFF-Net can achieve a superior segmentation performance with lower model complexity and faster inference speed, while maintaining good generalization ability.

IJCAI Conference 2023 Conference Paper

An Empirical Study on the Language Modal in Visual Question Answering

  • Daowan Peng
  • Wei Wei
  • Xian-Ling Mao
  • Yuanyuan Fu
  • Dangyang Chen

Generalization beyond in-domain experience to out-of-distribution data is of paramount significance in the AI domain. Of late, state-of-the-art Visual Question Answering (VQA) models have shown impressive performance on in-domain data, partially due to the language prior bias which, however, hinders the generalization ability in practice. This paper attempts to provide new insights into the influence of language modality on VQA performance from an empirical study perspective. To achieve this, we conducted a series of experiments on six models. The results of these experiments revealed that, 1) apart from prior bias caused by question types, there is a notable influence of postfix-related bias in inducing biases, and 2) training VQA models with word-sequence-related variant questions demonstrated improved performance on the out-of-distribution benchmark, and the LXMERT even achieved a 10-point gain without adopting any debiasing methods. We delved into the underlying reasons behind these experimental results and put forward some simple proposals to reduce the models' dependency on language priors. The experimental results demonstrated the effectiveness of our proposed method in improving performance on the out-of-distribution benchmark, VQA-CPv2. We hope this study can inspire novel insights for future research on designing bias-reduction approaches.

JBHI Journal 2023 Journal Article

CoIn: Correlation Induced Clustering for Cognition of High Dimensional Bioinformatics Data

  • Zeng Zeng
  • Ziyuan Zhao
  • Kaixin Xu
  • Yangfan Li
  • Cen Chen
  • Xiaofeng Zou
  • Yulan Wang
  • Wei Wei

Analysis of high dimensional biomedical data such as microarray gene expression data and mass spectrometry images, is crucial to provide better medical services including cancer subtyping, protein homology detection, etc. Clustering is a fundamental cognitive task which aims to group unlabeled data into multiple clusters based on their intrinsic similarities. However, for most clustering methods, including the most widely used $K$ -means algorithm, all features of the high dimensional data are considered equally in relevance, which distorts the performance when clustering high-dimensional data where there exist many redundant variables and correlated variables. In this paper, we aim at addressing the problem of the high dimensional bioinformatics data clustering and propose a new correlation induced clustering, CoIn, to capture complex correlations among high dimensional data and guarantee the correlation consistency within each cluster. We evaluate the proposed method on a high dimensional mass spectrometry dataset of liver cancer tumor to explore the metabolic differences on tissues and discover the intra-tumor heterogeneity (ITH). By comparing the results of baselines and ours, it has been found that our method produces more explainable and understandable results for clinical analysis, which demonstrates the proposed clustering paradigm has the potential with application to knowledge discovery in high dimensional bioinformatics data.

JBHI Journal 2023 Journal Article

Development of Prognostic Biomarkers by TMB-Guided WSI Analysis: A Two-Step Approach

  • Xiangyu Liu
  • Zhenyu Liu
  • Ye Yan
  • Kai Wang
  • Aodi Wang
  • Xiongjun Ye
  • Liwei Wang
  • Wei Wei

The rapid development of computational pathology has brought new opportunities for prognosis prediction using histopathological images. However, the existing deep learning frameworks lack exploration of the relationship between images and other prognostic information, resulting in poor interpretability. Tumor mutation burden (TMB) is a promising biomarker for predicting the survival outcomes of cancer patients, but its measurement is costly. Its heterogeneity may be reflected in histopathological images. Here, we report a two-step framework for prognostic prediction using whole-slide images (WSIs). First, the framework adopts a deep residual network to encode the phenotype of WSIs and classifies patient-level TMB by the deep features after aggregation and dimensionality reduction. Then, the patients' prognosis is stratified by the TMB-related information obtained during the classification model development. Deep learning feature extraction and TMB classification model construction are performed on an in-house dataset of 295 Haematoxylin & Eosin stained WSIs of clear cell renal cell carcinoma (ccRCC). The development and evaluation of prognostic biomarkers are performed on The Cancer Genome Atlas-Kidney ccRCC (TCGA-KIRC) project with 304 WSIs. Our framework achieves good performance for TMB classification with an area under the receiver operating characteristic curve (AUC) of 0. 813 on the validation set. Through survival analysis, our proposed prognostic biomarkers can achieve significant stratification of patients' overall survival (P $< $ 0. 05) and outperform the original TMB signature in risk stratification of patients with advanced disease. The results indicate the feasibility of mining TMB-related information from WSI to achieve stepwise prognosis prediction.

AAAI Conference 2023 Conference Paper

Mind the Gap: Polishing Pseudo Labels for Accurate Semi-supervised Object Detection

  • Lei Zhang
  • Yuxuan Sun
  • Wei Wei

Exploiting pseudo labels (e.g., categories and bounding boxes) of unannotated objects produced by a teacher detector have underpinned much of recent progress in semi-supervised object detection (SSOD). However, due to the limited generalization capacity of the teacher detector caused by the scarce annotations, the produced pseudo labels often deviate from ground truth, especially those with relatively low classification confidences, thus limiting the generalization performance of SSOD. To mitigate this problem, we propose a dual pseudo-label polishing framework for SSOD. Instead of directly exploiting the pseudo labels produced by the teacher detector, we take the first attempt at reducing their deviation from ground truth using dual polishing learning, where two differently structured polishing networks are elaborately developed and trained using synthesized paired pseudo labels and the corresponding ground truth for categories and bounding boxes on the given annotated objects, respectively. By doing this, both polishing networks can infer more accurate pseudo labels for unannotated objects through sufficiently exploiting their context knowledge based on the initially produced pseudo labels, and thus improve the generalization performance of SSOD. Moreover, such a scheme can be seamlessly plugged into the existing SSOD framework for joint end-to-end learning. In addition, we propose to disentangle the polished pseudo categories and bounding boxes of unannotated objects for separate category classification and bounding box regression in SSOD, which enables introducing more unannotated objects during model training and thus further improves the performance. Experiments on both PASCAL VOC and MS-COCO benchmarks demonstrate the superiority of the proposed method over existing state-of-the-art baselines. The code can be found at https://github.com/snowdusky/DualPolishLearning.

TIST Journal 2023 Journal Article

Representation Learning of Enhanced Graphs Using Random Walk Graph Convolutional Network

  • Xing Li
  • Wei Wei
  • Ruizhi Zhang
  • Zhenyu Shi
  • Zhiming Zheng
  • Xiangnan Feng

Nowadays, graph structure data has played a key role in machine learning because of its simple topological structure, and therefore, the graph representation learning methods have attracted great attention. And it turns out that the low-dimensional embedding representation obtained by graph representation learning is extremely useful in various typical tasks, such as node classification and content recommendation. However, most of the existing methods do not further dig out potential structural information on the original graph structure. Here, we propose wGCN, which utilizes random walk to obtain the node-specific mesoscopic structures (high-order local structure) of the graph and utilizes these mesoscopic structures to enhance the graph and organize the characteristic information of the nodes. Our method can effectively generate node embedding for data of previously unknown categories, which has been proven in a series of experiments conducted on many types of graph networks. And compared to baselines, our method shows the best performance on most datasets and achieves competitive results on others. It is believed that combining the mesoscopic structure to further explore the structural information of the graph will greatly improve the learning efficiency of the graph neural network.

NeurIPS Conference 2023 Conference Paper

Semi-Implicit Denoising Diffusion Models (SIDDMs)

  • Yanwu Xu
  • Mingming Gong
  • Shaoan Xie
  • Wei Wei
  • Matthias Grundmann
  • Kayhan Batmanghelich
  • Tingbo Hou

Despite the proliferation of generative models, achieving fast sampling during inference without compromising sample diversity and quality remains challenging. Existing models such as Denoising Diffusion Probabilistic Models (DDPM) deliver high-quality, diverse samples but are slowed by an inherently high number of iterative steps. The Denoising Diffusion Generative Adversarial Networks (DDGAN) attempted to circumvent this limitation by integrating a GAN model for larger jumps in the diffusion process. However, DDGAN encountered scalability limitations when applied to large datasets. To address these limitations, we introduce a novel approach that tackles the problem by matching implicit and explicit factors. More specifically, our approach involves utilizing an implicit model to match the marginal distributions of noisy data and the explicit conditional distribution of the forward diffusion. This combination allows us to effectively match the joint denoising distributions. Unlike DDPM but similar to DDGAN, we do not enforce a parametric distribution for the reverse step, enabling us to take large steps during inference. Similar to the DDPM but unlike DDGAN, we take advantage of the exact form of the diffusion process. We demonstrate that our proposed method obtains comparable generative performance to diffusion-based models and vastly superior results to models with a small number of sampling steps.

AAAI Conference 2023 Conference Paper

STAGE: Span Tagging and Greedy Inference Scheme for Aspect Sentiment Triplet Extraction

  • Shuo Liang
  • Wei Wei
  • Xian-Ling Mao
  • Yuanyuan Fu
  • Rui Fang
  • Dangyang Chen

Aspect Sentiment Triplet Extraction (ASTE) has become an emerging task in sentiment analysis research, aiming to extract triplets of the aspect term, its corresponding opinion term, and its associated sentiment polarity from a given sentence. Recently, many neural networks based models with different tagging schemes have been proposed, but almost all of them have their limitations: heavily relying on 1) prior assumption that each word is only associated with a single role (e.g., aspect term, or opinion term, etc. ) and 2) word-level interactions and treating each opinion/aspect as a set of independent words. Hence, they perform poorly on the complex ASTE task, such as a word associated with multiple roles or an aspect/opinion term with multiple words. Hence, we propose a novel approach, Span TAgging and Greedy infErence (STAGE), to extract sentiment triplets in span-level, where each span may consist of multiple words and play different roles simultaneously. To this end, this paper formulates the ASTE task as a multi-class span classification problem. Specifically, STAGE generates more accurate aspect sentiment triplet extractions via exploring span-level information and constraints, which consists of two components, namely, span tagging scheme and greedy inference strategy. The former tag all possible candidate spans based on a newly-defined tagging set. The latter retrieves the aspect/opinion term with the maximum length from the candidate sentiment snippet to output sentiment triplets. Furthermore, we propose a simple but effective model based on the STAGE, which outperforms the state-of-the-arts by a large margin on four widely-used datasets. Moreover, our STAGE can be easily generalized to other pair/triplet extraction tasks, which also demonstrates the superiority of the proposed scheme STAGE.

IJCAI Conference 2023 Conference Paper

Towards Hierarchical Policy Learning for Conversational Recommendation with Hypergraph-based Reinforcement Learning

  • Sen Zhao
  • Wei Wei
  • Yifan Liu
  • Ziyang Wang
  • Wendi Li
  • Xian-Ling Mao
  • Shuai Zhu
  • Minghui Yang

Conversational recommendation systems (CRS) aim to timely and proactively acquire user dynamic preferred attributes through conversations for item recommendation. In each turn of CRS, there naturally have two decision-making processes with different roles that influence each other: 1) director, which is to select the follow-up option (i. e. , ask or recommend) that is more effective for reducing the action space and acquiring user preferences; and 2) actor, which is to accordingly choose primitive actions (i. e. , asked attribute or recommended item) to estimate the effectiveness of the director’s option. However, existing methods heavily rely on a unified decision-making module or heuristic rules, while neglecting to distinguish the roles of different decision procedures, as well as the mutual influences between them. To address this, we propose a novel Director-Actor Hierarchical Conversational Recommender (DAHCR), where the director selects the most effective option, followed by the actor accordingly choosing primitive actions that satisfy user preferences. Specifically, we develop a dynamic hypergraph to model user preferences and introduce an intrinsic motivation to train from weak supervision over the director. Finally, to alleviate the bad effect of model bias on the mutual influence between the director and actor, we model the director’s option by sampling from a categorical distribution. Extensive experiments demonstrate that DAHCR outperforms state-of-the-art methods.

JBHI Journal 2023 Journal Article

TranSDFNet: Transformer-Based Truncated Signed Distance Fields for the Shape Design of Removable Partial Denture Clasps

  • Xinze Shen
  • Changdong Zhang
  • Xiuyi Jia
  • Dawei Li
  • Tingting Liu
  • Sukun Tian
  • Wei Wei
  • Yuchun Sun

The ever-growing aging population has led to an increasing need for removable partial dentures (RPDs) since they are typically the least expensive treatment options for partial edentulism. However, the digital design of RPDs remains challenging for dental technicians due to the variety of partially edentulous scenarios and complex combinations of denture components. To accelerate the design of RPDs, we propose a U-shape network incorporated with Transformer blocks to automatically generate RPD clasps, one of the most frequently used RPD components. Unlike existing dental restoration design algorithms, we introduce the voxel-based truncated signed distance field (TSDF) as an intermediate representation, which reduces the sensitivity of the network to resolution and contributes to more smooth reconstruction. Besides, a selective insertion scheme is proposed for solving the memory issue caused by Transformer blocks and enables the algorithm to work well in scenarios with insufficient data. We further design two weighted loss functions to filter out the noisy signals generated from the zero-gradient areas in TSDF. Ablation and comparison studies demonstrate that our algorithm outperforms state-of-the-art reconstruction methods by a large margin and can serve as an intelligent auxiliary in denture design.

JBHI Journal 2022 Journal Article

3DMol-Net: Learn 3D Molecular Representation Using Adaptive Graph Convolutional Network Based on Rotation Invariance

  • Chunyan Li
  • Wei Wei
  • Jin Li
  • Junfeng Yao
  • Xiangxiang Zeng
  • Zhihan Lv

Studying the deep learning-based molecular representation has great significance on predicting molecular property, promoted the development of drug screening and new drug discovery, and improving human well-being for avoiding illnesses. It is essential to learn the characterization of drug for various downstream tasks, such as molecular property prediction. In particular, the 3D structure features of molecules play an important role in biochemical function and activity prediction. The 3D characteristics of molecules largely determine the properties of the drug and the binding characteristics of the target. However, most current methods merely rely on 1D or 2D properties while ignoring the 3D topological structure, thereby degrading the performance of molecular inferring. In this paper, we propose 3DMol-Net to enhance the molecular representation, considering both the topology and rotation invariance (RI) of the 3D molecular structure. Specifically, we construct a molecular graph with soft relations related to the spatial arrangement of the 3D coordinates to learn 3D topology of arbitrary graph structure and employ an adaptive graph convolutional network to predict molecular properties and biochemical activities. Comparing with current graph-based methods, 3DMol-Net demonstrates superior performance in terms of both regression and classification tasks. Further verification of RI and visualization also show better robustness and representation capacity of our model.

IJCAI Conference 2022 Conference Paper

Automatic Noisy Label Correction for Fine-Grained Entity Typing

  • Weiran Pan
  • Wei Wei
  • Feida Zhu

Fine-grained entity typing (FET) aims to assign proper semantic types to entity mentions according to their context, which is a fundamental task in various entity-leveraging applications. Current FET systems usually establish on large-scale weaklysupervised/distantly annotation data, which may contain abundant noise and thus severely hinder the performance of the FET task. Although previous studies have made great success in automatically identifying the noisy labels in FET, they usually rely on some auxiliary resources which may be unavailable in real-world applications (e. g. , pre-defined hierarchical type structures, humanannotated subsets). In this paper, we propose a novel approach to automatically correct noisy labels for FET without external resources. Specifically, it first identifies the potentially noisy labels by estimating the posterior probability of a label being positive or negative according to the logits output by the model, and then relabel candidate noisy labels by training a robust model over the remaining clean labels. Experiments on two popular benchmarks prove the effectiveness of our method. Our source code can be obtained from https: //github. com/CCIIPLab/DenoiseFET.

AAAI Conference 2022 Conference Paper

Controlling Underestimation Bias in Reinforcement Learning via Quasi-median Operation

  • Wei Wei
  • Yujia Zhang
  • Jiye Liang
  • Lin Li
  • Yyuze Li

How to get a good value estimation is one of the key problems in reinforcement learning (RL). Current off-policy methods, such as Maxmin Q-learning, TD3, and TADD, suffer from the underestimation problem when solving the overestimation problem. In this paper, we propose the Quasi-Median Operation, a novel way to mitigate the underestimation bias by selecting the quasi-median from multiple state-action values. Based on the quasi-median operation, we propose Quasi- Median Q-learning (QMQ) for the discrete action tasks and Quasi-Median Delayed Deep Deterministic Policy Gradient (QMD3) for the continuous action tasks. Theoretically, the underestimation bias of our method is improved while the estimation variance is significantly reduced compared to Maxmin Q-learning, TD3, and TADD. We conduct extensive experiments on the discrete and continuous action tasks, and results show that our method outperforms the state-of-the-art methods.

IJCAI Conference 2022 Conference Paper

Declaration-based Prompt Tuning for Visual Question Answering

  • Yuhang Liu
  • Wei Wei
  • Daowan Peng
  • Feida Zhu

In recent years, the pre-training-then-fine-tuning paradigm has yielded immense success on a wide spectrum of cross-modal tasks, such as visual question answering (VQA), in which a visual-language (VL) model is first optimized via self-supervised task objectives, e. g. , masked language modeling (MLM) and image-text matching (ITM), and then fine-tuned to adapt to downstream task (e. g. , VQA) via a brand-new objective function, e. g. , answer prediction. However, the inconsistency of the objective forms not only severely limits the generalization of pre-trained VL models to downstream tasks, but also requires a large amount of labeled data for fine-tuning. To alleviate the problem, we propose an innovative VL fine-tuning paradigm (named Declaration-based Prompt Tuning, abbreviated as DPT), which fine-tunes the model for downstream VQA using the pre-training objectives, boosting the effective adaptation of pre-trained models to the downstream task. Specifically, DPT reformulates the VQA task via (1) textual adaptation, which converts the given questions into declarative sentence form for prompt-tuning, and (2) task adaptation, which optimizes the objective function of VQA problem in the manner of pre-training phase. Experimental results on GQA dataset show that DPT outperforms the fine-tuned counterpart by a large margin regarding accuracy in both fully-supervised (2. 68%) and zero-shot/fewshot (over 31%) settings. All the data and codes will be available to facilitate future research.

JBHI Journal 2022 Journal Article

MobileUNet-FPN: A Semantic Segmentation Model for Fetal Ultrasound Four-Chamber Segmentation in Edge Computing Environments

  • Bin Pu
  • Yuhuan Lu
  • Jianguo Chen
  • Shengli Li
  • Ningbo Zhu
  • Wei Wei
  • Kenli Li

The apical four-chamber (A4C) view in fetal echocardiography is a prenatal examination widely used for the early diagnosis of congenital heart disease (CHD). Accurate segmentation of A4C key anatomical structures is the basis for automatic measurement of growth parameters and necessary disease diagnosis. However, due to the ultrasound imaging arising from artefacts and scattered noise, the variability of anatomical structures in different gestational weeks, and the discontinuity of anatomical structure boundaries, accurately segmenting the fetal heart organ in the A4C view is a very challenging task. To this end, we propose to combine an explicit Feature Pyramid Network (FPN), MobileNet and UNet, i. e. , MobileUNet-FPN, for the segmentation of 13 key heart structures. To our knowledge, this is the first AI-based method that can segment so many anatomical structures in fetal A4C view. We split the MobileNet backbone network into four stages and use the features of these four phases as the encoder and the upsampling operation as the decoder. We build an explicit FPN network to enhance multi-scale semantic information and ultimately generate segmentation masks of key anatomical structures. In addition, we design a multi-level edge computing system and deploy the distributed edge nodes in different hospitals and city servers, respectively. Then, we train the MobileUNet-FPN model in parallel at each edge node to effectively reduce the network communication overhead. Extensive experiments are conducted and the results show the superior performance of the proposed model on the fetal A4C and femoral-length images.

AAAI Conference 2022 Conference Paper

Multi-View Intent Disentangle Graph Networks for Bundle Recommendation

  • Sen Zhao
  • Wei Wei
  • Ding Zou
  • Xianling Mao

Bundle recommendation aims to recommend the user a bundle of items as a whole. Previous models capture the user’s preferences on both items and the association of items. Nevertheless, they usually neglect the diversity of the user’s intents on adopting items and fail to disentangle the user’s intents in representations. In the real scenario of bundle recommendation, a user’s intent may be naturally distributed in the different bundles of that user (Global view), while a bundle may contain multiple intents of a user (Local view). Each view has its advantages for intent disentangling: 1) From the global view, more items are involved to present each intent, which can demonstrate the user’s preference under each intent more clearly. 2) From the local view, it can reveal the association among items under each intent since items within the same bundle are highly correlated to each other. To this end, we propose a novel model named Multi-view Intent Disentangle Graph Networks (MIDGN), which is capable of precisely and comprehensively capturing the diversity of the user’s intent and items’ associations at the finer granularity. Specifically, MIDGN disentangles the user’s intents from two different perspectives, respectively: 1) In the global level, MIDGN disentangles the user’s intent coupled with inter-bundle items; 2) In the Local level, MIDGN disentangles the user’s intent coupled with items within each bundle. Meanwhile, we compare the user’s intents disentangled from different views under the contrast learning framework to improve the learned intents. Extensive experiments conducted on two benchmark datasets demonstrate that MIDGN outperforms the state-ofthe-art methods by over 10. 7% and 26. 8%, respectively.

IJCAI Conference 2022 Conference Paper

Relational Triple Extraction: One Step is Enough

  • Yu-Ming Shang
  • Heyan Huang
  • Xin Sun
  • Wei Wei
  • Xian-Ling Mao

Extracting relational triples from unstructured text is an essential task in natural language processing and knowledge graph construction. Existing approaches usually contain two fundamental steps: (1) finding the boundary positions of head and tail entities; (2) concatenating specific tokens to form triples. However, nearly all previous methods suffer from the problem of error accumulation, i. e. , the boundary recognition error of each entity in step (1) will be accumulated into the final combined triples. To solve the problem, in this paper, we introduce a fresh perspective to revisit the triple extraction task and propose a simple but effective model, named DirectRel. Specifically, the proposed model first generates candidate entities through enumerating token sequences in a sentence, and then transforms the triple extraction task into a linking problem on a ``head -> tail" bipartite graph. By doing so, all triples can be directly extracted in only one step. Extensive experimental results on two widely used datasets demonstrate that the proposed model performs better than the state-of-the-art baselines.

JBHI Journal 2022 Journal Article

Skeleton-Based Abnormal Behavior Detection Using Secure Partitioned Convolutional Neural Network Model

  • Jiefan Qiu
  • Xinlei Yan
  • Wei Wang
  • Wei Wei
  • Kai Fang

Theabnormal behavior detection is the vital for evaluation of daily-life health status of the patient with cognitive impairment. Previous studies about abnormal behavior detection indicate that convolution neural network (CNN)-based computer vision owns the high robustness and accuracy for detection. However, executing CNN model on the cloud possible incurs a privacy disclosure problem during data transmission, and the high computation overhead makes difficult to execute the model on edge-end IoT devices with a well real-time performance. In this paper, we realize a skeleton-based abnormal behavior detection, and propose a secure partitioned CNN model (SP-CNN) to extract human skeleton keypoints and achieve safely collaborative computing by deploying different CNN model layers on the cloud and the IoT device. Because, the data outputted from the IoT device are processed by the several CNN layers instead of transmitting the sensitive video data, objectively it reduces the risk of privacy disclosure. Moreover, we also design an encryption method based on channel state information (CSI) to guarantee the sensitive data security. At last, we apply SP-CNN in abnormal behavior detection to evaluate its effectiveness. The experiment results illustrate that the efficiency of the abnormal behavior detection based on SP-CNN is at least 33. 2% higher than the state-of-the-art methods, and its detection accuracy arrives to 97. 54%.

TMLR Journal 2022 Journal Article

Unsupervised Mismatch Localization in Cross-Modal Sequential Data with Application to Mispronunciations Localization

  • Wei Wei
  • Hengguan Huang
  • Xiangming Gu
  • Hao Wang
  • Ye Wang

Content mismatch usually occurs when data from one modality is translated to another, e.g. language learners producing mispronunciations (errors in speech) when reading a sentence (target text) aloud. However, most existing alignment algorithms assume that the content involved in the two modalities is perfectly matched, thus leading to difficulty in locating such mismatch between speech and text. In this work, we develop an unsupervised learning algorithm that can infer the relationship between content-mismatched cross-modal sequential data, especially for speech-text sequences. More specifically, we propose a hierarchical Bayesian deep learning model, dubbed mismatch localization variational autoencoder (ML-VAE), which decomposes the generative process of the speech into hierarchically structured latent variables, indicating the relationship between the two modalities. Training such a model is very challenging due to the discrete latent variables with complex dependencies involved. To address this challenge, we propose a novel and effective training procedure that alternates between estimating the hard assignments of the discrete latent variables over a specifically designed mismatch localization finite-state acceptor (ML-FSA) and updating the parameters of neural networks. In this work, we focus on the mismatch localization problem for speech and text, and our experimental results show that ML-VAE successfully locates the mismatch between text and speech, without the need for human annotations for model training.

AAAI Conference 2021 Conference Paper

A Student-Teacher Architecture for Dialog Domain Adaptation Under the Meta-Learning Setting

  • Kun Qian
  • Wei Wei
  • Zhou Yu

Numerous new dialog domains are being created every day while collecting data for these domains is extremely costly since it involves human interactions. Therefore, it is essential to develop algorithms that can adapt to different domains efficiently when building data-driven dialog models. Most recent research on domain adaption focuses on giving the model a better initialization, rather than optimizing the adaptation process. We propose an efficient domain adaptive taskoriented dialog system model, which incorporates a metateacher model to emphasize the different impacts between generated tokens with respect to the context. We first train our base dialog model and meta-teacher model adversarially in a meta-learning setting on rich-resource domains. The metateacher learns to quantify the importance of tokens under different contexts across different domains. During adaptation, the meta-teacher guides the dialog model to focus on important tokens in order to achieve better adaptation efficiency. We evaluate our model on two multi-domain datasets, MultiWOZ and Google Schema-Guided Dialogue, and achieve state-of-the-art performance.

JBHI Journal 2021 Journal Article

Boundary Aware U-Net for Retinal Layers Segmentation in Optical Coherence Tomography Images

  • Bo Wang
  • Wei Wei
  • Shuang Qiu
  • Shengpei Wang
  • Dan Li
  • Huiguang He

Retinal layers segmentation in optical coherence tomography (OCT) images is a critical step in the diagnosis of numerous ocular diseases. Automatic layers segmentation requires separating each individual layer instance with accurate boundary detection, but remains a challenging task since it suffers from speckle noise, intensity inhomogeneity, and the low contrast around boundary. In this work, we proposed a boundary aware U-Net (BAU-Net) for retinal layers segmentation by detecting accurate boundary. Based on encoder-decoder architecture, we design a dual tasks framework with low-level outputs for boundary detection and high-level outputs for layers segmentation. Specifically, we first use the multi-scale input strategy to enrich the spatial information in the deep features of encoder. For low-level features from encoder, we design an edge aware (EA) module in skip connection to extract the pure edge features. Then, a U-structure feature enhanced (UFE) module is designed in all skip connections to enlarge the features receptive fields from the encoder. Besides, a canny edge fusion (CEF) module is introduced to aforementioned architecture, which can fuse the priory edge information from segmentation task to boundary detection branch for a better predication. Furthermore, we model each boundary as a vertical coordinates distribution for boundary detection. Based on this distribution, a topology guarantee loss with combined A-scan regression loss and structure loss is proposed to make an accurate and guaranteed topological boundary set. The method is evaluated on two public datasets and the results demonstrate that the BAU-Net achieves promising performance than other state-of-the-art methods.

JBHI Journal 2021 Journal Article

CSU-Net: A Context Spatial U-Net for Accurate Blood Vessel Segmentation in Fundus Images

  • Bo Wang
  • Shengpei Wang
  • Shuang Qiu
  • Wei Wei
  • Haibao Wang
  • Huiguang He

Blood vessel segmentation in fundus images is a critical procedure in the diagnosis of ophthalmic diseases. Recent deep learning methods achieve high accuracy in vessel segmentation but still face the challenge to segment the microvascular and detect the vessel boundary. This is due to the fact that common Convolutional Neural Networks (CNN) are unable to preserve rich spatial information and a large receptive field simultaneously. Besides, CNN models for vessel segmentation usually are trained by equal pixel level cross-entropy loss, which tend to miss fine vessel structures. In this paper, we propose a novel Context Spatial U-Net (CSU-Net) for blood vessel segmentation. Compared with the other U-Net based models, we design a two-channel encoder: a context channel with multi-scale convolution to capture more receptive field and a spatial channel with large kernel to retain spatial information. Also, to combine and strengthen the features extracted from two paths, we introduce a feature fusion module (FFM) and an attention skip module (ASM). Furthermore, we propose a structure loss, which adds a spatial weight to cross-entropy loss and guide the network to focus more on the thin vessels and boundaries. We evaluated this model on three public datasets: DRIVE, CHASE-DB1 and STARE. The results show that the CSU-Net achieves higher segmentation accuracy than the current state-of-the-art methods.

JBHI Journal 2021 Journal Article

Guest Editorial AI and 5G Empowered Internet of Medical Things

  • Syed Hassan Ahmed
  • Victor Hugo de Albuquerque
  • Wei Wei
  • Wei Wang

The papers in this special section focus on artificial intelligence (AI) and 5G Internet of Medical Things. The recent developments in biomedical sensors, wireless communication systems, and information networks are transforming the conventional healthcare systems. The transformed healthcare systems are enabling distributed healthcare services to patients who may not be co-located with the healthcare providers, providing early diagnoses, and reducing the cost in the healthcare section. The Internet of Medical Things (IoMT), which includes medical devices, wearable devices, sensors and apps, is a critical piece of the digital transformation of healthcare, as it allows new business models to emerge and enables changes in work processes, productivity improvements, cost containment and enhanced customer experiences. IoMT can help monitor, inform and notify not only care-givers, but provide healthcare providers with actual data to identify issues bef

AAAI Conference 2021 Conference Paper

Reinforced History Backtracking for Conversational Question Answering

  • Minghui Qiu
  • Xinjing Huang
  • Cen Chen
  • Feng Ji
  • Chen Qu
  • Wei Wei
  • Jun Huang
  • Yin Zhang

To model the context history in multi-turn conversations has become a critical step towards a better understanding of the user query in question answering systems. To utilize the context history, most existing studies treat the whole context as input, which will inevitably face the following two challenges. First, modeling a long history can be costly as it requires more computation resources. Second, the long context history consists of a lot of irrelevant information that makes it difficult to model appropriate information relevant to the user query. To alleviate these problems, we propose a reinforcement learning based method to capture and backtrack the related conversation history to boost model performance in this paper. Our method seeks to automatically backtrack the history information with the implicit feedback from the model performance. We further consider both immediate and delayed rewards to guide the reinforced backtracking policy. Extensive experiments on a large conversational question answering dataset show that the proposed method can help to alleviate the problems arising from longer context history. Meanwhile, experiments show that the method yields better performance than other strong baselines, and the actions made by the method are insightful.

AAAI Conference 2021 Conference Paper

Spatiotemporal Graph Neural Network based Mask Reconstruction for Video Object Segmentation

  • Daizong Liu
  • Shuangjie Xu
  • Xiao-Yang Liu
  • Zichuan Xu
  • Wei Wei
  • Pan Zhou

This paper addresses the task of segmenting class-agnostic objects in semi-supervised setting. Although previous detection based methods achieve relatively good performance, these approaches extract the best proposal by a greedy strategy, which may lose the local patch details outside the chosen candidate. In this paper, we propose a novel spatiotemporal graph neural network (STG-Net) to reconstruct more accurate masks for video object segmentation, which captures the local contexts by utilizing all proposals. In the spatial graph, we treat object proposals of a frame as nodes and represent their correlations with an edge weight strategy for mask context aggregation. To capture temporal information from previous frames, we use a memory network to refine the mask of current frame by retrieving historic masks in a temporal graph. The joint use of both local patch details and temporal relationships allow us to better address the challenges such as object occlusion and missing. Without online learning and finetuning, our STG-Net achieves state-of-the-art performance on four large benchmarks (DAVIS, YouTube-VOS, SegTrackv2, and YouTube-Objects), demonstrating the effectiveness of the proposed approach.

AAAI Conference 2020 Conference Paper

Are Noisy Sentences Useless for Distant Supervised Relation Extraction?

  • Yuming Shang
  • He-Yan Huang
  • Xian-Ling Mao
  • Xin Sun
  • Wei Wei

The noisy labeling problem has been one of the major obstacles for distant supervised relation extraction. Existing approaches usually consider that the noisy sentences are useless and will harm the model’s performance. Therefore, they mainly alleviate this problem by reducing the influence of noisy sentences, such as applying bag-level selective attention or removing noisy sentences from sentence-bags. However, the underlying cause of the noisy labeling problem is not the lack of useful information, but the missing relation labels. Intuitively, if we can allocate credible labels for noisy sentences, they will be transformed into useful training data and benefit the model’s performance. Thus, in this paper, we propose a novel method for distant supervised relation extraction, which employs unsupervised deep clustering to generate reliable labels for noisy sentences. Specifically, our model contains three modules: a sentence encoder, a noise detector and a label generator. The sentence encoder is used to obtain feature representations. The noise detector detects noisy sentences from sentence-bags, and the label generator produces high-confidence relation labels for noisy sentences. Extensive experimental results demonstrate that our model outperforms the state-of-the-art baselines on a popular benchmark dataset, and can indeed alleviate the noisy labeling problem.

NeurIPS Conference 2020 Conference Paper

Differentiable Top-k with Optimal Transport

  • Yujia Xie
  • Hanjun Dai
  • Minshuo Chen
  • Bo Dai
  • Tuo Zhao
  • Hongyuan Zha
  • Wei Wei
  • Tomas Pfister

Finding the k largest or smallest elements from a collection of scores, i. e. , top-k operation, is an important model component widely used in information retrieval, machine learning, and data mining. However, if the top-k operation is implemented in an algorithmic way, e. g. , using bubble algorithm, the resulted model cannot be trained in an end-to-end way using prevalent gradient descent algorithms. This is because these implementations typically involve swapping indices, whose gradient cannot be computed. Moreover, the corresponding mapping from the input scores to the indicator vector of whether this element belongs to the top-k set is essentially discontinuous. To address the issue, we propose a smoothed approximation, namely SOFT (Scalable Optimal transport-based diFferenTiable) top-k operator. Specifically, our SOFT top-k operator approximates the output of top-k operation as the solution of an Entropic Optimal Transport (EOT) problem. The gradient of the SOFT operator can then be efficiently approximated based on the optimality conditions of EOT problem. We then apply the proposed operator to k-nearest neighbors algorithm and beam search algorithm. The numerical experiment demonstrates their achieve improved performance.

AAAI Conference 2020 Conference Paper

InstaNAS: Instance-Aware Neural Architecture Search

  • An-Chieh Cheng
  • Chieh Hubert Lin
  • Da-Cheng Juan
  • Wei Wei
  • Min Sun

Conventional Neural Architecture Search (NAS) aims at finding a single architecture that achieves the best performance, which usually optimizes task related learning objectives such as accuracy. However, a single architecture may not be representative enough for the whole dataset with high diversity and variety. Intuitively, electing domain-expert architectures that are proficient in domain-specific features can further benefit architecture related objectives such as latency. In this paper, we propose InstaNAS—an instance-aware NAS framework—that employs a controller trained to search for a “distribution of architectures” instead of a single architecture; This allows the model to use sophisticated architectures for the difficult samples, which usually comes with large architecture related cost, and shallow architectures for those easy samples. During the inference phase, the controller assigns each of the unseen input samples with a domain expert architecture that can achieve high accuracy with customized inference costs. Experiments within a search space inspired by MobileNetV2 show InstaNAS can achieve up to 48. 8% latency reduction without compromising accuracy on a series of datasets against MobileNetV2.

NeurIPS Conference 2020 Conference Paper

Mitigating Forgetting in Online Continual Learning via Instance-Aware Parameterization

  • Hung-Jen Chen
  • An-Chieh Cheng
  • Da-Cheng Juan
  • Wei Wei
  • Min Sun

Online continual learning is a challenging scenario where a model needs to learn from a continuous stream of data without revisiting any previously encountered data instances. The phenomenon of catastrophic forgetting is worsened since the model should not only address the forgetting at the task-level but also at the data instance-level within the same task. To mitigate this, we leverage the concept of "instance awareness" in the neural network, where each data instance is classified by a path in the network searched by the controller from a meta-graph. To preserve the knowledge we learn from previous instances, we proposed a method to protect the path by restricting the gradient updates of one instance from overriding past updates calculated from previous instances if these instances are not similar. On the other hand, it also encourages fine-tuning the path if the incoming instance shares the similarity with previous instances. The mechanism of selecting paths according to instances similarity is naturally determined by the controller, which is compact and online updated. Experimental results show that the proposed method outperforms state-of-the-arts in online continual learning. Furthermore, the proposed method is evaluated against a realistic setting where the boundaries between tasks are blurred. Experimental results confirm that the proposed method outperforms the state-of-the-arts on CIFAR-10, CIFAR-100, and Tiny-ImageNet.

IJCAI Conference 2020 Conference Paper

MLS3RDUH: Deep Unsupervised Hashing via Manifold based Local Semantic Similarity Structure Reconstructing

  • Rong-Cheng Tu
  • Xian-Ling Mao
  • Wei Wei

Most of the unsupervised hashing methods usually map images into semantic similarity-preserving hash codes by constructing local semantic similarity structure as guiding information, i. e. , treating each point similar to its k nearest neighbours. However, for an image, some of its k nearest neighbours may be dissimilar to it, i. e. , they are noisy datapoints which will damage the retrieval performance. Thus, to tackle this problem, in this paper, we propose a novel deep unsupervised hashing method, called MLS3RDUH, which can reduce the noisy datapoints to further enhance retrieval performance. Specifically, the proposed method first defines a novel similarity matrix by utilising the intrinsic manifold structure in feature space and the cosine similarity of datapoints to reconstruct the local semantic similarity structure. Then a novel log-cosh hashing loss function is used to optimize the hashing network to generate compact hash codes by incorporating the defined similarity as guiding information. Extensive experiments on three public datasets show that the proposed method outperforms the state-of-the-art baselines.

AAAI Conference 2020 Conference Paper

Pixel-Aware Deep Function-Mixture Network for Spectral Super-Resolution

  • Lei Zhang
  • Zhiqiang Lang
  • Peng Wang
  • Wei Wei
  • Shengcai Liao
  • Ling Shao
  • Yanning Zhang

Spectral super-resolution (SSR) aims at generating a hyperspectral image (HSI) from a given RGB image. Recently, a promising direction is to learn a complicated mapping function from the RGB image to the HSI counterpart using a deep convolutional neural network. This essentially involves mapping the RGB context within a size-specific receptive field centered at each pixel to its spectrum in the HSI. The focus thereon is to appropriately determine the receptive field size and establish the mapping function from RGB context to the corresponding spectrum. Due to their differences in category or spatial position, pixels in HSIs often require different-sized receptive fields and distinct mapping functions. However, few efforts have been invested to explicitly exploit this prior. To address this problem, we propose a pixel-aware deep function-mixture network for SSR, which is composed of a new class of modules, termed function-mixture (FM) blocks. Each FM block is equipped with some basis functions, i. e. , parallel subnets of different-sized receptive fields. Besides, it incorporates an extra subnet as a mixing function to generate pixel-wise weights, and then linearly mixes the outputs of all basis functions with those generated weights. This enables us to pixel-wisely determine the receptive field size and the mapping function. Moreover, we stack several such FM blocks to further increase the flexibility of the network in learning the pixel-wise mapping. To encourage feature reuse, intermediate features generated by the FM blocks are fused in late stage, which proves to be effective for boosting the SSR performance. Experimental results on three benchmark HSI datasets demonstrate the superiority of the proposed method.

NeurIPS Conference 2019 Conference Paper

Abstract Reasoning with Distracting Features

  • Kecheng Zheng
  • Zheng-Jun Zha
  • Wei Wei

Abstraction reasoning is a long-standing challenge in artificial intelligence. Recent studies suggest that many of the deep architectures that have triumphed over other domains failed to work well in abstract reasoning. In this paper, we first illustrate that one of the main challenges in such a reasoning task is the presence of distracting features, which requires the learning algorithm to leverage counter-evidence and to reject any of false hypothesis in order to learn the true patterns. We later show that carefully designed learning trajectory over different categories of training data can effectively boost learning performance by mitigating the impacts of distracting features. Inspired this fact, we propose feature robust abstract reasoning (FRAR) model, which consists of a reinforcement learning based teacher network to determine the sequence of training and a student network for predictions. Experimental results demonstrated strong improvements over baseline algorithms and we are able to beat the state-of-the-art models by 18. 7\% in RAVEN dataset and 13. 3\% in the PGM dataset.

NeurIPS Conference 2019 Conference Paper

Meta Architecture Search

  • Albert Shaw
  • Wei Wei
  • Weiyang Liu
  • Le Song
  • Bo Dai

Neural Architecture Search (NAS) has been quite successful in constructing state-of-the-art models on a variety of tasks. Unfortunately, the computational cost can make it difficult to scale. In this paper, we make the first attempt to study Meta Architecture Search which aims at learning a task-agnostic representation that can be used to speed up the process of architecture search on a large number of tasks. We propose the Bayesian Meta Architecture SEarch (BASE) framework which takes advantage of a Bayesian formulation of the architecture search problem to learn over an entire set of tasks simultaneously. We show that on Imagenet classification, we can find a model that achieves 25. 7% top-1 error and 8. 1% top-5 error by adapting the architecture in less than an hour from an 8 GPU days pretrained meta-network. By learning a good prior for NAS, our method dramatically decreases the required computation cost while achieving comparable performance to current state-of-the-art methods - even finding competitive models for unseen datasets with very quick adaptation. We believe our framework will open up new possibilities for efficient and massively scalable architecture search research across multiple tasks.

ICML Conference 2019 Conference Paper

Policy Certificates: Towards Accountable Reinforcement Learning

  • Christoph Dann
  • Lihong Li 0001
  • Wei Wei
  • Emma Brunskill

The performance of a reinforcement learning algorithm can vary drastically during learning because of exploration. Existing algorithms provide little information about the quality of their current policy before executing it, and thus have limited use in high-stakes applications like healthcare. We address this lack of accountability by proposing that algorithms output policy certificates. These certificates bound the sub-optimality and return of the policy in the next episode, allowing humans to intervene when the certified quality is not satisfactory. We further introduce two new algorithms with certificates and present a new framework for theoretical analysis that guarantees the quality of their policies and certificates. For tabular MDPs, we show that computing certificates can even improve the sample-efficiency of optimism-based exploration. As a result, one of our algorithms is the first to achieve minimax-optimal PAC bounds up to lower-order terms, and this algorithm also matches (and in some settings slightly improves upon) existing minimax regret bounds.

IJCAI Conference 2018 Conference Paper

Learning to Explain Ambiguous Headlines of Online News

  • Tianyu Liu
  • Wei Wei
  • Xiaojun Wan

With the purpose of attracting clicks, online news publishers and editors use diverse strategies to make their headlines catchy, with a sacrifice of accuracy. Specifically, a considerable portion of news headlines is ambiguous. Such headlines are unclear relative to the content of the story, and largely degrade the reading experience of the audience. In this paper, we focus on dealing with the information gap caused by the ambiguous news headlines. We define a new task of explaining ambiguous headlines with short informative texts, and build a benchmark dataset for evaluation. We address the task by selecting a proper sentence from the news body to resolve the ambiguity in an ambiguous headline. Both feature engineering methods and neural network methods are explored. For feature engineering, we improve a standard SVM classifier with elaborately designed features. For neural networks, we propose an ambiguity-aware neural matching model based on a previous model. Utilizing automatic and manual evaluation metrics, we demonstrate the efficacy and the complementarity of the two methods, and the ambiguity-aware neural matching model achieves the state-of-the-art performance on this challenging task.

IJCAI Conference 2017 Conference Paper

Dynamic Programming Bipartite Belief Propagation For Hyper Graph Matching

  • Zhen Zhang
  • Julian McAuley
  • Yong Li
  • Wei Wei
  • Yanning Zhang
  • Qinfeng Shi

Hyper graph matching problems have drawn attention recently due to their ability to embed higher order relations between nodes. In this paper, we formulate hyper graph matching problems as constrained MAP inference problems in graphical models. Whereas previous discrete approaches introduce several global correspondence vectors, we introduce only one global correspondence vector, but several local correspondence vectors. This allows us to decompose the problem into a (linear) bipartite matching problem and several belief propagation sub-problems. Bipartite matching can be solved by traditional approaches, while the belief propagation sub-problem is further decomposed as two sub-problems with optimal substructure. Then a newly proposed dynamic programming procedure is used to solve the belief propagation sub-problem. Experiments show that the proposed methods outperform state-of-the-art techniques for hyper graph matching.

IJCAI Conference 2017 Conference Paper

Learning to Identify Ambiguous and Misleading News Headlines

  • Wei Wei
  • Xiaojun Wan

Accuracy is one of the basic principles of journalism. However, it is increasingly hard to manage due to the diversity of news media. Some editors of online news tend to use catchy headlines which trick readers into clicking. These headlines are either ambiguous or misleading, degrading the reading experience of the audience. Thus, identifying inaccurate news headlines is a task worth studying. Previous work names these headlines ``clickbaits'' and mainly focus on the features extracted from the headlines, which limits the performance since the consistency between headlines and news bodies is underappreciated. In this paper, we clearly redefine the problem and identify ambiguous and misleading headlines separately. We utilize class sequential rules to exploit structure information when detecting ambiguous headlines. For the identification of misleading headlines, we extract features based on the congruence between headlines and bodies. To make use of the large unlabeled data set, we apply a co-training method and gain an increase in performance. The experiment results show the effectiveness of our methods. Then we use our classifiers to detect inaccurate headlines crawled from different sources and conduct a data analysis.

AAAI Conference 2017 Conference Paper

Solving Constrained Combinatorial Optimisation Problems via MAP Inference without High-Order Penalties

  • Zhen Zhang
  • Qinfeng Shi
  • Julian McAuley
  • Wei Wei
  • Yanning Zhang
  • Rui Yao
  • Anton van den Hengel

Solving constrained combinatorial optimization problems via MAP inference is often achieved by introducing extra potential functions for each constraint. This can result in very high order potentials, e. g. a 2nd -order objective with pairwise potentials and a quadratic constraint over all N variables would correspond to an unconstrained objective with an order-N potential. This limits the practicality of such an approach, since inference with high order potentials is tractable only for a few special classes of functions. We propose an approach which is able to solve constrained combinatorial problems using belief propagation without increasing the order. For example, in our scheme the 2nd -order problem above remains order 2 instead of order N. Experiments on applications ranging from foreground detection, image reconstruction, quadratic knapsack, and the M-best solutions problem demonstrate the effectiveness and efficiency of our method. Moreover, we show several situations in which our approach outperforms commercial solvers like CPLEX and others designed for specific constrained MAP inference problems.

ICRA Conference 2014 Conference Paper

Development of a symmetrical spiral wireless microrobot in pipe for biomedical applications

  • Shuxiang Guo
  • Xiang Wei
  • Jian Guo 0003
  • Wei Wei
  • Yuehui Ji
  • Yunliang Wang

Colonoscopy is an important procedure for the diagnosis of various pathologies, in particular cancer of the colon and of the rectum. However, colonoscopy is a procedure often painful for the patient and complex for the doctor. So in the biomedical field, a wireless microrobot in pipe that can move smoothly in water or aqueous medium has urgently been demanded. In this paper, we developed a new kind of wireless microrobot with symmetrical spiral structure, which also had symmetrical kinematic characteristics. According to the hydromechanical lubrication theory and Newton viscous law, we build the motion model of the microrobot, which will provide a theoretical basis on designing the optimal structure parameters of the microrobot. Through analysis, simulations and experiments, this paper had evaluated the effect of spiral angle, which could realize forward-backward, upward-downward motion and stopping at any position we need in the pipe. In addition, we obtained the moving speeds of forward-backward and upward-downward motion in the pipe. The experimental results indicated that the maximum moving speed is 36. 5 mm/s at 14 Hz in the horizontal direction and 4. 6 mm/s at 16Hz in the vertical direction with input currents of 0. 7A. Finally, we designed a control panel for this system, which can control the microrobot current motion states intuitively and easily, and make our system more portable and compact. The developed wireless microrobot can move smoothly in water and other liquid medium and is very useful in the industrial.

IJCAI Conference 2013 Conference Paper

Joint and Coupled Bilingual Topic Model Based Sentence Representations for Language Model Adaptation

  • Shixiang Lu
  • Xiaoyin Fu
  • Wei Wei
  • Xingyuan Peng
  • Bo Xu

This paper is concerned with data selection for adapting language model (LM) in statistical machine translation (SMT), and aims to find the LM training sentences that are topic similar to the translation task. Although the traditional approaches have gained significant performance, they ignore the topic information and the distribution information of words when selecting similar training sentences. In this paper, we present two bilingual topic model (BLTM) (joint and coupled BLTM) based sentence representations for cross-lingual data selection. We map the data selection task into cross-lingual semantic representations that are language independent, then rank and select sentences in the target language LM training corpus for a sentence in the translation task by the semanticsbased likelihood. The semantic representations are learned from the parallel corpus, with the assumption that the bilingual pair shares the same or similar distribution over semantic topics. Largescale experimental results demonstrate that our approaches significantly outperform the state-of-theart approaches on both LM perplexity and translation performance, respectively.

AAAI Conference 2011 Conference Paper

Integrating Community Question and Answer Archives

  • Wei Wei
  • Gao Cong
  • Xiaoli Li
  • See-Kiong Ng
  • Guohui Li

Question and answer pairs in Community Question Answering (CQA) services are organized into hierarchical structures or taxonomies to facilitate users to find the answers for their questions conveniently. We observe that different CQA services have their own knowledge focus and used different taxonomies to organize their question and answer pairs in their archives. As there are no simple semantic mappings between the taxonomies of the CQA services, the integration of CQA services is a challenging task. The existing approaches on integrating taxonomies ignore the hierarchical structures of the source taxonomy. In this paper, we propose a novel approach that is capable of incorporating the parent-child and sibling information in the hierarchical structures of the source taxonomy for accurate taxonomy integration. Our experimental results with real world CQA data demonstrate that the proposed method significantly outperforms state-of-the-art methods.

ICRA Conference 2007 Conference Paper

Design and Theoretical Evaluation of Micro-Surgical Manipulators for Orbital Manipulation and Intraocular Dexterity

  • Wei Wei
  • Roger E. Goldman
  • Nabil Simaan
  • Howard F. Fine
  • Stanley Chang

This paper addresses the design considerations and dexterity evaluation of a novel hybrid two-armed micro-surgical slave robot equipped with intraocular dexterity devices. A unified framework for the kinematic modeling of this robot is presented while using the kinematic constraints stemming from the constrained motion of the eye. An augmented Jacobian describing the kinematics of the eye and the relative motion of each one of the two Intra-Ocular Dexterity Robots (IODR) is presented. Using this framework, the capabilities of this two-armed robot in performing dexterous intraocular operations are evaluated and compared to a similar robot without intra-ocular dexterity. The Kinematic Conditioning Index (KCI) for the proposed robot is shown to be significant. The results presented show an increase of approximately 33% and 47% in translational and rotational KCI respectively.

ICRA Conference 2005 Conference Paper

A New Multi-Robot Self-Determination Cooperation Method Based on Immune Agent Network

  • Yunyuan Gao
  • Wei Wei

In this paper, we propose an immune agent model combining the artificial immune system with the agent technology. In the model, a robot is regarded as an antibody and each environmental condition as an antigen respectively. Furthermore, based on the model, a new multi-robots cooperation algorithm is designed to build self-determination cooperation among robots even in a new environment. Inspired by pheromone from ants algorithm, a new pheromone as Inter-stimulus between robots in the model is introduced to the algorithm. By comparing the inter-stimulus value between antigen and antibody and among antibodies, the system will autonomously produce appropriate antibodies to kill the antigen. Finally, the model will be verified by simulation.

AAAI Conference 2004 Conference Paper

Towards Efficient Sampling: Exploiting Random Walk Strategies

  • Wei Wei

From a computational perspective, there is a close connection between various probabilistic reasoning tasks and the problem of counting or sampling satisfying assignments of a propositional theory. We consider the question of whether state-of-the-art satisfiability procedures, based on random walk strategies, can be used to sample uniformly or nearuniformly from the space of satisfying assignments. We first show that random walk SAT procedures often do reach the full set of solutions of complex logical theories. Moreover, by interleaving random walk steps with Metropolis transitions, we also show how the sampling becomes near-uniform.