Author name cluster

Wei Ye

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

35 papers

1 author row

AAAI Conference 2026 Conference Paper

ASKD: Reinforcement Learning-Style Knowledge Distillation with Quality-Adaptive Skewness

Mingjie Zhang
Xiaoling Zhou
Yuxiao Luo
Yiyu Liu
Shikun Zhang
Wei Ye

Knowledge distillation (KD) is a widely adopted technique for transferring the capabilities of large teacher models to smaller student models, thereby significantly reducing inference costs and memory consumption. However, existing KD methods are all constrained by an inherent greedy optimization objective, rooted in the assumption of teacher superiority: "Trust all teacher-generated outputs (TGOs)" and "Distrust any student-generated outputs (SGOs) unsupported by the teacher". We propose ASKD, a novel KD method with adaptive skewness determined by sample quality, refining this objective to: "Learn TGOs proportionally to their quality, and distrust only low-quality unsupported SGOs". ASKD comprises three key components: (1) A reinforcement learning-style optimization formulation to mitigate the inherent approximation bias in sample-based Kullback-Leibler (KL) divergence approximations used by previous KD methods; (2) Well-designed quality supervision signals to map and achieve adaptive skewness in skewed KL loss, pioneering the usage of sample quality to adjust learning magnitudes; (3) A gradient-clip function on high-quality SGOs for findings that high-quality SGOs in KL loss fail to yield positive updates and even cause adverse effects on some samples. Extensive experiments indicate that ASKD builds high-performance student models across various tasks, including instruction following, mathematical reasoning, and code generation, outperforming state-of-the-art methods comprehensively and surpassing GRPO-like approaches that use advantages as multiplicative factors. We also provide detailed mathematical proofs demonstrating properties such as Lipschitz continuity of the update coefficient and uniform convergence of the loss function, ensuring theoretical rigor for key components of ASKD.

PDF Details DOI

AAAI Conference 2026 Conference Paper

DensiCrafter: Physically-Constrained Generation and Fabrication of Self-Supporting Hollow Structures

Shengqi Dang
Fu Chai
Jiaxin Li
Chao Yuan
Wei Ye
Nan Cao

The rise of 3D generative models has enabled automatic 3D geometry and texture synthesis from multimodal inputs (e.g., text or images). However, these methods often ignore physical constraints and manufacturability considerations. In this work, we address the challenge of producing 3D designs that are both lightweight and self-supporting. We present DensiCrafter, a framework for generating lightweight, self-supporting 3D hollow structures by optimizing the density field. Starting from coarse voxel grids produced by Trellis, we interpret these as continuous density fields to optimize and introduce three differentiable, physically constrained, and simulation-free loss terms. Additionally, a mass regularization penalizes unnecessary material, while a restricted optimization domain preserves the outer surface. Our method seamlessly integrates with pretrained Trellis-based models (e.g., Trellis, DSO) without any architectural changes. In extensive evaluations, we achieve up to 43% reduction in material mass on the text-to-3D task. Compared to state-of-the-art baselines, our method could improve the stability and maintain high geometric fidelity. Real-world 3D-printing experiments confirm that our hollow designs can be reliably fabricated and could be self-supporting.

PDF Details DOI

AAAI Conference 2026 Conference Paper

Edge Self-Adversarial Augmentation Enhances Graph Contrastive Learning Against Neighborhood Inconsistency

Chunchun Chen
Xing Wei
Jiayi Yang
Chenrun Wang
Yiwei Fu
Yuxing Zhang
Xin Sun
Rui Fan

Recent studies have shown that unsupervised graph contrastive learning (GCL) is vulnerable to adversarial attacks. Automatic adversarial augmentation techniques are proposed to improve both the effectiveness and robustness of GCL. Existing methods typically regard unsupervised contrastive loss as the adversarial goal, essentially aiming to maximize inter-view instance-wise discrepancies between adversarial and original views. However, such attacks overlook intra-view neighborhood inconsistency, which hinders the robustness of GCL models against local neighborhood noises, resulting in performance degradation on low-homophily graphs. To tackle this issue, we propose a novel adversarial contrastive paradigm, named Edge self-aDversarial Augmentation for Graph Contrastive Learning (EDA-GCL). We theoretically establish that the adversarial objective of the intra-view neighborhood is equivalent to maximizing the discrepancy between bidirectional edge features. Hence, we build our adversarial framework based on edge self-adversarial learning. It generates pairwise adversarial augmentations from the original view by learning distinct neighborhood connectivity structures. The learned pairwise adversarial views are utilized for GCL model training in the minimization stage. Notably, this edge-level adversarial approach reduces the computational complexity to the level of the edge number. Experiments on various graph tasks and complex noise scenarios demonstrate the superiority and robustness of our EDA-GCL.

PDF Details DOI

AAAI Conference 2026 Conference Paper

Modeling Uncertainty Trends for Timely Retrieval in Dynamic RAG

Bo Li
Tian Tian
Zhenghua Xu
Hao Cheng
Shikun Zhang
Wei Ye

Dynamic retrieval-augmented generation (RAG) allows large language models (LLMs) to fetch external knowledge on demand, offering greater adaptability than static RAG. A central challenge in this setting lies in determining the optimal timing for retrieval. Existing methods often trigger retrieval based on low token-level confidence, which may lead to delayed intervention after errors have already propagated. We introduce Entropy-Trend Constraint (ETC), a training-free method that determines optimal retrieval timing by modeling the dynamics of token-level uncertainty. Specifically, ETC utilizes first- and second-order differences of the entropy sequence to detect emerging uncertainty trends, enabling earlier and more precise retrieval. Experiments on six QA benchmarks with three LLM backbones demonstrate that ETC consistently outperforms strong baselines while reducing retrieval frequency. ETC is particularly effective in domain-specific scenarios, exhibiting robust generalization capabilities. Ablation studies and qualitative analyses further confirm that trend-aware uncertainty modeling yields more effective retrieval timing. The method is plug-and-play, model-agnostic, and readily integrable into existing decoding pipelines. Implementation code is included in the supplementary materials.

PDF Details DOI

AAAI Conference 2026 Conference Paper

Rethinking the Sampling Criteria in Reinforcement Learning for LLM Reasoning: A Competence-Difficulty Alignment Perspective

Deyang Kong
Qi Guo
Xiangyu Xi
Wei Wang
Jingang Wang
Xunliang Cai
Shikun Zhang
Wei Ye

The low sampling efficiency during the rollout phase poses a significant challenge to scaling reinforcement learning for large language model reasoning. Existing methods attempt to improve efficiency by scheduling problems based on problem difficulties. However, these approaches suffer from unstable and biased estimations of problem difficulty and fail to capture the alignment between model competence and problem difficulty in RL training, leading to suboptimal results. To address these challenges, we introduce Competence-Difficulty Alignment Sampling (CDAS). This approach allows for accurate and stable estimation of problem difficulties by aggregating historical performance discrepancies across problems. Subsequently, model competence is quantified to adaptively select problems whose difficulties align with the model's current competence using a fixed-point system. Extensive experiments in mathematical RL training show that CDAS consistently outperforms strong baselines, achieving the highest average accuracy of 45.89%. Furthermore, CDAS reduces the training step time overhead by 57.06% compared to the widely-used Dynamic Sampling strategy, verifying the efficiency of CDAS. Additional experiments on different tasks, model architectures, and model sizes demonstrate the generalization capability of CDAS.

PDF Details DOI

NeurIPS Conference 2025 Conference Paper

Boosting Resilience of Large Language Models through Causality-Driven Robust Optimization

Xiaoling Zhou
Mingjie Zhang
Zhemg Lee
Yuncheng Hua
chengli xing
Wei Ye
Flora Salim
Shikun Zhang

Large language models (LLMs) have achieved remarkable achievements across diverse applications; however, they remain plagued by spurious correlations and the generation of hallucinated content. Despite extensive efforts to enhance the resilience of LLMs, existing approaches either rely on indiscriminate fine-tuning of all parameters, resulting in parameter inefficiency and lack of specificity, or depend on post-processing techniques that offer limited adaptability and flexibility. This study introduces a novel Causality-driven Robust Optimization (CdRO) approach that selectively updates model components sensitive to causal reasoning, enhancing model causality while preserving valuable pretrained knowledge to mitigate overfitting. Our method begins by identifying the parameter components within LLMs that capture causal relationships, achieved through comparing the training dynamics of parameter matrices associated with the original samples, as well as augmented counterfactual and paraphrased variants. These comparisons are then fed into a lightweight logistic regression model, optimized in real time to dynamically identify and adapt the causal components within LLMs. The identified parameters are subsequently optimized using an enhanced policy optimization algorithm, where the reward function is designed to jointly promote both model generalization and robustness. Extensive experiments across various tasks using twelve different LLMs demonstrate the superior performance of our framework, underscoring its significant effectiveness in reducing the model’s dependence on spurious associations and mitigating hallucinations.

PDF Details

AAAI Conference 2025 Conference Paper

Concept Matching with Agent for Out-of-Distribution Detection

Yuxiao Lee
Xiaofeng Cao
Jingcai Guo
Wei Ye
Qing Guo
Yi Chang

The remarkable achievements of Large Language Models (LLMs) have captivated the attention of both academia and industry, transcending their initial role in dialogue generation. To expand the usage scenarios of LLM, some works enhance the effectiveness and capabilities of the model by introducing more external information, which is called the agent paradigm. Based on this idea, we propose a new method that integrates the agent paradigm into out-of-distribution (OOD) detection task, aiming to improve its robustness and adaptability. Our proposed method, Concept Matching with Agent (CMA), employs neutral prompts as agents to augment the CLIP-based OOD detection process. These agents function as dynamic observers and communication hubs, interacting with both In-distribution (ID) labels and data inputs to form vector triangle relationships. This triangular framework offers a more nuanced approach than the traditional binary relationship, allowing for better separation and identification of ID and OOD inputs. Our extensive experimental results showcase the superior performance of CMA over both zero-shot and training-required methods in a diverse array of real-world scenarios.

PDF Details DOI

TMLR Journal 2025 Journal Article

Distributed Hierarchical Decomposition Framework for Multimodal Timeseries Prediction

Wei Ye
Prashant Khanduri
Jiangweizhi Peng
Feng Tian
Jun Gao
Jie Ding
Zhi-Li Zhang
Mingyi Hong

We consider a distributed time series forecasting problem where multiple distributed nodes each observing a local time series (of potentially different modality) collaborate to make both local and global forecasts. This problem is particularly challenging because each node only observes time series generated from a subset of sources, making it challenging to utilize correlations among different streams for accurate forecasting; and the data streams observed at each node may represent different modalities, leading to heterogeneous computational requirements among nodes. To tackle these challenges, we propose a hierarchical learning framework, consisting of multiple local models and a global model, and provide a suite of efficient training algorithms to achieve high local and global forecasting accuracy. We theoretically establish the convergence of the proposed framework and demonstrate the effectiveness of the proposed approach using several time series forecasting tasks, with the (somewhat surprising) observation that the proposed distributed models can match, or even outperform centralized ones.

PDF Details

IJCAI Conference 2025 Conference Paper

GETMusic: Generating Music Tracks with a Unified Representation and Diffusion Framework

Ang Lv
Xu Tan
Peiling Lu
Wei Ye
Shikun Zhang
Jiang Bian
Rui Yan

Symbolic music generation aims to create musical notes, which can help users compose music, such as generating target instrument tracks based on provided source tracks. In practical scenarios where there’s a predefined ensemble of tracks and various composition needs, an efficient and effective generative model that can generate any target tracks based on the other tracks becomes crucial. However, previous efforts have fallen short in addressing this necessity due to limitations in their music representations and models. In this paper, we introduce a framework known as GETMusic, with ``GET'' standing for ``GEnerate music Tracks. '' This framework encompasses a novel music representation ``GETScore'' and a diffusion model ``GETDiff. '' GETScore represents musical notes as tokens and organizes tokens in a 2D structure, with tracks stacked vertically and progressing horizontally over time. At a training step, each track of a music piece is randomly selected as either the target or source. The training involves two processes: In the forward process, target tracks are corrupted by masking their tokens, while source tracks remain as the ground truth; in the denoising process, GETDiff is trained to predict the masked target tokens conditioning on the source tracks. Our proposed representation, coupled with the non-autoregressive generative model, empowers GETMusic to generate music with any arbitrary source-target track combinations. Our experiments demonstrate that the versatile GETMusic outperforms prior works proposed for certain specific composition tasks.

PDF Details DOI

AAAI Conference 2025 Conference Paper

NightHaze: Nighttime Image Dehazing via Self-Prior Learning

Beibei Lin
Yeying Jin
Yan Wending
Wei Ye
Yuan Yuan
Robby T. Tan

Masked autoencoder (MAE) shows that severe augmentation during training produces robust representations for high-level tasks. This paper brings the MAE-like framework to nighttime image enhancement, demonstrating that severe augmentation during training produces strong network priors that are resilient to real-world night haze degradations. We propose a novel nighttime image dehazing method with self-prior learning. Our main novelty lies in the design of severe augmentation, which allows our model to learn robust priors. Unlike MAE that uses masking, we leverage two key challenging factors of nighttime images as augmentation: light effects and noise. During training, we intentionally degrade clear images by blending them with light effects as well as by adding noise, and subsequently restore the clear images. This enables our model to learn clear background priors. By increasing the noise values to approach as high as the pixel intensity values of the glow and light effect blended images, our augmentation becomes severe, resulting in stronger priors. While our self-prior learning is considerably effective in suppressing glow and revealing details of background scenes, in some cases, there are still some undesired artifacts that remain, particularly in the forms of over-suppression. To address these artifacts, we propose a self-refinement module based on the semi-supervised teacher-student framework. Our NightHaze, especially our MAE-like self-prior learning, shows that models trained with severe augmentation effectively improve the visibility of input haze images, approaching the clarity of clear nighttime images. Extensive experiments demonstrate that our NightHaze achieves state-of-the-art performance, outperforming existing nighttime image dehazing methods by a substantial margin of 15.5% for MUSIQ and 23.5% for ClipIQA.

PDF Details DOI

NeurIPS Conference 2025 Conference Paper

Preference-driven Knowledge Distillation for Few-shot Node Classification

Xing Wei
Chunchun Chen
Rui Fan
Xiaofeng Cao
Sourav Medya
Wei Ye

Graph neural networks (GNNs) can efficiently process text-attributed graphs (TAGs) due to their message-passing mechanisms, but their training heavily relies on the human-annotated labels. Moreover, the complex and diverse local topologies of nodes of real-world TAGs make it challenging for a single mechanism to handle. Large language models (LLMs) perform well in zero-/few-shot learning on TAGs but suffer from a scalability challenge. Therefore, we propose a preference-driven knowledge distillation (PKD) framework to synergize the complementary strengths of LLMs and various GNNs for few-shot node classification. Specifically, we develop a GNN-preference-driven node selector that effectively promotes prediction distillation from LLMs to teacher GNNs. To further tackle nodes' intricate local topologies, we develop a node-preference-driven GNN selector that identifies the most suitable teacher GNN for each node, thereby facilitating tailored knowledge distillation from teacher GNNs to the student GNN. Extensive experiments validate the efficacy of our proposed framework in few-shot node classification on real-world TAGs. Our code can be available at.

PDF Details

IJCAI Conference 2025 Conference Paper

Robustness to Spurious Correlations via Dynamic Knowledge Transfer

Xiaoling Zhou
Wei Ye
Zhemg Lee
Shikun Zhang

Spurious correlations pose a significant challenge to the robustness of statistical models, often resulting in unsatisfactory performance when distributional shifts occur between training and testing data. To address this, we propose to transfer knowledge across spuriously correlated categories within the deep feature space. Specifically, samples' deep features are enriched using semantic vectors extracted from both their respective category distributions and those of their spuriously correlated counterparts, enabling the generation of diverse class-specific factual and counterfactual augmented deep features. We then demonstrate the feasibility of optimizing a surrogate robust loss instead of conducting explicit augmentations by considering an infinite number of augmentations. As spurious correlations between samples and classes evolve during training, we develop a reinforcement learning-based training framework called Dynamic Knowledge Transfer (DKT) to facilitate dynamic adjustments in the direction and intensity of knowledge transfer. Within this framework, a target network is trained using the derived robust loss to enhance robustness, while a strategy network generates sample-wise augmentation strategies in a dynamic and automatic way. Extensive experiments validate the effectiveness of the DKT framework in mitigating spurious correlations, achieving state-of-the-art performance across three typical learning scenarios susceptible to such correlations.

PDF Details DOI

NeurIPS Conference 2025 Conference Paper

SAEMark: Steering Personalized Multilingual LLM Watermarks with Sparse Autoencoders

Zhuohao Yu
Xingru Jiang
Weizheng Gu
Yidong Wang
Qingsong Wen
Shikun Zhang
Wei Ye

Watermarking LLM-generated text is critical for content attribution and misinformation prevention, yet existing methods compromise text quality and require white-box model access with logit manipulation or training, which exclude API-based models and multilingual scenarios. We propose SAEMark, an inference-time framework for multi-bit watermarking that embeds personalized information through feature-based rejection sampling, fundamentally different from logit-based or rewriting-based approaches: we do not modify model outputs directly and require only black-box access, while naturally supporting multi-bit message embedding and generalizing across diverse languages and domains. We instantiate the framework using Sparse Autoencoders as deterministic feature extractors and provide theoretical worst-case analysis relating watermark accuracy to computational budget. Experiments across 4 datasets demonstrate strong watermarking performance on English, Chinese, and code while preserving text quality. SAEMark establishes a new paradigm for scalable, quality-preserving watermarks that work seamlessly with closed-source LLMs across languages and domains.

PDF Details

NeurIPS Conference 2025 Conference Paper

VLM-R³: Region Recognition, Reasoning, and Refinement for Enhanced Multimodal Chain-of-Thought

Chaoya Jiang
Yongrui Heng
Wei Ye
Haiyang Xu
Ming Yan
Ji Zhang
Fei Huang
Shikun Zhang

Recently, reasoning-based MLLMs have achieved a degree of success in generating long-form textual reasoning chains. However, they still struggle with complex tasks that necessitate dynamic and iterative focusing on and revisiting of visual regions to achieve precise grounding of textual reasoning in visual evidence. We introduce VLM-R³ (Visual Language Model with Region Recognition, Reasoning, and Refinement ), a framework that equips an MLLM with the ability to (i) decide when additional visual evidence is needed, (ii) determine where to ground within the image, and (iii) seamlessly weave the relevant sub-image content back into an interleaved chain-of-thought. The core of our method is \textbf{Region-Conditioned Reinforcement Policy Optimization (R-GRPO)}, a training paradigm that rewards the model for selecting informative regions, formulating appropriate transformations (e. g. crop, zoom), and integrating the resulting visual context into subsequent reasoning steps. To bootstrap this policy, we compile a modest but carefully curated Visuo-Lingual Interleaved Rationale (VLIR) corpus that provides step-level supervision on region selection and textual justification. Extensive experiments on MathVista, ScienceQA, and other benchmarks show that VLM-R$^3$ sets a new state of the art in zero-shot and few-shot settings, with the largest gains appearing on questions demanding subtle spatial reasoning or fine-grained visual cue extraction.

PDF Details

TIST Journal 2024 Journal Article

A Survey on Evaluation of Large Language Models

Yupeng Chang
Xu Wang
Jindong Wang
Yuan Wu
Linyi Yang
Kaijie Zhu
Hao Chen
Xiaoyuan Yi

Large language models (LLMs) are gaining increasing popularity in both academia and industry, owing to their unprecedented performance in various applications. As LLMs continue to play a vital role in both research and daily use, their evaluation becomes increasingly critical, not only at the task level, but also at the society level for better understanding of their potential risks. Over the past years, significant efforts have been made to examine LLMs from various perspectives. This paper presents a comprehensive review of these evaluation methods for LLMs, focusing on three key dimensions: what to evaluate, where to evaluate, and how to evaluate. Firstly, we provide an overview from the perspective of evaluation tasks, encompassing general natural language processing tasks, reasoning, medical usage, ethics, education, natural and social sciences, agent applications, and other areas. Secondly, we answer the ‘where’ and ‘how’ questions by diving into the evaluation methods and benchmarks, which serve as crucial components in assessing the performance of LLMs. Then, we summarize the success and failure cases of LLMs in different tasks. Finally, we shed light on several future challenges that lie ahead in LLMs evaluation. Our aim is to offer invaluable insights to researchers in the realm of LLMs evaluation, thereby aiding the development of more proficient LLMs. Our key point is that evaluation should be treated as an essential discipline to better assist the development of LLMs. We consistently maintain the related open-source materials at: https://github.com/MLGroupJLU/LLM-eval-survey

Details DOI

NeurIPS Conference 2024 Conference Paper

AutoSurvey: Large Language Models Can Automatically Write Surveys

Yidong Wang
Qi Guo
Wenjin Yao
Hongbo Zhang
Xin Zhang
Zhen Wu
Meishan Zhang
Xinyu Dai

This paper introduces AutoSurvey, a speedy and well-organized methodology for automating the creation of comprehensive literature surveys in rapidly evolving fields like artificial intelligence. Traditional survey paper creation faces challenges due to the vast volume and complexity of information, prompting the need for efficient survey methods. While large language models (LLMs) offer promise in automating this process, challenges such as context window limitations, parametric knowledge constraints, and the lack of evaluation benchmarks remain. AutoSurvey addresses these challenges through a systematic approach that involves initial retrieval and outline generation, subsection drafting by specialized LLMs, integration and refinement, and rigorous evaluation and iteration. Our contributions include a comprehensive solution to the survey problem, a reliable evaluation method, and experimental validation demonstrating AutoSurvey's effectiveness.

PDF Details DOI

IJCAI Conference 2024 Conference Paper

Boosting Model Resilience via Implicit Adversarial Data Augmentation

Xiaoling Zhou
Wei Ye
Zhemg Lee
Rui Xie
Shikun Zhang

Data augmentation plays a pivotal role in enhancing and diversifying training data. Nonetheless, consistently improving model performance in varied learning scenarios, especially those with inherent data biases, remains challenging. To address this, we propose to augment the deep features of samples by incorporating their adversarial and anti-adversarial perturbation distributions, enabling adaptive adjustment in the learning difficulty tailored to each sample’s specific characteristics. We then theoretically reveal that our augmentation process approximates the optimization of a surrogate loss function as the number of augmented copies increases indefinitely. This insight leads us to develop a meta-learning-based framework for optimizing classifiers with this novel loss, introducing the effects of augmentation while bypassing the explicit augmentation process. We conduct extensive experiments across four common biased learning scenarios: long-tail learning, generalized long-tail learning, noisy label learning, and subpopulation shift learning. The empirical results demonstrate that our method consistently achieves state-of-the-art performance, highlighting its broad adaptability.

PDF Details DOI

AAAI Conference 2024 Conference Paper

COMBHelper: A Neural Approach to Reduce Search Space for Graph Combinatorial Problems

Hao Tian
Sourav Medya
Wei Ye

Combinatorial Optimization (CO) problems over graphs appear routinely in many applications such as in optimizing traffic, viral marketing in social networks, and matching for job allocation. Due to their combinatorial nature, these problems are often NP-hard. Existing approximation algorithms and heuristics rely on the search space to find the solutions and become time-consuming when this space is large. In this paper, we design a neural method called COMBHelper to reduce this space and thus improve the efficiency of the traditional CO algorithms based on node selection. Specifically, it employs a Graph Neural Network (GNN) to identify promising nodes for the solution set. This pruned search space is then fed to the traditional CO algorithms. COMBHelper also uses a Knowledge Distillation (KD) module and a problem-specific boosting module to bring further efficiency and efficacy. Our extensive experiments show that the traditional CO algorithms with COMBHelper are at least 2 times faster than their original versions.

PDF Details DOI

IJCAI Conference 2024 Conference Paper

Deep Hierarchical Graph Alignment Kernels

Shuhao Tang
Hao Tian
Xiaofeng Cao
Wei Ye

Typical R-convolution graph kernels invoke the kernel functions that decompose graphs into non-isomorphic substructures and compare them. However, overlooking implicit similarities and topological position information between those substructures limits their performances. In this paper, we introduce Deep Hierarchical Graph Alignment Kernels (DHGAK) to resolve this problem. Specifically, the relational substructures are hierarchically aligned to cluster distributions in their deep embedding space. The substructures belonging to the same cluster are assigned the same feature map in the Reproducing Kernel Hilbert Space (RKHS), where graph feature maps are derived by kernel mean embedding. Theoretical analysis guarantees that DHGAK is positive semi-definite and has linear separability in the RKHS. Comparison with state-of-the-art graph kernels on various benchmark datasets demonstrates the effectiveness and efficiency of DHGAK. The code is available at Github (https: //github. com/EWesternRa/DHGAK).

PDF Details DOI

AAAI Conference 2024 Conference Paper

DeS3: Adaptive Attention-Driven Self and Soft Shadow Removal Using ViT Similarity

Yeying Jin
Wei Ye
Wenhan Yang
Yuan Yuan
Robby T. Tan

Removing soft and self shadows that lack clear boundaries from a single image is still challenging. Self shadows are shadows that are cast on the object itself. Most existing methods rely on binary shadow masks, without considering the ambiguous boundaries of soft and self shadows. In this paper, we present DeS3, a method that removes hard, soft and self shadows based on adaptive attention and ViT similarity. Our novel ViT similarity loss utilizes features extracted from a pre-trained Vision Transformer. This loss helps guide the reverse sampling towards recovering scene structures. Our adaptive attention is able to differentiate shadow regions from the underlying objects, as well as shadow regions from the object casting the shadow. This capability enables DeS3 to better recover the structures of objects even when they are partially occluded by shadows. Different from existing methods that rely on constraints during the training phase, we incorporate the ViT similarity during the sampling stage. Our method outperforms state-of-the-art methods on the SRD, AISTD, LRSS, USR and UIUC datasets, removing hard, soft, and self shadows robustly. Specifically, our method outperforms the SOTA method by 16% of the RMSE of the whole image on the LRSS dataset.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

Geometry Awakening: Cross-Geometry Learning Exhibits Superiority over Individual Structures

Yadong Sun
Xiaofeng Cao
Yu Wang
Wei Ye
Jingcai Guo
Qing Guo

Recent research has underscored the efficacy of Graph Neural Networks (GNNs) in modeling diverse geometric structures within graph data. However, real-world graphs typically exhibit geometrically heterogeneous characteristics, rendering the confinement to a single geometric paradigm insufficient for capturing their intricate structural complexities. To address this limitation, we examine the performance of GNNs across various geometries through the lens of knowledge distillation (KD) and introduce a novel cross-geometric framework. This framework encodes graphs by integrating both Euclidean and hyperbolic geometries in a space-mixing fashion. Our approach employs multiple teacher models, each generating hint embeddings that encapsulate distinct geometric properties. We then implement a structure-wise knowledge transfer module that optimally leverages these embeddings within their respective geometric contexts, thereby enhancing the training efficacy of the student model. Additionally, our framework incorporates a geometric optimization network designed to bridge the distributional disparities among these embeddings. Experimental results demonstrate that our model-agnostic framework more effectively captures topological graph knowledge, resulting in superior performance of the student models when compared to traditional KD methodologies.

PDF Details DOI

AAAI Conference 2024 Conference Paper

Labels Need Prompts Too: Mask Matching for Natural Language Understanding Tasks

Bo Li
Wei Ye
Quansen Wang
Wen Zhao
Shikun Zhang

Textual label names (descriptions) are typically semantically rich in many natural language understanding (NLU) tasks. In this paper, we incorporate the prompting methodology, which is widely used to enrich model input, into the label side for the first time. Specifically, we propose a Mask Matching method, which equips an input with a prompt and its label with another, and then makes predictions by matching their mask representations. We evaluate our method extensively on 8 NLU tasks with 14 datasets. The experimental results show that Mask Matching significantly outperforms its counterparts of fine-tuning and conventional prompt-tuning, setting up state-of-the-art performances in several datasets. Mask Matching is particularly good at handling NLU tasks with large label counts and informative label names. As pioneering efforts that investigate the label-side prompt, we also discuss open issues for future study.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

MaVEn: An Effective Multi-granularity Hybrid Visual Encoding Framework for Multimodal Large Language Model

Chaoya Jiang
Hongrui Jia
Haiyang Xu
Wei Ye
Mengfan Dong
Ming Yan
Ji Zhang
Fei Huang

This paper presents MaVEn, an innovative Multi-granularity Visual Encoding framework designed to enhance the capabilities of Multimodal Large Language Models (MLLMs) in multi-image reasoning. Current MLLMs primarily focus on single-image visual understanding, limiting their ability to interpret and integrate information across multiple images. MaVEn addresses this limitation by combining discrete visual symbol sequences, which abstract coarse-grained semantic concepts, with traditional continuous representation sequences that model fine-grained features. This dual approach bridges the semantic gap between visual and textual data, thereby improving the model's ability to process and interpret information from multiple images effectively. Additionally, we design a dynamic reduction mechanism by for long-sequence continuous features to enhance multi-image processing efficiency. Experimental results demonstrate that MaVEn significantly enhances MLLMs' understanding in complex multi-image scenarios, while also improving performance in single-image contexts.

PDF Details DOI

AAAI Conference 2024 Conference Paper

NightRain: Nighttime Video Deraining via Adaptive-Rain-Removal and Adaptive-Correction

Beibei Lin
Yeying Jin
Wending Yan
Wei Ye
Yuan Yuan
Shunli Zhang
Robby T. Tan

Existing deep-learning-based methods for nighttime video deraining rely on synthetic data due to the absence of real-world paired data. However, the intricacies of the real world, particularly with the presence of light effects and low-light regions affected by noise, create significant domain gaps, hampering synthetic-trained models in removing rain streaks properly and leading to over-saturation and color shifts. Motivated by this, we introduce NightRain, a novel nighttime video deraining method with adaptive-rain-removal and adaptive-correction. Our adaptive-rain-removal uses unlabeled rain videos to enable our model to derain real-world rain videos, particularly in regions affected by complex light effects. The idea is to allow our model to obtain rain-free regions based on the confidence scores. Once rain-free regions and the corresponding regions from our input are obtained, we can have region-based paired real data. These paired data are used to train our model using a teacher-student framework, allowing the model to iteratively learn from less challenging regions to more challenging regions. Our adaptive-correction aims to rectify errors in our model's predictions, such as over-saturation and color shifts. The idea is to learn from clear night input training videos based on the differences or distance between those input videos and their corresponding predictions. Our model learns from these differences, compelling our model to correct the errors. From extensive experiments, our method demonstrates state-of-the-art performance. It achieves a PSNR of 26.73dB, surpassing existing nighttime video deraining methods by a substantial margin of 13.7%.

PDF Details DOI

AAAI Conference 2024 Conference Paper

PICNN: A Pathway towards Interpretable Convolutional Neural Networks

Wengang Guo
Jiayi Yang
Huilin Yin
Qijun Chen
Wei Ye

Convolutional Neural Networks (CNNs) have exhibited great performance in discriminative feature learning for complex visual tasks. Besides discrimination power, interpretability is another important yet under-explored property for CNNs. One difficulty in the CNN interpretability is that filters and image classes are entangled. In this paper, we introduce a novel pathway to alleviate the entanglement between filters and image classes. The proposed pathway groups the filters in a late conv-layer of CNN into class-specific clusters. Clusters and classes are in a one-to-one relationship. Specifically, we use the Bernoulli sampling to generate the filter-cluster assignment matrix from a learnable filter-class correspondence matrix. To enable end-to-end optimization, we develop a novel reparameterization trick for handling the non-differentiable Bernoulli sampling. We evaluate the effectiveness of our method on ten widely used network architectures (including nine CNNs and a ViT) and five benchmark datasets. Experimental results have demonstrated that our method PICNN (the combination of standard CNNs with our proposed pathway) exhibits greater interpretability than standard CNNs while achieving higher or comparable discrimination power.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

SG-Bench: Evaluating LLM Safety Generalization Across Diverse Tasks and Prompt Types

Yutao Mou
Shikun Zhang
Wei Ye

Ensuring the safety of large language model (LLM) applications is essential for developing trustworthy artificial intelligence. Current LLM safety benchmarks have two limitations. First, they focus solely on either discriminative or generative evaluation paradigms while ignoring their interconnection. Second, they rely on standardized inputs, overlooking the effects of widespread prompting techniques, such as system prompts, few-shot demonstrations, and chain-of-thought prompting. To overcome these issues, we developed SG-Bench, a novel benchmark to assess the generalization of LLM safety across various tasks and prompt types. This benchmark integrates both generative and discriminative evaluation tasks and includes extended data to examine the impact of prompt engineering and jailbreak on LLM safety. Our assessment of 3 advanced proprietary LLMs and 10 open-source LLMs with the benchmark reveals that most LLMs perform worse on discriminative tasks than generative ones, and are highly susceptible to prompts, indicating poor generalization in safety alignment. We also explain these findings quantitatively and qualitatively to provide insights for future research.

PDF Details DOI

EAAI Journal 2024 Journal Article

Three-partition coevolutionary differential evolution algorithm for mixed-variable optimization problems

Guojun Gan
Hengzhou Ye
Minggang Dong
Wei Ye
Yan Wang

Details DOI

AAAI Conference 2024 Conference Paper

TiMix: Text-Aware Image Mixing for Effective Vision-Language Pre-training

Chaoya Jiang
Wei Ye
Haiyang Xu
Qinghao Ye
Ming Yan
Ji Zhang
Shikun Zhang

Self-supervised Multi-modal Contrastive Learning (SMCL) remarkably advances modern Vision-Language Pre-training (VLP) models by aligning visual and linguistic modalities. Due to noises in web-harvested text-image pairs, however, scaling up training data volume in SMCL presents considerable obstacles in terms of computational cost and data inefficiency. To improve data efficiency in VLP, we propose Text-aware Image Mixing (TiMix), which integrates mix-based data augmentation techniques into SMCL, yielding significant performance improvements without significantly increasing computational overhead. We provide a theoretical analysis of TiMix from a mutual information (MI) perspective, showing that mixed data samples for cross-modal contrastive learning implicitly serve as a regularizer for the contrastive loss. The experimental results demonstrate that TiMix exhibits a comparable performance on downstream tasks, even with a reduced amount of training data and shorter training time, when benchmarked against existing methods. This work empirically and theoretically demonstrates the potential of data mixing for data-efficient and computationally viable VLP, benefiting broader VLP model adoption in practical scenarios. Our code is available on https://github.com/chaoyajiang/TiMiX/tree/main.

PDF Details DOI

AAAI Conference 2023 Conference Paper

Reviewing Labels: Label Graph Network with Top-k Prediction Set for Relation Extraction

Bo Li
Wei Ye
Jinglei Zhang
Shikun Zhang

The typical way for relation extraction is fine-tuning large pre-trained language models on task-specific datasets, then selecting the label with the highest probability of the output distribution as the final prediction. However, the usage of the Top-k prediction set for a given sample is commonly overlooked. In this paper, we first reveal that the Top-k prediction set of a given sample contains useful information for predicting the correct label. To effectively utilizes the Top-k prediction set, we propose Label Graph Network with Top-k Prediction Set, termed as KLG. Specifically, for a given sample, we build a label graph to review candidate labels in the Top-k prediction set and learn the connections between them. We also design a dynamic k selection mechanism to learn more powerful and discriminative relation representation. Our experiments show that KLG achieves the best performances on three relation extraction datasets. Moreover, we observe thatKLG is more effective in dealing with long-tailed classes.

PDF Details DOI

AAAI Conference 2023 Conference Paper

Sequence Generation with Label Augmentation for Relation Extraction

Bo Li
Dingyao Yu
Wei Ye
Jinglei Zhang
Shikun Zhang

Sequence generation demonstrates promising performance in recent information extraction efforts, by incorporating large-scale pre-trained Seq2Seq models. This paper investigates the merits of employing sequence generation in relation extraction, finding that with relation names or synonyms as generation targets, their textual semantics and the correlation (in terms of word sequence pattern) among them affect model performance. We then propose Relation Extraction with Label Augmentation (RELA), a Seq2Seq model with automatic label augmentation for RE. By saying label augmentation, we mean prod semantically synonyms for each relation name as the generation target. Besides, we present an in-depth analysis of the Seq2Seq model's behavior when dealing with RE. Experimental results show that RELA achieves competitive results compared with previous methods on four RE datasets.

PDF Details DOI

AAAI Conference 2022 Conference Paper

Frequency-Aware Contrastive Learning for Neural Machine Translation

Tong Zhang
Wei Ye
Baosong Yang
Long Zhang
Xingzhang Ren
Dayiheng Liu
Jinan Sun
Shikun Zhang

Low-frequency word prediction remains a challenge in modern neural machine translation (NMT) systems. Recent adaptive training methods promote the output of infrequent words by emphasizing their weights in the overall training objectives. Despite the improved recall of low-frequency words, their prediction precision is unexpectedly hindered by the adaptive objectives. Inspired by the observation that lowfrequency words form a more compact embedding space, we tackle this challenge from a representation learning perspective. Specifically, we propose a frequency-aware tokenlevel contrastive learning method, in which the hidden state of each decoding step is pushed away from the counterparts of other target words, in a soft contrastive way based on the corresponding word frequencies. We conduct experiments on widely used NIST Chinese-English and WMT14 English- German translation tasks. Empirical results show that our proposed methods can not only significantly improve the translation quality but also enhance lexical diversity and optimize word representation space. Further investigation reveals that, comparing with related adaptive training strategies, the superiority of our method on low-frequency word prediction lies in the robustness of token-level recall across different frequencies without sacrificing precision.

PDF Details

NeurIPS Conference 2022 Conference Paper

Museformer: Transformer with Fine- and Coarse-Grained Attention for Music Generation

Botao Yu
Peiling Lu
Rui Wang
Wei Hu
Xu Tan
Wei Ye
Shikun Zhang
Tao Qin

Symbolic music generation aims to generate music scores automatically. A recent trend is to use Transformer or its variants in music generation, which is, however, suboptimal, because the full attention cannot efficiently model the typically long music sequences (e. g. , over 10, 000 tokens), and the existing models have shortcomings in generating musical repetition structures. In this paper, we propose Museformer, a Transformer with a novel fine- and coarse-grained attention for music generation. Specifically, with the fine-grained attention, a token of a specific bar directly attends to all the tokens of the bars that are most relevant to music structures (e. g. , the previous 1st, 2nd, 4th and 8th bars, selected via similarity statistics); with the coarse-grained attention, a token only attends to the summarization of the other bars rather than each token of them so as to reduce the computational cost. The advantages are two-fold. First, it can capture both music structure-related correlations via the fine-grained attention, and other contextual information via the coarse-grained attention. Second, it is efficient and can model over 3X longer music sequences compared to its full-attention counterpart. Both objective and subjective experimental results demonstrate its ability to generate long music sequences with high quality and better structures.

PDF Details

NeurIPS Conference 2022 Conference Paper

USB: A Unified Semi-supervised Learning Benchmark for Classification

Yidong Wang
Hao Chen
Yue Fan
Wang Sun
Ran Tao
Wenxin Hou
Renjie Wang
Linyi Yang

Semi-supervised learning (SSL) improves model generalization by leveraging massive unlabeled data to augment limited labeled samples. However, currently, popular SSL evaluation protocols are often constrained to computer vision (CV) tasks. In addition, previous work typically trains deep neural networks from scratch, which is time-consuming and environmentally unfriendly. To address the above issues, we construct a Unified SSL Benchmark (USB) for classification by selecting 15 diverse, challenging, and comprehensive tasks from CV, natural language processing (NLP), and audio processing (Audio), on which we systematically evaluate the dominant SSL methods, and also open-source a modular and extensible codebase for fair evaluation of these SSL methods. We further provide the pre-trained versions of the state-of-the-art neural models for CV tasks to make the cost affordable for further tuning. USB enables the evaluation of a single SSL algorithm on more tasks from multiple domains but with less cost. Specifically, on a single NVIDIA V100, only 39 GPU days are required to evaluate FixMatch on 15 tasks in USB while 335 GPU days (279 GPU days on 4 CV datasets except for ImageNet) are needed on 5 CV tasks with TorchSSL.

PDF Details

AAAI Conference 2021 Conference Paper

Multi-view Inference for Relation Extraction with Uncertain Knowledge

Bo Li
Wei Ye
Canming Huang
Shikun Zhang

Knowledge graphs (KGs) are widely used to facilitate relation extraction (RE) tasks. While most previous RE methods focus on leveraging deterministic KGs, uncertain KGs, which assign a confidence score for each relation instance, can provide prior probability distributions of relational facts as valuable external knowledge for RE models. This paper proposes to exploit uncertain knowledge to improve relation extraction. Specifically, we introduce ProBase, an uncertain KG that indicates to what extent a target entity belongs to a concept, into our RE architecture. We then design a novel multi-view inference framework to systematically integrate local context and global knowledge across three views: mention-, entity- and concept-view. The experiment results show that our model achieves competitive performances on both sentence- and document-level relation extraction, which verifies the effectiveness of introducing uncertain knowledge and the multiview inference framework that we design.

PDF Details

AAAI Conference 2021 Conference Paper

SongMASS: Automatic Song Writing with Pre-training and Alignment Constraint

Zhonghao Sheng
Kaitao Song
Xu Tan
Yi Ren
Wei Ye
Shikun Zhang
Tao Qin

Automatic song writing aims to compose a song (lyric and/or melody) by machine, which is an interesting topic in both academia and industry. In automatic song writing, lyric-tomelody generation and melody-to-lyric generation are two important tasks, both of which usually suffer from the following challenges: 1) the paired lyric and melody data are limited, which affects the generation quality of the two tasks, considering a lot of paired training data are needed due to the weak correlation between lyric and melody; 2) Strict alignments are required between lyric and melody, which relies on specific alignment modeling. In this paper, we propose SongMASS to address the above challenges, which leverages masked sequence to sequence (MASS) pre-training and attention based alignment modeling for lyric-to-melody and melody-to-lyric generation. Specifically, 1) we extend the original sentence-level MASS pre-training to song level to better capture long contextual information in music, and use a separate encoder and decoder for each modality (lyric or melody); 2) we leverage sentence-level attention mask and token-level attention constraint during training to enhance the alignment between lyric and melody. During inference, we use a dynamic programming strategy to obtain the alignment between each word/syllable in lyric and note in melody. We pre-train SongMASS on unpaired lyric and melody datasets, and both objective and subjective evaluations demonstrate that SongMASS generates lyric and melody with significantly better quality than the baseline method.

PDF Details