Arrow Research search

Author name cluster

Wei Lin

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

36 papers
2 author rows

Possible papers

36

AAAI Conference 2026 Conference Paper

Promoting Efficient Reasoning with Verifiable Stepwise Reward

  • Chuhuai Yue
  • Chengqi Dong
  • Yinan Gao
  • Hang He
  • Jiajun Chai
  • Wei Lin
  • Guojun Yin

Large reasoning models (LRMs) have recently achieved significant progress in complex reasoning tasks, aided by reinforcement learning with verifiable rewards. However, LRMs often suffer from overthinking, expending excessive computation on simple problems and reducing efficiency. Existing efficient reasoning methods typically require accurate task assessment to preset token budgets or select reasoning modes, which limits their flexibility and reliability. In this work, we revisit the essence of overthinking and identify that encouraging effective steps while penalizing ineffective ones is key to its solution. To this end, we propose a novel rule-based verifiable stepwise reward mechanism (VSRM), which assigns rewards based on the performance of intermediate states in the reasoning trajectory. This approach is intuitive and naturally fits the step-by-step nature of reasoning tasks. We conduct extensive experiments on standard mathematical reasoning benchmarks, including AIME24 and AIME25, by integrating VSRM with PPO and Reinforce++. Results show that our method achieves substantial output length reduction while maintaining original reasoning performance, striking an optimal balance between efficiency and accuracy. Further analysis of overthinking frequency and pass@k score before and after training demonstrates that our approach indeed effectively suppresses ineffective steps and encourages effective reasoning, fundamentally alleviating the overthinking problem.

JBHI Journal 2026 Journal Article

Value Decomposition-Based Multi-Agent Learning for Anesthetics Collaborative Control

  • Huijie Li
  • Yide Yu
  • Si Shi
  • Anmin Hu
  • Jian Huo
  • Wei Lin
  • Chaoran Wu
  • Wuman Luo

Automated control of personalized multiple anesthetics in clinical Total Intravenous Anesthesia (TIVA) is crucial yet challenging. Current systems, including target-controlled infusion (TCI) and closed-loop systems, either rely on relatively static pharmacokinetic/pharmacodynamic (PK/PD) models or focus on single anesthetic control. So they limit both personalization and collaborative control. To address these issues, we propose a novel V alue D ecomposition M ulti- A gent D eep R einforcement L earning (VD-MADRL) framework based on Markov Game (MG) for P ersonalized M ultiple A nesthetics C ontrol in a C losed- L oop system (PMAC-CL). VD-MADRL optimizes the collaboration between two anesthetics propofol (Agent I) and remifentanil (Agent II) by leveraging a MG to identify optimal actions among heterogeneous agents. We employ various value function decomposition methods to resolve the credit allocation problem and enhance collaborative control. We also introduce a multivariate environment model based on random forest (RF) for anesthesia state simulation. To ensure data validity, we design a data resampling and alignment technique to synchronize trajectory data from different devices, avoiding gradient explosion and maintaining conformity to Markov property. Extensive experiments on general and thoracic surgery datasets demonstrate that VD-MADRL provides more refined dose adjustments and maintains multiple anesthesia state indicators more stably at target levels compared to human experience. Especially, the best-performing algorithm, VDN in general surgery with online training, achieved a 16. 4% increase in cumulative reward (CR) and a 58. 0% reduction in mean MDPE compared to human experience. This demonstrates its great clinical value.

JBHI Journal 2026 Journal Article

ViG3D-UNet: Volumetric Vascular Connectivity-Aware Segmentation via 3D Vision Graph Representation

  • Bowen Liu
  • Chunlei Meng
  • Wei Lin
  • Hongda Zhang
  • Ziqing Zhou
  • Zhongxue Gan
  • Chun Ouyang

Accurate vascular segmentation is essential for coronary visualization and the diagnosis of coronary heart disease. This task involves the extraction of sparse tree-like vascular branches from volumetric space. However, existing methods have faced significant challenges due to discontinuous vascular segmentation and missing endpoints. To address this issue, a 3D vision graph neural network framework, named ViG3D-UNet, was introduced. This method integrates 3D graph representation and aggregation within a U-shaped architecture to facilitate continuous vascular segmentation. The ViG3D module captures volumetric vascular connectivity and topology, while the convolutional module extracts fine vascular details. These two branches are combined through channel attention to form the encoder feature. Subsequently, a paperclip-shaped offset decoder minimizes redundant computations in the sparse feature space and restores the feature map size to match the original input dimensions. To evaluate the effectiveness of the proposed approach for continuous vascular segmentation, evaluations were performed on two public datasets, ASOCA and ImageCAS. The segmentation results show that the ViG3D-UNet surpassed competing methods in maintaining vascular segmentation connectivity while achieving high segmentation accuracy.

AAAI Conference 2025 Conference Paper

Adversarial-Inspired Backdoor Defense via Bridging Backdoor and Adversarial Attacks

  • Jia-Li Yin
  • Weijian Wang
  • Lyhwa
  • Wei Lin
  • Ximeng Liu

Backdoor attacks and adversarial attacks are two major security threats to deep neural networks (DNNs), with the former one is a training-time data poisoning attack that aims to implant backdoor triggers into models by injecting trigger patterns into training samples, and the latter one is a testing-time attack trying to generate adversarial examples (AEs) from benign images to mislead a well-trained model. While previous works generally treat these two attacks separately, the inherent connection between these two attacks is rarely explored. In this paper, we focus on bridging backdoor and adversarial attacks and observe two intriguing phenomena when applying adversarial attacks on an infected model implanted with backdoors: 1) the sample is harder to be turned into an AE when the trigger is presented; 2) the AEs generated from backdoor samples are highly likely to be predicted as its true labels. Inspired by these observations, we proposed a novel backdoor defense method, dubbed Adversarial-Inspired Backdoor Defense (AIBD), to isolate the backdoor samples by leveraging a progressive top-q scheme and break the correlation between backdoor samples and their target labels using adversarial labels. Through extensive experiments on various datasets against six state-of-the-art backdoor attacks, the AIBD-trained models on poisoned data demonstrate superior performance over the existing defense methods.

NeurIPS Conference 2025 Conference Paper

Bi-Level Decision-Focused Causal Learning for Large-Scale Marketing Optimization: Bridging Observational and Experimental Data

  • Shuli Zhang
  • Hao Zhou
  • Jiaqi Zheng
  • Guibin Jiang
  • Cheng Bing
  • Wei Lin
  • Guihai Chen

Online Internet platforms require sophisticated marketing strategies to optimize user retention and platform revenue — a classical resource allocation problem. Traditional solutions adopt a two-stage pipeline: machine learning (ML) for predicting individual treatment effects to marketing actions, followed by operations research (OR) optimization for decision-making. This paradigm presents two fundamental technical challenges. First, the prediction-decision misalignment: Conventional ML methods focus solely on prediction accuracy without considering downstream optimization objectives, leading to improved predictive metrics that fail to translate to better decisions. Second, the bias-variance dilemma: Observational data suffers from multiple biases (e. g. , selection bias, position bias), while experimental data (e. g. , randomized controlled trials), though unbiased, is typically scarce and costly --- resulting in high-variance estimates. We propose Bi -level D ecision- F ocused C ausal L earning ( Bi-DFCL ) that systematically addresses these challenges. First, we develop an unbiased estimator of OR decision quality using experimental data, which guides ML model training through surrogate loss functions that bridge discrete optimization gradients. Second, we establish a bi-level optimization framework that jointly leverages observational and experimental data, solved via implicit differentiation. This novel formulation enables our unbiased OR estimator to correct learning directions from biased observational data, achieving optimal bias-variance tradeoff. Extensive evaluations on public benchmarks, industrial marketing datasets, and large-scale online A/B tests demonstrate the effectiveness of Bi-DFCL, showing statistically significant improvements over state-of-the-art. Currently, Bi-DFCL has been deployed across several marketing scenarios at Meituan, one of the largest online food delivery platforms in the world.

NeurIPS Conference 2025 Conference Paper

Data-Juicer 2.0: Cloud-Scale Adaptive Data Processing for and with Foundation Models

  • Daoyuan Chen
  • Yilun Huang
  • Xuchen Pan
  • Jiang Nana
  • Haibin Wang
  • Yilei Zhang
  • Ce Ge
  • Yushuo Chen

Foundation models demand advanced data processing for their vast, multimodal datasets. However, traditional frameworks struggle with the unique complexities of multimodal data. In response, we present Data-Juicer 2. 0, a data processing system backed by 100+ data processing operators spanning text, image, video, and audio modalities, supporting more critical tasks including data analysis, synthesis, annotation, and foundation model post-training. With seamless compatibility and dedicated optimization for popular dataset hubs like Hugging Face and computing engines like Ray, it improves upon its predecessor in terms of usability, efficiency, and programmability. It features an easily accessible user interface layer that supports decoupled Python interactions, RESTful APIs, and conversational commands. Its new runtime layer offers adaptive execution across diverse scales and environments, abstracting away system complexities. Extensive empirical evaluations demonstrate Data-Juicer 2. 0's remarkable performance and scalability, highlighting its capability to efficiently process TB-level data with 10k+ CPU cores. The system is publicly available and has been widely adopted in diverse research fields and real-world products such as Alibaba Cloud PAI. We actively maintain the system and share practical insights to foster research and applications of next-generation foundation models.

TMLR Journal 2025 Journal Article

GLOV: Guided Large Language Models as Implicit Optimizers for Vision Language Models

  • Muhammad Jehanzeb Mirza
  • Mengjie Zhao
  • Zhuoyuan Mao
  • Sivan Doveh
  • Wei Lin
  • Paul Gavrikov
  • Michael Dorkenwald
  • Shiqi Yang

In this work, we propose GLOV, which enables Large Language Models (LLMs) to act as implicit optimizers for Vision-Language Models (VLMs) to enhance downstream vision tasks. GLOV prompts an LLM with the downstream task description, querying it for suitable VLM prompts (\eg for zero-shot classification with CLIP). These prompts are ranked according to their fitness for the downstream vision task. In each respective optimization step, the ranked prompts are fed as in-context examples (with their accuracies) to equip the LLM with the knowledge of the type of prompts preferred by the downstream VLM. Furthermore, we explicitly guide the LLM's generation at each optimization step by adding an offset vector -- calculated from the embedding differences between previous \textit{positive} and \textit{negative} solutions -- to the intermediate layer of the network for the next generation. This offset vector biases the LLM generation toward the type of language the downstream VLM prefers, resulting in enhanced performance on the downstream vision tasks. We comprehensively evaluate our GLOV on two tasks: object recognition and the critical task of enhancing VLM safety. Our GLOV shows performance improvement by up to $15.0\%$ and $57.5\%$ for dual-encoder (\eg~CLIP) and encoder-decoder (\eg~\llava) models for object recognition and reduces the attack success rate (ASR) on state-of-the-art VLMs by up to $60.7\%$.

NeurIPS Conference 2025 Conference Paper

Learning Provably Improves the Convergence of Gradient Descent

  • Qingyu Song
  • Wei Lin
  • Hong Xu

Learn to Optimize (L2O) trains deep neural network-based solvers for optimization, achieving success in accelerating convex problems and improving non-convex solutions. However, L2O lacks rigorous theoretical backing for its own training convergence, as existing analyses often use unrealistic assumptions-a gap this work highlights empirically. We bridge this gap by proving the training convergence of L2O models that learn Gradient Descent (GD) hyperparameters for quadratic programming, leveraging the Neural Tangent Kernel (NTK) theory. We propose a deterministic initialization strategy to support our theoretical results and promote stable training over extended optimization horizons by mitigating gradient explosion. Our L2O framework demonstrates over 50% better optimality than GD and superior robustness over state-of-the-art L2O methods on synthetic datasets. The code of our method can be found from https: //github. com/NetX-lab/MathL2OProof-Official.

ICML Conference 2025 Conference Paper

Neural Event-Triggered Control with Optimal Scheduling

  • Luan Yang
  • Jingdong Zhang
  • Qunxi Zhu
  • Wei Lin

Learning-enabled controllers with stability certificate functions have demonstrated impressive empirical performance in addressing control problems in recent years. Nevertheless, directly deploying the neural controllers onto actual digital platforms requires impractically excessive communication resources due to a continuously updating demand from the closed-loop feedback controller. We introduce a framework aimed at learning the event-triggered controller (ETC) with optimal scheduling, i. e. , minimal triggering times, to address this challenge in resource-constrained scenarios. Our proposed framework, denoted by Neural ETC, includes two practical algorithms: the path integral algorithm based on directly simulating the event-triggered dynamics, and the Monte Carlo algorithm derived from new theoretical results regarding lower bound of inter-event time. Furthermore, we propose a projection operation with an analytical expression that ensures theoretical stability and schedule optimality for Neural ETC. Compared to the conventional neural controllers, our empirical results show that the Neural ETC significantly reduces the required communication resources while enhancing the control performance in constrained communication resources scenarios.

NeurIPS Conference 2025 Conference Paper

pLSTM: parallelizable Linear Source Transition Mark networks

  • Korbinian Pöppel
  • Richard Freinschlag
  • Thomas Schmied
  • Wei Lin
  • Sepp Hochreiter

Modern recurrent architectures, such as xLSTM and Mamba, have recently challenged the Transformer in language modeling. However, their structure constrains their applicability to sequences only or requires processing multi-dimensional data structures, such as images or molecular graphs, in a pre-defined sequential order. In contrast, Multi-Dimensional RNNs (MDRNNs) are well suited for data with a higher level structure, like 2D grids, trees, and directed acyclic graphs (DAGs). In this work, we extend the notion of multi-dimensionality to linear RNNs. We introduce parallelizable Linear Source Transition Mark networks (pLSTMs) using Source, Transition, and Mark gates that act on the linegraph of a general DAG. This enables parallelization in analogy to parallel associative scans and the chunkwise-recurrent form of sequential linear RNNs, but for DAGs. For regular grids (1D and 2D), like images, this scheme can be efficiently implemented using einsum operations, concatenations, and padding in logarithmic time. pLSTMs tackle the vanishing/exploding activation/gradient problem for long distances in DAGs via two distinct modes: a directed propagation mode (P-mode) and a diffusive distribution mode (D-mode). To showcase the long-range capabilities of pLSTM, we introduce arrow-pointing extrapolation as a synthetic computer vision task that contains long-distance directional information. We demonstrate that pLSTMs generalize well to larger image sizes, whereas Transformers struggle to extrapolate. On established molecular graph and computer vision benchmarks, pLSTMs also show strong performance. The complete code is available at https: //github. com/ml-jku/plstm_experiments.

NeurIPS Conference 2025 Conference Paper

PolarQuant: Leveraging Polar Transformation for Key Cache Quantization and Decoding Acceleration

  • Songhao Wu
  • Ang Lv
  • xiao feng
  • Yufei Zhang
  • Xun Zhang
  • Guojun Yin
  • Wei Lin
  • Rui Yan

The increasing demand for long-context generation has made the KV cache in large language models a bottleneck in memory consumption. Quantizing the cache to lower bit widths is an effective way to reduce memory costs; however, previous methods struggle with key cache quantization due to outliers, resulting in suboptimal performance. We propose a novel quantization approach PolarQuant, which provides a new perspective for key cache quantization and efficiently addresses the outlier dilemma. We observe that the distribution of the key states reveals well-structured patterns under polar transformation. Outliers generally appear in only one of the two dimensions, which are rotated together by a specific angle when rotary position embeddings are applied. When represented as two-dimensional vectors, these dimensions exhibit well-organized patterns, with radii and angles smoothly distributed in polar space. This alleviates the channel-wise outliers, making them well-suited for key cache quantization. PolarQuant divides key vectors into groups of two-dimensional sub-vectors, encoding them as the quantized radius and the polar angle, rather than quantizing original key vectors directly. PolarQuant achieves the superior efficiency in KV cache quantization and accelerates the decoding process by turning the query-key inner product into a table lookup, all while maintaining the downstream performance of full-precision models. Our code is available at https: //github. com/ericshwu/PolarQuant.

NeurIPS Conference 2025 Conference Paper

PUO-Bench: A Panel Understanding and Operation Benchmark with A Privacy-Preserving Framework

  • Wei Lin
  • Yiwei Zhou
  • Junkai Zhang
  • Rui Shao
  • Zhiyuan Zhao
  • Junyu Gao
  • Antoni Chan
  • Xuelong Li

Recent advancements in Vision-Language Models (VLMs) have enabled GUI agents to leverage visual features for interface understanding and operation in the digital world. However, limited research has addressed the interpretation and interaction with control panels in real-world settings. To bridge this gap, we propose the Panel Understanding and Operation (PUO) benchmark, comprising annotated panel images from appliances and associated vision-language instruction pairs. Experimental results on the benchmark demonstrate significant performance disparities between zero-shot and fine-tuned VLMs, revealing the lack of PUO-specific capabilities in existing language models. Furthermore, we introduce a Privacy-Preserving Framework (PPF) to address privacy concerns in cloud-based panel parsing and reasoning. PPF employs a dual-stage architecture, performing panel understanding on edge devices while delegating complex reasoning to cloud-based LLMs. Although this design introduces a performance trade-off due to edge model limitations, it eliminates the transmission of raw visual data, thereby mitigating privacy risks. Overall, this work provides foundational resources and methodologies for advancing interactive human-machine systems and robotic field in panel-centric applications.

JBHI Journal 2025 Journal Article

RTS-ViT: Real-Time Share Vision Transformer for Image Classification

  • Chunlei Meng
  • Wei Lin
  • Bowen Liu
  • Hongda Zhang
  • Zhongxue Gan
  • Chun Ouyang

Vision transformers have achieved remarkable success in image classification. The dual-branch vision transformer generates more features by taking advantage of feature fusion. Inspired by this, a dual-branch vision transformer with Real-Time Share feature was proposed during the encoding process for retinal image classification tasks. The approach processes image patches of varying sizes (base and large) through two independent branches and implements multi-stage Real-Time feature fusion via the Real-Time Share feature encoder. This encoder enables the branches to complement each other's features at each encoding stage, facilitating finer feature learning and enhancing the self-attention information passed to subsequent stages. It significantly boosts feature representation and classification performance. Additionally, a straightforward and effective feature fusion method, L -Times Attention Fusion, was proposed: vector concatenation for Real-Time Share feature in the earlier ( L -1) encoding stages and element-wise addition for overall feature fusion at the L -th stage, achieving more efficient feature integration. The method was validated on a retinal image dataset. Results show that the approach outperforms the recent Cross-ViT average TOP-1 Acc by 5. 61% with lower FLOPs and model parameters, without relying on pre-trained weights, highlighting stronger self-learning feature capabilities and reduced reliance on extensive pre-training data.

AAAI Conference 2025 Conference Paper

Semantic Convergence: Harmonizing Recommender Systems via Two-Stage Alignment and Behavioral Semantic Tokenization

  • Guanghan Li
  • Xun Zhang
  • Yufei Zhang
  • Yifan Yin
  • Guojun Yin
  • Wei Lin

Large language models (LLMs), endowed with exceptional reasoning capabilities, are adept at discerning profound user interests from historical behaviors, thereby presenting a promising avenue for the advancement of recommendation systems. However, a notable discrepancy persists between the sparse collaborative semantics typically found in recommendation systems and the dense token representations within LLMs. In our study, we propose a novel framework that harmoniously merges traditional recommendation models with the prowess of LLMs. We initiate this integration by transforming ItemIDs into sequences that align semantically with the LLMs' space, through the proposed Alignment Tokenization module. Additionally, we design a series of specialized supervised learning tasks aimed at aligning collaborative signals with the subtleties of natural language semantics. To ensure practical applicability, we optimize online inference by pre-caching the top-K results for each user, reducing latency and improving efficiency. Extensive experimental evidence indicates that our model markedly improves recall metrics and displays remarkable scalability of recommendation systems.

NeurIPS Conference 2025 Conference Paper

STSBench: A Spatio-temporal Scenario Benchmark for Multi-modal Large Language Models in Autonomous Driving

  • Christian Fruhwirth-Reisinger
  • Dušan Malić
  • Wei Lin
  • David Schinagl
  • Samuel Schulter
  • Horst Possegger

We introduce STSBench, a scenario-based framework to benchmark the holistic understanding of vision-language models (VLMs) for autonomous driving. The framework automatically mines predefined traffic scenarios from any dataset using ground-truth annotations, provides an intuitive user interface for efficient human verification, and generates multiple-choice questions for model evaluation. Applied to the nuScenes dataset, we present STSnu, the first benchmark that evaluates the spatio-temporal reasoning capabilities of VLMs based on comprehensive 3D perception. Existing benchmarks typically target off-the-shelf or fine-tuned VLMs for images or videos from a single viewpoint, focusing on semantic tasks such as object recognition, dense captioning, risk assessment, or scene understanding. In contrast, STSnu evaluates driving expert VLMs for end-to-end driving, operating on videos from multi-view cameras or LiDAR. It specifically assesses their ability to reason about both ego-vehicle actions and complex interactions among traffic participants, a crucial capability for autonomous vehicles. The benchmark features 43 diverse scenarios spanning multiple views and frames, resulting in 971 human-verified multiple-choice questions. A thorough evaluation uncovers critical shortcomings in existing models’ ability to reason about fundamental traffic dynamics in complex environments. These findings highlight the urgent need for architectural advancements that explicitly model spatio-temporal reasoning. By addressing a core gap in spatio-temporal evaluation, STSBench enables the development of more robust and explainable VLMs for autonomous driving.

AAAI Conference 2024 Conference Paper

A Fixed-Point Approach to Unified Prompt-Based Counting

  • Wei Lin
  • Antoni B. Chan

Existing class-agnostic counting models typically rely on a single type of prompt, e.g., box annotations. This paper aims to establish a comprehensive prompt-based counting framework capable of generating density maps for concerned objects indicated by various prompt types, such as box, point, and text. To achieve this goal, we begin by converting prompts from different modalities into prompt masks without requiring training. These masks are then integrated into a class-agnostic counting methodology for predicting density maps. Furthermore, we introduce a fixed-point inference along with an associated loss function to improve counting accuracy, all without introducing new parameters. The effectiveness of this method is substantiated both theoretically and experimentally. Additionally, a contrastive training scheme is implemented to mitigate dataset bias inherent in current class-agnostic counting datasets, a strategy whose effectiveness is confirmed by our ablation study. Our model excels in prominent class-agnostic datasets and exhibits superior performance in cross-dataset adaptation tasks.

ICML Conference 2024 Conference Paper

A Statistical Theory of Regularization-Based Continual Learning

  • Xuyang Zhao
  • Huiyuan Wang
  • Weiran Huang 0001
  • Wei Lin

We provide a statistical analysis of regularization-based continual learning on a sequence of linear regression tasks, with emphasis on how different regularization terms affect the model performance. We first derive the convergence rate for the oracle estimator obtained as if all data were available simultaneously. Next, we consider a family of generalized $\ell_2$-regularization algorithms indexed by matrix-valued hyperparameters, which includes the minimum norm estimator and continual ridge regression as special cases. As more tasks are introduced, we derive an iterative update formula for the estimation error of generalized $\ell_2$-regularized estimators, from which we determine the hyperparameters resulting in the optimal algorithm. Interestingly, the choice of hyperparameters can effectively balance the trade-off between forward and backward knowledge transfer and adjust for data heterogeneity. Moreover, the estimation error of the optimal algorithm is derived explicitly, which is of the same order as that of the oracle estimator. In contrast, our lower bounds for the minimum norm estimator and continual ridge regression show their suboptimality. A byproduct of our theoretical analysis is the equivalence between early stopping and generalized $\ell_2$-regularization in continual learning, which may be of independent interest. Finally, we conduct experiments to complement our theory.

AAAI Conference 2024 Conference Paper

Arithmetic Feature Interaction Is Necessary for Deep Tabular Learning

  • Yi Cheng
  • Renjun Hu
  • Haochao Ying
  • Xing Shi
  • Jian Wu
  • Wei Lin

Until recently, the question of the effective inductive bias of deep models on tabular data has remained unanswered. This paper investigates the hypothesis that arithmetic feature interaction is necessary for deep tabular learning. To test this point, we create a synthetic tabular dataset with a mild feature interaction assumption and examine a modified transformer architecture enabling arithmetical feature interactions, referred to as AMFormer. Results show that AMFormer outperforms strong counterparts in fine-grained tabular data modeling, data efficiency in training, and generalization. This is attributed to its parallel additive and multiplicative attention operators and prompt-based optimization, which facilitate the separation of tabular samples in an extended space with arithmetically-engineered features. Our extensive experiments on real-world data also validate the consistent effectiveness, efficiency, and rationale of AMFormer, suggesting it has established a strong inductive bias for deep learning on tabular data. Code is available at https://github.com/aigc-apps/AMFormer.

NeurIPS Conference 2024 Conference Paper

ConMe: Rethinking Evaluation of Compositional Reasoning for Modern VLMs

  • Irene Huang
  • Wei Lin
  • M. J. Mirza
  • Jacob A. Hansen
  • Sivan Doveh
  • Victor I. Butoi
  • Roei Herzig
  • Assaf Arbelle

Compositional Reasoning (CR) entails grasping the significance of attributes, relations, and word order. Recent Vision-Language Models (VLMs), comprising a visual encoder and a Large Language Model (LLM) decoder, have demonstrated remarkable proficiency in such reasoning tasks. This prompts a crucial question: have VLMs effectively tackled the CR challenge? We conjecture that existing CR benchmarks may not adequately push the boundaries of modern VLMs due to the reliance on an LLM only negative text generation pipeline. Consequently, the negatives produced either appear as outliers from the natural language distribution learned by VLMs' LLM decoders or as improbable within the corresponding image context. To address these limitations, we introduce ConMe\footnote{ConMe is an abbreviation for Confuse Me. } -- a compositional reasoning benchmark and a novel data generation pipeline leveraging VLMs to produce `hard CR Q&A'. Through a new concept of VLMs conversing with each other to collaboratively expose their weaknesses, our pipeline autonomously generates, evaluates, and selects challenging compositional reasoning questions, establishing a robust CR benchmark, also subsequently validated manually. Our benchmark provokes a noteworthy, up to 33%, decrease in CR performance compared to preceding benchmarks, reinstating the CR challenge even for state-of-the-art VLMs.

ICML Conference 2024 Conference Paper

FESSNC: Fast Exponentially Stable and Safe Neural Controller

  • Jingdong Zhang
  • Luan Yang
  • Qunxi Zhu
  • Wei Lin

In order to stabilize nonlinear systems modeled by stochastic differential equations, we design a Fast Exponentially Stable and Safe Neural Controller (FESSNC) for fast learning controllers. Our framework is parameterized by neural networks, and realizing both rigorous exponential stability and safety guarantees. Concretely, we design heuristic methods to learn the exponentially stable and the safe controllers, respectively, in light of the classical theory of stochastic exponential stability and our established theorem on guaranteeing the almost-sure safety for stochastic dynamics. More significantly, to rigorously ensure the stability and the safety guarantees for the learned controllers, we develop a projection operator, projecting to the space of exponentially-stable and safe controllers. To reduce the highly computational cost for solving the projection operation, approximate projection operators are delicately proposed with closed forms that map the learned controllers to the target controller space. Furthermore, we employ Hutchinson’s trace estimator for a scalable unbiased estimate of the Hessian matrix that is used in the projection operator, which thus allows for reducing computational cost and, therefore, can accelerate the training and testing processes. More importantly, our approximate projection operations are applicable to the nonparametric control methods, improving their stability and safety performance. We empirically demonstrate the superiority of the FESSNC over the existing methods.

AAAI Conference 2024 Conference Paper

How to Trade Off the Quantity and Capacity of Teacher Ensemble: Learning Categorical Distribution to Stochastically Employ a Teacher for Distillation

  • Zixiang Ding
  • Guoqing Jiang
  • Shuai Zhang
  • Lin Guo
  • Wei Lin

We observe two phenomenons with respect to quantity and capacity: 1) more teacher is not always better for multi-teacher knowledge distillation, and 2) stronger teacher is not always better for single-teacher knowledge distillation. To trade off the quantity and capacity of teacher ensemble, in this paper, we propose a new distillation paradigm named Dynamic Knowledge Distillation (DynaKD) that learn an adaptive categorical distribution to stochastically employ a teacher from a teacher ensemble in each step, to transfer knowledge from teacher ensemble into student. DynaKD has three advantages: 1) it can preserve diversity of each teacher via one-to-one distillation manner instead of several-for-one, 2) it can make the best of powerful teacher via those multi-level assistant teachers in ensemble, and 3) it can also dynamically determine the importance of each teacher for various tasks. To verify the effectiveness of the proposed approach, we conduct extensive experiments for BERT compression on GLUE benchmark. Experimental results show that the proposed approach achieves state-of-the-art score compared to previous compression approaches on five out of seven downstream tasks, including pushing MRPC F1 and accuracy to 92.2 (1.4 point absolute improvement), RTE accuracy to 76.2 (2.8 point absolute improvement). Moreover, we conduct also extensive experiments for image classification on CIFAR-100. Similarly, DynaKD achieves also state-of-the-art performance.

AAAI Conference 2024 Conference Paper

Hypergraph Neural Architecture Search

  • Wei Lin
  • Xu Peng
  • Zhengtao Yu
  • Taisong Jin

In recent years, Hypergraph Neural Networks (HGNNs) have achieved considerable success by manually designing architectures, which are capable of extracting effective patterns with high-order interactions from non-Euclidean data. However, such mechanism is extremely inefficient, demanding tremendous human efforts to tune diverse model parameters. In this paper, we propose a novel Hypergraph Neural Architecture Search (HyperNAS) to automatically design the optimal HGNNs. The proposed model constructs a search space suitable for hypergraphs, and derives hypergraph architectures through differentiable search strategies. A hypergraph structure-aware distance criterion is introduced as a guideline for obtaining an optimal hypergraph architecture via the leave-one-out method. Experimental results for node classification on benchmark Cora, Citeseer, Pubmed citation networks and hypergraph datasets show that HyperNAS outperforms existing HGNNs models and graph NAS methods.

NeurIPS Conference 2024 Conference Paper

PertEval: Unveiling Real Knowledge Capacity of LLMs with Knowledge-Invariant Perturbations

  • Jiatong Li
  • Renjun Hu
  • Kunzhe Huang
  • Yan Zhuang
  • Qi Liu
  • Mengxiao Zhu
  • Xing Shi
  • Wei Lin

Expert-designed close-ended benchmarks are indispensable in assessing the knowledge capacity of large language models (LLMs). Despite their widespread use, concerns have mounted regarding their reliability due to limited test scenarios and an unavoidable risk of data contamination. To rectify this, we present PertEval, a toolkit devised for in-depth probing of LLMs' knowledge capacity through knowledge-invariant perturbations. These perturbations employ human-like restatement techniques to generate on-the-fly test samples from static benchmarks, meticulously retaining knowledge-critical content while altering irrelevant details. Our toolkit further includes a suite of response consistency analyses that compare performance on raw vs. perturbed test sets to precisely assess LLMs' genuine knowledge capacity. Six representative LLMs are re-evaluated using PertEval. Results reveal significantly inflated performance of the LLMs on raw benchmarks, including an absolute 25. 8% overestimation for GPT-4. Additionally, through a nuanced response pattern analysis, we discover that PertEval retains LLMs' uncertainty to specious knowledge, and reveals their potential rote memorization to correct options which leads to overestimated performance. We also find that the detailed response consistency analyses by PertEval could illuminate various weaknesses in existing LLMs' knowledge mastery and guide the development of refinement. Our findings provide insights for advancing more robust and genuinely knowledgeable LLMs. Our code is available at https: //github. com/aigc-apps/PertEval.

NeurIPS Conference 2024 Conference Paper

Text2NKG: Fine-Grained N-ary Relation Extraction for N-ary relational Knowledge Graph Construction

  • Haoran Luo
  • Haihong E
  • Yuhao Yang
  • Tianyu Yao
  • Yikai Guo
  • Zichen Tang
  • Wentai Zhang
  • Shiyao Peng

Beyond traditional binary relational facts, n-ary relational knowledge graphs (NKGs) are comprised of n-ary relational facts containing more than two entities, which are closer to real-world facts with broader applications. However, the construction of NKGs remains at a coarse-grained level, which is always in a single schema, ignoring the order and variable arity of entities. To address these restrictions, we propose Text2NKG, a novel fine-grained n-ary relation extraction framework for n-ary relational knowledge graph construction. We introduce a span-tuple classification approach with hetero-ordered merging and output merging to accomplish fine-grained n-ary relation extraction in different arity. Furthermore, Text2NKG supports four typical NKG schemas: hyper-relational schema, event-based schema, role-based schema, and hypergraph-based schema, with high flexibility and practicality. The experimental results demonstrate that Text2NKG achieves state-of-the-art performance in F1 scores on the fine-grained n-ary relation extraction benchmark. Our code and datasets are publicly available.

NeurIPS Conference 2023 Conference Paper

LaFTer: Label-Free Tuning of Zero-shot Classifier using Language and Unlabeled Image Collections

  • Muhammad Jehanzeb Mirza
  • Leonid Karlinsky
  • Wei Lin
  • Horst Possegger
  • Mateusz Kozinski
  • Rogerio Feris
  • Horst Bischof

Recently, large-scale pre-trained Vision and Language (VL) models have set a new state-of-the-art (SOTA) in zero-shot visual classification enabling open-vocabulary recognition of potentially unlimited set of categories defined as simple language prompts. However, despite these great advances, the performance of these zero-shot classifiers still falls short of the results of dedicated (closed category set) classifiers trained with supervised fine-tuning. In this paper we show, for the first time, how to reduce this gap without any labels and without any paired VL data, using an unlabeled image collection and a set of texts auto-generated using a Large Language Model (LLM) describing the categories of interest and effectively substituting labeled visual instances of those categories. Using our label-free approach, we are able to attain significant performance improvements over the zero-shot performance of the base VL model and other contemporary methods and baselines on a wide variety of datasets, demonstrating absolute improvement of up to $11. 7\%$ ($3. 8\%$ on average) in the label-free setting. Moreover, despite our approach being label-free, we observe $1. 3\%$ average gains over leading few-shot prompting baselines that do use 5-shot supervision.

AAAI Conference 2023 Conference Paper

SKDBERT: Compressing BERT via Stochastic Knowledge Distillation

  • Zixiang Ding
  • Guoqing Jiang
  • Shuai Zhang
  • Lin Guo
  • Wei Lin

In this paper, we propose Stochastic Knowledge Distillation (SKD) to obtain compact BERT-style language model dubbed SKDBERT. In each distillation iteration, SKD samples a teacher model from a pre-defined teacher team, which consists of multiple teacher models with multi-level capacities, to transfer knowledge into student model in an one-to-one manner. Sampling distribution plays an important role in SKD. We heuristically present three types of sampling distributions to assign appropriate probabilities for multi-level teacher models. SKD has two advantages: 1) it can preserve the diversities of multi-level teacher models via stochastically sampling single teacher model in each distillation iteration, and 2) it can also improve the efficacy of knowledge distillation via multi-level teacher models when large capacity gap exists between the teacher model and the student model. Experimental results on GLUE benchmark show that SKDBERT reduces the size of a BERT model by 40% while retaining 99.5% performances of language understanding and being 100% faster.

IJCAI Conference 2022 Conference Paper

CGMN: A Contrastive Graph Matching Network for Self-Supervised Graph Similarity Learning

  • Di Jin
  • Luzhi Wang
  • Yizhen Zheng
  • Xiang Li
  • Fei Jiang
  • Wei Lin
  • Shirui Pan

Graph similarity learning refers to calculating the similarity score between two graphs, which is required in many realistic applications, such as visual tracking, graph classification, and collaborative filtering. As most of the existing graph neural networks yield effective graph representations of a single graph, little effort has been made for jointly learning two graph representations and calculating their similarity score. In addition, existing unsupervised graph similarity learning methods are mainly clustering-based, which ignores the valuable information embodied in graph pairs. To this end, we propose a contrastive graph matching network (CGMN) for self-supervised graph similarity learning in order to calculate the similarity between any two input graph objects. Specifically, we generate two augmented views for each graph in a pair respectively. Then, we employ two strategies, namely cross-view interaction and cross-graph interaction, for effective node representation learning. The former is resorted to strengthen the consistency of node representations in two views. The latter is utilized to identify node differences between different graphs. Finally, we transform node representations into graph-level representations via pooling operations for graph similarity computation. We have evaluated CGMN on eight real-world datasets, and the experiment results show that the proposed new approach is superior to the state-of-the-art methods in graph similarity learning downstream tasks.

AAAI Conference 2022 Conference Paper

Neural Piecewise-Constant Delay Differential Equations

  • Qunxi Zhu
  • Yifei Shen
  • Dongsheng Li
  • Wei Lin

Continuous-depth neural networks, such as the Neural Ordinary Differential Equations (ODEs), have aroused a great deal of interest from the communities of machine learning and data science in recent years, which bridge the connection between deep neural networks and dynamical systems. In this article, we introduce a new sort of continuous-depth neural network, called the Neural Piecewise-Constant Delay Differential Equations (PCDDEs). Here, unlike the recently proposed framework of the Neural Delay Differential Equations (DDEs), we transform the single delay into the piecewiseconstant delay(s). The Neural PCDDEs with such a transformation, on one hand, inherit the strength of universal approximating capability in Neural DDEs. On the other hand, the Neural PCDDEs, leveraging the contributions of the information from the multiple previous time steps, further promote the modeling capability without augmenting the network dimension. With such a promotion, we show that the Neural PCDDEs do outperform the several existing continuousdepth neural frameworks on the one-dimensional piecewiseconstant delay population dynamics and real-world datasets, including MNIST, CIFAR10, and SVHN.

NeurIPS Conference 2022 Conference Paper

Neural Stochastic Control

  • Jingdong Zhang
  • Qunxi Zhu
  • Wei Lin

Control problems are always challenging since they arise from the real-world systems where stochasticity and randomness are of ubiquitous presence. This naturally and urgently calls for developing efficient neural control policies for stabilizing not only the deterministic equations but the stochastic systems as well. Here, in order to meet this paramount call, we propose two types of controllers, viz. , the exponential stabilizer (ES) based on the stochastic Lyapunov theory and the asymptotic stabilizer (AS) based on the stochastic asymptotic stability theory. The ES can render the controlled systems exponentially convergent but it requires a long computational time; conversely, the AS makes the training much faster but it can only assure the asymptotic (not the exponential) attractiveness of the control targets. These two stochastic controllers thus are complementary in applications. We also investigate rigorously the linear control in both convergence time and energy cost and numerically compare it with the proposed controllers in these terms. More significantly, we use several representative physical systems to illustrate the usefulness of the proposed controllers in stabilization of dynamical systems.

IJCAI Conference 2022 Conference Paper

RAW-GNN: RAndom Walk Aggregation based Graph Neural Network

  • Di Jin
  • Rui Wang
  • Meng Ge
  • Dongxiao He
  • Xiang Li
  • Wei Lin
  • Weixiong Zhang

Graph-Convolution-based methods have been successfully applied to representation learning on homophily graphs where nodes with the same label or similar attributes tend to connect with one another. Due to the homophily assumption of Graph Convolutional Networks (GCNs) that these methods use, they are not suitable for heterophily graphs where nodes with different labels or dissimilar attributes tend to be adjacent. Several methods have attempted to address this heterophily problem, but they do not change the fundamental aggregation mechanism of GCNs because they rely on summation operators to aggregate information from neighboring nodes, which is implicitly subject to the homophily assumption. Here, we introduce a novel aggregation mechanism and develop a RAndom Walk Aggregation-based Graph Neural Network (called RAW-GNN) method. The proposed approach integrates the random walk strategy with graph neural networks. The new method utilizes breadth-first random walk search to capture homophily information and depth-first search to collect heterophily information. It replaces the conventional neighborhoods with path-based neighborhoods and introduces a new path-based aggregator based on Recurrent Neural Networks. These designs make RAW-GNN suitable for both homophily and heterophily graphs. Extensive experimental results showed that the new method achieved state-of-the-art performance on a variety of homophily and heterophily graphs.

IJCAI Conference 2020 Conference Paper

AdaBERT: Task-Adaptive BERT Compression with Differentiable Neural Architecture Search

  • Daoyuan Chen
  • Yaliang Li
  • Minghui Qiu
  • Zhen Wang
  • Bofang Li
  • Bolin Ding
  • Hongbo Deng
  • Jun Huang

Large pre-trained language models such as BERT have shown their effectiveness in various natural language processing tasks. However, the huge parameter size makes them difficult to be deployed in real-time applications that require quick inference with limited resources. Existing methods compress BERT into small models while such compression is task-independent, i. e. , the same compressed BERT for all different downstream tasks. Motivated by the necessity and benefits of task-oriented BERT compression, we propose a novel compression method, AdaBERT, that leverages differentiable Neural Architecture Search to automatically compress BERT into task-adaptive small models for specific tasks. We incorporate a task-oriented knowledge distillation loss to provide search hints and an efficiency-aware loss as search constraints, which enables a good trade-off between efficiency and effectiveness for task-adaptive BERT compression. We evaluate AdaBERT on several NLP tasks, and the results demonstrate that those task-adaptive compressed models are 12. 7x to 29. 3x faster than BERT in inference time and 11. 5x to 17. 0x smaller in terms of parameter size, while comparable performance is maintained.

AAAI Conference 2020 Short Paper

RPM-Oriented Query Rewriting Framework for E-commerce Keyword-Based Sponsored Search (Student Abstract)

  • Xiuying Chen
  • Daorui Xiao
  • Shen Gao
  • Guojun Liu
  • Wei Lin
  • Bo Zheng
  • Dongyan Zhao
  • Rui Yan

Sponsored search optimizes revenue and relevance, which is estimated by Revenue Per Mille (RPM). Existing sponsored search models are all based on traditional statistical models, which have poor RPM performance when queries follow a heavy-tailed distribution. Here, we propose an RPMoriented Query Rewriting Framework (RQRF) which outputs related bid keywords that can yield high RPM. RQRF embeds both queries and bid keywords to vectors in the same implicit space, converting the rewriting probability between each query and keyword to the distance between the two vectors. For label construction, we propose an RPM-oriented sample construction method, labeling keywords based on whether or not they can lead to high RPM. Extensive experiments are conducted to evaluate performance of RQRF. In a one month large-scale real-world traffic of e-commerce sponsored search system, the proposed model significantly outperforms traditional baseline.

IJCAI Conference 2019 Conference Paper

Tag2Gauss: Learning Tag Representations via Gaussian Distribution in Tagged Networks

  • Yun Wang
  • Lun Du
  • Guojie Song
  • Xiaojun Ma
  • Lichen Jin
  • Wei Lin
  • Fei Sun

Keyword-based tags (referred to as tags) are used to represent additional attributes of nodes in addition to what is explicitly stated in their contents, like the hashtags in YouTube. Aside of being auxiliary information for node representation, tags can also be used for retrieval, recommendation, content organization, and event analysis. Therefore, tag representation learning is of great importance. However, to learn satisfactory tag representations is challenging because 1) traditional representation methods generally fail when it comes to representing tags, 2) bidirectional interactions between nodes and tags should be modeled, which are generally not dealt within existing research works. In this paper, we propose a tag representation learning model which takes tag-related node interaction into consideration, named Tag2Gauss. Specifically, since tags represent node communities with intricate overlapping relationships, we propose that Gaussian distributions would be appropriate in modeling tags. Considering the bidirectional interactions between nodes and tags, we propose a tag representation learning model mapping tags to distributions consisting of two embedding tasks, namely Tag-view embedding and Node-view embedding. Extensive evidence demonstrates the effectiveness of representing tag as a distribution, and the advantages of the proposed architecture in many applications, such as the node classification and the network visualization.

IJCAI Conference 2018 Conference Paper

IncepText: A New Inception-Text Module with Deformable PSROI Pooling for Multi-Oriented Scene Text Detection

  • Qiangpeng Yang
  • Mengli Cheng
  • Wenmeng Zhou
  • Yan Chen
  • Minghui Qiu
  • Wei Lin

Incidental scene text detection, especially for multi-oriented text regions, is one of the most challenging tasks in many computer vision applications. Different from the common object detection task, scene text often suffers from a large variance of aspect ratio, scale, and orientation. To solve this problem, we propose a novel end-to-end scene text detector IncepText from an instance-aware segmentation perspective. We design a novel Inception-Text module and introduce deformable PSROI pooling to deal with multi-oriented text detection. Extensive experiments on ICDAR2015, RCTW-17, and MSRA-TD500 datasets demonstrate our method's superiority in terms of both effectiveness and efficiency. Our proposed method achieves 1st place result on ICDAR2015 challenge and the state-of-the-art performance on other datasets. Moreover, we have released our implementation as an OCR product which is available for public access.