Author name cluster

Zhen Qin

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

24 papers

2 author rows

AAAI Conference 2026 Conference Paper

Learning Spatial Decay for Vision Transformers

Yuxin Mao
Zhen Qin
Jinxing Zhou
Bin Fan
Jing Zhang
Yiran Zhong
Yuchao Dai

Vision Transformers (ViTs) have revolutionized computer vision, yet their self-attention mechanism lacks explicit spatial inductive biases, leading to suboptimal performance on spatially-structured tasks. Existing approaches introduce data-independent spatial decay based on fixed distance metrics, applying uniform attention weighting regardless of image content and limiting adaptability to diverse visual scenarios. Inspired by recent advances in large language models where content-aware gating mechanisms (e.g., GLA, HGRN2, FOX) significantly outperform static alternatives, we present the first successful adaptation of data-dependent spatial decay to 2D vision transformers. We introduce Spatial Decay Transformer (SDT), featuring a novel Context-Aware Gating (CAG) mechanism that generates dynamic, data-dependent decay for patch interactions. Our approach learns to modulate spatial attention based on both content relevance and spatial proximity. We address the fundamental challenge of 1D-to-2D adaptation through a unified spatial-content fusion framework that integrates manhattan distance-based spatial priors with learned content representations. Extensive experiments on ImageNet-1K classification and generation tasks demonstrate consistent improvements over strong baselines. Our work establishes data-dependent spatial decay as a new paradigm for enhancing spatial attention in vision transformers.

PDF Details DOI

TAAS Journal 2025 Journal Article

Adaptive Scheduling of High-Availability Drone Swarms for Congestion Alleviation in Connected Automated Vehicles

Shengye Pang
Yi Li
Zhen Qin
Xinkui Zhao
Jintao Chen
Fan Wang
Jianwei Yin

The Intelligent Transportation System (ITS) serves as a pivotal element within urban networks, offering decision support to users and connected automated vehicles through comprehensive information gathering, sensing, device control, and data processing. Presently, ITS predominantly relies on sensors embedded in fixed infrastructure, notably Roadside Units (RSUs). However, RSUs are confined by coverage limitations and may encounter challenges in prompt emergency responses. On-demand resources, such as drones, present a viable option to supplement these deficiencies effectively. This article introduces an approach where Software-Defined Networking and Mobile Edge Computing technologies are integrated to formulate a high-availability drone swarm control and communication infrastructure framework comprising the cloud layer, edge layer, and device layer. Drones confront limitations in flight duration attributed to battery limitations, posing a challenge in sustaining continuous monitoring of road conditions over extended periods. Effective drone scheduling stands as a promising solution to overcome these constraints. To tackle this issue, we initially utilized Graph WaveNet, a specialized graph neural network structure tailored for spatial-temporal graph modeling, for training a congestion prediction model using real-world dataset inputs. Building upon this, we further propose an algorithm for drone scheduling based on congestion prediction. Our simulation experiments using real-world data demonstrate that, compared to the baseline method, the proposed scheduling algorithm not only yielded superior scheduling gains but also mitigated drone idle rates.

Details DOI

AAAI Conference 2025 Conference Paper

Deep Non-Rigid Structure-from-Motion Revisited: Canonicalization and Sequence Modeling

Hui Deng
Jiawei Shi
Zhen Qin
Yiran Zhong
Yuchao Dai

Non-Rigid Structure-from-Motion (NRSfM) is a classic 3D vision problem, where a 2D sequence is taken as input to estimate the corresponding 3D sequence. Recently, the deep neural networks have greatly advanced the task of NRSfM. However, existing deep NRSfM methods still have limitations in handling the inherent sequence property and motion ambiguity associated with the NRSfM problem. In this paper, we revisit deep NRSfM from two perspectives to address the limitations of current deep NRSfM methods: (1) canonicalization and (2) sequence modeling. We propose an easy-to-implement per-sequence canonicalization method as opposed to the previous per-dataset canonicalization approaches. With this in mind, we propose a sequence modeling method that combines temporal information and subspace constraint. As a result, we have achieved a more optimal NRSfM reconstruction pipeline compared to previous efforts. The effectiveness of our method is verified by testing the sequence-to-sequence deep NRSfM pipeline with corresponding regularization modules on several commonly used datasets.

PDF Details DOI

ICLR Conference 2025 Conference Paper

DUET: Decentralized Bilevel Optimization without Lower-Level Strong Convexity

Zhen Qin
Zhuqing Liu
Songtao Lu
Yingbin Liang
Jia Liu 0002

Decentralized bilevel optimization (DBO) provides a powerful framework for multi-agent systems to solve local bilevel tasks in a decentralized fashion without the need for a central server. However, most existing DBO methods rely on lower-level strong convexity (LLSC) to guarantee unique solutions and a well-defined hypergradient for stationarity measure, hindering their applicability in many practical scenarios not satisfying LLSC. To overcome this limitation, we introduce a new single-loop DBO algorithm called diminishing quadratically-regularized bilevel decentralized optimization (DUET), which eliminates the need for LLSC by introducing a diminishing quadratic regularization to the lower-level (LL) objective. We show that DUET achieves an iteration complexity of $O(1/T^{1-5p-\frac{11}{4}\tau})$ for approximate KKT-stationary point convergence under relaxed assumptions, where $p$ and $\tau $ are control parameters for LL learning rate and averaging, respectively. In addition, our DUET algorithm incorporates gradient tracking to address data heterogeneity, a key challenge in DBO settings. To the best of our knowledge, this is the first work to tackle DBO without LLSC under decentralized settings with data heterogeneity. Numerical experiments validate the theoretical findings and demonstrate the practical effectiveness of our proposed algorithms.

Details

TAAS Journal 2025 Journal Article

Edge-Adaptive Dynamic Scalable Convolution for Efficient Remote Mobile Pathology Analysis

Peng Xiao
Dajiang Chen
Zhen Qin
Mingsheng Cao
Ruidong Chen

With the emergence of edge computing, there is a growing need for advanced technologies capable of real-time, efficient processing of complex data on edge devices, particularly in mobile health systems handling pathological images. On edge computing devices, the lightweighting of models and reduction of computational requirements not only save resources but also increase inference speed. Although many lightweight models and methods have been proposed in recent years, they still face many common challenges. This article introduces a novel convolution operation, Dynamic Scalable Convolution (DSC), which optimizes computational resources and accelerates inference on edge computing devices. DSC is shown to outperform traditional convolution methods in terms of parameter efficiency, computational speed, and overall performance, through comparative analyses in computer vision tasks like image classification and semantic segmentation. Experimental results demonstrate the significant potential of DSC in enhancing deep neural networks, particularly for edge computing applications in smart devices and remote healthcare, where it addresses the challenge of limited resources by reducing computational demands and improving inference speed. By integrating advanced convolution technology and edge computing applications, DSC offers a promising approach to support the rapidly developing mobile health field, especially in enhancing remote healthcare delivery through mobile multimedia communication.

Details DOI

NeurIPS Conference 2025 Conference Paper

Hybrid Latent Reasoning via Reinforcement Learning

Zhenrui Yue
Bowen Jin
Huimin Zeng
Honglei Zhuang
Zhen Qin
Jinsung Yoon
Lanyu Shang
Jiawei Han

Recent advances in large language models (LLMs) have introduced latent reasoning as a promising alternative to autoregressive reasoning. By performing internal computation with hidden states from previous steps, latent reasoning benefit from more informative features rather than sampling a discrete chain-of-thought (CoT) path. Yet latent reasoning approaches are often incompatible with LLMs, as their continuous paradigm conflicts with the discrete nature of autoregressive generation. Moreover, these methods rely on CoT traces for training and thus fail to exploit the inherent reasoning patterns of LLMs. In this work, we explore latent reasoning by leveraging the intrinsic capabilities of LLMs via reinforcement learning (RL). To this end, we introduce hybrid reasoning policy optimization (HRPO), an RL-based hybrid latent reasoning approach that (1) integrates prior hidden states into sampled tokens with a learnable gating mechanism, and (2) initializes training with predominantly token embeddings while progressively incorporating more hidden features. This design maintains LLMs' generative capabilities and incentivizes hybrid reasoning using both discrete and continuous representations. In addition, the hybrid HRPO introduces stochasticity into latent reasoning via token sampling, thereby enabling RL-based optimization without requiring CoT trajectories. Extensive evaluations across diverse benchmarks show that HRPO outperforms prior methods in both knowledge- and reasoning-intensive tasks. Furthermore, HRPO-trained LLMs remain interpretable and exhibit intriguing behaviors like cross-lingual patterns and shorter completion lengths, highlighting the potential of our RL-based approach and offer insights for future work in latent reasoning.

PDF Details

TMLR Journal 2025 Journal Article

LASP: Linear Attention Sequence Parallelism

Weigao Sun
Zhen Qin
Dong Li
Xuyang Shen
Yu Qiao
Yiran Zhong

Sequence parallelism (SP) serves as a prevalent strategy to handle long sequences that exceed the memory limit of a single device. However, for linear sequence modeling methods like linear attention, existing SP approaches do not take advantage of their right-product-first feature, resulting in sub-optimal communication efficiency and usability. In this paper, we introduce Linear Attention Sequence Parallelism (LASP), an efficient SP approach designed for linear attention-based transformer models. Specifically, we design an efficient point-to-point ring-style communication mechanism to leverage the right-product kernel trick of linear attention, which sharply decreases the communication overhead, comparing with existing SP methods. We enhance the computation efficiency of LASP by performing kernel fusion and intermediate state caching, making the implementation of LASP hardware-friendly on GPUs. Furthermore, we meticulously ensure the compatibility of sequence-level LASP with all types of batch-level data parallel methods, which is vital for distributed training on large clusters with very-long sequences. We also discuss the generalization of LASP on other linear sequence modeling methods. Extensive experiments on linear attention-based models are conducted with varying sequence lengths from 2K to 4096K. LASP scales sequence length up to 4096K on 128 GPUs, which is 8$\times$ longer than existing SP methods. Code is available at: \url{https://github.com/OpenNLPLab/LASP}.

PDF Details

ICML Conference 2025 Conference Paper

MVA: Linear Attention with High-order Query-Keys Integration and Multi-level Vocabulary Decomposition

Ning Wang
Zekun Li 0014
Tongxin Bai
Man Yao
Zhen Qin
Guoqi Li 0002

Linear attention offers the advantages of linear inference time and fixed memory usage compared to Softmax attention. However, training large-scale language models with linear attention from scratch remains prohibitively expensive and exhibits significant performance gaps compared to Softmax-based models. To address these challenges, we focus on transforming pre-trained Softmax-based language models into linear attention models. We unify mainstream linear attention methods using a high-order QK integration theory and a multi-level vocabulary decomposition. Specifically, the QK integration theory explains the efficacy of combining linear and sparse attention from the perspective of information collection across different frequency bands. The multi-level vocabulary decomposition exponentially expands memory capacity by recursively exploiting compression loss from compressed states. Through detailed error analysis, we demonstrate superior approximation of Softmax attention achieved by our approach. To further improve performance and reduce training costs, we adopt a soft integration strategy with attention scores, effectively combining a sliding window mechanism. With less than 100M tokens, our method fine-tunes models to achieve linear complexity while retaining 99% of their original performance. Compared to state-of-the-art linear attention model and method, our approach improves MMLU scores by 1. 2 percentage points with minimal fine-tuning. Furthermore, even without the sliding window mechanism, our method achieves state-of-the-art performance on all test sets with 10B tokens.

Details

JBHI Journal 2025 Journal Article

Postoperative Recovery Assessment for Parkinson's Patients via Light-weighted Topological Pose Estimation

Zeping Ma
Zhiyao Qin
Botao Jiang
Guosong Zhu
Zhen Qin
Ji Geng
Mohammed J.F. Alenazi
Saru Kumari

The UPDRS III scale plays a critical role in diagnosing the progression of Parkinson's disease. Current methods often involve doctors guiding patients through specific actions on the scale, recording their performance, and assigning scores. However, this approach has several drawbacks, including the lengthy time required for doctorpatient communication, the high costs of patients traveling to hospitals for follow-up visits, and the reliance on subjective judgments from doctors, which lack standardized criteria. With advancements in artificial intelligence, many traditional processes have been partially automated. To help patients reduce diagnosis time, lower medical costs, and provide more accurate and objective evaluation results, this paper proposes a Transformer-based pose estimation model for assessing UPDRS III scale actions. By integrating skeleton-based evaluations from the network with a series of post-processing operations, the model enables patients to perform self-assessments of their post-treatment recovery at home, saving doctors significant time. This work introduces a cascaded graph self-attention module, SGAM (Spatial-Graphical Attention Module), to enhance the network's understanding of human topology. Additionally, it proposes a lightweight convolutional block, Chi-block, which employs a novel approach leveraging the attribute invariance of filters to interpret model performance and guide compression. This approach reduces computational costs and model parameters while preserving accuracy. The proposed method demonstrates robust performance on human pose estimation (HPE) datasets and showcases impressive lightweight performance on benchmark datasets such as ImageNet-1K and CIFAR-10. These results demonstrate the potential of artificial intelligence in enabling automated remote diagnosis and treatment for Parkinson's patients.

Details DOI

NeurIPS Conference 2025 Conference Paper

Tensor Product Attention Is All You Need

Yifan Zhang
Yifeng Liu
Huizhuo Yuan
Zhen Qin
Yang Yuan
Quanquan Gu
Andrew Yao

Scaling language models to handle longer input sequences typically necessitates large key-value (KV) caches, resulting in substantial memory overhead during inference. In this paper, we propose Tensor Product Attention (TPA), a novel attention mechanism that uses tensor decompositions to represent queries, keys, and values compactly, substantially shrinking the KV cache size at inference time. By factorizing these representations into contextual low-rank components and seamlessly integrating with Rotary Position Embedding (RoPE), TPA achieves improved model quality alongside memory efficiency. Based on TPA, we introduce the Tensor ProducT ATTenTion Transformer (T6), a new model architecture for sequence modeling. Through extensive empirical evaluation on language modeling tasks, we demonstrate that T6 surpasses or matches the performance of standard Transformer baselines including Multi-Head Attention (MHA), Multi-Query Attention (MQA), Grouped-Query Attention (GQA), and Multi-Head Latent Attention (MLA) across various metrics, including perplexity and a range of established evaluation benchmarks. Notably, TPA's memory efficiency and computational efficiency at decoding stage enables processing longer sequences under fixed resource constraints, addressing a critical scalability challenge in modern language models. Project Page: https: //github. com/tensorgi/TPA.

PDF Details

AAAI Conference 2024 Conference Paper

Exploring Transformer Extrapolation

Zhen Qin
Yiran Zhong
Hui Deng

Length extrapolation has attracted considerable attention recently since it allows transformers to be tested on longer sequences than those used in training. Previous research has shown that this property can be attained by using carefully designed Relative Positional Encodings (RPEs). While these methods perform well on a variety of corpora, the conditions for length extrapolation have yet to be investigated. This paper attempts to determine what types of RPEs allow for length extrapolation through a thorough mathematical and empirical analysis. We discover that a transformer is certain to possess this property as long as the series that corresponds to the RPE's exponential converges. Two practices are derived from the conditions and examined in language modeling tasks on a variety of corpora. As a bonus from the conditions, we derive a new Theoretical Receptive Field (TRF) to measure the receptive field of RPEs without taking any training steps. Extensive experiments are conducted on the Wikitext-103, Books, Github, and WikiBook datasets to demonstrate the viability of our discovered conditions. We also compare TRF to Empirical Receptive Field (ERF) across different models, showing consistently matched trends on these datasets. Code is released at: https://github.com/OpenNLPLab/Rpe.

PDF Details DOI

JMLR Journal 2024 Journal Article

Guaranteed Nonconvex Factorization Approach for Tensor Train Recovery

Zhen Qin
Michael B. Wakin
Zhihui Zhu

Tensor train (TT) decomposition represents an order-$N$ tensor using $O(N)$ order-$3$ tensors (i.e., factors of small dimension), achieved through products among these factors. Due to its compact representation, TT decomposition has been widely used in the fields of signal processing, machine learning, and quantum physics. It offers benefits such as reduced memory requirements, enhanced computational efficiency, and decreased sampling complexity. Nevertheless, existing optimization algorithms with guaranteed performance concentrate exclusively on using the TT format for reducing the optimization space in recovery problems, while still operating on the entire tensor in each iteration. There is a lack of comprehensive theoretical analysis for optimization involving the factors directly, despite the proven efficacy of such factorization methods in practice. In this paper, we provide the first convergence guarantee for the factorization approach in a TT-based recovery problem. Specifically, to avoid the scaling ambiguity and to facilitate theoretical analysis, we optimize over the so-called left-orthogonal TT format which enforces orthonormality among most of the factors. To ensure the orthonormal structure, we utilize the Riemannian gradient descent (RGD) for optimizing those factors over the Stiefel manifold. We first delve into the TT factorization/decomposition problem and establish the local linear convergence of RGD. Notably, the rate of convergence only experiences a linear decline as the tensor order increases. We then study the sensing problem that aims to recover a TT format tensor from linear measurements. Assuming the sensing operator satisfies the restricted isometry property (RIP), we show that with a proper initialization, which could be obtained through spectral initialization, RGD also converges to the ground-truth tensor at a linear rate. Furthermore, we expand our analysis to encompass scenarios involving Gaussian noise in the measurements. We prove that RGD can reliably recover the ground truth at a linear rate, with the recovery error exhibiting only polynomial growth in relation to the tensor order $N$. We conduct various experiments to validate our theoretical findings. [abs] [ pdf ][ bib ] &copy JMLR 2024. ( edit, beta )

PDF Details

JBHI Journal 2024 Journal Article

LightNet: A Novel Lightweight Convolutional Network for Brain Tumor Segmentation in Healthcare

Dongyuan Wu
Junyi Tao
Zhen Qin
Rao Asad Mumtaz
Jing Qin
Linfang Yu
Jane Courtney

Diagnosis, treatment planning, surveillance, and the monitoring of clinical trials for brain diseases all benefit greatly from neuroimaging-based tumor segmentation. Recently, Convolutional Neural Networks (CNNs) have demonstrated promising results in enhancing the efficiency of image-based brain tumor segmentation. Most current work on CNNs, however, is devoted to creating increasingly complicated convolution modules to improve performance, which in turn raises the computing cost of the model. This work proposes a simple and effective feed-forward CNN, LightNet (Light Network). Based on multi-path and multi-level, it replaces traditional convolutional methods with light operations, which reduces network parameters and redundant feature maps. In the up-sampling stage, a light channel attention module is added to achieve richer multi-scale and spatial semantic feature information extraction of brain tumor. The performance of the network is evaluated in the Multimodal Brain Tumor Segmentation Challenge (BraTS 2015) dataset, and results are presented here alongside other high-performing CNNs. Results show comparable accuracy with other methods but with increased efficiency, segmentation performance, and reduced redundancy and computational complexity. The result is a high-performing network with a balance between efficiency and accuracy, allowing, for example, better energy performance on mobile devices.

Details DOI

AAAI Conference 2024 Conference Paper

Resisting Backdoor Attacks in Federated Learning via Bidirectional Elections and Individual Perspective

Zhen Qin
Feiyi Chen
Chen Zhi
Xueqiang Yan
Shuiguang Deng

Existing approaches defend against backdoor attacks in federated learning (FL) mainly through a) mitigating the impact of infected models, or b) excluding infected models. The former negatively impacts model accuracy, while the latter usually relies on globally clear boundaries between benign and infected model updates. However, in reality, model updates can easily become mixed and scattered throughout due to the diverse distributions of local data. This work focuses on excluding infected models in FL. Unlike previous perspectives from a global view, we propose Snowball, a novel anti-backdoor FL framework through bidirectional elections from an individual perspective inspired by one principle deduced by us and two principles in FL and deep learning. It is characterized by a) bottom-up election, where each candidate model update votes to several peer ones such that a few model updates are elected as selectees for aggregation; and b) top-down election, where selectees progressively enlarge themselves through picking up from the candidates. We compare Snowball with state-of-the-art defenses to backdoor attacks in FL on five real-world datasets, demonstrating its superior resistance to backdoor attacks and slight impact on the accuracy of the global model.

PDF Details DOI

ICLR Conference 2023 Conference Paper

Encoding Recurrence into Transformers

Feiqing Huang
Kexin Lu
Yuxi Cai
Zhen Qin
Yanwen Fang
Guangjian Tian
Guodong Li

This paper novelly breaks down with ignorable loss an RNN layer into a sequence of simple RNNs, each of which can be further rewritten into a lightweight positional encoding matrix of a self-attention, named the Recurrence Encoding Matrix (REM). Thus, recurrent dynamics introduced by the RNN layer can be encapsulated into the positional encodings of a multihead self-attention, and this makes it possible to seamlessly incorporate these recurrent dynamics into a Transformer, leading to a new module, Self-Attention with Recurrence (RSA). The proposed module can leverage the recurrent inductive bias of REMs to achieve a better sample efficiency than its corresponding baseline Transformer, while the self-attention is used to model the remaining non-recurrent signals. The relative proportions of these two components are controlled by a data-driven gated mechanism, and the effectiveness of RSA modules are demonstrated by four sequential learning tasks.

Details

NeurIPS Conference 2023 Conference Paper

Hierarchically Gated Recurrent Neural Network for Sequence Modeling

Zhen Qin
Songlin Yang
Yiran Zhong

Transformers have surpassed RNNs in popularity due to their superior abilities in parallel training and long-term dependency modeling. Recently, there has been a renewed interest in using linear RNNs for efficient sequence modeling. These linear RNNs often employ gating mechanisms in the output of the linear recurrence layer while ignoring the significance of using forget gates within the recurrence. In this paper, we propose a gated linear RNN model dubbed Hierarchically Gated Recurrent Neural Network (HGRN), which includes forget gates that are lower bounded by a learnable value. The lower bound increases monotonically when moving up layers. This allows the upper layers to model long-term dependencies and the lower layers to model more local, short-term dependencies. Experiments on language modeling, image classification, and long-range arena benchmarks showcase the efficiency and effectiveness of our proposed model. The source code is available at https: //github. com/OpenNLPLab/HGRN.

PDF Details

NeurIPS Conference 2023 Conference Paper

Learning List-Level Domain-Invariant Representations for Ranking

Ruicheng Xian
Honglei Zhuang
Zhen Qin
Hamed Zamani
Jing Lu
Ji Ma
Kai Hui
Han Zhao

Domain adaptation aims to transfer the knowledge learned on (data-rich) source domains to (low-resource) target domains, and a popular method is invariant representation learning, which matches and aligns the data distributions on the feature space. Although this method is studied extensively and applied on classification and regression problems, its adoption on ranking problems is sporadic, and the few existing implementations lack theoretical justifications. This paper revisits invariant representation learning for ranking. Upon reviewing prior work, we found that they implement what we call item-level alignment, which aligns the distributions of the items being ranked from all lists in aggregate but ignores their list structure. However, the list structure should be leveraged, because it is intrinsic to ranking problems where the data and the metrics are defined and computed on lists, not the items by themselves. To close this discrepancy, we propose list-level alignment—learning domain-invariant representations at the higher level of lists. The benefits are twofold: it leads to the first domain adaptation generalization bound for ranking, in turn providing theoretical support for the proposed method, and it achieves better empirical transfer performance for unsupervised domain adaptation on ranking tasks, including passage reranking.

PDF Details

TMLR Journal 2023 Journal Article

Linearized Relative Positional Encoding

Zhen Qin
Weixuan Sun
Kaiyue Lu
Hui Deng
Dongxu Li
Xiaodong Han
Yuchao Dai
Lingpeng Kong

Relative positional encoding is widely used in vanilla and linear transformers to represent positional information. However, existing encoding methods of a vanilla transformer are not always directly applicable to a linear transformer, because the latter requires a decomposition of the query and key representations into separate kernel functions. Nevertheless, principles for designing encoding methods suitable for linear transformers remain understudied. In this work, we put together a variety of existing linear relative positional encoding approaches under a canonical form and further propose a family of linear relative positional encoding algorithms via unitary transformation. Our formulation leads to a principled framework that can be used to develop new relative positional encoding methods that preserve linear space-time complexity. Equipped with different models, the proposed linearized relative positional encoding (LRPE) family derives effective encoding for various applications. Experiments show that compared with existing methods, LRPE achieves state-of-the-art performance in language modeling, text classification, and image classification. Meanwhile, it emphasizes a general paradigm for designing broadly more relative positional encoding methods that are applicable to linear transformers.

PDF Details

NeurIPS Conference 2023 Conference Paper

RD-Suite: A Benchmark for Ranking Distillation

Zhen Qin
Rolf Jagerman
Rama Kumar Pasumarthi
Honglei Zhuang
He Zhang
Aijun Bai
Kai Hui
Le Yan

The distillation of ranking models has become an important topic in both academia and industry. In recent years, several advanced methods have been proposed to tackle this problem, often leveraging ranking information from teacher rankers that is absent in traditional classification settings. To date, there is no well-established consensus on how to evaluate this class of models. Moreover, inconsistent benchmarking on a wide range of tasks and datasets make it difficult to assess or invigorate advances in this field. This paper first examines representative prior arts on ranking distillation, and raises three questions to be answered around methodology and reproducibility. To that end, we propose a systematic and unified benchmark, Ranking Distillation Suite (RD-Suite), which is a suite of tasks with 4 large real-world datasets, encompassing two major modalities (textual and numeric) and two applications (standard distillation and distillation transfer). RD-Suite consists of benchmark results that challenge some of the common wisdom in the field, and the release of datasets with teacher scores and evaluation scripts for future research. RD-Suite paves the way towards better understanding of ranking distillation, facilities more research in this direction, and presents new challenges.

PDF Details

NeurIPS Conference 2022 Conference Paper

Error Analysis of Tensor-Train Cross Approximation

Zhen Qin
Alexander Lidiak
Zhexuan Gong
Gongguo Tang
Michael B Wakin
Zhihui Zhu

Tensor train decomposition is widely used in machine learning and quantum physics due to its concise representation of high-dimensional tensors, overcoming the curse of dimensionality. Cross approximation---originally developed for representing a matrix from a set of selected rows and columns---is an efficient method for constructing a tensor train decomposition of a tensor from few of its entries. While tensor train cross approximation has achieved remarkable performance in practical applications, its theoretical analysis, in particular regarding the error of the approximation, is so far lacking. To our knowledge, existing results only provide element-wise approximation accuracy guarantees, which lead to a very loose bound when extended to the entire tensor. In this paper, we bridge this gap by providing accuracy guarantees in terms of the entire tensor for both exact and noisy measurements. Our results illustrate how the choice of selected subtensors affects the quality of the cross approximation and that the approximation error caused by model error and/or measurement error may not grow exponentially with the order of the tensor. These results are verified by numerical experiments, and may have important implications for the usefulness of cross approximations for high-order tensors, such as those encountered in the description of quantum many-body states.

PDF Details

JBHI Journal 2022 Journal Article

MVFusFra: A Multi-View Dynamic Fusion Framework for Multimodal Brain Tumor Segmentation

Yi Ding
Wei Zheng
Ji Geng
Zhen Qin
Kim-Kwang Raymond Choo
Zhiguang Qin
Xiaolin Hou

Medical practitioners generally rely on multimodal brain images, for example based on the information from the axial, coronal, and sagittal views, to inform brain tumor diagnosis. Hence, to further utilize the 3D information embedded in such datasets, this paper proposes a multi-view dynamic fusion framework (hereafter, referred to as MVFusFra) to improve the performance of brain tumor segmentation. The proposed framework consists of three key building blocks. First, a multi-view deep neural network architecture, which represents multi learning networks for segmenting the brain tumor from different views and each deep neural network corresponds to multi-modal brain images from one single view. Second, the dynamic decision fusion method, which is mainly used to fuse segmentation results from multi-views into an integrated method. Then, two different fusion methods (i. e. , voting and weighted averaging) are used to evaluate the fusing process. Third, the multi-view fusion loss (comprising segmentation loss, transition loss, and decision loss) is proposed to facilitate the training process of multi-view learning networks, so as to ensure consistency in appearance and space, for both fusing segmentation results and the training of the learning network. We evaluate the performance of MVFusFra on the BRATS 2015 and BRATS 2018 datasets. Findings from the evaluations suggest that fusion results from multi-views achieve better performance than segmentation results from the single view, and also implying effectiveness of the proposed multi-view fusion loss. A comparative summary also shows that MVFusFra achieves better segmentation performance, in terms of efficiency, in comparison to other competing approaches.

Details DOI

NeurIPS Conference 2022 Conference Paper

Transformer Memory as a Differentiable Search Index

Yi Tay
Vinh Tran
Mostafa Dehghani
Jianmo Ni
Dara Bahri
Harsh Mehta
Zhen Qin
Kai Hui

In this paper, we demonstrate that information retrieval can be accomplished with a single Transformer, in which all information about the corpus is encoded in the parameters of the model. To this end, we introduce the Differentiable Search Index (DSI), a new paradigm that learns a text-to-text model that maps string queries directly to relevant docids; in other words, a DSI model answers queries directly using only its parameters, dramatically simplifying the whole retrieval process. We study variations in how documents and their identifiers are represented, variations in training procedures, and the interplay between models and corpus sizes. Experiments demonstrate that given appropriate design choices, DSI significantly outperforms strong baselines such as dual encoder models. Moreover, DSI demonstrates strong generalization capabilities, outperforming a BM25 baseline in a zero-shot setup.

PDF Details

ICML Conference 2020 Conference Paper

Do RNN and LSTM have Long Memory?

Jingyu Zhao 0001
Feiqing Huang
Jia Lv
Yanjie Duan
Zhen Qin
Guodong Li
Guangjian Tian

The LSTM network was proposed to overcome the difficulty in learning long-term dependence, and has made significant advancements in applications. With its success and drawbacks in mind, this paper raises the question - do RNN and LSTM have long memory? We answer it partially by proving that RNN and LSTM do not have long memory from a statistical perspective. A new definition for long memory networks is further introduced, and it requires the model weights to decay at a polynomial rate. To verify our theory, we convert RNN and LSTM into long memory networks by making a minimal modification, and their superiority is illustrated in modeling long-term dependence of various datasets.

Details

AAAI Conference 2018 Conference Paper

Hawkes Process Inference With Missing Data

Christian Shelton
Zhen Qin
Chandini Shetty

A multivariate Hawkes process is a class of marked point processes: A sample consists of a finite set of events of unbounded random size; each event has a real-valued time and a discrete-valued label (mark). It is self-excitatory: Each event causes an increase in the rate of other events (of either the same or a different label) in the (near) future. Prior work has developed methods for parameter estimation from complete samples. However, just as unobserved variables can increase the modeling power of other probabilistic models, allowing unobserved events can increase the modeling power of point processes. In this paper we develop a method to sample over the posterior distribution of unobserved events in a multivariate Hawkes process. We demonstrate the efficacy of our approach, and its utility in improving predictive power and identifying latent structure in real-world data.

PDF Details