Author name cluster

Zhenyu Zhang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

27 papers

2 author rows

AAAI Conference 2026 Conference Paper

Decoupling Template Bias in CLIP: Harnessing Empty Prompts for Enhanced Few-Shot Learning

Zhenyu Zhang
Guangyao Chen
Yixiong Zou
Zhimeng Huang
Yuhua Li

The Contrastive Language-Image Pre-Training (CLIP) model excels in few-shot learning by aligning visual and textual representations. Our study shows that template-sample similarity (TSS), defined as the resemblance between a text template and an image sample, introduces bias. This bias leads the model to rely on template proximity rather than true sample-to-category alignment, reducing both accuracy and robustness in classification. We present a framework that uses empty prompts, textual inputs that convey the idea of “emptiness” without category information. These prompts capture unbiased template features and offset TSS bias. The framework employs two stages. During pre-training, empty prompts reveal and reduce template-induced bias within the CLIP encoder. During few-shot fine-tuning, a bias calibration loss enforces correct alignment between images and their categories, ensuring the model focuses on relevant visual cues. Experiments across multiple benchmarks demonstrate that our template correction method significantly reduces performance fluctuations caused by TSS, yielding higher classification accuracy and stronger robustness.

PDF Details DOI

AAAI Conference 2026 Conference Paper

Dual-Kernel Graph Community Contrastive Learning

Xiang Chen
Kun Yue
Wenjie Liu
Zhenyu Zhang
Liang Duan

Graph Contrastive Learning (GCL) has emerged as a powerful paradigm for training Graph Neural Networks (GNNs) in the absence of task-specific labels. However, its scalability on large-scale graphs is hindered by the intensive message passing mechanism of GNN and the quadratic computational complexity of contrastive loss over positive and negative node pairs. To address these issues, we propose an efficient GCL framework that transforms the input graph into a compact network of interconnected node sets while preserving structural information across communities. We firstly introduce a kernelized graph community contrastive loss with linear complexity, enabling effective information transfer among node sets to capture hierarchical structural information of the graph. We then incorporate a knowledge distillation technique into the decoupled GNN architecture to accelerate inference while maintaining strong generalization performance. Extensive experiments on sixteen real-world datasets of varying scales demonstrate that our method outperforms state-of-the-art GCL baselines in both effectiveness and scalability.

PDF Details DOI

AAAI Conference 2025 Conference Paper

Anywhere: A Multi-Agent Framework for User-Guided, Reliable, and Diverse Foreground-Conditioned Image Generation

Xie Tianyidan
Rui Ma
Qian Wang
Xiaoqian Ye
Feixuan Liu
Ying Tai
Zhenyu Zhang
Lanjun Wang

Recent advancements in image-conditioned image generation have demonstrated substantial progress. However, foreground-conditioned image generation remains underexplored, encountering challenges such as compromised object integrity, foreground-background inconsistencies, limited diversity, and reduced control flexibility. These challenges arise from current end-to-end inpainting models, which suffer from inaccurate training masks, limited foreground semantic understanding, data distribution biases, and inherent interference between visual and textual prompts. To overcome these limitations, we present Anywhere, a multi-agent framework that departs from the traditional end-to-end approach. In this framework, each agent is specialized in a distinct aspect, such as foreground understanding, diversity enhancement, object integrity protection, and textual prompt consistency. Our framework is further enhanced with the ability to incorporate optional user textual inputs, perform automated quality assessments, and initiate re-generation as needed. Comprehensive experiments demonstrate that this modular design effectively overcomes the limitations of existing end-to-end models, resulting in higher fidelity, quality, diversity and controllability in foreground-conditioned image generation. Additionally, the Anywhere framework is extensible, allowing it to benefit from future advancements in each individual agent.

PDF Details DOI

EAAI Journal 2025 Journal Article

Bogie doctor: Automated fault diagnosis in high-speed trains via deep learning-based maintenance log analysis

Zhenyu Zhang
Runsheng Miao
Chao Wang
Yong Qin

Details DOI

AAAI Conference 2025 Conference Paper

Exploiting Multimodal Spatial-temporal Patterns for Video Object Tracking

Xiantao Hu
Ying Tai
Xu Zhao
Chen Zhao
Zhenyu Zhang
Jun Li
Bineng Zhong
Jian Yang

Multimodal tracking has garnered widespread attention as a result of its ability to effectively address the inherent limitations of traditional RGB tracking. However, existing multimodal trackers mainly focus on the fusion and enhancement of spatial features or merely leverage the sparse temporal relationships between video frames. These approaches do not fully exploit the temporal correlations in multimodal videos, making it difficult to capture the dynamic changes and motion information of targets in complex scenarios. To alleviate this problem, we propose a unified multimodal spatial-temporal tracking approach named STTrack. In contrast to previous paradigms that solely relied on updating reference information, we introduced a temporal state generator (TSG) that continuously generates a sequence of tokens containing multimodal temporal information. These temporal information tokens are used to guide the localization of the target in the next time state, establish long-range contextual relationships between video frames, and capture the temporal trajectory of the target. Furthermore, at the spatial level, we introduced the mamba fusion and background suppression interactive (BSI) modules. These modules establish a dual-stage mechanism for coordinating information interaction and fusion between modalities. Extensive comparisons on five benchmark datasets illustrate that STTrack achieves state-of-the-art performance across various multimodal tracking scenarios.

PDF Details DOI

IROS Conference 2025 Conference Paper

MARS-FTCP: Robust Fault-Tolerant Control and Agile Trajectory Planning for Modular Aerial Robot Systems

Rui Huang
Zhenyu Zhang
Siyu Tang
Zhiqian Cai
Lin Zhao 0009

Modular Aerial Robot Systems (MARS) consist of multiple drone units that can self-reconfigure to adapt to various mission requirements and fault conditions. However, existing fault-tolerant control methods exhibit significant oscillations during docking and separation, impacting system stability. To address this issue, we propose a novel fault-tolerant control reallocation method that adapts to an arbitrary number of modular robots and their assembly formations. The algorithm redistributes the expected collective force and torque required for MARS to individual units according to their moment arm relative to the center of MARS mass. Furthermore, we propose an agile trajectory planning method for MARS of arbitrary configurations, which is collision-avoiding and dynamically feasible. Our work represents the first comprehensive approach to enable fault-tolerant and collision avoidance flight for MARS. We validate our method through extensive simulations, demonstrating improved fault tolerance, enhanced trajectory tracking accuracy, and greater robustness in cluttered environments. The videos and source code of this work are available at https://github.com/RuiHuangNUS/MARS-FTCP/

Details

JBHI Journal 2025 Journal Article

Pre-Operative Overall Survival Prediction of Diffuse Glioma Enhanced by Longitudinal Data

Zhenyu Tang
Jiannan Li
Jingliang Cheng
Zhi-Cheng Li
Zhenyu Zhang
Jing Yan

Many pre-operative overall survival (OS) prediction methods have been proposed to assist personalized treatment of diffuse glioma for better prognosis. Most of them utilize pre-operative data, while post-operative data, which contains essential prognosis-related information (e. g. , surgical outcomes and lesion evolution) is neglected, hindering prediction accuracy. However, incorporating post-operative data could make OS prediction inapplicable at pre-operative stage, affecting clinical utility. To address this contradiction, in this paper, we propose an effective framework that leverages longitudinal data (pre- and post-operative data) to enhance pre-operative OS prediction. Specifically, two OS prediction networks are built in a knowledge distillation framework. One is the teacher network trained with longitudinal data, and the other is the student network relying solely on pre-operative data. Distillation of deep features is conducted to align the performance of the student network with that of the teacher network. Moreover, mass effect and its distillation are adopted to incorporate lesion evolution information, further enhancing prediction performance. Based on our framework, the student network can leverage essential post-operative information without compromising its applicability at pre-operative stage. Experiments on both in-house and public datasets demonstrate that the student network outperforms all state-of-the-art methods under evaluation with statistical significance. Further ablation study reveals that distillation of mass effect and deep features play positive roles in OS prediction. Moreover, new prognosis-related factors are discovered by comparing the student network with and without distillation.

Details DOI

NeurIPS Conference 2025 Conference Paper

ReCAP: Recursive Context-Aware Reasoning and Planning for Large Language Model Agents

Zhenyu Zhang
Tianyi Chen
Weiran Xu
Alex Pentland
Jiaxin Pei

Long-horizon tasks requiring multi-step reasoning and dynamic re-planning remain challenging for large language models (LLMs). Sequential prompting methods are prone to context drift, loss of goal information, and recurrent failure cycles, while hierarchical prompting methods often weaken cross-level continuity or incur substantial runtime overhead. We introduce ReCAP (Recursive Context-Aware Reasoning and Planning), a hierarchical framework with shared context for reasoning and planning in LLMs. ReCAP combines three key mechanisms: (i) plan-ahead decomposition, in which the model generates a full subtask list, executes the first item, and refines the remainder; (ii) structured re-injection of parent plans, maintaining consistent multi-level context during recursive return; and (iii) memory-efficient execution, bounding the active prompt so costs scale linearly with task depth. Together these mechanisms align high-level goals with low-level actions, reduce redundant prompting, and preserve coherent context updates across recursion. Experiments demonstrate that ReCAP substantially improves subgoal alignment and success rates on various long-horizon reasoning benchmarks, achieving a 32\% gain on synchronous Robotouille and a 29\% improvement on asynchronous Robotouille under the strict pass@1 protocol.

PDF Details

IJCAI Conference 2025 Conference Paper

Semantic-Guided Diffusion Model for Single-Step Image Super-Resolution

Zihang Liu
Zhenyu Zhang
Hao Tang

Diffusion-based image super-resolution (SR) methods have demonstrated remarkable performance. Recent advancements have introduced deterministic sampling processes that reduce inference from 15 iterative steps to a single step, thereby significantly improving the inference speed of existing diffusion models. However, their efficiency remains limited when handling complex semantic regions due to the single-step inference. To address this limitation, we propose SAMSR, a semantic-guided diffusion framework that incorporates semantic segmentation masks into the sampling process. Specifically, we introduce the SAM-Noise Module, which refines Gaussian noise using segmentation masks to preserve spatial and semantic features. Furthermore, we develop a pixel-wise sampling strategy that dynamically adjusts the residual transfer rate and noise strength based on pixel-level semantic weights, prioritizing semantically rich regions during the diffusion process. To enhance model training, we also propose a semantic consistency loss, which aligns pixel-wise semantic weights between predictions and ground truth. Extensive experiments on both real-world and synthetic datasets demonstrate that SAMSR significantly improves perceptual quality and detail recovery, particularly in semantically complex images.

PDF Details DOI

EAAI Journal 2025 Journal Article

Spatial–temporal intention representation with multi-agent reinforcement learning for unmanned surface vehicles strategies learning in asset guarding task

Yang Li
Shaorong Xie
Hang Yu
Han Zhang
Zhenyu Zhang
Xiangfeng Luo

Details DOI

EAAI Journal 2024 Journal Article

A graph neural network-based teammate recommendation model for knowledge-intensive crowdsourcing

Zhenyu Zhang
Wenxin Yao
Fangzheng Li
Jiayan Yu
Vladimir Simic
Xicheng Yin

Details DOI

AAAI Conference 2024 Conference Paper

AltNeRF: Learning Robust Neural Radiance Field via Alternating Depth-Pose Optimization

Kun Wang
Zhiqiang Yan
Huang Tian
Zhenyu Zhang
Xiang Li
Jun Li
Jian Yang

Neural Radiance Fields (NeRF) have shown promise in generating realistic novel views from sparse scene images. However, existing NeRF approaches often encounter challenges due to the lack of explicit 3D supervision and imprecise camera poses, resulting in suboptimal outcomes. To tackle these issues, we propose AltNeRF---a novel framework designed to create resilient NeRF representations using self-supervised monocular depth estimation (SMDE) from monocular videos, without relying on known camera poses. SMDE in AltNeRF masterfully learns depth and pose priors to regulate NeRF training. The depth prior enriches NeRF's capacity for precise scene geometry depiction, while the pose prior provides a robust starting point for subsequent pose refinement. Moreover, we introduce an alternating algorithm that harmoniously melds NeRF outputs into SMDE through a consistence-driven mechanism, thus enhancing the integrity of depth priors. This alternation empowers AltNeRF to progressively refine NeRF representations, yielding the synthesis of realistic novel views. Extensive experiments showcase the compelling capabilities of AltNeRF in generating high-fidelity and robust novel views that closely resemble reality.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

DHA: Learning Decoupled-Head Attention from Transformer Checkpoints via Adaptive Heads Fusion

Yilong Chen
Linhao Zhang
Junyuan Shang
Zhenyu Zhang
Tingwen Liu
Shuohuan Wang
Yu Sun

Large language models (LLMs) with billions of parameters demonstrate impressive performance. However, the widely used Multi-Head Attention (MHA) in LLMs incurs substantial computational and memory costs during inference. While some efforts have optimized attention mechanisms by pruning heads or sharing parameters among heads, these methods often lead to performance degradation or necessitate substantial continued pre-training costs to restore performance. Based on the analysis of attention redundancy, we design a Decoupled-Head Attention (DHA) mechanism. DHA adaptively configures group sharing for key heads and value heads across various layers, achieving a better balance between performance and efficiency. Inspired by the observation of clustering similar heads, we propose to progressively transform the MHA checkpoint into the DHA model through linear fusion of similar head parameters step by step, retaining the parametric knowledge of the MHA checkpoint. We construct DHA models by transforming various scales of MHA checkpoints given target head budgets. Our experiments show that DHA remarkably requires a mere 0. 25\% of the original model's pre-training budgets to achieve 96. 1\% of performance while saving 75\% of KV cache. Compared to Group-Query Attention (GQA), DHA achieves a 5$\times$ training acceleration, a maximum of 13. 93\% performance improvement under 0. 01\% pre-training budget, and 5\% relative improvement under 0. 05\% pre-training budget.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

Found in the Middle: How Language Models Use Long Contexts Better via Plug-and-Play Positional Encoding

Zhenyu Zhang
Runjin Chen
Shiwei Liu
Zhewei Yao
Olatunji Ruwase
Beidi Chen
Xiaoxia Wu
Zhangyang Wang

This paper aims to overcome the ``lost-in-the-middle'' challenge of large language models (LLMs). While recent advancements have successfully enabled LLMs to perform stable language modeling with up to 4 million tokens, the persistent difficulty faced by most LLMs in identifying relevant information situated in the middle of the context has not been adequately tackled. To address this problem, this paper introduces Multi-scale Positional Encoding (Ms-PoE) which is a simple yet effective plug-and-play approach to enhance the capacity of LLMs to handle the relevant information located in the middle of the context, without fine-tuning or introducing any additional overhead. Ms-PoE leverages the position indice rescaling to relieve the long-term decay effect introduced by RoPE, while meticulously assigning distinct scaling ratios to different attention heads to preserve essential knowledge learned during the pre-training step, forming a multi-scale context fusion from short to long distance. Extensive experiments with a wide range of LLMs demonstrate the efficacy of our approach. Notably, Ms-PoE achieves an average accuracy gain of up to 3. 8 on the Zero-SCROLLS benchmark over the original LLMs. Code will be made public upon acceptence.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

Learning to Decouple the Lights for 3D Face Texture Modeling

Tianxin Huang
Zhenyu Zhang
Ying Tai
Gim Hee Lee

Existing research has made impressive strides in reconstructing human facial shapes and textures from images with well-illuminated faces and minimal external occlusions. Nevertheless, it remains challenging to recover accurate facial textures from scenarios with complicated illumination affected by external occlusions, \eg a face that is partially obscured by items such as a hat. Existing works based on the assumption of single and uniform illumination cannot correctly process these data. In this work, we introduce a novel approach to model 3D facial textures under such unnatural illumination. Instead of assuming single illumination, our framework learns to imitate the unnatural illumination as a composition of multiple separate light conditions combined with learned neural representations, named Light Decoupling. According to experiments on both single images and video sequences, we demonstrate the effectiveness of our approach in modeling facial textures under challenging illumination affected by occlusions.

PDF Details DOI

AAAI Conference 2024 Conference Paper

Sparsity-Guided Holistic Explanation for LLMs with Interpretable Inference-Time Intervention

Zhen Tan
Tianlong Chen
Zhenyu Zhang
Huan Liu

Large Language Models (LLMs) have achieved unprecedented breakthroughs in various natural language processing domains. However, the enigmatic ``black-box'' nature of LLMs remains a significant challenge for interpretability, hampering transparent and accountable applications. While past approaches, such as attention visualization, pivotal subnetwork extraction, and concept-based analyses, offer some insight, they often focus on either local or global explanations within a single dimension, occasionally falling short in providing comprehensive clarity. In response, we propose a novel methodology anchored in sparsity-guided techniques, aiming to provide a holistic interpretation of LLMs. Our framework, termed SparseCBM, innovatively integrates sparsity to elucidate three intertwined layers of interpretation: input, subnetwork, and concept levels. In addition, the newly introduced dimension of interpretable inference-time intervention facilitates dynamic adjustments to the model during deployment. Through rigorous empirical evaluations on real-world datasets, we demonstrate that SparseCBM delivers a profound understanding of LLM behaviors, setting it apart in both interpreting and ameliorating model inaccuracies. Codes are provided in supplements.

PDF Details DOI

AAAI Conference 2023 Conference Paper

DesNet: Decomposed Scale-Consistent Network for Unsupervised Depth Completion

Zhiqiang Yan
Kun Wang
Xiang Li
Zhenyu Zhang
Jun Li
Jian Yang

Unsupervised depth completion aims to recover dense depth from the sparse one without using the ground-truth annotation. Although depth measurement obtained from LiDAR is usually sparse, it contains valid and real distance information, i.e., scale-consistent absolute depth values. Meanwhile, scale-agnostic counterparts seek to estimate relative depth and have achieved impressive performance. To leverage both the inherent characteristics, we thus suggest to model scale-consistent depth upon unsupervised scale-agnostic frameworks. Specifically, we propose the decomposed scale-consistent learning (DSCL) strategy, which disintegrates the absolute depth into relative depth prediction and global scale estimation, contributing to individual learning benefits. But unfortunately, most existing unsupervised scale-agnostic frameworks heavily suffer from depth holes due to the extremely sparse depth input and weak supervisory signal. To tackle this issue, we introduce the global depth guidance (GDG) module, which attentively propagates dense depth reference into the sparse target via novel dense-to-sparse attention. Extensive experiments show the superiority of our method on outdoor KITTI, ranking 1st and outperforming the best KBNet more than 12% in RMSE. Additionally, our approach achieves state-of-the-art performance on indoor NYUv2 benchmark as well.

PDF Details DOI

NeurIPS Conference 2023 Conference Paper

H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models

Zhenyu Zhang
Ying Sheng
Tianyi Zhou
Tianlong Chen
Lianmin Zheng
Ruisi Cai
Zhao Song
Yuandong Tian

Large Language Models (LLMs), despite their recent impressive accomplishments, are notably cost-prohibitive to deploy, particularly for applications involving long-content generation, such as dialogue systems and story writing. Often, a large amount of transient state information, referred to as the $\mathsf{KV}$ $\mathsf{cache}$, is stored in GPU memory in addition to model parameters, scaling linearly with the sequence length and batch size. In this paper, we introduce a novel approach for implementing the $\mathsf{KV}$ $\mathsf{cache}$ which significantly reduces its memory footprint. Our approach is based on the noteworthy observation that a small portion of tokens contributes most of the value when computing attention scores. We call these tokens Heavy Hitters ($\mathsf{H_2}$). Through a comprehensive investigation, we find that ($i$) the emergence of $\mathsf{H_2}$ is natural and strongly correlates with the frequent co-occurrence of tokens in the text, and ($ii$) removing them results in significant performance degradation. Based on these insights, we propose Heavy Hitter Oracle ($\mathsf{H_2O}$), a $\mathsf{KV}$ $\mathsf{cache}$ eviction policy that dynamically retains a balance of recent and $\mathsf{H_2}$ tokens. We formulate the $\mathsf{KV}$ $\mathsf{cache}$ eviction as a dynamic submodular problem and prove (under mild assumptions) a theoretical guarantee for our novel eviction algorithm which could help guide future work. We validate the accuracy of our algorithm with OPT, LLaMA, and GPT-NeoX across a wide range of tasks. Our implementation of $\mathsf{H_2O}$ with 20\% heavy hitters improves the throughput over three leading inference systems DeepSpeed Zero-Inference, Hugging Face Accelerate, and FlexGen by up to $29\times$, $29\times$, and $3\times$ on OPT-6. 7B and OPT-30B. With the same batch size, $\mathsf{H_2O}$ can reduce the latency by up to $1. 9\times$.

PDF Details

TMLR Journal 2022 Journal Article

Can You Win Everything with A Lottery Ticket?

Tianlong Chen
Zhenyu Zhang
Jun Wu
Randy Huang
Sijia Liu
Shiyu Chang
Zhangyang Wang

$\textit{Lottery ticket hypothesis}$ (LTH) has demonstrated to yield independently trainable and highly sparse neural networks (a.k.a. $\textit{winning tickets}$), whose test set accuracies can be surprisingly on par or even better than dense models. However, accuracy is far from the only evaluation metric, and perhaps not always the most important one. Hence it might be myopic to conclude that a sparse subnetwork can replace its dense counterpart, even if the accuracy is preserved. Spurred by that, we perform the first comprehensive assessment of lottery tickets from diverse aspects beyond test accuracy, including $\textit{(i)}$ generalization to distribution shifts, $\textit{(ii)}$ prediction uncertainty, $\textit{(iii)}$ interpretability, and $\textit{(iv)}$ geometry of loss landscapes. With extensive experiments across datasets {CIFAR-10, CIFAR-100, and ImageNet}, model architectures, as well as tens of sparsification methods, we thoroughly characterize the trade-off between model sparsity and the all-dimension model capabilities. We find that an appropriate sparsity (e.g., $20\%\sim99.08\%$) can yield the winning ticket to perform comparably or even better $\textbf{in all above four aspects}$, although some aspects (generalization to certain distribution shifts, and uncertainty) appear more sensitive to the sparsification than others. We term it as a $\texttt{LTH-PASS}$. Overall, our results endorse choosing a good sparse subnetwork of a larger dense model, over directly training a small dense model of similar parameter counts. We hope that our study can offer more in-depth insights on pruning, for researchers and engineers who seek to incorporate sparse neural networks for user-facing deployments. Codes are available in: https://github.com/VITA-Group/LTH-Pass.

PDF Details

NeurIPS Conference 2022 Conference Paper

Randomized Channel Shuffling: Minimal-Overhead Backdoor Attack Detection without Clean Datasets

Ruisi Cai
Zhenyu Zhang
Tianlong Chen
Xiaohan Chen
Zhangyang Wang

Deep neural networks (DNNs) typically require massive data to train on, which is a hurdle for numerous practical domains. Facing the data shortfall, one viable option is to acquire domain-specific training data from external uncensored sources, such as open webs or third-party data collectors. However, the quality of such acquired data is often not rigorously scrutinized, and one cannot easily rule out the risk of `"poisoned" examples being included in such unreliable datasets, resulting in unreliable trained models which pose potential risks to many high-stake applications. While existing options usually suffer from high computational costs or assumptions on clean data access, this paper attempts to detect backdoors for potential victim models with minimal prior knowledge. In particular, provided with a trained model, users are assumed to (1) have no prior knowledge of whether it is already poisoned, or what the target class/percentage of samples is poisoned, and (2) have no access to a clean sample set from the same training distribution, nor any trusted model trained on such clean data. To tackle this challenging scenario, we first observe the contrasting channel-level statistics between the backdoor trigger and clean image features, and consequently, how they can be differentiated by progressive channel shuffling. We then propose the randomized channel shuffling method for backdoor-targeted class detection, which requires only a few feed-forward passes. It thus incurs minimal overheads and demands no clean sample nor prior knowledge. We further explore a “full” clean data-free setting, where neither the target class detection nor the trigger recovery can access the clean data. Extensive experiments are conducted with three datasets (CIFAR-10, GTSRB, Tiny ImageNet), three architectures (AlexNet, ResNet-20, SENet-18), and three attacks (BadNets, clean label attack, and WaNet). Results consistently endorse the effectiveness of our proposed technique in backdoor model detection, with margins of 0. 291 ～ 0. 640 AUROC over the current state-of-the-arts. Codes are available at https: //github. com/VITA-Group/Random-Shuffling-BackdoorDetect.

PDF Details

NeurIPS Conference 2022 Conference Paper

Sparse Winning Tickets are Data-Efficient Image Recognizers

Mukund Varma T
Xuxi Chen
Zhenyu Zhang
Tianlong Chen
Subhashini Venugopalan
Zhangyang Wang

Improving the performance of deep networks in data-limited regimes has warranted much attention. In this work, we empirically show that “winning tickets” (small sub-networks) obtained via magnitude pruning based on the lottery ticket hypothesis, apart from being sparse are also effective recognizers in data-limited regimes. Based on extensive experiments, we find that in low data regimes (datasets of 50-100 examples per class), sparse winning tickets substantially outperform the original dense networks. This approach, when combined with augmentations or fine-tuning from a self-supervised backbone network, shows further improvements in performance by as much as 16% (absolute) on low-sample datasets and long-tailed classification. Further, sparse winning tickets are more robust to synthetic noise and distribution shifts compared to their dense counterparts. Our analysis of winning tickets on small datasets indicates that, though sparse, the networks retain density in the initial layers and their representations are more generalizable. Code is available at https: //github. com/VITA-Group/DataEfficientLTH.

PDF Details

NeurIPS Conference 2021 Conference Paper

You are caught stealing my winning lottery ticket! Making a lottery ticket claim its ownership

Xuxi Chen
Tianlong Chen
Zhenyu Zhang
Zhangyang Wang

Despite tremendous success in many application scenarios, the training and inference costs of using deep learning are also rapidly increasing over time. The lottery ticket hypothesis (LTH) emerges as a promising framework to leverage a special sparse subnetwork (i. e. , $\textit{winning ticket}$) instead of a full model for both training and inference, that can lower both costs without sacrificing the performance. The main resource bottleneck of LTH is however the extraordinary cost to find the sparse mask of the winning ticket. That makes the found winning ticket become a valuable asset to the owners, highlighting the necessity of protecting its copyright. Our setting adds a new dimension to the recently soaring interest in protecting against the intellectual property (IP) infringement of deep models and verifying their ownerships, since they take owners' massive/unique resources to develop or train. While existing methods explored encrypted weights or predictions, we investigate a unique way to leverage sparse topological information to perform $\textit{lottery verification}$, by developing several graph-based signatures that can be embedded as credentials. By further combining trigger set-based methods, our proposal can work in both white-box and black-box verification scenarios. Through extensive experiments, we demonstrate the effectiveness of lottery verification in diverse models (ResNet-20, ResNet-18, ResNet-50) on CIFAR-10 and CIFAR-100. Specifically, our verification is shown to be robust to removal attacks such as model fine-tuning and pruning, as well as several ambiguity attacks. Our codes are available at https: //github. com/VITA-Group/NO-stealing-LTH.

PDF Details

AAAI Conference 2020 Conference Paper

Cross-Modal Attention Network for Temporal Inconsistent Audio-Visual Event Localization

Hanyu Xuan
Zhenyu Zhang
Shuo Chen
Jian Yang
Yan Yan

In human multi-modality perception systems, the beneﬁts of integrating auditory and visual information are extensive as they provide plenty supplementary cues for understanding the events. Despite some recent methods proposed for such application, they cannot deal with practical conditions with temporal inconsistency. Inspired by human system which puts different focuses at speciﬁc locations, time segments and media while performing multi-modality perception, we provide an attention-based method to simulate such process. Similar to human mechanism, our network can adaptively select “where” to attend, “when” to attend and “which” to attend for audio-visual event localization. In this way, even with large temporal inconsistent between vision and audio, our network is able to adaptively trade information between different modalities and successfully achieve event localization. Our method achieves state-of-the-art performance on AVE (Audio-Visual Event) dataset collected in the real life. In addition, we also systemically investigate audio-visual event localization tasks. The visualization results also help us better understand how our model works.

PDF Details

AAAI Conference 2020 Conference Paper

Distilling Knowledge from Well-Informed Soft Labels for Neural Relation Extraction

Zhenyu Zhang
Xiaobo Shu
Bowen Yu
Tingwen Liu
Jiapeng Zhao
Quangang Li
Li Guo

Extracting relations from plain text is an important task with wide application. Most existing methods formulate it as a supervised problem and utilize one-hot hard labels as the sole target in training, neglecting the rich semantic information among relations. In this paper, we aim to explore the supervision with soft labels in relation extraction, which makes it possible to integrate prior knowledge. Speciﬁcally, a bipartite graph is ﬁrst devised to discover type constraints between entities and relations based on the entire corpus. Then, we combine such type constraints with neural networks to achieve a knowledgeable model. Furthermore, this model is regarded as teacher to generate well-informed soft labels and guide the optimization of a student network via knowledge distillation. Besides, a multi-aspect attention mechanism is introduced to help student mine latent information from text. In this way, the enhanced student inherits the dark knowledge (e. g. , type constraints and relevance among relations) from teacher, and directly serves the testing scenarios without any extra constraints. We conduct extensive experiments on the TACRED and SemEval datasets, the experimental results justify the effectiveness of our approach.

PDF Details

ICML Conference 2020 Conference Paper

Learning with Feature and Distribution Evolvable Streams

Zhenyu Zhang
Peng Zhao 0006
Yuan Jiang 0001
Zhi-Hua Zhou

In many real-world applications, data are collected in the form of a stream, whose feature space can evolve over time. For instance, in the environmental monitoring task, features can be dynamically vanished or augmented due to the existence of expired old sensors and deployed new sensors. Furthermore, besides the evolvable feature space, the data distribution is usually changing in the streaming scenario. When both feature space and data distribution are evolvable, it is quite challenging to design algorithms with guarantees, particularly theoretical understandings of generalization ability. To address this difficulty, we propose a novel discrepancy measure for data with evolving feature space and data distribution, named the \emph{evolving discrepancy}. Based on that, we present the generalization error analysis, and the theory motivates the design of a learning algorithm which is further implemented by deep neural networks. Empirical studies on synthetic data verify the rationale of our proposed discrepancy measure, and extensive experiments on real-world tasks validate the effectiveness of our algorithm.

Details

ICML Conference 2020 Conference Paper

Safe Deep Semi-Supervised Learning for Unseen-Class Unlabeled Data

Lan-Zhe Guo
Zhenyu Zhang
Yuan Jiang 0001
Yufeng Li 0008
Zhi-Hua Zhou

Deep semi-supervised learning (SSL) has been recently shown very effectively. However, its performance is seriously decreased when the class distribution is mismatched, among which a common situation is that unlabeled data contains some classes not seen in the labeled data. Efforts on this issue remain to be limited. This paper proposes a simple and effective safe deep SSL method to alleviate the harm caused by it. In theory, the result learned from the new method is never worse than learning from merely labeled data, and it is theoretically guaranteed that its generalization approaches the optimal in the order $O(\sqrt{d\ln(n)/n})$, even faster than the convergence rate in supervised learning associated with massive parameters. In the experiment of benchmark data, unlike the existing deep SSL methods which are no longer as good as supervised learning in 40% of unseen-class unlabeled data, the new method can still achieve performance gain in more than 60% of unseen-class unlabeled data. Moreover, the proposal is suitable for many deep SSL algorithms and can be easily extended to handle other cases of class distribution mismatch.

Details

IJCAI Conference 2019 Conference Paper

Beyond Word Attention: Using Segment Attention in Neural Relation Extraction

Bowen Yu
Zhenyu Zhang
Tingwen Liu
Bin Wang
Sujian Li
Quangang Li

Relation extraction studies the issue of predicting semantic relations between pairs of entities in sentences. Attention mechanisms are often used in this task to alleviate the inner-sentence noise by performing soft selections of words independently. Based on the observation that information pertinent to relations is usually contained within segments (continuous words in a sentence), it is possible to make use of this phenomenon for better extraction. In this paper, we aim to incorporate such segment information into neural relation extractor. Our approach views the attention mechanism as linear-chain conditional random fields over a set of latent variables whose edges encode the desired structure, and regards attention weight as the marginal distribution of each word being selected as a part of the relational expression. Experimental results show that our method can attend to continuous relational expressions without explicit annotations, and achieve the state-of-the-art performance on the large-scale TACRED dataset.

PDF Details