Arrow Research search

Author name cluster

Kai Han

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

54 papers
2 author rows

Possible papers

54

AAAI Conference 2026 Conference Paper

PocketLLM: Ultimate Compression of Large Language Models via Meta Networks

  • Ye Tian
  • Chengcheng Wang
  • Jing Han
  • Yehui Tang
  • Kai Han

As Large Language Models (LLMs) continue to grow in size, storing and transmitting them on edge devices becomes increasingly challenging. Traditional methods like quantization and pruning struggle to achieve extreme compression of LLMs without sacrificing accuracy. In this paper, we introduce PocketLLM, a novel approach to compress LLMs in a latent space via meta-networks. A simple encoder network is proposed to project the weights of LLMs into discrete latent vectors, which are then represented using a compact codebook. A lightweight decoder network is employed to map the codebook's representative vectors back to the original weight space. This method allows for significant compression of the large weights in LLMs, consisting solely of a small decoder, a concise codebook, and an index. Extensive experiments show that PocketLLM achieves superior performance even at significantly high compression ratios, e.g., compressing Llama 2-7B by 10x with a negligible drop in accuracy.

NeurIPS Conference 2025 Conference Paper

3DRS: MLLMs Need 3D-Aware Representation Supervision for Scene Understanding

  • Xiaohu Huang
  • Jingjing Wu
  • Qunyi Xie
  • Kai Han

Recent advances in scene understanding have leveraged multimodal large language models (MLLMs) for 3D reasoning by capitalizing on their strong 2D pretraining. However, the lack of explicit 3D data during MLLM pretraining limits 3D representation capability. In this paper, we investigate the 3D-awareness of MLLMs by evaluating multi-view correspondence and reveal a strong positive correlation between the quality of 3D-aware representation and downstream task performance. Motivated by this, we propose 3DRS, a framework that enhances MLLM 3D representation learning by introducing supervision from pretrained 3D foundation models. Our approach aligns MLLM visual features with rich 3D knowledge distilled from 3D models, effectively improving scene understanding. Extensive experiments across multiple benchmarks and MLLMs—including visual grounding, captioning, and question answering—demonstrate consistent performance gains. Project: https: //visual-ai. github. io/3drs/

JBHI Journal 2025 Journal Article

Bidirectional Prototype-Guided Consistency Constraint for Semi-Supervised Fetal Ultrasound Image Segmentation

  • Chongwen Lyu
  • Kai Han
  • Lu Liu
  • Jun Chen
  • Lele Ma
  • Zheng Pang
  • Zhe Liu

Fetal ultrasound (US) image segmentation plays an important role in fetal development assessment, maternal pregnancy management, and intrauterine surgery planning. However, obtaining large-scale, accurately annotated fetal US imaging data is time-consuming and labor-intensive, posing challenges to the application of deep learning in this field. To address this challenge, we propose a semi-supervised fetal US image segmentation method based on bidirectional prototype-guided consistency constraint (BiPCC). BiPCC utilizes the prototype to bridge labeled and unlabeled data and establishes interaction between them. Specifically, the model generates pseudo-labels using prototypes from labeled data and then utilizes these pseudo-labels to generate pseudo-prototypes for segmenting the labeled data inversely, thereby achieving bidirectional consistency. Additionally, uncertainty-based cross-supervision is incorporated to provide additional supervision signals, thereby enhancing the quality of pseudo-labels. Extensive experiments on two fetal US datasets demonstrate that BiPCC outperforms state-of-the-art methods for semi-supervised fetal US segmentation. Furthermore, experimental results on two additional medical segmentation datasets exhibit BiPCC's outstanding generalization capability for diverse medical image segmentation tasks. Our proposed method offers a novel insight for semi-supervised fetal US image segmentation and holds promise for further advancing the development of intelligent healthcare.

AIJ Journal 2025 Journal Article

Efficient and effective budget-feasible mechanisms for submodular valuations

  • Kai Han
  • Haotian Zhang
  • Shuang Cui

We revisit the classical problem of designing Budget-Feasible Mechanisms (BFMs) for submodular valuation functions, which has been extensively studied since the seminal paper of Singer [FOCS'10] due to their wide applications in crowdsourcing and social marketing. We propose TripleEagle, a novel algorithmic framework for designing BFMs, based on which we present several simple yet effective BFMs that achieve better approximation ratios than the state-of-the-art work. Moreover, our BFMs are the first in the literature to achieve linear query complexity under the value oracle model while ensuring obvious strategyproofness, making them more practical than the previous BFMs. We conduct extensive experiments to evaluate the empirical performance of our BFMs, and the experimental results demonstrate the superiorities of our approach in terms of efficiency and effectiveness compared to the state-of-the-art BFMs.

AAAI Conference 2025 Conference Paper

Eve: Efficient Multimodal Vision Language Models with Elastic Visual Experts

  • Miao Rang
  • Zhenni Bi
  • Chuanjian Liu
  • Yehui Tang
  • Kai Han
  • Yunhe Wang

Multimodal vision language models (VLMs) have made significant progress with the support of continuously increasing model sizes and data volumes. Running VLMs on edge devices has become a challenge for their widespread application. There are several efficient VLM efforts, but they often sacrifice linguistic capabilities to enhance multimodal abilities, or require extensive training. To address this quandary, we introduce the innovative framework of Efficient Vision Language Models with Elastic Visual Experts (Eve). By strategically incorporating adaptable visual expertise at multiple stages of training, Eve strikes a balance between preserving linguistic abilities and augmenting multimodal capabilities. This balanced approach results in a versatile model with only 1.8B parameters that delivers significant improvements in both multimodal and linguistic tasks. Notably, in configurations below 3B parameters, Eve distinctly outperforms in language benchmarks and achieves state-of-the-art results in VLM Benchmarks. Additionally, its multimodal accuracy outstrips that of the larger 7B LLaVA-1.5 model.

NeurIPS Conference 2025 Conference Paper

Fin3R: Fine-tuning Feed-forward 3D Reconstruction Models via Monocular Knowledge Distillation

  • Weining Ren
  • Hongjun Wang
  • Xiao Tan
  • Kai Han

We present Fin3R, a simple, effective, and general fine-tuning method for feed-forward 3D reconstruction models. The family of feed-forward reconstruction model regresses pointmap of all input images to a reference frame coordinate system, along with other auxiliary outputs, in a single forward pass. However, we find that current models struggle with fine geometry and robustness due to (\textit{i}) the scarcity of high-fidelity depth and pose supervision and (\textit{ii}) the inherent geometric misalignment from multi-view pointmap regression. Fin3R jointly tackles two issues with an extra lightweight fine-tuning step. We freeze the decoder, which handles view matching, and fine-tune only the image encoder—the component dedicated to feature extraction. The encoder is enriched with fine geometric details distilled from a strong monocular teacher model on large, unlabeled datasets, using a custom, lightweight LoRA adapter. We validate our method on a wide range of models, including DUSt3R, MASt3R, CUT3R, and VGGT. The fine-tuned models consistently deliver sharper boundaries, recover complex structures, and achieve higher geometric accuracy in both single- and multi-view settings, while adding only the tiny LoRA weights, which leave test-time memory and latency virtually unchanged. Project page: \href{http: //visual-ai. github. io/fin3r}{https: //visual-ai. github. io/fin3r}

NeurIPS Conference 2025 Conference Paper

GSPN-2: Efficient Parallel Sequence Modeling

  • Hongjun Wang
  • yitong jiang
  • Collin McCarthy
  • David Wehr
  • Hanrong Ye
  • Xinhao Li
  • Ka Chun Cheung
  • Wonmin Byeon

Efficient vision transformer remains a bottleneck for high-resolution images and long-video related real-world applications. Generalized Spatial Propagation Network (GSPN) \cite{wang2025parallel} addresses this by replacing quadratic self-attention with a line-scan propagation scheme, bringing the cost close to linear in the number of rows or columns, while retaining accuracy. Despite this advancement, the existing GSPN implementation still suffers from (i) heavy overhead due to repeatedly launching GPU kernels, (ii) excessive data transfers from global GPU memory, and (iii) redundant computations caused by maintaining separate propagation weights for each channel. We introduce GSPN-2, a joint algorithm–system redesign. In particular, we eliminate thousands of micro-launches from the previous implementation into one single 2D kernel, explicitly pin one warp to each channel slice, and stage the previous column's activations in shared memory. On the model side, we introduce a set of channel-shared propagation weights that replace per-channel matrices, trimming parameters, and align naturally with the affinity map used in transformer attention. Experiments demonstrate GSPN-2's effectiveness across image classification and text-to-image synthesis tasks, matching transformer-level accuracy with significantly lower computational cost. GSPN-2 establishes a new efficiency frontier for modeling global spatial context in vision applications through its unique combination of structured matrix transformations and GPU-optimized implementation.

AAAI Conference 2025 Conference Paper

L-Man: A Large Multi-modal Model Unifying Human-centric Tasks

  • Jialong Zuo
  • Ying Nie
  • Tianyu Guo
  • Huaxin Zhang
  • Jiahao Hong
  • Nong Sang
  • Changxin Gao
  • Kai Han

Large language models (LLMs) have recently shown notable progress in unifying various visual tasks with an open-ended form. However, when transferred to human-centric tasks, despite their remarkable multi-modal understanding ability in general domains, they lack further human-related domain knowledge and show unsatisfactory performance. Meanwhile, current human-centric unified models are mostly restricted to a pre-defined form and lack open-ended task capability. Therefore, it is necessary to propose a large multi-modal model which utilizes LLMs to unify various human-centric tasks. We forge ahead along this path from the aspects of dataset and model. Specifically, we first construct a large-scale language-image instruction-following dataset named HumanIns based on existing 20 open datasets from 6 diverse downstream tasks, which provides sufficient and diverse data to implement multi-modal training. Then, a model named L-Man including a query adapter is designed to extract the multi-grained semantics of image and align the cross-modal information between image and text. In practice, we introduce a two-stage training strategy, where the first stage extracts generic text-relevant visual information, and the second stage maps the visual features to the embedding space of the LLM. By tuning on HumanIns, our model shows significant superiority on human-centric tasks compared with existing large multi-modal models, and also achieves even better results on downstream datasets compared with respective task-specific models.

JBHI Journal 2025 Journal Article

LiMT: A Multi-Task Liver Image Benchmark Dataset

  • Zhe Liu
  • Kai Han
  • Siqi Ma
  • Yan Zhu
  • Jun Chen
  • Chongwen Lyu
  • Xinyi Qiu
  • Chengxuan Qian

Computer-aided diagnosis (CAD) technology can assist clinicians in evaluating liver lesions and intervening with treatment in time. Although CAD technology has advanced in recent years, the application scope of existing datasets remains relatively limited, typically supporting only single tasks, which has somewhat constrained the development of CAD technology. To address the above limitation, in this paper, we construct a multi-task liver dataset (LiMT) used for liver and tumor segmentation, multi-label lesion classification, and lesion detection based on arterial phase-enhanced computed tomography (CT), potentially providing an exploratory solution that is able to explore the correlation between tasks and does not need to worry about the heterogeneity between task-specific datasets during training. The dataset includes CT volumes from 150 different cases, comprising four types of liver diseases as well as normal cases. Each volume has been carefully annotated and calibrated by experienced clinicians. This public multi-task dataset may become a valuable resource for the medical imaging research community in the future. In addition, this paper not only provides relevant baseline experimental results but also reviews existing datasets and methods related to liver-related tasks. Our dataset is available at https://drive.google.com/drive/folders/1l9HRK13uaOQTNShf5pwgSz3OTanWjkag? usp=sharing.

NeurIPS Conference 2025 Conference Paper

Panoptic Captioning: An Equivalence Bridge for Image and Text

  • Kun-Yu Lin
  • Hongjun Wang
  • Weining Ren
  • Kai Han

This work introduces panoptic captioning, a novel task striving to seek the minimum text equivalent of images, which has broad potential applications. We take the first step towards panoptic captioning by formulating it as a task of generating a comprehensive textual description for an image, which encapsulates all entities, their respective locations and attributes, relationships among entities, as well as global image state. Through an extensive evaluation, our work reveals that state-of-the-art Multi-modal Large Language Models (MLLMs) have limited performance in solving panoptic captioning. To address this, we propose an effective data engine named PancapEngine to produce high-quality data and a novel method named PancapChain to improve panoptic captioning. Specifically, our PancapEngine first detects diverse categories of entities in images by an elaborate detection suite, and then generates required panoptic captions using entity-aware prompts. Additionally, our PancapChain explicitly decouples the challenging panoptic captioning task into multiple stages and generates panoptic captions step by step. More importantly, we contribute a comprehensive metric named PancapScore and a human-curated test set for reliable model evaluation. Experiments show that our PancapChain-13B model can beat state-of-the-art open-source MLLMs like InternVL-2. 5-78B and even surpass proprietary models like GPT-4o and Gemini-2. 0-Pro, demonstrating the effectiveness of our data engine and method. Project page: https: //visual-ai. github. io/pancap/

JAIR Journal 2025 Journal Article

Practical Parallel Algorithms for Non-Monotone Submodular Maximization

  • Shuang Cui
  • Kai Han
  • Jing Tang
  • Xueying Li
  • Aakas Zhiyuli
  • Hanxiao Li

Submodular maximization has found extensive applications in various domains within the field of artificial intelligence, including but not limited to machine learning, computer vision, and natural language processing. With the increasing size of datasets in these domains, there is a pressing need to develop efficient and parallelizable algorithms for submodular maximization. One measure of the parallelizability of a submodular maximization algorithm is its adaptive complexity, which indicates the number of sequential rounds where a polynomial number of queries to the objective function can be executed in parallel. In this paper, we study the problem of non-monotone submodular maximization subject to a knapsack constraint, and propose a low-adaptivity algorithm achieving an (1/8 − ϵ)- approximation with practical Õ(n) query complexity. Moreover, we also propose the first algorithm with both provable approximation ratio and sublinear adaptive complexity for the problem of non-monotone submodular maximization subject to a k-system constraint. As a by-product, we show that our two algorithms can also be applied to the special case of submodular maximization subject to a cardinality constraint, and achieve performance bounds comparable with those of state-of-the-art algorithms. Finally, the effectiveness of our algorithms is demonstrated by extensive experiments on real-world applications.

NeurIPS Conference 2025 Conference Paper

SEAL: Semantic-Aware Hierarchical Learning for Generalized Category Discovery

  • Zhenqi He
  • Yuanpei Liu
  • Kai Han

This paper investigates the problem of Generalized Category Discovery (GCD). Given a partially labelled dataset, GCD aims to categorize all unlabelled images, regardless of whether they belong to known or unknown classes. Existing approaches typically depend on either single-level semantics or manually designed abstract hierarchies, which limit their generalizability and scalability. To address these limitations, we introduce a SEmantic-aware hierArchical Learning framework (SEAL), guided by naturally occurring and easily accessible hierarchical structures. Within SEAL, we propose a Hierarchical Semantic-Guided Soft Contrastive Learning approach that exploits hierarchical similarity to generate informative soft negatives, addressing the limitations of conventional contrastive losses that treat all negatives equally. Furthermore, a Cross-Granularity Consistency (CGC) module is designed to align the predictions from different levels of granularity. SEAL consistently achieves state-of-the-art performance on fine-grained benchmarks, including the SSB benchmark, Oxford-Pet, and the Herbarium19 dataset, and further demonstrates generalization on coarse-grained datasets. Project page: https: //visual-ai. github. io/seal/

NeurIPS Conference 2025 Conference Paper

VaMP: Variational Multi-Modal Prompt Learning for Vision-Language Models

  • Silin Cheng
  • Kai Han

Vision-language models (VLMs), such as CLIP, have shown strong generalization under zero-shot settings, yet adapting them to downstream tasks with limited supervision remains a significant challenge. Existing multi-modal prompt learning methods typically rely on fixed, shared prompts and deterministic parameters, which limits their ability to capture instance-level variation or model uncertainty across diverse tasks and domains. To tackle this issue, we propose a novel Variational Multi-Modal Prompt Learning (VaMP) framework that enables sample-specific, uncertainty-aware prompt tuning in multi-modal representation learning. VaMP generates instance-conditioned prompts by sampling from a learned posterior distribution, allowing the model to personalize its behavior based on input content. To further enhance the integration of local and global semantics, we introduce a class-aware prior derived from the instance representation and class prototype. Building upon these, we formulate prompt tuning as variational inference over latent prompt representations and train the entire framework end-to-end through reparameterized sampling. Experiments on few-shot and domain generalization benchmarks show that VaMP achieves state-of-the-art performance, highlighting the benefits of modeling both uncertainty and task structure in our method. Project page: https: //visual-ai. github. io/vamp

NeurIPS Conference 2025 Conference Paper

Wukong's 72 Transformations: High-fidelity Textured 3D Morphing via Flow Models

  • Minghao Yin
  • Yukang Cao
  • Kai Han

We present WUKONG, a novel training-free framework for high-fidelity textured 3D morphing that takes a pair of source and target prompts (text or images) as input. Unlike conventional methods -- which rely on manual correspondence matching and deformation trajectory estimation (limiting generalization and requiring costly preprocessing) -- WUKONG leverages the generative prior of flow-based transformers to produce high-fidelity 3D transitions with rich texture details. To ensure smooth shape transitions, we exploit the inherent continuity of flow-based generative processes and formulate morphing as an optimal transport barycenter problem. We further introduce a sequential initialization strategy to prevent abrupt geometric distortions and preserve identity coherence. For faithful texture preservation, we propose a similarity-guided semantic consistency mechanism that selectively retains high-frequency details and enables precise control over blending dynamics. This empowers WUKONG to support both global texture transitions and identity-preserving texture morphing, catering to diverse generation needs. Through extensive quantitative and qualitative evaluations, we demonstrate that WUKONG significantly outperforms state-of-the-art methods, achieving superior results across diverse geometry and texture variations.

IROS Conference 2025 Conference Paper

ZBOT: A Novel Modular Robot Capable of Active Transformation from Snake to Bipedal Configuration through RL

  • Nanlin Zhou
  • Sikai Zhao
  • Hang Luo
  • Kai Han
  • Zhiyuan Yang
  • Jian Qi
  • Ning Zhao
  • Jie Zhao 0003

In recent years, significant progress has been made in the prototype design and control methodologies of modular snake robots. However, there is still relatively little research on the potential enabled by the active morphological transformation of robots. This paper presents a novel modular snake robot capable of morphing into a bipedal configuration. The robot, ZBOT, is composed of some independent and homogeneous unit modules (named ZBot) connected in series. Each ZBot module has a dual-motor-driven 1-DoF rotational joint, which can rotate continuously, provide a large output torque and achieve backlash elimination. There are four connection orientations between adjacent modules. This paper proposes an articulation configuration, which enables the snake robot to achieve the active transformation from a snake form to a bipedal form. Meanwhile, through reinforcement learning (RL), movements including the stand-up gait are trained and verified in the IsaacSim/Lab simulation environment. This research will advance snake robots beyond surface-dependent locomotion, endowing them with more possibilities, unlocking greater potential for versatile applications.

TMLR Journal 2024 Journal Article

CiPR: An Efficient Framework with Cross-instance Positive Relations for Generalized Category Discovery

  • Shaozhe Hao
  • Kai Han
  • Kwan-Yee K. Wong

We tackle the issue of generalized category discovery (GCD). GCD considers the open-world problem of automatically clustering a partially labelled dataset, in which the unlabelled data may contain instances from both novel categories and labelled classes. In this paper, we address the GCD problem with an unknown category number for the unlabelled data. We propose a framework, named CiPR, to bootstrap the representation by exploiting cross-instance positive relations in the partially labelled data for contrastive learning, which have been neglected in existing methods. To obtain reliable cross-instance relations to facilitate representation learning, we introduce a semi-supervised hierarchical clustering algorithm, named selective neighbor clustering (SNC), which can produce a clustering hierarchy directly from the connected components of a graph constructed from selective neighbors. We further present a method to estimate the unknown class number using SNC with a joint reference score that considers clustering indexes of both labelled and unlabelled data, and extend SNC to allow label assignment for the unlabelled instances with a given class number. We thoroughly evaluate our framework on public generic image recognition datasets and challenging fine-grained datasets, and establish a new state-of-the-art. Code: https://github.com/haoosz/CiPR

AAAI Conference 2024 Conference Paper

Deletion-Robust Submodular Maximization with Knapsack Constraints

  • Shuang Cui
  • Kai Han
  • He Huang

Submodular maximization algorithms have found wide applications in various fields such as data summarization, recommendation systems, and active learning. In recent years, deletion-robust submodular maximization algorithms have garnered attention due to their significant implications in scenarios where some data points may be removed due to user preferences or privacy concerns, such as in recommendation systems and influence maximization. In this paper, we study the fundamental problem of submodular maximization with knapsack constraints and propose a robust streaming algorithm for it. To the best of our knowledge, our algorithm is the first to solve this problem for non-monotone submodular functions and can achieve an approximation ratio of 1/(6.82+2.63d)-ϵ under a near-optimal summary size of O(k+r), where k denotes the maximum cardinality of any feasible solution, d denotes the number of the knapsack constraints and r is the robustness parameter. For monotone submodular functions, our algorithm can achieve an approximation ratio of 1/(2+2d)-ϵ under a near-optimal summary size of O(k+r), significantly improving upon the best-known ratio of Ω((1/d-ϵ)^2). The empirical performance of our algorithm is extensively evaluated in several applications including influence maximization and recommendation systems, and the experimental results demonstrate the effectiveness of our algorithm.

ICLR Conference 2024 Conference Paper

FROSTER: Frozen CLIP is A Strong Teacher for Open-Vocabulary Action Recognition

  • Xiaohu Huang
  • Hao Zhou
  • Kun Yao
  • Kai Han

In this paper, we introduce \textbf{FROSTER}, an effective framework for open-vocabulary action recognition. The CLIP model has achieved remarkable success in a range of image-based tasks, benefiting from its strong generalization capability stemming from pretaining on massive image-text pairs. However, applying CLIP directly to the open-vocabulary action recognition task is challenging due to the absence of temporal information in CLIP's pretraining. Further, fine-tuning CLIP on action recognition datasets may lead to overfitting and hinder its generalizability, resulting in unsatisfactory results when dealing with unseen actions. To address these issues, FROSTER employs a residual feature distillation approach to ensure that CLIP retains its generalization capability while effectively adapting to the action recognition task. Specifically, the residual feature distillation treats the frozen CLIP model as a teacher to maintain the generalizability exhibited by the original CLIP and supervises the feature learning for the extraction of video-specific features to bridge the gap between images and videos. Meanwhile, it uses a residual sub-network for feature distillation to reach a balance between the two distinct objectives of learning generalizable and video-specific features. We extensively evaluate FROSTER on open-vocabulary action recognition benchmarks under both base-to-novel and cross-dataset settings. FROSTER consistently achieves state-of-the-art performance on all datasets across the board. Project page: \url{https://visual-ai.github.io/froster}.

NeurIPS Conference 2024 Conference Paper

Kangaroo: Lossless Self-Speculative Decoding for Accelerating LLMs via Double Early Exiting

  • Fangcheng Liu
  • Yehui Tang
  • Zhenhua Liu
  • Yunsheng Ni
  • Duyu Tang
  • Kai Han
  • Yunhe Wang

Speculative decoding has demonstrated its effectiveness in accelerating the inference of large language models (LLMs) while maintaining an identical sampling distribution. However, the conventional approach of training separate draft model to achieve a satisfactory token acceptance rate can be costly and impractical. In this paper, we propose a novel self-speculative decoding framework \emph{Kangaroo} with \emph{double} early exiting strategy, which leverages the shallow sub-network and the \texttt{LM Head} of the well-trained target LLM to construct a self-drafting model. Then, the self-verification stage only requires computing the remaining layers over the \emph{early-exited} hidden states in parallel. To bridge the representation gap between the sub-network and the full model, we train a lightweight and efficient adapter module on top of the sub-network. One significant challenge that comes with the proposed method is that the inference latency of the self-draft model may no longer be negligible compared to the big model. To boost the token acceptance rate while minimizing the latency of the self-drafting model, we introduce an additional \emph{early exiting} mechanism for both single-sequence and the tree decoding scenarios. Specifically, we dynamically halt the small model's subsequent prediction during the drafting phase once the confidence level for the current step falls below a certain threshold. This approach reduces unnecessary computations and improves overall efficiency. Extensive experiments on multiple benchmarks demonstrate our effectiveness, where Kangaroo achieves walltime speedups up to 2. 04$\times$, outperforming Medusa-1 with 88. 7\% fewer additional parameters. The code for Kangaroo is available at https: //github. com/Equationliu/Kangaroo.

NeurIPS Conference 2024 Conference Paper

MemoryFormer : Minimize Transformer Computation by Removing Fully-Connected Layers

  • Ning Ding
  • Yehui Tang
  • Haochen Qin
  • Zhenli Zhou
  • Chao Xu
  • Lin Li
  • Kai Han
  • Heng Liao

In order to reduce the computational complexity of large language models, great efforts have been made to to improve the efficiency of transformer models such as linear attention and flash-attention. However, the model size and corresponding computational complexity are constantly scaled up in pursuit of higher performance. In this work, we present MemoryFormer, a novel transformer architecture which significantly reduces the computational complexity (FLOPs) from a new perspective. We eliminate nearly all the computations of the transformer model except for the necessary computation required by the multi-head attention operation. This is made possible by utilizing an alternative method for feature transformation to replace the linear projection of fully-connected layers. Specifically, we first construct a group of in-memory lookup tables that store a large amount of discrete vectors to replace the weight matrix used in linear projection. We then use a hash algorithm to retrieve a correlated subset of vectors dynamically based on the input embedding. The retrieved vectors combined together will form the output embedding, which provides an estimation of the result of matrix multiplication operation in a fully-connected layer. Compared to conducting matrix multiplication, retrieving data blocks from memory is a much cheaper operation which requires little computations. We train MemoryFormer from scratch and conduct extensive experiments on various benchmarks to demonstrate the effectiveness of the proposed model.

NeurIPS Conference 2024 Conference Paper

SciFIBench: Benchmarking Large Multimodal Models for Scientific Figure Interpretation

  • Jonathan Roberts
  • Kai Han
  • Neil Houlsby
  • Samuel Albanie

Large multimodal models (LMMs) have proven flexible and generalisable across many tasks and fields. Although they have strong potential to aid scientific research, their capabilities in this domain are not well characterised. A key aspect of scientific research is the ability to understand and interpret figures, which serve as a rich, compressed source of complex information. In this work, we present SciFIBench, a scientific figure interpretation benchmark consisting of 2000 questions split between two tasks across 8 categories. The questions are curated from arXiv paper figures and captions, using adversarial filtering to find hard negatives and human verification forquality control. We evaluate 28 LMMs on SciFIBench, finding it to be a challenging benchmark. Finally, we investigate the alignment and reasoning faithfulness of the LMMs on augmented question sets from our benchmark. We release SciFIBench to encourage progress in this domain.

NeurIPS Conference 2024 Conference Paper

Star-Agents: Automatic Data Optimization with LLM Agents for Instruction Tuning

  • Hang Zhou
  • Yehui Tang
  • Haochen Qin
  • Yujie Yang
  • Renren Jin
  • Deyi Xiong
  • Kai Han
  • Yunhe Wang

The efficacy of large language models (LLMs) on downstream tasks usually hinges on instruction tuning, which relies critically on the quality of training data. Unfortunately, collecting high-quality and diverse data is both expensive and time-consuming. To mitigate this issue, we propose a novel Star-Agents framework, which automates the enhancement of data quality across datasets through multi-agent collaboration and assessment. The framework adopts a three-pronged strategy. It initially generates diverse instruction data with multiple LLM agents through a bespoke sampling method. Subsequently, the generated data undergo a rigorous evaluation using a dual-model method that assesses both difficulty and quality. Finaly, the above process evolves in a dynamic refinement phase, where more effective LLMs are prioritized, enhancing the overall data quality. Our empirical studies, including instruction tuning experiments with models such as Pythia and LLaMA, demonstrate the effectiveness of the proposed framework. Optimized datasets have achieved substantial improvements, with an average increase of 12\% and notable gains in specific metrics, such as a 40\% improvement in Fermi, as evidenced by benchmarks like MT-bench, Vicuna bench, and WizardLM testset. Codes will be released soon.

TMLR Journal 2023 Journal Article

Complementary Sparsity: Accelerating Sparse CNNs with High Accuracy on General-Purpose Computing Platforms

  • Kang Zhao
  • Yijun Tan
  • Kai Han
  • Ting Hu
  • Hanting Chen
  • Tao Yuan
  • Yunhe Wang
  • Jun Yao

Model sparsity is a promising approach to reducing parameters or FLOPs of convolutional neural networks (CNNs). Compared to unstructured or coarse-grained structured sparsity, fine-grained structured sparsity, e.g., N:M sparse pattern, can achieve a better balance between accuracy and efficiency on general computing platforms like CPUs and GPUs. In particular, the 2:4 sparsity can accelerate CNN inference by 2$\times$ speed and with negligible accuracy drop. However, N:M sparsity needs to be supported by GPU within specific hardware circuits and hardly achieves significant speedups on common GPUs. To accelerate CNNs with general-purposed computing resources and simultaneously retain the model accuracy as much as possible, this paper proposes complementary sparsity (CS). CS denotes that only one weight can be retained for weights spaced at the same distance. On the one hand, CS features high mask flexibility, which is naturally favorable to high model accuracy. Moreover, we propose a CS-specific sparse training method to improve CS-based CNNs' accuracy under high parameter sparsities ($>$75\%). On the other hand, CS itself is memory-access balanced and robust to pattern hyperparameters, which can be utilized to speedup CS-based convolution computation on CPUs and common GPUs. We thus propose a CS convolution parallel computing algorithm that adapts to common GPUs without sparse tensor cores. Experimental results show that compared to other sparsity patterns, the proposed CS can achieve the optimal trade-off in terms of accuracy and latency for CPUs and common GPUs, respectively. Codes will be available at https://gitee.com/mindspore/models/tree/master/research/cv/CS.

YNICL Journal 2023 Journal Article

Difference of mean Hounsfield units (dHU) between follow-up and initial noncontrast CT scan predicts 90-day poor outcome in spontaneous supratentorial acute intracerebral hemorrhage with deep convolutional neural networks

  • Xiaona Xia
  • Xiaoqian Zhang
  • Jiufa Cui
  • Qingjun Jiang
  • Shuai Guan
  • Kongming Liang
  • Hao Wang
  • Chao Wang

OBJECTIVES: This study aimed to investigate the usefulness of a new non-contrast CT scan (NCCT) sign called the dHU, which represented the difference in mean Hounsfield unit values between follow-up and the initial NCCT for predicting 90-day poor functional outcomes in acute supratentorial spontaneous intracerebral hemorrhage(sICH) using deep convolutional neural networks. METHODS: A total of 377 consecutive patients with sICH from center 1 and 91 patients from center 2 (external validation set) were included. A receiver operating characteristic (ROC) analysis was performed to determine the critical value of dHU for predicting poor outcome at 90 days. Modified Rankin score (mRS) >3 or >2 was defined as the primary and secondary poor outcome, respectively. Two multivariate models were developed to test whether dHU was an independent predictor of the two unfavorable functional outcomes. RESULTS: The ROC analysis showed that a dHU >2.5 was a critical value to predict the poor outcomes (mRS >3) in sICH. The sensitivity, specificity, and accuracy of dHU >2.5 for poor outcome prediction were 37.5%, 86.0%, and 70.6%, respectively. In multivariate models developed after adjusting for all elements of the ICH score and hematoma expansion, dHU >2.5 was an independent predictor of both primary and secondary poor outcomes (OR = 2.61, 95% CI [1.32,5.13], P = 0.006; OR = 2.63, 95% CI [1.36,5.10], P = 0.004, respectively). After adjustment for all possible significant predictors (p 2.5 had a positive association with primary and secondary poor outcomes (OR = 3.25, 95% CI [1.52,6.98], P = 0.002; OR = 3.42, 95% CI [1.64,7.15], P = 0.001). CONCLUSIONS: The dHU of hematoma based on serial CT scans is independently associated with poor outcomes after acute sICH, which may help predict clinical evolution and guide therapy for sICH patients.

NeurIPS Conference 2023 Conference Paper

Gold-YOLO: Efficient Object Detector via Gather-and-Distribute Mechanism

  • Chengcheng Wang
  • Wei He
  • Ying Nie
  • Jianyuan Guo
  • Chuanjian Liu
  • Yunhe Wang
  • Kai Han

In the past years, YOLO-series models have emerged as the leading approaches in the area of real-time object detection. Many studies pushed up the baseline to a higher level by modifying the architecture, augmenting data and designing new losses. However, we find previous models still suffer from information fusion problem, although Feature Pyramid Network (FPN) and Path Aggregation Network (PANet) have alleviated this. Therefore, this study provides an advanced Gatherand-Distribute mechanism (GD) mechanism, which is realized with convolution and self-attention operations. This new designed model named as Gold-YOLO, which boosts the multi-scale feature fusion capabilities and achieves an ideal balance between latency and accuracy across all model scales. Additionally, we implement MAE-style pretraining in the YOLO-series for the first time, allowing YOLOseries models could be to benefit from unsupervised pretraining. Gold-YOLO-N attains an outstanding 39. 9% AP on the COCO val2017 datasets and 1030 FPS on a T4 GPU, which outperforms the previous SOTA model YOLOv6-3. 0-N with similar FPS by +2. 4%. The PyTorch code is available at https: //github. com/huawei-noah/Efficient-Computing/tree/master/Detection/Gold-YOLO, and the MindSpore code is available at https: //gitee. com/mindspore/models/tree/master/research/cv/Gold_YOLO.

NeurIPS Conference 2023 Conference Paper

HeadSculpt: Crafting 3D Head Avatars with Text

  • Xiao Han
  • Yukang Cao
  • Kai Han
  • Xiatian Zhu
  • Jiankang Deng
  • Yi-Zhe Song
  • Tao Xiang
  • Kwan-Yee K. Wong

Recently, text-guided 3D generative methods have made remarkable advancements in producing high-quality textures and geometry, capitalizing on the proliferation of large vision-language and image diffusion models. However, existing methods still struggle to create high-fidelity 3D head avatars in two aspects: (1) They rely mostly on a pre-trained text-to-image diffusion model whilst missing the necessary 3D awareness and head priors. This makes them prone to inconsistency and geometric distortions in the generated avatars. (2) They fall short in fine-grained editing. This is primarily due to the inherited limitations from the pre-trained 2D image diffusion models, which become more pronounced when it comes to 3D head avatars. In this work, we address these challenges by introducing a versatile coarse-to-fine pipeline dubbed HeadSculpt for crafting (i. e. , generating and editing) 3D head avatars from textual prompts. Specifically, we first equip the diffusion model with 3D awareness by leveraging landmark-based control and a learned textual embedding representing the back view appearance of heads, enabling 3D-consistent head avatar generations. We further propose a novel identity-aware editing score distillation strategy to optimize a textured mesh with a high-resolution differentiable rendering technique. This enables identity preservation while following the editing instruction. We showcase HeadSculpt's superior fidelity and editing capabilities through comprehensive experiments and comparisons with existing methods.

NeurIPS Conference 2023 Conference Paper

One-for-All: Bridge the Gap Between Heterogeneous Architectures in Knowledge Distillation

  • Zhiwei Hao
  • Jianyuan Guo
  • Kai Han
  • Yehui Tang
  • Han Hu
  • Yunhe Wang
  • Chang Xu

Knowledge distillation (KD) has proven to be a highly effective approach for enhancing model performance through a teacher-student training scheme. However, most existing distillation methods are designed under the assumption that the teacher and student models belong to the same model family, particularly the hint-based approaches. By using centered kernel alignment (CKA) to compare the learned features between heterogeneous teacher and student models, we observe significant feature divergence. This divergence illustrates the ineffectiveness of previous hint-based methods in cross-architecture distillation. To tackle the challenge in distilling heterogeneous models, we propose a simple yet effective one-for-all KD framework called OFA-KD, which significantly improves the distillation performance between heterogeneous architectures. Specifically, we project intermediate features into an aligned latent space such as the logits space, where architecture-specific information is discarded. Additionally, we introduce an adaptive target enhancement scheme to prevent the student from being disturbed by irrelevant information. Extensive experiments with various architectures, including CNN, Transformer, and MLP, demonstrate the superiority of our OFA-KD framework in enabling distillation between heterogeneous architectures. Specifically, when equipped with our OFA-KD, the student models achieve notable performance improvements, with a maximum gain of 8. 0% on the CIFAR-100 dataset and 0. 7% on the ImageNet-1K dataset. PyTorch code and checkpoints can be found at https: //github. com/Hao840/OFAKD.

AAAI Conference 2023 Conference Paper

Practical Parallel Algorithms for Submodular Maximization Subject to a Knapsack Constraint with Nearly Optimal Adaptivity

  • Shuang Cui
  • Kai Han
  • Jing Tang
  • He Huang
  • Xueying Li
  • Aakas Zhiyuli

Submodular maximization has wide applications in machine learning and data mining, where massive datasets have brought the great need for designing efficient and parallelizable algorithms. One measure of the parallelizability of a submodular maximization algorithm is its adaptivity complexity, which indicates the number of sequential rounds where a polynomial number of queries to the objective function can be executed in parallel. In this paper, we study the problem of non-monotone submodular maximization subject to a knapsack constraint, and propose the first combinatorial algorithm achieving an (8+epsilon)-approximation under O(log n) adaptive complexity, which is optimal up to a factor of O(loglog n). Moreover, under slightly larger adaptivity, we also propose approximation algorithms with nearly optimal query complexity of O(n), while achieving better approximation ratios. We show that our algorithms can also be applied to the special case of submodular maximization subject to a cardinality constraint, and achieve performance bounds comparable with those of state-of-the-art algorithms. Finally, the effectiveness of our approach is demonstrated by extensive experiments on real-world applications.

NeurIPS Conference 2023 Conference Paper

Revisit the Power of Vanilla Knowledge Distillation: from Small Scale to Large Scale

  • Zhiwei Hao
  • Jianyuan Guo
  • Kai Han
  • Han Hu
  • Chang Xu
  • Yunhe Wang

The tremendous success of large models trained on extensive datasets demonstrates that scale is a key ingredient in achieving superior results. Therefore, the reflection on the rationality of designing knowledge distillation (KD) approaches for limited-capacity architectures solely based on small-scale datasets is now deemed imperative. In this paper, we identify the small data pitfall that presents in previous KD methods, which results in the underestimation of the power of vanilla KD framework on large-scale datasets such as ImageNet-1K. Specifically, we show that employing stronger data augmentation techniques and using larger datasets can directly decrease the gap between vanilla KD and other meticulously designed KD variants. This highlights the necessity of designing and evaluating KD approaches in the context of practical scenarios, casting off the limitations of small-scale datasets. Our investigation of the vanilla KD and its variants in more complex schemes, including stronger training strategies and different model capacities, demonstrates that vanilla KD is elegantly simple but astonishingly effective in large-scale scenarios. Without bells and whistles, we obtain state-of-the-art ResNet-50, ViT-S, and ConvNeXtV2-T models for ImageNet, which achieve 83. 1%, 84. 3%, and 85. 0% top-1 accuracy, respectively. PyTorch code and checkpoints can be found at https: //github. com/Hao840/vanillaKD.

NeurIPS Conference 2023 Conference Paper

Species196: A One-Million Semi-supervised Dataset for Fine-grained Species Recognition

  • Wei He
  • Kai Han
  • Ying Nie
  • Chengcheng Wang
  • Yunhe Wang

The development of foundation vision models has pushed the general visual recognition to a high level, but cannot well address the fine-grained recognition in specialized domain such as invasive species classification. Identifying and managing invasive species has strong social and ecological value. Currently, most invasive species datasets are limited in scale and cover a narrow range of species, which restricts the development of deep-learning based invasion biometrics systems. To fill the gap of this area, we introduced Species196, a large-scale semi-supervised dataset of 196-category invasive species. It collects over 19K images with expert-level accurate annotations (Species196-L), and 1. 2M unlabeled images of invasive species (Species196-U). The dataset provides four experimental settings for benchmarking the existing models and algorithms, namely, supervised learning, semi-supervised learning and self-supervised pretraining. To facilitate future research on these four learning paradigms, we conduct an empirical study of the representative methods on the introduced dataset. The dataset will be made publicly available at https: //species-dataset. github. io/.

NeurIPS Conference 2023 Conference Paper

Triple Eagle: Simple, Fast and Practical Budget-Feasible Mechanisms

  • Kai Han
  • You Wu
  • He Huang
  • Shuang Cui

We revisit the classical problem of designing Budget-Feasible Mechanisms (BFMs) for submodular valuation functions, which has been extensively studied since the seminal paper of Singer [FOCS’10] due to its wide applications in crowdsourcing and social marketing. We propose TripleEagle, a novel algorithmic framework for designing BFMs, based on which we present several simple yet effective BFMs thatachieve better approximation ratios than the state-of-the-art work for both monotone and non-monotone submodular valuation functions. Moreover, our BFMs are the first in the literature to achieve linear complexities while ensuring obvious strategyproofness, making them more practical than the previous BFMs. We conduct extensive experiments to evaluate the empirical performance of our BFMs, and the experimental results strongly demonstrate the efficiency and effectiveness of our approach.

NeurIPS Conference 2022 Conference Paper

A Transformer-Based Object Detector with Coarse-Fine Crossing Representations

  • Zhishan Li
  • Ying Nie
  • Kai Han
  • Jianyuan Guo
  • Lei Xie
  • Yunhe Wang

Transformer-based object detectors have shown competitive performance recently. Compared with convolutional neural networks limited by the relatively small receptive fields, the advantage of transformer for visual tasks is the capacity to perceive long-range dependencies among all image patches, while the deficiency is that the local fine-grained information is not fully excavated. In this paper, we introduce the Coarse-grained and Fine-grained crossing representations to build an efficient Detection Transformer (CFDT). Specifically, we propose a local-global cross fusion module to establish the connection between local fine-grained features and global coarse-grained features. Besides, we propose a coarse-fine aware neck which enables detection tokens to interact with both coarse-grained and fine-grained features. Furthermore, an efficient feature integration module is presented for fusing multi-scale representations from different stages. Experimental results on the COCO dataset demonstrate the effectiveness of the proposed method. For instance, our CFDT achieves 48. 1 AP with 173G FLOPs, which possesses higher accuracy and less computation compared with the state-of-the-art transformer-based detector ViDT. Code will be available at https: //gitee. com/mindspore/models/tree/master/research/cv/CFDT.

NeurIPS Conference 2022 Conference Paper

Accelerating Sparse Convolution with Column Vector-Wise Sparsity

  • Yijun Tan
  • Kai Han
  • Kang Zhao
  • Xianzhi Yu
  • Zidong Du
  • Yunji Chen
  • Yunhe Wang
  • Jun Yao

Weight sparsity is a promising approach to reducing the model size and computation cost of convolutional neural networks (CNNs). Nevertheless, non-zero weights often distribute randomly in sparse CNN models, introducing enormous difficulty in obtaining actual speedup on common hardware (e. g. , GPU) over their dense counterparts. Existing acceleration solutions either require hardware modifications for irregular memory access support or rely on a partially structured sparsity pattern. Neither of these methods is capable of achieving fruitful speedup on convolution layers. In this work, we propose an algorithm-software co-designed sparse convolution based on a novel out-vector-wise (OVW) sparse pattern. Building on the insight that vertical vector integrity can preserve continuous memory access in IM2COL, the OVW pattern treats a $V\times1$ vector as an entirety. To reduce the error caused by sparsity, we propose an equivalent transformation process, i. e. , clustering-based channel permutation, to gather similar rows together. Experimental evaluations demonstrate that our method achieves a $1. 7\times$ and $3. 2\times$ speedup over the SOTA solution and the dense convolution of ResNet50 on NVIDIA V100 at 75\% sparsity, respectively, with only negligible accuracy loss. Moreover, compared to the SOTA solution that achieves speedups only on data with 60\% sparsity or more, our method begins to obtain speedups on data with only 10\% sparsity.

JBHI Journal 2022 Journal Article

An Effective Semi-Supervised Approach for Liver CT Image Segmentation

  • Kai Han
  • Lu Liu
  • Yuqing Song
  • Yi Liu
  • Chengjian Qiu
  • Yangyang Tang
  • Qiaoying Teng
  • Zhe Liu

Despite the substantial progress made by deep networks in the field of medical image segmentation, they generally require sufficient pixel-level annotated data for training. The scale of training data remains to be the main bottleneck to obtain a better deep segmentation model. Semi-supervised learning is an effective approach that alleviates the dependence on labeled data. However, most existing semi-supervised image segmentation methods usually do not generate high-quality pseudo labels to expand training dataset. In this paper, we propose a deep semi-supervised approach for liver CT image segmentation by expanding pseudo-labeling algorithm under the very low annotated-data paradigm. Specifically, the output features of labeled images from the pretrained network combine with corresponding pixel-level annotations to produce class representations according to the mean operation. Then pseudo labels of unlabeled images are generated by calculating the distances between unlabeled feature vectors and each class representation. To further improve the quality of pseudo labels, we adopt a series of operations to optimize pseudo labels. A more accurate segmentation network is obtained by expanding the training dataset and adjusting the contributions between supervised and unsupervised loss. Besides, the novel random patch based on prior locations is introduced for unlabeled images in the training procedure. Extensive experiments show our method has achieved more competitive results compared with other semi-supervised methods when fewer labeled slices of LiTS dataset are available.

NeurIPS Conference 2022 Conference Paper

Chromatic Correlation Clustering, Revisited

  • Qing Xiu
  • Kai Han
  • Jing Tang
  • Shuang Cui
  • He Huang

Chromatic Correlation Clustering (CCC) (introduced by Bonchi et al. [6]) is a natural generalization of the celebrated Correlation Clustering (CC) problem, introduced by Bonchi et al. [6]. It models objects with categorical pairwise relationships by an edge-colored graph, and has many applications in data mining, social networks and bioinformatics. We show that there exists a $2. 5$-approximation to the CCC problem based on a Linear Programming (LP) approach, thus improving the best-known approximation ratio of 3 achieved by Klodt et al. [21]. We also present an efficient heuristic algorithm for CCC leveraging a greedy clustering strategy, and conduct extensive experiments to demonstrate the effectiveness and efficiency of our proposed algorithm.

NeurIPS Conference 2022 Conference Paper

GhostNetV2: Enhance Cheap Operation with Long-Range Attention

  • Yehui Tang
  • Kai Han
  • Jianyuan Guo
  • Chang Xu
  • Chao Xu
  • Yunhe Wang

Light-weight convolutional neural networks (CNNs) are specially designed for applications on mobile devices with faster inference speed. The convolutional operation can only capture local information in a window region, which prevents performance from being further improved. Introducing self-attention into convolution can capture global information well, but it will largely encumber the actual speed. In this paper, we propose a hardware-friendly attention mechanism (dubbed DFC attention) and then present a new GhostNetV2 architecture for mobile applications. The proposed DFC attention is constructed based on fully-connected layers, which can not only execute fast on common hardware but also capture the dependence between long-range pixels. We further revisit the expressiveness bottleneck in previous GhostNet and propose to enhance expanded features produced by cheap operations with DFC attention, so that a GhostNetV2 block can aggregate local and long-range information simultaneously. Extensive experiments demonstrate the superiority of GhostNetV2 over existing architectures. For example, it achieves 75. 3% top-1 accuracy on ImageNet with 167M FLOPs, significantly suppressing GhostNetV1 (74. 5%) with a similar computational cost. The source code will be available at https: //github. com/huawei-noah/Efficient-AI-Backbones/tree/master/ghostnetv2_pytorch and https: //gitee. com/mindspore/models/tree/master/research/cv/ghostnetv2.

TMLR Journal 2022 Journal Article

GhostSR: Learning Ghost Features for Efficient Image Super-Resolution

  • Ying Nie
  • Kai Han
  • Zhenhua Liu
  • Chuanjian Liu
  • Yunhe Wang

Modern single image super-resolution (SISR) systems based on convolutional neural networks (CNNs) have achieved impressive performance but require huge computational costs. The problem on feature redundancy has been well studied in visual recognition task, but rarely discussed in SISR. Based on the observation that many features in SISR models are also similar to each other, we propose to use shift operation for generating the redundant features (i.e. ghost features). Compared with depth-wise convolution which is time-consuming on GPU-like devices, shift operation can bring a real inference acceleration for CNNs on common hardware. We analyze the benefits of shift operation in SISR and make the shift orientation learnable based on the Gumbel-Softmax trick. Besides, a clustering procedure is explored based on pre-trained models to identify the intrinsic filters for generating corresponding intrinsic features. The ghost features will be generated by moving these intrinsic features along a certain orientation. Finally, the complete output features are constructed by concatenating the intrinsic and ghost features together. Extensive experiments on several benchmark models and datasets demonstrate that both the non-compact and lightweight SISR CNN models embedded with the proposed method can achieve a comparable performance to the baseline models with a large reduction of parameters, FLOPs and GPU inference latency. For example, we reduce the parameters by 46%, FLOPs by 46% and GPU inference latency by 42% of x2 EDSR model with almost lossless performance. Code will be available at https://gitee.com/mindspore/models/tree/master/research/cv/GhostSR.

NeurIPS Conference 2022 Conference Paper

Learning Efficient Vision Transformers via Fine-Grained Manifold Distillation

  • Zhiwei Hao
  • Jianyuan Guo
  • Ding Jia
  • Kai Han
  • Yehui Tang
  • Chao Zhang
  • Han Hu
  • Yunhe Wang

In the past few years, transformers have achieved promising performance on various computer vision tasks. Unfortunately, the immense inference overhead of most existing vision transformers withholds them from being deployed on edge devices such as cell phones and smart watches. Knowledge distillation is a widely used paradigm for compressing cumbersome architectures into compact students via transferring information. However, most of them are designed for convolutional neural networks (CNNs), which do not fully investigate the character of vision transformers. In this paper, we fully utilize the patch-level information and propose a fine-grained manifold distillation method for transformer-based networks. Specifically, we train a tiny student model to match a pre-trained teacher model in the patch-level manifold space. Then, we decouple the manifold matching loss into three terms with careful design to further reduce the computational costs for the patch relationship. Equipped with the proposed method, a DeiT-Tiny model containing 5M parameters achieves 76. 5\% top-1 accuracy on ImageNet-1k, which is +2. 0\% higher than previous distillation approaches. Transfer learning results on other classification benchmarks and downstream vision tasks also demonstrate the superiority of our method over the state-of-the-art algorithms.

NeurIPS Conference 2022 Conference Paper

Redistribution of Weights and Activations for AdderNet Quantization

  • Ying Nie
  • Kai Han
  • Haikang Diao
  • Chuanjian Liu
  • Enhua Wu
  • Yunhe Wang

Adder Neural Network (AdderNet) provides a new way for developing energy-efficient neural networks by replacing the expensive multiplications in convolution with cheaper additions (i. e. , L1-norm). To achieve higher hardware efficiency, it is necessary to further study the low-bit quantization of AdderNet. Due to the limitation that the commutative law in multiplication does not hold in L1-norm, the well-established quantization methods on convolutional networks cannot be applied on AdderNets. Thus, the existing AdderNet quantization techniques propose to use only one shared scale to quantize both the weights and activations simultaneously. Admittedly, such an approach can keep the commutative law in the L1-norm quantization process, while the accuracy drop after low-bit quantization cannot be ignored. To this end, we first thoroughly analyze the difference on distributions of weights and activations in AdderNet and then propose a new quantization algorithm by redistributing the weights and the activations. Specifically, the pre-trained full-precision weights in different kernels are clustered into different groups, then the intra-group sharing and inter-group independent scales can be adopted. To further compensate the accuracy drop caused by the distribution difference, we then develop a lossless range clamp scheme for weights and a simple yet effective outliers clamp strategy for activations. Thus, the functionality of full-precision weights and the representation ability of full-precision activations can be fully preserved. The effectiveness of the proposed quantization method for AdderNet is well verified on several benchmarks, e. g. , our 4-bit post-training quantized adder ResNet-18 achieves an 66. 5% top-1 accuracy on the ImageNet with comparable energy efficiency, which is about 8. 5% higher than that of the previous AdderNet quantization methods. Code will be available at https: //gitee. com/mindspore/models/tree/master/research/cv/AdderQuant.

NeurIPS Conference 2022 Conference Paper

Vision GNN: An Image is Worth Graph of Nodes

  • Kai Han
  • Yunhe Wang
  • Jianyuan Guo
  • Yehui Tang
  • Enhua Wu

Network architecture plays a key role in the deep learning-based computer vision system. The widely-used convolutional neural network and transformer treat the image as a grid or sequence structure, which is not flexible to capture irregular and complex objects. In this paper, we propose to represent the image as a graph structure and introduce a new \emph{Vision GNN} (ViG) architecture to extract graph-level feature for visual tasks. We first split the image to a number of patches which are viewed as nodes, and construct a graph by connecting the nearest neighbors. Based on the graph representation of images, we build our ViG model to transform and exchange information among all the nodes. ViG consists of two basic modules: Grapher module with graph convolution for aggregating and updating graph information, and FFN module with two linear layers for node feature transformation. Both isotropic and pyramid architectures of ViG are built with different model sizes. Extensive experiments on image recognition and object detection tasks demonstrate the superiority of our ViG architecture. We hope this pioneering study of GNN on general visual tasks will provide useful inspiration and experience for future research. The PyTorch code is available at \url{https: //github. com/huawei-noah/Efficient-AI-Backbones} and the MindSpore code is available at \url{https: //gitee. com/mindspore/models}.

NeurIPS Conference 2021 Conference Paper

Augmented Shortcuts for Vision Transformers

  • Yehui Tang
  • Kai Han
  • Chang Xu
  • An Xiao
  • Yiping Deng
  • Chao Xu
  • Yunhe Wang

Transformer models have achieved great progress on computer vision tasks recently. The rapid development of vision transformers is mainly contributed by their high representation ability for extracting informative features from input images. However, the mainstream transformer models are designed with deep architectures, and the feature diversity will be continuously reduced as the depth increases, \ie, feature collapse. In this paper, we theoretically analyze the feature collapse phenomenon and study the relationship between shortcuts and feature diversity in these transformer models. Then, we present an augmented shortcut scheme, which inserts additional paths with learnable parameters in parallel on the original shortcuts. To save the computational costs, we further explore an efficient approach that uses the block-circulant projection to implement augmented shortcuts. Extensive experiments conducted on benchmark datasets demonstrate the effectiveness of the proposed method, which brings about 1% accuracy increase of the state-of-the-art visual transformers without obviously increasing their parameters and FLOPs.

NeurIPS Conference 2021 Conference Paper

Dynamic Resolution Network

  • Mingjian Zhu
  • Kai Han
  • Enhua Wu
  • Qiulin Zhang
  • Ying Nie
  • Zhenzhong Lan
  • Yunhe Wang

Deep convolutional neural networks (CNNs) are often of sophisticated design with numerous learnable parameters for the accuracy reason. To alleviate the expensive costs of deploying them on mobile devices, recent works have made huge efforts for excavating redundancy in pre-defined architectures. Nevertheless, the redundancy on the input resolution of modern CNNs has not been fully investigated, i. e. , the resolution of input image is fixed. In this paper, we observe that the smallest resolution for accurately predicting the given image is different using the same neural network. To this end, we propose a novel dynamic-resolution network (DRNet) in which the input resolution is determined dynamically based on each input sample. Wherein, a resolution predictor with negligible computational costs is explored and optimized jointly with the desired network. Specifically, the predictor learns the smallest resolution that can retain and even exceed the original recognition accuracy for each image. During the inference, each input image will be resized to its predicted resolution for minimizing the overall computation burden. We then conduct extensive experiments on several benchmark networks and datasets. The results show that our DRNet can be embedded in any off-the-shelf network architecture to obtain a considerable reduction in computational complexity. For instance, DR-ResNet-50 achieves similar performance with an about 34% computation reduction, while gaining 1. 4% accuracy increase with 10% computation reduction compared to the original ResNet-50 on ImageNet. Code will be available at https: //gitee. com/mindspore/models/tree/master/research/cv/DRNet.

NeurIPS Conference 2021 Conference Paper

Learning Frequency Domain Approximation for Binary Neural Networks

  • Yixing Xu
  • Kai Han
  • Chang Xu
  • Yehui Tang
  • Chunjing Xu
  • Yunhe Wang

Binary neural networks (BNNs) represent original full-precision weights and activations into 1-bit with sign function. Since the gradient of the conventional sign function is almost zero everywhere which cannot be used for back-propagation, several attempts have been proposed to alleviate the optimization difficulty by using approximate gradient. However, those approximations corrupt the main direction of factual gradient. To this end, we propose to estimate the gradient of sign function in the Fourier frequency domain using the combination of sine functions for training BNNs, namely frequency domain approximation (FDA). The proposed approach does not affect the low-frequency information of the original sign function which occupies most of the overall energy, and high-frequency coefficients will be ignored to avoid the huge computational overhead. In addition, we embed a noise adaptation module into the training phase to compensate the approximation error. The experiments on several benchmark datasets and neural architectures illustrate that the binary network learned using our method achieves the state-of-the-art accuracy. Code will be available at https: //gitee. com/mindspore/models/tree/master/research/cv/FDA-BNN.

NeurIPS Conference 2021 Conference Paper

Novel Visual Category Discovery with Dual Ranking Statistics and Mutual Knowledge Distillation

  • Bingchen Zhao
  • Kai Han

In this paper, we tackle the problem of novel visual category discovery, i. e. , grouping unlabelled images from new classes into different semantic partitions by leveraging a labelled dataset that contains images from other different but relevant categories. This is a more realistic and challenging setting than conventional semi-supervised learning. We propose a two-branch learning framework for this problem, with one branch focusing on local part-level information and the other branch focusing on overall characteristics. To transfer knowledge from the labelled data to the unlabelled, we propose using dual ranking statistics on both branches to generate pseudo labels for training on the unlabelled data. We further introduce a mutual knowledge distillation method to allow information exchange and encourage agreement between the two branches for discovering new categories, allowing our model to enjoy the benefits of global and local features. We comprehensively evaluate our method on public benchmarks for generic object classification, as well as the more challenging datasets for fine-grained visual recognition, achieving state-of-the-art performance.

NeurIPS Conference 2021 Conference Paper

Post-Training Quantization for Vision Transformer

  • Zhenhua Liu
  • Yunhe Wang
  • Kai Han
  • Wei Zhang
  • Siwei Ma
  • Wen Gao

Recently, transformer has achieved remarkable performance on a variety of computer vision applications. Compared with mainstream convolutional neural networks, vision transformers are often of sophisticated architectures for extracting powerful feature representations, which are more difficult to be developed on mobile devices. In this paper, we present an effective post-training quantization algorithm for reducing the memory storage and computational costs of vision transformers. Basically, the quantization task can be regarded as finding the optimal low-bit quantization intervals for weights and inputs, respectively. To preserve the functionality of the attention mechanism, we introduce a ranking loss into the conventional quantization objective that aims to keep the relative order of the self-attention results after quantization. Moreover, we thoroughly analyze the relationship between quantization loss of different layers and the feature diversity, and explore a mixed-precision quantization scheme by exploiting the nuclear norm of each attention map and output feature. The effectiveness of the proposed method is verified on several benchmark models and datasets, which outperforms the state-of-the-art post-training quantization algorithms. For instance, we can obtain an 81. 29% top-1 accuracy using DeiT-B model on ImageNet dataset with about 8-bit quantization. Code will be available at https: //gitee. com/mindspore/models/tree/master/research/cv/VT-PTQ.

NeurIPS Conference 2021 Conference Paper

Transformer in Transformer

  • Kai Han
  • An Xiao
  • Enhua Wu
  • Jianyuan Guo
  • Chunjing Xu
  • Yunhe Wang

Transformer is a new kind of neural architecture which encodes the input data as powerful features via the attention mechanism. Basically, the visual transformers first divide the input images into several local patches and then calculate both representations and their relationship. Since natural images are of high complexity with abundant detail and color information, the granularity of the patch dividing is not fine enough for excavating features of objects in different scales and locations. In this paper, we point out that the attention inside these local patches are also essential for building visual transformers with high performance and we explore a new architecture, namely, Transformer iN Transformer (TNT). Specifically, we regard the local patches (\eg, 16$\times$16) as “visual sentences” and present to further divide them into smaller patches (\eg, 4$\times$4) as “visual words”. The attention of each word will be calculated with other words in the given visual sentence with negligible computational costs. Features of both words and sentences will be aggregated to enhance the representation ability. Experiments on several benchmarks demonstrate the effectiveness of the proposed TNT architecture, \eg, we achieve an 81. 5\% top-1 accuracy on the ImageNet, which is about 1. 7\% higher than that of the state-of-the-art visual transformer with similar computational cost. The PyTorch code is available at \url{https: //github. com/huawei-noah/CV-Backbones}, and the MindSpore code is available at \url{https: //gitee. com/mindspore/models/tree/master/research/cv/TNT}.

NeurIPS Conference 2020 Conference Paper

Deterministic Approximation for Submodular Maximization over a Matroid in Nearly Linear Time

  • Kai Han
  • zongmai Cao
  • Shuang Cui
  • Benwei Wu

We study the problem of maximizing a non-monotone, non-negative submodular function subject to a matroid constraint. The prior best-known deterministic approximation ratio for this problem is $\frac{1}{4}-\epsilon$ under $\mathcal{O}(({n^4}/{\epsilon})\log n)$ time complexity. We show that this deterministic ratio can be improved to $\frac{1}{4}$ under $\mathcal{O}(nr)$ time complexity, and then present a more practical algorithm dubbed TwinGreedyFast which achieves $\frac{1}{4}-\epsilon$ deterministic ratio in nearly-linear running time of $\mathcal{O}(\frac{n}{\epsilon}\log\frac{r}{\epsilon})$. Our approach is based on a novel algorithmic framework of simultaneously constructing two candidate solution sets through greedy search, which enables us to get improved performance bounds by fully exploiting the properties of independence systems. As a byproduct of this framework, we also show that TwinGreedyFast achieves $\frac{1}{2p+2}-\epsilon$ deterministic ratio under a $p$-set system constraint with the same time complexity. To showcase the practicality of our approach, we empirically evaluated the performance of TwinGreedyFast on two network applications, and observed that it outperforms the state-of-the-art deterministic and randomized algorithms with efficient implementations for our problem.

NeurIPS Conference 2020 Conference Paper

Dual-Resolution Correspondence Networks

  • Xinghui Li
  • Kai Han
  • Shuda Li
  • Victor Prisacariu

We tackle the problem of establishing dense pixel-wise correspondences between a pair of images. In this work, we introduce Dual-Resolution Correspondence Networks (DualRC-Net), to obtain pixel-wise correspondences in a coarse-to-fine manner. DualRC-Net extracts both coarse- and fine- resolution feature maps. The coarse maps are used to produce a full but coarse 4D correlation tensor, which is then refined by a learnable neighbourhood consensus module. The fine-resolution feature maps are used to obtain the final dense correspondences guided by the refined coarse 4D correlation tensor. The selected coarse-resolution matching scores allow the fine-resolution features to focus only on a limited number of possible matches with high confidence. In this way, DualRC-Net dramatically increases matching reliability and localisation accuracy, while avoiding to apply the expensive 4D convolution kernels on fine-resolution feature maps. We comprehensively evaluate our method on large-scale public benchmarks including HPatches, InLoc, and Aachen Day-Night. It achieves state-of-the-art results on all of them.

NeurIPS Conference 2020 Conference Paper

Model Rubik’s Cube: Twisting Resolution, Depth and Width for TinyNets

  • Kai Han
  • Yunhe Wang
  • Qiulin Zhang
  • Wei Zhang
  • Chunjing Xu
  • Tong Zhang

To obtain excellent deep neural architectures, a series of techniques are carefully designed in EfficientNets. The giant formula for simultaneously enlarging the resolution, depth and width provides us a Rubik’s cube for neural networks. So that we can find networks with high efficiency and excellent performance by twisting the three dimensions. This paper aims to explore the twisting rules for obtaining deep neural networks with minimum model sizes and computational costs. Different from the network enlarging, we observe that resolution and depth are more important than width for tiny networks. Therefore, the original method, \ie the compound scaling in EfficientNet is no longer suitable. To this end, we summarize a tiny formula for downsizing neural architectures through a series of smaller models derived from the EfficientNet-B0 with the FLOPs constraint. Experimental results on the ImageNet benchmark illustrate that our TinyNet performs much better than the smaller version of EfficientNets using the inversed giant formula. For instance, our TinyNet-E achieves a 59. 9\% Top-1 accuracy with only 24M FLOPs, which is about 1. 9\% higher than that of the previous best MobileNetV3 with similar computational cost. Code will be available at \url{https: //github. com/huawei-noah/CV-Backbones/tree/master/tinynet}, and \url{https: //gitee. com/mindspore/mindspore/tree/master/model_zoo/research/cv/tinynet}.

NeurIPS Conference 2020 Conference Paper

Searching for Low-Bit Weights in Quantized Neural Networks

  • Zhaohui Yang
  • Yunhe Wang
  • Kai Han
  • Chunjing Xu
  • Chao Xu
  • Dacheng Tao
  • Chang Xu

Quantized neural networks with low-bit weights and activations are attractive for developing AI accelerators. However, the quantization functions used in most conventional quantization methods are non-differentiable, which increases the optimization difficulty of quantized networks. Compared with full-precision parameters (\emph{i. e. }, 32-bit floating numbers), low-bit values are selected from a much smaller set. For example, there are only 16 possibilities in 4-bit space. Thus, we present to regard the discrete weights in an arbitrary quantized neural network as searchable variables, and utilize a differential method to search them accurately. In particular, each weight is represented as a probability distribution over the discrete value set. The probabilities are optimized during training and the values with the highest probability are selected to establish the desired quantized network. Experimental results on benchmarks demonstrate that the proposed method is able to produce quantized neural networks with higher performance over the state-of-the-arts on both image classification and super-resolution tasks.

IJCAI Conference 2019 Conference Paper

Attribute Aware Pooling for Pedestrian Attribute Recognition

  • Kai Han
  • Yunhe Wang
  • Han Shu
  • Chuanjian Liu
  • Chunjing Xu
  • Chang Xu

This paper expands the strength of deep convolutional neural networks (CNNs) to the pedestrian attribute recognition problem by devising a novel attribute aware pooling algorithm. Existing vanilla CNNs cannot be straightforwardly applied to handle multi-attribute data because of the larger label space as well as the attribute entanglement and correlations. We tackle these challenges that hampers the development of CNNs for multi-attribute classification by fully exploiting the correlation between different attributes. The multi-branch architecture is adopted for fucusing on attributes at different regions. Besides the prediction based on each branch itself, context information of each branch are employed for decision as well. The attribute aware pooling is developed to integrate both kinds of information. Therefore, attributes which are indistinct or tangled with others can be accurately recognized by exploiting the context information. Experiments on benchmark datasets demonstrate that the proposed pooling method appropriately explores and exploits the correlations between attributes for the pedestrian attribute recognition.

IJCAI Conference 2019 Conference Paper

Learning Instance-wise Sparsity for Accelerating Deep Models

  • Chuanjian Liu
  • Yunhe Wang
  • Kai Han
  • Chunjing Xu
  • Chang Xu

Exploring deep convolutional neural networks of high efficiency and low memory usage is very essential for a wide variety of machine learning tasks. Most of existing approaches used to accelerate deep models by manipulating parameters or filters without data, e. g. , pruning and decomposition. In contrast, we study this problem from a different perspective by respecting the difference between data. An instance-wise feature pruning is developed by identifying informative features for different instances. Specifically, by investigating a feature decay regularization, we expect intermediate feature maps of each instance in deep neural networks to be sparse while preserving the overall network performance. During online inference, subtle features of input images extracted by intermediate layers of a well-trained neural network can be eliminated to accelerate the subsequent calculations. We further take coefficient of variation as a measure to select the layers that are appropriate for acceleration. Extensive experiments conducted on benchmark datasets and networks demonstrate the effectiveness of the proposed method.

NeurIPS Conference 2019 Conference Paper

Positive-Unlabeled Compression on the Cloud

  • Yixing Xu
  • Yunhe Wang
  • Hanting Chen
  • Kai Han
  • Chunjing Xu
  • Dacheng Tao
  • Chang Xu

Many attempts have been done to extend the great success of convolutional neural networks (CNNs) achieved on high-end GPU servers to portable devices such as smart phones. Providing compression and acceleration service of deep learning models on the cloud is therefore of significance and is attractive for end users. However, existing network compression and acceleration approaches usually fine-tuning the svelte model by requesting the entire original training data (e. g. ImageNet), which could be more cumbersome than the network itself and cannot be easily uploaded to the cloud. In this paper, we present a novel positive-unlabeled (PU) setting for addressing this problem. In practice, only a small portion of the original training set is required as positive examples and more useful training examples can be obtained from the massive unlabeled data on the cloud through a PU classifier with an attention based multi-scale feature extractor. We further introduce a robust knowledge distillation (RKD) scheme to deal with the class imbalance problem of these newly augmented training examples. The superiority of the proposed method is verified through experiments conducted on the benchmark models and datasets. We can use only 8% of uniformly selected data from the ImageNet to obtain an efficient model with comparable performance to the baseline ResNet-34.

NeurIPS Conference 2018 Conference Paper

Greedy Hash: Towards Fast Optimization for Accurate Hash Coding in CNN

  • Shupeng Su
  • Chao Zhang
  • Kai Han
  • Yonghong Tian

To convert the input into binary code, hashing algorithm has been widely used for approximate nearest neighbor search on large-scale image sets due to its computation and storage efficiency. Deep hashing further improves the retrieval quality by combining the hash coding with deep neural network. However, a major difficulty in deep hashing lies in the discrete constraints imposed on the network output, which generally makes the optimization NP hard. In this work, we adopt the greedy principle to tackle this NP hard problem by iteratively updating the network toward the probable optimal discrete solution in each iteration. A hash coding layer is designed to implement our approach which strictly uses the sign function in forward propagation to maintain the discrete constraints, while in back propagation the gradients are transmitted intactly to the front layer to avoid the vanishing gradients. In addition to the theoretical derivation, we provide a new perspective to visualize and understand the effectiveness and efficiency of our algorithm. Experiments on benchmark datasets show that our scheme outperforms state-of-the-art hashing methods in both supervised and unsupervised tasks.