Author name cluster

Can Qin

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

17 papers

2 author rows

TMLR Journal 2026 Journal Article

A Survey of Token Compression for Efficient Multimodal Large Language Models

Kele Shao
Keda TAO
Kejia Zhang
Sicheng Feng
Mu Cai
Yuzhang Shang
Haoxuan You
Can Qin

Multimodal large language models (MLLMs) have made remarkable strides, largely driven by their ability to process increasingly long and complex contexts, such as high-resolution images, extended video sequences, and lengthy audio input. While this ability significantly enhances MLLM capabilities, it introduces substantial computational challenges, primarily due to the quadratic complexity of self-attention mechanisms with numerous input tokens. To mitigate these bottlenecks, token compression has emerged as an auspicious and critical approach, efficiently reducing the number of tokens during both training and inference. In this paper, we present the first systematic survey and synthesis of the burgeoning field of multimodal long context token compression. Recognizing that effective compression strategies are deeply tied to the unique characteristics and redundancies of each modality, we categorize existing approaches by their primary data focus, enabling researchers to quickly access and learn methods tailored to their specific area of interest: (1) image-centric compression, which addresses spatial redundancy in visual data; (2) video-centric compression, which tackles spatio-temporal redundancy in dynamic sequences; and (3) audio-centric compression, which handles temporal and spectral redundancy in acoustic signals. Beyond this modality-driven categorization, we further dissect methods based on their underlying mechanisms, including transformation-based, similarity-based, attention-based, and query-based approaches. By providing a comprehensive and structured overview, this survey aims to consolidate current progress, identify key challenges, and inspire future research directions in this rapidly evolving domain.

PDF Details

AAAI Conference 2026 Conference Paper

DcSplat: Dual-Constraint Human Gaussian Splatting with Latent Multi-View Consistency

Tengfei Xiao
Yue Wu
Zhigang Gao
Yongzhe Yuan
Can Qin
Hao Li
Mingyang Zhang

Human Novel View Synthesis (HNVS) aims to synthesize photorealistic human images from novel viewpoints given observations from known views. Despite significant advances achieved by existing methods such as NeRF, diffusion models, and 3DGS, they still face substantial challenges in achieving stable modeling from a single image. In this paper, we introduce Dual-Constraint Human Gaussian Splatting (DcSplat), a novel, simple, and efficient 3D Gaussian-based framework for single-view 3D human reconstruction. To address occlusion-induced texture missing and depth ambiguities, we introduce two key components: a Latent Multi-View Consistency Constraint Mechanism and a Geometric Constraint Module. The former employs a Latent-space Appearance Transformer (LatentFormer) to learn semantically coherent, view-consistent appearance priors via SMPL-guided pseudo-view fusion. The latter refines noisy SMPL-based depth through a U-Net-like structure conditioned on latent appearance features. These two modules are jointly optimized to generate high-quality Gaussian parameters in a unified latent space. Extensive experiments demonstrate that DcSplat outperforms existing SOTA methods in both geometry and texture quality, while achieving fast inference and lower computational cost.

PDF Details DOI

AAAI Conference 2026 Conference Paper

Hybrid Vector-Occupancy Field for Robust Implicit 3D Surface Reconstruction

Yue Wu
Zhigang Gao
Tengfei Xiao
Can Qin
Yongzhe Yuan
Hao Li
Kaiyuan Feng
Wenping Ma

We introduce the Hybrid Vector-Occupancy Field (HVOF), a new implicit 3D representation for reconstructing both open and closed surfaces from sparse point clouds. Existing approaches, such as occupancy field and signed distance fields, face severe limitations. They struggle with open surfaces, while unsigned distance field and neural vector field exhibit directional instability in complex topologies and ridge regions. HVOF addresses these challenges by incorporating a smoothly decaying occupancy field around the surface, while capturing precise local geometry using truncated displacement vectors, naturally mitigating direction-field ambiguities near ridge regions. This unified design forms a robust hybrid representation that leverages both occupancy and vector fields. To fulfill it, we design a Hybrid Field variational autoencoder including a hierarchical cross-attention encoder and dual-branch decoder that jointly learn occupancy and vector fields through continuous weighting. Extensive experiments demonstrate that HVOF consistently outperforms state-of-the-art methods across ShapeNet, ABC, and MGN datasets, accurately reconstructing both open and closed surfaces while preserving fine geometric details in complex regions.

PDF Details DOI

TMLR Journal 2026 Journal Article

VLM2Vec-V2: Advancing Multimodal Embedding for Videos, Images, and Visual Documents

Rui Meng
Ziyan Jiang
Ye Liu
Mingyi Su
Xinyi Yang
Yuepeng Fu
Can Qin
Raghuveer Thirukovalluru

Multimodal embedding models have been crucial in enabling various downstream tasks such as semantic similarity, information retrieval, and clustering over different modalities. However, existing multimodal embeddings like VLM2Vec, E5-V, GME are predominantly focused on natural images, with limited support for other visual forms such as videos and visual documents. This restricts their applicability in real-world scenarios, including AI agents, retrieval-augmented generation (RAG) systems, and recommendation. To close this gap, we propose VLM2Vec-V2, a unified framework for learning embeddings across diverse visual forms. First, we introduce MMEB-V2, a comprehensive benchmark that extends MMEB with five new task types: visual document retrieval, video retrieval, temporal grounding, video classification and video question answering -- spanning text, image, video, and visual document inputs. Next, we train VLM2Vec-V2, a general-purpose embedding model that supports text, image, video, and visual document inputs. Extensive experiments show that VLM2Vec-V2 achieves strong performance not only on the newly introduced video and document retrieval tasks, but also improves over prior baselines on the original image benchmarks. Through extensive evaluation, our study offers insights into the generalizability of various multimodal embedding models and highlights effective strategies for unified embedding learning, laying the groundwork for more scalable and adaptable representation learning in both research and real-world settings.

PDF Details

NeurIPS Conference 2025 Conference Paper

HoliTom: Holistic Token Merging for Fast Video Large Language Models

Kele Shao
Keda TAO
Can Qin
Haoxuan You
Yang Sui
Huan Wang

Video large language models (video LLMs) excel at video comprehension but face significant computational inefficiency due to redundant video tokens. Existing token pruning methods offer solutions. However, approaches operating within the LLM (inner-LLM pruning), such as FastV, incur intrinsic computational overhead in shallow layers. In contrast, methods performing token pruning before the LLM (outer-LLM pruning) primarily address spatial redundancy within individual frames or limited temporal windows, neglecting the crucial global temporal dynamics and correlations across longer video sequences. This leads to sub-optimal spatio-temporal reduction and does not leverage video compressibility fully. Crucially, the synergistic potential and mutual influence of combining these strategies remain unexplored. To further reduce redundancy, we introduce HoliTom, a novel training-free holistic token merging framework. HoliTom employs outer-LLM pruning through global redundancy-aware temporal segmentation, followed by spatial-temporal merging to reduce visual tokens by over 90%, significantly alleviating the LLM's computational burden. Complementing this, we introduce a robust inner-LLM token similarity-based merging approach, designed for superior performance and compatibility with outer-LLM pruning. Evaluations demonstrate our method's promising efficiency-performance trade-off on LLaVA-OneVision-7B, reducing computational costs to 6. 9% of FLOPs while maintaining 99. 1% of the original performance. Furthermore, we achieve a 2. 28× reduction in Time-To-First-Token (TTFT) and a 1. 32× acceleration in decoding throughput, highlighting the practical benefits of our integrated pruning approach for efficient video LLMs inference.

PDF Details

AAAI Conference 2024 Conference Paper

M3SOT: Multi-Frame, Multi-Field, Multi-Space 3D Single Object Tracking

Jiaming Liu
Yue Wu
Maoguo Gong
Qiguang Miao
Wenping Ma
Cai Xu
Can Qin

3D Single Object Tracking (SOT) stands a forefront task of computer vision, proving essential for applications like autonomous driving. Sparse and occluded data in scene point clouds introduce variations in the appearance of tracked objects, adding complexity to the task. In this research, we unveil M3SOT, a novel 3D SOT framework, which synergizes multiple input frames (template sets), multiple receptive fields (continuous contexts), and multiple solution spaces (distinct tasks) in ONE model. Remarkably, M3SOT pioneers in modeling temporality, contexts, and tasks directly from point clouds, revisiting a perspective on the key factors influencing SOT. To this end, we design a transformer-based network centered on point cloud targets in the search area, aggregating diverse contextual representations and propagating target cues by employing historical frames. As M3SOT spans varied processing perspectives, we've streamlined the network—trimming its depth and optimizing its structure—to ensure a lightweight and efficient deployment for SOT applications. We posit that, backed by practical construction, M3SOT sidesteps the need for complex frameworks and auxiliary components to deliver sterling results. Extensive experiments on benchmarks such as KITTI, nuScenes, and Waymo Open Dataset demonstrate that M3SOT achieves state-of-the-art performance at 38 FPS. Our code and models are available at https://github.com/ywu0912/TeamCode.git.

PDF Details DOI

ICLR Conference 2023 Conference Paper

Image as Set of Points

Xu Ma 0005
Yuqian Zhou
Huan Wang 0014
Can Qin
Bin Sun 0002
Chang Liu 0022
Yun Fu 0001

What is an image, and how to extract latent features? Convolutional Networks (ConvNets) consider an image as organized pixels in a rectangular shape and extract features via convolutional operation in a local region; Vision Transformers (ViTs) treat an image as a sequence of patches and extract features via attention mechanism in a global range. In this work, we introduce a straightforward and promising paradigm for visual representation, which is called Context Clusters. Context clusters (CoCs) view an image as a set of unorganized points and extract features via a simplified clustering algorithm. In detail, each point includes the raw feature (e.g., color) and positional information (e.g., coordinates), and a simplified clustering algorithm is employed to group and extract deep features hierarchically. Our CoCs are convolution- and attention-free, only relying on clustering algorithm for spatial interaction. Owing to the simple design, we show CoCs endow gratifying interpretability via the visualization of the clustering process. Our CoCs aim at providing a new perspective on image and visual representation, which may enjoy broad applications in different domains and exhibit profound insights. Even though we are not targeting SOTA performance, COCs still achieve comparable or even better performance than ConvNets or ViTs on several benchmarks.

Details

NeurIPS Conference 2023 Conference Paper

UniControl: A Unified Diffusion Model for Controllable Visual Generation In the Wild

Can Qin
Shu Zhang
Ning Yu
Yihao Feng
Xinyi Yang
Yingbo Zhou
Huan Wang
Juan Carlos Niebles

Achieving machine autonomy and human control often represent divergent objectives in the design of interactive AI systems. Visual generative foundation models such as Stable Diffusion show promise in navigating these goals, especially when prompted with arbitrary languages. However, they often fall short in generating images with spatial, structural, or geometric controls. The integration of such controls, which can accommodate various visual conditions in a single unified model, remains an unaddressed challenge. In response, we introduce UniControl, a new generative foundation model that consolidates a wide array of controllable condition-to-image (C2I) tasks within a singular framework, while still allowing for arbitrary language prompts. UniControl enables pixel-level-precise image generation, where visual conditions primarily influence the generated structures and language prompts guide the style and context. To equip UniControl with the capacity to handle diverse visual conditions, we augment pretrained text-to-image diffusion models and introduce a task-aware HyperNet to modulate the diffusion models, enabling the adaptation to different C2I tasks simultaneously. Trained on nine unique C2I tasks, UniControl demonstrates impressive zero-shot generation abilities with unseen visual conditions. Experimental results show that UniControl often surpasses the performance of single-task-controlled methods of comparable model sizes. This control versatility positions UniControl as a significant advancement in the realm of controllable visual generation.

PDF Details

ICLR Conference 2022 Conference Paper

Learning Efficient Image Super-Resolution Networks via Structure-Regularized Pruning

Yulun Zhang 0001
Huan Wang 0014
Can Qin
Yun Fu 0001

Several image super-resolution (SR) networks have been proposed of late for efficient SR, achieving promising results. However, they are still not lightweight enough and neglect to be extended to larger networks. At the same time, model compression techniques, like neural architecture search and knowledge distillation, typically consume considerable computation resources. In contrast, network pruning is a cheap and effective model compression technique. However, it is hard to be applied to SR networks directly because filter pruning for residual blocks is well-known tricky. To address the above issues, we propose structure-regularized pruning (SRP), which imposes regularization on the pruned structure to ensure the locations of pruned filters are aligned across different layers. Specifically, for the layers connected by the same residual, we select the filters of the same indices as unimportant filters. To transfer the expressive power in the unimportant filters to the rest of the network, we employ $L_2$ regularization to drive the weights towards zero so that eventually, their absence will cause minimal performance degradation. We apply SRP to train efficient image SR networks, resulting in a lightweight network SRPN-Lite and a very deep one SRPN. We conduct extensive comparisons with both lightweight and larger networks. SRPN-Lite and SRPN perform favorably against other recent efficient SR approaches quantitatively and visually.

Details

IJCAI Conference 2022 Conference Paper

MemREIN: Rein the Domain Shift for Cross-Domain Few-Shot Learning

Yi Xu
Lichen Wang
Yizhou Wang
Can Qin
Yulun Zhang
Yun Fu

Few-shot learning aims to enable models generalize to new categories (query instances) with only limited labeled samples (support instances) from each category. Metric-based mechanism is a promising direction which compares feature embeddings via different metrics. However, it always fail to generalize to unseen domains due to the considerable domain gap challenge. In this paper, we propose a novel framework, MemREIN, which considers Memorized, Restitution, and Instance Normalization for cross-domain few-shot learning. Specifically, an instance normalization algorithm is explored to alleviate feature dissimilarity, which provides the initial model generalization ability. However, naively normalizing the feature would lose fine-grained discriminative knowledge between different classes. To this end, a memorized module is further proposed to separate the most refined knowledge and remember it. Then, a restitution module is utilized to restitute the discrimination ability from the learned knowledge. A novel reverse contrastive learning strategy is proposed to stabilize the distillation process. Extensive experiments on five popular benchmark datasets demonstrate that MemREIN well addresses the domain shift challenge, and significantly improves the performance up to 16. 43% compared with state-of-the-art baselines.

PDF Details DOI

IJCAI Conference 2022 Conference Paper

Recent Advances on Neural Network Pruning at Initialization

Huan Wang
Can Qin
Yue Bai
Yulun Zhang
Yun Fu

Neural network pruning typically removes connections or neurons from a pretrained converged model; while a new pruning paradigm, pruning at initialization (PaI), attempts to prune a randomly initialized network. This paper offers the first survey concentrated on this emerging pruning fashion. We first introduce a generic formulation of neural network pruning, followed by the major classic pruning topics. Then, as the main body of this paper, a thorough and structured literature review of PaI methods is presented, consisting of two major tracks (sparse training and sparse selection). Finally, we summarize the surge of PaI compared to PaT and discuss the open problems. Apart from the dedicated literature review, this paper also offers a code base for easy sanity-checking and benchmarking of different PaI methods.

PDF Details DOI

ICLR Conference 2022 Conference Paper

Rethinking Network Design and Local Geometry in Point Cloud: A Simple Residual MLP Framework

Xu Ma 0005
Can Qin
Haoxuan You
Haoxi Ran
Yun Fu 0001

Point cloud analysis is challenging due to irregularity and unordered data structure. To capture the 3D geometries, prior works mainly rely on exploring sophisticated local geometric extractors, using convolution, graph, or attention mechanisms. These methods, however, incur unfavorable latency during inference and the performance saturates over the past few years. In this paper, we present an ovel perspective on this task. We find detailed local geometrical informationprobably is not the key to point cloud analysis – we introduce a pure residual MLP network, called PointMLP, which integrates no local geometrical extractors but still performs very competitively. Equipped with a proposed lightweight geometric-affine module to stabilize the training, PointMLP delivers the new state-of-the-art on multiple datasets. On the real-world ScanObjectNN dataset, our method even surpasses the prior best method by 3.3% accuracy. We emphasize PointMLP achieves this strong performance without any sophisticated operations, hence leading to a prominent inference speed. Compared to most recent CurveNet, PointMLP trains 2× faster, tests 7× faster, and is more accurate on ModelNet40 benchmark. We hope our PointMLP may help the community towards a better understanding of point cloud analysis. The code is available at https://github.com/ma-xu/pointMLP-pytorch.

Details

NeurIPS Conference 2021 Conference Paper

Aligned Structured Sparsity Learning for Efficient Image Super-Resolution

Yulun Zhang
Huan Wang
Can Qin
Yun Fu

Lightweight image super-resolution (SR) networks have obtained promising results with moderate model size. Many SR methods have focused on designing lightweight architectures, which neglect to further reduce the redundancy of network parameters. On the other hand, model compression techniques, like neural architecture search and knowledge distillation, typically consume considerable memory and computation resources. In contrast, network pruning is a cheap and effective model compression technique. However, it is hard to be applied to SR networks directly, because filter pruning for residual blocks is well-known tricky. To address the above issues, we propose aligned structured sparsity learning (ASSL), which introduces a weight normalization layer and applies $L_2$ regularization to the scale parameters for sparsity. To align the pruned locations across different layers, we propose a \emph{sparsity structure alignment} penalty term, which minimizes the norm of soft mask gram matrix. We apply aligned structured sparsity learning strategy to train efficient image SR network, named as ASSLN, with smaller model size and lower computation than state-of-the-art methods. We conduct extensive comparisons with lightweight SR networks. Our ASSLN achieves superior performance gains over recent methods quantitatively and visually.

PDF Details

ICLR Conference 2021 Conference Paper

Neural Pruning via Growing Regularization

Huan Wang 0014
Can Qin
Yulun Zhang 0001
Yun Fu 0001

Regularization has long been utilized to learn sparsity in deep neural network pruning. However, its role is mainly explored in the small penalty strength regime. In this work, we extend its application to a new scenario where the regularization grows large gradually to tackle two central problems of pruning: pruning schedule and weight importance scoring. (1) The former topic is newly brought up in this work, which we find critical to the pruning performance while receives little research attention. Specifically, we propose an L2 regularization variant with rising penalty factors and show it can bring significant accuracy gains compared with its one-shot counterpart, even when the same weights are removed. (2) The growing penalty scheme also brings us an approach to exploit the Hessian information for more accurate pruning without knowing their specific values, thus not bothered by the common Hessian approximation problems. Empirically, the proposed algorithms are easy to implement and scalable to large datasets and networks in both structured and unstructured pruning. Their effectiveness is demonstrated with modern deep neural networks on the CIFAR and ImageNet datasets, achieving competitive results compared to many state-of-the-art algorithms. Our code and trained models are publicly available at https://github.com/mingsun-tse/regularization-pruning.

Details

NeurIPS Conference 2021 Conference Paper

Slow Learning and Fast Inference: Efficient Graph Similarity Computation via Knowledge Distillation

Can Qin
Handong Zhao
Lichen Wang
Huan Wang
Yulun Zhang
Yun Fu

Graph Similarity Computation (GSC) is essential to wide-ranging graph applications such as retrieval, plagiarism/anomaly detection, etc. The exact computation of graph similarity, e. g. , Graph Edit Distance (GED), is an NP-hard problem that cannot be exactly solved within an adequate time given large graphs. Thanks to the strong representation power of graph neural network (GNN), a variety of GNN-based inexact methods emerged. To capture the subtle difference across graphs, the key success is designing the dense interaction with features fusion at the early stage, which, however, is a trade-off between speed and accuracy. For slow learning of graph similarity, this paper proposes a novel early-fusion approach by designing a co-attention-based feature fusion network on multilevel GNN features. To further improve the speed without much accuracy drop, we introduce an efficient GSC solution by distilling the knowledge from the slow early-fusion model to the student one for fast inference. Such a student model also enables the offline collection of individual graph embeddings, speeding up the inference time in orders. To address the instability through knowledge transfer, we decompose the dynamic joint embedding into the static pseudo individual ones for precise teacher-student alignment. The experimental analysis on the real-world datasets demonstrates the superiority of our approach over the state-of-the-art methods on both accuracy and efficiency. Particularly, we speed up the prior art by more than 10x on the benchmark AIDS data.

PDF Details

AAAI Conference 2020 Conference Paper

Dual Relation Semi-Supervised Multi-Label Learning

Lichen Wang
Yunyu Liu
Can Qin
Gan Sun
Yun Fu

Multi-label learning (MLL) solves the problem that one single sample corresponds to multiple labels. It is a challenging task due to the long-tail label distribution and the sophisticated label relations. Semi-supervised MLL methods utilize a small-scale labeled samples and large-scale unlabeled samples to enhance the performance. However, these approaches mainly focus on exploring the data distribution in feature space while ignoring mining the label relation inside of each instance. To this end, we proposed a Dual Relation Semisupervised Multi-label Learning (DRML) approach which jointly explores the feature distribution and the label relation simultaneously. A dual-classiﬁer domain adaptation strategy is proposed to align features while generating pseudo labels to improve learning performance. A relation network is proposed to explore the relation knowledge. As a result, DRML effectively explores the feature-label and label-label relations in both labeled and unlabeled samples. It is an end-to-end model without any extra knowledge. Extensive experiments illustrate the effectiveness and efﬁciency of our method1.

PDF Details

NeurIPS Conference 2019 Conference Paper

PointDAN: A Multi-Scale 3D Domain Adaption Network for Point Cloud Representation

Can Qin
Haoxuan You
Lichen Wang
C. -C. Jay Kuo
Yun Fu

Domain Adaptation (DA) approaches achieved significant improvements in a wide range of machine learning and computer vision tasks (i. e. , classification, detection, and segmentation). However, as far as we are aware, there are few methods yet to achieve domain adaptation directly on 3D point cloud data. The unique challenge of point cloud data lies in its abundant spatial geometric information, and the semantics of the whole object is contributed by including regional geometric structures. Specifically, most general-purpose DA methods that struggle for global feature alignment and ignore local geometric information are not suitable for 3D domain alignment. In this paper, we propose a novel 3D Domain Adaptation Network for point cloud data (PointDAN). PointDAN jointly aligns the global and local features in multi-level. For local alignment, we propose Self-Adaptive (SA) node module with an adjusted receptive field to model the discriminative local structures for aligning domains. To represent hierarchically scaled features, node-attention module is further introduced to weight the relationship of SA nodes across objects and domains. For global alignment, an adversarial-training strategy is employed to learn and align global features across domains. Since there is no common evaluation benchmark for 3D point cloud DA scenario, we build a general benchmark (i. e. , PointDA-10) extracted from three popular 3D object/scene datasets (i. e. , ModelNet, ShapeNet and ScanNet) for cross-domain 3D objects classification fashion. Extensive experiments on PointDA-10 illustrate the superiority of our model over the state-of-the-art general-purpose DA methods.

PDF Details