Author name cluster

Haoji Hu

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

9 papers

2 author rows

AAAI Conference 2026 Conference Paper

Q Cache: Visual Attention Is Valuable in Less than Half of Decode Layers for Multimodal Large Language Model

Jiedong Zhuang
Lu Lu
Ming Dai
Rui Hu
Jian Chen
Qiang Liu
Haoji Hu

Multimodal large language models (MLLMs) are plagued by exorbitant inference costs attributable to the profusion of visual tokens within the vision encoder. The redundant visual tokens engenders a substantial computational load and key-value (KV) cache footprint bottleneck. Existing approaches focus on token-wise optimization, leveraging diverse intricate token pruning techniques to eliminate non-crucial visual tokens. Nevertheless, these methods often unavoidably undermine the integrity of the KV cache, resulting in failures in long-text generation tasks. To this end, we conduct an in-depth investigation towards the attention mechanism of the model from a new perspective, and discern that attention within more than half of all decode layers are semantic similar. Upon this finding, we contend that the attention in certain layers can be streamlined by inheriting the attention from their preceding layers. Consequently, we propose Lazy Attention, an efficient attention mechanism that enables cross-layer sharing of similar attention patterns. It ingeniously reduces layer-wise redundant computation in attention. In Lazy Attention, we develop a novel layer-shared cache, Q Cache, tailored for MLLMs, which facilitates the reuse of queries across adjacent layers. In particular, Q Cache is lightweight and fully compatible with existing inference frameworks, including Flash Attention and KV cache. Additionally, our method is highly flexible as it is orthogonal to existing token-wise techniques and can be deployed independently or combined with token pruning approaches. Empirical evaluations on multiple benchmarks demonstrate that our method can reduce KV cache usage by over 35% and achieve 1.5x throughput improvement, while sacrificing only approximately 1% of performance on various MLLMs. Compared with SOTA token-wise methods, our technique achieves superior accuracy preservation.

PDF Details DOI

NeurIPS Conference 2025 Conference Paper

Orientation Matters: Making 3D Generative Models Orientation-Aligned

Yichong Lu
Yuzhuo Tian
Zijin Jiang
Yikun Zhao
Yuanbo Yang
Hao Ouyang
Haoji Hu
Huimin Yu

Humans intuitively perceive object shape and orientation from a single image, guided by strong priors about canonical poses. However, existing 3D generative models often produce misaligned results due to inconsistent training data, limiting their usability in downstream tasks. To address this gap, we introduce the task of orientation-aligned 3D object generation: producing 3D objects from single images with consistent orientations across categories. To facilitate this, we construct Objaverse-OA, a dataset of 14, 832 orientation-aligned 3D models spanning 1, 008 categories. Leveraging Objaverse-OA, we fine-tune two representative 3D generative models based on multi-view diffusion and 3D variational autoencoder frameworks to produce aligned objects that generalize well to unseen objects across various categories. Experimental results demonstrate the superiority of our method over post-hoc alignment approaches. Furthermore, we showcase downstream applications enabled by our aligned object generation, including zero-shot object orientation estimation via analysis-by-synthesis and efficient arrow-based object rotation manipulation.

PDF Details

AAAI Conference 2025 Conference Paper

ST3: Accelerating Multimodal Large Language Model by Spatial-Temporal Visual Token Trimming

Jiedong Zhuang
Lu Lu
Ming Dai
Rui Hu
Jian Chen
Qiang Liu
Haoji Hu

Multimodal large language models (MLLMs) enhance their perceptual capabilities by integrating visual and textual information. However, processing the massive number of visual tokens incurs a significant computational cost. Existing analysis of the MLLM attention mechanisms remains shallow, leading to coarse-grain token pruning strategies that fail to effectively balance speed and accuracy. In this paper, we conduct a comprehensive investigation of MLLM attention mechanisms with LLaVA. We find that numerous visual tokens and partial attention computations are redundant during the decoding process. Based on this insight, we propose Spatial-Temporal Visual Token Trimming (ST3), a framework designed to accelerate MLLM inference without retraining. ST3 consists of two primary components: 1) Progressive Visual Token Pruning (PVTP), which eliminates inattentive visual tokens across layers, and 2) Visual Token Annealing (VTA), which dynamically reduces the number of visual tokens in each layer as the generated tokens grow. Together, these techniques deliver around 2x faster inference with only about 30% KV cache memory compared to the original LLaVA, while maintaining consistent performance across various datasets. Crucially, ST3 can be seamlessly integrated into existing pre-trained MLLMs, providing a plug-and-play solution for efficient inference.

PDF Details DOI

ICLR Conference 2025 Conference Paper

SynCamMaster: Synchronizing Multi-Camera Video Generation from Diverse Viewpoints

Jianhong Bai
Menghan Xia
Xintao Wang 0002
Ziyang Yuan
Zuozhu Liu
Haoji Hu
Pengfei Wan 0001
Di Zhang 0026

Recent advancements in video diffusion models demonstrate remarkable capabilities in simulating real-world dynamics and 3D consistency. This progress motivates us to explore the potential of these models to maintain dynamic consistency across diverse viewpoints, a feature highly sought after in applications like virtual filming. Unlike existing methods focused on multi-view generation of single objects for 4D reconstruction, our interest lies in generating open-world videos from arbitrary viewpoints, incorporating six degrees of freedom (6 DoF) camera poses. To achieve this, we propose a plug-and-play module that enhances a pre-trained text-to-video model for multi-camera video generation, ensuring consistent content across different viewpoints. Specifically, we introduce a multi-view synchronization module designed to maintain appearance and geometry consistency across these viewpoints. Given the scarcity of high-quality training data, we also propose a progressive training scheme that leverages multi-camera images and monocular videos as a supplement to Unreal Engine-rendered multi-camera videos. This comprehensive approach significantly benefits our model. Experimental results demonstrate the superiority of our proposed method over existing competitors and several baselines. Furthermore, our method enables intriguing extensions, such as re-rendering a video from multiple novel viewpoints. Project webpage: https://jianhongbai.github.io/SynCamMaster/

Details

AAAI Conference 2024 Conference Paper

Robustness-Guided Image Synthesis for Data-Free Quantization

Jianhong Bai
Yuchen Yang
Huanpeng Chu
Hualiang Wang
Zuozhu Liu
Ruizhe Chen
Xiaoxuan He
Lianrui Mu

Quantization has emerged as a promising direction for model compression. Recently, data-free quantization has been widely studied as a promising method to avoid privacy concerns, which synthesizes images as an alternative to real training data. Existing methods use classification loss to ensure the reliability of the synthesized images. Unfortunately, even if these images are well-classified by the pre-trained model, they still suffer from low semantics and homogenization issues. Intuitively, these low-semantic images are sensitive to perturbations, and the pre-trained model tends to have inconsistent output when the generator synthesizes an image with low semantics. To this end, we propose Robustness-Guided Image Synthesis (RIS), a simple but effective method to enrich the semantics of synthetic images and improve image diversity, further boosting the performance of data-free compression tasks. Concretely, we first introduce perturbations on input and model weight, then define the inconsistency metrics at feature and prediction levels before and after perturbations. On the basis of inconsistency on two levels, we design a robustness optimization objective to eliminate low-semantic images. Moreover, we also make our approach diversity-aware by forcing the generator to synthesize images with small correlations. With RIS, we achieve state-of-the-art performance for various settings on data-free quantization and can be extended to other data-free compression tasks.

PDF Details DOI

ICLR Conference 2023 Conference Paper

On the Effectiveness of Out-of-Distribution Data in Self-Supervised Long-Tail Learning

Jianhong Bai
Zuozhu Liu
Hualiang Wang
Jin Hao
Yang Feng 0011
Huanpeng Chu
Haoji Hu

Though Self-supervised learning (SSL) has been widely studied as a promising technique for representation learning, it doesn't generalize well on long-tailed datasets due to the majority classes dominating the feature space. Recent work shows that the long-tailed learning performance could be boosted by sampling extra in-domain (ID) data for self-supervised training, however, large-scale ID data which can rebalance the minority classes are expensive to collect. In this paper, we propose an alternative but easy-to-use and effective solution, \textbf{C}ontrastive with \textbf{O}ut-of-distribution (OOD) data for \textbf{L}ong-\textbf{T}ail learning (COLT), which can effectively exploit OOD data to dynamically re-balance the feature space. We empirically identify the counter-intuitive usefulness of OOD samples in SSL long-tailed learning and principally design a novel SSL method. Concretely, we first localize the `\emph{head}' and `\emph{tail}' samples by assigning a tailness score to each OOD sample based on its neighborhoods in the feature space. Then, we propose an online OOD sampling strategy to dynamically re-balance the feature space. Finally, we enforce the model to be capable of distinguishing ID and OOD samples by a distribution-level supervised contrastive loss. Extensive experiments are conducted on various datasets and several state-of-the-art SSL frameworks to verify the effectiveness of the proposed method. The results show that our method significantly improves the performance of SSL on long-tailed datasets by a large margin, and even outperforms previous work which uses external ID data. Our code is available at \url{https://github.com/JianhongBai/COLT}.

Details

NeurIPS Conference 2023 Conference Paper

Towards Distribution-Agnostic Generalized Category Discovery

Jianhong Bai
Zuozhu Liu
Hualiang Wang
Ruizhe Chen
Lianrui Mu
Xiaomeng Li
Joey Tianyi Zhou
Yang Feng

Data imbalance and open-ended distribution are two intrinsic characteristics of the real visual world. Though encouraging progress has been made in tackling each challenge separately, few works dedicated to combining them towards real-world scenarios. While several previous works have focused on classifying close-set samples and detecting open-set samples during testing, it's still essential to be able to classify unknown subjects as human beings. In this paper, we formally define a more realistic task as distribution-agnostic generalized category discovery (DA-GCD): generating fine-grained predictions for both close- and open-set classes in a long-tailed open-world setting. To tackle the challenging problem, we propose a Self- Ba lanced Co -Advice co n trastive framework (BaCon), which consists of a contrastive-learning branch and a pseudo-labeling branch, working collaboratively to provide interactive supervision to resolve the DA-GCD task. In particular, the contrastive-learning branch provides reliable distribution estimation to regularize the predictions of the pseudo-labeling branch, which in turn guides contrastive learning through self-balanced knowledge transfer and a proposed novel contrastive loss. We compare BaCon with state-of-the-art methods from two closely related fields: imbalanced semi-supervised learning and generalized category discovery. The effectiveness of BaCon is demonstrated with superior performance over all baselines and comprehensive analysis across various datasets. Our code is publicly available.

PDF Details

AAAI Conference 2022 Conference Paper

Renovate Yourself: Calibrating Feature Representation of Misclassified Pixels for Semantic Segmentation

Hualiang Wang
Huanpeng Chu
Siming Fu
Zuozhu Liu
Haoji Hu

Existing image semantic segmentation methods favor learning consistent representations by extracting long-range contextual features with the attention, multi-scale, or graph aggregation strategies. These methods usually treat the misclassified and correctly classified pixels equally, hence misleading the optimization process and causing inconsistent intra-class pixel feature representations in the embedding space during learning. In this paper, we propose the auxiliary representation calibration head (RCH), which consists of the image decoupling, prototype clustering, error calibration modules and a metric loss function, to calibrate these error-prone feature representations for better intra-class consistency and segmentation performance. RCH could be incorporated into the hidden layers, trained together with the segmentation networks, and decoupled in the inference stage without additional parameters. Experimental results show that our method could significantly boost the performance of current segmentation methods on multiple datasets (e. g. , we outperform the original HRNet and OCRNet by 1. 1% and 0. 9% mIoU on the Cityscapes test set). Codes are available at https: //github. com/VipaiLab/RCH.

PDF Details

AAAI Conference 2020 Conference Paper

Appearance and Motion Enhancement for Video-Based Person Re-Identification

Shuzhao Li
Huimin Yu
Haoji Hu

In this paper, we propose an Appearance and Motion Enhancement Model (AMEM) for video-based person reidentiﬁcation to enrich the two kinds of information contained in the backbone network in a more interpretable way. Concretely, human attribute recognition under the supervision of pseudo labels is exploited in an Appearance Enhancement Module (AEM) to help enrich the appearance and semantic information. A Motion Enhancement Module (MEM) is designed to capture the identity-discriminative walking patterns through predicting future frames. Despite a complex model with several auxiliary modules during training, only the backbone model plus two small branches are kept for similarity evaluation which constitute a simple but effective ﬁnal model. Extensive experiments conducted on three popular video-based person ReID benchmarks demonstrate the effectiveness of our proposed model and the state-of-the-art performance compared with existing methods.

PDF Details