Arrow Research search

Author name cluster

Ran He

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

42 papers
1 author row

Possible papers

42

AAAI Conference 2026 Conference Paper

CoGrad3D: Spatially-Coupled Timestep Optimization with Orthogonal Gradient Fusion for 3D Generation

  • Haoyang Tong
  • Hongbo Wang
  • Jin Liu
  • Qi Wang
  • Jie Cao
  • Ran He

Score Distillation Sampling has driven recent advances in text-to-3D generation. However, current approaches often fail to produce 3D assets that are both rich in detail and consistent across viewpoints. These limitations primarily arise from imbalanced guidance on fine-grained details and an overdependence on single-view optimization—issues exacerbated by the excessive randomness in selecting diffusion timesteps and camera configurations. Such deficiencies commonly lead to blurry textures and inter-view inconsistencies, which degrade visual realism and hinder practical deployment. To tackle these challenges, we introduce CoGrad3D, a unified generative refinement framework that adopts a continuously adaptive optimization strategy. By dynamically modulating the optimization focus based on real-time convergence signals, CoGrad3D ensures balanced progress toward both geometric completeness and high-fidelity detail. Concretely, we propose an adaptive region sampling strategy that emphasizes under-converged viewing areas, promoting stable and uniform optimization. To facilitate the transition from coarse geometry to fine-grained reconstruction, we develop a region-aware temporal scheduling scheme that integrates global training dynamics with local convergence feedback. Furthermore, we introduce a gradient fusion mechanism that consolidates historical gradients from adjacent viewpoints, mitigating view-specific artifacts and promoting the emergence of coherent 3D structures. Extensive experiments demonstrate that CoGrad3D substantially surpasses existing methods in both geometric consistency and texture fidelity, enabling the generation of high-quality, view-consistent 3D models from textual descriptions.

NeurIPS Conference 2025 Conference Paper

DiCo: Revitalizing ConvNets for Scalable and Efficient Diffusion Modeling

  • Yuang Ai
  • Qihang Fan
  • Xuefeng Hu
  • Zhenheng Yang
  • Ran He
  • Huaibo Huang

Diffusion Transformer (DiT), a promising diffusion model for visual generation, demonstrates impressive performance but incurs significant computational overhead. Intriguingly, analysis of pre-trained DiT models reveals that global self-attention is often redundant, predominantly capturing local patterns—highlighting the potential for more efficient alternatives. In this paper, we revisit convolution as an alternative building block for constructing efficient and expressive diffusion models. However, naively replacing self-attention with convolution typically results in degraded performance. Our investigations attribute this performance gap to the higher channel redundancy in ConvNets compared to Transformers. To resolve this, we introduce a compact channel attention mechanism that promotes the activation of more diverse channels, thereby enhancing feature diversity. This leads to Diffusion ConvNet (DiCo), a family of diffusion models built entirely from standard ConvNet modules, offering strong generative performance with significant efficiency gains. On class-conditional ImageNet generation benchmarks, DiCo-XL achieves an FID of 2. 05 at 256$\times$256 resolution and 2. 53 at 512$\times$512, with a **2. 7$\times$** and **3. 1$\times$** speedup over DiT-XL/2, respectively. Furthermore, experimental results on MS-COCO demonstrate that the purely convolutional DiCo exhibits strong potential for text-to-image generation.

AAAI Conference 2025 Conference Paper

Exploring Vacant Classes in Label-Skewed Federated Learning

  • Kuangpu Guo
  • Yuhe Ding
  • Jian Liang
  • Zilei Wang
  • Ran He
  • Tieniu Tan

Label skews, characterized by disparities in local label distribution across clients, pose a significant challenge in federated learning. As minority classes suffer from worse accuracy due to overfitting on local imbalanced data, prior methods often incorporate class-balanced learning techniques during local training. Although these methods improve the mean accuracy across all classes, we observe that vacant classes—referring to categories absent from a client's data distribution—remain poorly recognized. Besides, there is still a gap in the accuracy of local models on minority classes compared to the global model. This paper introduces FedVLS, a novel approach to label-skewed federated learning that integrates both vacant-class distillation and logit suppression simultaneously. Specifically, vacant-class distillation leverages knowledge distillation during local training on each client to retain essential information related to vacant classes from the global model. Moreover, logit suppression directly penalizes network logits for non-label classes, effectively addressing misclassifications in minority classes that may be biased toward majority classes. Extensive experiments validate the efficacy of FedVLS, demonstrating superior performance compared to previous state-of-the-art (SOTA) methods across diverse datasets with varying degrees of label skews.

NeurIPS Conference 2025 Conference Paper

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

  • Chaoyou Fu
  • Peixian Chen
  • Yunhang Shen
  • Yulei Qin
  • Mengdan Zhang
  • Xu Lin
  • Jinrui Yang
  • Xiawu Zheng

Multimodal Large Language Model (MLLM) relies on the powerful LLM to perform multimodal tasks, showing amazing emergent abilities in recent studies, such as writing poems based on an image. However, it is difficult for these case studies to fully reflect the performance of MLLM, lacking a comprehensive evaluation. In this paper, we fill in this blank, presenting the first comprehensive MLLM Evaluation benchmark MME. It measures both perception and cognition abilities on a total of 14 subtasks. In order to avoid data leakage that may arise from direct use of public datasets for evaluation, the annotations of instruction-answer pairs are all manually designed. The concise instruction design allows us to fairly compare MLLMs, instead of struggling in prompt engineering. Besides, with such an instruction, we can also easily carry out quantitative statistics. A total of 30 advanced MLLMs are comprehensively evaluated on our MME, which not only suggests that existing MLLMs still have a large room for improvement, but also reveals the potential directions for the subsequent model optimization. The data are released at the project page: https: //github. com/BradyFU/Awesome-Multimodal-Large-Language-Models/tree/Evaluation.

AAAI Conference 2025 Conference Paper

Protecting Model Adaptation from Trojans in the Unlabeled Data

  • Lijun Sheng
  • Jian Liang
  • Ran He
  • Zilei Wang
  • Tieniu Tan

Model adaptation tackles the distribution shift problem with a pre-trained model instead of raw data, which has become a popular paradigm due to its great privacy protection. Existing methods always assume adapting to a clean target domain, overlooking the security risks of unlabeled samples. This paper for the first time explores the potential trojan attacks on model adaptation launched by well-designed poisoning target data. Concretely, we provide two trigger patterns with two poisoning strategies for different prior knowledge owned by attackers. These attacks achieve a high success rate while maintaining the normal performance on clean samples in the test stage. To defend against such backdoor injection, we propose a plug-and-play method named DiffAdapt, which can be seamlessly integrated with existing adaptation algorithms. Experiments across commonly used benchmarks and adaptation methods demonstrate the effectiveness of DiffAdapt. We hope this work will shed light on the safety of transfer learning with unlabeled data.

NeurIPS Conference 2025 Conference Paper

The Illusion of Progress? A Critical Look at Test-Time Adaptation for Vision-Language Models

  • Lijun Sheng
  • Jian Liang
  • Ran He
  • Zilei Wang
  • Tieniu Tan

Test-time adaptation (TTA) methods have gained significant attention for enhancing the performance of vision-language models (VLMs) such as CLIP during inference, without requiring additional labeled data. However, current TTA researches generally suffer from major limitations such as duplication of baseline results, limited evaluation metrics, inconsistent experimental settings, and insufficient analysis. These problems hinder fair comparisons between TTA methods and make it difficult to assess their practical strengths and weaknesses. To address these challenges, we introduce TTA-VLM, a comprehensive benchmark for evaluating TTA methods on VLMs. Our benchmark implements 8 episodic TTA and 7 online TTA methods within a unified and reproducible framework, and evaluates them across 15 widely used datasets. Unlike prior studies focused solely on CLIP, we extend the evaluation to SigLIP—a model trained with a Sigmoid loss—and include training-time tuning methods such as CoOp, MaPLe, and TeCoA to assess generality. Beyond classification accuracy, TTA-VLM incorporates various evaluation metrics, including robustness, calibration, out-of-distribution detection, and stability, enabling a more holistic assessment of TTA methods. Through extensive experiments, we find that 1) existing TTA methods produce limited gains compared to the previous pioneering work; 2) current TTA methods exhibit poor collaboration with training-time fine-tuning methods; 3) accuracy gains frequently come at the cost of reduced model trustworthiness. We release TTA-VLM to provide fair comparison and comprehensive evaluation of TTA methods for VLMs, and we hope it encourages the community to develop more reliable and generalizable TTA strategies. The code is available in https: //github. com/TomSheng21/tta-vlm.

NeurIPS Conference 2025 Conference Paper

Video-SafetyBench: A Benchmark for Safety Evaluation of Video LVLMs

  • Xuannan Liu
  • Zekun Li
  • Zheqi He
  • Peipei Li
  • shuhan xia
  • Xing Cui
  • Huaibo Huang
  • Xi Yang

The increasing deployment of Large Vision-Language Models (LVLMs) raises safety concerns under potential malicious inputs. However, existing multimodal safety evaluations primarily focus on model vulnerabilities exposed by static image inputs, ignoring the temporal dynamics of video that may induce distinct safety risks. To bridge this gap, we introduce Video-SafetyBench, the first comprehensive benchmark designed to evaluate the safety of LVLMs under video-text attacks. It comprises 2, 264 video-text pairs spanning 48 fine-grained unsafe categories, each pairing a synthesized video with either a harmful query, which contains explicit malice, or a benign query, which appears harmless but triggers harmful behavior when interpreted alongside the video. To generate semantically accurate videos for safety evaluation, we design a controllable pipeline that decomposes video semantics into subject images (what is shown) and motion text (how it moves), which jointly guide the synthesis of query-relevant videos. To effectively evaluate uncertain or borderline harmful outputs, we propose RJScore, a novel LLM-based metric that incorporates the confidence of judge models and human-aligned decision threshold calibration. Extensive experiments show that benign-query video composition achieves average attack success rates of 67. 2%, revealing consistent vulnerabilities to video-induced attacks. We believe Video-SafetyBench will catalyze future research into video-based safety evaluation and defense strategies.

NeurIPS Conference 2025 Conference Paper

VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction

  • Chaoyou Fu
  • Haojia Lin
  • Xiong Wang
  • Yifan Zhang
  • Yunhang Shen
  • Xiaoyu Liu
  • Haoyu Cao
  • Zuwei Long

Recent Multimodal Large Language Models (MLLMs) have typically focused on integrating visual and textual modalities, with less emphasis placed on the role of speech in enhancing interaction. However, speech plays a crucial role in multimodal dialogue systems, and implementing high-performance in both vision and speech tasks remains a challenge due to the fundamental modality differences. In this paper, we propose a carefully designed multi-stage training methodology that progressively trains LLM to understand both visual and speech information, ultimately enabling fluent vision and speech interaction. Our approach not only preserves strong vision-language capacity, but also enables efficient speech-to-speech dialogue capabilities without separate ASR and TTS modules, significantly accelerating multimodal end-to-end response speed. By comparing against state-of-the-art counterparts across benchmarks for image, video, and speech, we demonstrate that our omni model is equipped with both strong visual and speech capabilities, making omni understanding and interaction.

NeurIPS Conference 2025 Conference Paper

ZeroPatcher: Training-free Sampler for Video Inpainting and Editing

  • Shaoshu Yang
  • Yingya Zhang
  • Ran He

Video inpainting and editing have long been challenging tasks in the video generation community, requiring extensive computational resources and large datasets to train models with satisfactory performance. Recent breakthroughs in large-scale video foundation models have greatly enhanced text-to-video generation capabilities. This naturally leads to the idea of leveraging the prior knowledge from these powerful generators to facilitate video inpainting and editing. In this work, we investigate the feasibility of employing pre-trained text-to-video foundation models for high-quality video inpainting and editing without additional training. Specifically, we introduce a model-agnostic denoising sampler that optimizes the trajectory by maximizing the log-likelihood expectation conditioned on the known video segments. To enable efficient dynamic object removal and replacement, we propose a latent mask fuser that performs accurate video masking directly in latent space, eliminating the need for explicit VAE decoding and encoding. We implement our approach in widely-used foundation generators such as CogVideoX and HunyuanVideo, demonstrating the model-agnostic nature of our sampler. Comprehensive quantitative and qualitative evaluations confirm that our method achieves outstanding video inpainting and editing performance in a plug-and-play fashion.

NeurIPS Conference 2024 Conference Paper

Hallo3D: Multi-Modal Hallucination Detection and Mitigation for Consistent 3D Content Generation

  • Hongbo Wang
  • Jie Cao
  • Jin Liu
  • Xiaoqiang Zhou
  • Huaibo Huang
  • Ran He

Recent advancements in 3D content generation have been significant, primarily due to the visual priors provided by pretrained diffusion models. However, large 2D visual models exhibit spatial perception hallucinations, leading to multi-view inconsistency in 3D content generated through Score Distillation Sampling (SDS). This phenomenon, characterized by overfitting to specific views, is referred to as the "Janus Problem". In this work, we investigate the hallucination issues of pretrained models and find that large multimodal models without geometric constraints possess the capability to infer geometric structures, which can be utilized to mitigate multi-view inconsistency. Building on this, we propose a novel tuning-free method. We represent the multimodal inconsistency query information to detect specific hallucinations in 3D content, using this as an enhanced prompt to re-consist the 2D renderings of 3D and jointly optimize the structure and appearance across different views. Our approach does not require 3D training data and can be implemented plug-and-play within existing frameworks. Extensive experiments demonstrate that our method significantly improves the consistency of 3D content generation and specifically mitigates hallucinations caused by pretrained large models, achieving state-of-the-art performance compared to other optimization methods.

AAAI Conference 2024 Conference Paper

Heterogeneous Test-Time Training for Multi-Modal Person Re-identification

  • Zi Wang
  • Huaibo Huang
  • Aihua Zheng
  • Ran He

Multi-modal person re-identification (ReID) seeks to mitigate challenging lighting conditions by incorporating diverse modalities. Most existing multi-modal ReID methods concentrate on leveraging complementary multi-modal information via fusion or interaction. However, the relationships among heterogeneous modalities and the domain traits of unlabeled test data are rarely explored. In this paper, we propose a Heterogeneous Test-time Training (HTT) framework for multi-modal person ReID. We first propose a Cross-identity Inter-modal Margin (CIM) loss to amplify the differentiation among distinct identity samples. Moreover, we design a Multi-modal Test-time Training (MTT) strategy to enhance the generalization of the model by leveraging the relationships in the heterogeneous modalities and the information existing in the test data. Specifically, in the training stage, we utilize the CIM loss to further enlarge the distance between anchor and negative by forcing the inter-modal distance to maintain the margin, resulting in an enhancement of the discriminative capacity of the ultimate descriptor. Subsequently, since the test data contains characteristics of the target domain, we adapt the MTT strategy to optimize the network before the inference by using self-supervised tasks designed based on relationships among modalities. Experimental results on benchmark multi-modal ReID datasets RGBNT201, Market1501-MM, RGBN300, and RGBNT100 validate the effectiveness of the proposed method. The codes can be found at https://github.com/ziwang1121/HTT.

NeurIPS Conference 2024 Conference Paper

Not Just Object, But State: Compositional Incremental Learning without Forgetting

  • Yanyi Zhang
  • Binglin Qiu
  • Qi Jia
  • Yu Liu
  • Ran He

Most incremental learners excessively prioritize object classes while neglecting various kinds of states (e. g. color and material) attached to the objects. As a result, they are limited in the ability to model state-object compositionality accurately. To remedy this limitation, we propose a novel task called Compositional Incremental Learning (composition-IL), which enables the model to recognize a variety of state-object compositions in an incremental learning fashion. Since the lack of suitable datasets, we re-organize two existing datasets and make them tailored for composition-IL. Then, we propose a prompt-based Composition Incremental Learner (CompILer), to overcome the ambiguous composition boundary. Specifically, we exploit multi-pool prompt learning, and ensure the inter-pool prompt discrepancy and intra-pool prompt diversity. Besides, we devise object-injected state prompting which injects object prompts to guide the selection of state prompts. Furthermore, we fuse the selected prompts by a generalized-mean strategy, to eliminate irrelevant information learned in the prompts. Extensive experiments on two datasets exhibit state-of-the-art performance achieved by CompILer. Code and datasets are available at: https: //github. com/Yanyi-Zhang/CompILer.

NeurIPS Conference 2024 Conference Paper

Visual Anchors Are Strong Information Aggregators For Multimodal Large Language Model

  • Haogeng Liu
  • Quanzeng You
  • Xiaotian Han
  • Yongfei Liu
  • Huaibo Huang
  • Ran He
  • Hongxia Yang

In the realm of Multimodal Large Language Models (MLLMs), vision-language connector plays a crucial role to link the pre-trained vision encoders with Large Language Models (LLMs). Despite its importance, the vision-language connector has been relatively less explored. In this study, we aim to propose a strong vision-language connector that enables MLLM to simultaneously achieve high accuracy and low computation cost. We first reveal the existence of the visual anchors in Vision Transformer and propose a cost-effective search algorithm to progressively extract them. Building on these findings, we introduce the Anchor Former (AcFormer), a novel vision-language connector designed to leverage the rich prior knowledge obtained from these visual anchors during pretraining, guiding the aggregation of information. Through extensive experimentation, we demonstrate that the proposed method significantly reduces computational costs by nearly two-thirds, while simultaneously outperforming baseline methods. This highlights the effectiveness and efficiency of AcFormer.

EAAI Journal 2023 Journal Article

Diverse features discovery transformer for pedestrian attribute recognition

  • Aihua Zheng
  • Huimin Wang
  • Jiaxiang Wang
  • Huaibo Huang
  • Ran He
  • Amir Hussain

Recently, Swin Transformer has been widely explored as a general backbone for computer vision, which helps to improve the performance of vision tasks due to the ability to establish associations for long-range dependencies of different spatial locations. By implementing the pedestrian attribute recognition with Swin Transformer, we observe that Swin Transformer tends to focus on a relatively small number of local regions within which attributes may be correlated with other attributes, which leads Swin Transformer to predict attributes in those neglected regions based on such correlation. In fact, discriminative information may exist within these neglected regions, which is crucial for attribute identification. To address this problem, we propose a novel diverse features discovery transformer (DFDT) which can find more attribute relationship regions for robust pedestrian attribute recognition. First, Swin Transformer is used as a feature extraction network to acquire attribute features with the long-distance association, which predicts the corresponding attribute information. Second, we propose a diverse features suppression module (DFSM) to obtain semantic features directly associated with attributes by suppressing the peak locations of the most discriminative features and randomly selected feature regions to spread the feature regions that Swin Transformer is interested in. Third, we plug the diverse features suppression module into different stages of Swin Transformer to learn detailed texture features to help recognition. In addition, we have divided the attribute features into multiple vertical feature regions to improve the focus on local attribute features. Experiments on three benchmark datasets validate the effectiveness of the proposed algorithm.

NeurIPS Conference 2023 Conference Paper

Learning-to-Rank Meets Language: Boosting Language-Driven Ordering Alignment for Ordinal Classification

  • Rui Wang
  • Peipei Li
  • Huaibo Huang
  • Chunshui Cao
  • Ran He
  • Zhaofeng He

We present a novel language-driven ordering alignment method for ordinal classification. The labels in ordinal classification contain additional ordering relations, making them prone to overfitting when relying solely on training data. Recent developments in pre-trained vision-language models inspire us to leverage the rich ordinal priors in human language by converting the original task into a vision-language alignment task. Consequently, we propose L2RCLIP, which fully utilizes the language priors from two perspectives. First, we introduce a complementary prompt tuning technique called RankFormer, designed to enhance the ordering relation of original rank prompts. It employs token-level attention with residual-style prompt blending in the word embedding space. Second, to further incorporate language priors, we revisit the approximate bound optimization of vanilla cross-entropy loss and restructure it within the cross-modal embedding space. Consequently, we propose a cross-modal ordinal pairwise loss to refine the CLIP feature space, where texts and images maintain both semantic alignment and ordering alignment. Extensive experiments on three ordinal classification tasks, including facial age estimation, historical color image (HCI) classification, and aesthetic assessment demonstrate its promising performance.

NeurIPS Conference 2023 Conference Paper

Lightweight Vision Transformer with Bidirectional Interaction

  • Qihang Fan
  • Huaibo Huang
  • Xiaoqiang Zhou
  • Ran He

Recent advancements in vision backbones have significantly improved their performance by simultaneously modeling images’ local and global contexts. However, the bidirectional interaction between these two contexts has not been well explored and exploited, which is important in the human visual system. This paper proposes a F ully A daptive S elf- A ttention (FASA) mechanism for vision transformer to model the local and global information as well as the bidirectional interaction between them in context-aware ways. Specifically, FASA employs self-modulated convolutions to adaptively extract local representation while utilizing self-attention in down-sampled space to extract global representation. Subsequently, it conducts a bidirectional adaptation process between local and global representation to model their interaction. In addition, we introduce a fine-grained downsampling strategy to enhance the down-sampled self-attention mechanism for finer-grained global perception capability. Based on FASA, we develop a family of lightweight vision backbones, F ully A daptive T ransformer (FAT) family. Extensive experiments on multiple vision tasks demonstrate that FAT achieves impressive performance. Notably, FAT accomplishes a 77. 6% accuracy on ImageNet-1K using only 4. 5M parameters and 0. 7G FLOPs, which surpasses the most advanced ConvNets and Transformers with similar model size and computational costs. Moreover, our model exhibits faster speed on modern GPU compared to other models.

NeurIPS Conference 2022 Conference Paper

Are You Stealing My Model? Sample Correlation for Fingerprinting Deep Neural Networks

  • Jiyang Guan
  • Jian Liang
  • Ran He

An off-the-shelf model as a commercial service could be stolen by model stealing attacks, posing great threats to the rights of the model owner. Model fingerprinting aims to verify whether a suspect model is stolen from the victim model, which gains more and more attention nowadays. Previous methods always leverage the transferable adversarial examples as the model fingerprint, which is sensitive to adversarial defense or transfer learning scenarios. To address this issue, we consider the pairwise relationship between samples instead and propose a novel yet simple model stealing detection method based on SAmple Correlation (SAC). Specifically, we present SAC-w that selects wrongly classified normal samples as model inputs and calculates the mean correlation among their model outputs. To reduce the training time, we further develop SAC-m that selects CutMix Augmented samples as model inputs, without the need for training the surrogate models or generating adversarial examples. Extensive results validate that SAC successfully defends against various model stealing attacks, even including adversarial training or transfer learning, and detects the stolen models with the best performance in terms of AUC across different datasets and model architectures. The codes are available at https: //github. com/guanjiyang/SAC.

AAAI Conference 2022 Conference Paper

Interact, Embed, and EnlargE: Boosting Modality-Specific Representations for Multi-Modal Person Re-identification

  • Zi Wang
  • Chenglong Li
  • Aihua Zheng
  • Ran He
  • Jin Tang

Multi-modal person Re-ID introduces more complementary information to assist the traditional Re-ID task. Existing multi-modal methods ignore the importance of modalityspecific information in the feature fusion stage. To this end, we propose a novel method to boost modality-specific representations for multi-modal person Re-ID: Interact, Embed, and EnlargE (IEEE). First, we propose a cross-modal interacting module to exchange useful information between different modalities in the feature extraction phase. Second, we propose a relation-based embedding module to enhance the richness of feature descriptors by embedding the global feature into the fine-grained local information. Finally, we propose multi-modal margin loss to force the network to learn modality-specific information for each modality by enlarging the intra-class discrepancy. Superior performance on multimodal Re-ID dataset RGBNT201 and three constructed Re- ID datasets validate the effectiveness of the proposed method compared with the state-of-the-art approaches.

NeurIPS Conference 2022 Conference Paper

Orthogonal Transformer: An Efficient Vision Transformer Backbone with Token Orthogonalization

  • Huaibo Huang
  • Xiaoqiang Zhou
  • Ran He

We present a general vision transformer backbone, called as Orthogonal Transformer, in pursuit of both efficiency and effectiveness. A major challenge for vision transformer is that self-attention, as the key element in capturing long-range dependency, is very computationally expensive for dense prediction tasks (e. g. , object detection). Coarse global self-attention and local self-attention are then designed to reduce the cost, but they suffer from either neglecting local correlations or hurting global modeling. We present an orthogonal self-attention mechanism to alleviate these issues. Specifically, self-attention is computed in the orthogonal space that is reversible to the spatial domain but has much lower resolution. The capabilities of learning global dependency and exploring local correlations are maintained because every orthogonal token in self-attention can attend to the entire visual tokens. Remarkably, orthogonality is realized by constructing an endogenously orthogonal matrix that is friendly to neural networks and can be optimized as arbitrary orthogonal matrices. We also introduce Positional MLP to incorporate position information for arbitrary input resolutions as well as enhance the capacity of MLP. Finally, we develop a hierarchical architecture for Orthogonal Transformer. Extensive experiments demonstrate its strong performance on a broad range of vision tasks, including image classification, object detection, instance segmentation and semantic segmentation.

NeurIPS Conference 2020 Conference Paper

AOT: Appearance Optimal Transport Based Identity Swapping for Forgery Detection

  • Hao Zhu
  • Chaoyou Fu
  • Qianyi Wu
  • Wayne Wu
  • Chen Qian
  • Ran He

Recent studies have shown that the performance of forgery detection can be improved with diverse and challenging Deepfakes datasets. However, due to the lack of Deepfakes datasets with large variance in appearance, which can be hardly produced by recent identity swapping methods, the detection algorithm may fail in this situation. In this work, we provide a new identity swapping algorithm with large differences in appearance for face forgery detection. The appearance gaps mainly arise from the large discrepancies in illuminations and skin colors that widely exist in real-world scenarios. However, due to the difficulties of modeling the complex appearance mapping, it is challenging to transfer fine-grained appearances adaptively while preserving identity traits. This paper formulates appearance mapping as an optimal transport problem and proposes an Appearance Optimal Transport model (AOT) to formulate it in both latent and pixel space. Specifically, a relighting generator is designed to simulate the optimal transport plan. It is solved via minimizing the Wasserstein distance of the learned features in the latent space, enabling better performance and less computation than conventional optimization. To further refine the solution of the optimal transport plan, we develop a segmentation game to minimize the Wasserstein distance in the pixel space. A discriminator is introduced to distinguish the fake parts from a mix of real and fake image patches. Extensive experiments reveal that the superiority of our method when compared with state-of-the-art methods and the ability of our generated data to improve the performance of face forgery detection.

IJCAI Conference 2020 Conference Paper

Arbitrary Talking Face Generation via Attentional Audio-Visual Coherence Learning

  • Hao Zhu
  • Huaibo Huang
  • Yi Li
  • Aihua Zheng
  • Ran He

Talking face generation aims to synthesize a face video with precise lip synchronization as well as a smooth transition of facial motion over the entire video via the given speech clip and facial image. Most existing methods mainly focus on either disentangling the information in a single image or learning temporal information between frames. However, cross-modality coherence between audio and video information has not been well addressed during synthesis. In this paper, we propose a novel arbitrary talking face generation framework by discovering the audio-visual coherence via the proposed Asymmetric Mutual Information Estimator (AMIE). In addition, we propose a Dynamic Attention (DA) block by selectively focusing the lip area of the input image during the training stage, to further enhance lip synchronization. Experimental results on benchmark LRW dataset and GRID dataset transcend the state-of-the-art methods on prevalent metrics with robust high-resolution synthesizing on gender and pose variations.

AAAI Conference 2019 Conference Paper

Disentangled Variational Representation for Heterogeneous Face Recognition

  • Xiang Wu
  • Huaibo Huang
  • Vishal M. Patel
  • Ran He
  • Zhenan Sun

Visible (VIS) to near infrared (NIR) face matching is a challenging problem due to the significant domain discrepancy between the domains and a lack of sufficient data for training cross-modal matching algorithms. Existing approaches attempt to tackle this problem by either synthesizing visible faces from NIR faces, extracting domain-invariant features from these modalities, or projecting heterogeneous data onto a common latent space for cross-modal matching. In this paper, we take a different approach in which we make use of the Disentangled Variational Representation (DVR) for crossmodal matching. First, we model a face representation with an intrinsic identity information and its within-person variations. By exploring the disentangled latent variable space, a variational lower bound is employed to optimize the approximate posterior for NIR and VIS representations. Second, aiming at obtaining more compact and discriminative disentangled latent space, we impose a minimization of the identity information for the same subject and a relaxed correlation alignment constraint between the NIR and VIS modality variations. An alternative optimization scheme is proposed for the disentangled variational representation part and the heterogeneous face recognition network part. The mutual promotion between these two parts effectively reduces the NIR and VIS domain discrepancy and alleviates over-fitting. Extensive experiments on three challenging NIR-VIS heterogeneous face recognition databases demonstrate that the proposed method achieves significant improvements over the state-of-the-art methods.

NeurIPS Conference 2019 Conference Paper

Dual Variational Generation for Low Shot Heterogeneous Face Recognition

  • Chaoyou Fu
  • Xiang Wu
  • Yibo Hu
  • Huaibo Huang
  • Ran He

Heterogeneous Face Recognition (HFR) is a challenging issue because of the large domain discrepancy and a lack of heterogeneous data. This paper considers HFR as a dual generation problem, and proposes a novel Dual Variational Generation (DVG) framework. It generates large-scale new paired heterogeneous images with the same identity from noise, for the sake of reducing the domain gap of HFR. Specifically, we first introduce a dual variational autoencoder to represent a joint distribution of paired heterogeneous images. Then, in order to ensure the identity consistency of the generated paired heterogeneous images, we impose a distribution alignment in the latent space and a pairwise identity preserving in the image space. Moreover, the HFR network reduces the domain discrepancy by constraining the pairwise feature distances between the generated paired heterogeneous images. Extensive experiments on four HFR databases show that our method can significantly improve state-of-the-art results. When using the generated paired images for training, our method gains more than 18\% True Positive Rate improvements over the baseline model when False Positive Rate is at $10^{-5}$.

AAAI Conference 2019 Conference Paper

Geometry-Aware Face Completion and Editing

  • Linsen Song
  • Jie Cao
  • Lingxiao Song
  • Yibo Hu
  • Ran He

Face completion is a challenging generation task because it requires generating visually pleasing new pixels that are semantically consistent with the unmasked face region. This paper proposes a geometry-aware Face Completion and Editing NETwork (FCENet) by systematically studying facial geometry from the unmasked region. Firstly, a facial geometry estimator is learned to estimate facial landmark heatmaps and parsing maps from the unmasked face image. Then, an encoder-decoder structure generator serves to complete a face image and disentangle its mask areas conditioned on both the masked face image and the estimated facial geometry images. Besides, since low-rank property exists in manually labeled masks, a low-rank regularization term is imposed on the disentangled masks, enforcing our completion network to manage occlusion area with various shape and size. Furthermore, our network can generate diverse results from the same masked input by modifying estimated facial geometry, which provides a flexible mean to edit the completed face appearance. Extensive experimental results qualitatively and quantitatively demonstrate that our network is able to generate visually pleasing face completion results and edit face attributes as well.

IJCAI Conference 2019 Conference Paper

Neurons Merging Layer: Towards Progressive Redundancy Reduction for Deep Supervised Hashing

  • Chaoyou Fu
  • Liangchen Song
  • Xiang Wu
  • Guoli Wang
  • Ran He

Deep supervised hashing has become an active topic in information retrieval. It generates hashing bits by the output neurons of a deep hashing network. During binary discretization, there often exists much redundancy between hashing bits that degenerates retrieval performance in terms of both storage and accuracy. This paper proposes a simple yet effective Neurons Merging Layer (NMLayer) for deep supervised hashing. A graph is constructed to represent the redundancy relationship between hashing bits that is used to guide the learning of a hashing network. Specifically, it is dynamically learned by a novel mechanism defined in our active and frozen phases. According to the learned relationship, the NMLayer merges the redundant neurons together to balance the importance of each output neuron. Moreover, multiple NMLayers are progressively trained for a deep hashing network to learn a more compact hashing code from a long redundant code. Extensive experiments on four datasets demonstrate that our proposed method outperforms state-of-the-art hashing methods.

IJCAI Conference 2019 Conference Paper

Pedestrian Attribute Recognition by Joint Visual-semantic Reasoning and Knowledge Distillation

  • Qiaozhe Li
  • Xin Zhao
  • Ran He
  • Kaiqi Huang

Pedestrian attribute recognition in surveillance is a challenging task in computer vision due to significant pose variation, viewpoint change and poor image quality. To achieve effective recognition, this paper presents a graph-based global reasoning framework to jointly model potential visual-semantic relations of attributes and distill auxiliary human parsing knowledge to guide the relational learning. The reasoning framework models attribute groups on a graph and learns a projection function to adaptively assign local visual features to the nodes of the graph. After feature projection, graph convolution is utilized to perform global reasoning between the attribute groups to model their mutual dependencies. Then, the learned node features are projected back to visual space to facilitate knowledge transfer. An additional regularization term is proposed by distilling human parsing knowledge from a pre-trained teacher model to enhance feature representations. The proposed framework is verified on three large scale pedestrian attribute datasets including PETA, RAP, and PA-100k. Experiments show that our method achieves state-of-the-art results.

IJCAI Conference 2019 Conference Paper

Pose-preserving Cross Spectral Face Hallucination

  • Junchi Yu
  • Jie Cao
  • Yi Li
  • Xiaofei Jia
  • Ran He

To narrow the inherent sensing gap in heterogeneous face recognition (HFR), recent methods have resorted to generative models and explored the? recognition via generation? framework. Even though, it remains a very challenging task to synthesize photo-realistic visible faces (VIS) from near-infrared (NIR) images especially when paired training data are unavailable. We present an approach to avert the data misalignment problem and faithfully preserve pose, expression and identity information during cross-spectral face hallucination. At the pixel level, we introduce an unsupervised attention mechanism to warping that is jointly learned with the generator to derive pixel-wise correspondence from unaligned data. At the image level, an auxiliary generator is employed to facilitate the learning of mapping from NIR to VIS domain. At the domain level, we first apply the mutual information constraint to explicitly measure the correlation between domains and thus benefit synthesis. Extensive experiments on three heterogeneous face datasets demonstrate that our approach not only outperforms current state-of-the-art HFR methods but also produce visually appealing results at a high resolution.

AAAI Conference 2019 Conference Paper

Visual-Semantic Graph Reasoning for Pedestrian Attribute Recognition

  • Qiaozhe Li
  • Xin Zhao
  • Ran He
  • Kaiqi Huang

Pedestrian attribute recognition in surveillance is a challenging task due to poor image quality, significant appearance variations and diverse spatial distribution of different attributes. This paper treats pedestrian attribute recognition as a sequential attribute prediction problem and proposes a novel visual-semantic graph reasoning framework to address this problem. Our framework contains a spatial graph and a directed semantic graph. By performing reasoning using the Graph Convolutional Network (GCN), one graph captures spatial relations between regions and the other learns potential semantic relations between attributes. An end-to-end architecture is presented to perform mutual embedding between these two graphs to guide the relational learning for each other. We verify the proposed framework on three large scale pedestrian attribute datasets including PETA, RAP, and PA- 100k. Experiments show superiority of the proposed method over state-of-the-art methods and effectiveness of our joint GCN structures for sequential attribute prediction.

AAAI Conference 2018 Conference Paper

Adversarial Discriminative Heterogeneous Face Recognition

  • Lingxiao Song
  • Man Zhang
  • Xiang Wu
  • Ran He

The gap between sensing patterns of different face modalities remains a challenging problem in heterogeneous face recognition (HFR). This paper proposes an adversarial discriminative feature learning framework to close the sensing gap via adversarial learning on both raw-pixel space and compact feature space. This framework integrates cross-spectral face hallucination and discriminative feature learning into an endto-end adversarial network. In the pixel space, we make use of generative adversarial networks to perform cross-spectral face hallucination. An elaborate two-path model is introduced to alleviate the lack of paired images, which gives consideration to both global structures and local textures. In the feature space, an adversarial loss and a high-order variance discrepancy loss are employed to measure the global and local discrepancy between two heterogeneous distributions respectively. These two losses enhance domain-invariant feature learning and modality independent noise removing. Experimental results on three NIR-VIS databases show that our proposed approach outperforms state-of-the-art HFR methods, without requiring of complex network or large-scale training dataset.

IJCAI Conference 2018 Conference Paper

An Appearance-and-Structure Fusion Network for Object Viewpoint Estimation

  • Yueying Kao
  • Weiming Li
  • Zairan Wang
  • Dongqing Zou
  • Ran He
  • Qiang Wang
  • Minsu Ahn
  • Sunghoon Hong

Automatic object viewpoint estimation from a single image is an important but challenging problem in machine intelligence community. Although impressive performance has been achieved, current state-of-the-art methods still have difficulty to deal with the visual ambiguity and structure ambiguity in real world images. To tackle these problems, a novel Appearance-and-Structure Fusion network, which we call it ASFnet that estimates viewpoint by fusing both appearance and structure information, is proposed in this paper. The structure information is encoded by precise semantic keypoints and can help address the visual ambiguity. Meanwhile, distinguishable appearance features contribute to overcoming the structure ambiguity. Our ASFnet integrates an appearance path and a structure path to an end-to-end network and allows deep features effectively share supervision from both the two complementary aspects. A convolutional layer is learned to fuse the two path results adaptively. To balance the influence from the two supervision sources, a piecewise loss weight strategy is employed during training. Experimentally, our proposed network outperforms state-of-the-art methods on a public PASCAL 3D+ dataset, which verifies the effectiveness of our method and further corroborates the above proposition.

AAAI Conference 2018 Conference Paper

Anti-Makeup: Learning A Bi-Level Adversarial Network for Makeup-Invariant Face Verification

  • Yi Li
  • Lingxiao Song
  • Xiang Wu
  • Ran He
  • Tieniu Tan

Makeup is widely used to improve facial attractiveness and is well accepted by the public. However, different makeup styles will result in significant facial appearance changes. It remains a challenging problem to match makeup and non-makeup face images. This paper proposes a learning from generation approach for makeup-invariant face verification by introducing a bi-level adversarial network (BLAN). To alleviate the negative effects from makeup, we first generate non-makeup images from makeup ones, and then use the synthesized nonmakeup images for further verification. Two adversarial networks in BLAN are integrated in an end-to-end deep network, with the one on pixel level for reconstructing appealing facial images and the other on feature level for preserving identity information. These two networks jointly reduce the sensing gap between makeup and non-makeup images. Moreover, we make the generator well constrained by incorporating multiple perceptual losses. Experimental results on three benchmark makeup face datasets demonstrate that our method achieves state-of-the-art verification accuracy across makeup status and can produce photo-realistic non-makeup face images.

AAAI Conference 2018 Conference Paper

Coupled Deep Learning for Heterogeneous Face Recognition

  • Xiang Wu
  • Lingxiao Song
  • Ran He
  • Tieniu Tan

Heterogeneous face matching is a challenge issue in face recognition due to large domain difference as well as insufficient pairwise images in different modalities during training. This paper proposes a coupled deep learning (CDL) approach for the heterogeneous face matching. CDL seeks a shared feature space in which the heterogeneous face matching problem can be approximately treated as a homogeneous face matching problem. The objective function of CDL mainly includes two parts. The first part contains a trace norm and a block-diagonal prior as relevance constraints, which not only make unpaired images from multiple modalities be clustered and correlated, but also regularize the parameters to alleviate overfitting. An approximate variational formulation is introduced to deal with the difficulties of optimizing low-rank constraint directly. The second part contains a cross modal ranking among triplet domain specific images to maximize the margin for different identities and increase data for a small amount of training samples. Besides, an alternating minimization method is employed to iteratively update the parameters of CDL. Experimental results show that CDL achieves better performance on the challenging CASIA NIR-VIS 2. 0 face recognition database, the IIIT-D Sketch database, the CUHK Face Sketch (CUFS), and the CUHK Face Sketch FERET (CUFSF), which significantly outperforms state-ofthe-art heterogeneous face recognition methods.

AAAI Conference 2018 Conference Paper

Information-Theoretic Domain Adaptation Under Severe Noise Conditions

  • Wei Wang
  • Hao Wang
  • Zhi-Yong Ran
  • Ran He

Cross-domain data reconstruction methods derive a shared transformation across source and target domains. These methods usually make a specific assumption on noise, which exhibits limited ability when the target data are contaminated by different kinds of complex noise in practice. To enhance the robustness of domain adaptation under severe noise conditions, this paper proposes a novel reconstruction based algorithm in an information-theoretic setting. Specifically, benefiting from the theoretical property of correntropy, the proposed algorithm is distinguished with: detecting the contaminated target samples without making any specific assumption on noise; greatly suppressing the negative influence of noise on cross-domain transformation. Moreover, a relative entropy based regularization of the transformation is incorporated to avoid trivial solutions with the reaped theoretic advantages, i. e. , non-negativity and scale-invariance. For optimization, a half-quadratic technique is developed to minimize the nonconvex information-theoretic objectives with explicitly guaranteed convergence. Experiments on two real-world domain adaptation tasks demonstrate the superiority of our method.

NeurIPS Conference 2018 Conference Paper

IntroVAE: Introspective Variational Autoencoders for Photographic Image Synthesis

  • Huaibo Huang
  • Zhihang Li
  • Ran He
  • Zhenan Sun
  • Tieniu Tan

We present a novel introspective variational autoencoder (IntroVAE) model for synthesizing high-resolution photographic images. IntroVAE is capable of self-evaluating the quality of its generated samples and improving itself accordingly. Its inference and generator models are jointly trained in an introspective way. On one hand, the generator is required to reconstruct the input images from the noisy outputs of the inference model as normal VAEs. On the other hand, the inference model is encouraged to classify between the generated and real samples while the generator tries to fool it as GANs. These two famous generative frameworks are integrated in a simple yet efficient single-stream architecture that can be trained in a single stage. IntroVAE preserves the advantages of VAEs, such as stable training and nice latent manifold. Unlike most other hybrid models of VAEs and GANs, IntroVAE requires no extra discriminators, because the inference model itself serves as a discriminator to distinguish between the generated and real samples. Experiments demonstrate that our method produces high-resolution photo-realistic images (e. g. , CELEBA images at (1024^{2})), which are comparable to or better than the state-of-the-art GANs.

NeurIPS Conference 2018 Conference Paper

Learning a High Fidelity Pose Invariant Model for High-resolution Face Frontalization

  • Jie Cao
  • Yibo Hu
  • Hongwen Zhang
  • Ran He
  • Zhenan Sun

Face frontalization refers to the process of synthesizing the frontal view of a face from a given profile. Due to self-occlusion and appearance distortion in the wild, it is extremely challenging to recover faithful results and preserve texture details in a high-resolution. This paper proposes a High Fidelity Pose Invariant Model (HF-PIM) to produce photographic and identity-preserving results. HF-PIM frontalizes the profiles through a novel texture warping procedure and leverages a dense correspondence field to bind the 2D and 3D surface spaces. We decompose the prerequisite of warping into dense correspondence field estimation and facial texture map recovering, which are both well addressed by deep networks. Different from those reconstruction methods relying on 3D data, we also propose Adversarial Residual Dictionary Learning (ARDL) to supervise facial texture map recovering with only monocular images. Exhaustive experiments on both controlled and uncontrolled environments demonstrate that the proposed method not only boosts the performance of pose-invariant face recognition but also dramatically improves high-resolution frontalization appearances.

NeurIPS Conference 2017 Conference Paper

Deep Supervised Discrete Hashing

  • Qi Li
  • Zhenan Sun
  • Ran He
  • Tieniu Tan

With the rapid growth of image and video data on the web, hashing has been extensively studied for image or video search in recent years. Benefiting from recent advances in deep learning, deep hashing methods have achieved promising results for image retrieval. However, there are some limitations of previous deep hashing methods (e. g. , the semantic information is not fully exploited). In this paper, we develop a deep supervised discrete hashing algorithm based on the assumption that the learned binary codes should be ideal for classification. Both the pairwise label information and the classification information are used to learn the hash codes within one stream framework. We constrain the outputs of the last layer to be binary codes directly, which is rarely investigated in deep hashing algorithm. Because of the discrete nature of hash codes, an alternating minimization method is used to optimize the objective function. Experimental results have shown that our method outperforms current state-of-the-art methods on benchmark datasets.

AAAI Conference 2017 Conference Paper

Learning Invariant Deep Representation for NIR-VIS Face Recognition

  • Ran He
  • Xiang Wu
  • Zhenan Sun
  • Tieniu Tan

Visual versus near infrared (VIS-NIR) face recognition is still a challenging heterogeneous task due to large appearance difference between VIS and NIR modalities. This paper presents a deep convolutional network approach that uses only one network to map both NIR and VIS images to a compact Euclidean space. The low-level layers of this network are trained only on large-scale VIS data. Each convolutional layer is implemented by the simplest case of maxout operator. The highlevel layer is divided into two orthogonal subspaces that contain modality-invariant identity information and modalityvariant spectrum information respectively. Our joint formulation leads to an alternating minimization approach for deep representation at the training time and an efficient computation for heterogeneous data at the testing time. Experimental evaluations show that our method achieves 94% verification rate at FAR=0. 1% on the challenging CASIA NIR-VIS 2. 0 face recognition dataset. Compared with state-of-the-art methods, it reduces the error rate by 58% only with a compact 64-D representation.

AAAI Conference 2017 Conference Paper

Self-Paced Learning: An Implicit Regularization Perspective

  • Yanbo Fan
  • Ran He
  • Jian Liang
  • Baogang Hu

Self-paced learning (SPL) mimics the cognitive mechanism of humans and animals that gradually learns from easy to hard samples. One key issue in SPL is to obtain better weighting strategy that is determined by the minimizer function. Existing methods usually pursue this by artificially designing the explicit form of SPL regularizer. In this paper, we study a group of new regularizer (named self-paced implicit regularizer) that is deduced from robust loss function. Based on the convex conjugacy theory, the minimizer function for selfpaced implicit regularizer can be directly learned from the latent loss function, while the analytic form of the regularizer can be even unknown. A general framework (named SPL-IR) for SPL is developed accordingly. We demonstrate that the learning procedure of SPL-IR is associated with latent robust loss functions, thus can provide some theoretical insights for its working mechanism. We further analyze the relation between SPL-IR and half-quadratic optimization and provide a group of self-paced implicit regularizer. Finally, we implement SPL-IR to both supervised and unsupervised tasks, and experimental results corroborate our ideas and demonstrate the correctness and effectiveness of implicit regularizers.

AAAI Conference 2016 Conference Paper

Discriminative Analysis Dictionary Learning

  • Jun Guo
  • Yanqing Guo
  • Xiangwei Kong
  • Man Zhang
  • Ran He

Dictionary learning (DL) has been successfully applied to various pattern classification tasks in recent years. However, analysis dictionary learning (ADL), as a major branch of DL, has not yet been fully exploited in classification due to its poor discriminability. This paper presents a novel DL method, namely Discriminative Analysis Dictionary Learning (DADL), to improve the classification performance of ADL. First, a code consistent term is integrated into the basic analysis model to improve discriminability. Second, a tripletconstraint-based local topology preserving loss function is introduced to capture the discriminative geometrical structures embedded in data. Third, correntropy induced metric is employed as a robust measure to better control outliers for classification. Then, half-quadratic minimization and alternate search strategy are used to speed up the optimization process so that there exist closed-form solutions in each alternating minimization stage. Experiments on several commonly used databases show that our proposed method not only significantly improves the discriminative ability of ADL, but also outperforms state-of-the-art synthesis DL methods.

IJCAI Conference 2016 Conference Paper

Group-Invariant Cross-Modal Subspace Learning

  • Jian Liang
  • Ran He
  • Zhenan Sun
  • Tieniu Tan

Cross-modal learning tries to find various types of heterogeneous data (e. g. , image) from a given query (e. g. , text). Most cross-modal algorithms heavily rely on semantic labels and benefit from a semantic-preserving aggregation of pairs of heterogeneous data. However, the semantic labels are not readily obtained in many real-world applications. This paper studies the aggregation of these pairs unsupervisedly. Apart from lower pairwise correspondences that force the data from one pair to be close to each other, we propose a novel concept, referred as groupwise correspondences, supposing that each paired heterogeneous data are from an identical latent group. We incorporate this groupwise correspondences into canonical correlation analysis (CCA) model, and seek a latent common subspace where data are naturally clustered into several latent groups. To simplify this nonconvex and nonsmooth problem, we introduce a non-negative orthogonal variable to represent the soft group membership, then two coupled computationally efficient subproblems (a generalized ratio-trace problem and a non-negative problem) are alternatively minimized to guarantee the proposed algorithm converges locally. Experimental results on two benchmark datasets demonstrate that the proposed unsupervised algorithm even achieves comparable performance to some state-of-the-art supervised cross-modal algorithms. Cross-modal learning tries to find various types of heterogeneous data (e. g. , image) from a given query (e. g. , text). Most cross-modal algorithms heavily rely on semantic labels and benefit from a semantic-preserving aggregation of pairs of heterogeneous data. However, the semantic labels are not readily obtained in many real-world applications. This paper studies the aggregation of these pairs unsupervisedly. Apart from lower pairwise correspondences that force the data from one pair to be close to each other, we propose a novel concept, referred as groupwise correspondences, supposing that each paired heterogeneous data are from an identical latent group. We incorporate this groupwise correspondences into canonical correlation analysis (CCA) model, and seek a latent common subspace where data are naturally clustered into several latent groups. To simplify this nonconvex and nonsmooth problem, we introduce a non-negative orthogonal variable to represent the soft group membership, then two coupled computationally efficient subproblems (a generalized ratio-trace problem and a non-negative problem) are alternatively minimized to guarantee the proposed algorithm converges locally. Experimental results on two benchmark datasets demonstrate that the proposed unsupervised algorithm even achieves comparable performance to some state-of-the-art supervised cross-modal algorithms.

AAAI Conference 2016 Conference Paper

Simultaneous Feature and Sample Reduction for Image-Set Classification

  • Man Zhang
  • Ran He
  • Dong Cao
  • Zhenan Sun
  • Tieniu Tan

Image-set classification is the assignment of a label to a given image set. In real-life scenarios such as surveillance videos, each image set often contains much redundancy in terms of features and samples. This paper introduces a joint learning method for image-set classification that simultaneously learns compact binary codes and removes redundant samples. The joint objective function of our model mainly includes two parts. The first part seeks a hashing function to generate binary codes that have larger inter-class and smaller intra-class distances. The second one reduces redundant samples with discrete constraints in a low-rank way. A kernel method based on anchor points is further used to reduce sample variations. The proposed discrete objective function is simplified to a series of sub-problems that admit an analytical solution, resulting in a high-quality discrete solution with a low computational cost. Experiments on three commonly used image-set datasets show that the proposed method for the tasks of face recognition from image sets is efficient and effective.

AAAI Conference 2010 Conference Paper

Two-Stage Sparse Representation for Robust Recognition on Large-Scale Database

  • Ran He
  • Baogang Hu
  • Wei-Shi Zheng
  • Yanqing Guo

This paper proposes a novel robust sparse representation method, called the two-stage sparse representation (TSR), for robust recognition on a large-scale database. Based on the divide and conquer strategy, TSR divides the procedure of robust recognition into outlier detection stage and recognition stage. In the first stage, a weighted linear regression is used to learn a metric in which noise and outliers in image pixels are detected. In the second stage, based on the learnt metric, the large-scale dataset is firstly filtered into a small set according to the nearest neighbor criterion. Then a sparse representation is computed by the non-negative least squares technique. The sparse solution is unique and can be optimized efficiently. The extensive numerical experiments on several public databases demonstrate that the proposed TSR approach generally obtains better classification accuracy than the state-of-the-art Sparse Representation Classification (SRC). At the same time, by using the TSR, a significant reduction of computational cost is reached by over fifty times in comparison with the SRC, which enables the TSR to be deployed more suitably for large-scale dataset.