Author name cluster

Peng Hu

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

29 papers

2 author rows

AAAI Conference 2026 Conference Paper

Robust Semi-paired Multimodal Learning for Cross-modal Retrieval

Yang Qin
Yuan Sun
Xi Peng
Dezhong Peng
Joey Tianyi Zhou
Xiaomin Song
Peng Hu

Cross-modal retrieval is a fundamental application of multi-modal learning that has achieved remarkable success with large-scale well-paired data. However, in practice, it is costly to collect large-scale well-paired data. To alleviate the dependence on the amount of paired data, in this paper, we study a practical learning paradigm: semi-paired cross-modal learning (SPL), which utilizes both a small amount of paired data and a large amount of unpaired data to enhance cross-modal learning directly and is more accessible in practice. To achieve this, we take image-text retrieval as an example and propose a novel Robust Cross-modal Semi-paired Learning method (RCSL) by addressing two challenges. To be specific, i) to overcome the under-optimization issue caused by too little paired data, we present Semi-paired Discriminative Learning (SDL) to fully learn visual-semantic associations from a small amount of image-text pairs by preserving the alignment and uniformity of modality representations. ii) To mine visual-semantic correspondences from unpaired data, RCSL first constructs pseudo-paired correlations across different modalities by nearest neighbor association. However, this may introduce noisy correspondences (NCs) due to inaccurate pseudo signals, which could degrade the model's performance. To tackle NCs, we devise Robust Cross-correlation Mining (RCM) based on the risk minimization criterion to robustly and explicitly learn visual-semantic associations from pseudo-paired data, thus boosting cross-modal learning. Finally, we conduct extensive experiments on four datasets, i.e., three widely used benchmark datasets of Flickr30K, MS-COCO, CC152K, and a newly constructed real-world dataset Drone-SP, to demonstrate the effectiveness of RCSL under semi-paired and noisy settings.

PDF Details DOI

TMLR Journal 2025 Journal Article

Beyond Instance Consistency: Investigating View Diversity in Self-supervised Learning

Huaiyuan Qin
Muli Yang
Siyuan Hu
Peng Hu
Yu Zhang
Chen Gong
Hongyuan Zhu

Self-supervised learning (SSL) conventionally relies on the instance consistency paradigm, assuming that different views of the same image can be treated as positive pairs. However, this assumption breaks down for non-iconic data, where different views may contain distinct objects or semantic information. In this paper, we investigate the effectiveness of SSL when instance consistency is not guaranteed. Through extensive ablation studies, we demonstrate that SSL can still learn meaningful representations even when positive pairs lack strict instance consistency. Furthermore, our analysis further reveals that increasing view diversity, by enforcing zero overlapping or using smaller crop scales, can enhance downstream performance on classification and dense prediction tasks. However, excessive diversity is found to reduce effectiveness, suggesting an optimal range for view diversity. To quantify this, we adopt the Earth Mover’s Distance (EMD) as an estimator to measure mutual information between views, finding that moderate EMD values correlate with improved SSL learning, providing insights for future SSL framework design. We validate our findings across a range of settings, highlighting their robustness and applicability on diverse data sources.

PDF Details

NeurIPS Conference 2025 Conference Paper

Conditional Representation Learning for Customized Tasks

Honglin Liu
Chao Sun
Peng Hu
Yunfan Li
Xi Peng

Conventional representation learning methods learn a universal representation that primarily captures dominant semantics, which may not always align with customized downstream tasks. For instance, in animal habitat analysis, researchers prioritize scene-related features, whereas universal embeddings emphasize categorical semantics, leading to suboptimal results. As a solution, existing approaches resort to supervised fine-tuning, which however incurs high computational and annotation costs. In this paper, we propose Conditional Representation Learning (CRL), aiming to extract representations tailored to arbitrary user-specified criteria. Specifically, we reveal that the semantics of a space are determined by its basis, thereby enabling a set of descriptive words to approximate the basis for a customized feature space. Building upon this insight, given a user-specified criterion, CRL first employs a large language model (LLM) to generate descriptive texts to construct the semantic basis, then projects the image representation into this conditional feature space leveraging a vision-language model (VLM). The conditional representation better captures semantics for the specific criterion, which could be utilized for multiple customized tasks. Extensive experiments on classification and retrieval tasks demonstrate the superiority and generality of the proposed CRL. The code is available at https: //github. com/XLearning-SCU/2025-NeurIPS-CRL.

PDF Details

AAAI Conference 2025 Conference Paper

Deep Evidential Hashing for Trustworthy Cross-Modal Retrieval

Yuan Li
Liangli Zhen
Yuan Sun
Dezhong Peng
Xi Peng
Peng Hu

Cross-modal hashing provides an efficient solution for retrieval tasks across various modalities, such as images and text. However, most existing methods are deterministic models, which overlook the reliability associated with the retrieved results. This omission renders them unreliable for determining matches between data pairs based solely on Hamming distance. To bridge the gap, in this paper, we propose a novel method called Deep Evidential Cross-modal Hashing (DECH). This method equips hashing models with the ability to quantify the reliability level of the association between a query sample and each corresponding retrieved sample, bringing a new dimension of reliability to the cross-modal retrieval process. To achieve this, our method addresses two key challenges: i) To leverage evidential theory in guiding the model to learn hash codes, we design a novel evidence acquisition module to collect evidence and place the evidence captured by hash codes on a Beta distribution to derive a binomial opinion. Unlike existing evidential learning approaches that rely on classifiers, our method collects evidence directly through hash codes. ii) To tackle the task-oriented challenge, we first introduce a method to update the derived binomial opinion, allowing it to present the uncertainty caused by conflicting evidence. Following this manner, we present a strategy to precisely evaluate the reliability level of retrieved results, culminating in performance improvement. We validate the efficacy of our DECH through extensive experimentation on four benchmark datasets. The experimental results demonstrate our superior performance compared to 12 state-of-the-art methods.

PDF Details DOI

IJCAI Conference 2025 Conference Paper

Disentangling Multi-view Representations via Curriculum Learning with Learnable Prior

Kai Guo
Jiedong Wang
Xi Peng
Peng Hu
Hao Wang

Multi-view representation learning methods typically follow a consistent-and-specific pipeline that aims at extracting latent representations for an entity from its multiple observable views to facilitate downstream tasks. However, most of them overlook the complex underlying correlation between different views. To solve this issue, we delve into a well-known property of neural networks (NNs) that NNs tend to learn simple patterns first and then hard ones. In our case, view-consistent representations are simple patterns and view-specific representations are hard. To this end, we propose to disentangle view-consistency and view-specificity and learn them gradually. Specifically, we devise a novel curriculum learning approach that adjusts the whole model to learn view-consistent representations first and then progressively view-specific representations. Besides, we saddle each view with a learnable prior that allows each view-specific representation to appropriate its distribution. Moreover, we incorporate a mixture-of-experts layer and a disentangling module to further enhance the quality of the learned representations. Extensive experiments on five real-world datasets show that the proposed model outperforms its counterparts markedly. The code is available at https: //github. com/XLearning-SCU/2025-IJCAI-CL2P.

PDF Details DOI

NeurIPS Conference 2025 Conference Paper

Interactive Cross-modal Learning for Text-3D Scene Retrieval

Yanglin Feng
Yongxiang Li
Yuan Sun
Yang Qin
Dezhong Peng
Peng Hu

Text-3D Scene Retrieval (T3SR) aims to retrieve relevant scenes using linguistic queries. Although traditional T3SR methods have made significant progress in capturing fine-grained associations, they implicitly assume that query descriptions are information-complete. In practical deployments, however, limited by the capabilities of users and models, it is difficult or even impossible to directly obtain a perfect textual query suiting the entire scene and model, thereby leading to performance degradation. To address this issue, we propose a novel Interactive Text-3D Scene Retrieval Method (IDeal), which promotes the enhancement of the alignment between texts and 3D scenes through continuous interaction. To achieve this, we present an Interactive Retrieval Refinement framework (IRR), which employs a questioner to pose contextually relevant questions to an answerer in successive rounds that either promote detailed probing or encourage exploratory divergence within scenes. Upon the iterative responses received from the answerer, IRR adopts a retriever to perform both feature-level and semantic-level information fusion, facilitating scene-level interaction and understanding for more precise re-rankings. To bridge the domain gap between queries and interactive texts, we propose an Interaction Adaptation Tuning strategy (IAT). IAT mitigates the discriminability and diversity risks among augmented text features that approximate the interaction text domain, achieving contrastive domain adaptation for our retriever. Extensive experimental results on three datasets demonstrate the superiority of IDeal. Code is available at https: //github. com/Yangl1nFeng/IDeal.

PDF Details

IJCAI Conference 2025 Conference Paper

Learning Robust Multi-view Representation Using Dual-masked VAEs

Jiedong Wang
Kai Guo
Peng Hu
Xi Peng
Hao Wang

Most existing multi-view representation learning methods assume view-completeness and noise-free data. However, such assumptions are strong in real-world applications. Despite advances in methods tailored to view-missing or noise problems individually, a one-size-fits-all approach that concurrently addresses both remains unavailable. To this end, we propose a holistic method, called Dual-masked Variational Autoencoders (DualVAE), which aims at learning robust multi-view representation. The DualVAE exhibits an innovative amalgamation of dual-masked prediction, mixture-of-experts learning, representation disentangling, and a joint loss function in wrapping up all components. The key novelty lies in the dual-masked (view-mask and patch-mask) mechanism to mimic missing views and noisy data. Extensive experiments on four multi-view datasets show the effectiveness of the proposed method and its superior performance in comparison to baselines. The code is available at https: //github. com/XLearning-SCU/2025-IJCAI-DualVAE.

PDF Details DOI

NeurIPS Conference 2025 Conference Paper

Learning Source-Free Domain Adaptation for Visible-Infrared Person Re-Identification

Yongxiang Li
Yanglin Feng
Yuan Sun
Dezhong Peng
Xi Peng
Peng Hu

In this paper, we investigate source-free domain adaptation (SFDA) for visible-infrared person re-identification (VI-ReID), aiming to adapt a pre-trained source model to an unlabeled target domain without access to source data. To address this challenging setting, we propose a novel learning paradigm, termed Source-Free Visible-Infrared Person Re-Identification (SVIP), which fully exploits the prior knowledge embedded in the source model to guide target domain adaptation. The proposed framework comprises three key components specifically designed for the source-free scenario: 1) a Source-Guided Contrastive Learning (SGCL) module, which leverages the discriminative feature space of the frozen source model as a reference to perform contrastive learning on the unlabeled target data, thereby preserving discrimination without requiring source samples; 2) a Residual Transfer Learning (RTL) module, which learns residual mappings to adapt the target model’s representations while maintaining the knowledge from the source model; and 3) a Structural Consistency-Guided Cross-modal Alignment (SCCA) module, which enforces reciprocal structural constraints between visible and infrared modalities to identify reliable cross-modal pairs and achieve robust modality alignment without source supervision. Extensive experiments on benchmark datasets demonstrate that SVIP substantially enhances target domain performance and outperforms existing unsupervised VI-ReID methods under source-free settings.

PDF Details

IJCAI Conference 2025 Conference Paper

Probabilistic Multimodal Learning with von Mises-Fisher Distributions

Peng Hu
Yang Qin
Yuanbiao Gou
Yunfan Li
Mouxing Yang
Xi Peng

Multimodal learning is pivotal for the advancement of artificial intelligence, enabling machines to integrate complementary information from diverse data sources for holistic perception and understanding. Despite significant progress, existing methods struggle with challenges such as noisy inputs, noisy correspondence, and the inherent uncertainty of multimodal data, limiting their reliability and robustness. To address these issues, this paper presents a novel Probabilistic Multimodal Learning framework (PML) that models each data point as a von Mises-Fisher (vMF) distribution, effectively capturing intrinsic uncertainty and enabling robust fusion. Unlike traditional Gaussian-based models, PML learns directional representation with a concentration parameter to quantify reliability directly, enhancing stability and interpretability. To enhance discrimination, we propose a von Mises-Fisher Prototypical Contrastive Learning paradigm (vMF-PCL), which projects data onto a hypersphere by pulling within-class samples closer to their class prototype while pushing between-class prototypes apart, adaptively learning the reliability estimations. Building upon the estimated reliability, we develop a Reliable Multimodal Fusion mechanism (RMF) that dynamically adjusts the contribution and conflict of each modality, ensuring robustness against noisy data, noisy correspondence, and uncertainty. Extensive experiments on nine benchmarks demonstrate the superiority of PML, consistently outperforming 14 state-of-the-art methods. Code is available at https: //github. com/XLearning-SCU/2025-IJCAI-PML.

PDF Details DOI

NeurIPS Conference 2025 Conference Paper

Robust Cross-modal Alignment Learning for Cross-Scene Spatial Reasoning and Grounding

Yanglin Feng
Hongyuan Zhu
Dezhong Peng
Xi Peng
Xiaomin Song
Peng Hu

Grounding target objects in 3D environments via natural language is a fundamental capability for autonomous agents to successfully fulfill user requests. Almost all existing works typically assume that the target object lies within a known scene and focus solely on in-scene localization. In practice, however, agents often encounter unknown or previously visited environments and need to search across a large archive of scenes to ground the described object, thereby invalidating this assumption. To address this, we reveal a novel task called Cross-Scene Spatial Reasoning and Grounding (CSSRG), which aims to locate a described object anywhere across an entire collection of 3D scenes rather than predetermined scenes. Due to the difference from existing 3D visual grounding, CSSRG poses two challenges: the prohibitive cost of exhaustively traversing all scenes and more complex cross-modal spatial alignment. To address the challenges, we propose a Cross-Scene 3D Object Reasoning Framework (CoRe), which adopts a matching-then-grounding pipeline to reduce computational overhead. Specifically, CoRe consists of i) a Robust Text-Scene Aligning (RTSA) module that learns global scene representations for robust alignment between object descriptions and the corresponding 3D scenes, enabling efficient retrieval of candidate scenes; and ii) a Tailored Word-Object Associating (TWOA) module that establishes fine-grained alignment between words and target objects to filter out redundant context, supporting precise object-level reasoning and alignment. Additionally, to benchmark CSSRG, we construct a new CrossScene-RETR dataset and evaluation protocol tailored for cross-scene grounding. Extensive experiments across four multimodal datasets demonstrate that CoRe dramatically reduces computational overhead while showing superiority in both scene retrieval and object grounding. Code is available at https: //github. com/Yangl1nFeng/CoRe.

PDF Details

NeurIPS Conference 2024 Conference Paper

AverNet: All-in-one Video Restoration for Time-varying Unknown Degradations

Haiyu Zhao
Lei Tian
Xinyan Xiao
Peng Hu
Yuanbiao Gou
Xi Peng

Traditional video restoration approaches were designed to recover clean videos from a specific type of degradation, making them ineffective in handling multiple unknown types of degradation. To address this issue, several studies have been conducted and have shown promising results. However, these studies overlook that the degradations in video usually change over time, dubbed time-varying unknown degradations (TUD). To tackle such a less-touched challenge, we propose an innovative method, termed as All-in-one VidEo Restoration Network (AverNet), which comprises two core modules, i. e. , Prompt-Guided Alignment (PGA) module and Prompt-Conditioned Enhancement (PCE) module. Specifically, PGA addresses the issue of pixel shifts caused by time-varying degradations by learning and utilizing prompts to align video frames at the pixel level. To handle multiple unknown degradations, PCE recasts it into a conditional restoration problem by implicitly establishing a conditional map between degradations and ground truths. Thanks to the collaboration between PGA and PCE modules, AverNet empirically demonstrates its effectiveness in recovering videos from TUD. Extensive experiments are carried out on two synthesized datasets featuring seven types of degradations with random corruption levels. The code is available at https: //github. com/XLearning-SCU/2024-NeurIPS-AverNet.

PDF Details DOI

AAAI Conference 2024 Conference Paper

Decoupled Contrastive Multi-View Clustering with High-Order Random Walks

Yiding Lu
Yijie Lin
Mouxing Yang
Dezhong Peng
Peng Hu
Xi Peng

In recent, some robust contrastive multi-view clustering (MvC) methods have been proposed, which construct data pairs from neighborhoods to alleviate the false negative issue, i.e., some intra-cluster samples are wrongly treated as negative pairs. Although promising performance has been achieved by these methods, the false negative issue is still far from addressed and the false positive issue emerges because all in- and out-of-neighborhood samples are simply treated as positive and negative, respectively. To address the issues, we propose a novel robust method, dubbed decoupled contrastive multi-view clustering with high-order random walks (DIVIDE). In brief, DIVIDE leverages random walks to progressively identify data pairs in a global instead of local manner. As a result, DIVIDE could identify in-neighborhood negatives and out-of-neighborhood positives. Moreover, DIVIDE embraces a novel MvC architecture to perform inter- and intra-view contrastive learning in different embedding spaces, thus boosting clustering performance and embracing the robustness against missing views. To verify the efficacy of DIVIDE, we carry out extensive experiments on four benchmark datasets comparing with nine state-of-the-art MvC methods in both complete and incomplete MvC settings. The code is released on https://github.com/XLearning-SCU/2024-AAAI-DIVIDE.

PDF Details DOI

AAAI Conference 2024 Conference Paper

Dual Self-Paced Cross-Modal Hashing

Yuan Sun
Jian Dai
Zhenwen Ren
Yingke Chen
Dezhong Peng
Peng Hu

Cross-modal hashing~(CMH) is an efficient technique to retrieve relevant data across different modalities, such as images, texts, and videos, which has attracted more and more attention due to its low storage cost and fast query speed. Although existing CMH methods achieve remarkable processes, almost all of them treat all samples of varying difficulty levels without discrimination, thus leaving them vulnerable to noise or outliers. Based on this observation, we reveal and study dual difficulty levels implied in cross-modal hashing learning, \ie instance-level and feature-level difficulty. To address this problem, we propose a novel Dual Self-Paced Cross-Modal Hashing (DSCMH) that mimics human cognitive learning to learn hashing from ``easy'' to ``hard'' in both instance and feature levels, thereby embracing robustness against noise/outliers. Specifically, our DSCMH assigns weights to each instance and feature to measure their difficulty or reliability, and then uses these weights to automatically filter out the noisy and irrelevant data points in the original space. By gradually increasing the weights during training, our method can focus on more instances and features from ``easy'' to ``hard'' in training, thus mitigating the adverse effects of noise or outliers. Extensive experiments are conducted on three widely-used benchmark datasets to demonstrate the effectiveness and robustness of the proposed DSCMH over 12 state-of-the-art CMH methods.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

Interactive Deep Clustering via Value Mining

Honglin Liu
Peng Hu
Changqing Zhang
Yunfan Li
Xi Peng

In the absence of class priors, recent deep clustering methods resort to data augmentation and pseudo-labeling strategies to generate supervision signals. Though achieved remarkable success, existing works struggle to discriminate hard samples at cluster boundaries, mining which is particularly challenging due to their unreliable cluster assignments. To break such a performance bottleneck, we propose incorporating user interaction to facilitate clustering instead of exhaustively mining semantics from the data itself. To be exact, we present Interactive Deep Clustering (IDC), a plug-and-play method designed to boost the performance of pre-trained clustering models with minimal interaction overhead. More specifically, IDC first quantitatively evaluates sample values based on hardness, representativeness, and diversity, where the representativeness avoids selecting outliers and the diversity prevents the selected samples from collapsing into a small number of clusters. IDC then queries the cluster affiliations of high-value samples in a user-friendly manner. Finally, it utilizes the user feedback to finetune the pre-trained clustering model. Extensive experiments demonstrate that IDC could remarkably improve the performance of various pre-trained clustering models, at the expense of low user interaction costs. The code could be accessed at pengxi. me.

PDF Details DOI

AAAI Conference 2024 Conference Paper

PrefAce: Face-Centric Pretraining with Self-Structure Aware Distillation

Siyuan Hu
Zheng Wang
Peng Hu
Xi Peng
Jie Wu
Hongyuan Zhu
Yew Soon Ong

Video-based facial analysis is important for autonomous agents to understand human expressions and sentiments. However, limited labeled data is available to learn effective facial representations. This paper proposes a novel self-supervised face-centric pretraining framework, called PrefAce, which learns transferable video facial representation without labels. The self-supervised learning is performed with an effective landmark-guided global-local tube distillation. Meanwhile, a novel instance-wise update FaceFeat Cache is built to enforce more discriminative and diverse representations for downstream tasks. Extensive experiments demonstrate that the proposed framework learns universal instance-aware facial representations with fine-grained landmark details from videos. The point is that it can transfer across various facial analysis tasks, e.g., Facial Attribute Recognition (FAR), Facial Expression Recognition (FER), DeepFake Detection (DFD), and Lip Synchronization (LS). Our framework also outperforms the state-of-the-art on various downstream tasks, even in low data regimes. Code is available at https://github.com/siyuan-h/PrefAce.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

Robust Contrastive Multi-view Clustering against Dual Noisy Correspondence

Ruiming Guo
Mouxing Yang
Yijie Lin
Xi Peng
Peng Hu

Recently, contrastive multi-view clustering (MvC) has emerged as a promising avenue for analyzing data from heterogeneous sources, typically leveraging the off-the-shelf instances as positives and randomly sampled ones as negatives. In practice, however, this paradigm would unavoidably suffer from the Dual Noisy Correspondence (DNC) problem, where noise compromises the constructions of both positive and negative pairs. Specifically, the complexity of data collection and transmission might mistake some unassociated pairs as positive (namely, false positive correspondence), while the intrinsic one-to-many contrast nature of contrastive MvC would sample some intra-cluster samples as negative (namely, false negative correspondence). To handle this daunting problem, we propose a novel method, dubbed Contextually-spectral based correspondence refinery (CANDY). CANDY dexterously exploits inter-view similarities as \textit{context} to uncover false negatives. Furthermore, it employs a spectral-based module to denoise correspondence, alleviating the negative influence of false positives. Extensive experiments on five widely-used multi-view benchmarks, in comparison with eight competitive multi-view clustering methods, verify the effectiveness of our method in addressing the DNC problem. The code is available at https: //github. com/XLearning-SCU/2024-NeurIPS-CANDY.

PDF Details DOI

AAAI Conference 2023 Conference Paper

Correspondence-Free Domain Alignment for Unsupervised Cross-Domain Image Retrieval

Xu Wang
Dezhong Peng
Ming Yan
Peng Hu

Cross-domain image retrieval aims at retrieving images across different domains to excavate cross-domain classificatory or correspondence relationships. This paper studies a less-touched problem of cross-domain image retrieval, i.e., unsupervised cross-domain image retrieval, considering the following practical assumptions: (i) no correspondence relationship, and (ii) no category annotations. It is challenging to align and bridge distinct domains without cross-domain correspondence. To tackle the challenge, we present a novel Correspondence-free Domain Alignment (CoDA) method to effectively eliminate the cross-domain gap through In-domain Self-matching Supervision (ISS) and Cross-domain Classifier Alignment (CCA). To be specific, ISS is presented to encapsulate discriminative information into the latent common space by elaborating a novel self-matching supervision mechanism. To alleviate the cross-domain discrepancy, CCA is proposed to align distinct domain-specific classifiers. Thanks to the ISS and CCA, our method could encode the discrimination into the domain-invariant embedding space for unsupervised cross-domain image retrieval. To verify the effectiveness of the proposed method, extensive experiments are conducted on four benchmark datasets compared with six state-of-the-art methods.

PDF Details DOI

NeurIPS Conference 2023 Conference Paper

Cross-modal Active Complementary Learning with Self-refining Correspondence

Yang Qin
Yuan Sun
Dezhong Peng
Joey Tianyi Zhou
Xi Peng
Peng Hu

Recently, image-text matching has attracted more and more attention from academia and industry, which is fundamental to understanding the latent correspondence across visual and textual modalities. However, most existing methods implicitly assume the training pairs are well-aligned while ignoring the ubiquitous annotation noise, a. k. a noisy correspondence (NC), thereby inevitably leading to a performance drop. Although some methods attempt to address such noise, they still face two challenging problems: excessive memorizing/overfitting and unreliable correction for NC, especially under high noise. To address the two problems, we propose a generalized Cross-modal Robust Complementary Learning framework (CRCL), which benefits from a novel Active Complementary Loss (ACL) and an efficient Self-refining Correspondence Correction (SCC) to improve the robustness of existing methods. Specifically, ACL exploits active and complementary learning losses to reduce the risk of providing erroneous supervision, leading to theoretically and experimentally demonstrated robustness against NC. SCC utilizes multiple self-refining processes with momentum correction to enlarge the receptive field for correcting correspondences, thereby alleviating error accumulation and achieving accurate and stable corrections. We carry out extensive experiments on three image-text benchmarks, i. e. , Flickr30K, MS-COCO, and CC152K, to verify the superior robustness of our CRCL against synthetic and real-world noisy correspondences.

PDF Details

IJCAI Conference 2023 Conference Paper

Incomplete Multi-view Clustering via Prototype-based Imputation

Haobin Li
Yunfan Li
Mouxing Yang
Peng Hu
Dezhong Peng
Xi Peng

In this paper, we study how to achieve two characteristics highly-expected by incomplete multi-view clustering (IMvC). Namely, i) instance commonality refers to that within-cluster instances should share a common pattern, and ii) view versatility refers to that cross-view samples should own view-specific patterns. To this end, we design a novel dual-stream model which employs a dual attention layer and a dual contrastive learning loss to learn view-specific prototypes and model the sample-prototype relationship. When the view is missed, our model performs data recovery using the prototypes in the missing view and the sample-prototype relationship inherited from the observed view. Thanks to our dual-stream model, both cluster- and view-specific information could be captured, and thus the instance commonality and view versatility could be preserved to facilitate IMvC. Extensive experiments demonstrate the superiority of our method on five challenging benchmarks compared with 11 approaches. The code could be accessed from https: //pengxi. me.

PDF Details DOI

NeurIPS Conference 2022 Conference Paper

Multi-Scale Adaptive Network for Single Image Denoising

Yuanbiao Gou
Peng Hu
Jiancheng Lv
Joey Tianyi Zhou
Xi Peng

Multi-scale architectures have shown effectiveness in a variety of tasks thanks to appealing cross-scale complementarity. However, existing architectures treat different scale features equally without considering the scale-specific characteristics, \textit{i. e. }, the within-scale characteristics are ignored in the architecture design. In this paper, we reveal this missing piece for multi-scale architecture design and accordingly propose a novel Multi-Scale Adaptive Network (MSANet) for single image denoising. Specifically, MSANet simultaneously embraces the within-scale characteristics and the cross-scale complementarity thanks to three novel neural blocks, \textit{i. e. }, adaptive feature block (AFeB), adaptive multi-scale block (AMB), and adaptive fusion block (AFuB). In brief, AFeB is designed to adaptively preserve image details and filter noises, which is highly expected for the features with mixed details and noises. AMB could enlarge the receptive field and aggregate the multi-scale information, which meets the need of contextually informative features. AFuB devotes to adaptively sampling and transferring the features from one scale to another scale, which fuses the multi-scale features with varying characteristics from coarse to fine. Extensive experiments on both three real and six synthetic noisy image datasets show the superiority of MSANet compared with 12 methods. The code could be accessed from https: //github. com/XLearning-SCU/2022-NeurIPS-MSANet.

PDF Details

AAAI Conference 2021 Conference Paper

A Primal-Dual Online Algorithm for Online Matching Problem in Dynamic Environments

Yu-Hang Zhou
Peng Hu
Chen Liang
Huan Xu
Guangda Huzhang
Yinfu Feng
Qing Da
Xinshang Wang

Recently, the online matching problem has attracted much attention due to its wide application on real-world decisionmaking scenarios. In stationary environments, by adopting the stochastic user arrival model, existing methods are proposed to learn dual optimal prices and are shown to achieve a fast regret bound. However, the stochastic model is no longer a proper assumption when the environment is changing, leading to an optimistic method that may suffer poor performance. In this paper, we study the online matching problem in dynamic environments in which the dual optimal prices are allowed to vary over time. We bound the dynamic regret of online matching problem by the sum of two quantities, including a regret of online max-min problem and a dynamic regret of online convex optimization (OCO) problem. Then we propose a novel online approach named Primal-Dual Online Algorithm (PDOA) to minimize both quantities. In particular, PDOA adopts the primal-dual framework by optimizing dual prices with the online gradient descent (OGD) algorithm to eliminate the online max-min problem’s regret. Moreover, it maintains a set of OGD experts and combines them via an expert-tracking algorithm, which gives a sublinear dynamic regret bound for the OCO problem. We show that PDOA achieves an O(K p T(1 + PT )) dynamic regret where K is the number of resources, T is the number of iterations and PT is the path-length of any potential dual price sequence that reflects the dynamic environment. Finally, experiments on real applications exhibit the superiority of our approach.

PDF Details

ICLR Conference 2021 Conference Paper

BRECQ: Pushing the Limit of Post-Training Quantization by Block Reconstruction

Yuhang Li 0001
Ruihao Gong
Xu Tan
Yang Yang
Peng Hu
Qi Zhang
Fengwei Yu
Wei Wang 0059

We study the challenging task of neural network quantization without end-to-end retraining, called Post-training Quantization (PTQ). PTQ usually requires a small subset of training data but produces less powerful quantized models than Quantization-Aware Training (QAT). In this work, we propose a novel PTQ framework, dubbed BRECQ, which pushes the limits of bitwidth in PTQ down to INT2 for the first time. BRECQ leverages the basic building blocks in neural networks and reconstructs them one-by-one. In a comprehensive theoretical study of the second-order error, we show that BRECQ achieves a good balance between cross-layer dependency and generalization error. To further employ the power of quantization, the mixed precision technique is incorporated in our framework by approximating the inter-layer and intra-layer sensitivity. Extensive experiments on various handcrafted and searched neural architectures are conducted for both image classification and object detection tasks. And for the first time we prove that, without bells and whistles, PTQ can attain 4-bit ResNet and MobileNetV2 comparable with QAT and enjoy 240 times faster production of quantized models. Codes are available at https://github.com/yhhhli/BRECQ.

Details

AAAI Conference 2021 Conference Paper

Contrastive Clustering

Yunfan Li
Peng Hu
Zitao Liu
Dezhong Peng
Joey Tianyi Zhou
Xi Peng

In this paper, we propose an online clustering method called Contrastive Clustering (CC) which explicitly performs the instance- and cluster-level contrastive learning. To be specific, for a given dataset, the positive and negative instance pairs are constructed through data augmentations and then projected into a feature space. Therein, the instance- and cluster-level contrastive learning are respectively conducted in the row and column space by maximizing the similarities of positive pairs while minimizing those of negative ones. Our key observation is that the rows of the feature matrix could be regarded as soft labels of instances, and accordingly the columns could be further regarded as cluster representations. By simultaneously optimizing the instance- and cluster-level contrastive loss, the model jointly learns representations and cluster assignments in an end-to-end manner. Besides, the proposed method could timely compute the cluster assignment for each individual, even when the data is presented in streams. Extensive experimental results show that CC remarkably outperforms 17 competitive clustering methods on six challenging image benchmarks. In particular, CC achieves an NMI of 0. 705 (0. 431) on the CIFAR-10 (CIFAR-100) dataset, which is an up to 19% (39%) performance improvement compared with the best baseline. The code is available at https: //github. com/XLearning-SCU/2021-AAAI-CC.

PDF Details

AAAI Conference 2021 Conference Paper

OPQ: Compressing Deep Neural Networks with One-shot Pruning-Quantization

Peng Hu
Xi Peng
Hongyuan Zhu
Mohamed M. Sabry Aly
Jie Lin

As Deep Neural Networks (DNNs) usually are overparameterized and have millions of weight parameters, it is challenging to deploy these large DNN models on resourceconstrained hardware platforms, e. g. , smartphones. Numerous network compression methods such as pruning and quantization are proposed to reduce the model size significantly, of which the key is to find suitable compression allocation (e. g. , pruning sparsity and quantization codebook) of each layer. Existing solutions obtain the compression allocation in an iterative/manual fashion while finetuning the compressed model, thus suffering from the efficiency issue. Different from the prior art, we propose a novel One-shot Pruning- Quantization (OPQ) in this paper, which analytically solves the compression allocation with pre-trained weight parameters only. During finetuning, the compression module is fixed and only weight parameters are updated. To our knowledge, OPQ is the first work that reveals pre-trained model is sufficient for solving pruning and quantization simultaneously, without any complex iterative/manual optimization at the finetuning stage. Furthermore, we propose a unified channelwise quantization method that enforces all channels of each layer to share a common codebook, which leads to low bitrate allocation without introducing extra overhead brought by traditional channel-wise quantization. Comprehensive experiments on ImageNet with AlexNet/MobileNet-V1/ResNet-50 show that our method improves accuracy and training efficiency while obtains significantly higher compression rates compared to the state-of-the-art.

PDF Details

NeurIPS Conference 2020 Conference Paper

Partially View-aligned Clustering

Zhenyu Huang
Peng Hu
Joey Tianyi Zhou
Jiancheng Lv
Xi Peng

In this paper, we study one challenging issue in multi-view data clustering. To be specific, for two data matrices $\mathbf{X}^{(1)}$ and $\mathbf{X}^{(2)}$ corresponding to two views, we do not assume that $\mathbf{X}^{(1)}$ and $\mathbf{X}^{(2)}$ are fully aligned in row-wise. Instead, we assume that only a small portion of the matrices has established the correspondence in advance. Such a partially view-aligned problem (PVP) could lead to the intensive labor of capturing or establishing the aligned multi-view data, which has less been touched so far to the best of our knowledge. To solve this practical and challenging problem, we propose a novel multi-view clustering method termed partially view-aligned clustering (PVC). To be specific, PVC proposes to use a differentiable surrogate of the non-differentiable Hungarian algorithm and recasts it as a pluggable module. As a result, the category-level correspondence of the unaligned data could be established in a latent space learned by a neural network, while learning a common space across different views using the ``aligned'' data. Extensive experimental results show promising results of our method in clustering partially view-aligned data.

PDF Details

AAAI Conference 2020 Conference Paper

Semi-Supervised Multi-Modal Learning with Balanced Spectral Decomposition

Peng Hu
Hongyuan Zhu
Xi Peng
Jie Lin

Cross-modal retrieval aims to retrieve the relevant samples across different modalities, of which the key problem is how to model the correlations among different modalities while narrowing the large heterogeneous gap. In this paper, we propose a Semi-supervised Multimodal Learning Network method (SMLN) which correlates different modalities by capturing the intrinsic structure and discriminative correlation of the multimedia data. To be speciﬁc, the labeled and unlabeled data are used to construct a similarity matrix which integrates the cross-modal correlation, discrimination, and intra-modal graph information existing in the multimedia data. What is more important is that we propose a novel optimization approach to optimize our loss within a neural network which involves a spectral decomposition problem derived from a ratio trace criterion. Our optimization enjoys two advantages given below. On the one hand, the proposed approach is not limited to our loss, which could be applied to any case that is a neural network with the ratio trace criterion. On the other hand, the proposed optimization is different from existing ones which alternatively maximize the minor eigenvalues, thus overemphasizing the minor eigenvalues and ignore the dominant ones. In contrast, our method will exactly balance all eigenvalues, thus being more competitive to existing methods. Thanks to our loss and optimization strategy, our method could well preserve the discriminative and instinct information into the common space and embrace the scalability in handling large-scale multimedia data. To verify the effectiveness of the proposed method, extensive experiments are carried out on three widely-used multimodal datasets comparing with 13 state-of-the-art approaches.

PDF Details

IJCAI Conference 2019 Conference Paper

Hybrid Item-Item Recommendation via Semi-Parametric Embedding

Peng Hu
Rong Du
Yao Hu
Nan Li

Nowadays, item-item recommendation plays an important role in modern recommender systems. Traditionally, this is either solved by behavior-based collaborative filtering or content-based meth- ods. However, both kinds of methods often suffer from cold-start problems, or poor performance due to few behavior supervision; and hybrid methods which can leverage the strength of both kinds of methods are needed. In this paper, we propose a semi-parametric embedding framework for this problem. Specifically, the embedding of an item is composed of two parts, i. e. , the parametric part from content information and the non-parametric part designed to encode behavior information; meanwhile, a deep learning algorithm is proposed to learn two parts simultaneously. Extensive experiments on real-world datasets demonstrate the effectiveness and robustness of the proposed method.

PDF Details

AAAI Conference 2016 Conference Paper

Learning Expected Hitting Time Distance

De-Chuan Zhan
Peng Hu
Zui Chu
Zhi-Hua Zhou

Most distance metric learning (DML) approaches focus on learning a Mahalanobis metric for measuring distances between examples. However, for particular feature representations, e. g. , histogram features like BOW and SPM, Mahalanobis metric could not model the correlations between these features well. In this work, we deﬁne a non- Mahalanobis distance for histogram features, via Expected Hitting Time (EHT) of Markov Chain, which implicitly considers the high-order feature relationships between different histogram features. The EHT based distance is parameterized by transition probabilities of Markov Chain, we consequently propose a novel type of distance learning approach (LED, Learning Expected hitting time Distance) to learn appropriate transition probabilities for EHT based distance. We validate the effectiveness of LED on a series of realworld datasets. Moreover, experiments show that the learned transition probabilities are with good comprehensibility.

PDF Details

IROS Conference 2009 Conference Paper

Rapid and precise object detection based on color histograms and adaptive bandwidth mean shift

Xiaopeng Chen
Qiang Huang 0002
Peng Hu
Min Li 0015
Ye Tian 0024
Chen Li

Speed and precision are important for object detection algorithms. In this paper, a novel object detection algorithm based on color histogram and adaptive bandwidth mean shift is proposed. The algorithm is capable of detecting objects rapidly and precisely. It is composed of two stages: a rough detection stage and a precise detection stage. At the rough detection stage, histogram back projection and thresholding are applied to fast object identification and rough global localization. At the precise detection stage, the precise position, size and orientation are derived under the adaptive bandwidth mean shift framework. Experiments verify that the algorithm is able to detect the size, position and orientation of general objects rapidly and precisely.

Details