Author name cluster

Yuan Sun

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

27 papers

1 author row

AAAI Conference 2026 Conference Paper

Ambiguity-Tolerant Cross-Modal Hashing with Partial Labels

Chao Su
Yanan Li
Xu Wang
Yingke Chen
Huiming Zheng
Dezhong Peng
Yuan Sun

Cross-modal hashing (CMH) has achieved remarkable success in large-scale cross-modal retrieval due to its low storage cost and high computational efficiency. However, most existing CMH methods rely on accurately annotated training data, which is often impractical in real-world applications due to the high cost and limited scalability of data annotation. In practice, annotators typically assign a candidate label set rather than a single precise label to each sample pair, resulting in partial labels with inherent ambiguity. Such ambiguous supervision poses significant challenges to conventional CMH methods that assume reliable and unambiguous labels. In this paper, we investigate a less-touched yet meaningful problem, i.e., cross-modal hashing with partial labels (PLCMH). PLCMH faces two major challenges: label ambiguity and modality-alignment barriers induced by misleading supervision. To address these issues, we propose a new approach named Ambiguity-Tolerant Cross-Modal Hashing (ATCH). Specifically, ATCH presents a Local Consensus Disambiguation (LCD) mechanism that resolves label ambiguity by effectively inferring stable and accurate label confidence based on local consensus within the Hamming space. Moreover, ATCH proposes a Confidence-Aware Contrastive Hashing (CACH) mechanism that derives both pseudo labels and trustworthiness scores from the label confidence vectors to learn discriminative hash codes, leading to effective modality alignment. Extensive experiments on three multimodal datasets demonstrate the superiority of ATCH.

PDF Details DOI

AAAI Conference 2026 Conference Paper

Neighbor-aware Instance Refining with Noisy Labels for Cross-Modal Retrieval

Yizhi Liu
Ruitao Pu
Shilin Xu
Yingke Chen
Quan-Hui Liu
Yuan Sun

In recent years, Cross-Modal Retrieval (CMR) has made significant progress in the field of multi-modal analysis. However, since it is time-consuming and labor-intensive to collect large-scale and well-annotated data, the annotation of multi-modal data inevitably contains some noise. This will degrade the retrieval performance of the model. To tackle the problem, numerous robust CMR methods have been developed, including robust learning paradigms, label calibration strategies, and instance selection mechanisms. Unfortunately, they often fail to simultaneously satisfy model performance ceilings, calibration reliability, and data utilization rate. To overcome the limitations, we propose a novel robust cross-modal learning framework, namely Neighbor-aware Instance Refining with Noisy Labels (NIRNL). Specifically, we first propose Cross-modal Margin Preserving (CMP) to adjust the relative distance between positive and negative pairs, thereby enhancing the discrimination between sample pairs. Then, we propose Neighbor-aware Instance Refining (NIR) to identify pure subset, hard subset, and noisy subset through cross-modal neighborhood consensus. Afterward, we construct different tailored optimization strategies for this fine-grained partitioning, thereby maximizing the utilization of all available data while mitigating error propagation. Extensive experiments on three benchmark datasets demonstrate that NIRNL achieves state-of-the-art performance, exhibiting remarkable robustness, especially under high noise rates.

PDF Details DOI

AAAI Conference 2026 Conference Paper

Neural Collapse Priors Driven Trust Semi-Supervised Multi-View Classification

Taotao Guo
Honglin Yuan
Xujian Zhao
Yuan Sun
Dongliang Wang
Zhenwen Ren
Xingfeng Li

In semi‑supervised multi‑view classification (SMVC), scarce labels and noisy unlabeled data impair feature aggregation and compromise prediction reliability, while existing methods lack principled guidance and interpretability. To overcome these limitations, we propose a novel unified SMVC framework, Neural Collapse Priors Driven Trust Semi-Supervised Multi-View Classification (NCPD-TSMVC), building upon neural collapse–derived prototype priors and evidential opinion fusion. Concretely, we rigorously prove under neural collapse theory that normalized classifier weights from the labeled‑data pre‑training stage coincide with class centroids in feature space, conferring maximal inter‑class separation and optimal within‑class compactness. These prototype priors permeate the entire learning pipeline, calibrating the representation learning of unlabeled samples to obtain highly discriminative embeddings. Simultaneously, our evidential learning module quantifies epistemic uncertainty and fuses view‑level opinions at the evidence level, yielding robust and transparent decision making. Extensive evaluations across diverse benchmarks demonstrate that NCPD‑TSMVC surpasses state‑of‑the‑art SMVC approaches in performance, robustness and interpretability.

PDF Details DOI

AAAI Conference 2026 Conference Paper

Revisiting Network Inertia: Dynamic Inertia Inhibition Coupled Multidimensional Periodicity for Infrared and Visible Image Fusion

Yufeng Chen
Yuan Sun
Hao Pan
Xujian Zhao
Jian Dai
Zhenwen Ren
Xingfeng Li

Infrared and visible image fusion (IVIF) technology has become a frontier of great interest due to the ability to integrate information from multiple sources. However, the progressive slowdown of weight updates in deep networks (i.e., “network laziness” phenomenon), makes existing methods far from realizing the full characterization potential. To this end, we propose a lightweight fusion method for IVIF, Anti-Inert Dynamic Fusion (AIDFusion), to fully utilize the potential of the network at all levels. Specifically, by progressively regulating the collaborative Learning process of multi-level prediction in the network, Dynamic Inertia Inhibition Learning Strategy (DIILS) is proposed to adaptively and efficiently inhibit inertia accumulation. Subsequently, to deeply explore the representation potential while breaking through the performance threshold, lightweight Multi-dimensional modulation fusion module (MMFM) is specifically proposed to capture comprehensive multi-view and multi-scale features efficiently. Finally, considering the semantic bias between the prediction maps of DIILS and the fusion feature of MMFM, Fourier Analysis Convolution (FAConv) is designed in feature recovery as a bridge between prediction and fusion to accomplish the implicit periodic modeling. Based on the above study, extensive experiments on three public IVIF datasets demonstrate the dual advantages of AIDFusion in terms of fusion performance and computational overhead compared to state-of-the-art baseline methods.

PDF Details DOI

AAAI Conference 2026 Conference Paper

Robust Semi-paired Multimodal Learning for Cross-modal Retrieval

Yang Qin
Yuan Sun
Xi Peng
Dezhong Peng
Joey Tianyi Zhou
Xiaomin Song
Peng Hu

Cross-modal retrieval is a fundamental application of multi-modal learning that has achieved remarkable success with large-scale well-paired data. However, in practice, it is costly to collect large-scale well-paired data. To alleviate the dependence on the amount of paired data, in this paper, we study a practical learning paradigm: semi-paired cross-modal learning (SPL), which utilizes both a small amount of paired data and a large amount of unpaired data to enhance cross-modal learning directly and is more accessible in practice. To achieve this, we take image-text retrieval as an example and propose a novel Robust Cross-modal Semi-paired Learning method (RCSL) by addressing two challenges. To be specific, i) to overcome the under-optimization issue caused by too little paired data, we present Semi-paired Discriminative Learning (SDL) to fully learn visual-semantic associations from a small amount of image-text pairs by preserving the alignment and uniformity of modality representations. ii) To mine visual-semantic correspondences from unpaired data, RCSL first constructs pseudo-paired correlations across different modalities by nearest neighbor association. However, this may introduce noisy correspondences (NCs) due to inaccurate pseudo signals, which could degrade the model's performance. To tackle NCs, we devise Robust Cross-correlation Mining (RCM) based on the risk minimization criterion to robustly and explicitly learn visual-semantic associations from pseudo-paired data, thus boosting cross-modal learning. Finally, we conduct extensive experiments on four datasets, i.e., three widely used benchmark datasets of Flickr30K, MS-COCO, CC152K, and a newly constructed real-world dataset Drone-SP, to demonstrate the effectiveness of RCSL under semi-paired and noisy settings.

PDF Details DOI

AAAI Conference 2026 Conference Paper

Semantic-Consistent Bidirectional Contrastive Hashing for Noisy Multi-Label Cross-Modal Retrieval

Likang Peng
Chao Su
Wenyuan Wu
Yuan Sun
Dezhong Peng
Xi Peng
Xu Wang

Cross-modal hashing (CMH) facilitates efficient retrieval across different modalities (e.g., image and text) by encoding data into compact binary representations. While recent methods have achieved remarkable performance, they often rely heavily on fully annotated datasets, which are costly and labor-intensive to obtain. In real-world scenarios, particularly in multi-label datasets, label noise is prevalent and severely degrades retrieval performance. Moreover, existing CMH approaches typically overlook the partial semantic overlaps inherent in multi-label data, limiting their robustness and generalization. To tackle these challenges, we propose a novel framework named Semantic-Consistent Bidirectional Contrastive Hashing (SCBCH). The framework comprises two complementary modules: (1) Cross-modal Semantic-Consistent Classification (CSCC), which leverages cross-modal semantic consistency to estimate sample reliability and reduce the impact of noisy labels; (2) Bidirectional Soft Contrastive Hashing (BSCH), which dynamically generates soft contrastive sample pairs based on multi-label semantic overlap, enabling adaptive contrastive learning between semantically similar and dissimilar samples across modalities. Extensive experiments on four widely-used cross-modal retrieval benchmarks validate the effectiveness and robustness of our method, consistently outperforming state-of-the-art approaches under noisy multi-label conditions.

PDF Details DOI

AAAI Conference 2025 Conference Paper

Deep Evidential Hashing for Trustworthy Cross-Modal Retrieval

Yuan Li
Liangli Zhen
Yuan Sun
Dezhong Peng
Xi Peng
Peng Hu

Cross-modal hashing provides an efficient solution for retrieval tasks across various modalities, such as images and text. However, most existing methods are deterministic models, which overlook the reliability associated with the retrieved results. This omission renders them unreliable for determining matches between data pairs based solely on Hamming distance. To bridge the gap, in this paper, we propose a novel method called Deep Evidential Cross-modal Hashing (DECH). This method equips hashing models with the ability to quantify the reliability level of the association between a query sample and each corresponding retrieved sample, bringing a new dimension of reliability to the cross-modal retrieval process. To achieve this, our method addresses two key challenges: i) To leverage evidential theory in guiding the model to learn hash codes, we design a novel evidence acquisition module to collect evidence and place the evidence captured by hash codes on a Beta distribution to derive a binomial opinion. Unlike existing evidential learning approaches that rely on classifiers, our method collects evidence directly through hash codes. ii) To tackle the task-oriented challenge, we first introduce a method to update the derived binomial opinion, allowing it to present the uncertainty caused by conflicting evidence. Following this manner, we present a strategy to precisely evaluate the reliability level of retrieved results, culminating in performance improvement. We validate the efficacy of our DECH through extensive experimentation on four benchmark datasets. The experimental results demonstrate our superior performance compared to 12 state-of-the-art methods.

PDF Details DOI

NeurIPS Conference 2025 Conference Paper

Dynamic Focused Masking for Autoregressive Embodied Occupancy Prediction

Yuan Sun
Julio Contreras
Jorge Ortiz

Visual autoregressive modeling has recently demonstrated potential in image tasks by enabling coarse-to-fine, next-level prediction. Most indoor 3D occupancy prediction methods, however, continue to rely on dense voxel grids and convolution-heavy backbones, which incur high computational costs when applying such coarse-to-fine frameworks. In contrast, cost-efficient alternatives based on Gaussian representations—particularly in the context of multi-scale autoregression—remain underexplored. To bridge this gap, we propose DFGauss, a dynamic focused masking framework for multi-scale 3D Gaussian representation. Unlike conventional approaches that refine voxel volumes or 2D projections, DFGauss directly operates in the 3D Gaussian parameter space, progressively refining representations across resolutions under hierarchical supervision. Each finer-scale Gaussian is conditioned on its coarser-level counterpart, forming a scale-wise autoregressive process. To further enhance efficiency, we introduce an importance-guided refinement strategy that selectively propagates informative Gaussians across scales, enabling spatially adaptive detail modeling. Experiments on 3D occupancy benchmarks demonstrate that DFGauss achieves competitive performance, highlighting the promise of autoregressive modeling for scalable 3D occupancy prediction.

NeurIPS Conference 2025 Conference Paper

Interactive Cross-modal Learning for Text-3D Scene Retrieval

Yanglin Feng
Yongxiang Li
Yuan Sun
Yang Qin
Dezhong Peng
Peng Hu

Text-3D Scene Retrieval (T3SR) aims to retrieve relevant scenes using linguistic queries. Although traditional T3SR methods have made significant progress in capturing fine-grained associations, they implicitly assume that query descriptions are information-complete. In practical deployments, however, limited by the capabilities of users and models, it is difficult or even impossible to directly obtain a perfect textual query suiting the entire scene and model, thereby leading to performance degradation. To address this issue, we propose a novel Interactive Text-3D Scene Retrieval Method (IDeal), which promotes the enhancement of the alignment between texts and 3D scenes through continuous interaction. To achieve this, we present an Interactive Retrieval Refinement framework (IRR), which employs a questioner to pose contextually relevant questions to an answerer in successive rounds that either promote detailed probing or encourage exploratory divergence within scenes. Upon the iterative responses received from the answerer, IRR adopts a retriever to perform both feature-level and semantic-level information fusion, facilitating scene-level interaction and understanding for more precise re-rankings. To bridge the domain gap between queries and interactive texts, we propose an Interaction Adaptation Tuning strategy (IAT). IAT mitigates the discriminability and diversity risks among augmented text features that approximate the interaction text domain, achieving contrastive domain adaptation for our retriever. Extensive experimental results on three datasets demonstrate the superiority of IDeal. Code is available at https: //github. com/Yangl1nFeng/IDeal.

NeurIPS Conference 2025 Conference Paper

Learning Source-Free Domain Adaptation for Visible-Infrared Person Re-Identification

Yongxiang Li
Yanglin Feng
Yuan Sun
Dezhong Peng
Xi Peng
Peng Hu

In this paper, we investigate source-free domain adaptation (SFDA) for visible-infrared person re-identification (VI-ReID), aiming to adapt a pre-trained source model to an unlabeled target domain without access to source data. To address this challenging setting, we propose a novel learning paradigm, termed Source-Free Visible-Infrared Person Re-Identification (SVIP), which fully exploits the prior knowledge embedded in the source model to guide target domain adaptation. The proposed framework comprises three key components specifically designed for the source-free scenario: 1) a Source-Guided Contrastive Learning (SGCL) module, which leverages the discriminative feature space of the frozen source model as a reference to perform contrastive learning on the unlabeled target data, thereby preserving discrimination without requiring source samples; 2) a Residual Transfer Learning (RTL) module, which learns residual mappings to adapt the target model’s representations while maintaining the knowledge from the source model; and 3) a Structural Consistency-Guided Cross-modal Alignment (SCCA) module, which enforces reciprocal structural constraints between visible and infrared modalities to identify reliable cross-modal pairs and achieve robust modality alignment without source supervision. Extensive experiments on benchmark datasets demonstrate that SVIP substantially enhances target domain performance and outperforms existing unsupervised VI-ReID methods under source-free settings.

NeurIPS Conference 2025 Conference Paper

Neighbor-aware Contrastive Disambiguation for Cross-Modal Hashing with Redundant Annotations

Chao Su
Likang Peng
Yuan Sun
Dezhong Peng
Xi Peng
Xu Wang

Cross-modal hashing aims to efficiently retrieve information across different modalities by mapping data into compact hash codes. However, most existing methods assume access to fully accurate supervision, which rarely holds in real-world scenarios. In fact, annotations are often redundant, i. e. , each sample is associated with a set of candidate labels that includes both ground-truth labels and redundant noisy labels. Treating all annotated labels as equally valid introduces two critical issues: (1) the sparse presence of true labels within the label set is not explicitly addressed, leading to overfitting on redundant noisy annotations; (2) redundant noisy labels induce spurious similarities that distort semantic alignment across modalities and degrade the quality of the hash space. To address these challenges, we propose that effective cross-modal hashing requires explicitly identifying and leveraging the true label subset within all candidate annotations. Based on this insight, we present Neighbor-aware Contrastive Disambiguation (NACD), a novel framework designed for robust learning under redundant supervision. NACD consists of two key components. The first, Neighbor-aware Confidence Reconstruction (NACR), refines label confidence by aggregating information from cross-modal neighbors to distinguish true labels from redundant noisy ones. The second, Class-aware Robust Contrastive Hashing (CRCH), constructs reliable positive and negative pairs based on label confidence scores, thereby significantly enhancing robustness against noisy supervision. Moreover, to effectively reduce the quantization error, we incorporate a quantization loss that enforces binary constraints on the learned hash representations. Extensive experiments conducted on three large-scale multimodal benchmarks demonstrate that our method consistently outperforms state-of-the-art approaches, thereby establishing a new standard for cross-modal hashing with redundant annotations. Code is available at https: //github. com/Rose-bud/NACD.

AAAI Conference 2025 Conference Paper

Noisy Label Calibration for Multi-View Classification

Shilin Xu
Yuan Sun
Xingfeng Li
Siyuan Duan
Zhenwen Ren
Zheng Liu
Dezhong Peng

In recent years, multi-view learning has aroused extensive research passion. Most existing multi-view learning methods often rely on well-annotations to improve decision accuracy. However, noise labels are ubiquitous in multi-view data due to imperfect annotations. To deal with this problem, we propose a novel noisy label calibration method (NLC) for multi-view classification to resist the negative impact of noisy labels. Specifically, to capture consensus information from multiple views, we employ max-margin rank loss to reduce the heterogeneous gap. Subsequently, we evaluate the confidence scores to enrich predictions associated with noise instances according to all reliable neighbors. Further, we propose Label Noise Detection (LND) to separate multi-view data into a clean or noisy subset, and propose Label Calibration Learning (LCL) to correct noisy instances. Finally, we adopt the cross-entropy loss to achieve multi-view classification. Extensive experiments on six datasets validate that our method outperforms eight state-of-the-art methods.

PDF Details DOI

IJCAI Conference 2025 Conference Paper

Reliable Disentanglement Multi-view Learning Against View Adversarial Attacks

Xuyang Wang
Siyuan Duan
Qizhi Li
Guiduo Duan
Yuan Sun
Dezhong Peng

Trustworthy multi-view learning has attracted extensive attention because evidence learning can provide reliable uncertainty estimation to enhance the credibility of multi-view predictions. Existing trusted multi-view learning methods implicitly assume that multi-view data is secure. However, in safety-sensitive applications such as autonomous driving and security monitoring, multi-view data often faces threats from adversarial perturbations, thereby deceiving or disrupting multi-view models. This inevitably leads to the adversarial unreliability problem (AUP) in trusted multi-view learning. To overcome this tricky problem, we propose a novel multi-view learning framework, namely Reliable Disentanglement Multi-view Learning (RDML). Specifically, we first propose evidential disentanglement learning to decompose each view into clean and adversarial parts under the guidance of corresponding evidences, which is extracted by a pretrained evidence extractor. Then, we employ the feature recalibration module to mitigate the negative impact of adversarial perturbations and extract potential informative features from them. Finally, to further ignore the irreparable adversarial interferences, a view-level evidential attention mechanism is designed. Extensive experiments on multi-view classification tasks with adversarial attacks show that RDML outperforms the state-of-the-art methods by a relatively large margin. Our code is available at https: //github. com/Willy1005/2025-IJCAI-RDML.

PDF Details DOI

IJCAI Conference 2025 Conference Paper

Robust Graph Contrastive Learning for Incomplete Multi-view Clustering

Deyin Zhuang
Jian Dai
Xingfeng Li
Xi Wu
Yuan Sun
Zhenwen Ren

In recent years, multi-view clustering (MVC) has become a promising approach for analyzing heterogeneous multi-source data. However, during the collection of multi-view data, factors such as environmental interference or sensor failure often lead to the loss of view sample data, resulting in incomplete multi-view clustering (IMVC). Graph contrastive IMVC has demonstrated promising performance as an effective solution, which typically utilizes in-graph instances as positive pairs and out-of-graph instances as negative pairs. However, the construction of positive and negative pairs in this paradigm inevitably leads to graph noise Correspondence (GNC). To this end, we propose a new IMVC framework, namely robust graph contrastive learning (RGCL). Specifically, RGCL first completes the missing data by using a multi-view consistency transfer relationship graph. Then, to mitigate the impact of false negative pairs from graph contrastive, we propose noise-robust graph contrastive learning to mine intra-view consistency accurately. Finally, we present cross-view graph-level alignment to fully exploit the complementary information across different views. Experimental results on the six multi-view datasets demonstrate that our RGCL exhibits superiority and effectiveness compared with 9 state-of-the-art IMVC methods. The source code is available at https: //github. com/DYZ163/RGCL. git.

PDF Details DOI

AAAI Conference 2025 Conference Paper

Robust Self-Paced Hashing for Cross-Modal Retrieval with Noisy Labels

Ruitao Pu
Yuan Sun
Yang Qin
Zhenwen Ren
Xiaomin Song
Huiming Zheng
Dezhong Peng

Cross-modal hashing (CMH) has appeared as a popular technique for cross-modal retrieval due to its low storage cost and high computational efficiency in large-scale data. Most existing methods implicitly assume that multi-modal data is correctly labeled, which is expensive and even unattainable due to the inevitable imperfect annotations (i.e., noisy labels) in real-world scenarios. Inspired by human cognitive learning, a few methods introduce self-paced learning to gradually train the model from easy to hard samples, which is often used to mitigate the effects of feature noise or outliers. It is a less-touched problem that how to utilize SPL to alleviate the misleading of noisy labels on the hash model. To tackle this problem, we propose a new cognitive cross-modal retrieval method called Robust Self-paced Hashing with Noisy Labels (RSHNL), which can mimic the human cognitive process to identify the noise while embracing robustness against noisy labels. Specifically, we first propose a contrastive hashing learning (CHL) scheme to improve multi-modal consistency, thereby reducing the inherent semantic gap. Afterward, we propose center aggregation learning (CAL) to mitigate the intra-class variations. Finally, we propose Noise-tolerance Self-paced Hashing (NSH) that dynamically estimates the learning difficulty for each instance and distinguishes noisy labels through the difficulty level. For all estimated clean pairs, we further adopt a self-paced regularizer to gradually learn hash codes from easy to hard. Extensive experiments demonstrate that the proposed RSHNL performs remarkably well over the state-of-the-art CMH methods.

PDF Details DOI

AAAI Conference 2025 Conference Paper

TPCH: Tensor-interacted Projection and Cooperative Hashing for Multi-view Clustering

Zhongwen Wang
Xingfeng Li
Yinghui Sun
Quansen Sun
Yuan Sun
Han Ling
Jian Dai
Zhenwen Ren

In recent years, anchor and hash-based multi-view clustering methods have gained attention for their efficiency and simplicity in handling large-scale data. However, existing methods often overlook the interactions among multi-view data and higher-order cooperative relationships during projection, negatively impacting the quality of hash representation in low-dimensional spaces, clustering performance, and sensitivity to noise. To address this issue, we propose a novel approach named Tensor-Interacted Projection and Cooperative Hashing for Multi-View Clustering(TPCH). TPCH stacks multiple projection matrices into a tensor, taking into account the synergies and communications during the projection process. By capturing higher-order multi-view information through dual projection and Hamming space, TPCH employs an enhanced tensor nuclear norm to learn more compact and distinguishable hash representations, promoting communication within and between views. Experimental results demonstrate that this refined method significantly outperforms state-of-the-art methods in clustering on five large-scale multi-view datasets. Moreover, in terms of CPU time, TPCH achieves substantial acceleration compared to the most advanced current methods.

PDF Details DOI

AAAI Conference 2024 Conference Paper

Dual Self-Paced Cross-Modal Hashing

Yuan Sun
Jian Dai
Zhenwen Ren
Yingke Chen
Dezhong Peng
Peng Hu

Cross-modal hashing~(CMH) is an efficient technique to retrieve relevant data across different modalities, such as images, texts, and videos, which has attracted more and more attention due to its low storage cost and fast query speed. Although existing CMH methods achieve remarkable processes, almost all of them treat all samples of varying difficulty levels without discrimination, thus leaving them vulnerable to noise or outliers. Based on this observation, we reveal and study dual difficulty levels implied in cross-modal hashing learning, \ie instance-level and feature-level difficulty. To address this problem, we propose a novel Dual Self-Paced Cross-Modal Hashing (DSCMH) that mimics human cognitive learning to learn hashing from ``easy'' to ``hard'' in both instance and feature levels, thereby embracing robustness against noise/outliers. Specifically, our DSCMH assigns weights to each instance and feature to measure their difficulty or reliability, and then uses these weights to automatically filter out the noisy and irrelevant data points in the original space. By gradually increasing the weights during training, our method can focus on more instances and features from ``easy'' to ``hard'' in training, thus mitigating the adverse effects of noise or outliers. Extensive experiments are conducted on three widely-used benchmark datasets to demonstrate the effectiveness and robustness of the proposed DSCMH over 12 state-of-the-art CMH methods.

PDF Details DOI

IJCAI Conference 2024 Conference Paper

Dual Semantic Fusion Hashing for Multi-Label Cross-Modal Retrieval

Kaiming Liu
Yunhong Gong
Yu Cao
Zhenwen Ren
Dezhong Peng
Yuan Sun

Cross-modal hashing (CMH) has been widely used for multi-modal retrieval tasks due to its low storage cost and fast query speed. Although existing CMH methods achieve promising performance, most of them mainly rely on coarse-grained supervision information (\ie pairwise similarity matrix) to measure the semantic similarities between all instances, ignoring the impact of multi-label distribution. To address this issue, we construct fine-grained semantic similarity to explore the cluster-level semantic relationships between multi-label data, and propose a new dual semantic fusion hashing (DSFH) for multi-label cross-modal retrieval. Specifically, we first learn the modal-specific representation and consensus hash codes, thereby merging the specificity with consistency. Then, we fuse the coarse-grained and fine-grained semantics to mine multiple-level semantic relationships, thereby enhancing hash codes discrimination. Extensive experiments on three benchmarks demonstrate the superior performance of our DSFH compared with 16 state-of-the-art methods.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

Efficient Adaptation of Pre-trained Vision Transformer via Householder Transformation

Wei Dong
Yuan Sun
Yiting Yang
Xing Zhang
Zhijun Lin
Qingsen Yan
Haokui Zhang
Peng Wang

A common strategy for Parameter-Efficient Fine-Tuning (PEFT) of pre-trained Vision Transformers (ViTs) involves adapting the model to downstream tasks by learning a low-rank adaptation matrix. This matrix is decomposed into a product of down-projection and up-projection matrices, with the bottleneck dimensionality being crucial for reducing the number of learnable parameters, as exemplified by prevalent methods like LoRA and Adapter. However, these low-rank strategies typically employ a fixed bottleneck dimensionality, which limits their flexibility in handling layer-wise variations. To address this limitation, we propose a novel PEFT approach inspired by Singular Value Decomposition (SVD) for representing the adaptation matrix. SVD decomposes a matrix into the product of a left unitary matrix, a diagonal matrix of scaling values, and a right unitary matrix. We utilize Householder transformations to construct orthogonal matrices that efficiently mimic the unitary matrices, requiring only a vector. The diagonal values are learned in a layer-wise manner, allowing them to flexibly capture the unique properties of each layer. This approach enables the generation of adaptation matrices with varying ranks across different layers, providing greater flexibility in adapting pre-trained models. Experiments on standard downstream vision tasks demonstrate that our method achieves promising fine-tuning performance.

PDF Details DOI

EAAI Journal 2024 Journal Article

Multimodal contrastive learning for face anti-spoofing

Pengchao Deng
Chenyang Ge
Hao Wei
Yuan Sun
Xin Qiao

Multimodal face anti-spoofing systems adopt multiple sensor modalities, such as infrared, color, depth, and thermal, to distinguish between living and spoofing faces via complementary spoofing clues from each modality. One challenge is that when the multimodal face anti-spoofing system is placed in different environments, the sensor setup may not be unified, causing a certain sensor to be unavailable. To alleviate this issue, a two-stream face anti-spoofing method is proposed. The first stream focuses on extracting primary features from an available sensor by a baseline network. The second stream employs a multimodal contrastive learning strategy to acquire modality-agnostic and task-specific representations from another deployed sensor. Furthermore, a master–slave modulation fusion block is designed to effectively fuse features from the two streams. Experiments conducted on three public multimodal databases show the superior performance of the proposed method.

NeurIPS Conference 2023 Conference Paper

Cross-modal Active Complementary Learning with Self-refining Correspondence

Yang Qin
Yuan Sun
Dezhong Peng
Joey Tianyi Zhou
Xi Peng
Peng Hu

Recently, image-text matching has attracted more and more attention from academia and industry, which is fundamental to understanding the latent correspondence across visual and textual modalities. However, most existing methods implicitly assume the training pairs are well-aligned while ignoring the ubiquitous annotation noise, a. k. a noisy correspondence (NC), thereby inevitably leading to a performance drop. Although some methods attempt to address such noise, they still face two challenging problems: excessive memorizing/overfitting and unreliable correction for NC, especially under high noise. To address the two problems, we propose a generalized Cross-modal Robust Complementary Learning framework (CRCL), which benefits from a novel Active Complementary Loss (ACL) and an efficient Self-refining Correspondence Correction (SCC) to improve the robustness of existing methods. Specifically, ACL exploits active and complementary learning losses to reduce the risk of providing erroneous supervision, leading to theoretically and experimentally demonstrated robustness against NC. SCC utilizes multiple self-refining processes with momentum correction to enlarge the receptive field for correcting correspondences, thereby alleviating error accumulation and achieving accurate and stable corrections. We carry out extensive experiments on three image-text benchmarks, i. e. , Flickr30K, MS-COCO, and CC152K, to verify the superior robustness of our CRCL against synthetic and real-world noisy correspondences.

AAAI Conference 2022 Conference Paper

Enhancing Column Generation by a Machine-Learning-Based Pricing Heuristic for Graph Coloring

Yunzhuang Shen
Yuan Sun
Xiaodong Li
Andrew Eberhard
Andreas Ernst

Column Generation (CG) is an effective method for solving large-scale optimization problems. CG starts by solving a subproblem with a subset of columns (i. e. , variables) and gradually includes new columns that can improve the solution of the current subproblem. The new columns are generated as needed by repeatedly solving a pricing problem, which is often NPhard and is a bottleneck of the CG approach. To tackle this, we propose a Machine-Learning-based Pricing Heuristic (MLPH) that can generate many high-quality columns efficiently. In each iteration of CG, our MLPH leverages an ML model to predict the optimal solution of the pricing problem, which is then used to guide a sampling method to efficiently generate multiple high-quality columns. Using the graph coloring problem, we empirically show that MLPH significantly enhances CG as compared to six state-of-the-art methods, and the improvement in CG can lead to substantially better performance of the branch-and-price exact method.

NeurIPS Conference 2022 Conference Paper

Learning Generalizable Models for Vehicle Routing Problems via Knowledge Distillation

Jieyi Bi
Yining Ma
Jiahai Wang
Zhiguang Cao
Jinbiao Chen
Yuan Sun
Yeow Meng Chee

Recent neural methods for vehicle routing problems always train and test the deep models on the same instance distribution (i. e. , uniform). To tackle the consequent cross-distribution generalization concerns, we bring the knowledge distillation to this field and propose an Adaptive Multi-Distribution Knowledge Distillation (AMDKD) scheme for learning more generalizable deep models. Particularly, our AMDKD leverages various knowledge from multiple teachers trained on exemplar distributions to yield a light-weight yet generalist student model. Meanwhile, we equip AMDKD with an adaptive strategy that allows the student to concentrate on difficult distributions, so as to absorb hard-to-master knowledge more effectively. Extensive experimental results show that, compared with the baseline neural methods, our AMDKD is able to achieve competitive results on both unseen in-distribution and out-of-distribution instances, which are either randomly synthesized or adopted from benchmark datasets (i. e. , TSPLIB and CVRPLIB). Notably, our AMDKD is generic, and consumes less computational resources for inference.

YNICL Journal 2021 Journal Article

Hippocampal subfield and anterior-posterior segment volumes in patients with sporadic amyotrophic lateral sclerosis

Shuangwu Liu
Qingguo Ren
Gaolang Gong
Yuan Sun
Bing Zhao
Xiaotian Ma
Na Zhang
Suyu Zhong

Neuroimaging studies of hippocampal volumes in patients with amyotrophic lateral sclerosis (ALS) have reported inconsistent results. Our aims were to demonstrate that such discrepancies are largely due to atrophy of different regions of the hippocampus that emerge in different disease stages of ALS and to explore the existence of co-pathology in ALS patients. We used the well-validated King's clinical staging system for ALS to classify patients into different disease stages. We investigated in vivo hippocampal atrophy patterns across subfields and anterior-posterior segments in different King's stages using structural MRI in 76 ALS patients and 94 health controls (HCs). The thalamus, corticostriatal tract and perforant path were used as structural controls to compare the sequence of alterations between these structures and the hippocampal subfields. Compared with HCs, ALS patients at King's stage 1 had lower volumes in the bilateral posterior subiculum and presubiculum; ALS patients at King's stage 2 exhibited lower volumes in the bilateral posterior subiculum, left anterior presubiculum and left global hippocampus; ALS patients at King's stage 3 showed significantly lower volumes in the bilateral posterior subiculum, dentate gyrus and global hippocampus. Thalamic atrophy emerged at King's stage 3. White matter tracts remained normal in a subset of ALS patients. Our study demonstrated that the pattern of hippocampal atrophy in ALS patients varies greatly across King's stages. Future studies in ALS patients that focus on the hippocampus may help to further clarify possible co-pathologies in ALS.

AAAI Conference 2020 Conference Paper

Revisiting Probability Distribution Assumptions for Information Theoretic Feature Selection

Yuan Sun
Wei Wang
Michael Kirley
Xiaodong Li
Jeffrey Chan

Feature selection has been shown to be beneﬁcial for many data mining and machine learning tasks, especially for big data analytics. Mutual Information (MI) is a well-known information-theoretic approach used to evaluate the relevance of feature subsets and class labels. However, estimating highdimensional MI poses signiﬁcant challenges. Consequently, a great deal of research has focused on using low-order MI approximations or computing a lower bound on MI called Variational Information (VI). These methods often require certain assumptions made on the probability distributions of features such that these distributions are realistic yet tractable to compute. In this paper, we reveal two sets of distribution assumptions underlying many MI and VI based methods: Feature Independence Distribution and Geometric Mean Distribution. We systematically analyze their strengths and weaknesses and propose a logical extension called Arithmetic Mean Distribution, which leads to an unbiased and normalised estimation of probability densities. We conduct detailed empirical studies across a suite of 29 real-world classiﬁcation problems and illustrate improved prediction accuracy of our methods based on the identiﬁcation of more informative features, thus providing support for our theoretical ﬁndings.

TCS Journal 2020 Journal Article

The one-round multi-player discrete Voronoi game on grids and trees

Xiaoming Sun
Yuan Sun
Zhiyu Xia
Jialin Zhang

Basing on the two-player Voronoi game introduced by Ahn et al. [1] and the multi-player diffusion game introduced by Alon et al. [2], we investigate the following one-round multi-player discrete Voronoi game on grids and trees. There are n players playing this game on a graph G = ( V, E ). Each player chooses an initial vertex from the vertex set of the graph and tries to maximize the size of the nearest vertex set. As the main result, we give sufficient conditions for the existence/non-existence of a pure-strategy Nash equilibrium in 4-player games on grids and only a constant gap leaves unknown. We further consider this game with more than 4 players and construct a family of strategy profiles, which are pure-strategy Nash equilibria on sufficiently narrow graphs. Besides, we investigate the game with 3 players on trees and design a linear time/space algorithm to decide the existence of a pure-strategy Nash equilibrium.

IS Journal 2019 Journal Article

Automatic Vehicle Tracking With Roadside LiDAR Data for the Connected-Vehicles System

Yuepeng Cui
Hao Xu
Jianqing Wu
Yuan Sun
Junxuan Zhao

The existing connected-vehicle deployments obtain the real-time status of connected vehicles, but without knowing the unconnected traffic. It is urgent to find an approach to collecting the high-resolution real-time status of unconnected road users. This paper introduces a new-generation light detection and ranging (LiDAR) enhanced connected infrastructures that actively sense the high-resolution status of surrounding traffic participants with roadside LiDAR sensors and broadcast connected-vehicle messages through DSRC roadside units. The LiDAR data processing procedure, including background filtering, object clustering, vehicle recognition, lane identification, and vehicle tracking, is presented in this paper. The performance of the proposed data processing procedure is evaluated with the field collected data.