Arrow Research search

Author name cluster

Chun Yuan

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

35 papers
1 author row

Possible papers

35

AAAI Conference 2026 Conference Paper

M3Time: LLM-Enhanced Multi-Modal, Multi-Scale, and Multi-Frequency Multivariate Time Series Forecasting

  • Shuning Jia
  • Baijun Song
  • Canming Ye
  • Chun Yuan

Multivariate Time Series Forecasting (MTSF) aims to capture the dependencies among multiple variables and their temporal dynamics to predict future values. In recent years, Large Language Models (LLMs) have set a new paradigm for MTSF, incorporating external knowledge into the modeling process through textual prompts. However, we observe that current LLM-based methods fail to exploit these priors due to their coarse-grained representation of time series data, which hinders effective alignment of the two modals. To address this, we propose M3Time, a multi-modal, multi-scale, and multi-frequency framework for multivariate time series forecasting. It enhances the quality of time series representations and facilitates the integration of LLM semantic priors with fine-grained temporal features. Additionally, M3Time further improved training stability and model robustness with an adaptive mixed loss function, which dynamically balances L1 and L2 error terms. Experiment results on seven real-world public datasets show that M3Time consistently outperforms state-of-the-art methods, underscoring its effectiveness.

AAAI Conference 2026 Conference Paper

Prune&Comp: Free Lunch for Layer-Pruned LLMs via Iterative Pruning with Magnitude Compensation

  • Xinrui Chen
  • Hongxing Zhang
  • Fanyi Zeng
  • Yongxian Wei
  • Yizhi Wang
  • Xitong Ling
  • Guanghao Li
  • Chun Yuan

Layer pruning is a viable technique for compressing large language models while achieving acceleration proportional to the pruning ratio. In this work, we identify that removing any layer induces a magnitude gap in hidden states, and demonstrate that a simple compensation operation leads to superior performance in iterative layer pruning. This key observation motivates us to propose Prune&Comp, a novel, plug-and-play iterative layer pruning scheme that leverages magnitude compensation to mitigate such gaps in a training-free manner. Specifically, we first estimate the magnitude gap of layer removal and then eliminate it by rescaling the remaining weights offline. We further demonstrate the advantages of Prune&Comp in improving the stability of iterative pruning. When integrated with an iterative prune-and-compensate loop, Prune&Comp consistently enhances existing layer pruning metrics. For instance, when 5 layers of LLaMA-3-8B are pruned with the prevalent Taylor+ metric, Prune&Comp reduces PPL from 512.78 to 16.34 and retains 90.57% of the original performance across 9 question-answering tasks, outperforming the baseline by 24.72%.

NeurIPS Conference 2025 Conference Paper

A Simple Linear Patch Revives Layer-Pruned Large Language Models

  • Xinrui Chen
  • Haoli Bai
  • Tao Yuan
  • ruikang liu
  • Kang Zhao
  • Xianzhi Yu
  • Lu Hou
  • Tian Guan

Layer pruning has emerged as a widely used technique for compressing large language models (LLMs). However, existing layer pruning approaches often incur substantial performance degradation. We identify the majority of this degradation to a single yet previously overlooked issue: \textit{the mismatch of activation magnitudes at the pruning interface}. The pre-interface activations exhibit significantly different scales from the post-interface ones, causing the distributional shift as it propagates through the remaining layers. To address this issue, we introduce \textsc{LinearPatch}, a lightweight and plug-and-play technique that fuses two operations into one matrix multiply at the pruning interface: (i) a Hadamard transformation that suppresses massive outliers at particular tokens and (ii) a channel-wise scaling that aligns activation statistics. On LLaMA-3-8B, \textsc{LinearPatch} preserves up to \textbf{94. 15\%} of the original model's performance when pruning 5 out of 32 layers, outperforming the previous state of the art by \textbf{4\%}. The patch can be further refined with 5K unlabeled samples via memory-efficient offline distillation, pushing the retention to 95. 16\% within only 30 minutes on a single GPU. Code is available at \url{https: //github. com/chenxinrui-tsinghua/LinearPatch}.

AAAI Conference 2025 Conference Paper

Aligning Composed Query with Image via Discriminative Perception from Negative Correspondences

  • Yifan Wang
  • Wuliang Huang
  • Chun Yuan

The task of composed image retrieval aims to match the multi-modal query composed of a reference image and a modification sentence with the target image. Most current approaches narrow the distances between the composed queries and targets by investigating matched correspondences in positive triplets. Nevertheless, they are inclined to exhibit heavy reliance on partial correlations. As the negative correspondences are underestimated, semantic clues that distinguish the target from mismatched candidates are obscured by incomplete associations. Moreover, the correlations between the modification textual features and the visual variations from the reference to candidates are imperative to further strengthen the semantic discriminations. In this paper, we propose DIscriminative Perception from NEgative Correspondences (DIPNEC) to address the aforementioned issues. To encourage awareness of the differences between matched and mismatched correspondences, DIPNEC introduces optimal transport with semantic preservation for reassignments on hard negative triplets. Besides, Difference Quantization Alignments (DQA) and Composed Word-level Alignments (CWA) jointly determine the matching scores between multi-modal queries and candidates. Specifically, DQA concentrates on the correlations of textual features with source-to-target visual differences, and CWA further emphasizes the differentiated semantics. DIPNEC has demonstrated competitive performances on the experimental results and ablation studies on widely-used datasets FashionIQ and CIRR.

AAAI Conference 2025 Conference Paper

IteRPrimE: Zero-shot Referring Image Segmentation with Iterative Grad-CAM Refinement and Primary Word Emphasis

  • Yuji Wang
  • Jingchen Ni
  • Yong Liu
  • Chun Yuan
  • Yansong Tang

Zero-shot Referring Image Segmentation (RIS) identifies the instance mask that best aligns with a specified referring expression without training and fine-tuning, significantly reducing the labor-intensive annotation process. Despite achieving commendable results, previous CLIP-based models have a critical drawback: the models exhibit a notable reduction in their capacity to discern relative spatial relationships of objects. This is because they generate all possible masks on an image and evaluate each masked region for similarity to the given expression, often resulting in decreased sensitivity to direct positional clues in text inputs. Moreover, most methods have weak abilities to manage relationships between primary words and their contexts, causing confusion and reduced accuracy in identifying the correct target region. To address these challenges, we propose IteRPrimE (Iterative Grad-CAM Refinement and Primary word Emphasis), which leverages a saliency heatmap through Grad-CAM from a Vision-Language Pre-trained (VLP) model for image-text matching. An iterative Grad-CAM refinement strategy is introduced to progressively enhance the model's focus on the target region and overcome positional insensitivity, creating a self-correcting effect. Additionally, we design the Primary Word Emphasis module to help the model handle complex semantic relations, enhancing its ability to attend to the intended object. Extensive experiments conducted on the RefCOCO/+/g, and PhraseCut benchmarks demonstrate that IteRPrimE outperforms previous SOTA zero-shot methods, particularly excelling in out-of-domain scenarios.

NeurIPS Conference 2025 Conference Paper

NaDRO: Leveraging Dual-Reward Strategies for LLMs Training on Noisy Data

  • Haolong Qian
  • Xianliang Yang
  • Ling Zhang
  • Lei Song
  • Jiang Bian
  • Chun Yuan

Group Relative Policy Optimization (GRPO) fine-tuning has demonstrated significant enhancements in reasoning tasks. However, it often relies on high quality labeled dataset, which is typically difficult to obtain. To address this challenge, we introduce \textbf{N}oise-\textbf{A}ware \textbf{D}ual-\textbf{R}eward \textbf{O}ptimization (\textbf{NaDRO}) to effectively enhances the training of Large Language Models (LLMs) under noisy or ambiguous supervision. NaDRO operates through two key components: \textbf{(1) Preference-based Outcome Reward (POR)}, which makes a principled bias-variance tradeoff, reducing training variance by learning from robust preference rankings instead of overfitting to single-best estimates; and \textbf{(2) Context Perception Reward (CPR) mechanism}, which ensures that LLMs conduct necessary qualitative assessment of the current problem state to foster deeper situational understanding prior to decision-making. To validate our approach in a realistic decision-making testbed, we model classic combinatorial optimization problems like the Traveling Salesman Problem (TSP) and Capacitated Vehicle Routing Problem (CVRP) as Markov Decision Processes, generating training data via cost-limited exploration. Our results demonstrate that the fine-tuned Qwen 7B and Llama 3. 1-8B models achieve statistically robust performance, significantly outperforming leading LLM baselines and standard fine-tuning methods on these complex benchmarks. Code is released at \url{https: //github. com/microsoft/HeurAgenix/tree/NaDRO}.

AAAI Conference 2025 Conference Paper

Rethinking Pseudo-Label Guided Learning for Weakly Supervised Temporal Action Localization from the Perspective of Noise Correction

  • Quan Zhang
  • Yuxin Qi
  • Xi Tang
  • Rui Yuan
  • Xi Lin
  • Ke Zhang
  • Chun Yuan

Pseudo-label learning methods have been widely applied in weakly-supervised temporal action localization. Existing works directly utilize weakly-supervised base model to generate instance-level pseudo-labels for training the fully-supervised detection head. We argue that the noise in pseudo-labels would interfere with the learning of fully-supervised detection head, leading to significant performance leakage. Issues with noisy labels include:(1) inaccurate boundary localization; (2) undetected short action clips; (3) multiple adjacent segments incorrectly detected as one segment. To target these issues, we introduce a two-stage noisy label learning strategy to harness every potential useful signal in noisy labels. First, we propose a frame-level pseudo-label generation model with a context-aware denoising algorithm to refine the boundaries. Second, we introduce an online-revised teacher-student framework with a missing instance compensation module and an ambiguous instance correction module to solve the short-action-missing and many-to-one problems. Besides, we apply a high-quality pseudo-label mining loss in our online-revised teacher-student framework to add different weights to the noisy labels to train more effectively. Our model outperforms the previous state-of-the-art method in detection accuracy and inference speed greatly upon the THUMOS14 and ActivityNet v1.2 benchmarks.

NeurIPS Conference 2025 Conference Paper

Safe-Sora: Safe Text-to-Video Generation via Graphical Watermarking

  • Zihan Su
  • Xuerui Qiu
  • Hongbin Xu Xu
  • Tangyu Jiang
  • Jun-hao Zhuang
  • Chun Yuan
  • Ming Li
  • Shengfeng He

The explosive growth of generative video models has amplified the demand for reliable copyright preservation of AI-generated content. Despite its popularity in image synthesis, invisible generative watermarking remains largely underexplored in video generation. To address this gap, we propose Safe-Sora, the first framework to embed graphical watermarks directly into the video generation process. Motivated by the observation that watermarking performance is closely tied to the visual similarity between the watermark and cover content, we introduce a hierarchical coarse-to-fine adaptive matching mechanism. Specifically, the watermark image is divided into patches, each assigned to the most visually similar video frame, and further localized to the optimal spatial region for seamless embedding. To enable spatiotemporal fusion of watermark patches across video frames, we develop a 3D wavelet transform-enhanced Mamba architecture with a novel scanning strategy, effectively modeling long-range dependencies during watermark embedding and retrieval. To the best of our knowledge, this is the first attempt to apply state space models to watermarking, opening new avenues for efficient and robust watermark protection. Extensive experiments demonstrate that Safe-Sora achieves state-of-the- art performance in terms of video quality, watermark fidelity, and robustness, which is largely attributed to our proposals. Code and additional supporting materials are provided in the supplementary.

NeurIPS Conference 2025 Conference Paper

SCOUT: Teaching Pre-trained Language Models to Enhance Reasoning via Flow Chain-of-Thought

  • Guanghao Li
  • Wenhao Jiang
  • Mingfeng Chen
  • Yan Li
  • Hao Yu
  • Shuting Dong
  • Tao Ren
  • Ming Tang

Chain-of-Thought (CoT) prompting improves the reasoning performance of large language models (LLMs) by encouraging step-by-step thinking. However, CoT-based methods depend on intermediate reasoning steps, which limits scalability and generalization. Recent work explores recursive reasoning, where LLMs reuse internal layers across iterations to refine latent representations without explicit CoT supervision. While promising, these approaches often require costly pretraining and lack a principled framework for how reasoning should evolve across iterations. We address this gap by introducing Flow Chain-of-Thought (Flow CoT), a reasoning paradigm that models recursive inference as a progressive trajectory of latent cognitive states. Flow CoT frames each iteration as a distinct cognitive stage—deepening reasoning across iterations without relying on manual supervision. To realize this, we propose SCOUT ( Stepwise Cognitive Optimization Using Teachers ), a lightweight fine-tuning framework that enables Flow CoT-style reasoning without the need for pretraining. SCOUT uses progressive distillation to align each iteration with a teacher of appropriate capacity, and a cross-attention-based retrospective module that integrates outputs from previous iterations while preserving the model’s original computation flow. Experiments across eight reasoning benchmarks show that SCOUT consistently improves both accuracy and explanation quality, achieving up to 1. 8\% gains under fine-tuning. Qualitative analyses further reveal that SCOUT enables progressively deeper reasoning across iterations—refining both belief formation and explanation granularity. These results not only validate the effectiveness of SCOUT, but also demonstrate the practical viability of Flow CoT as a scalable framework for enhancing reasoning in LLMs.

NeurIPS Conference 2025 Conference Paper

Towards Implicit Aggregation: Robust Image Representation for Place Recognition in the Transformer Era

  • Feng Lu
  • Tong Jin
  • Canming Ye
  • Xiangyuan Lan
  • Yunpeng Liu
  • Chun Yuan

Visual place recognition (VPR) is typically regarded as a specific image retrieval task, whose core lies in representing images as global descriptors. Over the past decade, dominant VPR methods (e. g. , NetVLAD) have followed a paradigm that first extracts the patch features/tokens of the input image using a backbone, and then aggregates these patch features into a global descriptor via an aggregator. This backbone-plus-aggregator paradigm has achieved overwhelming dominance in the CNN era and remains widely used in transformer-based models. In this paper, however, we argue that a dedicated aggregator is not necessary in the transformer era, that is, we can obtain robust global descriptors only with the backbone. Specifically, we introduce some learnable aggregation tokens, which are prepended to the patch tokens before a particular transformer block. All these tokens will be jointly processed and interact globally via the intrinsic self-attention mechanism, implicitly aggregating useful information within the patch tokens to the aggregation tokens. Finally, we only take these aggregation tokens from the last output tokens and concatenate them as the global representation. Although implicit aggregation can provide robust global descriptors in an extremely simple manner, where and how to insert additional tokens, as well as the initialization of tokens, remains an open issue worthy of further exploration. To this end, we also propose the optimal token insertion strategy and token initialization method derived from empirical studies. Experimental results show that our method outperforms state-of-the-art methods on several VPR datasets with higher efficiency and ranks 1st on the MSLS challenge leaderboard. The code is available at https: //github. com/lu-feng/image.

NeurIPS Conference 2025 Conference Paper

Towards Robust Uncertainty Calibration for Composed Image Retrieval

  • Yifan Wang
  • Wuliang Huang
  • Yufan Wen
  • Shunning Liu
  • Chun Yuan

The interactive task of composed image retrieval aims to retrieve the most relevant images with the bi-modal query, consisting of a reference image and a modification sentence. Despite significant efforts to bridge the heterogeneous gap within the bi-modal query and leverage contrastive learning to reduce the disparity between positive and negative triplets, prior methods often fail to ensure reliable matching due to aleatoric and epistemic uncertainty. Specifically, the aleatoric uncertainty stems from underlying semantic correlations within candidate instances and annotation noise, and the epistemic uncertainty is usually caused by overconfidence in dominant semantic categories. In this paper, we propose Robust UNcertainty Calibration (RUNC) to quantify the uncertainty and calibrate the imbalanced semantic distribution. To mitigate semantic ambiguity in similarity distribution between fusion queries and targets, RUNC maximizes the matching evidence by utilizing a high-order conjugate prior distribution to fit the semantic covariances in candidate samples. With the estimated uncertainty coefficient of each candidate, the target distribution is calibrated to encourage balanced semantic alignment. Additionally, we minimize the ambiguity in the fusion evidence when forming the unified query by incorporating orthogonal constraints on explicit textual embeddings and implicit queries, to reduce the representation redundancy. Extensive experiments and ablation analysis on benchmark datasets FashionIQ and CIRR verify the robustness of RUNC in predicting reliable retrieval results from a large image gallery.

AAAI Conference 2024 Conference Paper

Blind Face Restoration under Extreme Conditions: Leveraging 3D-2D Prior Fusion for Superior Structural and Texture Recovery

  • Zhengrui Chen
  • Liying Lu
  • Ziyang Yuan
  • Yiming Zhu
  • Yu Li
  • Chun Yuan
  • Weihong Deng

Blind face restoration under extreme conditions involves reconstructing high-quality face images from severely degraded inputs. These input images are often in poor quality and have extreme facial poses, leading to errors in facial structure and unnatural artifacts within the restored images. In this paper, we show that utilizing 3D priors effectively compensates for structure knowledge deficiencies in 2D priors while preserving the texture details. Based on this, we introduce FREx (Face Restoration under Extreme conditions) that combines structure-accurate 3D priors and texture-rich 2D priors in pretrained generative networks for blind face restoration under extreme conditions. To fuse the different information in 3D and 2D priors, we introduce an adaptive weight module that adjusts the importance of features based on the input image's condition. With this approach, our model can restore structure-accurate and natural-looking faces even when the images have lost a lot of information due to degradation and extreme pose. Extensive experimental results on synthetic and real-world datasets validate the effectiveness of our methods.

AAAI Conference 2024 Conference Paper

Deep Homography Estimation for Visual Place Recognition

  • Feng Lu
  • Shuting Dong
  • Lijun Zhang
  • Bingxi Liu
  • Xiangyuan Lan
  • Dongmei Jiang
  • Chun Yuan

Visual place recognition (VPR) is a fundamental task for many applications such as robot localization and augmented reality. Recently, the hierarchical VPR methods have received considerable attention due to the trade-off between accuracy and efficiency. They usually first use global features to retrieve the candidate images, then verify the spatial consistency of matched local features for re-ranking. However, the latter typically relies on the RANSAC algorithm for fitting homography, which is time-consuming and non-differentiable. This makes existing methods compromise to train the network only in global feature extraction. Here, we propose a transformer-based deep homography estimation (DHE) network that takes the dense feature map extracted by a backbone network as input and fits homography for fast and learnable geometric verification. Moreover, we design a re-projection error of inliers loss to train the DHE network without additional homography labels, which can also be jointly trained with the backbone network to help it extract the features that are more suitable for local matching. Extensive experiments on benchmark datasets show that our method can outperform several state-of-the-art methods. And it is more than one order of magnitude faster than the mainstream hierarchical VPR methods using RANSAC. The code is released at https://github.com/Lu-Feng/DHE-VPR.

AAAI Conference 2024 Conference Paper

Efficient Conditional Diffusion Model with Probability Flow Sampling for Image Super-resolution

  • Yutao Yuan
  • Chun Yuan

Image super-resolution is a fundamentally ill-posed problem because multiple valid high-resolution images exist for one low-resolution image. Super-resolution methods based on diffusion probabilistic models can deal with the ill-posed nature by learning the distribution of high-resolution images conditioned on low-resolution images, avoiding the problem of blurry images in PSNR-oriented methods. However, existing diffusion-based super-resolution methods have high time consumption with the use of iterative sampling, while the quality and consistency of generated images are less than ideal due to problems like color shifting. In this paper, we propose Efficient Conditional Diffusion Model with Probability Flow Sampling (ECDP) for image super-resolution. To reduce the time consumption, we design a continuous-time conditional diffusion model for image super-resolution, which enables the use of probability flow sampling for efficient generation. Additionally, to improve the consistency of generated images, we propose a hybrid parametrization for the denoiser network, which interpolates between the data-predicting parametrization and the noise-predicting parametrization for different noise scales. Moreover, we design an image quality loss as a complement to the score matching loss of diffusion models, further improving the consistency and quality of super-resolution. Extensive experiments on DIV2K, ImageNet, and CelebA demonstrate that our method achieves higher super-resolution quality than existing diffusion-based image super-resolution methods while having lower time consumption. Our code is available at https://github.com/Yuan-Yutao/ECDP.

JAIR Journal 2024 Journal Article

Efficiently Adapt to New Dynamic via Meta-Model

  • Kaixin Huang
  • Chen Zhao
  • Chun Yuan

We delve into the realm of offline meta-reinforcement learning (OMRL), a practical paradigm in the field of reinforcement learning that leverages offline data to adapt to new tasks. While prior approaches have not explored the utilization of context-based dynamical models to tackle OMRL problems, our research endeavors to fill this gap. Our investigation uncovers shortcomings in existing context-based methods, primarily related to distribution shifts during offline learning and challenges in establishing stable task representations. To address these issues, we formulate the problem as Hidden-Parameter MDPs and propose a framework for effective model adaptation using meta-models plus latent variables, which is inferred by the transformer-based system recognition module trained in an unsupervised fashion. Through extensive experimentation encompassing diverse simulated robotics and control tasks, we validate the efficacy of our approach and demonstrate its superior generalization ability compared to existing schemes, and explore multiple strategies for obtaining policies with personalized models. Our method achieves a model with reduced prediction error, outperforming previous methods in policy performance, and facilitating efficient adaptation when compared to prior dynamic model generalization methods and OMRL algorithms.

AAAI Conference 2024 Conference Paper

Mean Teacher DETR with Masked Feature Alignment: A Robust Domain Adaptive Detection Transformer Framework

  • Weixi Weng
  • Chun Yuan

Unsupervised domain adaptation object detection(UDAOD) research on Detection Transformer(DETR) mainly focuses on feature alignment and existing methods can be divided into two kinds, each of which has its unresolved issues. One-stage feature alignment methods can easily lead to performance fluctuation and training stagnation. Two-stage feature alignment method based on mean teacher comprises a pretraining stage followed by a self-training stage, each facing problems in obtaining reliable pretrained model and achieving consistent performance gains. Methods mentioned above have not yet explore how to utilize the third related domain such as target-like domain to assist adaptation. To address these issues, we propose a two-stage framework named MTM, i.e. Mean Teacher-DETR with Masked Feature Alignment. In the pretraining stage, we utilize labeled target-like images produced by image style transfer to avoid performance fluctuation. In the self-training stage, we leverage unlabeled target images by pseudo labels based on mean teacher and propose a module called Object Queries Knowledge Transfer(OQKT) to ensure consistent performance gains of the student model. Most importantly, we propose masked feature alignment methods including Masked Domain Query-based Feature Alignment(MDQFA) and Masked Token-wise Feature Alignment(MTWFA) to alleviate domain shift in a more robust way, which not only prevent training stagnation and lead to a robust pretrained model in the pretraining stage, but also enhance the model's target performance in the self-training stage. Experiments on three challenging scenarios and a theoretical analysis verify the effectiveness of MTM.

NeurIPS Conference 2024 Conference Paper

SuperVLAD: Compact and Robust Image Descriptors for Visual Place Recognition

  • Feng Lu
  • Xinyao Zhang
  • Canming Ye
  • Shuting Dong
  • Lijun Zhang
  • Xiangyuan Lan
  • Chun Yuan

Visual place recognition (VPR) is an essential task for multiple applications such as augmented reality and robot localization. Over the past decade, mainstream methods in the VPR area have been to use feature representation based on global aggregation, as exemplified by NetVLAD. These features are suitable for large-scale VPR and robust against viewpoint changes. However, the VLAD-based aggregation methods usually learn a large number of (e. g. , 64) clusters and their corresponding cluster centers, which directly leads to a high dimension of the yielded global features. More importantly, when there is a domain gap between the data in training and inference, the cluster centers determined on the training set are usually improper for inference, resulting in a performance drop. To this end, we first attempt to improve NetVLAD by removing the cluster center and setting only a small number of (e. g. , only 4) clusters. The proposed method not only simplifies NetVLAD but also enhances the generalizability across different domains. We name this method SuperVLAD. In addition, by introducing ghost clusters that will not be retained in the final output, we further propose a very low-dimensional 1-Cluster VLAD descriptor, which has the same dimension as the output of GeM pooling but performs notably better. Experimental results suggest that, when paired with a transformer-based backbone, our SuperVLAD shows better domain generalization performance than NetVLAD with significantly fewer parameters. The proposed method also surpasses state-of-the-art methods with lower feature dimensions on several benchmark datasets. The code is available at https: //github. com/lu-feng/SuperVLAD.

AAAI Conference 2023 Conference Paper

Darwinian Model Upgrades: Model Evolving with Selective Compatibility

  • Binjie Zhang
  • Shupeng Su
  • Yixiao Ge
  • Xuyuan Xu
  • Yexin Wang
  • Chun Yuan
  • Mike Zheng Shou
  • Ying Shan

The traditional model upgrading paradigm for retrieval requires recomputing all gallery embeddings before deploying the new model (dubbed as "backfilling"), which is quite expensive and time-consuming considering billions of instances in industrial applications. BCT presents the first step towards backward-compatible model upgrades to get rid of backfilling. It is workable but leaves the new model in a dilemma between new feature discriminativeness and new-to-old compatibility due to the undifferentiated compatibility constraints. In this work, we propose Darwinian Model Upgrades (DMU), which disentangle the inheritance and variation in the model evolving with selective backward compatibility and forward adaptation, respectively. The old-to-new heritable knowledge is measured by old feature discriminativeness, and the gallery features, especially those of poor quality, are evolved in a lightweight manner to become more adaptive in the new latent space. We demonstrate the superiority of DMU through comprehensive experiments on large-scale landmark retrieval and face recognition benchmarks. DMU effectively alleviates the new-to-new degradation at the same time improving new-to-old compatibility, rendering a more proper model upgrading paradigm in large-scale retrieval systems.Code: https://github.com/TencentARC/OpenCompatible.

IJCAI Conference 2023 Conference Paper

DFVSR: Directional Frequency Video Super-Resolution via Asymmetric and Enhancement Alignment Network

  • Shuting Dong
  • Feng Lu
  • Zhe Wu
  • Chun Yuan

Recently, techniques utilizing frequency-based methods have gained significant attention, as they exhibit exceptional restoration capabilities for detail and structure in video super-resolution tasks. However, most of these frequency-based methods mainly have three major limitations: 1) insufficient exploration of object motion information, 2) inadequate enhancement for high-fidelity regions, and 3) loss of spatial information during convolution. In this paper, we propose a novel network, Directional Frequency Video Super-Resolution (DFVSR), to address these limitations. Specifically, we reconsider object motion from a new perspective and propose Directional Frequency Representation (DFR), which not only borrows the property of frequency representation of detail and structure information but also contains the direction information of the object motion that is extremely significant in videos. Based on this representation, we propose a Directional Frequency-Enhanced Alignment (DFEA) to use double enhancements of task-related information for ensuring the retention of high-fidelity frequency regions to generate the high-quality alignment feature. Furthermore, we design a novel Asymmetrical U-shaped network architecture to progressively fuse these alignment features and output the final output. This architecture enables the intercommunication of the same level of resolution in the encoder and decoder to achieve the supplement of spatial information. Powered by the above designs, our method achieves superior performance over state-of-the-art models on both quantitative and qualitative evaluations.

JAIR Journal 2023 Journal Article

Efficient Multi-Goal Reinforcement Learning via Value Consistency Prioritization

  • Jiawei Xu
  • Shuxing Li
  • Rui Yang
  • Chun Yuan
  • Lei Han

Goal-conditioned reinforcement learning (RL) with sparse rewards remains a challenging problem in deep RL. Hindsight Experience Replay (HER) has been demonstrated to be an effective solution, where HER replaces desired goals in failed experiences with practically achieved states. Existing approaches mainly focus on either exploration or exploitation to improve the performance of HER. From a joint perspective, exploiting specific past experiences can also implicitly drive exploration. Therefore, we concentrate on prioritizing both original and relabeled samples for efficient goal-conditioned RL. To achieve this, we propose a novel value consistency prioritization (VCP) method, where the priority of samples is determined by the consistency of ensemble Q-values. This distinguishes the VCP method with most existing prioritization approaches which prioritizes samples based on the uncertainty of ensemble Q-values. Through extensive experiments, we demonstrate that VCP achieves significantly higher sample efficiency than existing algorithms on a range of challenging goal-conditioned manipulation tasks. We also visualize how VCP prioritizes good experiences to enhance policy learning.

NeurIPS Conference 2023 Conference Paper

MeGraph: Capturing Long-Range Interactions by Alternating Local and Hierarchical Aggregation on Multi-Scaled Graph Hierarchy

  • Honghua Dong
  • Jiawei Xu
  • Yu Yang
  • Rui Zhao
  • Shiwen Wu
  • Chun Yuan
  • Xiu Li
  • Chris J. Maddison

Graph neural networks, which typically exchange information between local neighbors, often struggle to capture long-range interactions (LRIs) within the graph. Building a graph hierarchy via graph pooling methods is a promising approach to address this challenge; however, hierarchical information propagation cannot entirely take over the role of local information aggregation. To balance locality and hierarchy, we integrate the local and hierarchical structures, represented by intra- and inter-graphs respectively, of a multi-scale graph hierarchy into a single mega graph. Our proposed MeGraph model consists of multiple layers alternating between local and hierarchical information aggregation on the mega graph. Each layer first performs local-aware message-passing on graphs of varied scales via the intra-graph edges, then fuses information across the entire hierarchy along the bidirectional pathways formed by inter-graph edges. By repeating this fusion process, local and hierarchical information could intertwine and complement each other. To evaluate our model, we establish a new Graph Theory Benchmark designed to assess LRI capture ability, in which MeGraph demonstrates dominant performance. Furthermore, MeGraph exhibits superior or equivalent performance to state-of-the-art models on the Long Range Graph Benchmark. The experimental results on commonly adopted real-world datasets further demonstrate the broad applicability of MeGraph.

AAAI Conference 2023 Conference Paper

Truncate-Split-Contrast: A Framework for Learning from Mislabeled Videos

  • Zixiao Wang
  • Junwu Weng
  • Chun Yuan
  • Jue Wang

Learning with noisy label is a classic problem that has been extensively studied for image tasks, but much less for video in the literature. A straightforward migration from images to videos without considering temporal semantics and computational cost is not a sound choice. In this paper, we propose two new strategies for video analysis with noisy labels: 1) a lightweight channel selection method dubbed as Channel Truncation for feature-based label noise detection. This method selects the most discriminative channels to split clean and noisy instances in each category. 2) A novel contrastive strategy dubbed as Noise Contrastive Learning, which constructs the relationship between clean and noisy instances to regularize model training. Experiments on three well-known benchmark datasets for video classification show that our proposed truNcatE-split-contrAsT (NEAT) significantly outperforms the existing baselines. By reducing the dimension to 10% of it, our method achieves over 0.4 noise detection F1-score and 5% classification accuracy improvement on Mini-Kinetics dataset under severe noise (symmetric-80%). Thanks to Noise Contrastive Learning, the average classification accuracy improvement on Mini-Kinetics and Sth-Sth-V1 is over 1.6%.

IJCAI Conference 2022 Conference Paper

Cross-modal Representation Learning and Relation Reasoning for Bidirectional Adaptive Manipulation

  • Lei Li
  • Kai Fan
  • Chun Yuan

Since single-modal controllable manipulation typically requires supervision of information from other modalities or cooperation with complex software and experts, this paper addresses the problem of cross-modal adaptive manipulation (CAM). The novel task performs cross-modal semantic alignment from mutual supervision and implements bidirectional exchange of attributes, relations, or objects in parallel, benefiting both modalities while significantly reducing manual effort. We introduce a robust solution for CAM, which includes two essential modules, namely Heterogeneous Representation Learning (HRL) and Cross-modal Relation Reasoning (CRR). The former is designed to perform representation learning for cross-modal semantic alignment on heterogeneous graph nodes. The latter is adopted to identify and exchange the focused attributes, relations, or objects in both modalities. Our method produces pleasing cross-modal outputs on CUB and Visual Genome.

NeurIPS Conference 2022 Conference Paper

One Model to Edit Them All: Free-Form Text-Driven Image Manipulation with Semantic Modulations

  • Yiming Zhu
  • Hongyu Liu
  • Yibing Song
  • Ziyang Yuan
  • Xintong Han
  • Chun Yuan
  • Qifeng Chen
  • Jue Wang

Free-form text prompts allow users to describe their intentions during image manipulation conveniently. Based on the visual latent space of StyleGAN[21] and text embedding space of CLIP[34], studies focus on how to map these two latent spaces for text-driven attribute manipulations. Currently, the latent mapping between these two spaces is empirically designed and confines that each manipulation model can only handle one fixed text prompt. In this paper, we propose a method named Free-Form CLIP (FFCLIP), aiming to establish an automatic latent mapping so that one manipulation model handles free-form text prompts. Our FFCLIP has a cross-modality semantic modulation module containing semantic alignment and injection. The semantic alignment performs the automatic latent mapping via linear transformations with a cross attention mechanism. After alignment, we inject semantics from text prompt embeddings to the StyleGAN latent space. For one type of image (e. g. , human portrait'), one FFCLIP model can be learned to handle free-form text prompts. Meanwhile, we observe that although each training text prompt only contains a single semantic meaning, FFCLIP can leverage text prompts with multiple semantic meanings for image manipulation. In the experiments, we evaluate FFCLIP on three types of images (i. e. , human portraits', cars', and churches'). Both visual and numerical results show that FFCLIP effectively produces semantically accurate and visually realistic images. Project page: https: //github. com/KumapowerLIU/FFCLIP.

IJCAI Conference 2022 Conference Paper

Towards Universal Backward-Compatible Representation Learning

  • Binjie Zhang
  • Yixiao Ge
  • Yantao Shen
  • Shupeng Su
  • Fanzi Wu
  • Chun Yuan
  • Xuyuan Xu
  • Yexin Wang

Conventional model upgrades for visual search systems require offline refresh of gallery features by feeding gallery images into new models (dubbed as “backfill”), which is time-consuming and expensive, especially in large-scale applications. The task of backward-compatible representation learning is therefore introduced to support backfill-free model upgrades, where the new query features are interoperable with the old gallery features. Despite the success, previous works only investigated a close-set training scenario (i. e. , the new training set shares the same classes as the old one), and are limited by more realistic and challenging open-set scenarios. To this end, we first introduce a new problem of universal backward-compatible representation learning, covering all possible data split in model upgrades. We further propose a simple yet effective method, dubbed as Universal Backward-Compatible Training (UniBCT) with a novel structural prototype refinement algorithm, to learn compatible representations in all kinds of model upgrading benchmarks in a unified manner. Comprehensive experiments on the large-scale face recognition datasets MS1Mv3 and IJB-C fully demonstrate the effectiveness of our method. Source code is available at https: //github. com/TencentARC/OpenCompatible.

NeurIPS Conference 2021 Conference Paper

Towards Calibrated Model for Long-Tailed Visual Recognition from Prior Perspective

  • Zhengzhuo Xu
  • Zenghao Chai
  • Chun Yuan

Real-world data universally confronts a severe class-imbalance problem and exhibits a long-tailed distribution, i. e. , most labels are associated with limited instances. The naïve models supervised by such datasets would prefer dominant labels, encounter a serious generalization challenge and become poorly calibrated. We propose two novel methods from the prior perspective to alleviate this dilemma. First, we deduce a balance-oriented data augmentation named Uniform Mixup (UniMix) to promote mixup in long-tailed scenarios, which adopts advanced mixing factor and sampler in favor of the minority. Second, motivated by the Bayesian theory, we figure out the Bayes Bias (Bayias), an inherent bias caused by the inconsistency of prior, and compensate it as a modification on standard cross-entropy loss. We further prove that both the proposed methods ensure the classification calibration theoretically and empirically. Extensive experiments verify that our strategies contribute to a better-calibrated model, and their combination achieves state-of-the-art performance on CIFAR-LT, ImageNet-LT, and iNaturalist 2018.

IJCAI Conference 2020 Conference Paper

Feature Augmented Memory with Global Attention Network for VideoQA

  • Jiayin Cai
  • Chun Yuan
  • Cheng Shi
  • Lei Li
  • Yangyang Cheng
  • Ying Shan

Recently, Recurrent Neural Network (RNN) based methods and Self-Attention (SA) based methods have achieved promising performance in Video Question Answering (VideoQA). Despite the success of these works, RNN-based methods tend to forget the global semantic contents due to the inherent drawbacks of the recurrent units themselves, while SA-based methods cannot precisely capture the dependencies of the local neighborhood, leading to insufficient modeling for temporal order. To tackle these problems, we propose a novel VideoQA framework which progressively refines the representations of videos and questions from fine to coarse grain in a sequence-sensitive manner. Specifically, our model improves the feature representations via the following two steps: (1) introducing two fine-grained feature-augmented memories to strengthen the information augmentation of video and text which can improve memory capacity by memorizing more relevant and targeted information. (2) appending the self-attention and co-attention module to the memory output thus the module is able to capture global interaction between high-level semantic informations. Experimental results show that our approach achieves state-of-the-art performance on VideoQA benchmark datasets.

IJCAI Conference 2020 Conference Paper

HAF-SVG: Hierarchical Stochastic Video Generation with Aligned Features

  • Zhihui Lin
  • Chun Yuan
  • Maomao Li

Stochastic video generation methods predict diverse videos based on observed frames, where the main challenge lies in modeling the complex future uncertainty and generating realistic frames. Numerous of Recurrent-VAE-based methods have achieved state-of-the-art results. However, on the one hand, the independence assumption of the variables of approximate posterior limits the inference performance. On the other hand, although these methods adopt skip connections between encoder and decoder to utilize multi-level features, they still produce blurry generation due to the spatial misalignment between encoder and decoder features at different time steps. In this paper, we propose a hierarchical recurrent VAE with a feature aligner, which can not only relax the independence assumption in typical VAE but also use a feature aligner to enable the decoder to obtain the aligned spatial information from the last observed frames. The proposed model is named Hierarchical Stochastic Video Generation network with Aligned Features, referred to as HAF-SVG. Experiments on Moving-MNIST, BAIR, and KTH datasets demonstrate that hierarchical structure is helpful for modeling more accurate future uncertainty, and the feature aligner is beneficial to generate realistic frames. Besides, the HAF-SVG exceeds SVG on both prediction accuracy and the quality of generated frames.

AAAI Conference 2020 Conference Paper

Self-Attention ConvLSTM for Spatiotemporal Prediction

  • Zhihui Lin
  • Maomao Li
  • Zhuobin Zheng
  • Yangyang Cheng
  • Chun Yuan

Spatiotemporal prediction is challenging due to the complex dynamic motion and appearance changes. Existing work concentrates on embedding additional cells into the standard ConvLSTM to memorize spatial appearances during the prediction. These models always rely on the convolution layers to capture the spatial dependence, which are local and inef- ficient. However, long-range spatial dependencies are significant for spatial applications. To extract spatial features with both global and local dependencies, we introduce the selfattention mechanism into ConvLSTM. Specifically, a novel self-attention memory (SAM) is proposed to memorize features with long-range dependencies in terms of spatial and temporal domains. Based on the self-attention, SAM can produce features by aggregating features across all positions of both the input itself and memory features with pair-wise similarity scores. Moreover, the additional memory is updated by a gating mechanism on aggregated features and an established highway with the memory of the previous time step. Therefore, through SAM, we can extract features with longrange spatiotemporal dependencies. Furthermore, we embed the SAM into a standard ConvLSTM to construct a selfattention ConvLSTM (SA-ConvLSTM) for the spatiotemporal prediction. In experiments, we apply the SA-ConvLSTM to perform frame prediction on the MovingMNIST and KTH datasets and traffic flow prediction on the TexiBJ dataset. Our SA-ConvLSTM achieves state-of-the-art results on both datasets with fewer parameters and higher time efficiency than previous state-of-the-art method.

AAAI Conference 2019 Conference Paper

Self-Supervised Mixture-of-Experts by Uncertainty Estimation

  • Zhuobin Zheng
  • Chun Yuan
  • Xinrui Zhu
  • Zhihui Lin
  • Yangyang Cheng
  • Cheng Shi
  • Jiahui Ye

Learning related tasks in various domains and transferring exploited knowledge to new situations is a significant challenge in Reinforcement Learning (RL). However, most RL algorithms are data inefficient and fail to generalize in complex environments, limiting their adaptability and applicability in multi-task scenarios. In this paper, we propose Self- Supervised Mixture-of-Experts (SUM), an effective algorithm driven by predictive uncertainty estimation for multitask RL. SUM utilizes a multi-head agent with shared parameters as experts to learn a series of related tasks simultaneously by Deep Deterministic Policy Gradient (DDPG). Each expert is extended by predictive uncertainty estimation on known and unknown states to enhance the Q-value evaluation capacity against overfitting and the overall generalization ability. These enable the agent to capture and diffuse the common knowledge across different tasks improving sample efficiency in each task and the effectiveness of expert scheduling across multiple tasks. Instead of task-specific design as common MoEs, a self-supervised gating network is adopted to determine a potential expert to handle each interaction from unseen environments and calibrated completely by the uncertainty feedback from the experts without explicit supervision. To alleviate the imbalanced expert utilization as the crux of MoE, optimization is accomplished via decayedmasked experience replay, which encourages both diversification and specialization of experts during different periods. We demonstrate that our approach learns faster and achieves better performance by efficient transfer and robust generalization, outperforming several related methods on extended OpenAI Gym’s MuJoCo multi-task environments.

IJCAI Conference 2018 Conference Paper

Self-Adaptive Double Bootstrapped DDPG

  • Zhuobin Zheng
  • Chun Yuan
  • Zhihui Lin
  • Yangyang Cheng
  • Hanghao Wu

Deep Deterministic Policy Gradient (DDPG) algorithm has been successful for state-of-the-art performance in high-dimensional continuous control tasks. However, due to the complexity and randomness of the environment, DDPG tends to suffer from inefficient exploration and unstable training. In this work, we propose Self-Adaptive Double Bootstrapped DDPG (SOUP), an algorithm that extends DDPG to bootstrapped actor-critic architecture. SOUP improves the efficiency of exploration by multiple actor heads capturing more potential actions and multiple critic heads evaluating more reasonable Q-values collaboratively. The crux of double bootstrapped architecture is to tackle the fluctuations in performance, caused by multiple heads of spotty capacity varying throughout training. To alleviate the instability, a self-adaptive confidence mechanism is introduced to dynamically adjust the weights of bootstrapped heads and enhance the ensemble performance effectively and efficiently. We demonstrate that SOUP achieves faster learning by at least 45% while improving cumulative reward and stability substantially in comparison to vanilla DDPG on OpenAI Gym's MuJoCo environments.

IJCAI Conference 2018 Conference Paper

TreeNet: Learning Sentence Representations with Unconstrained Tree Structure

  • Zhou Cheng
  • Chun Yuan
  • Jiancheng Li
  • Haiqin Yang

Recursive neural network (RvNN) has been proved to be an effective and promising tool to learn sentence representations by explicitly exploiting the sentence structure. However, most existing work can only exploit simple tree structure, e. g. , binary trees, or ignore the order of nodes, which yields suboptimal performance. In this paper, we proposed a novel neural network, namely TreeNet, to capture sentences structurally over the raw unconstrained constituency trees, where the number of child nodes can be arbitrary. In TreeNet, each node is learning from its left sibling and right child in a bottom-up left-to-right order, thus enabling the net to learn over any tree. Furthermore, multiple soft gates and a memory cell are employed in implementing the TreeNet to determine to what extent it should learn, remember and output, which proves to be a simple and efficient mechanism for semantic synthesis. Moreover, TreeNet significantly suppresses convolutional neural networks (CNN) and Long Short-Term Memory (LSTM) with fewer parameters. It improves the classification accuracy by 2%-5% with 42% of the best CNN’s parameters or 94% of standard LSTM’s. Extensive experiments demonstrate TreeNet achieves the state-of-the-art performance on all four typical text classification tasks.