Author name cluster

Yongjian Deng

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

16 papers

2 author rows

AAAI Conference 2026 Conference Paper

AIR-DR: Adaptive Image Retargeting with Instance Relocation and Dual-guidance Repainting

Zhitong Dong
Chao Li
Yongjian Deng
Hao Chen

Image retargeting aims to adjust the aspect ratio of images to accommodate various display devices. While existing methods consider both foreground semantics and background inpainting, their Seam-carving-based framework is inherently destructive, often compromising the structural integrity of foreground instances. Furthermore, conventional inpainting models struggle to achieve pixel-level accuracy with global-only guidance, leading to local inconsistencies and background distortions. To address these challenges, we reformulate image retargeting as a instance-level re-layout task. By Adaptive Instance Relocation and Dual-guidance Repainting (AIR-DR), our method preserves the structural integrity of the foreground and recovers the background with consistent details. Additionally, we introduce an adaptive retargeting decision that maintains robustness across challenging retargeting scenarios and any ratios. Extensive experiments on multiple public datasets across various aspect ratios demonstrate that our approach consistently outperforms existing methods in both objective metrics and subjective evaluations. Comprehensive ablation studies further validate the effectiveness of each component.

PDF Details DOI

AAAI Conference 2025 Conference Paper

CFDM: Contrastive Fusion and Disambiguation for Multi-View Partial-Label Learning

Qiuru Hai
Yongjian Deng
Yuena Lin
Zheng Li
Zhen Yang
Gengyu Lyu

When dealing with multi-view data, the heterogeneity of data attributes across different views often leads to label ambiguity. To effectively address this challenge, this paper designs a Multi-View Partial-Label Learning (MVPLL) framework, where each training instance is described by multiple view features and associated with a set of candidate labels, among which only one is correct. The key to deal with such problem lies in how to effectively fuse multi-view information and accurately disambiguate these ambiguous labels. In this paper, we propose a novel approach named CFDM, which explores the consistency and complementarity of multi-view data by multi-view contrastive fusion and reduces label ambiguity by multi-class contrastive prototype disambiguation. Specifically, we first extract view-specific representations using multiple view-specific autoencoders, and then integrate multi-view information through both inter-view and intra-view contrastive fusion to enhance the distinctiveness of these representations. Afterwards, we utilize these distinctive representations to establish and update prototype vectors for each class within each view. Based on these, we apply contrastive prototype disambiguation to learn global class prototypes and accordingly reduce label ambiguity. In our model, multi-view contrastive fusion and multi-class contrastive prototype disambiguation are conducted mutually to enhance each other within a coherent framework, leading to a more ideal classification performance. Experimental results on multiple datasets have demonstrated that our proposed method is superior to other state-of-the-art methods.

PDF Details DOI

ICLR Conference 2025 Conference Paper

Enhance Multi-View Classification Through Multi-Scale Alignment and Expanded Boundary

Yuena Lin
Yiyuan Wang
Gengyu Lyu
Yongjian Deng
Haichun Cai
Huibin Lin
Haobo Wang 0001
Zhen Yang 0004

Multi-view classification aims at unifying the data from multiple views to complementarily enhance the classification performance. Unfortunately, two major problems in multi-view data are damaging model performance. The first is feature heterogeneity, which makes it hard to fuse features from different views. Considering this, we introduce a multi-scale alignment module, including an instance-scale alignment module and a prototype-scale alignment module to mine the commonality from an inter-view perspective and an inter-class perspective respectively, jointly alleviating feature heterogeneity. The second is information redundancy which easily incurs ambiguous data to blur class boundaries and impair model generalization. Therefore, we propose a novel expanded boundary by extending the original class boundary with fuzzy set theory, which adaptively adjusts the boundary to fit ambiguous data. By integrating the expanded boundary into the prototype-scale alignment module, our model further tightens the produced representations and reduces boundary ambiguity. Additionally, compared with the original class boundary, the expanded boundary preserves more margins for classifying unseen data, which guarantees the model generalization. Extensive experiment results across various real-world datasets demonstrate the superiority of the proposed model against existing state-of-the-art methods.

Details

NeurIPS Conference 2025 Conference Paper

EPA: Boosting Event-based Video Frame Interpolation with Perceptually Aligned Learning

Yuhan Liu
LingHui Fu
Zhen Yang
Hao Chen
Youfu Li
Yongjian Deng

Event cameras, with their capacity to provide high temporal resolution information between frames, are increasingly utilized for video frame interpolation (VFI) in challenging scenarios characterized by high-speed motion and significant occlusion. However, prevalent issues of blur and distortion within the keyframes and ground truth data used for training and inference in these demanding conditions are frequently overlooked. This oversight impedes the perceptual realism and multi-scene generalization capabilities of existing event-based VFI (E-VFI) methods when generating interpolated frames. Motivated by the observation that semantic-perceptual discrepancies between degraded and pristine images are considerably smaller than their image-level differences, we introduce EPA. This novel E-VFI framework diverges from approaches reliant on direct image-level supervision by constructing multilevel, degradation-insensitive semantic perceptual supervisory signals to enhance the perceptual realism and multi-scene generalization of the model's predictions. Specifically, EPA operates in two phases: it first employs a DINO-based perceptual extractor, a customized style adapter, and a reconstruction generator to derive multi-layered, degradation-insensitive semantic-perceptual features ($\mathcal{S}$). Second, a novel Bidirectional Event-Guided Alignment (BEGA) module utilizes deformable convolutions to align perceptual features from keyframes to ground truth with inter-frame temporal guidance extracted from event signals. By decoupling the learning process from direct image-level supervision, EPA enhances model robustness against degraded keyframes and unreliable ground truth information. Extensive experiments demonstrate that this approach yields interpolated frames more consistent with human perceptual preferences. *The code will be released upon acceptance. *

PDF Details

AAAI Conference 2025 Conference Paper

ESEG: Event-Based Segmentation Boosted by Explicit Edge-Semantic Guidance

Yucheng Zhao
Gengyu Lyu
Ke Li
Zihao Wang
Hao Chen
Zhen Yang
Yongjian Deng

Event-based semantic segmentation (ESS) has attracted researchers' attention recently, as event cameras can solve problems such as under/over-exposure or motion blur that are difficult for RGB cameras to handle. However, event data are noisy and sparse, resulting in difficulties for the model to locate and extract reliable cues from their sparse representations, especially when performing pixel-level tasks. In this paper, we propose a novel framework ESEG to alleviate the dilemma. Given that event signals relate closely to moving edges, instead of proposing complex structures to expect them to recognize those reliable edge regions behind event signals on their own, we introduce the explicit edge-semantic supervision as a reference to let the ESS model globally optimize semantics, considering the high confidence of event data in edge regions. In addition, we propose a fusion module named Density-Aware Dynamic-Window Cross Attention Fusion (D\textsuperscript{2}CAF), in which the density perception, cross-attention, and dynamic window masking mechanisms are jointly imposed to optimize edge-dense feature fusion, leveraging the characteristics of event cameras. Experimental results on DSEC and DDD17 datasets demonstrate the efficacy of the ESEG framework and its core designs.

PDF Details DOI

EAAI Journal 2025 Journal Article

Event-based video interpolation via complementary motion information

Yuhan Liu
LingHui Fu
Hao Chen
Zhen Yang
Youfu Li
Yongjian Deng

Details DOI

AAAI Conference 2025 Conference Paper

Graph Consistency and Diversity Measurement for Federated Multi-View Clustering

Bohang Sun
Yongjian Deng
Yuena Lin
Qiuru Hai
Zhen Yang
Gengyu Lyu

Federated Multi-View Clustering (FMVC) aims to learn a global clustering model from heterogeneous data distributed across different devices, where each device only stores one view of all clustering samples. The key to deal with such problem lies in how to effectively fuse these heterogeneous samples while strictly preserve the data privacy across multiple devices. In this paper, we propose a novel structural graph learning framework named MGCD, which leverages both consistency and diversity of multi-view graph structure across global view-fusion server and local view-specific clients to achieve desired clustering while better preserves data privacy. Specifically, in each local client, we design a dual autoencoder to extract the latent consensuses and specificities of each view, where self-representation construction is introduced to generate the corresponding view-specific diversity graph. In the global server, the consistency implied in uploaded diversity graphs are further distilled and then incorporated into the consistency graph for subsequent cross-view contrastive fusion. During the training process, the server generates a global consistency graph and distributes it to each client for assisting in diversity graph construction, while the clients extract view-specific information and upload it to the server for more reliable consistency graph generation. The ``server-client'' interaction is conducted in an iterative manner, where the consistency implied in each local client is gradually aggregated into the global consistency graph, and the final clustering results are obtained by spectral clustering on the desired global consistency graph. Extensive experiments on various datasets have demonstrated the effectiveness of our proposed method on clustering federated multi-view data.

PDF Details DOI

ICML Conference 2025 Conference Paper

Improving Multimodal Learning Balance and Sufficiency through Data Remixing

Xiaoyu Ma
Hao Chen 0034
Yongjian Deng

Different modalities hold considerable gaps in optimization trajectories, including speeds and paths, which lead to modality laziness and modality clash when jointly training multimodal models, resulting in insufficient and imbalanced multimodal learning. Existing methods focus on enforcing the weak modality by adding modality-specific optimization objectives, aligning their optimization speeds, or decomposing multimodal learning to enhance unimodal learning. These methods fail to achieve both unimodal sufficiency and multimodal balance. In this paper, we, for the first time, address both concerns by proposing multimodal Data Remixing, including decoupling multimodal data and filtering hard samples for each modality to mitigate modality imbalance; and then batch-level reassembling to align the gradient directions and avoid cross-modal interference, thus enhancing unimodal learning sufficiency. Experimental results demonstrate that our method can be seamlessly integrated with existing approaches, improving accuracy by approximately 6. 50%$\uparrow$ on CREMAD and 3. 41%$\uparrow$ on Kinetic-Sounds, without training set expansion or additional computational overhead during inference. The source code is available at Data Remixing.

Details

AAAI Conference 2025 Conference Paper

Know Where You Are From: Event-Based Segmentation via Spatio-Temporal Propagation

Ke Li
Gengyu Lyu
Hao Chen
Bochen Xie
Zhen Yang
Youfu Li
Yongjian Deng

Event cameras have gained attention in segmentation due to their higher temporal resolution and dynamic range compared to traditional cameras. However, they struggle with issues like lack of color perception and triggering only at motion edges, making it hard to distinguish objects with similar contours or segment spatially continuous objects. Our work aims to address these often overlooked issues. Based on the assumption that various objects exhibit different motion patterns, we believe that embedding the historical motion states of objects into segmented scenes can effectively address these challenges. Inspired by this, we propose the ESS framework ``Know Where You Are From" (KWYAF), which incorporates past motion cues through spatio-temporal propagation embedding. This framework features two core components: the Sequential Motion Encoding Module (SME) and the Event-Based Reliable Region Selection Mechanism (ER²SM). SMEs construct prior motion features through spatio-temporal correlation modeling for boosting final segmentation, while ER²SM adapts to identify high-confidence regions, embedding motion more precisely through local window masks and reliable region selection. A large number of experiments have demonstrated the effectiveness of our proposed framework in terms of both quantity and quality.

PDF Details DOI

AAAI Conference 2025 Conference Paper

MSV-PCT: Multi-Sparse-View Enhanced Transformer Framework for Salient Object Detection in Point Clouds

Zihao Wang
Yiming Huang
Gengyu Lyu
Yucheng Zhao
Ziyu Zhou
Bochen Xie
Zhen Yang
Yongjian Deng

Salient object detection (SOD) methods for 2D images have great significance in the field of human-computer interaction (HCI). However, as a common data format in HCI, the SOD research in the form of 3D point cloud data remains limited. Previous works commonly treat this task as point cloud segmentation, which perceives all points in the scene for prediction. However, these methods neglect that SOD is designed to simulate human visual perception where human can only see the surfaces rather than occluded point clouds. Thereby, these methods may fail when meet such situations. This paper aims to solve this problem by approximately simulating the perception paradigm of humans towards 3D scenes. Thus, we propose a framework based on the 3D visual point cloud backbone and its multi-view projection named MSV-PCT. Specifically, instead of relying solely on general point cloud learning frameworks, we additionally introduce multi-sparse-view learning branches to supplement the SOD perception. Furthermore, we propose a novel point cloud edge detection loss function to effectively address artifacts, enabling the accurate segmentation of the edges of salient objects from the background. Finally, to evaluate the generalization of point cloud SOD methods, we introduce a new approach to generate simulated PC-SOD datasets from RGBD-SOD data. Experiments on the simulated datasets show that MSV-PCT achieves better accuracy and robustness.

PDF Details DOI

AAAI Conference 2025 Conference Paper

Multi-View Multi-Label Classification via View-Label Matching Selection

Hao Wei
Yongjian Deng
Qiuru Hai
Yuena Lin
Zhen Yang
Gengyu Lyu

In multi-view multi-label classification (MVML), each object is described by several heterogeneous views while annotated with multiple related labels. The key to learn from such complicate data lies in how to fuse cross-view features and explore multi-label correlations, while accordingly obtain correct assignments between each object and its corresponding labels. In this paper, we proposed an advanced MVML method named VAMS, which treats each object as a bag of views and reformulates the task of MVML as a “view-label” matching selection problem. Specifically, we first construct an object graph and a label graph respectively. In the object graph, nodes represent the multi-view representation of an object, and each view node is connected to its K-nearest neighbor within its own view. In the label graph, nodes represent the semantic representation of a label. Then, we connect each view node with all labels to generate the unified “view-label” matching graph. Afterwards, a graph network block is introduced to aggregate and update all nodes and edges on the matching graph, and further generating a structural representation that fuses multi-view heterogeneity and multi-label correlations for each view and label. Finally, we derive a prediction score for each view-label matching and select the optimal matching via optimizing a weighted cross-entropy loss. Extensive results on various datasets have verified that our proposed VAMS can achieve superior or comparable performance against state-of-the-art methods.

PDF Details DOI

EAAI Journal 2025 Journal Article

Pixel-Level Semantics Boosted Fine-Grained Bird Image Classification

Haoxiang Ma
Yongjian Deng
Bochen Xie
Jian Liu
Hai Liu
Youfu Li
Zhen Yang

Details DOI

AAAI Conference 2024 Conference Paper

A Dynamic GCN with Cross-Representation Distillation for Event-Based Learning

Yongjian Deng
Hao Chen
Youfu Li

Recent advances in event-based research prioritize sparsity and temporal precision. Approaches learning sparse point-based representations through graph CNNs (GCN) become more popular. Yet, these graph techniques hold lower performance than their frame-based counterpart due to two issues: (i) Biased graph structures that don't properly incorporate varied attributes (such as semantics, and spatial and temporal signals) for each vertex, resulting in inaccurate graph representations. (ii) A shortage of robust pretrained models. Here we solve the first problem by proposing a new event-based GCN (EDGCN), with a dynamic aggregation module to integrate all attributes of vertices adaptively. To address the second problem, we introduce a novel learning framework called cross-representation distillation (CRD), which leverages the dense representation of events as a cross-representation auxiliary to provide additional supervision and prior knowledge for the event graph. This frame-to-graph distillation allows us to benefit from the large-scale priors provided by CNNs while still retaining the advantages of graph-based models. Extensive experiments show our model and learning framework are effective and generalize well across multiple vision tasks.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

A Motion-aware Spatio-temporal Graph for Video Salient Object Ranking

Hao Chen
Yufei Zhu
Yongjian Deng

Video salient object ranking aims to simulate the human attention mechanism by dynamically prioritizing the visual attraction of objects in a scene over time. Despite its numerous practical applications, this area remains underexplored. In this work, we propose a graph model for video salient object ranking. This graph simultaneously explores multi-scale spatial contrasts and intra-/inter-instance temporal correlations across frames to extract diverse spatio-temporal saliency cues. It has two advantages: 1. Unlike previous methods that only perform global inter-frame contrast or compare all proposals across frames globally, we explicitly model the motion of each instance by comparing its features with those in the same spatial region in adjacent frames, thus obtaining more accurate motion saliency cues. 2. We synchronize the spatio-temporal saliency cues in a single graph for joint optimization, which exhibits better dynamics compared to the previous stage-wise methods that prioritize spatial cues followed by temporal cues. Additionally, we propose a simple yet effective video retargeting method based on video saliency ranking. Extensive experiments demonstrate the superiority of our model in video salient object ranking and the effectiveness of the video retargeting method. Our codes/models are released at https: //github. com/zyf-815/VSOR/tree/main.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

Prune and Repaint: Content-Aware Image Retargeting for any Ratio

Feihong Shen
Chao Li
Yifeng Geng
Yongjian Deng
Hao Chen

Image retargeting is the task of adjusting the aspect ratio of images to suit different display devices or presentation environments. However, existing retargeting methods often struggle to balance the preservation of key semantics and image quality, resulting in either deformation or loss of important objects, or the introduction of local artifacts such as discontinuous pixels and inconsistent regenerated content. To address these issues, we propose a content-aware retargeting method called PruneRepaint. It incorporates semantic importance for each pixel to guide the identification of regions that need to be pruned or preserved in order to maintain key semantics. Additionally, we introduce an adaptive repainting module that selects image regions for repainting based on the distribution of pruned pixels and the proportion between foreground size and target aspect ratio, thus achieving local smoothness after pruning. By focusing on the content and structure of the foreground, our PruneRepaint approach adaptively avoids key content loss and deformation, while effectively mitigating artifacts with local repainting. We conduct experiments on the public RetargetMe benchmark and demonstrate through objective experimental results and subjective user studies that our method outperforms previous approaches in terms of preserving semantics and aesthetics, as well as better generalization across diverse aspect ratios. Codes will be available at https: //github. com/fhshen2022/PruneRepaint.

PDF Details DOI

ICRA Conference 2024 Conference Paper

SAM-Event-Adapter: Adapting Segment Anything Model for Event-RGB Semantic Segmentation

Bowen Yao
Yongjian Deng
Yuhan Liu 0021
Hao Chen 0034
You-Fu Li 0001
Zhen Yang 0004

Semantic segmentation, a fundamental visual task ubiquitously employed in sectors ranging from transportation and robotics to healthcare, has always captivated the research community. In the wake of rapid advancements in large model research, the foundation model for semantic segmentation tasks, termed the Segment Anything Model (SAM), has been introduced. This model substantially addresses the dilemma of poor generalizability of previous segmentation models and the disadvantage in requiring to retrain the whole model on variant datasets. Nonetheless, segmentation models developed on SAM remain constrained by the inherent limitations of RGB sensors, particularly in scenarios characterized by complex lighting conditions and high-speed motion. Motivated by these observations, a natural recourse is to adapt SAM to additional visual modalities without compromising its robust generalizability. To achieve this, we introduce a lightweight SAM-Event-Adapter (SE-Adapter) module, which incorporates event camera data into a cross-modal learning architecture based on SAM, with only limited tunable parameters incremental. Capitalizing on the high dynamic range and temporal resolution afforded by event cameras, our proposed multi-modal Event-RGB learning architecture effectively augments the performance of semantic segmentation tasks. In addition, we propose a novel paradigm for representing event data in a patch format compatible with transformer-based models, employing multi-spatiotemporal scale encoding to efficiently extract motion and semantic correlations from event representations. Exhaustive empirical evaluations conducted on the DSEC-Semantic and DDD17 datasets provide validation of the effectiveness and rationality of our proposed approach.

Details