Author name cluster

Yang Jin

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

16 papers

2 author rows

AAAI Conference 2026 Conference Paper

A Unified Self-Regulating Training Framework for Federated Deep Reinforcement Learning

Meng Xu
Xinhong Chen
Zhongying Chen
Guanyi Zhao
Yang Jin
Jianping Wang

Federated Deep Reinforcement Learning (FDRL) aims to enable distributed collaborative training of multiple DRL models while preserving privacy. Existing FDRL methods function in static client environments, but real-world scenarios often involve dynamic state transitions, such as noise, which render static model topologies inadequate and result in biased policy loss. This degrades client performance and leads to suboptimal global policies. To address this challenge, we develop a generic solution, referred to as the self-regulating training framework, which can be seamlessly integrated into existing FDRL approaches to address dynamic state transitions. Specifically, we propose a Sparse Training (ST) method that dynamically sparsifies and adjusts the topology of each model during training to maximize model performance and reduce model complexity. Additionally, we introduce an auxiliary model to adaptively regulate the policy loss of client models, mitigating loss bias and facilitating updates that yield improved returns. Experimental results demonstrate that our method enhances six state-of-the-art (SOTA) FDRL approaches across nine tasks in terms of return.

PDF Details DOI

IROS Conference 2025 Conference Paper

DiffGen: Robot Demonstration Generation via Differentiable Physics Simulation, Differentiable Rendering, and Vision-Language Model

Yang Jin
Jun Lv
Shuqiang Jiang
Cewu Lu

Generating robot demonstrations through simulation is widely recognized as an effective way to scale up robot data. Previous work often trained reinforcement learning agents to generate expert policies, but this approach lacks sample efficiency. Recently, a line of work has attempted to generate robot demonstrations via differentiable simulation, which is promising but heavily relies on reward design, a labor-intensive process. In this paper, we propose DiffGen, a novel framework that integrates differentiable physics simulation, differentiable rendering, and a vision-language model to enable automatic and efficient generation of robot demonstrations. Given a simulated robot manipulation scenario and a natural language instruction, DiffGen can generate realistic robot demonstrations by minimizing the distance between the embedding of the language instruction and the embedding of the simulated observation after manipulation in representation space. The embeddings are obtained from the vision-language model, and the optimization is achieved by calculating and descending gradients through the differentiable simulation, differentiable rendering, and vision-language model components. Experiments demonstrate that with DiffGen, we could efficiently and effectively generate robot data with minimal human effort or training time. The videos of the results can be accessed at https://sites.google.com/view/diffgen.

Details

NeurIPS Conference 2025 Conference Paper

Enhancing Consistency of Flow-Based Image Editing through Kalman Control

Haozhe Chi
Zhicheng Sun
Yang Jin
Yi Ma
Jing Wang
Yadong Mu

Flow-based generative models have gained popularity for image generation and editing. For instruction-based image editing, it is critical to ensure that modifications are confined to the targeted regions. Yet existing methods often fail to maintain consistency in non-targeted regions between the original / edited images. Our primary contribution is to identify the cause of this limitation as the error accumulation across individual editing steps and to address it by incorporating the historical editing trajectory. Specifically, we formulate image editing as a control problem and leverage the Kalman filter to integrate the historical editing trajectory. Our proposed algorithm, dubbed Kalman-Edit, reuses early-stage details from the historical trajectory to enhance the structural consistency of the editing results. To speed up editing, we introduce a shortcut technique based on approximate vector field velocity estimation. Extensive experiments on several datasets demonstrate its superior performance compared to previous state-of-the-art methods.

PDF Details

AAAI Conference 2025 Conference Paper

Granularity-Adaptive Spatial Evidence Tokenization for Video Question Answering

Hao Jiang
Yang Jin
Zhicheng Sun
Kun Xu
Liwei Chen
Yang Song
Kun Gai
Yadong Mu

Video question answering plays a vital role in computer vision, and recent advances in large language models have further propelled the development of this field. However, existing video question answering techniques often face limitations in grasping fine-grained video content in spatial dimensions. It mainly stems from the fixed and low-resolution input of video frames. While some approaches using high-resolution inputs partially alleviate this problem, they introduce excessive computational burdens by encoding the entire high-resolution image. In this work, we propose a granularity-adaptive spatial evidence tokenization model for video question answering. Our method introduces multi-granular visual tokenization in the spatial dimension to produce video tokens at various granularities based on the question. It highlights spatially activated patches at low resolutions through a granularity weighting module and then adaptively encodes these activated patches at high resolution for detail supplementation. To mitigate the computational overhead associated with high-resolution frame encoding, a masking and acceleration module is developed for efficient visual tokenization. Moreover, a granularity compression module is designed to dynamically select and compress visual tokens of varying granularities based on questions. We conduct extensive experiments on 11 mainstream video question answering datasets and the experimental results demonstrate the effectiveness of our proposed method.

PDF Details DOI

IROS Conference 2025 Conference Paper

Knowledge-Driven Imitation Learning: Enabling Generalization Across Diverse Conditions

Zhuochen Miao
Jun Lv
Hongjie Fang
Yang Jin
Cewu Lu

Imitation learning has emerged as a powerful paradigm in robot manipulation, yet its generalization capability remains constrained by object-specific dependencies in limited expert demonstrations. To address this challenge, we propose knowledge-driven imitation learning, a framework that leverages external structural semantic knowledge to abstract object representations within the same category. We introduce a novel semantic keypoint graph as a knowledge template and develop a coarse-to-fine template-matching algorithm that optimizes both structural consistency and semantic similarity. Evaluated on three real-world robotic manipulation tasks, our method achieves superior performance, surpassing image-based diffusion policies with only one-quarter of the expert demonstrations. Extensive experiments further demonstrate its robustness across novel objects, backgrounds, and lighting conditions. This work pioneers a knowledge-driven approach to data-efficient robotic learning in real-world settings. Code and more materials are available on knowledge-driven.github.io.

Details

ICLR Conference 2025 Conference Paper

Pyramidal Flow Matching for Efficient Video Generative Modeling

Yang Jin
Zhicheng Sun 0001
Ningyuan Li 0002
Kun Xu 0005
Hao Jiang
Nan Zhuang
Quzhe Huang
Yang Song 0008

Video generation requires modeling a vast spatiotemporal space, which demands significant computational resources and data usage. To reduce the complexity, the prevailing approaches employ a cascaded architecture to avoid direct training with full resolution latent. Despite reducing computational demands, the separate optimization of each sub-stage hinders knowledge sharing and sacrifices flexibility. This work introduces a unified pyramidal flow matching algorithm. It reinterprets the original denoising trajectory as a series of pyramid stages, where only the final stage operates at the full resolution, thereby enabling more efficient video generative modeling. Through our sophisticated design, the flows of different pyramid stages can be interlinked to maintain continuity. Moreover, we craft autoregressive video generation with a temporal pyramid to compress the full-resolution history. The entire framework can be optimized in an end-to-end manner and with a single unified Diffusion Transformer (DiT). Extensive experiments demonstrate that our method supports generating high-quality 5-second (up to 10-second) videos at 768p resolution and 24 FPS within 20.7k A100 GPU training hours. All code and models are open-sourced at https://pyramid-flow.github.io.

Details

IROS Conference 2025 Conference Paper

SIME: Enhancing Policy Self-Improvement with Modal-level Exploration

Yang Jin
Jun Lv
Wenye Yu
Hongjie Fang
Yonglu Li 0001
Cewu Lu

Self-improvement requires robotic systems to initially learn from human-provided data and then gradually enhance their capabilities through interaction with the environment. This is similar to how humans improve their skills through continuous practice. However, achieving effective self-improvement is challenging, primarily because robots tend to repeat their existing abilities during interactions, often failing to generate new, valuable data for learning. In this paper, we identify the key to successful self-improvement: modal-level exploration and data selection. By incorporating a modal-level exploration mechanism during policy execution, the robot can produce more diverse and multi-modal interactions. At the same time, we select the most valuable trials and high-quality segments from these interactions for learning. We successfully demonstrate effective robot self-improvement on both simulation benchmarks and real-world experiments. The capability for self-improvement will enable us to develop more robust and high-success-rate robotic control strategies at a lower cost. Our code and experiment scripts are available at ericjin2002.github.io/SIME.

Details

ICRA Conference 2024 Conference Paper

CTA-LO: Accurate and Robust LiDAR Odometry Using Continuous-Time Adaptive Estimation

Yuezhang Lv
Yunzhou Zhang
Xiaoyu Zhao
Wu Li
Jian Ning
Yang Jin

Accurate and robust LiDAR odometry is a crucial technology for robot localization. However, motion distortion and ranging error make it a bottleneck. Most existing methods are limited in accuracy and robustness because they simply compensate for motion distortion by constant velocity motion assumption without accurate model of ranging error. In this paper, we propose a high-precision and robust LiDAR odometry (LO), which utilizes continuous-time estimation to remove LiDAR distortion and builds the spot uncertainty model to quantify the ranging error. Generally, the number of variables in continuous-time estimation is several times higher than that in discrete-time ones, leading to insufficient constraints on the LiDAR odometry. To solve this problem, we propose a marginalization method to retain prior scans’ constraints by exploiting the local support property of the B-spline. To further improve the odometry accuracy, we propose a residual adaptive weighting method and a probabilistic point cloud map based on the spot uncertainty model of LiDAR points. The experimental results show that our method outperforms state-of-the-art LiDAR odometry in accuracy and robustness.

Details

NeurIPS Conference 2024 Conference Paper

RectifID: Personalizing Rectified Flow with Anchored Classifier Guidance

Zhicheng Sun
Zhenhao Yang
Yang Jin
Haozhe Chi
Kun Xu
Liwei Chen
Hao Jiang
Yang Song

Customizing diffusion models to generate identity-preserving images from user-provided reference images is an intriguing new problem. The prevalent approaches typically require training on extensive domain-specific images to achieve identity preservation, which lacks flexibility across different use cases. To address this issue, we exploit classifier guidance, a training-free technique that steers diffusion models using an existing classifier, for personalized image generation. Our study shows that based on a recent rectified flow framework, the major limitation of vanilla classifier guidance in requiring a special classifier can be resolved with a simple fixed-point solution, allowing flexible personalization with off-the-shelf image discriminators. Moreover, its solving procedure proves to be stable when anchored to a reference flow trajectory, with a convergence guarantee. The derived method is implemented on rectified flow with different off-the-shelf image discriminators, delivering advantageous personalization results for human faces, live subjects, and certain objects. Code is available at https: //github. com/feifeiobama/RectifID.

PDF Details DOI

TIST Journal 2024 Journal Article

Strengthening Cooperative Consensus in Multi-Robot Confrontation

Meng Xu
Xinhong Chen
Yechao She
Yang Jin
Guanyi Zhao
Jianping Wang

Multi-agent reinforcement learning (MARL) has proven effective in training multi-robot confrontation, such as StarCraft and robot soccer games. However, the current joint action policies utilized in MARL have been unsuccessful in recognizing and preventing actions that often lead to failures on our side. This exacerbates the cooperation dilemma, ultimately resulting in our agents acting independently and being defeated individually by their opponents. To tackle this challenge, we propose a novel joint action policy, referred to as the consensus action policy (CAP). Specifically, CAP records the number of times each joint action has caused our side to fail in the past and computes a cooperation tendency, which is integrated with each agent’s Q -value and Nash bargaining solution to determine a joint action. The cooperation tendency promotes team cooperation by selecting joint actions that have a high tendency of cooperation and avoiding actions that may lead to team failure. Moreover, the proposed CAP policy can be extended to partially observable scenarios by combining it with Deep Q network or actor-critic–based methods. We conducted extensive experiments to compare the proposed method with seven existing joint action policies, including four commonly used methods and three state-of-the-art methods, in terms of episode rewards, winning rates, and other metrics. Our results demonstrate that this approach holds great promise for multi-robot confrontation scenarios.

Details DOI

AAAI Conference 2024 Conference Paper

TransGOP: Transformer-Based Gaze Object Prediction

Binglu Wang
Chenxi Guo
Yang Jin
Haisheng Xia
Nian Liu

Gaze object prediction aims to predict the location and category of the object that is watched by a human. Previous gaze object prediction works use CNN-based object detectors to predict the object's location. However, we find that Transformer-based object detectors can predict more accurate object location for dense objects in retail scenarios. Moreover, the long-distance modeling capability of the Transformer can help to build relationships between the human head and the gaze object, which is important for the GOP task. To this end, this paper introduces Transformer into the fields of gaze object prediction and proposes an end-to-end Transformer-based gaze object prediction method named TransGOP. Specifically, TransGOP uses an off-the-shelf Transformer-based object detector to detect the location of objects and designs a Transformer-based gaze autoencoder in the gaze regressor to establish long-distance gaze relationships. Moreover, to improve gaze heatmap regression, we propose an object-to-gaze cross-attention mechanism to let the queries of the gaze autoencoder learn the global-memory position knowledge from the object detector. Finally, to make the whole framework end-to-end trained, we propose a Gaze Box loss to jointly optimize the object detector and gaze regressor by enhancing the gaze heatmap energy in the box of the gaze object. Extensive experiments on the GOO-Synth and GOO-Real datasets demonstrate that our TransGOP achieves state-of-the-art performance on all tracks, i.e., object detection, gaze estimation, and gaze object prediction. Our code will be available at https://github.com/chenxi-Guo/TransGOP.git.

PDF Details DOI

ICLR Conference 2024 Conference Paper

Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization

Yang Jin
Kun Xu 0005
Li-Wei Chen
Chao Liao
Jianchao Tan
Quzhe Huang
Bin Chen
Chengru Song

Recently, the remarkable advance of the Large Language Model (LLM) has inspired researchers to transfer its extraordinary reasoning capability to both vision and language data. However, the prevailing approaches primarily regard the visual input as a prompt and focus exclusively on optimizing the text generation process conditioned upon vision content by a frozen LLM. Such an inequitable treatment of vision and language heavily constrains the model's potential. In this paper, we break through this limitation by representing both vision and language in a unified form. Specifically, we introduce a well-designed visual tokenizer to translate the non-linguistic image into a sequence of discrete tokens like a foreign language that LLM can read. The resulting visual tokens encompass high-level semantics worthy of a word and also support dynamic sequence length varying from the image. Coped with this tokenizer, the presented foundation model called LaVIT can handle both image and text indiscriminately under the same generative learning paradigm. This unification empowers LaVIT to serve as an impressive generalist interface to understand and generate multi-modal content simultaneously. Extensive experiments further showcase that it outperforms the existing models by a large margin on massive vision-language tasks. Our code and models are available at https://github.com/jy0205/LaVIT.

Details

ICML Conference 2024 Conference Paper

Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization

Yang Jin
Zhicheng Sun 0001
Kun Xu 0005
Li-Wei Chen
Hao Jiang
Quzhe Huang
Chengru Song
Yuliang Liu

In light of recent advances in multimodal Large Language Models (LLMs), there is increasing attention to scaling them from image-text data to more informative real-world videos. Compared to static images, video poses unique challenges for effective large-scale pre-training due to the modeling of its spatiotemporal dynamics. In this paper, we address such limitations in video-language pre-training with an efficient video decomposition that represents each video as keyframes and temporal motions. These are then adapted to an LLM using well-designed tokenizers that discretize visual and temporal information as a few tokens, thus enabling unified generative pre-training of videos, images, and text. At inference, the generated tokens from the LLM are carefully recovered to the original continuous pixel space to create various video content. Our proposed framework is both capable of comprehending and generating image and video content, as demonstrated by its competitive performance across 13 multimodal benchmarks in image and video understanding and generation. Our code and models are available at https: //video-lavit. github. io.

Details

TIST Journal 2023 Journal Article

Dynamic Weights and Prior Reward in Policy Fusion for Compound Agent Learning

Meng Xu
Yechao She
Yang Jin
Jianping Wang

In Deep Reinforcement Learning (DRL) domain, a compound learning task is often decomposed into several sub-tasks in a divide-and-conquer manner, each trained separately and then fused concurrently to achieve the original task, referred to as policy fusion. However, the state-of-the-art (SOTA) policy fusion methods treat the importance of sub-tasks equally throughout the task process, eliminating the possibility of the agent relying on different sub-tasks at various stages. To address this limitation, we propose a generic policy fusion approach, referred to as Policy Fusion Learning with Dynamic Weights and Prior Reward (PFLDWPR), to automate the time-varying selection of sub-tasks. Specifically, PFLDWPR produces a time-varying one-hot vector for sub-tasks to dynamically select a suitable sub-task and mask the rest throughout the entire task process, enabling the fused strategy to optimally guide the agent in executing the compound task. The sub-tasks with the dynamic one-hot vector are then aggregated to obtain the action policy for the original task. Moreover, we collect sub-tasks’s rewards at the pre-training stage as a prior reward, which, along with the current reward, is used to train the policy fusion network. Thus, this approach reduces fusion bias by leveraging prior experience. Experimental results under three popular learning tasks demonstrate that the proposed method significantly improves three SOTA policy fusion methods in terms of task duration, episode reward, and score difference.

Details DOI

NeurIPS Conference 2022 Conference Paper

Embracing Consistency: A One-Stage Approach for Spatio-Temporal Video Grounding

Yang Jin
Yongzhi Li
Zehuan Yuan
Yadong Mu

Spatio-Temporal video grounding (STVG) focuses on retrieving the spatio-temporal tube of a specific object depicted by a free-form textual expression. Existing approaches mainly treat this complicated task as a parallel frame-grounding problem and thus suffer from two types of inconsistency drawbacks: feature alignment inconsistency and prediction inconsistency. In this paper, we present an end-to-end one-stage framework, termed Spatio-Temporal Consistency-Aware Transformer (STCAT), to alleviate these issues. Specially, we introduce a novel multi-modal template as the global objective to address this task, which explicitly constricts the grounding region and associates the predictions among all video frames. Moreover, to generate the above template under sufficient video-textual perception, an encoder-decoder architecture is proposed for effective global context modeling. Thanks to these critical designs, STCAT enjoys more consistent cross-modal feature alignment and tube prediction without reliance on any pre-trained object detectors. Extensive experiments show that our method outperforms previous state-of-the-arts with clear margins on two challenging video benchmarks (VidSTG and HC-STVG), illustrating the superiority of the proposed framework to better understanding the association between vision and natural language. Code is publicly available at https: //github. com/jy0205/STCAT.

PDF Details

JBHI Journal 2022 Journal Article

Full-Resolution Network and Dual-Threshold Iteration for Retinal Vessel and Coronary Angiograph Segmentation

Wentao Liu
Huihua Yang
Tong Tian
Zhiwei Cao
Xipeng Pan
Weijin Xu
Yang Jin
Feng Gao

Vessel segmentation is critical for disease diagnosis and surgical planning. Recently, the vessel segmentation method based on deep learning has achieved outstanding performance. However, vessel segmentation remains challenging due to thin vessels with low contrast that easily lose spatial information in the traditional U-shaped segmentation network. To alleviate this problem, we propose a novel and straightforward full-resolution network (FR-UNet) that expands horizontally and vertically through a multiresolution convolution interactive mechanism while retaining full image resolution. In FR-UNet, the feature aggregation module integrates multiscale feature maps from adjacent stages to supplement high-level contextual information. The modified residual blocks continuously learn multiresolution representations to obtain a pixel-level accuracy prediction map. Moreover, we propose the dual-threshold iterative algorithm (DTI) to extract weak vessel pixels for improving vessel connectivity. The proposed method was evaluated on retinal vessel datasets (DRIVE, CHASE_DB1, and STARE) and coronary angiography datasets (DCA1 and CHUAC). The results demonstrate that FR-UNet outperforms state-of-the-art methods by achieving the highest Sen, AUC, F1, and IOU on most of the above-mentioned datasets with fewer parameters, and that DTI enhances vessel connectivity while greatly improving sensitivity. The code is available at: https://github.com/lseventeen/FR-UNet.

Details DOI