Author name cluster

Jiayi Chen

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

15 papers

2 author rows

AAAI Conference 2026 Conference Paper

Compositional Attribute Imbalance in Vision Datasets

Yanbiao Ma
Jiayi Chen
Wei Dai
Dong Zhao
Zeyu Zhang
Yuting Yang
Bowei Liu
Jiaxuan Zhao

Visual attribute imbalance is a common yet underexplored issue in image classification, significantly impacting model performance and generalization. In this work, we first define the first-level and second-level attributes of images and then introduce a CLIP-based framework to construct a visual attribute dictionary, enabling automatic evaluation of image attributes. By systematically analyzing both single-attribute imbalance and compositional attribute imbalance, we reveal how the rarity of attributes affects model performance. To tackle these challenges, we propose adjusting the sampling probability of samples based on the rarity of their compositional attributes. This strategy is further integrated with various data augmentation techniques (such as CutMix, Fmix, and SaliencyMix) to enhance the model's ability to represent rare attributes. Extensive experiments on benchmark datasets demonstrate that our method effectively mitigates attribute imbalance, thereby improving the robustness and fairness of deep neural networks. Our research highlights the importance of modeling visual attribute distributions and provides a scalable solution for long-tail image classification tasks.

PDF Details DOI

AAAI Conference 2026 Conference Paper

MARS: A Meta-Adaptive Reinforcement Learning Framework for Risk-Aware Multi-Agent Portfolio Management

Jiayi Chen
Jing Li
Guiling Wang

Reinforcement Learning (RL) has shown significant promise in automated portfolio management; however, effectively balancing risk and return remains a central challenge, as many models fail to adapt to dynamically changing market conditions. We propose Meta-controlled Agents for a Risk-aware System (MARS), a novel framework addressing this through a multi-agent, risk-aware approach. MARS replaces monolithic models with a Heterogeneous Agent Ensemble, where each agent’s unique risk profile is enforced by a Safety-Critic network to span behaviors from capital preservation to aggressive growth. A high-level Meta-Adaptive Controller (MAC) dynamically orchestrates this ensemble, shifting reliance between conservative and aggressive agents to minimize drawdown during downturns while seizing opportunities in bull markets. This two-tiered structure leverages behavioral diversity rather than explicit feature engineering to ensure a disciplined portfolio robust across market regimes. Experiments on major international indexes confirm that our framework significantly reduces maximum drawdown and volatility while maintaining competitive returns.

PDF Details DOI

AAAI Conference 2026 Conference Paper

ReconVLA: Reconstructive Vision-Language-Action Model as Effective Robot Perceiver

Wenxuan Song
Ziyang Zhou
Han Zhao
Jiayi Chen
Pengxiang Ding
Haodong Yan
Yuxin Huang
Feilong Tang

Recent advances in Vision-Language-Action (VLA) models have enabled robotic agents to integrate multimodal understanding with action execution. However, our empirical analysis reveals that current VLAs struggle to allocate visual attention to target regions. Instead, visual attention is always dispersed. To guide the visual attention grounding on the correct target, we propose ReconVLA, a reconstructive VLA model with an implicit grounding paradigm. Conditioned on the model's visual outputs, a diffusion transformer aims to reconstruct the gaze region of the image, which corresponds to the target manipulated objects. This process prompts the VLA model to learn fine-grained representations and accurately allocate visual attention, thus effectively leveraging task-specific visual information and conducting precise manipulation. Moreover, we curate a large-scale pretraining dataset comprising over 100k trajectories and 2 million data samples from open-source robotic datasets, further boosting the model’s generalization in visual reconstruction. Extensive experiments in simulation and the real world demonstrate the superiority of our implicit grounding method, showcasing its capabilities of precise manipulation and generalization.

PDF Details DOI

JBHI Journal 2025 Journal Article

Active Learning Based on Temporal Difference of Gradient Flow in Thoracic Disease Diagnosis

Jiayi Chen
Benteng Ma
Hengfei Cui
Jingfeng Zhang
Yong Xia

Given the significant advancements in thoracic disease diagnosis due to deep learning, there is a reliance on the availability of numerous annotated samples, which, however, can hardly be guaranteed due to the resource-intensive nature of medical image annotation. Active learning has been introduced to mitigate annotation costs by selecting a subset of uncertain samples for annotation and training. Existing active learning methods encounter two primary challenges: 1) overlooking the impact of samples on the dynamics of model training during data selection, and 2) suffering from high costs of data evaluation and selection. To tackle both issues, we propose a novel metric called T emporal D ifference of G radient F low (TDGF) for data selection in active learning. Each round of active learning involves three steps: model training, data selection, and data annotation. First, we train a target model, a proxy model, and a historical proxy model on the labeled set. Second, the TDGF scores of unlabeled samples are evaluated based on the surrogate gradient flow, i. e. , the TDGF w. r. t the final fully-connected layer between the proxy and historical proxy models, and top- K samples with the highest TDGF scores are selected. Third, the selected samples are annotated, and the labeled pool and unlabeled pool are updated. Comparative experiments have been conducted on two public chest radiograph datasets, i. e. , ChestX-ray14 and CheXpert. Our results suggest that the proposed TDGF metric is prone to selecting hard and uncertain samples, and the use of proxy models and surrogate gradient flow substantially reduces the complexity of TDGF calculation. More importantly, the results also indicate that our TDGF-based method outperforms classical and state-of-the-art active learning methods in thoracic disease diagnosis.

Details DOI

ICRA Conference 2025 Conference Paper

BODex: Scalable and Efficient Robotic Dexterous Grasp Synthesis Using Bilevel Optimization

Jiayi Chen
Yubin Ke
He Wang

Robotic dexterous grasping is important for interacting with the environment. To unleash the potential of data-driven models for dexterous grasping, a large-scale, highquality dataset is essential. While gradient-based optimization offers a promising way for constructing such datasets, previous works suffer from limitations, such as inefficiency, strong assumptions in the grasp quality energy, or limited object sets for experiments. Moreover, the lack of a standard benchmark for comparing different methods and datasets hinders progress in this field. To address these challenges, we develop a highly efficient synthesis system and a comprehensive benchmark with MuJoCo for dexterous grasping. We formulate grasp synthesis as a bilevel optimization problem, combining a novel lowerlevel quadratic programming (QP) with an upper-level gradient descent process. By leveraging recent advances in CUDAaccelerated robotic libraries and GPU-based QP solvers, our system can parallelize thousands of grasps and synthesize over 49 grasps per second on a single 3090 GPU. Our synthesized grasps for Shadow, Allegro, and Leap hands all achieve a success rate above 75 % in simulation, with a penetration depth under 1 mm, outperforming existing baselines on nearly all metrics. Compared to the previous large-scale dataset, DexGraspNet, our dataset significantly improves the performance of learning models, with a success rate from around 40 % to 80 % in simulation. Real-world testing of the trained model on the Shadow Hand achieves an 81 % success rate across 20 diverse objects. The codes and datasets are released on our project page: https://pku-epic.github.io/BODex.

Details

NeurIPS Conference 2025 Conference Paper

EvaLearn: Quantifying the Learning Capability and Efficiency of LLMs via Sequential Problem Solving

Shihan Dou
Ming Zhang
Chenhao Huang
Jiayi Chen
Feng Chen
Shichun Liu
Yan Liu
Chenxiao Liu

We introduce EvaLearn, a pioneering benchmark designed to evaluate large language models (LLMs) on their learning capability and efficiency in challenging tasks, a critical, yet underexplored aspect of model potential. EvaLearn contains 648 challenging problems across six task types, grouped into 182 sequences, each sequence dedicated to one task type. Diverging from most existing benchmarks that evaluate models in parallel, EvaLearn requires models to solve problems sequentially, allowing them to leverage the experience gained from previous solutions. EvaLearn provides five comprehensive automated metrics to evaluate models and quantify their learning capability and efficiency. We extensively benchmark nine frontier models and observe varied performance profiles: some models, such as Claude-3. 7-sonnet, start with moderate initial performance but exhibit strong learning ability, while some models struggle to benefit from experience and may even show negative transfer. Moreover, we investigate model performance under two learning settings and find that instance-level rubrics and teacher-model feedback further facilitate model learning. Importantly, we observe that current LLMs with stronger static abilities do not show a clear advantage in learning capability across all tasks, highlighting that EvaLearn evaluates a new dimension of model performance. We hope EvaLearn provides a novel evaluation perspective for assessing LLM potential and understanding the gap between models and human capabilities, promoting the development of deeper and more dynamic evaluation approaches. All datasets, the automatic evaluation framework, and the results studied in this paper are available in the supplementary materials.

PDF Details

IROS Conference 2025 Conference Paper

Opportunistic Collaborative Planning with Large Vision Model Guided Control and Joint Query-Service Optimization

Jiayi Chen
Shuai Wang 0004
Guoliang Li
Wei Xu 0001
Guangxu Zhu
Derrick Wing Kwan Ng
ChengZhong Xu 0001

Navigating autonomous vehicles in open scenarios is a challenge due to the difficulties in handling unseen objects. Existing solutions either rely on small models that struggle with generalization or large models that are resource-intensive. While collaboration between the two offers a promising solution, the key challenge is deciding when and how to engage the large model. To address this issue, this paper proposes opportunistic collaborative planning (OCP), which seamlessly integrates efficient local models with powerful cloud models through two key innovations. First, we propose large vision model guided model predictive control (LVM-MPC), which leverages the cloud for LVM perception and decision making. The cloud output serves as a global guidance for a local MPC, thereby forming a closed-loop perception-to-control system. Second, to determine the best timing for large model query and service, we propose collaboration timing optimization (CTO), including object detection confidence thresholding (ODCT) and cloud forward simulation (CFS), to decide when to seek cloud assistance and when to offer cloud service. Extensive experiments show that the proposed OCP outperforms existing methods in terms of both navigation time and success rate.

Details

IROS Conference 2025 Conference Paper

PD-VLA: Accelerating Vision-Language-Action Model Integrated with Action Chunking via Parallel Decoding

Wenxuan Song
Jiayi Chen
Pengxiang Ding
Han Zhao 0008
Wei Zhao
Zhide Zhong
Zongyuan Ge
Zhijun Li

Vision-Language-Action (VLA) models demonstrate remarkable potential for generalizable robotic manipulation. The performance of VLA models can be improved by integrating with action chunking, a critical technique for effective control. However, action chunking linearly scales up action dimensions in VLA models with increased chunking sizes. This reduces the inference efficiency. Therefore, accelerating VLA integrated with action chunking is an urgent need. To tackle this problem, we propose PD-VLA, the first parallel decoding framework for VLA models integrated with action chunking. Our framework reformulates autoregressive decoding as a nonlinear system solved by parallel fixed-point iterations. This approach preserves model performance with mathematical guarantees while significantly improving decoding speed. In addition, it enables training-free acceleration without architectural changes, as well as seamless synergy with existing acceleration techniques. Extensive simulations validate that our PD-VLA maintains competitive success rates while achieving 2. 52× execution frequency on manipulators (with 7 degrees of freedom) compared with the fundamental VLA model. Furthermore, we experimentally identify the most effective settings for acceleration. Finally, real-world experiments validate its high applicability across different tasks.

Details

ICLR Conference 2025 Conference Paper

Pursuing Better Decision Boundaries for Long-Tailed Object Detection via Category Information Amount

Yanbiao Ma
Wei Dai 0015
Jiayi Chen

In object detection, the number of instances is commonly used to determine whether a dataset follows a long-tailed distribution, implicitly assuming that the model will perform poorly on categories with fewer instances. This assumption has led to extensive research on category bias in datasets with imbalanced instance distributions. However, even in datasets with relatively balanced instance counts, models still exhibit bias toward certain categories, indicating that instance count alone cannot explain this phenomenon. In this work, we first introduce the concept and measurement of category informativeness. We observe a significant negative correlation between a category’s informativeness and its accuracy, suggesting that informativeness more accurately reflects the learning difficulty of a category. Based on this observation, we propose the Informativeness-Guided Angular Margin Loss (IGAM Loss), which dynamically adjusts the decision space of categories according to their informativeness, thereby mitigating category bias in long-tailed datasets. IGAM Loss not only achieves superior performance on long-tailed benchmark datasets such as LVIS v1.0 and COCO-LT but also demonstrates significant improvements for underrepresented categories in non-long-tailed datasets like Pascal VOC. Extensive experiments confirm the potential of category informativeness as a tool and the generalizability of our proposed method.

Details

IROS Conference 2025 Conference Paper

TagGuideBot: Enhancing Robot Intelligence with Object Tags and VLMs

Jiayi Chen
Ying He 0006
F. Richard Yu

This research aims to enhance the interaction between humans and robots, especially in environments with multiple similar objects or semantic ambiguities. Traditional command-based interactions typically require users to provide precise descriptions, which often poses a significant challenge. To address this issue, we propose a framework named Tag-GuideBot, which leverages Visual Language Models (VLMs) and utilizes object markers to help locate and identify objects in the environment. By integrating positional point prompts of the target objects with robot motion planning models, we aim to achieve a more accurate understanding and execution of complex commands, thus improving the efficiency and naturalness of interactions. Experimental results demonstrate that TagGuideBot effectively addresses the challenges posed by complex commands and environmental complexities, achieving an accuracy of 66. 3% on user instructions extended beyond the training set, providing solid support for further optimization of human-robot interaction.

Details

IJCAI Conference 2024 Conference Paper

Cross-Domain Few-Shot Semantic Segmentation via Doubly Matching Transformation

Jiayi Chen
Rong Quan
Jie Qin

Cross-Domain Few-shot Semantic Segmentation (CD-FSS) aims to train generalized models that can segment classes from different domains with a few labeled images. Previous works have proven the effectiveness of feature transformation in addressing CD-FSS. However, they completely rely on support images for feature transformation, and repeatedly utilizing a few support images for each class may easily lead to overfitting and overlooking intra-class appearance differences. In this paper, we propose a Doubly Matching Transformation-based Network (DMTNet) to solve the above issue. Instead of completely relying on support images, we propose Self-Matching Transformation (SMT) to construct query-specific transformation matrices based on query images themselves to transform domain-specific query features into domain-agnostic ones. Calculating query-specific transformation matrices can prevent overfitting, especially for the meta-testing stage where only one or several images are used as support images to segment hundreds or thousands of images. After obtaining domain-agnostic features, we exploit a Dual Hypercorrelation Construction (DHC) module to explore the hypercorrelations between the query image with the foreground and background of the support image, based on which foreground and background prediction maps are generated and supervised, respectively, to enhance the segmentation result. In addition, we propose a Test-time Self-Finetuning (TSF) strategy to more accurately self-tune the query prediction in unseen domains. Extensive experiments on four popular datasets show that DMTNet achieves superior performance over state-of-the-art approaches. Code is available at https: //github. com/ChenJiayi68/DMTNet.

PDF Details DOI

ICML Conference 2024 Conference Paper

FedMBridge: Bridgeable Multimodal Federated Learning

Jiayi Chen
Aidong Zhang 0001

Multimodal Federated Learning (MFL) addresses the setup of multiple clients with diversified modality types (e. g. image, text, video, and audio) working together to improve their local personal models in a data-privacy manner. Prior MFL works rely on restrictive compositional neural architecture designs to ensure inter-client information sharing via blockwise model aggregation, limiting their applicability in the real-world Architecture-personalized MFL (AMFL) scenarios, where clients may have distinguished multimodal interaction strategies and there is no restriction on local architecture design. The key challenge in AMFL is how to automatically and efficiently tackle the two heterogeneity patterns–statistical and architecture heterogeneity–while maximizing the beneficial information sharing among clients. To solve this challenge, we propose FedMBridge, which leverages a topology-aware hypernetwork to act as a bridge that can automatically balance and digest the two heterogeneity patterns in a communication-efficient manner. Our experiments on four AMFL simulations demonstrate the efficiency and effectiveness of our proposed approach.

Details

NeurIPS Conference 2024 Conference Paper

Humor in AI: Massive Scale Crowd-Sourced Preferences and Benchmarks for Cartoon Captioning

Jifan Zhang
Lalit Jain
Yang Guo
Jiayi Chen
Kuan L. Zhou
Siddharth Suresh
Andrew Wagenmaker
Scott Sievert

We present a novel multimodal preference dataset for creative tasks, consisting of over 250 million human votes on more than 2. 2 million captions, collected through crowdsourcing rating data for The New Yorker's weekly cartoon caption contest over the past eight years. This unique dataset supports the development and evaluation of multimodal large language models and preference-based fine-tuning algorithms for humorous caption generation. We propose novel benchmarks for judging the quality of model-generated captions, utilizing both GPT4 and human judgments to establish ranking-based evaluation strategies. Our experimental results highlight the limitations of current fine-tuning methods, such as RLHF and DPO, when applied to creative tasks. Furthermore, we demonstrate that even state-of-the-art models like GPT4 and Claude currently underperform top human contestants in generating humorous captions. As we conclude this extensive data collection effort, we release the entire preference dataset to the research community, fostering further advancements in AI humor generation and evaluation.

PDF Details DOI

AAAI Conference 2024 Conference Paper

On Disentanglement of Asymmetrical Knowledge Transfer for Modality-Task Agnostic Federated Learning

Jiayi Chen
Aidong Zhang

There has been growing concern regarding data privacy during the development and deployment of Multimodal Foundation Models for Artificial General Intelligence (AGI), while Federated Learning (FL) allows multiple clients to collaboratively train models in a privacy-preserving manner. This paper formulates and studies Modality-task Agnostic Federated Learning (AFL) to pave the way toward privacy-preserving AGI. A unique property of AFL is the asymmetrical knowledge relationships among clients due to modality gaps, task gaps, and domain shifts between clients. This raises a challenge in learning an optimal inter-client information-sharing scheme that maximizes positive transfer and minimizes negative transfer for AFL. However, prior FL methods, mostly focusing on symmetrical knowledge transfer, tend to exhibit insufficient positive transfer and fail to fully avoid negative transfer during inter-client collaboration. To address this issue, we propose DisentAFL, which leverages a two-stage Knowledge Disentanglement and Gating mechanism to explicitly decompose the original asymmetrical inter-client information-sharing scheme into several independent symmetrical inter-client information-sharing schemes, each of which corresponds to certain semantic knowledge type learned from the local tasks. Experimental results demonstrate the superiority of our method on AFL than baselines.

PDF Details DOI

AAAI Conference 2023 Conference Paper

Tracking and Reconstructing Hand Object Interactions from Point Cloud Sequences in the Wild

Jiayi Chen
Mi Yan
Jiazhao Zhang
Yinzhen Xu
Xiaolong Li
Yijia Weng
Li Yi
Shuran Song

In this work, we tackle the challenging task of jointly tracking hand object poses and reconstructing their shapes from depth point cloud sequences in the wild, given the initial poses at frame 0. We for the first time propose a point cloud-based hand joint tracking network, HandTrackNet, to estimate the inter-frame hand joint motion. Our HandTrackNet proposes a novel hand pose canonicalization module to ease the tracking task, yielding accurate and robust hand joint tracking. Our pipeline then reconstructs the full hand via converting the predicted hand joints into a MANO hand. For object tracking, we devise a simple yet effective module that estimates the object SDF from the first frame and performs optimization-based tracking. Finally, a joint optimization step is adopted to perform joint hand and object reasoning, which alleviates the occlusion-induced ambiguity and further refines the hand pose. During training, the whole pipeline only sees purely synthetic data, which are synthesized with sufficient variations and by depth simulation for the ease of generalization. The whole pipeline is pertinent to the generalization gaps and thus directly transferable to real in-the-wild data. We evaluate our method on two real hand object interaction datasets, e.g. HO3D and DexYCB, without any fine-tuning. Our experiments demonstrate that the proposed method significantly outperforms the previous state-of-the-art depth-based hand and object pose estimation and tracking methods, running at a frame rate of 9 FPS. We have released our code on https://github.com/PKU-EPIC/HOTrack.

PDF Details DOI