Author name cluster

Jiachen Li

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

26 papers

2 author rows

TMLR Journal 2026 Journal Article

From Preferences to Prejudice: The Role of Alignment Tuning in Shaping Social Bias in Video Diffusion Models

Zefan Cai
Haoyi Qiu
Haozhe Zhao
Ke Wan
Jiachen Li
Jiuxiang Gu
Wen Xiao
Nanyun Peng

Recent advances in video diffusion models have significantly enhanced text-to-video generation, particularly through alignment tuning using reward models trained on human preferences. While these methods improve visual quality, they can unintentionally encode and amplify social biases. To systematically trace how such biases evolve throughout the alignment pipeline, we introduce VideoBiasEval, a comprehensive diagnostic framework for evaluating social representation in video generation. Grounded in established social bias taxonomies, VideoBiasEval employs an event-based prompting strategy to disentangle semantic content (verbs and contexts) from actor attributes (gender and ethnicity). It further introduces multi-granular metrics to evaluate (1) overall ethnicity bias, (2) gender bias conditioned on ethnicity, (3) distributional shifts in social attributes across model variants, and (4) the temporal persistence of bias within videos. Using this framework, we conduct the first end-to-end analysis connecting biases in human preference datasets, their amplification in reward models, and their propagation through alignment-tuned video diffusion models. Our results reveal that alignment tuning not only strengthens representational biases but also makes them temporally stable, producing smoother yet more stereotyped portrayals. These findings highlight the need for bias-aware evaluation and mitigation throughout the alignment process to ensure fair and socially responsible video generation.

PDF Details

AAAI Conference 2026 Conference Paper

Modality-Aware Bias Mitigation and Invariance Learning for Unsupervised Visible-Infrared Person Re-Identification

Menglin Wang
Xiaojin Gong
Jiachen Li
Genlin Ji

Unsupervised visible-infrared person re-identification (USVI-ReID) aims to match individuals across visible and infrared cameras without relying on any annotation. Given the significant gap across visible and infrared modality, estimating reliable cross-modality association becomes a major challenge in USVI-ReID. Existing methods usually adopt optimal transport to associate the intra-modality clusters, which is prone to propagating the local cluster errors, and also overlooks global instance-level relations. By mining and attending to the visible-infrared modality bias, this paper focuses on addressing cross-modality learning from two aspects: bias-mitigated global association and modality-invariant representation learning. Motivated by the camera-aware distance rectification in single-modality re-ID, we propose modality-aware Jaccard distance to mitigate the distance bias caused by modality discrepancy, so that more reliable cross-modality associations can be estimated through global clustering. To further improve cross-modality representation learning, a `split-and-contrast' strategy is designed to obtain modality-specific global prototypes. By explicitly aligning these prototypes under global association guidance, modality-invariant yet ID-discriminative representation learning can be achieved. While conceptually simple, our method obtains state-of-the-art performance on benchmark VI-ReID datasets and outperforms existing methods by a significant margin, validating its effectiveness.

PDF Details DOI

AAAI Conference 2026 Conference Paper

TrajEvo: Trajectory Prediction Heuristics Design via LLM-driven Evolution

Zhikai Zhao
Chuanbo Hua
Federico Berto
Kanghoon Lee
Zihan Ma
Jiachen Li
Jinkyoo Park

Trajectory prediction is a crucial task in modeling human behavior, especially in safety-critical fields such as social robotics and autonomous vehicle navigation. Traditional heuristics based on handcrafted rules often lack accuracy, while recently proposed deep learning approaches suffer from computational cost, slow inference speed, lack of explainability, and generalization issues that limit their practical adoption in such environments. In this paper, we introduce TrajEvo, a framework that leverages Large Language Models (LLMs) to automatically design trajectory prediction heuristics. TrajEvo employs an evolutionary algorithm to generate and refine prediction heuristics from past trajectory data. We introduce a Cross-Generation Elite Sampling to promote population diversity and a Statistics Feedback Loop allowing the LLM to analyze alternative predictions. Our evaluations show TrajEvo outperforms previous heuristic methods on various real-world datasets, and remarkably outperforms both heuristics and deep learning methods when generalizing to an unseen real-world dataset. TrajEvo represents a first step toward automated design of fast, explainable, and generalizable trajectory prediction heuristics. We make our source code publicly available to foster future research.

PDF Details DOI

ICRA Conference 2025 Conference Paper

A Visual Servo System for Robotic on-Orbit Servicing Based on 3D Perception of Non-Cooperative Satellite

Panpan Zhao
Li Jin
Yeheng Chen
Jiachen Li
Xiuqiang Song
Wenxuan Chen
Nan Li
Wenjuan Du

The 3D perception of satellites, including both their shape and pose, is a key foundation for robotic on-orbit servicing. However, the demanding space environment-such as intense and dim illumination-presents significant challenges. Previous non-cooperative methods focus on specific geometric features like solar panel brackets or docking rings, overlooking the satellite's overall shape and increasing the risk of collisions during grasping. Additionally, satellites are often weakly textured, limiting the accuracy of 3D perception. To address these issues, we propose, for the first time, a 3D perceptionbased visual servo system of non-cooperative satellites. This system combines reconstruction and tracking to enhance shape perception and pose estimation accuracy in orbital conditions. Specifically, we employ an alternating iterative strategy to simultaneously reconstruct and track the satellite and introduce a novel constraint to fuse different cues under extreme conditions. Further, we develop a simulation environment platform, a dualarm microgravity grasping system, and an online monitoring module to enhance system capabilities for on-orbit servicing. Synthetic and real-world datasets from the simulation environment are also created for experimental validation. Results show that each module of our system achieves state-of-the-art performance.

Details

IROS Conference 2025 Conference Paper

HeightAware-BEV: Height-Aware Feature Mapping for Efficient Bird's-Eye-View Perception

Renjie Zhou
Jiachen Li
Zhen Su
Chao Lu
Zhengjun Wang

Bird’s-Eye View (BEV) perception has gained significant attention in autonomous driving and robotics due to its advantages in simplifying modality alignment and feature fusion. Addressing the challenge of jointly optimizing performance and efficiency in 2D-3D view transformation, we identify that, compared to depth information which is viewpoint-dependent and requires camera intrinsics for estimation, height information can maintains prediction consistency across different camera perspectives. Based on this insight, we propose the HeightAware-BEV framework, which achieves efficient and accurate view transformation through height-aware feature mapping. (1) Building on an efficient projection-based view transformation approach, 3D voxels directly query the height probability distribution predicted by images according to grid height, weighting corresponding features to enable precise and efficient feature projection; (2) Design a dynamic feature filtering mechanism to filter out task-irrelevant features during the view transformation process. Additionally, a weakly-supervised training strategy is designed to improve model performance in scenarios with limited samples. The HeightAware-BEV (R50@448×800) achieves an IOU of 47. 8% on the nuScenes validation set and 60 FPS on 2080Ti, outperforming advanced methods such as SimpleBEV and PointBEV. The code is available at https://github.com/Zhou-Renjie/HeightAware-BEV.

Details

ICRA Conference 2025 Conference Paper

LaMMA-P: Generalizable Multi-Agent Long-Horizon Task Allocation and Planning with LM-Driven PDDL Planner

Xiaopan Zhang
Hao Qin
Fuquan Wang
Yue Dong 0002
Jiachen Li

Language models (LMs) possess a strong capability to comprehend natural language, making them effective in translating human instructions into detailed plans for simple robot tasks. Nevertheless, it remains a significant challenge to handle long-horizon tasks, especially in subtask identification and allocation for cooperative heterogeneous robot teams. To address this issue, we propose a Language Model-Driven MultiAgent PDDL Planner (LaMMA-P), a novel multi-agent task planning framework that achieves state-of-the-art performance on long-horizon tasks. LaMMA-P integrates the strengths of the LMs' reasoning capability and the traditional heuristic search planner to achieve a high success rate and efficiency while demonstrating strong generalization across tasks. Additionally, we create MAT-THOR, a comprehensive benchmark that features household tasks with two different levels of complexity based on the AI2-THOR environment. The experimental results demonstrate that LaMMA-P achieves a 105% higher success rate and 36 % higher efficiency than existing LM-based multiagent planners. The experimental videos, code, datasets, and detailed prompts used in each module can be found on the project website: https://lamma-p.github.io.

Details

ICLR Conference 2025 Conference Paper

MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos

Xuehai He
Weixi Feng
Kaizhi Zheng
Yujie Lu
Wanrong Zhu
Jiachen Li
Yue Fan
Jianfeng Wang

Multimodal Language Language Models (MLLMs) demonstrate the emerging abilities of "world models"---interpreting and reasoning about complex real-world dynamics. To assess these abilities, we posit videos are the ideal medium, as they encapsulate rich representations of real-world dynamics and causalities. To this end, we introduce MMWorld, a new benchmark for multi-discipline, multi-faceted multimodal video understanding. MMWorld distinguishes itself from previous video understanding benchmarks with two unique advantages: (1) multi-discipline, covering various disciplines that often require domain expertise for comprehensive understanding; (2) multi-faceted reasoning, including explanation, counterfactual thinking, future prediction, etc. MMWorld consists of a human-annotated dataset to evaluate MLLMs with questions about the whole videos and a synthetic dataset to analyze MLLMs within a single modality of perception. Together, MMWorld encompasses 1,910 videos across seven broad disciplines and 69 subdisciplines, complete with 6,627 question-answer pairs and associated captions. The evaluation includes 4 proprietary and 11 open-source MLLMs, which struggle on MMWorld (e.g., GPT-4o performs the best with only 62.5% accuracy), showing large room for improvement. Further ablation studies reveal other interesting findings such as models' different skill sets from humans. We hope MMWorld can serve as an essential step towards world model evaluation in videos.

Details

NeurIPS Conference 2025 Conference Paper

RDD: Retrieval-Based Demonstration Decomposer for Planner Alignment in Long-Horizon Tasks

Mingxuan Yan
Yuping Wang
Zechun Liu
Jiachen Li

To tackle long-horizon tasks, recent hierarchical vision-language-action (VLAs) frameworks employ vision-language model (VLM)-based planners to decompose complex manipulation tasks into simpler sub-tasks that low-level visuomotor policies can handle. Typically, the VLM planner needs finetuning to learn to decompose a new task, which requires target task demonstrations segmented into sub-tasks by either human annotation or heuristic rules. However, without prior knowledge, the heuristic sub-tasks can deviate significantly from the visuomotor policy's training data, thereby degrading task performance. To address these issues, we propose a Retrieval-based Demonstration Decomposer (RDD) that automatically decomposes video demonstrations into sub-tasks with prior by aligning the visual features of the decomposed sub-task intervals with those from the training data of the low-level visuomotor policies. RDD outperforms the state-of-the-art sub-task decomposer on both simulation and real-world tasks, demonstrating robustness across diverse settings. Code and more results are available at https: //rdd-neurips. github. io

PDF Details

TMLR Journal 2025 Journal Article

Robust Offline Imitation Learning from Diverse Auxiliary Data

Udita Ghosh
Dripta S. Raychaudhuri
Jiachen Li
Konstantinos Karydis
Amit Roy-Chowdhury

Offline imitation learning enables learning a policy solely from a set of expert demonstrations, without any environment interaction. To alleviate the issue of distribution shift arising due to the small amount of expert data, recent works incorporate large numbers of auxiliary demonstrations alongside the expert data. However, the performance of these approaches rely on assumptions about the quality and composition of the auxiliary data, and they are rarely successful when those assumptions do not hold. To address this limitation, we propose Robust Offline Imitation from Diverse Auxiliary Data (ROIDA). ROIDA first identifies high-quality transitions from the entire auxiliary dataset using a learned reward function. These high-reward samples are combined with the expert demonstrations for weighted behavioral cloning. For lower-quality samples, ROIDA applies temporal difference learning to steer the policy towards high-reward states, improving long-term returns. This two-pronged approach enables our framework to effectively leverage both high and low-quality data without any assumptions. Extensive experiments validate that ROIDA achieves robust and consistent performance across multiple auxiliary datasets with diverse ratios of expert and non-expert demonstrations. ROIDA effectively leverages unlabeled auxiliary data, outperforming prior methods reliant on specific data assumptions.

PDF Details

ICLR Conference 2025 Conference Paper

T2V-Turbo-v2: Enhancing Video Model Post-Training through Data, Reward, and Conditional Guidance Design

Jiachen Li
Qian Long
Jian Zheng
Xiaofeng Gao 0002
Robinson Piramuthu
Wenhu Chen
William Yang Wang

In this paper, we focus on enhancing a diffusion-based text-to-video (T2V) model during the post-training phase by distilling a highly capable consistency model from a pretrained T2V model. Our proposed method, T2V-Turbo-v2, introduces a significant advancement by integrating various supervision signals, including high-quality training data, reward model feedback, and conditional guidance, into the consistency distillation process. Through comprehensive ablation studies, we highlight the crucial importance of tailoring datasets to specific learning objectives and the effectiveness of learning from diverse reward models for enhancing both the visual quality and text-video alignment. Additionally, we highlight the vast design space of conditional guidance strategies, which centers on designing an effective energy function to augment the teacher ODE solver. We demonstrate the potential of this approach by extracting motion guidance from the training datasets and incorporating it into the ODE solver, showcasing its effectiveness in improving the motion quality of the generated videos with the improved motion-related metrics from VBench and T2V-CompBench. Empirically, our T2V-Turbo-v2 establishes a new state-of-the-art result on VBench, **with a Total score of 85.13**, surpassing proprietary systems such as Gen-3 and Kling.

Details

EAAI Journal 2025 Journal Article

Three-dimensional reconstruction and fracture segmentation based on X-ray and computed tomography paired dataset

Yuan Gao
Yuan Zhou
Da Chen
Jiachen Li
Mingle Zhou
Gang Li
Yunbo Gu
Jean-Louis Coatrieux

In some orthopedic surgeries, the use of three-dimensional (3D) computed tomography (CT) scanning technology is not feasible due to scene limitations, leaving doctors to rely on two-dimensional (2D) X-ray images for real-time diagnosis. However, X-ray images lack 3D information, making accurate diagnosis challenging. Developing an algorithm to convert 2D X-ray images into 3D CT images, while simultaneously combining high-quality 3D reconstruction with precise fracture segmentation, offers a promising solution to the problem. In this study, we propose a novel artificial intelligence (AI)-driven framework named 3D reconstruction and segment anything model (3DRecSAM). The reconstruction image enhancer (RIE) is designed to achieve high-precision 3D reconstruction and provide high-quality feature initialization for fracture segmentation. Meanwhile, the mamba segment anything model (MSAM), based on the segment anything model (SAM) architecture, is developed for accurate fracture segmentation. We introduce a Kolmogorov–Arnold network (KAN)-based attention fusion module (KAF), which facilitates the joint optimization of the RIE reconstruction network and the MSAM segmentation network. Furthermore, the selective scanning mamba with KAN (SKM) is incorporated to enhance feature extraction for both RIE and MSAM. Mamba efficiently captures long-range dependencies and sequential patterns, while KAN’s learnable activation functions facilitate adaptive feature fusion and non-linear representation. To train and evaluate 3DRecSAM, we introduce the real X-ray and CT paired dataset (XCPData), which is publicly available on GitHub: https: //github. com/YuanGao1201/XCPData.

Details DOI

NeurIPS Conference 2024 Conference Paper

CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts

Jiachen Li
Xinyao Wang
Sijie Zhu
Chia-Wen Kuo
Lu Xu
Fan Chen
Jitesh Jain
Humphrey Shi

Recent advancements in Multimodal Large Language Models (LLMs) have focused primarily on scaling by increasing text-image pair data and enhancing LLMs to improve performance on multimodal tasks. However, these scaling approaches are computationally expensive and overlook the significance of efficiently improving model capabilities from the vision side. Inspired by the successful applications of Mixture-of-Experts (MoE) in LLMs, which improves model scalability during training while keeping inference costs similar to those of smaller models, we propose CuMo, which incorporates Co-upcycled Top-K sparsely-gated Mixture-of-experts blocks into both the vision encoder and the MLP connector, thereby enhancing the multimodal LLMs with neglectable additional activated parameters during inference. CuMo first pre-trains the MLP blocks and then initializes each expert in the MoE block from the pre-trained MLP block during the visual instruction tuning stage, with auxiliary losses to ensure a balanced loading of experts. CuMo outperforms state-of-the-art multimodal LLMs across various VQA and visual-instruction-following benchmarks within each model size group, all while training exclusively on open-sourced datasets.

PDF Details DOI

AAMAS Conference 2024 Conference Paper

ELA: Exploited Level Augmentation for Offline Learning in Zero-Sum Games

Shiqi Lei
Kanghoon Lee
Linjing Li
Jinkyoo Park
Jiachen Li

Offline learning derives effective policies from expert demonstrators’ datasets without direct interaction. While recent research consider dataset characteristics like expertise level or multiple demonstrators, a distinct approach is necessary in zero-sum games, where outcomes significantly depend on the opponent’s strategy. In this study, we introduce a novel approach using unsupervised learning techniques to estimate the exploited level (EL) of each trajectory from the offline dataset of zero-sum games made by diverse demonstrators. The estimated EL is then integrated into offline learning to maximize the influence of the dominant strategy. Our method enables interpretable EL estimation in multiple zero-sum games, effectively identifying dominant strategies. Also, EL augmented offline learning significantly enhances the imitation and offline reinforcement learning algorithms in zero-sum games.

PDF

ICML Conference 2024 Conference Paper

Mastering Robot Manipulation with Multimodal Prompts through Pretraining and Multi-task Fine-tuning

Jiachen Li
Qiaozi Gao
Michael Johnston
Xiaofeng Gao 0002
Xuehai He
Hangjie Shi
Suhaila Shakiah
Reza Ghanadan

Prompt-based learning has been demonstrated as a compelling paradigm contributing to large language models’ tremendous success (LLMs). Inspired by their success in language tasks, existing research has leveraged LLMs in embodied instruction following and task planning. In this work, we tackle the problem of training a robot to understand multimodal prompts, interleaving vision signals with text descriptions. This type of task poses a major challenge to robots’ capability to understand the interconnection and complementarity between vision and language signals. In this work, we introduce an effective framework that learns a policy to perform robot manipulation with multimodal prompts from multi-task expert trajectories. Our methods consist of a two-stage training pipeline that performs inverse dynamics pretraining and multi-task finetuning. To facilitate multimodal understanding, we design our multimodal prompt encoder by augmenting a pretrained LM with a residual connection to the visual input and model the dependencies among action dimensions. Empirically, we evaluate the efficacy of our method on the VIMA-BENCH and establish a new state-of-the-art (10% improvement in success rate). Moreover, we demonstrate that our model exhibits remarkable in-context learning ability.

Details

TMLR Journal 2024 Journal Article

Reward Guided Latent Consistency Distillation

Jiachen Li
Weixi Feng
Wenhu Chen
William Yang Wang

Latent Consistency Distillation (LCD) has emerged as a promising paradigm for efficient text-to-image synthesis. By distilling a latent consistency model (LCM) from a pre-trained teacher latent diffusion model (LDM), LCD facilitates the generation of high-fidelity images within merely 2 to 4 inference steps. However, the LCM's efficient inference is obtained at the cost of the sample quality. In this paper, we propose compensating the quality loss by aligning LCM's output with human preference during training. Specifically, we introduce Reward Guided LCD (RG-LCD), which integrates feedback from a reward model (RM) into the LCD process by augmenting the original LCD loss with the objective of maximizing the reward associated with LCM's single-step generation. As validated through human evaluation, when trained with the feedback of a good RM, the 2-step generations from our RG-LCM are favored by humans over the 50-step DDIM samples from the teacher LDM, representing a 25-time inference acceleration without quality loss. As directly optimizing towards differentiable RMs can suffer from over-optimization, we take the initial step to overcome this difficulty by proposing the use of a latent proxy RM (LRM). This novel component serves as an intermediary, connecting our LCM with the RM. Empirically, we demonstrate that incorporating the LRM into our RG-LCD successfully avoids high-frequency noise in the generated images, contributing to both improved Fréchet Inception Distance (FID) on MS-COCO and a higher HPSv2.1 score on HPSv2's test set, surpassing those achieved by the baseline LCM. Project Page: https://rg-lcd.github.io/

PDF Details

NeurIPS Conference 2024 Conference Paper

T2V-Turbo: Breaking the Quality Bottleneck of Video Consistency Model with Mixed Reward Feedback

Jiachen Li
Weixi Feng
Tsu-Jui Fu
Xinyi Wang
Sugato Basu
Wenhu Chen
William Y. Wang

Diffusion-based text-to-video (T2V) models have achieved significant success but continue to be hampered by the slow sampling speed of their iterative sampling processes. To address the challenge, consistency models have been proposed to facilitate fast inference, albeit at the cost of sample quality. In this work, we aim to break the quality bottleneck of a video consistency model (VCM) to achieve both fast and high-quality video generation. We introduce T2V-Turbo, which integrates feedback from a mixture of differentiable reward models into the consistency distillation (CD) process of a pre-trained T2V model. Notably, we directly optimize rewards associated with single-step generations that arise naturally from computing the CD loss, effectively bypassing the memory constraints imposed by backpropagating gradients through an iterative sampling process. Remarkably, the 4-step generations from our T2V-Turbo achieve the highest total score on VBench, even surpassing Gen-2 and Pika. We further conduct human evaluations to corroborate the results, validating that the 4-step generations from our T2V-Turbo are preferred over the 50-step DDIM samples from their teacher models, representing more than a tenfold acceleration while improving video generation quality.

PDF Details DOI

ICLR Conference 2023 Conference Paper

Causal Balancing for Domain Generalization

Xinyi Wang 0003
Michael Saxon
Jiachen Li
Hongyang Zhang 0001
Kun Zhang 0001
William Yang Wang

While machine learning models rapidly advance the state-of-the-art on various real-world tasks, out-of-domain (OOD) generalization remains a challenging problem given the vulnerability of these models to spurious correlations. We propose a balanced mini-batch sampling strategy to transform a biased data distribution into a spurious-free balanced distribution, based on the invariance of the underlying causal mechanisms for the data generation process. We argue that the Bayes optimal classifiers trained on such balanced distribution are minimax optimal across a diverse enough environment space. We also provide an identifiability guarantee of the latent variable model of the proposed data generation process, when utilizing enough train environments. Experiments are conducted on DomainBed, demonstrating empirically that our method obtains the best performance across 20 baselines reported on the benchmark.

Details

ICML Conference 2023 Conference Paper

Offline Reinforcement Learning with Closed-Form Policy Improvement Operators

Jiachen Li
Edwin Zhang
Ming Yin 0003
Qinxun Bai
Yu-Xiang Wang 0003
William Yang Wang

Behavior constrained policy optimization has been demonstrated to be a successful paradigm for tackling Offline Reinforcement Learning. By exploiting historical transitions, a policy is trained to maximize a learned value function while constrained by the behavior policy to avoid a significant distributional shift. In this paper, we propose our closed-form policy improvement operators. We make a novel observation that the behavior constraint naturally motivates the use of first-order Taylor approximation, leading to a linear approximation of the policy objective. Additionally, as practical datasets are usually collected by heterogeneous policies, we model the behavior policies as a Gaussian Mixture and overcome the induced optimization difficulties by leveraging the LogSumExp’s lower bound and Jensen’s Inequality, giving rise to a closed-form policy improvement operator. We instantiate both one-step and iterative offline RL algorithms with our novel policy improvement operators and empirically demonstrate their effectiveness over state-of-the-art algorithms on the standard D4RL benchmark. Our code is available at https: //cfpi-icml23. github. io/.

Details

ICRA Conference 2023 Conference Paper

Online Hand-Eye Calibration with Decoupling by 3D Textureless Object Tracking

Li Jin
Kang Xie
Wenxuan Chen
Xin Cao
Yuehua Li
Jiachen Li
Jiankai Qian
Xueying Qin

Hand-eye calibration estimates the pose of a camera relative to a robot, which is a fundamental problem for visually guided robots, especially for dynamic object grasping. Most methods use 2D fiducial markers with distinctive visual features and require pre-calibration for accurate calibration, which can not work online. In this paper, we propose a novel hand-eye calibration method based on the natural 3D object, which can work online and automatically even if the object is textureless or weakly textured. We first propose a Pose Refinement Network (PR-Net) to improve the accuracy of 3D object tracking. Then we build a 3D convergence point constraint based on the multi-view information with the accurate object pose to adjust the object position. Finally, we optimize the hand-eye pose by the closed-loop constraint with the optimized object position, solving the problem that is easy to fall into a local minimum. The experiments show that the average error of our hand-eye calibration method is 1. 20 degrees and 23. 18 mm. The results achieve state-of-the-art by using the working object to realize the online hand-eye calibration.

Details

ICRA Conference 2023 Conference Paper

Pedestrian Crossing Action Recognition and Trajectory Prediction with 3D Human Keypoints

Jiachen Li
Xinwei Shi
Feiyu Chen
Jonathan Stroud
Zhishuai Zhang
Tian Lan
Junhua Mao
Jeonhyung Kang

Accurate understanding and prediction of human behaviors are critical prerequisites for autonomous vehicles, especially in highly dynamic and interactive scenarios such as intersections in dense urban areas. In this work, we aim at identifying crossing pedestrians and predicting their future trajectories. To achieve these goals, we not only need the context information of road geometry and other traffic participants but also need fine-grained information of the human pose, motion and activity, which can be inferred from human keypoints. In this paper, we propose a novel multi-task learning framework for pedestrian crossing action recognition and trajectory pre-diction, which utilizes 3D human keypoints extracted from raw sensor data to capture rich information on human pose and activity. Moreover, we propose to apply two auxiliary tasks and contrastive learning to enable auxiliary supervisions to improve the learned keypoints representation, which further enhances the performance of major tasks. We validate our approach on a large-scale in-house dataset, as well as a public benchmark dataset, and show that our approach achieves state-of-the-art performance on a wide range of evaluation metrics. The effectiveness of each model component is validated in a detailed ablation study.

Details

NeurIPS Conference 2022 Conference Paper

Interaction Modeling with Multiplex Attention

Fan-Yun Sun
Isaac Kauvar
Ruohan Zhang
Jiachen Li
Mykel J Kochenderfer
Jiajun Wu
Nick Haber

Modeling multi-agent systems requires understanding how agents interact. Such systems are often difficult to model because they can involve a variety of types of interactions that layer together to drive rich social behavioral dynamics. Here we introduce a method for accurately modeling multi-agent systems. We present Interaction Modeling with Multiplex Attention (IMMA), a forward prediction model that uses a multiplex latent graph to represent multiple independent types of interactions and attention to account for relations of different strengths. We also introduce Progressive Layer Training, a training strategy for this architecture. We show that our approach outperforms state-of-the-art models in trajectory forecasting and relation inference, spanning three multi-agent scenarios: social navigation, cooperative task achievement, and team sports. We further demonstrate that our approach can improve zero-shot generalization and allows us to probe how different interactions impact agent behavior.

PDF Details

NeurIPS Conference 2022 Conference Paper

Learning Physical Dynamics with Subequivariant Graph Neural Networks

Jiaqi Han
Wenbing Huang
Hengbo Ma
Jiachen Li
Josh Tenenbaum
Chuang Gan

Graph Neural Networks (GNNs) have become a prevailing tool for learning physical dynamics. However, they still encounter several challenges: 1) Physical laws abide by symmetry, which is a vital inductive bias accounting for model generalization and should be incorporated into the model design. Existing simulators either consider insufficient symmetry, or enforce excessive equivariance in practice when symmetry is partially broken by gravity. 2) Objects in the physical world possess diverse shapes, sizes, and properties, which should be appropriately processed by the model. To tackle these difficulties, we propose a novel backbone, called Subequivariant Graph Neural Network, which 1) relaxes equivariance to subequivariance by considering external fields like gravity, where the universal approximation ability holds theoretically; 2) introduces a new subequivariant object-aware message passing for learning physical interactions between multiple objects of various shapes in particle-based representation; 3) operates in a hierarchical fashion, allowing for modeling long-range and complex interactions. Our model achieves on average over 3% enhancement in contact prediction accuracy across 8 scenarios on Physion and 2$\times$ lower rollout MSE on RigidFall compared with state-of-the-art GNN simulators, while exhibiting strong generalization and data efficiency.

PDF Details

IJCAI Conference 2020 Conference Paper

A Speech-to-Knowledge-Graph Construction System

Xiaoyi Fu
Jie Zhang
Hao Yu
Jiachen Li
Dong Chen
Jie Yuan
Xindong Wu

This paper presents a HAO-Graph system that generates and visualizes knowledge graphs from a speech in real-time. When a user speaks to the system, HAO-Graph transforms the voice into knowledge graphs with key phrases from the original speech as nodes and edges. Different from language-to-language systems, such as Chinese-to-English and English-to-English, HAO-Graph converts a speech into graphs, and is the first of its kind. The effectiveness of our HAO-Graph system is verified by a two-hour chairman's talk in front of two thousand participants at an annual meeting in the form of a satisfaction survey.

PDF Details DOI

NeurIPS Conference 2020 Conference Paper

EvolveGraph: Multi-Agent Trajectory Prediction with Dynamic Relational Reasoning

Jiachen Li
Fan Yang
Masayoshi Tomizuka
Chiho Choi

Multi-agent interacting systems are prevalent in the world, from purely physical systems to complicated social dynamic systems. In many applications, effective understanding of the situation and accurate trajectory prediction of interactive agents play a significant role in downstream tasks, such as decision making and planning. In this paper, we propose a generic trajectory forecasting framework (named EvolveGraph) with explicit relational structure recognition and prediction via latent interaction graphs among multiple heterogeneous, interactive agents. Considering the uncertainty of future behaviors, the model is designed to provide multi-modal prediction hypotheses. Since the underlying interactions may evolve even with abrupt changes, and different modalities of evolution may lead to different outcomes, we address the necessity of dynamic relational reasoning and adaptively evolving the interaction graphs. We also introduce a double-stage training pipeline which not only improves training efficiency and accelerates convergence, but also enhances model performance. The proposed framework is evaluated on both synthetic physics simulations and multiple real-world benchmark datasets in various areas. The experimental results illustrate that our approach achieves state-of-the-art performance in terms of prediction accuracy.

PDF Details

NeurIPS Conference 2020 Conference Paper

Multi-task Batch Reinforcement Learning with Metric Learning

Jiachen Li
Quan Vuong
Shuang Liu
Minghua Liu
Kamil Ciosek
Henrik Christensen
Hao Su

We tackle the Multi-task Batch Reinforcement Learning problem. Given multiple datasets collected from different tasks, we train a multi-task policy to perform well in unseen tasks sampled from the same distribution. The task identities of the unseen tasks are not provided. To perform well, the policy must infer the task identity from collected transitions by modelling its dependency on states, actions and rewards. Because the different datasets may have state-action distributions with large divergence, the task inference module can learn to ignore the rewards and spuriously correlate \textit{only} state-action pairs to the task identity, leading to poor test time performance. To robustify task inference, we propose a novel application of the triplet loss. To mine hard negative examples, we relabel the transitions from the training tasks by approximating their reward functions. When we allow further training on the unseen tasks, using the trained policy as an initialization leads to significantly faster convergence compared to randomly initialized policies (up to 80% improvement and across 5 different Mujoco task distributions). We name our method \textbf{MBML} (\textbf{M}ulti-task \textbf{B}atch RL with \textbf{M}etric \textbf{L}earning).

PDF Details

AAAI Conference 2019 Conference Paper

Weakly Supervised Scene Parsing with Point-Based Distance Metric Learning

Rui Qian
Yunchao Wei
Honghui Shi
Jiachen Li
Jiaying Liu
Thomas Huang

Semantic scene parsing is suffering from the fact that pixellevel annotations are hard to be collected. To tackle this issue, we propose a Point-based Distance Metric Learning (PDML) in this paper. PDML does not require dense annotated masks and only leverages several labeled points that are much easier to obtain to guide the training process. Concretely, we leverage semantic relationship among the annotated points by encouraging the feature representations of the intra- and intercategory points to keep consistent, i. e. points within the same category should have more similar feature representations compared to those from different categories. We formulate such a characteristic into a simple distance metric loss, which collaborates with the point-wise cross-entropy loss to optimize the deep neural networks. Furthermore, to fully exploit the limited annotations, distance metric learning is conducted across different training images instead of simply adopting an image-dependent manner. We conduct extensive experiments on two challenging scene parsing benchmarks of PASCAL- Context and ADE 20K to validate the effectiveness of our PDML, and competitive mIoU scores are achieved.

PDF Details