Author name cluster

Jianbing Shen

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

14 papers

2 author rows

AAAI Conference 2026 Conference Paper

Less Is More: Vision Representation Compression for Efficient Video Generation with Large Language Models

Yucheng Zhou
Jihai Zhang
Guanjie Chen
Jianbing Shen
Yu Cheng

Video generation using Large Language Models (LLMs) has shown promising potential, effectively leveraging the extensive LLM infrastructure to provide a unified framework for multimodal understanding and content generation. However, these methods face critical challenges, i.e., token redundancy and inefficiencies arising from long sequences, which constrain their performance and efficiency compared to diffusion-based approaches. In this study, we investigate the impact of token redundancy in LLM-based video generation by information-theoretic analysis and propose Vision Representation Compression (VRC), a novel framework designed to achieve more in both performance and efficiency with less video token representations. VRC introduces learnable representation compressor and decompressor to compress video token representations, enabling autoregressive next-sequence prediction in a compact latent space. Our approach reduces redundancy, shortens token sequences, and improves model's ability to capture underlying video structures. Our experiments demonstrate that VRC reduces token sequence lengths by a factor of 4, achieving more than 9~14x acceleration in inference while maintaining performance comparable to state-of-the-art video generation models. VRC not only accelerates the inference but also significantly reduces memory requirements during both model training and inference.

PDF Details DOI

AAAI Conference 2026 Conference Paper

Sim4Seg: Boosting Multimodal Multi-disease Medical Diagnosis Segmentation with Region-Aware Vision-Language Similarity Masks

Lingran Song
Yucheng Zhou
Jianbing Shen

Despite significant progress in pixel-level medical image analysis, existing medical image segmentation models rarely explore medical segmentation and diagnosis tasks jointly. However, it is crucial for patients that models can provide explainable diagnoses along with medical segmentation results. In this paper, we introduce a medical vision-language task named Medical Diagnosis Segmentation (MDS), which aims to understand clinical queries for medical images and generate the corresponding segmentation masks as well as diagnostic results. To facilitate this task, we first present the Multimodal Multi-disease Medical Diagnosis Segmentation (M3DS) dataset, containing diverse multimodal multi-disease medical images paired with their corresponding segmentation masks and diagnosis chain-of-thought, created via an automated diagnosis chain-of-thought generation pipeline. Moreover, we propose Sim4Seg, a novel framework that improves the performance of diagnosis segmentation by taking advantage of the Region-Aware Vision-Language Similarity to Mask (RVLS2M) module. To improve overall performance, we investigate a test-time scaling strategy for MDS tasks. Experimental results demonstrate that our method outperforms baselines in both segmentation and diagnosis.

PDF Details DOI

AAAI Conference 2026 Conference Paper

Towards High-Fidelity 3D Portrait Generation with Rich Details by Cross-View Prior-Aware Diffusion

Haoran Wei
Wencheng Han
Xingping Dong
Jianbing Shen

Recent diffusion-based Single-image 3D portrait generation methods typically employ 2D diffusion models to provide multi-view knowledge, which is then distilled into 3D representations. However, these methods usually struggle to produce high-fidelity 3D models, frequently yielding excessively blurred textures. We attribute this issue to the insufficient consideration of cross-view consistency during the diffusion process, resulting in significant disparities between different views and ultimately leading to blurred 3D representations. In this paper, we address this issue by comprehensively exploiting multi-view priors in both the conditioning and diffusion procedures to produce consistent, detail-rich portraits. From the conditioning standpoint, we propose a Hybrid Priors Diffusion model, which explicitly and implicitly incorporates multi-view priors as conditions to enhance the status consistency of the generated multi-view portraits. From the diffusion perspective, considering the significant impact of the diffusion noise distribution on detailed texture generation, we propose a Multi-View Noise Resampling Strategy integrated within the optimization process leveraging cross-view priors to enhance representation consistency. Extensive experiments show that our method produces 3D portraits with accurate geometry and rich details from a single image.

PDF Details DOI

AAAI Conference 2025 Conference Paper

DME-Driver: Integrating Human Decision Logic and 3D Scene Perception in Autonomous Driving

Wencheng Han
Dongqian Guo
Cheng-Zhong Xu
Jianbing Shen

There are two crucial aspects of reliable autonomous driving systems: the reasoning behind decision-making and the precision of environmental perception. This paper introduces DME-Driver, a new autonomous driving system that enhances performance and robustness by fully leveraging the two crucial aspects. This system comprises two main models. The first, the Decision Maker, is responsible for providing logical driving instructions. The second, the Executor, receives these instructions and generates precise control signals for the vehicles. To ensure explainable and reliable driving decisions, we build the Decision-Maker based on a large vision language model. This model follows the logic employed by experienced human drivers and simulates making decisions in a safe and reasonable manner. On the other hand, the generation of accurate control signals relies on precise and detailed environmental perception, where 3D scene perception models excel. Therefore, a planning-oriented perception model is employed as the Executor. It translates the logical decisions made by the Decision-Maker into accurate control signals for the self-driving cars. To effectively train the proposed system, a new dataset named Human-driver Behavior and Decision-making (HBD) dataset has been collected. This dataset encompasses a diverse range of human driver behaviors and their underlying motivations. By leveraging this dataset, our system achieves high-precision planning accuracy through a logical thinking process.

PDF Details DOI

AAAI Conference 2025 Conference Paper

Language Prompt for Autonomous Driving

Dongming Wu
Wencheng Han
Yingfei Liu
Tiancai Wang
Cheng-Zhong Xu
Xiangyu Zhang
Jianbing Shen

A new trend in the computer vision community is to capture objects of interest following flexible human command represented by a natural language prompt. However, the progress of using language prompts in driving scenarios is stuck in a bottleneck due to the scarcity of paired prompt-instance data. To address this challenge, we propose the first object-centric language prompt set for driving scenes within 3D, multi-view, and multi-frame space, named NuPrompt. It expands nuScenes dataset by constructing a total of 40,147 language descriptions, each referring to an average of 7.4 object tracklets. Based on the object-text pairs from the new benchmark, we formulate a novel prompt-based driving task, \ie, employing a language prompt to predict the described object trajectory across views and frames. Furthermore, we provide a simple end-to-end baseline model based on Transformer, named PromptTrack. Experiments show that our PromptTrack achieves impressive performance on NuPrompt. We hope this work can provide some new insights for the self-driving community.

PDF Details DOI

AAAI Conference 2025 Conference Paper

OLiDM: Object-aware LiDAR Diffusion Models for Autonomous Driving

Tianyi Yan
Junbo Yin
Xianpeng Lang
Ruigang Yang
Cheng-Zhong Xu
Jianbing Shen

To enhance autonomous driving, innovative approaches have been proposed to generate simulated LiDAR data. However, these methods often face challenges in producing high-quality and controllable foreground objects. To cater to the needs of object-aware tasks in 3D perception, we introduce OLiDM, a novel framework capable of generating controllable and high-fidelity LiDAR data at both the object and scene levels. OLiDM consists of two pivotal components: the Object-Scene Progressive Generation (OPG) module and the Object Semantic Alignment (OSA) module. OPG adapts to user-specific prompts to generate desired foreground objects, which are subsequently employed as conditions in scene generation, ensuring controllable and diverse output at both the object and scene levels. This also facilitates the association of user-defined object-level annotations with the generated LiDAR scenes. Moreover, OSA aims to rectify the misalignment between foreground objects and background scenes, enhancing the overall quality of the generated objects. The broad efficacy of OLiDM is demonstrated across both unconditional and conditional LiDAR generation tasks, as well as 3D perception tasks. Specifically, on the KITTI-360 dataset, OLiDM surpasses prior state-of-the-art methods such as UltraLiDAR by 11.8 in FPD, producing data that closely mirrors real-world distributions. Additionally, in sparse-to-dense LiDAR completion, OLiDM achieves a significant improvement over LiDARGen, with a 57.47% increase in semantic IoU. Moreover, in 3D object detection, OLiDM enhances the performance of mainstream detectors by 2.4% in mAP and 1.9% in NDS, underscoring its potential in advancing 3D perception models.

PDF Details DOI

NeurIPS Conference 2025 Conference Paper

RLGF: Reinforcement Learning with Geometric Feedback for Autonomous Driving Video Generation

Tianyi Yan
Wencheng Han
Xia Zhou
Xueyang Zhang
Kun Zhan
Cheng-Zhong Xu
Jianbing Shen

Synthetic data is crucial for advancing autonomous driving (AD) systems, yet current state-of-the-art video generation models, despite their visual realism, suffer from subtle geometric distortions that limit their utility for downstream perception tasks. We identify and quantify this critical issue, demonstrating a significant performance gap in 3D object detection when using synthetic versus real data. To address this, we introduce Reinforcement Learning with Geometric Feedback (RLGF), RLGF uniquely refines video diffusion models by incorporating rewards from specialized latent-space AD perception models. Its core components include an efficient Latent-Space Windowing Optimization technique for targeted feedback during diffusion, and a Hierarchical Geometric Reward (HGR) system providing multi-level rewards for point-line-plane alignment, and scene occupancy coherence. To quantify these distortions, we propose GeoScores. Applied to models like DiVE on nuScenes, RLGF substantially reduces geometric errors (e. g. , VP error by 21\%, Depth error by 57\%) and dramatically improves 3D object detection mAP by 12. 7\%, narrowing the gap to real-data performance. RLGF offers a plug-and-play solution for generating geometrically sound and reliable synthetic videos for AD development.

PDF Details

ICLR Conference 2025 Conference Paper

Weak to Strong Generalization for Large Language Models with Multi-capabilities

Yucheng Zhou 0001
Jianbing Shen
Yu Cheng 0001

As large language models (LLMs) grow in sophistication, some of their capabilities surpass human abilities, making it essential to ensure their alignment with human values and intentions, i.e., Superalignment. This superalignment challenge is particularly critical for complex tasks, as annotations provided by humans, as weak supervisors, may be overly simplistic, incomplete, or incorrect. Previous work has demonstrated the potential of training a strong model using the weak dataset generated by a weak model as weak supervision. However, these studies have been limited to a single capability. In this work, we conduct extensive experiments to investigate weak to strong generalization for LLMs with multi-capabilities. The experiments reveal that different capabilities tend to remain relatively independent in this generalization, and the effectiveness of weak supervision is significantly impacted by the quality and diversity of the weak datasets. Moreover, the self-bootstrapping of the strong model leads to performance degradation due to its overconfidence and the limited diversity of its generated dataset. To address these issues, we proposed a novel training framework using reward models to select valuable data, thereby providing weak supervision for strong model training. In addition, we propose a two-stage training method on both weak and selected datasets to train the strong model. Experimental results demonstrate our method significantly improves the weak to strong generalization with multi-capabilities.

Details

AAAI Conference 2024 Conference Paper

DI-V2X: Learning Domain-Invariant Representation for Vehicle-Infrastructure Collaborative 3D Object Detection

Xiang Li
Junbo Yin
Wei Li
Chengzhong Xu
Ruigang Yang
Jianbing Shen

Vehicle-to-Everything (V2X) collaborative perception has recently gained significant attention due to its capability to enhance scene understanding by integrating information from various agents, e.g., vehicles, and infrastructure. However, current works often treat the information from each agent equally, ignoring the inherent domain gap caused by the utilization of different LiDAR sensors of each agent, thus leading to suboptimal performance. In this paper, we propose DI-V2X, that aims to learn Domain-Invariant representations through a new distillation framework to mitigate the domain discrepancy in the context of V2X 3D object detection. DI-V2X comprises three essential components: a domain-mixing instance augmentation (DMA) module, a progressive domain-invariant distillation (PDD) module, and a domain-adaptive fusion (DAF) module. Specifically, DMA builds a domain-mixing 3D instance bank for the teacher and student models during training, resulting in aligned data representation. Next, PDD encourages the student models from different domains to gradually learn a domain-invariant feature representation towards the teacher, where the overlapping regions between agents are employed as guidance to facilitate the distillation process. Furthermore, DAF closes the domain gap between the students by incorporating calibration-aware domain-adaptive attention. Extensive experiments on the challenging DAIR-V2X and V2XSet benchmark datasets demonstrate DI-V2X achieves remarkable performance, outperforming all the previous V2X models. Code is available at https://github.com/Serenos/DI-V2X.

PDF Details DOI

AAAI Conference 2024 Conference Paper

Fine-Grained Distillation for Long Document Retrieval

Yucheng Zhou
Tao Shen
Xiubo Geng
Chongyang Tao
Jianbing Shen
Guodong Long
Can Xu
Daxin Jiang

Long document retrieval aims to fetch query-relevant documents from a large-scale collection, where knowledge distillation has become de facto to improve a retriever by mimicking a heterogeneous yet powerful cross-encoder. However, in contrast to passages or sentences, retrieval on long documents suffers from the \textit{scope hypothesis} that a long document may cover multiple topics. This maximizes their structure heterogeneity and poses a granular-mismatch issue, leading to an inferior distillation efficacy. In this work, we propose a new learning framework, fine-grained distillation (FGD), for long-document retrievers. While preserving the conventional dense retrieval paradigm, it first produces global-consistent representations crossing different fine granularity and then applies multi-granular aligned distillation merely during training. In experiments, we evaluate our framework on two long-document retrieval benchmarks, which show state-of-the-art performance.

PDF Details DOI

ICLR Conference 2024 Conference Paper

TopoMLP: A Simple yet Strong Pipeline for Driving Topology Reasoning

Dongming Wu
Jiahao Chang
Fan Jia 0006
Yingfei Liu
Tiancai Wang
Jianbing Shen

Topology reasoning aims to comprehensively understand road scenes and present drivable routes in autonomous driving. It requires detecting road centerlines (lane) and traffic elements, further reasoning their topology relationship, \textit{i.e.}, lane-lane topology, and lane-traffic topology. In this work, we first present that the topology score relies heavily on detection performance on lane and traffic elements. Therefore, we introduce a powerful 3D lane detector and an improved 2D traffic element detector to extend the upper limit of topology performance. Further, we propose TopoMLP, a simple yet high-performance pipeline for driving topology reasoning. Based on the impressive detection performance, we develop two simple MLP-based heads for topology generation. TopoMLP achieves state-of-the-art performance on OpenLane-V2 dataset, \textit{i.e.}, 41.2\% OLS with ResNet-50 backbone. It is also the 1st solution for 1st OpenLane Topology in Autonomous Driving Challenge. We hope such simple and strong pipeline can provide some new insights to the community. Code is at https://github.com/wudongming97/TopoMLP.

Details

AAAI Conference 2023 Conference Paper

Exposing the Self-Supervised Space-Time Correspondence Learning via Graph Kernels

Zheyun Qin
Xiankai Lu
Xiushan Nie
Yilong Yin
Jianbing Shen

Self-supervised space-time correspondence learning is emerging as a promising way of leveraging unlabeled video. Currently, most methods adapt contrastive learning with mining negative samples or reconstruction adapted from the image domain, which requires dense affinity across multiple frames or optical flow constraints. Moreover, video correspondence predictive models require mining more inherent properties in videos, such as structural information. In this work, we propose the VideoHiGraph, a space-time correspondence framework based on a learnable graph kernel. Concerning the video as the spatial-temporal graph, the learning objectives of VideoHiGraph are emanated in a self-supervised manner for predicting unobserved hidden graphs via graph kernel manner. We learn a representation of the temporal coherence across frames in which pairwise similarity defines the structured hidden graph, such that a biased random walk graph kernel along the sub-graph can predict long-range correspondence. Then, we learn a refined representation across frames on the node-level via a dense graph kernel. The self-supervision of the model training is formed by the structural and temporal consistency of the graph. VideoHiGraph achieves superior performance and demonstrates its robustness across the benchmark of label propagation tasks involving objects, semantic parts, keypoints, and instances. Our algorithm implementations have been made publicly available at https://github.com/zyqin19/VideoHiGraph.

PDF Details DOI

AAAI Conference 2023 Conference Paper

LWSIS: LiDAR-Guided Weakly Supervised Instance Segmentation for Autonomous Driving

Xiang Li
Junbo Yin
Botian Shi
Yikang Li
Ruigang Yang
Jianbing Shen

Image instance segmentation is a fundamental research topic in autonomous driving, which is crucial for scene understanding and road safety. Advanced learning-based approaches often rely on the costly 2D mask annotations for training. In this paper, we present a more artful framework, LiDAR-guided Weakly Supervised Instance Segmentation (LWSIS), which leverages the off-the-shelf 3D data, i.e., Point Cloud, together with the 3D boxes, as natural weak supervisions for training the 2D image instance segmentation models. Our LWSIS not only exploits the complementary information in multimodal data during training but also significantly reduces the annotation cost of the dense 2D masks. In detail, LWSIS consists of two crucial modules, Point Label Assignment (PLA) and Graph-based Consistency Regularization (GCR). The former module aims to automatically assign the 3D point cloud as 2D point-wise labels, while the atter further refines the predictions by enforcing geometry and appearance consistency of the multimodal data. Moreover, we conduct a secondary instance segmentation annotation on the nuScenes, named nuInsSeg, to encourage further research on multimodal perception tasks. Extensive experiments on the nuInsSeg, as well as the large-scale Waymo, show that LWSIS can substantially improve existing weakly supervised segmentation models by only involving 3D data during training. Additionally, LWSIS can also be incorporated into 3D object detectors like PointPainting to boost the 3D detection performance for free. The code and dataset are available at https://github.com/Serenos/LWSIS.

PDF Details DOI

AAAI Conference 2023 Conference Paper

SSDA3D: Semi-supervised Domain Adaptation for 3D Object Detection from Point Cloud

Yan Wang
Junbo Yin
Wei Li
Pascal Frossard
Ruigang Yang
Jianbing Shen

LiDAR-based 3D object detection is an indispensable task in advanced autonomous driving systems. Though impressive detection results have been achieved by superior 3D detectors, they suffer from significant performance degeneration when facing unseen domains, such as different LiDAR configurations, different cities, and weather conditions. The mainstream approaches tend to solve these challenges by leveraging unsupervised domain adaptation (UDA) techniques. However, these UDA solutions just yield unsatisfactory 3D detection results when there is a severe domain shift, e.g., from Waymo (64-beam) to nuScenes (32-beam). To address this, we present a novel Semi-Supervised Domain Adaptation method for 3D object detection (SSDA3D), where only a few labeled target data is available, yet can significantly improve the adaptation performance. In particular, our SSDA3D includes an Inter-domain Adaptation stage and an Intra-domain Generalization stage. In the first stage, an Inter-domain Point-CutMix module is presented to efficiently align the point cloud distribution across domains. The Point-CutMix generates mixed samples of an intermediate domain, thus encouraging to learn domain-invariant knowledge. Then, in the second stage, we further enhance the model for better generalization on the unlabeled target set. This is achieved by exploring Intra-domain Point-MixUp in semi-supervised learning, which essentially regularizes the pseudo label distribution. Experiments from Waymo to nuScenes show that, with only 10% labeled target data, our SSDA3D can surpass the fully-supervised oracle model with 100% target label. Our code is available at https://github.com/yinjunbo/SSDA3D.

PDF Details DOI