Author name cluster

Bo Zhao

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

17 papers

2 author rows

AAAI Conference 2026 Conference Paper

Compensating Distribution Drifts in Continual Learning with Pre-trained Vision Transformers

Xuan Rao
Simian Xu
Zheng Li
Bo Zhao
Derong Liu
Mingming Ha
Cesare Alippi

Recent advances have shown that sequential fine-tuning (SeqFT) of pre-trained vision transformers (ViTs), followed by classifier refinement using approximate distributions of class features, can be an effective strategy for class-incremental learning (CIL). However, this approach is susceptible to distribution drift, caused by the sequential optimization of shared backbone parameters. This results in a mismatch between the distributions of the previously learned classes and that of the updated model, ultimately degrading the effectiveness of classifier performance over time. To address this issue, we introduce a latent space transition operator and propose Sequential Learning with Drift Compensation (SLDC). SLDC aims to align feature distributions across tasks to mitigate the impact of drift. First, we present a linear variant of SLDC, which learns a linear operator by solving a regularized least-squares problem that maps features before and after fine-tuning. Next, we extend this with a weakly nonlinear SLDC variant, which assumes that the ideal transition operator lies between purely linear and fully nonlinear transformations. This is implemented using learnable, weakly nonlinear mappings that balance flexibility and generalization. To further reduce representation drift, we apply knowledge distillation (KD) in both algorithmic variants. Extensive experiments on standard CIL benchmarks demonstrate that SLDC significantly improves the performance of SeqFT. Notably, by combining KD to address representation drift with SLDC to compensate distribution drift, SeqFT achieves performance comparable to joint training across all evaluated datasets.

PDF Details DOI

AAAI Conference 2026 Conference Paper

Encode Geometric Diagram as Geo-Graph in Geometry Problem Solving

Wenjun Wu
Lingling Zhang
Bo Zhao
Bo Li
Xinyu Zhang
Yaqiang Wu

Geometry Problem Solving has become a hot topic these years due to its complexity of enabling the machine with geometric abstraction, multi-modal reasoning and mathematical capabilities. Majority of research works place their attention on the fusion of multi-modal data or the synergistic combination of neural and symbolic systems for performance improvement. However, their neglect of the unique characteristics of geometric diagrams, which distinguish them from natural images, impedes the further exploring of critical information in geometric diagrams. In this work, we introduce the novel concept of geo-graph and propose the Geo-Graph Geometry Problem Solving model which encodes the geometric diagram from a new perspective. The geo-graph is designed to include semantic, structural and spatial information in the diagram, which is crucial to subsequent problem reasoning stage. To facilitate the model's comprehension of the actual layout of geometric diagram, spatial and connecting attentions are devised to serve as intrinsic knowledge guidance for feature propagation. An extra cross-modal attention is used as external guidance to instruct the encoding of geo-graph to be related to specific problem target. Fused multi-modal features are then sent into a commonly used encoder-decoder framework for final solution generation. The model is first trained with three carefully designed pre-training tasks to establish its fundamental knowledge of geo-graph, leveraging numerous varied samples generated through a geo-graph-based augmentation method. Experiments on popular geometry problem solving datasets demonstrate the effectiveness and superiority of our model for geometric diagram encoding.

PDF Details DOI

AAAI Conference 2026 Conference Paper

Mask2IV: Interaction-Centric Video Generation via Mask Trajectories

Gen Li
Bo Zhao
Jianfei Yang
Laura Sevilla-Lara

Generating interaction-centric videos, such as those depicting humans or robots interacting with objects, is crucial for embodied intelligence, as they provide rich and diverse visual priors for robot learning, manipulation policy training, and affordance reasoning. However, existing methods often struggle to model such complex and dynamic interactions. While recent studies show that masks can serve as effective control signals and enhance generation quality, obtaining dense and precise mask annotations remains a major challenge for real-world use. To overcome this limitation, we introduce Mask2IV, a novel framework specifically designed for interaction-centric video generation. It adopts a decoupled two-stage pipeline that first predicts plausible motion trajectories for both actor and object, then generates a video conditioned on these trajectories. This design eliminates the need for dense mask inputs from users while preserving the flexibility to manipulate the interaction process. Furthermore, Mask2IV supports versatile and intuitive control, allowing users to specify the target object of interaction and guide the motion trajectory through action descriptions or spatial position cues. To support systematic training and evaluation, we curate two benchmarks covering diverse action and object categories across both human-object interaction and robotic manipulation scenarios. Extensive experiments demonstrate that our method achieves superior visual realism and controllability compared to existing baselines.

PDF Details DOI

TMLR Journal 2026 Journal Article

Symmetry in Neural Network Parameter Spaces

Bo Zhao
Robin Walters
Rose Yu

Modern deep learning models are highly overparameterized, resulting in large sets of parameter configurations that yield the same outputs. A significant portion of this redundancy is explained by symmetries in the parameter space—transformations that leave the network function unchanged. These symmetries shape the loss landscape and constrain learning dynamics, offering a new lens for understanding optimization, generalization, and model complexity that complements existing theory of deep learning. This survey provides an overview of parameter space symmetry. We summarize existing literature, uncover connections between symmetry and learning theory, and identify gaps and opportunities in this emerging field.

PDF Details

NeurIPS Conference 2025 Conference Paper

Causal-R: A Causal-Reasoning Geometry Problem Solver for Optimized Solution Exploration

Wenjun Wu
Lingling Zhang
Bo Zhao
Muye Huang
Qianying Wang
Jun Liu

The task of geometry problem solving has been a long-standing focus in the automated mathematics community and draws growing attention due to its complexity for both symbolic and neural models. Although prior studies have explored various effective approaches for enhancing problem solving performances, two fundamental challenges remain unaddressed, which are essential to the application in practical scenarios. First, the multi-step reasoning gap between the initial geometric conditions and ultimate problem goal leads to a great search space for solution exploration. Second, obtaining multiple interpretable and shorter solutions remains an open problem. In this work, we introduce the Causal-Reasoning Geometry Problem Solver to overcome these challenges. Specifically, the Causal Graph Reasoning theory is proposed to perform symbolic reasoning before problem solving. Several causal graphs are constructed according to predefined rule base, where each graph is composed of primitive nodes, causal edges and prerequisite edges. By applying causal graph deduction from initial conditions, the reachability status of nodes are iteratively conveyed by causal edges until reaching the target nodes, representing feasible causal deduction paths. In this way, the search space of solutions is compressed from the beginning, the end and intermediate reasoning paths, while ensuring the interpretability and variety of solutions. To achieve this, we further propose Forward Matrix Deduction which transforms the causal graphs into matrices and vectors, and applies matrix operations to update the status value of reachable nodes in iterations. Finally, multiple solutions can be generated by tracing back from the target nodes after validation. Experiments demonstrate the effectiveness of our method to obtain multiple shorter and interpretable solutions. Code is available after acceptance.

PDF Details

EAAI Journal 2025 Journal Article

Development and implementation of a robotic moving target system for object recognition testing

Haojie Zhang
Rongmin Liang
Bo Zhao
Chuankai Liu

While artificial intelligence (AI) has made great progress in recent years, the scarcity of image datasets still limits their practical use in military domain. This paper proposes a robotic moving target system to build the image training dataset and test the object recognition algorithms. The robotic moving target system consists of human–machine interface and autonomous robotic moving platform. Once the working mode is set through the human–machine interface, robotic moving platform will move autonomously to simulate the object images in arbitrary poses. The path tracking controller is proposed to track the desired routes accurately. The effectiveness of the robotic moving target system was verified by following three types of routes in simulation and real experiments. The results show that robotic moving platform can move in arbitrary poses while it follows preset routes autonomously. The linear and angular velocities are adjusted smoothly. The average trajectory tracking error is less than 0. 1 m. The function of it fully meets the requirement of building image training datasets and testing object recognition algorithms.

Details DOI

EAAI Journal 2025 Journal Article

Event-triggered neuro-optimal fault tolerant control for uncertain macro–micro composite stage system with actuator faults

Shunchao Zhang
Bo Zhao
Yongwei Zhang

This paper develops an event-triggered neuro-optimal fault tolerant control (ETNOFTC) method for an uncertain macro–micro composite stage (MMCS) system with actuator faults. By regarding the MMCS system as a two-player system, its control problem is formulated as a nonzero-sum game which aims to cooperatively stabilize the system with the individual cost functions of all players. To stabilize the faulty MMCS system, we develop an ETNOFTC scheme which consists of two parts, i. e. , integral sliding mode control and adaptive dynamic programming-based event-triggered optimal control. An integral sliding mode control method is developed to eliminate the effects of actuator faults and uncertainties, and thus a system equivalent to the nominal system is obtained. Then, by employing the critic-only architecture, the event-triggered Hamilton–Jacobi equation is solved by approximating the coupled cost function to obtain the approximate event-triggered optimal control policy of each player. Furthermore, we prove that the closed-loop MMCS system is asymptotically stable by using the Lyapunov’s direct method. Finally, the effectiveness of the developed ETNOFTC scheme is verified.

Details DOI

NeurIPS Conference 2025 Conference Paper

MomentSeeker: A Task-Oriented Benchmark For Long-Video Moment Retrieval

Huaying Yuan
Jian Ni
Zheng Liu
Yueze Wang
Junjie Zhou
Zhengyang Liang
Bo Zhao
Zhao Cao

Accurately locating key moments within long videos is crucial for solving long video understanding (LVU) tasks. However, existing benchmarks are either severely limited in terms of video length and task diversity, or they focus solely on the end-to-end LVU performance, making them inappropriate for evaluating whether key moments can be accurately accessed. To address this challenge, we propose MomentSeeker, a novel benchmark for long-video moment retrieval (LVMR), distinguished by the following features. First, it is created based on long and diverse videos, averaging over 1, 200 seconds in duration, and collected from various domains, e. g. , movie, anomaly, egocentric, and sports. Second, it covers a variety of real-world scenarios in three levels: global-level, event-level, and object-level, covering common tasks like action recognition, object localization, causal reasoning, etc. Third, it incorporates rich forms of queries, including text-only queries, image-conditioned queries, and video-conditioned queries. On top of MomentSeeker, we conduct comprehensive experiments for both generation-based approaches (directly using MLLMs) and retrieval-based approaches (leveraging video retrievers). Our results reveal the significant challenges in long-video moment retrieval in terms of accuracy and efficiency, despite improvements from the latest long-video MLLMs and task-specific fine-tuning. We have publicly released MomentSeeker to facilitate future research in this area.

PDF Details

NeurIPS Conference 2025 Conference Paper

UtilGen: Utility-Centric Generative Data Augmentation with Dual-Level Task Adaptation

Jiyu Guo
Shuo Yang
Yiming Huang
Yancheng Long
Xiaobo Xia
Xiu Su
Bo Zhao
Zeke Xie

Data augmentation using generative models has emerged as a powerful paradigm for enhancing performance in computer vision tasks. However, most existing augmentation approaches primarily focus on optimizing intrinsic data attributes -- such as fidelity and diversity -- to generate visually high-quality synthetic data, while often neglecting task-specific requirements. Yet, it is essential for data generators to account for the needs of downstream tasks, as training data requirements can vary significantly across different tasks and network architectures. To address these limitations, we propose UtilGen, a novel utility-centric data augmentation framework that adaptively optimizes the data generation process to produce task-specific, high-utility training data via downstream task feedback. Specifically, we first introduce a weight allocation network to evaluate the task-specific utility of each synthetic sample. Guided by these evaluations, UtilGen iteratively refines the data generation process using a dual-level optimization strategy to maximize the synthetic data utility: (1) model-level optimization tailors the generative model to the downstream task, and (2) instance-level optimization adjusts generation policies -- such as prompt embeddings and initial noise -- at each generation round. Extensive experiments on eight benchmark datasets of varying complexity and granularity demonstrate that UtilGen consistently achieves superior performance, with an average accuracy improvement of 3. 87\% over previous SOTA. Further analysis of data influence and distribution reveals that UtilGen produces more impactful and task-relevant synthetic data, validating the effectiveness of the paradigm shift from visual characteristics-centric to task utility-centric data augmentation.

PDF Details

NeurIPS Conference 2024 Conference Paper

Fetch and Forge: Efficient Dataset Condensation for Object Detection

Ding Qi
Jian Li
Jinlong Peng
Bo Zhao
Shuguang Dou
Jialin Li
Jiangning Zhang
Yabiao Wang

Dataset condensation (DC) is an emerging technique capable of creating compact synthetic datasets from large originals while maintaining considerable performance. It is crucial for accelerating network training and reducing data storage requirements. However, current research on DC mainly focuses on image classification, with less exploration of object detection. This is primarily due to two challenges: (i) the multitasking nature of object detection complicates the condensation process, and (ii) Object detection datasets are characterized by large-scale and high-resolution data, which are difficult for existing DC methods to handle. As a remedy, we propose DCOD, the first dataset condensation framework for object detection. It operates in two stages: Fetch and Forge, initially storing key localization and classification information into model parameters, and then reconstructing synthetic images via model inversion. For the complex of multiple objects in an image, we propose Foreground Background Decoupling to centrally update the foreground of multiple instances and Incremental PatchExpand to further enhance the diversity of foregrounds. Extensive experiments on various detection datasets demonstrate the superiority of DCOD. Even at an extremely low compression rate of 1\%, we achieve 46. 4\% and 24. 7\% $\text{AP}_{50}$ on the VOC and COCO, respectively, significantly reducing detector training duration.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

SegVol: Universal and Interactive Volumetric Medical Image Segmentation

Yuxin Du
Fan Bai
Tiejun Huang
Bo Zhao

Precise image segmentation provides clinical study with instructive information. Despite the remarkable progress achieved in medical image segmentation, there is still an absence of a 3D foundation segmentation model that can segment a wide range of anatomical categories with easy user interaction. In this paper, we propose a 3D foundation segmentation model, named SegVol, supporting universal and interactive volumetric medical image segmentation. By scaling up training data to 90K unlabeled Computed Tomography (CT) volumes and 6K labeled CT volumes, this foundation model supports the segmentation of over 200 anatomical categories using semantic and spatial prompts. To facilitate efficient and precise inference on volumetric images, we design a zoom-out-zoom-in mechanism. Extensive experiments on 22 anatomical segmentation tasks verify that SegVol outperforms the competitors in 19 tasks, with improvements up to 37. 24\% compared to the runner-up methods. We demonstrate the effectiveness and importance of specific designs by ablation study. We expect this foundation model can promote the development of volumetric medical image analysis. The model and code are publicly available at https: //github. com/BAAI-DCAI/SegVol.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

Touchstone Benchmark: Are We on the Right Way for Evaluating AI Algorithms for Medical Segmentation?

Pedro R. Bassi
Wenxuan Li
Yucheng Tang
Fabian Isensee
Zifu Wang
Jieneng Chen
Yu-Cheng Chou
Saikat Roy

How can we test AI performance? This question seems trivial, but it isn't. Standard benchmarks often have problems such as in-distribution and small-size test sets, oversimplified metrics, unfair comparisons, and short-term outcome pressure. As a consequence, good performance on standard benchmarks does not guarantee success in real-world scenarios. To address these problems, we present Touchstone, a large-scale collaborative segmentation benchmark of 9 types of abdominal organs. This benchmark is based on 5, 195 training CT scans from 76 hospitals around the world and 5, 903 testing CT scans from 11 additional hospitals. This diverse test set enhances the statistical significance of benchmark results and rigorously evaluates AI algorithms across various out-of-distribution scenarios. We invited 14 inventors of 19 AI algorithms to train their algorithms, while our team, as a third party, independently evaluated these algorithms on three test sets. In addition, we also evaluated pre-existing AI frameworks---which, differing from algorithms, are more flexible and can support different algorithms—including MONAI from NVIDIA, nnU-Net from DKFZ, and numerous other open-source frameworks. We are committed to expanding this benchmark to encourage more innovation of AI algorithms for the medical domain.

PDF Details DOI

NeurIPS Conference 2023 Conference Paper

DYffusion: A Dynamics-informed Diffusion Model for Spatiotemporal Forecasting

Salva Rühling Cachay
Bo Zhao
Hailey Joren
Rose Yu

While diffusion models can successfully generate data and make predictions, they are predominantly designed for static images. We propose an approach for training diffusion models for dynamics forecasting that leverages the temporal dynamics encoded in the data, directly coupling it with the diffusion steps in the network. We train a stochastic, time-conditioned interpolator and a backbone forecaster networkthat mimic the forward and reverse processes of conventional diffusion models, respectively. This design choice naturally encodes multi-step and long-range forecasting capabilities, allowing for highly flexible, continuous-time sampling trajectories and the ability to trade-off performance with accelerated sampling at inference time. In addition, the dynamics-informed diffusion process imposes a strong inductive bias, allowing for improved computational efficiency compared to traditional Gaussian noise-based diffusion models. Our approach performs competitively on probabilistic skill score metrics in complex dynamics forecasting of sea surface temperatures, Navier-Stokes flows, and spring mesh systems.

PDF Details

AAAI Conference 2022 Conference Paper

FedInv: Byzantine-Robust Federated Learning by Inversing Local Model Updates

Bo Zhao
Peng Sun
Tao Wang
Keyu Jiang

Federated learning (FL) is a privacy-preserving distributed machine learning paradigm that enables multiple clients to collaboratively train statistical models without disclosing raw training data. However, the inaccessible local training data and uninspectable local training process make FL susceptible to various Byzantine attacks (e. g. , data poisoning and model poisoning attacks), aiming to manipulate the FL model training process and degrade the model performance. Most of the existing Byzantine-robust FL schemes cannot effectively defend against stealthy poisoning attacks that craft poisoned models statistically similar to benign models. Things worsen when many clients are compromised or data among clients are highly non-independent and identically distributed (non-IID). In this work, to address these issues, we propose FedInv, a novel Byzantine-robust FL framework by inversing local model updates. Specifically, in each round of local model aggregation in FedInv, the parameter server first inverses the local model updates submitted by each client to generate a corresponding dummy dataset. Then, the server identifies those dummy datasets with exceptional Wasserstein distances from others and excludes the related local model updates from model aggregation. We conduct an exhaustive experimental evaluation of FedInv. The results demonstrate that FedInv significantly outperforms the existing robust FL schemes in defending against stealthy poisoning attacks under highly non-IID data partitions.

PDF Details

NeurIPS Conference 2022 Conference Paper

Symmetry Teleportation for Accelerated Optimization

Bo Zhao
Nima Dehmamy
Robin Walters
Rose Yu

Existing gradient-based optimization methods update parameters locally, in a direction that minimizes the loss function. We study a different approach, symmetry teleportation, that allows parameters to travel a large distance on the loss level set, in order to improve the convergence speed in subsequent steps. Teleportation exploits symmetries in the loss landscape of optimization problems. We derive loss-invariant group actions for test functions in optimization and multi-layer neural networks, and prove a necessary condition for teleportation to improve convergence rate. We also show that our algorithm is closely related to second order methods. Experimentally, we show that teleportation improves the convergence speed of gradient descent and AdaGrad for several optimization problems including test functions, multi-layer regressions, and MNIST classification.

PDF Details

ICAPS Conference 2021 Conference Paper

Multiple Plans are Better than One: Diverse Stochastic Planning

Mahsa Ghasemi
Evan Scope Crafts
Bo Zhao
Ufuk Topcu

In planning problems, it is often challenging to fully model the desired specifications. In particular, in human-robot interaction, such difficulty may arise due to human's preferences that are either private or complex to model. Consequently, the resulting objective function can only partially capture the specifications and optimizing that may lead to poor performance with respect to the true specifications. Motivated by this challenge, we formulate a problem, called diverse stochastic planning, that aims to generate a set of representative --- small and diverse --- behaviors that are near-optimal with respect to the known objective. In particular, the problem aims to compute a set of diverse and near-optimal policies for systems modeled by a Markov decision process. We cast the problem as a constrained nonlinear optimization for which we propose a solution relying on the Frank-Wolfe method. We then prove that the proposed solution converges to a stationary point and demonstrate its efficacy in several planning problems.

Details

YNIMG Journal 2017 Journal Article

3D MR fingerprinting with accelerated stack-of-spirals and hybrid sliding-window and GRAPPA reconstruction

Congyu Liao
Berkin Bilgic
Mary Kate Manhard
Bo Zhao
Xiaozhi Cao
Jianhui Zhong
Lawrence L. Wald
Kawin Setsompop

Purpose Whole-brain high-resolution quantitative imaging is extremely encoding intensive, and its rapid and robust acquisition remains a challenge. Here we present a 3D MR fingerprinting (MRF) acquisition with a hybrid sliding-window (SW) and GRAPPA reconstruction strategy to obtain high-resolution T1, T2 and proton density (PD) maps with whole brain coverage in a clinically feasible timeframe. Methods 3D MRF data were acquired using a highly under-sampled stack-of-spirals trajectory with a steady-state precession (FISP) sequence. For data reconstruction, kx-ky under-sampling was mitigated using SW combination along the temporal axis. Non-uniform fast Fourier transform (NUFFT) was then applied to create Cartesian k-space data that are fully-sampled in the in-plane direction, and Cartesian GRAPPA was performed to resolve kz under-sampling to create an alias-free SW dataset. T1, T2 and PD maps were then obtained using dictionary matching. Results Phantom study demonstrated that the proposed 3D-MRF acquisition/reconstruction method is able to produce quantitative maps that are consistent with conventional quantification techniques. Retrospectively under-sampled in vivo acquisition revealed that SW + GRAPPA substantially improves quantification accuracy over the current state-of-the-art accelerated 3D MRF. Prospectively under-sampled in vivo study showed that whole brain T1, T2 and PD maps with 1 mm3 resolution could be obtained in 7. 5 min. Conclusions 3D MRF stack-of-spirals acquisition with hybrid SW + GRAPPA reconstruction may provide a feasible approach for rapid, high-resolution quantitative whole-brain imaging.

Details DOI