Arrow Research search

Author name cluster

Jiaming Zhang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

14 papers
2 author rows

Possible papers

14

AAAI Conference 2026 Conference Paper

GAHMN: A Generative Approach for High-Dimensional Mediation Analysis

  • Jiaming Zhang
  • Yiqi Lin
  • Rou Zhang
  • Xinyuan Song
  • Hanwen Ning

High-dimensional mediation analysis (HMA) seeks to uncover complex causal mechanisms involving numerous mediators and plays a crucial role in scientific and social sciences. In this work, we introduce the Generative Adversarial High-dimensional Mediation Network (GAHMN), a novel, scalable structured generative framework designed for causal analysis in high-dimensional settings. GAHMN formulates mediation analysis as dual conditional generative blocks, explicitly capturing mediators' dual roles as outcomes influenced by treatments and as predictors affecting outcomes. Each block integrates a high-dimensional partially linear structure with multi-channel convolutional layers, promoting effective parameter sharing and enhanced representation learning. To induce sparsity and accurate mediator selection, GAHMN employs customized min-max optimization problems with L1 penalties on generator parameters, alongside specially designed optimization algorithms for efficient computation. Unlike existing benchmark methods relying on restrictive parametric assumptions or random-effect specifications, GAHMN flexibly captures heterogeneity, complex distributions, and inter-mediator correlations. With careful design, the computational complexity of GAHMN scales linearly with the number of mediators p, rather than quadratically as in conventional approaches. Theoretical results rigorously ensures estimation consistency, convergence rate, and accurate sparse recovery. GAHMN also serves as a structured generative causal modeling framework, extending to causal decomposition, structural equation modeling, and counterfactual policy evaluation. Extensive experiments confirm GAHMN's superior performance and robustness in synthetic and real-world scenarios.

AAAI Conference 2026 Conference Paper

HybriDLA: Hybrid Generation for Document Layout Analysis

  • Yufan Chen
  • Omar Moured
  • Ruiping Liu
  • Junwei Zheng
  • Kunyu Peng
  • Jiaming Zhang
  • Rainer Stiefelhagen

Conventional document layout analysis (DLA) traditionally depends on empirical priors or a fixed set of learnable queries executed in a single forward pass. While sufficient for early-generation documents with a small, predetermined number of regions, this paradigm struggles with contemporary documents, which exhibit diverse element counts and increasingly complex layouts. To address challenges posed by modern documents, we present HybriDLA, a novel generative framework that unifies diffusion and autoregressive decoding within a single layer. The diffusion component iteratively refines bounding-box hypotheses, whereas the autoregressive component injects semantic and contextual awareness, enabling precise region prediction even in highly varied layouts. To further enhance detection quality, we design a multi-scale feature-fusion encoder that captures both fine-grained and high-level visual cues. This architecture elevates performance to 83.5% mean Average Precision (mAP). Extensive experiments on the DocLayNet and M6Doc benchmarks demonstrate that HybriDLA sets a state-of-the-art performance, outperforming previous approaches.

AAAI Conference 2026 Conference Paper

TOFA: Training-Free One-Shot Federated Adaptation for Vision-Language Models

  • Li Zhang
  • Zhongxuan Han
  • XiaoHua Feng
  • Jiaming Zhang
  • Yuyuan Li
  • Linbo Jiang
  • Jianan Lin
  • Chaochao Chen

Efficient and lightweight adaptation of pre-trained Vision-Language Models (VLMs) to downstream tasks through collaborative interactions between local clients and a central server is a rapidly emerging research topic in federated learning. Existing adaptation algorithms are typically trained iteratively, which incur significant communication costs and increase the susceptibility to potential attacks. Motivated by the one-shot federated training techniques that reduce client-server exchanges to a single round, developing a lightweight one-shot federated VLM adaptation method to alleviate these issues is particularly attractive. However, current one-shot approaches face certain challenges in adapting VLMs within federated settings: (1) insufficient exploitation of the rich multimodal information inherent in VLMs; (2) lack of specialized adaptation strategies to systematically handle the severe data heterogeneity; and (3) requiring additional training resource of clients or server. To bridge these gaps, we propose a novel Training-free One-shot Federated Adaptation framework for VLMs, named TOFA. To fully leverage the generalizable multimodal features in pre-trained VLMs, TOFA employs both visual and textual pipelines to extract task-relevant representations. In the visual pipeline, a hierarchical Bayesian model learns personalized, class-specific prototype distributions. For the textual pipeline, TOFA evaluates and globally aligns the generated local text prompts for robustness. An adaptive weight calibration mechanism is also introduced to combine predictions from both modalities, balancing personalization and robustness to handle data heterogeneity. Our method is training-free, not relying on additional training resources on either the client or server side. Extensive experiments across 9 datasets in various federated settings demonstrate the effectiveness of the proposed TOFA method.

IROS Conference 2025 Conference Paper

dARt Vinci: Egocentric Data Collection for Surgical Robot Learning at Scale

  • Yihao Liu
  • Yu-Chun Ku
  • Jiaming Zhang
  • Hao Ding 0021
  • Peter Kazanzides
  • Mehran Armand

Data scarcity has long been an issue in the robot learning community. Particularly, in safety-critical domains like surgical applications, obtaining high-quality data can be especially difficult. It poses challenges to researchers seeking to exploit recent advancements in reinforcement learning and imitation learning, which have greatly improved generalizability and enabled robots to conduct tasks autonomously. We introduce dARt Vinci, a scalable data collection platform for robot learning in surgical settings. The system uses Augmented Reality (AR) hand tracking and a high-fidelity physics engine to capture subtle maneuvers in primitive surgical tasks: By eliminating the need for a physical robot setup and providing flexibility in terms of time, space, and hardware resources-such as multiview sensors and actuators-specialized simulation is a viable alternative. At the same time, AR allows the robot data collection to be more egocentric, supported by its body tracking and content overlaying capabilities. Our user study confirms the proposed system’s efficiency and usability, where we use widely-used primitive tasks for training teleoperation with da Vinci surgical robots. Data throughput improves across all tasks compared to real robot settings by 41% on average. The total experiment time is reduced by an average of 10%. The temporal demand in the task load survey is improved. These gains are statistically significant. Additionally, the collected data is over 400 times smaller in size, requiring far less storage while achieving double the frequency. The source code for this project can be accessed at https://dartvinci.finite-state.com/.

NeurIPS Conference 2025 Conference Paper

FedFACT: A Provable Framework for Controllable Group-Fairness Calibration in Federated Learning

  • Li Zhang
  • Zhongxuan Han
  • XiaoHua Feng
  • Jiaming Zhang
  • Yuyuan Li
  • Chaochao Chen

With emerging application of Federated Learning (FL) in decision-making scenarios, it is imperative to regulate model fairness to prevent disparities across sensitive groups (e. g. , female, male). Current research predominantly focuses on two concepts of group fairness within FL: Global Fairness (overall model disparity across all clients) and Local Fairness (the disparity within each client). However, the non-decomposable, non-differentiable nature of fairness criteria pose two fundamental, unresolved challenges for fair FL: (i) Harmonizing global and local fairness, especially in multi-class classification; (ii) Enabling a controllable, optimal accuracy-fairness trade-off. To tackle the aforementioned challenges, we propose a novel controllable federated group-fairness calibration framework, named FedFACT. FedFACT identifies the Bayes-optimal classifiers under both global and local fairness constraints in multi-class case, yielding models with minimal performance decline while guaranteeing fairness. To effectively realize an adjustable, optimal accuracy-fairness balance, we derive specific characterizations of the Bayes-optimal fair classifiers for reformulating fair FL as personalized cost-sensitive learning problem for in-processing, and bi-level optimization for post-processing. Theoretically, we provide convergence and generalization guarantees for FedFACT to approach the near-optimal accuracy under given fairness levels. Extensive experiments on multiple datasets across various data heterogeneity demonstrate that FedFACT consistently outperforms baselines in balancing accuracy and global-local fairness.

NeurIPS Conference 2025 Conference Paper

mmWalk: Towards Multi-modal Multi-view Walking Assistance

  • Kedi Ying
  • Ruiping Liu
  • Chongyan Chen
  • Mingzhe Tao
  • Hao Shi
  • Kailun Yang
  • Jiaming Zhang
  • Rainer Stiefelhagen

Walking assistance in extreme or complex environments remains a significant challenge for people with blindness or low vision (BLV), largely due to the lack of a holistic scene understanding. Motivated by the real-world needs of the BLV community, we build mmWalk, a simulated multi-modal dataset that integrates multi-view sensor and accessibility-oriented features for outdoor safe navigation. Our dataset comprises $120$ manually controlled, scenario-categorized walking trajectories with $62k$ synchronized frames. It contains over $559k$ panoramic images across RGB, depth, and semantic modalities. Furthermore, to emphasize real-world relevance, each trajectory involves outdoor corner cases and accessibility-specific landmarks for BLV users. Additionally, we generate mmWalkVQA, a VQA benchmark with over $69k$ visual question-answer triplets across $9$ categories tailored for safe and informed walking assistance. We evaluate state-of-the-art Vision-Language Models (VLMs) using zero- and few-shot settings and found they struggle with our risk assessment and navigational tasks. We validate our mmWalk-finetuned model on real-world datasets and show the effectiveness of our dataset for advancing multi-modal walking assistance.

NeurIPS Conference 2025 Conference Paper

Situat3DChange: Situated 3D Change Understanding Dataset for Multimodal Large Language Model

  • Ruiping Liu
  • Junwei Zheng
  • Yufan Chen
  • Zirui Wang
  • Kunyu Peng
  • Kailun Yang
  • Jiaming Zhang
  • Marc Pollefeys

Physical environments and circumstances are fundamentally dynamic, yet current 3D datasets and evaluation benchmarks tend to concentrate on either dynamic scenarios or dynamic situations in isolation, resulting in incomplete comprehension. To overcome these constraints, we introduce Situat3DChange, an extensive dataset supporting three situation-aware change understanding tasks following the perception-action model: 121K question-answer pairs, 36K change descriptions for perception tasks, and 17K rearrangement instructions for the action task. To construct this large-scale dataset, Situat3DChange leverages 11K human observations of environmental changes to establish shared mental models and shared situational awareness for human-AI collaboration. These observations, enriched with egocentric and allocentric perspectives as well as categorical and coordinate spatial relations, are integrated using an LLM to support understanding of situated changes. To address the challenge of comparing pairs of point clouds from the same scene with minor changes, we propose SCReasoner, an efficient 3D MLLM approach that enables effective point cloud comparison with minimal parameter overhead and no additional tokens required for the language decoder. Comprehensive evaluation on Situat3DChange tasks highlights both the progress and limitations of MLLMs in dynamic scene and situation understanding. Additional experiments on data scaling and cross-domain transfer demonstrate the task-agnostic effectiveness of using Situat3DChange as a training dataset for MLLMs. The established dataset and source code are publicly available at: https: //github. com/RuipingL/Situat3DChange.

NeurIPS Conference 2024 Conference Paper

CURE4Rec: A Benchmark for Recommendation Unlearning with Deeper Influence

  • Chaochao Chen
  • Jiaming Zhang
  • Yizhao Zhang
  • Li Zhang
  • Lingjuan Lyu
  • Yuyuan Li
  • Biao Gong
  • Chenggang Yan

With increasing privacy concerns in artificial intelligence, regulations have mandated the right to be forgotten, granting individuals the right to withdraw their data from models. Machine unlearning has emerged as a potential solution to enable selective forgetting in models, particularly in recommender systems where historical data contains sensitive user information. Despite recent advances in recommendation unlearning, evaluating unlearning methods comprehensively remains challenging due to the absence of a unified evaluation framework and overlooked aspects of deeper influence, e. g. , fairness. To address these gaps, we propose CURE4Rec, the first comprehensive benchmark for recommendation unlearning evaluation. CURE4Rec covers four aspects, i. e. , unlearning Completeness, recommendation Utility, unleaRning efficiency, and recommendation fairnEss, under three data selection strategies, i. e. , core data, edge data, and random data. Specifically, we consider the deeper influence of unlearning on recommendation fairness and robustness towards data with varying impact levels. We construct multiple datasets with CURE4Rec evaluation and conduct extensive experiments on existing recommendation unlearning methods. Our code is released at https: //github. com/xiye7lai/CURE4Rec.

ICRA Conference 2024 Conference Paper

GBEC: Geometry-Based Hand-Eye Calibration

  • Yihao Liu 0004
  • Jiaming Zhang
  • Zhangcong She
  • Amir Kheradmand
  • Mehran Armand

Hand-eye calibration is the problem of solving the transformation from the end-effector of a robot to the sensor attached to it. Commonly employed techniques, such as AXXB or AXZB formulations, rely on regression methods that require collecting pose data from different robot configurations, which can produce low accuracy and repeatability. However, the derived transformation should solely depend on the geometry of the end-effector and the sensor attachment. We propose Geometry-Based End-Effector Calibration (GBEC) that enhances the repeatability and accuracy of the derived transformation compared to traditional hand-eye calibrations. To demonstrate improvements, we apply the approach to two different robot-assisted procedures: Transcranial Magnetic Stimulation (TMS) and femoroplasty. We also discuss the generalizability of GBEC for camera-in-hand and marker-in-hand sensor mounting methods. In the experiments, we perform GBEC between the robot end-effector and an optical tracker’s rigid body marker attached to the TMS coil or femoroplasty drill guide. Previous research documents low repeatability and accuracy of the conventional methods for robot-assisted TMS hand-eye calibration. Applying GBEC to repeated calibrations, we obtain transformations with standard deviations of 0. 37mm, 0. 65mm, and 0. 40mm (translation) along x, y, and z axes of the end-effector, respectively. The tool alignment experiments after using GBEC achieve a mean accuracy around 0. 2mm in Euclidean distance. When compared to some existing methods, the proposed method relies solely on the geometry of the flange and the pose of the rigid-body marker, making it independent of workspace constraints or robot accuracy, without sacrificing the orthogonality of the rotation matrix. Our results validate the accuracy and applicability of the approach, providing a new and generalizable methodology for obtaining the transformation from the end-effector to a sensor.

AAAI Conference 2024 Conference Paper

Navigating Open Set Scenarios for Skeleton-Based Action Recognition

  • Kunyu Peng
  • Cheng Yin
  • Junwei Zheng
  • Ruiping Liu
  • David Schneider
  • Jiaming Zhang
  • Kailun Yang
  • M. Saquib Sarfraz

In real-world scenarios, human actions often fall outside the distribution of training data, making it crucial for models to recognize known actions and reject unknown ones. However, using pure skeleton data in such open-set conditions poses challenges due to the lack of visual background cues and the distinct sparse structure of body pose sequences. In this paper, we tackle the unexplored Open-Set Skeleton-based Action Recognition (OS-SAR) task and formalize the benchmark on three skeleton-based datasets. We assess the performance of seven established open-set approaches on our task and identify their limits and critical generalization issues when dealing with skeleton information.To address these challenges, we propose a distance-based cross-modality ensemble method that leverages the cross-modal alignment of skeleton joints, bones, and velocities to achieve superior open-set recognition performance. We refer to the key idea as CrossMax - an approach that utilizes a novel cross-modality mean max discrepancy suppression mechanism to align latent spaces during training and a cross-modality distance-based logits refinement method during testing. CrossMax outperforms existing approaches and consistently yields state-of-the-art results across all datasets and backbones. We will release the benchmark, code, and models to the community.

ICRA Conference 2024 Conference Paper

Realtime Robust Shape Estimation of Deformable Linear Object

  • Jiaming Zhang
  • Zhaomeng Zhang
  • Yihao Liu 0004
  • Yaqian Chen
  • Amir Kheradmand
  • Mehran Armand

Realtime shape estimation of continuum objects and manipulators is essential for developing accurate planning and control paradigms. The existing methods that create dense point clouds from camera images, and/or use distinguishable markers on a deformable body have limitations in realtime tracking of large continuum objects/manipulators. The physical occlusion of markers can often compromise accurate shape estimation. We propose a robust method to estimate the shape of linear deformable objects in realtime using scattered and unordered key points. By utilizing a robust probability-based labeling algorithm, our approach identifies the true order of the detected key points and then reconstructs the shape using piecewise spline interpolation. The approach only relies on knowing the number of the key points and the interval between two neighboring points. We demonstrate the robustness of the method when key points are partially occluded. The proposed method is also integrated into a simulation in Unity for tracking the shape of a cable with a length of 1m and a radius of 5mm. The simulation results show that our proposed approach achieves an average length error of 1. 07% over the continuum’s centerline and an average cross-section error of 2. 11mm. The real-world experiments of tracking and estimating a heavy-load cable prove that the proposed approach is robust under occlusion and complex entanglement scenarios.

TMLR Journal 2023 Journal Article

A Modulation Layer to Increase Neural Network Robustness Against Data Quality Issues

  • Mohamed Abdelhack
  • Jiaming Zhang
  • Sandhya Tripathi
  • Bradley A Fritz
  • Daniel Felsky
  • Michael Avidan
  • Yixin Chen
  • Christopher Ryan King

Data missingness and quality are common problems in machine learning, especially for high-stakes applications such as healthcare. Developers often train machine learning models on carefully curated datasets using only high-quality data; however, this reduces the utility of such models in production environments. We propose a novel neural network modification to mitigate the impacts of low-quality and missing data which involves replacing the fixed weights of a fully-connected layer with a function of additional input. This is inspired by neuromodulation in biological neural networks where the cortex can up- and down-regulate inputs based on their reliability and the presence of other data. In testing, with reliability scores as a modulating signal, models with modulating layers were found to be more robust against data quality degradation, including additional missingness. These models are superior to imputation as they save on training time by entirely skipping the imputation process and further allow the introduction of other data quality measures that imputation cannot handle. Our results suggest that explicitly accounting for reduced information quality with a modulating fully connected layer can enable the deployment of artificial intelligence systems in real-time applications.

AAAI Conference 2023 Conference Paper

ImageNet Pre-training Also Transfers Non-robustness

  • Jiaming Zhang
  • Jitao Sang
  • Qi Yi
  • Yunfan Yang
  • Huiwen Dong
  • Jian Yu

ImageNet pre-training has enabled state-of-the-art results on many tasks. In spite of its recognized contribution to generalization, we observed in this study that ImageNet pre-training also transfers adversarial non-robustness from pre-trained model into fine-tuned model in the downstream classification tasks. We first conducted experiments on various datasets and network backbones to uncover the adversarial non-robustness in fine-tuned model. Further analysis was conducted on examining the learned knowledge of fine-tuned model and standard model, and revealed that the reason leading to the non-robustness is the non-robust features transferred from ImageNet pre-trained model. Finally, we analyzed the preference for feature learning of the pre-trained model, explored the factors influencing robustness, and introduced a simple robust ImageNet pre-training solution. Our code is available at https://github.com/jiamingzhang94/ImageNet-Pretraining-transfers-non-robustness.

TIST Journal 2017 Journal Article

Location-Based Parallel Tag Completion for Geo-Tagged Social Image Retrieval

  • Jiaming Zhang
  • Shuhui Wang
  • Qingming Huang

Having benefited from tremendous growth of user-generated content, social annotated tags get higher importance in the organization and retrieval of large-scale image databases on Online Sharing Websites (OSW). To obtain high-quality tags from existing community contributed tags with missing information and noise, tag-based annotation or recommendation methods have been proposed for performance promotion of tag prediction. While images from OSW contain rich social attributes, they have not taken full advantage of rich social attributes and auxiliary information associated with social images to construct global information completion models. In this article, beyond the image-tag relation, we take full advantage of the ubiquitous GPS locations and image-user relationship to enhance the accuracy of tag prediction and improve the computational efficiency. For GPS locations, we define the popular geo-locations where people tend to take more images as Points of Interests (POI), which are discovered by mean shift approach. For image-user relationship, we integrate a localized prior constraint, expecting the completed tag sub-matrix in each POI to maintain consistency with users’ tagging behaviors. Based on these two key issues, we propose a unified tag matrix completion framework, which learns the image-tag relation within each POI. To solve the optimization problem, an efficient proximal sub-gradient descent algorithm is designed. The model optimization can be easily parallelized and distributed to learn the tag sub-matrix for each POI. Extensive experimental results reveal that the learned tag sub-matrix of each POI reflects the major trend of users’ tagging results with respect to different POIs and users, and the parallel learning process provides strong support for processing large-scale online image databases. To fit the response time requirement and storage limitations of Tag-based Image Retrieval (TBIR) on mobile devices, we introduce Asymmetric Locality Sensitive Hashing (ALSH) to reduce the time cost and meanwhile improve the efficiency of retrieval.