Arrow Research search

Author name cluster

David Doermann

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

14 papers
1 author row

Possible papers

14

AAAI Conference 2026 Conference Paper

Textured Geometry Evaluation: Perceptual 3D Textured Shape Metric via 3D Latent-Geometry Network

  • Tianyu Luan
  • Xuelu Feng
  • Zixin Zhu
  • Phani Nuney
  • Sheng Liu
  • Xuan Gong
  • David Doermann
  • Chunming Qiao

Textured high-fidelity 3D models are crucial for games, AR/VR, and film, but human-aligned evaluation methods still fall behind despite recent advances in 3D reconstruction and generation. Existing metrics, such as Chamfer Distance, often fail to align with how humans evaluate the fidelity of 3D shapes. Recent learning-based metrics attempt to improve this by relying on rendered images and 2D image quality metrics. However, these approaches face limitations due to incomplete structural coverage and sensitivity to viewpoint choices. Moreover, most methods are trained on synthetic distortions, which differ significantly from real-world distortions, resulting in a domain gap. To address these challenges, we propose a new fidelity evaluation method that is based directly on 3D meshes with texture, without relying on rendering. Our method, named Textured Geometry Evaluation TGE, jointly uses the geometry and color information to calculate the fidelity of the input textured mesh with comparison to a reference colored shape. To train and evaluate our metric, we design a human-annotated dataset with real-world distortions. Experiments show that TGE outperforms rendering-based and geometry-only methods on real-world distortion dataset.

NeurIPS Conference 2025 Conference Paper

AutoEdit: Automatic Hyperparameter Tuning for Image Editing

  • Chau Pham
  • Quan Dao
  • Mahesh Bhosale
  • Yunjie Tian
  • Dimitris Metaxas
  • David Doermann

Recent advances in diffusion models have revolutionized text-guided image editing, yet existing editing methods face critical challenges in hyperparameter identification. To get the reasonable editing performance, these methods often require the user to brute-force tune multiple interdependent hyperparameters, such as inversion timesteps and attention modification, \textit{etc. } This process incurs high computational costs due to the huge hyperparameter search space. We consider searching optimal editing's hyperparameters as a sequential decision-making task within the diffusion denoising process. Specifically, we propose a reinforcement learning framework, which establishes a Markov Decision Process that dynamically adjusts hyperparameters across denoising steps, integrating editing objectives into a reward function. The method achieves time efficiency through proximal policy optimization while maintaining optimal hyperparameter configurations. Experiments demonstrate significant reduction in search time and computational overhead compared to existing brute-force approaches, advancing the practical deployment of a diffusion-based image editing framework in the real world.

NeurIPS Conference 2025 Conference Paper

YOLOv12: Attention-Centric Real-Time Object Detectors

  • Yunjie Tian
  • Qixiang Ye
  • David Doermann

Enhancing the network architecture of the YOLO framework has been crucial for a long time. Still, it has focused on CNN-based improvements despite the proven superiority of attention mechanisms in modeling capabilities. This is because attention-based models cannot match the speed of CNN-based models. This paper proposes an attention-centric YOLO framework, namely YOLOv12, that matches the speed of previous CNN-based ones while harnessing the performance benefits of attention mechanisms. YOLOv12 surpasses popular real-time object detectors in accuracy with competitive speed. For example, YOLOv12-N achieves 40. 5% mAP with an inference latency of 1. 62 ms on a T4 GPU, outperforming advanced YOLOv10-N / YOLO11-N by 2. 0%/1. 1% mAP with a comparable speed. This advantage extends to other model scales. YOLOv12 also surpasses end-to-end real-time detectors that improve DETR, such as RT-DETRv2 / RT-DETRv3: YOLOv12-X beats RT-DETRv2-R101 / RT-DETRv3-R101 while running faster with fewer computations and parameters. See more comparisons in Figure 1. Source code is available at https: //github. com/sunsmarterjie/yolov12.

NeurIPS Conference 2024 Conference Paper

Artemis: Towards Referential Understanding in Complex Videos

  • Jihao Qiu
  • Yuan Zhang
  • Xi Tang
  • Lingxi Xie
  • Tianren Ma
  • Pengyu Yan
  • David Doermann
  • Qixiang Ye

Videos carry rich visual information including object description, action, interaction, etc. , but the existing multimodal large language models (MLLMs) fell short in referential understanding scenarios such as video-based referring. In this paper, we present Artemis, an MLLM that pushes video-based referential understanding to a finer level. Given a video, Artemis receives a natural-language question with a bounding box in any video frame and describes the referred target in the entire video. The key to achieving this goal lies in extracting compact, target-specific video features, where we set a solid baseline by tracking and selecting spatiotemporal features from the video. We train Artemis on the newly established ViderRef45K dataset with 45K video-QA pairs and design a computationally efficient, three-stage training procedure. Results are promising both quantitatively and qualitatively. Additionally, we show that Artemis can be integrated with video grounding and text summarization tools to understand more complex scenarios. Code and data are available at https: //github. com/NeurIPS24Artemis/Artemis.

AAAI Conference 2024 Conference Paper

Federated Learning via Input-Output Collaborative Distillation

  • Xuan Gong
  • Shanglin Li
  • Yuxiang Bao
  • Barry Yao
  • Yawen Huang
  • Ziyan Wu
  • Baochang Zhang
  • Yefeng Zheng

Federated learning (FL) is a machine learning paradigm in which distributed local nodes collaboratively train a central model without sharing individually held private data. Existing FL methods either iteratively share local model parameters or deploy co-distillation. However, the former is highly susceptible to private data leakage, and the latter design relies on the prerequisites of task-relevant real data. Instead, we propose a data-free FL framework based on local-to-central collaborative distillation with direct input and output space exploitation. Our design eliminates any requirement of recursive local parameter exchange or auxiliary task-relevant data to transfer knowledge, thereby giving direct privacy control to local users. In particular, to cope with the inherent data heterogeneity across locals, our technique learns to distill input on which each local model produces consensual yet unique results to represent each expertise. Our proposed FL framework achieves notable privacy-utility trade-offs with extensive experiments on image classification and segmentation tasks under various real-world heterogeneous federated learning settings on both natural and medical images. Code is available at https://github.com/lsl001006/FedIOD.

NeurIPS Conference 2024 Conference Paper

Motion Consistency Model: Accelerating Video Diffusion with Disentangled Motion-Appearance Distillation

  • Yuanhao Zhai
  • Kevin Lin
  • Zhengyuan Yang
  • Linjie Li
  • Jianfeng Wang
  • Chung-Ching Lin
  • David Doermann
  • Junsong Yuan

Image diffusion distillation achieves high-fidelity generation with very few sampling steps. However, directly applying these techniques to video models results in unsatisfied frame quality. This issue arises from the limited frame appearance quality in public video datasets, affecting the performance of both teacher and student video diffusion models. Our study aims to improve video diffusion distillation and meanwhile enabling the student model to improve frame appearance using the abundant high-quality image data. To this end, we propose motion consistency models (MCM), a single-stage video diffusion distillation method that disentangles motion and appearance learning. Specifically, MCM involves a video consistency model that distills motion from the video teacher model, and an image discriminator that boosts frame appearance to match high-quality image data. However, directly combining these components leads to two significant challenges: a conflict in frame learning objectives, where video distillation learns from low-quality video frames while the image discriminator targets high-quality images, and training-inference discrepancies due to the differing quality of video samples used during training and inference. To address these challenges, we introduce disentangled motion distillation and mixed trajectory distillation. The former applies the distillation objective solely to the motion representation, while the latter mitigates training-inference discrepancies by mixing distillation trajectories from both the low- and high-quality video domains. Extensive experiments show that our MCM achieves state-of-the-art video diffusion distillation performance. Additionally, our method can enhance frame quality in video diffusion models, producing frames with high aesthetic value or specific styles.

NeurIPS Conference 2023 Conference Paper

Defending against Data-Free Model Extraction by Distributionally Robust Defensive Training

  • Zhenyi Wang
  • Li Shen
  • Tongliang Liu
  • Tiehang Duan
  • Yanjun Zhu
  • Donglin Zhan
  • David Doermann
  • Mingchen Gao

Data-Free Model Extraction (DFME) aims to clone a black-box model without knowing its original training data distribution, making it much easier for attackers to steal commercial models. Defense against DFME faces several challenges: (i) effectiveness; (ii) efficiency; (iii) no prior on the attacker's query data distribution and strategy. However, existing defense methods: (1) are highly computation and memory inefficient; or (2) need strong assumptions about attack data distribution; or (3) can only delay the attack or prove a model theft after the model stealing has happened. In this work, we propose a Memory and Computation efficient defense approach, named MeCo, to prevent DFME from happening while maintaining the model utility simultaneously by distributionally robust defensive training on the target victim model. Specifically, we randomize the input so that it: (1) causes a mismatch of the knowledge distillation loss for attackers; (2) disturbs the zeroth-order gradient estimation; (3) changes the label prediction for the attack query data. Therefore, the attacker can only extract misleading information from the black-box model. Extensive experiments on defending against both decision-based and score-based DFME demonstrate that MeCo can significantly reduce the effectiveness of existing DFME methods and substantially improve running efficiency.

AAAI Conference 2023 Conference Paper

Progressive Multi-View Human Mesh Recovery with Self-Supervision

  • Xuan Gong
  • Liangchen Song
  • Meng Zheng
  • Benjamin Planche
  • Terrence Chen
  • Junsong Yuan
  • David Doermann
  • Ziyan Wu

To date, little attention has been given to multi-view 3D human mesh estimation, despite real-life applicability (e.g., motion capture, sport analysis) and robustness to single-view ambiguities. Existing solutions typically suffer from poor generalization performance to new settings, largely due to the limited diversity of image/3D-mesh pairs in multi-view training data. To address this shortcoming, people have explored the use of synthetic images. But besides the usual impact of visual gap between rendered and target data, synthetic-data-driven multi-view estimators also suffer from overfitting to the camera viewpoint distribution sampled during training which usually differs from real-world distributions. Tackling both challenges, we propose a novel simulation-based training pipeline for multi-view human mesh recovery, which (a) relies on intermediate 2D representations which are more robust to synthetic-to-real domain gap; (b) leverages learnable calibration and triangulation to adapt to more diversified camera setups; and (c) progressively aggregates multi-view information in a canonical 3D space to remove ambiguities in 2D representations. Through extensive benchmarking, we demonstrate the superiority of the proposed solution especially for unseen in-the-wild scenarios.

AAAI Conference 2022 Conference Paper

Preserving Privacy in Federated Learning with Ensemble Cross-Domain Knowledge Distillation

  • Xuan Gong
  • Abhishek Sharma
  • Srikrishna Karanam
  • Ziyan Wu
  • Terrence Chen
  • David Doermann
  • Arun Innanje

Federated Learning (FL) is a machine learning paradigm where local nodes collaboratively train a central model while the training data remains decentralized. Existing FL methods typically share model parameters or employ co-distillation to address the issue of unbalanced data distribution. However, they suffer from communication bottlenecks. More importantly, they risk privacy leakage. In this work, we develop a privacy preserving and communication efficient method in a FL framework with one-shot offline knowledge distillation using unlabeled, cross-domain public data. We propose a quantized and noisy ensemble of local predictions from completely trained local models for stronger privacy guarantees without sacrificing accuracy. Based on extensive experiments on image classification and text classification tasks, we show that our privacy-preserving method outperforms baseline FL algorithms with superior performance in both accuracy and communication efficiency.

IJCAI Conference 2021 Conference Paper

Uncertainty-aware Binary Neural Networks

  • Junhe Zhao
  • Linlin Yang
  • Baochang Zhang
  • Guodong Guo
  • David Doermann

Binary Neural Networks (BNN) are promising machine learning solutions for deployment on resource-limited devices. Recent approaches to training BNNs have produced impressive results, but minimizing the drop in accuracy from full precision networks is still challenging. One reason is that conventional BNNs ignore the uncertainty caused by weights that are near zero, resulting in the instability or frequent flip while learning. In this work, we investigate the intrinsic uncertainty of vanishing near-zero weights, making the training vulnerable to instability. We introduce an uncertainty-aware BNN (UaBNN) by leveraging a new mapping function called certainty-sign (c-sign) to reduce these weights' uncertainties. Our c-sign function is the first to train BNNs with a decreasing uncertainty for binarization. The approach leads to a controlled learning process for BNNs. We also introduce a simple but effective method to measure the uncertainty-based on a Gaussian function. Extensive experiments demonstrate that our method improves multiple BNN methods by maintaining stability of training, and achieves a higher performance over prior arts.

AAAI Conference 2020 Conference Paper

Binarized Neural Architecture Search

  • Hanlin Chen
  • Li'an Zhuo
  • Baochang Zhang
  • Xiawu Zheng
  • Jianzhuang Liu
  • David Doermann
  • Rongrong Ji

Neural architecture search (NAS) can have a significant impact in computer vision by automatically designing optimal neural network architectures for various tasks. A variant, binarized neural architecture search (BNAS), with a search space of binarized convolutions, can produce extremely compressed models. Unfortunately, this area remains largely unexplored. BNAS is more challenging than NAS due to the learning inefficiency caused by optimization requirements and the huge architecture space. To address these issues, we introduce channel sampling and operation space reduction into a differentiable NAS to significantly reduce the cost of searching. This is accomplished through a performancebased strategy used to abandon less potential operations. Two optimization methods for binarized neural networks are used to validate the effectiveness of our BNAS. Extensive experiments demonstrate that the proposed BNAS achieves a performance comparable to NAS on both CIFAR and ImageNet databases. An accuracy of 96. 53% vs. 97. 22% is achieved on the CIFAR-10 dataset, but with a significantly compressed model, and a 40% faster search than the state-of-the-art PC- DARTS.

IJCAI Conference 2020 Conference Paper

CP-NAS: Child-Parent Neural Architecture Search for 1-bit CNNs

  • Li'an Zhuo
  • Baochang Zhang
  • Hanlin Chen
  • Linlin Yang
  • Chen Chen
  • Yanjun Zhu
  • David Doermann

Neural architecture search (NAS) proves to be among the best approaches for many tasks by generating an application-adaptive neural architectures, which are still challenged by high computational cost and memory consumption. At the same time, 1-bit convolutional neural networks (CNNs) with binarized weights and activations show their potential for resource-limited embedded devices. One natural approach is to use 1-bit CNNs to reduce the computation and memory cost of NAS by taking advantage of the strengths of each in a unified framework. To this end, a Child-Parent model is introduced to a differentiable NAS to search the binarized architecture(Child) under the supervision of a full-precision model (Parent). In the search stage, the Child-Parent model uses an indicator generated by the parent and child model accuracy to evaluate the performance and abandon operations with less potential. In the training stage, a kernel level CP loss is introduced to optimize the binarized network. Extensive experiments demonstrate that the proposed CP-NAS achieves a comparable accuracy with traditional NAS on both the CIFAR and ImageNet databases. It achieves an accuracy of 95. 27% on CIFAR-10, 64. 3% on ImageNet with binarized weights and activations, and a 30% faster search than prior arts.

AAAI Conference 2019 Conference Paper

Calibrated Stochastic Gradient Descent for Convolutional Neural Networks

  • Li’an Zhuo
  • Baochang Zhang
  • Chen Chen
  • Qixiang Ye
  • Jianzhuang Liu
  • David Doermann

In stochastic gradient descent (SGD) and its variants, the optimized gradient estimators may be as expensive to compute as the true gradient in many scenarios. This paper introduces a calibrated stochastic gradient descent (CSGD) algorithm for deep neural network optimization. A theorem is developed to prove that an unbiased estimator for the network variables can be obtained in a probabilistic way based on the Lipschitz hypothesis. Our work is significantly distinct from existing gradient optimization methods, by providing a theoretical framework for unbiased variable estimation in the deep learning paradigm to optimize the model parameter calculation. In particular, we develop a generic gradient calibration layer which can be easily used to build convolutional neural networks (CNNs). Experimental results demonstrate that CNNs with our CSGD optimization scheme can improve the stateof-the-art performance for natural image classification, digit recognition, ImageNet object classification, and object detection tasks. This work opens new research directions for developing more efficient SGD updates and analyzing the backpropagation algorithm.

AAAI Conference 2019 Conference Paper

Projection Convolutional Neural Networks for 1-bit CNNs via Discrete Back Propagation

  • Jiaxin Gu
  • Ce Li
  • Baochang Zhang
  • Jungong Han
  • Xianbin Cao
  • Jianzhuang Liu
  • David Doermann

The advancement of deep convolutional neural networks (DCNNs) has driven significant improvement in the accuracy of recognition systems for many computer vision tasks. However, their practical applications are often restricted in resource-constrained environments. In this paper, we introduce projection convolutional neural networks (PCNNs) with a discrete back propagation via projection (DBPP) to improve the performance of binarized neural networks (BNNs). The contributions of our paper include: 1) for the first time, the projection function is exploited to efficiently solve the discrete back propagation problem, which leads to a new highly compressed CNNs (termed PCNNs); 2) by exploiting multiple projections, we learn a set of diverse quantized kernels that compress the full-precision kernels in a more efficient way than those proposed previously; 3) PCNNs achieve the best classification performance compared to other state-ofthe-art BNNs on the ImageNet and CIFAR datasets.