Author name cluster

David Doermann

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

14 papers

1 author row

AAAI Conference 2026 Conference Paper

Textured Geometry Evaluation: Perceptual 3D Textured Shape Metric via 3D Latent-Geometry Network

Tianyu Luan
Xuelu Feng
Zixin Zhu
Phani Nuney
Sheng Liu
Xuan Gong
David Doermann
Chunming Qiao

Textured high-fidelity 3D models are crucial for games, AR/VR, and film, but human-aligned evaluation methods still fall behind despite recent advances in 3D reconstruction and generation. Existing metrics, such as Chamfer Distance, often fail to align with how humans evaluate the fidelity of 3D shapes. Recent learning-based metrics attempt to improve this by relying on rendered images and 2D image quality metrics. However, these approaches face limitations due to incomplete structural coverage and sensitivity to viewpoint choices. Moreover, most methods are trained on synthetic distortions, which differ significantly from real-world distortions, resulting in a domain gap. To address these challenges, we propose a new fidelity evaluation method that is based directly on 3D meshes with texture, without relying on rendering. Our method, named Textured Geometry Evaluation TGE, jointly uses the geometry and color information to calculate the fidelity of the input textured mesh with comparison to a reference colored shape. To train and evaluate our metric, we design a human-annotated dataset with real-world distortions. Experiments show that TGE outperforms rendering-based and geometry-only methods on real-world distortion dataset.

PDF Details DOI

NeurIPS Conference 2025 Conference Paper

AutoEdit: Automatic Hyperparameter Tuning for Image Editing

Chau Pham
Quan Dao
Mahesh Bhosale
Yunjie Tian
Dimitris Metaxas
David Doermann

Recent advances in diffusion models have revolutionized text-guided image editing, yet existing editing methods face critical challenges in hyperparameter identification. To get the reasonable editing performance, these methods often require the user to brute-force tune multiple interdependent hyperparameters, such as inversion timesteps and attention modification, \textit{etc. } This process incurs high computational costs due to the huge hyperparameter search space. We consider searching optimal editing's hyperparameters as a sequential decision-making task within the diffusion denoising process. Specifically, we propose a reinforcement learning framework, which establishes a Markov Decision Process that dynamically adjusts hyperparameters across denoising steps, integrating editing objectives into a reward function. The method achieves time efficiency through proximal policy optimization while maintaining optimal hyperparameter configurations. Experiments demonstrate significant reduction in search time and computational overhead compared to existing brute-force approaches, advancing the practical deployment of a diffusion-based image editing framework in the real world.

PDF Details

NeurIPS Conference 2025 Conference Paper

YOLOv12: Attention-Centric Real-Time Object Detectors

Yunjie Tian
Qixiang Ye
David Doermann

Enhancing the network architecture of the YOLO framework has been crucial for a long time. Still, it has focused on CNN-based improvements despite the proven superiority of attention mechanisms in modeling capabilities. This is because attention-based models cannot match the speed of CNN-based models. This paper proposes an attention-centric YOLO framework, namely YOLOv12, that matches the speed of previous CNN-based ones while harnessing the performance benefits of attention mechanisms. YOLOv12 surpasses popular real-time object detectors in accuracy with competitive speed. For example, YOLOv12-N achieves 40. 5% mAP with an inference latency of 1. 62 ms on a T4 GPU, outperforming advanced YOLOv10-N / YOLO11-N by 2. 0%/1. 1% mAP with a comparable speed. This advantage extends to other model scales. YOLOv12 also surpasses end-to-end real-time detectors that improve DETR, such as RT-DETRv2 / RT-DETRv3: YOLOv12-X beats RT-DETRv2-R101 / RT-DETRv3-R101 while running faster with fewer computations and parameters. See more comparisons in Figure 1. Source code is available at https: //github. com/sunsmarterjie/yolov12.

PDF Details

NeurIPS Conference 2024 Conference Paper

Artemis: Towards Referential Understanding in Complex Videos

Jihao Qiu
Yuan Zhang
Xi Tang
Lingxi Xie
Tianren Ma
Pengyu Yan
David Doermann
Qixiang Ye

Videos carry rich visual information including object description, action, interaction, etc. , but the existing multimodal large language models (MLLMs) fell short in referential understanding scenarios such as video-based referring. In this paper, we present Artemis, an MLLM that pushes video-based referential understanding to a finer level. Given a video, Artemis receives a natural-language question with a bounding box in any video frame and describes the referred target in the entire video. The key to achieving this goal lies in extracting compact, target-specific video features, where we set a solid baseline by tracking and selecting spatiotemporal features from the video. We train Artemis on the newly established ViderRef45K dataset with 45K video-QA pairs and design a computationally efficient, three-stage training procedure. Results are promising both quantitatively and qualitatively. Additionally, we show that Artemis can be integrated with video grounding and text summarization tools to understand more complex scenarios. Code and data are available at https: //github. com/NeurIPS24Artemis/Artemis.

PDF Details DOI

AAAI Conference 2024 Conference Paper

Federated Learning via Input-Output Collaborative Distillation

Xuan Gong
Shanglin Li
Yuxiang Bao
Barry Yao
Yawen Huang
Ziyan Wu
Baochang Zhang
Yefeng Zheng

Federated learning (FL) is a machine learning paradigm in which distributed local nodes collaboratively train a central model without sharing individually held private data. Existing FL methods either iteratively share local model parameters or deploy co-distillation. However, the former is highly susceptible to private data leakage, and the latter design relies on the prerequisites of task-relevant real data. Instead, we propose a data-free FL framework based on local-to-central collaborative distillation with direct input and output space exploitation. Our design eliminates any requirement of recursive local parameter exchange or auxiliary task-relevant data to transfer knowledge, thereby giving direct privacy control to local users. In particular, to cope with the inherent data heterogeneity across locals, our technique learns to distill input on which each local model produces consensual yet unique results to represent each expertise. Our proposed FL framework achieves notable privacy-utility trade-offs with extensive experiments on image classification and segmentation tasks under various real-world heterogeneous federated learning settings on both natural and medical images. Code is available at https://github.com/lsl001006/FedIOD.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

Motion Consistency Model: Accelerating Video Diffusion with Disentangled Motion-Appearance Distillation

Yuanhao Zhai
Kevin Lin
Zhengyuan Yang
Linjie Li
Jianfeng Wang
Chung-Ching Lin
David Doermann
Junsong Yuan

Image diffusion distillation achieves high-fidelity generation with very few sampling steps. However, directly applying these techniques to video models results in unsatisfied frame quality. This issue arises from the limited frame appearance quality in public video datasets, affecting the performance of both teacher and student video diffusion models. Our study aims to improve video diffusion distillation and meanwhile enabling the student model to improve frame appearance using the abundant high-quality image data. To this end, we propose motion consistency models (MCM), a single-stage video diffusion distillation method that disentangles motion and appearance learning. Specifically, MCM involves a video consistency model that distills motion from the video teacher model, and an image discriminator that boosts frame appearance to match high-quality image data. However, directly combining these components leads to two significant challenges: a conflict in frame learning objectives, where video distillation learns from low-quality video frames while the image discriminator targets high-quality images, and training-inference discrepancies due to the differing quality of video samples used during training and inference. To address these challenges, we introduce disentangled motion distillation and mixed trajectory distillation. The former applies the distillation objective solely to the motion representation, while the latter mitigates training-inference discrepancies by mixing distillation trajectories from both the low- and high-quality video domains. Extensive experiments show that our MCM achieves state-of-the-art video diffusion distillation performance. Additionally, our method can enhance frame quality in video diffusion models, producing frames with high aesthetic value or specific styles.

PDF Details DOI

NeurIPS Conference 2023 Conference Paper

Defending against Data-Free Model Extraction by Distributionally Robust Defensive Training

Zhenyi Wang
Li Shen
Tongliang Liu
Tiehang Duan
Yanjun Zhu
Donglin Zhan
David Doermann
Mingchen Gao

Data-Free Model Extraction (DFME) aims to clone a black-box model without knowing its original training data distribution, making it much easier for attackers to steal commercial models. Defense against DFME faces several challenges: (i) effectiveness; (ii) efficiency; (iii) no prior on the attacker's query data distribution and strategy. However, existing defense methods: (1) are highly computation and memory inefficient; or (2) need strong assumptions about attack data distribution; or (3) can only delay the attack or prove a model theft after the model stealing has happened. In this work, we propose a Memory and Computation efficient defense approach, named MeCo, to prevent DFME from happening while maintaining the model utility simultaneously by distributionally robust defensive training on the target victim model. Specifically, we randomize the input so that it: (1) causes a mismatch of the knowledge distillation loss for attackers; (2) disturbs the zeroth-order gradient estimation; (3) changes the label prediction for the attack query data. Therefore, the attacker can only extract misleading information from the black-box model. Extensive experiments on defending against both decision-based and score-based DFME demonstrate that MeCo can significantly reduce the effectiveness of existing DFME methods and substantially improve running efficiency.

PDF Details

AAAI Conference 2023 Conference Paper

Progressive Multi-View Human Mesh Recovery with Self-Supervision

Xuan Gong
Liangchen Song
Meng Zheng
Benjamin Planche
Terrence Chen
Junsong Yuan
David Doermann
Ziyan Wu

To date, little attention has been given to multi-view 3D human mesh estimation, despite real-life applicability (e.g., motion capture, sport analysis) and robustness to single-view ambiguities. Existing solutions typically suffer from poor generalization performance to new settings, largely due to the limited diversity of image/3D-mesh pairs in multi-view training data. To address this shortcoming, people have explored the use of synthetic images. But besides the usual impact of visual gap between rendered and target data, synthetic-data-driven multi-view estimators also suffer from overfitting to the camera viewpoint distribution sampled during training which usually differs from real-world distributions. Tackling both challenges, we propose a novel simulation-based training pipeline for multi-view human mesh recovery, which (a) relies on intermediate 2D representations which are more robust to synthetic-to-real domain gap; (b) leverages learnable calibration and triangulation to adapt to more diversified camera setups; and (c) progressively aggregates multi-view information in a canonical 3D space to remove ambiguities in 2D representations. Through extensive benchmarking, we demonstrate the superiority of the proposed solution especially for unseen in-the-wild scenarios.

PDF Details DOI

AAAI Conference 2022 Conference Paper

Preserving Privacy in Federated Learning with Ensemble Cross-Domain Knowledge Distillation

Xuan Gong
Abhishek Sharma
Srikrishna Karanam
Ziyan Wu
Terrence Chen
David Doermann
Arun Innanje

Federated Learning (FL) is a machine learning paradigm where local nodes collaboratively train a central model while the training data remains decentralized. Existing FL methods typically share model parameters or employ co-distillation to address the issue of unbalanced data distribution. However, they suffer from communication bottlenecks. More importantly, they risk privacy leakage. In this work, we develop a privacy preserving and communication efficient method in a FL framework with one-shot offline knowledge distillation using unlabeled, cross-domain public data. We propose a quantized and noisy ensemble of local predictions from completely trained local models for stronger privacy guarantees without sacrificing accuracy. Based on extensive experiments on image classification and text classification tasks, we show that our privacy-preserving method outperforms baseline FL algorithms with superior performance in both accuracy and communication efficiency.

PDF Details

IJCAI Conference 2021 Conference Paper

Uncertainty-aware Binary Neural Networks

Junhe Zhao
Linlin Yang
Baochang Zhang
Guodong Guo
David Doermann

Binary Neural Networks (BNN) are promising machine learning solutions for deployment on resource-limited devices. Recent approaches to training BNNs have produced impressive results, but minimizing the drop in accuracy from full precision networks is still challenging. One reason is that conventional BNNs ignore the uncertainty caused by weights that are near zero, resulting in the instability or frequent flip while learning. In this work, we investigate the intrinsic uncertainty of vanishing near-zero weights, making the training vulnerable to instability. We introduce an uncertainty-aware BNN (UaBNN) by leveraging a new mapping function called certainty-sign (c-sign) to reduce these weights' uncertainties. Our c-sign function is the first to train BNNs with a decreasing uncertainty for binarization. The approach leads to a controlled learning process for BNNs. We also introduce a simple but effective method to measure the uncertainty-based on a Gaussian function. Extensive experiments demonstrate that our method improves multiple BNN methods by maintaining stability of training, and achieves a higher performance over prior arts.

PDF Details DOI

AAAI Conference 2020 Conference Paper

Binarized Neural Architecture Search

Hanlin Chen
Li'an Zhuo
Baochang Zhang
Xiawu Zheng
Jianzhuang Liu
David Doermann
Rongrong Ji

Neural architecture search (NAS) can have a signiﬁcant impact in computer vision by automatically designing optimal neural network architectures for various tasks. A variant, binarized neural architecture search (BNAS), with a search space of binarized convolutions, can produce extremely compressed models. Unfortunately, this area remains largely unexplored. BNAS is more challenging than NAS due to the learning inefﬁciency caused by optimization requirements and the huge architecture space. To address these issues, we introduce channel sampling and operation space reduction into a differentiable NAS to signiﬁcantly reduce the cost of searching. This is accomplished through a performancebased strategy used to abandon less potential operations. Two optimization methods for binarized neural networks are used to validate the effectiveness of our BNAS. Extensive experiments demonstrate that the proposed BNAS achieves a performance comparable to NAS on both CIFAR and ImageNet databases. An accuracy of 96. 53% vs. 97. 22% is achieved on the CIFAR-10 dataset, but with a signiﬁcantly compressed model, and a 40% faster search than the state-of-the-art PC- DARTS.

PDF Details

IJCAI Conference 2020 Conference Paper

CP-NAS: Child-Parent Neural Architecture Search for 1-bit CNNs

Li'an Zhuo
Baochang Zhang
Hanlin Chen
Linlin Yang
Chen Chen
Yanjun Zhu
David Doermann

Neural architecture search (NAS) proves to be among the best approaches for many tasks by generating an application-adaptive neural architectures, which are still challenged by high computational cost and memory consumption. At the same time, 1-bit convolutional neural networks (CNNs) with binarized weights and activations show their potential for resource-limited embedded devices. One natural approach is to use 1-bit CNNs to reduce the computation and memory cost of NAS by taking advantage of the strengths of each in a uniﬁed framework. To this end, a Child-Parent model is introduced to a differentiable NAS to search the binarized architecture(Child) under the supervision of a full-precision model (Parent). In the search stage, the Child-Parent model uses an indicator generated by the parent and child model accuracy to evaluate the performance and abandon operations with less potential. In the training stage, a kernel level CP loss is introduced to optimize the binarized network. Extensive experiments demonstrate that the proposed CP-NAS achieves a comparable accuracy with traditional NAS on both the CIFAR and ImageNet databases. It achieves an accuracy of 95. 27% on CIFAR-10, 64. 3% on ImageNet with binarized weights and activations, and a 30% faster search than prior arts.

PDF Details DOI

AAAI Conference 2019 Conference Paper

Calibrated Stochastic Gradient Descent for Convolutional Neural Networks

Li’an Zhuo
Baochang Zhang
Chen Chen
Qixiang Ye
Jianzhuang Liu
David Doermann

In stochastic gradient descent (SGD) and its variants, the optimized gradient estimators may be as expensive to compute as the true gradient in many scenarios. This paper introduces a calibrated stochastic gradient descent (CSGD) algorithm for deep neural network optimization. A theorem is developed to prove that an unbiased estimator for the network variables can be obtained in a probabilistic way based on the Lipschitz hypothesis. Our work is significantly distinct from existing gradient optimization methods, by providing a theoretical framework for unbiased variable estimation in the deep learning paradigm to optimize the model parameter calculation. In particular, we develop a generic gradient calibration layer which can be easily used to build convolutional neural networks (CNNs). Experimental results demonstrate that CNNs with our CSGD optimization scheme can improve the stateof-the-art performance for natural image classification, digit recognition, ImageNet object classification, and object detection tasks. This work opens new research directions for developing more efficient SGD updates and analyzing the backpropagation algorithm.

PDF Details

AAAI Conference 2019 Conference Paper

Projection Convolutional Neural Networks for 1-bit CNNs via Discrete Back Propagation

Jiaxin Gu
Ce Li
Baochang Zhang
Jungong Han
Xianbin Cao
Jianzhuang Liu
David Doermann

The advancement of deep convolutional neural networks (DCNNs) has driven significant improvement in the accuracy of recognition systems for many computer vision tasks. However, their practical applications are often restricted in resource-constrained environments. In this paper, we introduce projection convolutional neural networks (PCNNs) with a discrete back propagation via projection (DBPP) to improve the performance of binarized neural networks (BNNs). The contributions of our paper include: 1) for the first time, the projection function is exploited to efficiently solve the discrete back propagation problem, which leads to a new highly compressed CNNs (termed PCNNs); 2) by exploiting multiple projections, we learn a set of diverse quantized kernels that compress the full-precision kernels in a more efficient way than those proposed previously; 3) PCNNs achieve the best classification performance compared to other state-ofthe-art BNNs on the ImageNet and CIFAR datasets.

PDF Details