Author name cluster

Dan Zhang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

47 papers

2 author rows

AAAI Conference 2026 Conference Paper

SOAR: Semi-Supervised Open-Vocabulary Aerial Object Detection via Dual-Aware Enhanced Prior Denoising

Xu Liu
Yihong Huang
Dan Zhang
Lingling Li
Long Sun
Licheng Jiao

Open-Vocabulary Object Detection (OVOD) shows promise in remote sensing (RS), but due to its unique value, there are challenges such as the predominance of background regions, sparse labels, limited semantic information, and difficulties in semi-supervised training. To tackle these challenges, we propose the Semi-Supervised Open-Vocabulary Aerial Object Detection with Dual-Perception Prior Denoising (SOAR), which explicitly models the background embeddings of each scene to indirectly construct foreground priors, thereby capitalizing on the abundant background information present in RS imagery. We further introduce a query enhancement module that integrates language and foreground prior information to enhance the effectiveness of query selection and feature augmentation. During the decoding stage of semi-supervised training, we perform denoising and reconstruction of the foreground priors to generate pseudo-labels that support the training process. Additionally, we address the sparsity of label information through expansion and aggregation techniques, further improving model performance. Experimental evaluations reveal that, in the open-vocabulary object detection task on the DIOR dataset, our method achieves a mean Average Precision (mAP) of 68.5% and Harmonic Mean (HM) of 55.9%, outperforming the previous state-of-the-art model’s mAP of 61.6% and HM of 53.6%. Our approach offers a novel solution to the open-vocabulary challenge in aerial object detection.

PDF Details DOI

AAAI Conference 2026 Conference Paper

VisionReward: Fine-Grained Multi-Dimensional Human Preference Learning for Image and Video Generation

Jiazheng Xu
Yu Huang
Jiale Cheng
Yuanming Yang
Jiajun Xu
Yuan Wang
Wenbo Duan
Shen Yang

Visual generative models have achieved remarkable progress in synthesizing photorealistic images and videos, yet aligning their outputs with human preferences across critical dimensions remains a persistent challenge. Though reinforcement learning from human feedback offers promise for preference alignment, existing reward models for visual generation face limitations, including black-box scoring without interpretability and potentially resultant unexpected biases. We present VisionReward, a general framework for learning human visual preferences in both image and video generation. Specifically, we employ a hierarchical visual assessment framework to capture fine-grained human preferences, and leverages linear weighting to enable interpretable preference learning. Furthermore, we propose a multi-dimensional consistent strategy when using VisionReward as a reward model during preference optimization for visual generation. Experiments show that VisionReward can significantly outperform existing image and video reward models on both machine metrics and human evaluation. Notably, VisionReward surpasses VideoScore by 17.2% in preference prediction accuracy, and text-to-video models with VisionReward achieve a 31.6% higher pairwise win rate compared to the same models using VideoScore.

PDF Details DOI

JBHI Journal 2025 Journal Article

$\text{MR}^{2}$-Net: Retinal OCTA Image Stitching via Multi-Scale Representation Learning and Dynamic Location Guidance

Haiting Mao
Yuhui Ma
Dan Zhang
Yanda Meng
Shaodong Ma
Yuchuan Qiao
Huazhu Fu
Caifeng Shan

Optical coherence tomography angiography (OCTA) plays a crucial role in quantifying and analyzing retinal vascular diseases. However, the limited field of view (FOV) inherent in most commercial OCTA imaging systems poses a significant challenge for clinicians, restricting the possibility to analyze larger retinal regions of high resolution. Automatic stitching of OCTA scans in adjacent regions may provide a promising solution to extend the region of interest. However, commonly-used stitching algorithms face difficulties in achieving effective alignment due to noise, artifacts and dense vasculature present in OCTA images. To address these challenges, we propose a novel retinal OCTA image stitching network, named $\text{MR}^{2}$ -Net, which integrates multi-scale representation learning and dynamic location guidance. In the first stage, an image registration network with a progressive multi-resolution feature fusion is proposed to derive deep semantic information effectively. Additionally, we introduce a dynamic guidance strategy to locate the foveal avascular zone (FAZ) and constrain registration errors in overlapping vascular regions. In the second stage, an image fusion network based on multiple mask constraints and adjacent image aggregation (AIA) strategies is developed to further eliminate the artifacts in the overlapping areas of stitched images, thereby achieving precise vessel alignment. To validate the effectiveness of our method, we conduct a series of experiments on two delicately constructed datasets, i. e. , OPTOVUE-OCTA and SVision-OCTA. Experimental results demonstrate that our method outperforms other image stitching methods and effectively generates high-quality wide-field OCTA images, achieving a structural similarity index (SSIM) score of 0. 8264 and 0. 8014 on the two datasets, respectively.