Author name cluster

Kun Ding

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

7 papers

1 author row

AAAI Conference 2026 Conference Paper

Beyond Counting: Evaluating Abstract and Emotional Reasoning in Vision-Language Models

Yuan Zhou
Yan Zhang
Jianlong Chang
Xin Gu
Ying Wang
Kun Ding
Guangwen Yang
Shiming Xiang

Despite the rapid progress of Vision Language Models (VLMs), existing benchmarks still concentrate on coarse-grained object recognition or simple relational reasoning, leaving the fine-grained and higher-order reasoning abilities of these systems largely unexamined. To bridge this critical evaluation gap, we introduce EmojiGrid, a novel diagnostic benchmark specifically designed to probe these fine-grained and higher-order skills. Leveraging the universal and semantically rich nature of emojis, we synthesize a grid‑based visual dataset paired with 29,000+ QA pairs. Each pair is explicitly anchored in a three-level cognitive taxonomy comprising (i) Perception and Information Extraction, (ii) Relational and Structural Reasoning, and (iii) Abstraction and Advanced Cognition. These dimensions further decompose into nine categories covering a broad range of cognitive skills, including counting, spatial relations, compositional logic, semantic sentiment, and related higher-order reasoning tasks. Our extensive evaluation of 25 state-of-the-art open-source and proprietary VLMs reveals a significant performance gap between foundational perceptual tasks and higher-level cognitive abilities, particularly in abstraction and advanced emotional reasoning. Notably, all models struggle with compositional logic, spatial consistency, and especially emotional and semantic understanding. EmojiGrid provides a quantifiable, fine-grained benchmark to diagnose VLM limitations and guides future progress toward models that can truly perceive, reason about, and interpret complex, symbol-rich visual scenes.

PDF Details DOI

AAAI Conference 2026 Conference Paper

LookFlow: Training-Free and Efficient High-Resolution Image Synthesis via Dynamic Lookahead Guidance Flow

Yuan Zhou
Yan Zhang
Jianlong Chang
Xin Gu
Ying Wang
Kun Ding
Guangwen Yang
Shiming Xiang

Rectification flow Transformers (RFTs) have shown promising performance in diffusion-based image synthesis but are typically confined to lower-resolution scenarios, limiting their ability to generate high-resolution images. Existing resolution extrapolation approaches often suffer from excessive computational overhead, resulting in prolonged inference times. We propose LookFlow, a training-free high-resolution synthesis framework that accelerates inference while preserving visual quality. Building on pretrained text-to-image RFTs, LookFlow employs a dynamic lookahead guidance flow mechanism to refine high-resolution velocity predictions by leveraging multi-timestep lookahead information extracted from a low-resolution flow. Additionally, reusing temporally similar features across consecutive timesteps drastically reduces computation and significantly decreases inference time overhead. Extensive experiments on COCO demonstrate that LookFlow robustly scales resolutions from 4× to 25×, achieving up to a maximum speedup of 2.01× while maintaining competitive visual fidelity.

PDF Details DOI

NeurIPS Conference 2025 Conference Paper

Diffusion Model as a Noise-Aware Latent Reward Model for Step-Level Preference Optimization

Tao Zhang
Cheng Da
Kun Ding
Huan Yang
Kun Jin
Yan Li
Tingting Gao
Di Zhang

Preference optimization for diffusion models aims to align them with human preferences for images. Previous methods typically use Vision-Language Models (VLMs) as pixel-level reward models to approximate human preferences. However, when used for step-level preference optimization, these models face challenges in handling noisy images of different timesteps and require complex transformations into pixel space. In this work, we show that pre-trained diffusion models are naturally suited for step-level reward modeling in the noisy latent space, as they are explicitly designed to process latent images at various noise levels. Accordingly, we propose the Latent Reward Model (LRM), which repurposes components of the diffusion model to predict preferences of latent images at arbitrary timesteps. Building on LRM, we introduce Latent Preference Optimization (LPO), a step-level preference optimization method conducted directly in the noisy latent space. Experimental results indicate that LPO significantly improves the model's alignment with general, aesthetic, and text-image alignment preferences, while achieving a 2. 5-28x training speedup over existing preference optimization methods.

PDF Details

NeurIPS Conference 2025 Conference Paper

Knowledge-based Visual Question Answer with Multimodal Processing, Retrieval and Filtering

yuyang Hong
Jiaqi Gu
Yang Qi
Lubin Fan
Yue Wu
Ying Wang
Kun Ding
Shiming Xiang

The task of Knowlegde-Based Visual Question Answering (KB-VQA) requires the model to understand visual features and retrieve external knowledge. Retrieval-Augmented Generation (RAG) have been employed to address this problem through knowledge base querying. However, existing work demonstrate two limitations: insufficient interactivity during knowledge retrieval and ineffective organization of retrieved information for Visual-Language Model (VLM). To address these challenges, we propose a three-stage visual language model with Process, Retrieve and Filter (VLM-PRF) framework. For interactive retrieval, VLM-PRF uses reinforcement learning (RL) to guide the model to strategically process information via tool-driven operations. For knowledge filtering, our method trains the VLM to transform the raw retrieved information into into task-specific knowledge. With a dual reward as supervisory signals, VLM-PRF successfully enable model to optimize retrieval strategies and answer generation capabilities simultaneously. Experiments on two datasets demonstrate the effectiveness of our framework.

PDF Details

EAAI Journal 2024 Journal Article

Deep convolutional neural network based on self-distillation for tool wear recognition

Yi Pan
Ling Hao
Jianliang He
Kun Ding
Qiang Yu
Yulin Wang

Details DOI

AAAI Conference 2024 Conference Paper

Weak Distribution Detectors Lead to Stronger Generalizability of Vision-Language Prompt Tuning

Kun Ding
Haojian Zhang
Qiang Yu
Ying Wang
Shiming Xiang
Chunhong Pan

We propose a generalized method for boosting the generalization ability of pre-trained vision-language models (VLMs) while fine-tuning on downstream few-shot tasks. The idea is realized by exploiting out-of-distribution (OOD) detection to predict whether a sample belongs to a base distribution or a novel distribution and then using the score generated by a dedicated competition based scoring function to fuse the zero-shot and few-shot classifier. The fused classifier is dynamic, which will bias towards the zero-shot classifier if a sample is more likely from the distribution pre-trained on, leading to improved base-to-novel generalization ability. Our method is performed only in test stage, which is applicable to boost existing methods without time-consuming re-training. Extensive experiments show that even weak distribution detectors can still improve VLMs' generalization ability. Specifically, with the help of OOD detectors, the harmonic mean of CoOp and ProGrad increase by 2.6 and 1.5 percentage points over 11 recognition datasets in the base-to-novel setting.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

Zero-shot Generalizable Incremental Learning for Vision-Language Object Detection

Jieren Deng
Haojian Zhang
Kun Ding
Jianhua Hu
Xingxuan Zhang
Yunkuan Wang

This paper presents Incremental Vision-Language Object Detection (IVLOD), a novel learning task designed to incrementally adapt pre-trained Vision-Language Object Detection Models (VLODMs) to various specialized domains, while simultaneously preserving their zero-shot generalization capabilities for the generalized domain. To address this new challenge, we present the Zero-interference Reparameterizable Adaptation (ZiRa), a novel method that introduces Zero-interference Loss and reparameterization techniques to tackle IVLOD without incurring a significant increase in memory usage. Comprehensive experiments on COCO and ODinW-13 datasets demonstrate that ZiRa effectively safeguards the zero-shot generalization ability of VLODMs while continuously adapting to new tasks. Specifically, after training on ODinW-13 datasets, ZiRa exhibits superior performance compared to CL-DETR and iDETR, boosting zero-shot generalizability by substantial $\textbf{13. 91}$ and $\textbf{8. 74}$ AP, respectively. Our code is available at https: //github. com/JarintotionDin/ZiRaGroundingDINO.

PDF Details DOI