Author name cluster

Dongmei Jiang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

11 papers

2 author rows

AAAI Conference 2026 Conference Paper

Bolster Hallucination Detection via Prompt-Guided Data Augmentation

Wenyun Li
Zheng Zhang
Dongmei Jiang
Xiangyuan Lan

Large language models (LLMs) have garnered significant interest in AI community. Despite their impressive generation capabilities, they have been found to produce misleading or fabricated information, a phenomenon known as hallucinations. Consequently, hallucination detection has become critical to ensure the reliability of LLM-generated content. One primary challenge in hallucination detection is the scarcity of well-labeled datasets containing both truthful and hallucinated outputs. To address this issue, we introduce Prompt-guided data Augmented haLlucination dEtection (PALE), a novel framework that leverages prompt-guided responses from LLMs as data augmentation for hallucination detection. This strategy can generate both truthful and hallucinated data under prompt guidance at a relatively low cost. To more effectively evaluate the truthfulness of the sparse intermediate embeddings produced by LLMs, we introduce an estimation metric called the Contrastive Mahalanobis Score (CM Score). This score is based on modeling the distributions of truthful and hallucinated data in the activation space. CM Score employs a matrix decomposition approach to more accurately capture the underlying structure of these distributions. Importantly, our framework does not require additional human annotations, offering strong generalizability and practicality for real-world applications. Extensive experiments demonstrate that PALE achieves superior hallucination detection performance, outperforming the competitive baseline by a significant margin of 6.55%.

PDF Details DOI

ICLR Conference 2025 Conference Paper

CatVTON: Concatenation Is All You Need for Virtual Try-On with Diffusion Models

Zheng Chong
Xiao Dong
Haoxiang Li
Shiyue Zhang
Wenqing Zhang
Hanqing Zhao
Xujie Zhang
Dongmei Jiang

Virtual try-on methods based on diffusion models achieve realistic effects but often require additional encoding modules, a large number of training parameters, and complex preprocessing, which increases the burden on training and inference. In this work, we re-evaluate the necessity of additional modules and analyze how to improve training efficiency and reduce redundant steps in the inference process. Based on these insights, we propose CatVTON, a simple and efficient virtual try-on diffusion model that transfers in-shop or worn garments of arbitrary categories to target individuals by concatenating them along spatial dimensions as inputs of the diffusion model. The efficiency of CatVTON is reflected in three aspects: (1) Lightweight network. CatVTON consists only of a VAE and a simplified denoising UNet, removing redundant image and text encoders as well as cross-attentions, and includes just 899.06M parameters. (2) Parameter-efficient training. Through experimental analysis, we identify self-attention modules as crucial for adapting pre-trained diffusion models to the virtual try-on task, enabling high-quality results with only 49.57M training parameters. (3) Simplified inference. CatVTON eliminates unnecessary preprocessing, such as pose estimation, human parsing, and captioning, requiring only a person image and garment reference to guide the virtual try-on process, reducing over 49% memory usage compared to other diffusion-based methods. Extensive experiments demonstrate that CatVTON achieves superior qualitative and quantitative results compared to baseline methods and demonstrates strong generalization performance in in-the-wild scenarios, despite being trained solely on public datasets with 73K samples.

Details

ICLR Conference 2025 Conference Paper

EMMA: Empowering Multi-modal Mamba with Structural and Hierarchical Alignment

Yifei Xing 0001
Xiangyuan Lan
Ruiping Wang 0001
Dongmei Jiang
Wenjun Huang
Qingfang Zheng
Yaowei Wang 0001

Mamba-based architectures have shown to be a promising new direction for deep learning models owing to their competitive performance and sub-quadratic deployment speed. However, current Mamba multi-modal large language models (MLLM) are insufficient in extracting visual features, leading to imbalanced cross-modal alignment between visual and textural latents, negatively impacting performance on multi-modal tasks. In this work, we propose Empowering Multi-modal Mamba with Structural and Hierarchical Alignment (EMMA), which enables the MLLM to extract fine-grained visual information. Specifically, we propose a pixel-wise alignment module to autoregressively optimize the learning and processing of spatial image-level features along with textual tokens, enabling structural alignment at the image level. In addition, to prevent the degradation of visual information during the cross-model alignment process, we propose a multi-scale feature fusion (MFF) module to combine multi-scale visual features from intermediate layers, enabling hierarchical alignment at the feature level. Extensive experiments are conducted across a variety of multi-modal benchmarks. Our model shows lower latency than other Mamba-based MLLMs and is nearly four times faster than transformer-based MLLMs of similar scale during inference. Due to better cross-modal alignment, our model exhibits lower degrees of hallucination and enhanced sensitivity to visual details, which manifests in superior performance across diverse multi-modal benchmarks. Code provided at https://github.com/xingyifei2016/EMMA.

Details

ICML Conference 2025 Conference Paper

Open-Det: An Efficient Learning Framework for Open-Ended Detection

Guiping Cao
Tao Wang
Wenjian Huang 0001
Xiangyuan Lan
Jianguo Zhang 0001
Dongmei Jiang

Open-Ended object Detection (OED) is a novel and challenging task that detects objects and generates their category names in a free-form manner, without requiring additional vocabularies during inference. However, the existing OED models, such as GenerateU, require large-scale datasets for training, suffer from slow convergence, and exhibit limited performance. To address these issues, we present a novel and efficient Open-Det framework, consisting of four collaborative parts. Specifically, Open-Det accelerates model training in both the bounding box and object name generation process by reconstructing the Object Detector and the Object Name Generator. To bridge the semantic gap between Vision and Language modalities, we propose a Vision-Language Aligner with V-to-L and L-to-V alignment mechanisms, incorporating with the Prompts Distiller to transfer knowledge from the VLM into VL-prompts, enabling accurate object name generation for the LLM. In addition, we design a Masked Alignment Loss to eliminate contradictory supervision and introduce a Joint Loss to enhance classification, resulting in more efficient training. Compared to GenerateU, Open-Det, using only 1. 5% of the training data (0. 077M vs. 5. 077M), 20. 8% of the training epochs (31 vs. 149), and fewer GPU resources (4 V100 vs. 16 A100), achieves even higher performance (+1. 0% in APr). The source codes are available at: https: //github. com/Med-Process/Open-Det.

Details

ICLR Conference 2025 Conference Paper

PolaFormer: Polarity-aware Linear Attention for Vision Transformers

Weikang Meng
Yadan Luo
Xin Li 0003
Dongmei Jiang
Zheng Zhang 0006

Linear attention has emerged as a promising alternative to softmax-based attention, leveraging kernelized feature maps to reduce complexity from quadratic to linear in sequence length. However, the non-negative constraint on feature maps and the relaxed exponential function used in approximation lead to significant information loss compared to the original query-key dot products, resulting in less discriminative attention maps with higher entropy. To address the missing interactions driven by negative values in query-key pairs, we propose a polarity-aware linear attention mechanism that explicitly models both same-signed and opposite-signed query-key interactions, ensuring comprehensive coverage of relational information. Furthermore, to restore the spiky properties of attention maps, we provide a theoretical analysis proving the existence of a class of element-wise functions (with positive first and second derivatives) that can reduce entropy in the attention distribution. For simplicity, and recognizing the distinct contributions of each dimension, we employ a learnable power function for rescaling, allowing strong and weak attention signals to be effectively separated. Extensive experiments demonstrate that the proposed PolaFormer improves performance on various vision tasks, enhancing both expressiveness and efficiency by up to 4.6%.

Details

AAAI Conference 2025 Conference Paper

Transferable Adversarial Face Attack with Text Controlled Attribute

Wenyun Li
Zheng Zhang
Xiangyuan Lan
Dongmei Jiang

Traditional adversarial attacks typically produce adversarial examples under norm-constrained conditions, whereas unrestricted adversarial examples are free-form with semantically meaningful perturbations. Current unrestricted adversarial impersonation attacks exhibit limited control over adversarial face attributes and often suffer from low transferability. In this paper, we propose a novel Text Controlled Attribute Attack (TCA2) to generate photorealistic adversarial impersonation faces guided by natural language. Specifically, the category-level personal softmax vector is employed to precisely guide the impersonation attacks. Additionally, we propose both data and model augmentation strategies to achieve transferable attacks on unknown target models. Finally, a generative model, i.e, Style-GAN, is utilized to synthesize impersonated faces with desired attributes. Extensive experiments on two high-resolution face recognition datasets validate that our TCA2 method can generate natural text-guided adversarial impersonation faces with high transferability. We also evaluate our method on real-world face recognition systems, i.e, Face++ and Aliyun, further demonstrating the practical potential of our approach.

PDF Details DOI

AAAI Conference 2025 Conference Paper

Unsupervised Degradation Representation Aware Transform for Real-World Blind Image Super-Resolution

Sen Chen
Hongying Liu
Chaowei Fang
Fanhua Shang
Yuanyuan Liu
Liang Wan
Dongmei Jiang
Yaowei Wang

Blind image super-resolution (blind SR) aims to restore a high-resolution (HR) image from a low-resolution (LR) image with unknown degradation. Many existing methods explicitly estimate degradation information from various LR images. However, in most cases, image degradations are independent of image content. Their estimations may be influenced by the image content resulting in inaccuracy. Unlike existing works, we design a dual-encoder for degradation representation (DEDR) to preclude the influence of image content from LR images. This benefits in extracting the intrinsic degradation representation more accurately. To the best of our knowledge, this paper is the first work that estimates degradation representation through filtering out image content. Based on the degradation representation extracted by DEDR, we present a novel framework, named degradation representation aware transform network (DRAT) for blind SR. We propose global degradation aware (GDA) blocks to propagate degradation information across spatial and channel dimensions, in which a degradation representation transform module (DRT) is introduced to render features degradation-aware, thereby enhancing the restoration of LR images. Extensive experiments are conducted on three benchmark datasets (including Gaussian 8, DIV2KRK, and real-world datasets) under large scaling factors with complex degradations. The experimental results demonstrate that DRAT surpasses state-of-the-art supervised kernel estimation and unsupervised degradation representation methods.

PDF Details DOI

EAAI Journal 2024 Journal Article

A single frame and multi-frame joint network for 360-degree panorama video super-resolution

Hongying Liu
Wanhao Ma
Zhubo Ruan
Chaowei Fang
Fanhua Shang
Yuanyuan Liu
Lijun Wang
Chaoli Wang

Spherical videos, also known as 360-degree (panorama) videos, can be viewed with various virtual reality devices such as computers and head-mounted displays. They attract a large amount of interest since awesome immersion can be experienced when watching spherical videos. However, capturing, storing and transmitting high-resolution spherical videos are extremely expensive, and the low-resolution ones are widely available. In this paper, we propose a novel single-frame and multi-frame joint network (SMFN) for recovering high-resolution spherical videos from low-resolution observations. To take advantage of pixel-level inter-frame consistency, we use deformable convolutions to eliminate the motion difference between feature maps of the target frame and its neighboring frames. A mixed attention mechanism is devised to enhance the feature representation capability. The dual learning strategy is presented to constrain the space of solutions so that a better solution can be found. A new loss function based on the weighted mean squared error is proposed to emphasize the super-resolution of the equatorial regions. This is the first attempt to settle the super-resolution of spherical videos, and we collect a new dataset from the Internet, MiG panorama video, which includes 208 videos. Experimental results on representative video clips demonstrate the efficacy of the proposed method. The dataset and our source code are available at https: //github. com/lovepiano/SMFN_For_360VSR.

Details DOI

AAAI Conference 2024 Conference Paper

Deep Homography Estimation for Visual Place Recognition

Feng Lu
Shuting Dong
Lijun Zhang
Bingxi Liu
Xiangyuan Lan
Dongmei Jiang
Chun Yuan

Visual place recognition (VPR) is a fundamental task for many applications such as robot localization and augmented reality. Recently, the hierarchical VPR methods have received considerable attention due to the trade-off between accuracy and efficiency. They usually first use global features to retrieve the candidate images, then verify the spatial consistency of matched local features for re-ranking. However, the latter typically relies on the RANSAC algorithm for fitting homography, which is time-consuming and non-differentiable. This makes existing methods compromise to train the network only in global feature extraction. Here, we propose a transformer-based deep homography estimation (DHE) network that takes the dense feature map extracted by a backbone network as input and fits homography for fast and learnable geometric verification. Moreover, we design a re-projection error of inliers loss to train the DHE network without additional homography labels, which can also be jointly trained with the backbone network to help it extract the features that are more suitable for local matching. Extensive experiments on benchmark datasets show that our method can outperform several state-of-the-art methods. And it is more than one order of magnitude faster than the mainstream hierarchical VPR methods using RANSAC. The code is released at https://github.com/Lu-Feng/DHE-VPR.

PDF Details DOI

IJCAI Conference 2024 Conference Paper

MLP-DINO: Category Modeling and Query Graphing with Deep MLP for Object Detection

Guiping Cao
Wenjian Huang
Xiangyuan Lan
Jianguo Zhang
Dongmei Jiang
Yaowei Wang

Popular transformer-based detectors detect objects in a one-to-one manner, where both the bounding box and category of each object are predicted only by the single query, leading to the box-sensitive category predictions. Additionally, the initialization of positional queries solely based on the predicted confidence scores or learnable embeddings neglects the significant spatial interrelation between different queries. This oversight leads to an imbalanced spatial distribution of queries (SDQ). In this paper, we propose a new MLP-DINO model to address these issues. Firstly, we present a new Query-Independent Category Supervision (QICS) approach for modeling categories information, decoupling the sensitive bounding box prediction process to improve the detection performance. Additionally, to further improve the category predictions, we introduce a deep MLP model into transformer-based detection framework to capture the long-range and short-range information simultaneously. Thirdly, to balance the SDQ, we design a novel Graph-based Query Selection (GQS) method that distributes each query point in a discrete manner by graphing the spatial information of queries to cover a broader range of potential objects, significantly enhancing the hit-rate of queries. Experimental results on COCO indicate that our MLP-DINO achieves 54. 6% AP with only 44M parame ters under 36-epoch setting, greatly outperforming the original DINO by +3. 7% AP with fewer parameters and FLOPs. The source codes will be available at https: //github. com/Med-Process/MLP-DINO.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

Optimus-1: Hybrid Multimodal Memory Empowered Agents Excel in Long-Horizon Tasks

Zaijing Li
Yuquan Xie
Rui Shao
Gongwei Chen
Dongmei Jiang
Liqiang Nie

Building a general-purpose agent is a long-standing vision in the field of artificial intelligence. Existing agents have made remarkable progress in many domains, yet they still struggle to complete long-horizon tasks in an open world. We attribute this to the lack of necessary world knowledge and multimodal experience that can guide agents through a variety of long-horizon tasks. In this paper, we propose a Hybrid Multimodal Memory module to address the above challenges. It 1) transforms knowledge into Hierarchical Directed Knowledge Graph that allows agents to explicitly represent and learn world knowledge, and 2) summarises historical information into Abstracted Multimodal Experience Pool that provide agents with rich references for in-context learning. On top of the Hybrid Multimodal Memory module, a multimodal agent, Optimus-1, is constructed with dedicated Knowledge-guided Planner and Experience-Driven Reflector, contributing to a better planning and reflection in the face of long-horizon tasks in Minecraft. Extensive experimental results show that Optimus-1 significantly outperforms all existing agents on challenging long-horizon task benchmarks, and exhibits near human-level performance on many tasks. In addition, we introduce various Multimodal Large Language Models (MLLMs) as the backbone of Optimus-1. Experimental results show that Optimus-1 exhibits strong generalization with the help of the Hybrid Multimodal Memory module, outperforming the GPT-4V baseline on many tasks.

PDF Details DOI