Arrow Research search

Author name cluster

Wangmeng Zuo

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

54 papers
2 author rows

Possible papers

54

AAAI Conference 2026 Conference Paper

Deblur4DGS: 4D Gaussian Splatting from Blurry Monocular Video

  • Renlong Wu
  • Zhilu Zhang
  • Mingyang Chen
  • Zifei Yan
  • Wangmeng Zuo

Recent 4D reconstruction methods have yielded impressive results but rely on sharp videos as supervision. However, motion blur often occurs in videos due to camera shake and object movement, while existing methods render blurry results when using such videos for reconstructing 4D models. Although a few approaches attempted to address the problem, they struggled to produce high-quality results, due to the inaccuracy in estimating continuous dynamic representations within the exposure time. Encouraged by recent works in 3D motion trajectory modeling using 3D Gaussian Splatting (3DGS), we take 3DGS as the scene representation manner, and propose Deblur4DGS to obtain a high-quality 4D model from blurry monocular video. Specifically, we transform continuous dynamic representations estimation within an exposure time into the exposure time estimation. Moreover, we introduce the exposure regularization term, multi-frame, and multi-resolution consistency regularization term to avoid trivial solutions. Furthermore, to better represent objects with large motion, we suggest blur-aware variable canonical Gaussians. Beyond novel-view synthesis, Deblur4DGS can be applied to improve blurry video from multiple perspectives, including deblurring, frame interpolation, and video stabilization. Extensive experiments in both synthetic and real-world data on the above four tasks show that Deblur4DGS outperforms state-of-the-art 4D reconstruction methods.

AAAI Conference 2026 Conference Paper

Perceptual Quality Assessment of 3D Gaussian Splatting: A Subjective Dataset and Prediction Metric

  • Zhaolin Wan
  • Yining Diao
  • Jingqi Xu
  • Hao Wang
  • Zhiyang Li
  • Xiaopeng Fan
  • Wangmeng Zuo
  • Debin Zhao

With the rapid advancement of 3D visualization, 3D Gaussian Splatting (3DGS) has emerged as a leading technique for real-time, high-fidelity rendering. While prior research has emphasized algorithmic performance and visual fidelity, the perceptual quality of 3DGS-rendered content, especially under varying reconstruction conditions, remains largely underexplored. In practice, factors such as viewpoint sparsity, limited training iterations, point downsampling, noise, and color distortions can significantly degrade visual quality, yet their perceptual impact has not been systematically studied. To bridge this gap, we present 3DGS-QA, the first subjective quality assessment dataset for 3DGS. It comprises 225 degraded reconstructions across 15 object types, enabling a controlled investigation of common distortion factors. Based on this dataset, we introduce a no-reference quality prediction model that directly operates on native 3D Gaussian primitives, without requiring rendered images or ground-truth references. Our model extracts spatial and photometric cues from the Gaussian representation to estimate perceived quality in a structure-aware manner. We further benchmark existing quality assessment methods, spanning both traditional and learning-based approaches. Experimental results show that our method consistently achieves superior performance, highlighting its robustness and effectiveness for 3DGS content evaluation. The dataset and code are made publicly available to facilitate future research in 3DGS quality assessment.

AAAI Conference 2026 Conference Paper

RefSTAR: Blind Face Image Restoration with Reference Selection, Transfer, and Reconstruction

  • Zhicun Yin
  • Junjie Chen
  • Ming Liu
  • Zhixin Wang
  • Fan Li
  • Renjing Pei
  • Xiaoming Li
  • Rynson W. H. Lau

Introducing high-quality references can largely alleviate the uncertainty in blind face image restoration tasks, yet the equivocal utilization of reference priors makes it still a struggle to well preserve the human identity. We attribute the identity inconsistency to two deficiencies of existing reference-based face restoration methods, namely the inability to effectively determine which features need to be transferred, and the failure to preserve the structure and details of the selected features. This work mainly focuses on these two issues, and we present a novel blind face image restoration method that considers reference selection, transfer, and reconstruction (RefSTAR) to introduce proper features from reference images. Specifically, we construct a reference selection (RefSel) module, which can generate accurate masks to select reference features. For training the RefSel module, we construct a RefSel-HQ dataset through a mask generation pipeline, which contains annotated masks for 10,000 ground truth-reference pairs. To guarantee the exact introduction of selected reference features, a feature fusion paradigm is designed for reference feature transferring, and a Mask-Compatible Cycle-Consistency Loss is redesigned based on reference reconstruction to further ensure the presence of selected reference image features in the output image. Experiments on various backbone models demonstrate superior performance, showing better identity preservation ability and reference feature transfer quality.

NeurIPS Conference 2025 Conference Paper

BMMR: A Large-Scale Bilingual Multimodal Multi-Discipline Reasoning Dataset

  • Zhiheng Xi
  • Guanyu Li
  • Yutao Fan
  • Honglin Guo
  • Yufang Liu
  • Xiaoran Fan
  • Jiaqi Liu
  • Wangmeng Zuo

In this paper, we introduce BMMR, a large-scale bilingual, multimodal, multi-disciplinary reasoning dataset for the community to develop and evaluate large multimodal models (LMMs). BMMR comprises 100k university-level questions drawn from 300 UNESCO-defined subjects, spanning diverse formats—multiple-choice, fill-in-the-blank, and open-ended QA—and sourced from both print and digital media such as books, exams, and quizzes. All data are curated and filtered via a human-in-the-loop, automated, and scalable framework, and each instance is paired with a high-quality reasoning path. The dataset is organized into two parts: BMMR-Eval that comprises 20k high-quality instances to comprehensively assess LMMs’ knowledge and reasoning across multiple disciplines in both Chinese and English; and BMMR-Train that contains 80k instances to support further research and development, extending the current focus on mathematical reasoning to diverse disciplines and domains. In addition, we propose the process-based multi-discipline BMMR-Verifier for accurate and fine-grained evaluation of LMMs’ reasoning. Extensive experiments reveal that (i) even SOTA models leave substantial headroom on BMMR-Eval; (ii) reasoning models exhibit discipline bias and outperform LMMs only on specific subjects; (iii) open-source models still trail their proprietary counterparts; and (iv) fine-tuning on BMMR-Train narrows this gap. Additionally, we conduct reasoning-chain analyses using BMMR-Verifier and other in-depth studies, uncovering the challenges LMMs currently face in multidisciplinary reasoning. We will release the data and models, and we believe our work can offers valuable insights and contributions to the community.

AAAI Conference 2025 Conference Paper

DreamPhysics: Learning Physics-Based 3D Dynamics with Video Diffusion Priors

  • Tianyu Huang
  • Haoze Zhang
  • Yihan Zeng
  • Zhilu Zhang
  • Hui Li
  • Wangmeng Zuo
  • Rynson W. H. Lau

Dynamic 3D interaction has been attracting a lot of attention recently. However, creating such 4D content remains challenging. One solution is to animate 3D scenes with physics-based simulation, which requires manually assigning precise physical properties to the object or the simulated results would become unnatural. Another solution is to learn the deformation of 3D objects with the distillation of video generative models, which, however, tends to produce 3D videos with small and discontinuous motions due to the inappropriate extraction and application of physics priors. In this work, to combine the strengths and complementing shortcomings of the above two solutions, we propose to learn the physical properties of a material field with video diffusion priors, and then utilize a physics-based Material-Point-Method (MPM) simulator to generate 4D content with realistic motions. In particular, we propose motion distillation sampling to emphasize video motion information during distillation. In addition, to facilitate the optimization, we further propose a KAN-based material field with frame boosting. Experimental results demonstrate that our method enjoys more realistic motions than state-of-the-arts do.

ICLR Conference 2025 Conference Paper

Exposure Bracketing Is All You Need For A High-Quality Image

  • Zhilu Zhang 0001
  • Shuohao Zhang
  • Renlong Wu
  • Zifei Yan
  • Wangmeng Zuo

It is highly desired but challenging to acquire high-quality photos with clear content in low-light environments. Although multi-image processing methods (using burst, dual-exposure, or multi-exposure images) have made significant progress in addressing this issue, they typically focus on specific restoration or enhancement problems, and do not fully explore the potential of utilizing multiple images. Motivated by the fact that multi-exposure images are complementary in denoising, deblurring, high dynamic range imaging, and super-resolution, we propose to utilize exposure bracketing photography to get a high-quality image by combining these tasks in this work. Due to the difficulty in collecting real-world pairs, we suggest a solution that first pre-trains the model with synthetic paired data and then adapts it to real-world unlabeled images. In particular, a temporally modulated recurrent network (TMRNet) and self-supervised adaptation method are proposed. Moreover, we construct a data simulation pipeline to synthesize pairs and collect real-world images from 200 nighttime scenarios. Experiments on both datasets show that our method performs favorably against the state-of-the-art multi-image processing ones. Code and datasets are available at https://github.com/cszhilu1998/BracketIRE.

ICLR Conference 2025 Conference Paper

Federated Residual Low-Rank Adaptation of Large Language Models

  • Yunlu Yan
  • Chun-Mei Feng 0001
  • Wangmeng Zuo
  • Rick Siow Mong Goh
  • Yong Liu 0026
  • Lei Zhu 0003

Low-Rank Adaptation (LoRA) presents an effective solution for federated fine-tuning of Large Language Models (LLMs), as it substantially reduces communication overhead. However, a straightforward combination of FedAvg and LoRA results in suboptimal performance, especially under data heterogeneity. We noted this stems from both intrinsic (i.e., constrained parameter space) and extrinsic (i.e., client drift) limitations, which hinder it effectively learn global knowledge. In this work, we proposed a novel Federated Residual Low-Rank Adaption method, namely FRLoRA, to tackle above two limitations. It directly sums the weight of the global model parameters with a residual low-rank matrix product (\ie, weight change) during the global update step, and synchronizes this update for all local models. By this, FRLoRA performs global updates in a higher-rank parameter space, enabling a better representation of complex knowledge structure. Furthermore, FRLoRA reinitializes the local low-rank matrices with the principal singular values and vectors of the pre-trained weights in each round, to calibrate their inconsistent convergence, thereby mitigating client drift. Our extensive experiments demonstrate that FRLoRA consistently outperforms various state-of-the-art FL methods across nine different benchmarks in natural language understanding and generation under different FL scenarios.

NeurIPS Conference 2025 Conference Paper

LabUtopia: High-Fidelity Simulation and Hierarchical Benchmark for Scientific Embodied Agents

  • Rui Li
  • Zixuan Hu
  • Wenxi Qu
  • Jinouwen Zhang
  • Zhenfei Yin
  • Sha Zhang
  • Xuantuo Huang
  • Hanqing Wang

Scientific embodied agents play a crucial role in modern laboratories by automating complex experimental workflows. Compared to typical household environments, laboratory settings impose significantly higher demands on perception of physical-chemical transformations and long-horizon planning, making them an ideal testbed for advancing embodied intelligence. However, its development has been long hampered by the lack of suitable simulator and benchmarks. In this paper, we address this gap by introducing LabUtopia, a comprehensive simulation and benchmarking suite designed to facilitate the development of generalizable, reasoning-capable embodied agents in laboratory settings. Specifically, it integrates i) LabSim, a high-fidelity simulator supporting multi-physics and chemically meaningful interactions; ii) LabScene, a scalable procedural generator for diverse scientific scenes; and iii) LabBench, a hierarchical benchmark spanning five levels of complexity from atomic actions to long-horizon mobile manipulation. LabUtopia supports 30 distinct tasks and includes more than 200 scene and instrument assets, enabling large-scale training and principled evaluation in high-complexity environments. We demonstrate that LabUtopia offers a powerful platform for advancing the integration of perception, planning, and control in scientific-purpose agents and provides a rigorous testbed for exploring the practical capabilities and generalization limits of embodied intelligence in future research. Project web page: https: //rui-li023. github. io/labutopia-site/

NeurIPS Conference 2025 Conference Paper

MIRAGE: Assessing Hallucination in Multimodal Reasoning Chains of MLLM

  • Bowen Dong
  • Minheng Ni
  • Zitong Huang
  • Guanglei Yang
  • Wangmeng Zuo
  • Lei Zhang

Multimodal hallucination in multimodal large language models (MLLMs) restricts the correctness of MLLMs. However, multimodal hallucinations are multi-sourced and arise from diverse causes. Existing benchmarks fail to adequately distinguish between perception-induced hallucinations and reasoning-induced hallucinations. This failure constitutes a significant issue and hinders the diagnosis of multimodal reasoning failures within MLLMs. To address this, we propose the MIRAGE benchmark, which isolates reasoning hallucinations by constructing questions where input images are correctly perceived by MLLMs yet reasoning errors persist. MIRAGE introduces multi-granular evaluation metrics: accuracy, factuality, and LLMs hallucination score for hallucination quantification. Our analysis reveals strong correlations between question types and specific hallucination patterns, particularly systematic failures of MLLMs in spatial reasoning involving complex relationships (\emph{e. g. }, complex geometric patterns across images). This highlights a critical limitation in the reasoning capabilities of current MLLMs and provides targeted insights for hallucination mitigation on specific types. To address these challenges, we propose Logos, a method that combines curriculum reinforcement fine-tuning to encourage models to generate logic-consistent reasoning chains by stepwise reducing learning difficulty, and collaborative hint inference to reduce reasoning complexity. Logos establishes a baseline on MIRAGE, and reduces the logical hallucinations in original base models. Link: \url{https: //bit. ly/25mirage}.

AAAI Conference 2025 Conference Paper

MV-VTON: Multi-View Virtual Try-On with Diffusion Models

  • Haoyu Wang
  • Zhilu Zhang
  • Donglin Di
  • Shiliang Zhang
  • Wangmeng Zuo

The goal of image-based virtual try-on is to generate an image of the target person naturally wearing the given clothing. However, existing methods solely focus on the frontal try-on using the frontal clothing. When the views of the clothing and person are significantly inconsistent, particularly when the person's view is non-frontal, the results are unsatisfactory. To address this challenge, we introduce Multi-View Virtual Try-ON (MV-VTON), which aims to reconstruct the dressing results from multiple views using the given clothes. Given that single-view clothes provide insufficient information for MV-VTON, we instead employ two images, i.e., the frontal and back views of the clothing, to encompass the complete view as much as possible. Moreover, we adopt diffusion models that have demonstrated superior abilities to perform our MV-VTON. In particular, we propose a view-adaptive selection method where hard-selection and soft-selection are applied to the global and local clothing feature extraction, respectively. This ensures that the clothing features are roughly fit to the person's view. Subsequently, we suggest joint attention blocks to align and fuse clothing features with person features. Additionally, we collect a MV-VTON dataset MVG, in which each person has multiple photos with diverse views and poses. Experiments show that the proposed method not only achieves state-of-the-art results on MV-VTON task using our MVG dataset, but also has superiority on frontal-view virtual try-on task using VITON-HD and DressCode datasets.

ICLR Conference 2025 Conference Paper

On the Importance of Language-driven Representation Learning for Heterogeneous Federated Learning

  • Yunlu Yan
  • Chun-Mei Feng 0001
  • Wangmeng Zuo
  • Salman H. Khan 0001
  • Yong Liu 0026
  • Lei Zhu 0003

Non-Independent and Identically Distributed (Non-IID) training data significantly challenge federated learning (FL), impairing the performance of the global model in distributed frameworks. Inspired by the superior performance and generalizability of language-driven representation learning in centralized settings, we explore its potential to enhance FL for handling non-IID data. In specific, this paper introduces FedGLCL, a novel language-driven FL framework for image-text learning that uniquely integrates global language and local image features through contrastive learning, offering a new approach to tackle non-IID data in FL. FedGLCL redefines FL by avoiding separate local training models for each client. Instead, it uses contrastive learning to harmonize local image features with global textual data, enabling uniform feature learning across different local models. The utilization of a pre-trained text encoder in FedGLCL serves a dual purpose: it not only reduces the variance in local feature representations within FL by providing a stable and rich language context but also aids in mitigating overfitting, particularly to majority classes, by leveraging broad linguistic knowledge. Extensive experiments show that FedGLCL significantly outperforms state-of-the-art FL algorithms across different non-IID scenarios.

NeurIPS Conference 2025 Conference Paper

Point-RFT: Improving Multimodal Reasoning with Visually Grounded Reinforcement Finetuning

  • Minheng Ni
  • Zhengyuan Yang
  • Linjie Li
  • Chung-Ching Lin
  • Kevin Lin
  • Wangmeng Zuo
  • Lijuan Wang

Recent advances in large language models have significantly improved textual reasoning through the effective use of Chain-of-Thought (CoT) and reinforcement learning. However, extending these successes to vision-language tasks remains challenging due to inherent limitations in text-only CoT, such as visual hallucinations and insufficient multimodal integration. In this paper, we introduce Point-RFT, a multimodal reasoning framework explicitly designed to leverage visually grounded CoT reasoning for visual document understanding. Our approach consists of two stages: First, we conduct format finetuning using a curated dataset of 71K diverse visual reasoning problems, each annotated with detailed, step-by-step rationales explicitly grounded to corresponding visual elements. Second, we employ reinforcement finetuning targeting visual document understanding. On ChartQA, our approach improves accuracy from 70. 88% (format-finetuned baseline) to 90. 04%, surpassing the 83. 92% accuracy achieved by reinforcement finetuning relying solely on text-based CoT. The result shows that our grounded CoT is more effective for multimodal reasoning compared with the text-only CoT. Moreover, Point-RFT exhibits superior generalization capability across several out-of-domain visual document reasoning benchmarks, including CharXiv, PlotQA, IconQA, TabMWP, etc. , and highlights its potential in complex real-world scenarios.

AAAI Conference 2025 Conference Paper

Rethinking Transformer-Based Blind-Spot Network for Self-Supervised Image Denoising

  • Junyi Li
  • Zhilu Zhang
  • Wangmeng Zuo

Blind-spot networks (BSN) have been prevalent neural architectures in self-supervised image denoising (SSID). However, most existing BSNs are conducted with convolution layers. Although transformers have shown the potential to overcome the limitations of convolutions in many image restoration tasks, the attention mechanisms may violate the blind-spot requirement, thereby restricting their applicability in BSN. To this end, we propose to analyze and redesign the channel and spatial attentions to meet the blind-spot requirement. Specifically, channel self-attention may leak the blind-spot information in multi-scale architectures, since the downsampling shuffles the spatial feature into channel dimensions. To alleviate this problem, we divide the channel into several groups and perform channel attention separately. For spatial self-attention, we apply an elaborate mask to the attention matrix to restrict and mimic the receptive field of dilated convolution. Based on the redesigned channel and window attentions, we build a Transformer-based Blind-Spot Network (TBSN), which shows strong local fitting and global perspective abilities. Furthermore, we introduce a knowledge distillation strategy that distills TBSN into smaller denoisers to improve computational efficiency while maintaining performance. Extensive experiments on real-world image denoising datasets show that TBSN largely extends the receptive field and exhibits favorable performance against state-of-the-art SSID methods.

NeurIPS Conference 2025 Conference Paper

RoomEditor: High-Fidelity Furniture Synthesis with Parameter-Sharing U-Net

  • Zhenyi Lin
  • Xiaofan Ming
  • Qilong Wang
  • Dongwei Ren
  • Wangmeng Zuo
  • Qinghua Hu

Virtual furniture synthesis, a critical task in image composition, aims to seamlessly integrate reference objects into indoor scenes while preserving geometric coherence and visual realism. Despite its significant potential in home design applications, this field remains underexplored due to two major challenges: the absence of publicly available and ready-to-use benchmarks hinders reproducible research, and existing image composition methods fail to meet the stringent fidelity requirements for realistic furniture placement. To address these issues, we introduce RoomBench, a ready-to-use benchmark dataset for virtual furniture synthesis, comprising 7, 298 training pairs and 895 testing samples across 27 furniture categories. Then, we propose RoomEditor, a simple yet effective image composition method that employs a parameter-sharing dual U-Net architecture, ensuring better feature consistency by sharing weights between dual branches. Technical analysis reveals that conventional dual-branch architectures generally suffer from inconsistent intermediate features due to independent processing of reference and background images. In contrast, RoomEditor enforces unified feature learning through shared parameters, thereby facilitating model optimization for robust geometric alignment and maintaining visual consistency. Experiments show our RoomEditor is superior to state-of-the-arts, while generalizing directly to diverse objects synthesis in unseen scenes without task-specific fine-tuning. Our dataset and code are available at https: //github. com/stonecutter-21/roomeditor.

AAAI Conference 2025 Conference Paper

VideoElevator: Elevating Video Generation Quality with Versatile Text-to-Image Diffusion Models

  • Yabo Zhang
  • Yuxiang Wei
  • Xianhui Lin
  • Zheng Hui
  • Peiran Ren
  • Xuansong Xie
  • Wangmeng Zuo

Text-to-image diffusion models (T2I) have demonstrated unprecedented capabilities in creating realistic and aesthetic images. On the contrary, text-to-video diffusion models (T2V) still lag far behind in frame quality and text alignment, owing to insufficient quality and quantity of training videos. In this paper, we introduce VideoElevator, a training-free and plug-and-play method, which elevates the performance of T2V using superior capabilities of T2I. Different from conventional T2V sampling (i.e., temporal and spatial modeling), VideoElevator explicitly decomposes each sampling step into temporal motion refining and spatial quality elevating. Specifically, temporal motion refining uses encapsulated T2V to enhance temporal consistency, followed by inverting to the noise distribution required by T2I. Then, spatial quality elevating harnesses inflated T2I to directly predict less noisy latent, adding more photo-realistic details. We have conducted experiments in extensive prompts under the combination of various T2V and T2I. The results show that VideoElevator not only improves the performance of T2V baselines with foundational T2I, but also facilitates stylistic video synthesis with personalized T2I. Please watch all videos in supplementary materials for better view.

ICLR Conference 2025 Conference Paper

Visual-O1: Understanding Ambiguous Instructions via Multi-modal Multi-turn Chain-of-thoughts Reasoning

  • Minheng Ni
  • Yutao Fan
  • Lei Zhang 0006
  • Wangmeng Zuo

As large-scale models evolve, language instructions are increasingly utilized in multi-modal tasks. Due to human language habits, these instructions often contain ambiguities in real-world scenarios, necessitating the integration of visual context or common sense for accurate interpretation. However, even highly intelligent large models exhibit observable performance limitations on ambiguous instructions, where weak reasoning abilities of disambiguation can lead to catastrophic errors. To address this issue, this paper proposes Visual-O1, a multi-modal multi-turn chain-of-thought reasoning framework. It simulates human multi-modal multi-turn reasoning, providing instantial experience for highly intelligent models or empirical experience for generally intelligent models to understand ambiguous instructions. Unlike traditional methods that require models to possess high intelligence to understand long texts or perform lengthy complex reasoning, our framework does not notably increase computational overhead and is more general and effective, even for generally intelligent models. Experiments show that our method not only enhances the performance of models of different intelligence levels on ambiguous instructions but also improves their performance on general datasets. Our work highlights the potential of artificial intelligence to work like humans in real-world scenarios with uncertainty and ambiguity. We release our data and code at https://github.com/kodenii/Visual-O1.

AAAI Conference 2025 Conference Paper

VQA4CIR: Boosting Composed Image Retrieval with Visual Question Answering

  • Chun-Mei Feng
  • Yang Bai
  • Tao Luo
  • Zhen Li
  • Salman Khan
  • Wangmeng Zuo
  • Rick Siow Mong Goh
  • Yong Liu

Albeit progress has been made in Composed Image Retrieval (CIR), we empirically find that a certain percentage of failure retrieval results are not consistent with their relative captions. To address this issue, this work provides a Visual Question Answering (VQA) perspective to boost the performance of CIR. The resulting VQA4CIR is a post-processing approach and can be directly plugged into existing CIR methods. Given the top-C retrieved images by a CIR method, VQA4CIR aims to decrease the adverse effect of the failure retrieval results being inconsistent with the relative caption. To find the retrieved images inconsistent with the relative caption, we resort to the "QA generation → VQA" self-verification pipeline. For QA generation, we suggest fine-tuning LLM (e.g., LLaMA) to generate several pairs of questions and answers from each relative caption. We then fine-tune LVLM (e.g., LLaVA) to obtain the VQA model. By feeding the retrieved image and question to the VQA model, one can find the images inconsistent with relative caption when the answer by VQA is inconsistent with the answer in the QA pair. Consequently, the CIR performance can be boosted by modifying the ranks of inconsistently retrieved images. Experimental results show that our proposed method outperforms state-of-the-art CIR methods on the CIRR and Fashion-IQ datasets.

ICLR Conference 2024 Conference Paper

ControlVideo: Training-free Controllable Text-to-video Generation

  • Yabo Zhang
  • Yuxiang Wei 0001
  • Dongsheng Jiang
  • Xiaopeng Zhang 0008
  • Wangmeng Zuo
  • Qi Tian 0001

Text-driven diffusion models have unlocked unprecedented abilities in image generation, whereas their video counterpart lags behind due to the excessive training cost. To avert the training burden, we propose a training-free ControlVideo to produce high-quality videos based on the provided text prompts and motion sequences. Specifically, ControlVideo adapts a pre-trained text-to-image model (i.e., ControlNet) for controllable text-to-video generation. To generate continuous videos without flicker effect, we propose an interleaved-frame smoother to smooth the intermediate frames. In particular, interleaved-frame smoother splits the whole videos with successive three-frame clips, and stabilizes each clip by updating the middle frame with the interpolation among other two frames in latent space. Furthermore, a fully cross-frame interaction mechanism have been exploited to further enhance the frame consistency, while a hierarchical sampler is employed to produce long videos efficiently. Extensive experiments demonstrate that our ControlVideo outperforms the state-of-the-arts both quantitatively and qualitatively. It is worthy noting that, thanks to the efficient designs, ControlVideo could generate both short and long videos within several minutes using one NVIDIA 2080Ti. Code and videos are available at [this link](https://github.com/YBYBZhang/ControlVideo).

AAAI Conference 2024 Conference Paper

Decoupled Textual Embeddings for Customized Image Generation

  • Yufei Cai
  • Yuxiang Wei
  • Zhilong Ji
  • Jinfeng Bai
  • Hu Han
  • Wangmeng Zuo

Customized text-to-image generation, which aims to learn user-specified concepts with a few images, has drawn significant attention recently. However, existing methods usually suffer from overfitting issues and entangle the subject-unrelated information (e.g., background and pose) with the learned concept, limiting the potential to compose concept into new scenes. To address these issues, we propose the DETEX, a novel approach that learns the disentangled concept embedding for flexible customized text-to-image generation. Unlike conventional methods that learn a single concept embedding from the given images, our DETEX represents each image using multiple word embeddings during training, i.e., a learnable image-shared subject embedding and several image-specific subject-unrelated embeddings. To decouple irrelevant attributes (i.e., background and pose) from the subject embedding, we further present several attribute mappers that encode each image as several image-specific subject-unrelated embeddings. To encourage these unrelated embeddings to capture the irrelevant information, we incorporate them with corresponding attribute words and propose a joint training strategy to facilitate the disentanglement. During inference, we only use the subject embedding for image generation, while selectively using image-specific embeddings to retain image-specified attributes. Extensive experiments demonstrate that the subject embedding obtained by our method can faithfully represent the target concept, while showing superior editability compared to the state-of-the-art methods. Our code will be available at https://github.com/PrototypeNx/DETEX.

NeurIPS Conference 2024 Conference Paper

Evaluation of Text-to-Video Generation Models: A Dynamics Perspective

  • Mingxiang Liao
  • Hannan Lu
  • Xinyu Zhang
  • Fang Wan
  • Tianyu Wang
  • Yuzhong Zhao
  • Wangmeng Zuo
  • Qixiang Ye

Comprehensive and constructive evaluation protocols play an important role when developing sophisticated text-to-video (T2V) generation models. Existing evaluation protocols primarily focus on temporal consistency and content continuity, yet largely ignore dynamics of video content. Such dynamics is an essential dimension measuring the visual vividness and the honesty of video content to text prompts. In this study, we propose an effective evaluation protocol, termed DEVIL, which centers on the dynamics dimension to evaluate T2V generation models, as well as improving existing evaluation metrics. In practice, we define a set of dynamics scores corresponding to multiple temporal granularities, and a new benchmark of text prompts under multiple dynamics grades. Upon the text prompt benchmark, we assess the generation capacity of T2V models, characterized by metrics of dynamics ranges and T2V alignment. Moreover, we analyze the relevance of existing metrics to dynamics metrics, improving them from the perspective of dynamics. Experiments show that DEVIL evaluation metrics enjoy up to about 90\% consistency with human ratings, demonstrating the potential to advance T2V generation models.

NeurIPS Conference 2024 Conference Paper

Improved Generation of Adversarial Examples Against Safety-aligned LLMs

  • Qizhang Li
  • Yiwen Guo
  • Wangmeng Zuo
  • Hao Chen

Adversarial prompts (or say, adversarial examples) generated using gradient-based methods exhibit outstanding performance in performing automatic jailbreak attacks against safety-aligned LLMs. Nevertheless, due to the discrete nature of texts, the input gradient of LLMs struggles to precisely reflect the magnitude of loss change that results from token replacements in the prompt, leading to limited attack success rates against safety-aligned LLMs, even in the white-box setting. In this paper, we explore a new perspective on this problem, suggesting that it can be alleviated by leveraging innovations inspired in transfer-based attacks that were originally proposed for attacking black-box image classification models. For the first time, we appropriate the ideologies of effective methods among these transfer-based attacks, i. e. , Skip Gradient Method and Intermediate Level Attack, into gradient-based adversarial prompt generation and achieve significant performance gains without introducing obvious computational cost. Meanwhile, by discussing mechanisms behind the gains, new insights are drawn, and proper combinations of these methods are also developed. Our empirical results show that 87% of the query-specific adversarial suffixes generated by the developed combination can induce Llama-2-7B-Chat to produce the output that exactly matches the target string on AdvBench. This match rate is 33% higher than that of a very strong baseline known as GCG, demonstrating advanced discrete optimization for adversarial prompt generation against LLMs. In addition, without introducing obvious cost, the combination achieves >30% absolute increase in attack success rates compared with GCG when generating both query-specific (38% ->68%) and universal adversarial prompts (26. 68% -> 60. 32%) for attacking the Llama-2-7B-Chat model on AdvBench. Code at: https: //github. com/qizhangli/Gradient-based-Jailbreak-Attacks.

AAAI Conference 2024 Conference Paper

Learning Real-World Image De-weathering with Imperfect Supervision

  • Xiaohui Liu
  • Zhilu Zhang
  • Xiaohe Wu
  • Chaoyu Feng
  • Xiaotao Wang
  • Lei Lei
  • Wangmeng Zuo

Real-world image de-weathering aims at removing various undesirable weather-related artifacts. Owing to the impossibility of capturing image pairs concurrently, existing real-world de-weathering datasets often exhibit inconsistent illumination, position, and textures between the ground-truth images and the input degraded images, resulting in imperfect supervision. Such non-ideal supervision negatively affects the training process of learning-based de-weathering methods. In this work, we attempt to address the problem with a unified solution for various inconsistencies. Specifically, inspired by information bottleneck theory, we first develop a Consistent Label Constructor (CLC) to generate a pseudo-label as consistent as possible with the input degraded image while removing most weather-related degradation. In particular, multiple adjacent frames of the current input are also fed into CLC to enhance the pseudo-label. Then we combine the original imperfect labels and pseudo-labels to jointly supervise the de-weathering model by the proposed Information Allocation Strategy (IAS). During testing, only the de-weathering model is used for inference. Experiments on two real-world de-weathering datasets show that our method helps existing de-weathering models achieve better performance. Code is available at https://github.com/1180300419/imperfect-deweathering.

ICLR Conference 2024 Conference Paper

Self-Supervised High Dynamic Range Imaging with Multi-Exposure Images in Dynamic Scenes

  • Zhilu Zhang 0001
  • Haoyu Wang
  • Shuai Liu 0009
  • Xiaotao Wang
  • Lei Lei
  • Wangmeng Zuo

Merging multi-exposure images is a common approach for obtaining high dynamic range (HDR) images, with the primary challenge being the avoidance of ghosting artifacts in dynamic scenes. Recent methods have proposed using deep neural networks for deghosting. However, the methods typically rely on sufficient data with HDR ground-truths, which are difficult and costly to collect. In this work, to eliminate the need for labeled data, we propose SelfHDR, a self-supervised HDR reconstruction method that only requires dynamic multi-exposure images during training. Specifically, SelfHDR learns a reconstruction network under the supervision of two complementary components, which can be constructed from multi-exposure images and focus on HDR color as well as structure, respectively. The color component is estimated from aligned multi-exposure images, while the structure one is generated through a structure-focused network that is supervised by the color component and an input reference (\eg, medium-exposure) image. During testing, the learned reconstruction network is directly deployed to predict an HDR image. Experiments on real-world images demonstrate our SelfHDR achieves superior results against the state-of-the-art self-supervised methods, and comparable performance to supervised ones. Codes are available at https://github.com/cszhilu1998/SelfHDR

ICLR Conference 2024 Conference Paper

Sentence-level Prompts Benefit Composed Image Retrieval

  • Yang Bai 0011
  • Xinxing Xu
  • Yong Liu 0026
  • Salman H. Khan 0001
  • Fahad Shahbaz Khan
  • Wangmeng Zuo
  • Rick Siow Mong Goh
  • Chun-Mei Feng 0001

Composed image retrieval (CIR) is the task of retrieving specific images by using a query that involves both a reference image and a relative caption. Most existing CIR models adopt the late-fusion strategy to combine visual and language features. Besides, several approaches have also been suggested to generate a pseudo-word token from the reference image, which is further integrated into the relative caption for CIR. However, these pseudo-word-based prompting methods have limitations when target image encompasses complex changes on reference image, e.g., object removal and attribute modification. In this work, we demonstrate that learning an appropriate sentence-level prompt for the relative caption (SPRC) is sufficient for achieving effective composed image retrieval. Instead of relying on pseudo- word-based prompts, we propose to leverage pretrained V-L models, e.g., BLIP-2, to generate sentence-level prompts. By concatenating the learned sentence-level prompt with the relative caption, one can readily use existing text-based image retrieval models to enhance CIR performance. Furthermore, we introduce both image-text contrastive loss and text prompt alignment loss to enforce the learning of suitable sentence-level prompts. Experiments show that our proposed method performs favorably against the state-of-the-art CIR methods on the Fashion-IQ and CIRR datasets.

ICLR Conference 2024 Conference Paper

TextField3D: Towards Enhancing Open-Vocabulary 3D Generation with Noisy Text Fields

  • Tianyu Huang
  • Yihan Zeng
  • Bowen Dong 0001
  • Hang Xu 0004
  • Songcen Xu
  • Rynson W. H. Lau
  • Wangmeng Zuo

Recent works learn 3D representation explicitly under text-3D guidance. However, limited text-3D data restricts the vocabulary scale and text control of generations. Generators may easily fall into a stereotype concept for certain text prompts, thus losing open-vocabulary generation ability. To tackle this issue, we introduce a conditional 3D generative model, namely TextField3D. Specifically, rather than using the text prompts as input directly, we suggest to inject dynamic noise into the latent space of given text prompts, i.e., Noisy Text Fields (NTFs). In this way, limited 3D data can be mapped to the appropriate range of textual latent space that is expanded by NTFs. To this end, an NTFGen module is proposed to model general text latent code in noisy fields. Meanwhile, an NTFBind module is proposed to align view-invariant image latent code to noisy fields, further supporting image-conditional 3D generation. To guide the conditional generation in both geometry and texture, multi-modal discrimination is constructed with a text-3D discriminator and a text-2.5D discriminator. Compared to previous methods, TextField3D includes three merits: 1) large vocabulary, 2) text consistency, and 3) low latency. Extensive experiments demonstrate that our method achieves a potential open-vocabulary 3D generation capability.

AAAI Conference 2024 Conference Paper

VQ-FONT: Few-Shot Font Generation with Structure-Aware Enhancement and Quantization

  • Mingshuai Yao
  • Yabo Zhang
  • Xianhui Lin
  • Xiaoming Li
  • Wangmeng Zuo

Few-shot font generation is challenging, as it needs to capture the fine-grained stroke styles from a limited set of reference glyphs, and then transfer to other characters, which are expected to have similar styles. However, due to the diversity and complexity of Chinese font styles, the synthesized glyphs of existing methods usually exhibit visible artifacts, such as missing details and distorted strokes. In this paper, we propose a VQGAN-based framework (i.e., VQ-Font) to enhance glyph fidelity through token prior refinement and structure-aware enhancement. Specifically, we pre-train a VQGAN to encapsulate font token prior within a code-book. Subsequently, VQ-Font refines the synthesized glyphs with the codebook to eliminate the domain gap between synthesized and real-world strokes. Furthermore, our VQ-Font leverages the inherent design of Chinese characters, where structure components such as radicals and character components are combined in specific arrangements, to recalibrate fine-grained styles based on references. This process improves the matching and fusion of styles at the structure level. Both modules collaborate to enhance the fidelity of the generated fonts. Experiments on a collected font dataset show that our VQ-Font outperforms the competing methods both quantitatively and qualitatively, especially in generating challenging styles. Our code is available at https://github.com/Yaomingshuai/VQ-Font.

ICLR Conference 2023 Conference Paper

ImaginaryNet: Learning Object Detectors without Real Images and Annotations

  • Minheng Ni
  • Zitong Huang
  • Kailai Feng
  • Wangmeng Zuo

Without the demand of training in reality, humans are able of detecting a new category of object simply based on the language description on its visual characteristics. Empowering deep learning with this ability undoubtedly enables the neural network to handle complex vision tasks, e.g., object detection, without collecting and annotating real images. To this end, this paper introduces a novel challenging learning paradigm Imaginary-Supervised Object Detection (ISOD), where neither real images nor manual annotations are allowed for training object detectors. To resolve this challenge, we propose ImaginaryNet, a framework to synthesize images by combining pretrained language model and text-to-image synthesis model. Given a class label, the language model is used to generate a full description of a scene with a target object, and the text-to-image model is deployed to generate a photo-realistic image. With the synthesized images and class labels, weakly supervised object detection can then be leveraged to accomplish ISOD. By gradually introducing real images and manual annotations, ImaginaryNet can collaborate with other supervision settings to further boost detection performance. Experiments show that ImaginaryNet can (i) obtain about 75% performance in ISOD compared with the weakly supervised counterpart of the same backbone trained on real data, (ii) significantly improve the baseline while achieving state-of-the-art or comparable performance by incorporating ImaginaryNet with other supervision settings. Our code will be publicly available at https://github.com/kodenii/ImaginaryNet.

NeurIPS Conference 2023 Conference Paper

Improving Adversarial Transferability via Intermediate-level Perturbation Decay

  • Qizhang Li
  • Yiwen Guo
  • Wangmeng Zuo
  • Hao Chen

Intermediate-level attacks that attempt to perturb feature representations following an adversarial direction drastically have shown favorable performance in crafting transferable adversarial examples. Existing methods in this category are normally formulated with two separate stages, where a directional guide is required to be determined at first and the scalar projection of the intermediate-level perturbation onto the directional guide is enlarged thereafter. The obtained perturbation deviates from the guide inevitably in the feature space, and it is revealed in this paper that such a deviation may lead to sub-optimal attack. To address this issue, we develop a novel intermediate-level method that crafts adversarial examples within a single stage of optimization. In particular, the proposed method, named intermediate-level perturbation decay (ILPD), encourages the intermediate-level perturbation to be in an effective adversarial direction and to possess a great magnitude simultaneously. In-depth discussion verifies the effectiveness of our method. Experimental results show that it outperforms state-of-the-arts by large margins in attacking various victim models on ImageNet (+10. 07% on average) and CIFAR-10 (+3. 88% on average). Our code is at https: //github. com/qizhangli/ILPD-attack.

AAAI Conference 2023 Conference Paper

Learning Single Image Defocus Deblurring with Misaligned Training Pairs

  • Yu Li
  • Dongwei Ren
  • Xinya Shu
  • Wangmeng Zuo

By adopting popular pixel-wise loss, existing methods for defocus deblurring heavily rely on well aligned training image pairs. Although training pairs of ground-truth and blurry images are carefully collected, e.g., DPDD dataset, misalignment is inevitable between training pairs, making existing methods possibly suffer from deformation artifacts. In this paper, we propose a joint deblurring and reblurring learning (JDRL) framework for single image defocus deblurring with misaligned training pairs. Generally, JDRL consists of a deblurring module and a spatially invariant reblurring module, by which deblurred result can be adaptively supervised by ground-truth image to recover sharp textures while maintaining spatial consistency with the blurry image. First, in the deblurring module, a bi-directional optical flow-based deformation is introduced to tolerate spatial misalignment between deblurred and ground-truth images. Second, in the reblurring module, deblurred result is reblurred to be spatially aligned with blurry image, by predicting a set of isotropic blur kernels and weighting maps. Moreover, we establish a new single image defocus deblurring (SDD) dataset, further validating our JDRL and also benefiting future research. Our JDRL can be applied to boost defocus deblurring networks in terms of both quantitative metrics and visual quality on DPDD, RealDOF and our SDD datasets.

ICLR Conference 2023 Conference Paper

LPT: Long-tailed Prompt Tuning for Image Classification

  • Bowen Dong 0001
  • Pan Zhou 0002
  • Shuicheng Yan
  • Wangmeng Zuo

For long-tailed classification tasks, most works often pretrain a big model on a large-scale (unlabeled) dataset, and then fine-tune the whole pretrained model for adapting to long-tailed data. Though promising, fine-tuning the whole pretrained model tends to suffer from high cost in computation and deployment of different models for different tasks, as well as weakened generalization capability for overfitting to certain features of long-tailed data. To alleviate these issues, we propose an effective Long-tailed Prompt Tuning (LPT) method for long-tailed classification tasks. LPT introduces several trainable prompts into a frozen pretrained model to adapt it to long-tailed data. For better effectiveness, we divide prompts into two groups: 1) a shared prompt for the whole long-tailed dataset to learn general features and to adapt a pretrained model into the target long-tailed domain; and 2) group-specific prompts to gather group-specific features for the samples which have similar features and also to empower the pretrained model with fine-grained discrimination ability. Then we design a two-phase training paradigm to learn these prompts. In the first phase, we train the shared prompt via conventional supervised prompt tuning to adapt a pretrained model to the desired long-tailed domain. In the second phase, we use the learnt shared prompt as query to select a small best matched set for a group of similar samples from the group-specific prompt set to dig the common features of these similar samples, and then optimize these prompts with a dual sampling strategy and the asymmetric Gaussian Clouded Logit loss. By only fine-tuning a few prompts while fixing the pretrained model, LPT can reduce training cost and deployment cost by storing a few prompts, and enjoys a strong generalization ability of the pretrained model. Experiments show that on various long-tailed benchmarks, with only $\sim$1.1\% extra trainable parameters, LPT achieves comparable or higher performance than previous whole model fine-tuning methods, and is more robust to domain-shift.

ICLR Conference 2023 Conference Paper

Making Substitute Models More Bayesian Can Enhance Transferability of Adversarial Examples

  • Qizhang Li
  • Yiwen Guo
  • Wangmeng Zuo
  • Hao Chen 0003

The transferability of adversarial examples across deep neural networks (DNNs) is the crux of many black-box attacks. Many prior efforts have been devoted to improving the transferability via increasing the diversity in inputs of some substitute models. In this paper, by contrast, we opt for the diversity in substitute models and advocate to attack a Bayesian model for achieving desirable transferability. Deriving from the Bayesian formulation, we develop a principled strategy for possible finetuning, which can be combined with many off-the-shelf Gaussian posterior approximations over DNN parameters. Extensive experiments have been conducted to verify the effectiveness of our method, on common benchmark datasets, and the results demonstrate that our method outperforms recent state-of-the-arts by large margins (roughly 19% absolute increase in average attack success rate on ImageNet), and, by combining with these recent methods, further performance gain can be obtained. Our code: https://github.com/qizhangli/MoreBayesian-attack.

EAAI Journal 2023 Journal Article

Progressive convolutional transformer for image restoration

  • Yecong Wan
  • Mingwen Shao
  • Yuanshuo Cheng
  • Deyu Meng
  • Wangmeng Zuo

In the past few years, convolutional neural networks (CNNs) have become the primary workhorse for image restoration tasks. However, the deficiency of modeling long-range dependencies due to the local computational property of convolution greatly limits the restoration performance. To overcome this limitation, we propose a novel multi-stage progressive convolutional Transformer to recursively restore the degraded images, termed PCformer, which enjoys a high capability for capturing local context and global dependencies with friendly computational cost. Specifically, each stage of PCformer is an asymmetric encoder–decoder network whose bottleneck is built upon a tailored Transformer block with convolution operation deployed to avoid any loss of local context. Both encoder and decoder are convolution-based modules, thus allowing to explore rich contextualized information for image recovery. Taking the low-resolution features encoded by the encoder as tokens input into the Transformer bottleneck guarantees that long-range pixel interactions are captured while reducing the computational burden. Meanwhile, we apply a gated module for filtering redundant information propagation between every two phases. In addition, long-range enhanced inpainting is further introduced to mining the ability of PCformer to exploit distant complementary features. Extensive experiments yield superior results and in particular establishing new state-of-the-art results on several image restoration tasks, including deraining ( + 0. 37 dB on Rain13K), denoising ( + 0. 11 dB on DND), dehazing ( + 0. 56 dB on I-HAZE), enhancement ( + 0. 72 dB on SICE), and shadow removal ( + 0. 65 RMSE on ISTD). The implementation code is available at https: //github. com/Jeasco/PCformer.

ICLR Conference 2023 Conference Paper

Squeeze Training for Adversarial Robustness

  • Qizhang Li
  • Yiwen Guo
  • Wangmeng Zuo
  • Hao Chen 0003

The vulnerability of deep neural networks (DNNs) to adversarial examples has attracted great attention in the machine learning community. The problem is related to non-flatness and non-smoothness of normally obtained loss landscapes. Training augmented with adversarial examples (a.k.a., adversarial training) is considered as an effective remedy. In this paper, we highlight that some collaborative examples, nearly perceptually indistinguishable from both adversarial and benign examples yet show extremely lower prediction loss, can be utilized to enhance adversarial training. A novel method is therefore proposed to achieve new state-of-the-arts in adversarial robustness. Code: https://github.com/qizhangli/ST-AT.

AAAI Conference 2023 Conference Paper

Symmetry-Aware Transformer-Based Mirror Detection

  • Tianyu Huang
  • Bowen Dong
  • Jiaying Lin
  • Xiaohui Liu
  • Rynson W.H. Lau
  • Wangmeng Zuo

Mirror detection aims to identify the mirror regions in the given input image. Existing works mainly focus on integrating the semantic features and structural features to mine specific relations between mirror and non-mirror regions, or introducing mirror properties like depth or chirality to help analyze the existence of mirrors. In this work, we observe that a real object typically forms a loose symmetry relationship with its corresponding reflection in the mirror, which is beneficial in distinguishing mirrors from real objects. Based on this observation, we propose a dual-path Symmetry-Aware Transformer-based mirror detection Network (SATNet), which includes two novel modules: Symmetry-Aware Attention Module (SAAM) and Contrast and Fusion Decoder Module (CFDM). Specifically, we first adopt a transformer backbone to model global information aggregation in images, extracting multi-scale features in two paths. We then feed the high-level dual-path features to SAAMs to capture the symmetry relations. Finally, we fuse the dual-path features and refine our prediction maps progressively with CFDMs to obtain the final mirror mask. Experimental results show that SATNet outperforms both RGB and RGB-D mirror detection methods on all available mirror detection datasets.

NeurIPS Conference 2023 Conference Paper

Towards Evaluating Transfer-based Attacks Systematically, Practically, and Fairly

  • Qizhang Li
  • Yiwen Guo
  • Wangmeng Zuo
  • Hao Chen

The adversarial vulnerability of deep neural networks (DNNs) has drawn great attention due to the security risk of applying these models in real-world applications. Based on transferability of adversarial examples, an increasing number of transfer-based methods have been developed to fool black-box DNN models whose architecture and parameters are inaccessible. Although tremendous effort has been exerted, there still lacks a standardized benchmark that could be taken advantage of to compare these methods systematically, fairly, and practically. Our investigation shows that the evaluation of some methods needs to be more reasonable and more thorough to verify their effectiveness, to avoid, for example, unfair comparison and insufficient consideration of possible substitute/victim models. Therefore, we establish a transfer-based attack benchmark (TA-Bench) which implements 30+ methods. In this paper, we evaluate and compare them comprehensively on 10 popular substitute/victim models on ImageNet. New insights about the effectiveness of these methods are gained and guidelines for future evaluations are provided.

NeurIPS Conference 2022 Conference Paper

Self-Supervised Image Restoration with Blurry and Noisy Pairs

  • Zhilu Zhang
  • RongJian Xu
  • Ming Liu
  • Zifei Yan
  • Wangmeng Zuo

When taking photos under an environment with insufficient light, the exposure time and the sensor gain usually require to be carefully chosen to obtain images with satisfying visual quality. For example, the images with high ISO usually have inescapable noise, while the long-exposure ones may be blurry due to camera shake or object motion. Existing solutions generally suggest to seek a balance between noise and blur, and learn denoising or deblurring models under either full- or self-supervision. However, the real-world training pairs are difficult to collect, and the self-supervised methods merely rely on blurry or noisy images are limited in performance. In this work, we tackle this problem by jointly leveraging the short-exposure noisy image and the long-exposure blurry image for better image restoration. Such setting is practically feasible due to that short-exposure and long-exposure images can be either acquired by two individual cameras or synthesized by a long burst of images. Moreover, the short-exposure images are hardly blurry, and the long-exposure ones have negligible noise. Their complementarity makes it feasible to learn restoration model in a self-supervised manner. Specifically, the noisy images can be used as the supervision information for deblurring, while the sharp areas in the blurry images can be utilized as the auxiliary supervision information for self-supervised denoising. By learning in a collaborative manner, the deblurring and denoising tasks in our method can benefit each other. Experiments on synthetic and real-world images show the effectiveness and practicality of the proposed method. Codes are available at https: //github. com/cszhilu1998/SelfIR.

IJCAI Conference 2022 Conference Paper

Self-supervised Learning and Adaptation for Single Image Dehazing

  • Yudong Liang
  • Bin Wang
  • Wangmeng Zuo
  • Jiaying Liu
  • Wenqi Ren

Existing deep image dehazing methods usually depend on supervised learning with a large number of hazy-clean image pairs which are expensive or difficult to collect. Moreover, dehazing performance of the learned model may deteriorate significantly when the training hazy-clean image pairs are insufficient and are different from real hazy images in applications. In this paper, we show that exploiting large scale training set and adapting to real hazy images are two critical issues in learning effective deep dehazing models. Under the depth guidance estimated by a well-trained depth estimation network, we leverage the conventional atmospheric scattering model to generate massive hazy-clean image pairs for the self-supervised pre-training of dehazing network. Furthermore, self-supervised adaptation is presented to adapt pre-trained network to real hazy images. Learning without forgetting strategy is also deployed in self-supervised adaptation by combining self-supervision and model adaptation via contrastive learning. Experiments show that our proposed method performs favorably against the state-of-the-art methods, and is quite efficient, i. e. , handling a 4K image in 23 ms. The codes are available at https: //github. com/DongLiangSXU/SLAdehazing.

NeurIPS Conference 2022 Conference Paper

Towards Diverse and Faithful One-shot Adaption of Generative Adversarial Networks

  • Yabo Zhang
  • Mingshuai Yao
  • Yuxiang Wei
  • Zhilong Ji
  • Jinfeng Bai
  • Wangmeng Zuo

One-shot generative domain adaption aims to transfer a pre-trained generator on one domain to a new domain using one reference image only. However, it remains very challenging for the adapted generator (i) to generate diverse images inherited from the pre-trained generator while (ii) faithfully acquiring the domain-specific attributes and styles of the reference image. In this paper, we present a novel one-shot generative domain adaption method, i. e. , DiFa, for diverse generation and faithful adaptation. For global-level adaptation, we leverage the difference between the CLIP embedding of the reference image and the mean embedding of source images to constrain the target generator. For local-level adaptation, we introduce an attentive style loss which aligns each intermediate token of an adapted image with its corresponding token of the reference image. To facilitate diverse generation, selective cross-domain consistency is introduced to select and retain domain-sharing attributes in the editing latent $\mathcal{W}+$ space to inherit the diversity of the pre-trained generator. Extensive experiments show that our method outperforms the state-of-the-arts both quantitatively and qualitatively, especially for the cases of large domain gap. Moreover, our DiFa can easily be extended to zero-shot generative domain adaption with appealing results.

NeurIPS Conference 2020 Conference Paper

Cross-Scale Internal Graph Neural Network for Image Super-Resolution

  • Shangchen Zhou
  • Jiawei Zhang
  • Wangmeng Zuo
  • Chen Change Loy

Non-local self-similarity in natural images has been well studied as an effective prior in image restoration. However, for single image super-resolution (SISR), most existing deep non-local methods (e. g. , non-local neural networks) only exploit similar patches within the same scale of the low-resolution (LR) input image. Consequently, the restoration is limited to using the same-scale information while neglecting potential high-resolution (HR) cues from other scales. In this paper, we explore the cross-scale patch recurrence property of a natural image, i. e. , similar patches tend to recur many times across different scales. This is achieved using a novel cross-scale internal graph neural network (IGNN). Specifically, we dynamically construct a cross-scale graph by searching k-nearest neighboring patches in the downsampled LR image for each query patch in the LR image. We then obtain the corresponding k HR neighboring patches in the LR image and aggregate them adaptively in accordance to the edge label of the constructed graph. In this way, the HR information can be passed from k HR neighboring patches to the LR query patch to help it recover more detailed textures. Besides, these internal image-specific LR/HR exemplars are also significant complements to the external information learned from the training dataset. Extensive experiments demonstrate the effectiveness of IGNN against the state-of-the-art SISR methods including existing non-local networks on standard benchmarks.

NeurIPS Conference 2018 Conference Paper

Deep Non-Blind Deconvolution via Generalized Low-Rank Approximation

  • Wenqi Ren
  • Jiawei Zhang
  • Lin Ma
  • Jinshan Pan
  • Xiaochun Cao
  • Wangmeng Zuo
  • Wei Liu
  • Ming-Hsuan Yang

In this paper, we present a deep convolutional neural network to capture the inherent properties of image degradation, which can handle different kernels and saturated pixels in a unified framework. The proposed neural network is motivated by the low-rank property of pseudo-inverse kernels. We first compute a generalized low-rank approximation for a large number of blur kernels, and then use separable filters to initialize the convolutional parameters in the network. Our analysis shows that the estimated decomposed matrices contain the most essential information of the input kernel, which ensures the proposed network to handle various blurs in a unified framework and generate high-quality deblurring results. Experimental results on benchmark datasets with noise and saturated pixels demonstrate that the proposed algorithm performs favorably against state-of-the-art methods.

NeurIPS Conference 2018 Conference Paper

Global Gated Mixture of Second-order Pooling for Improving Deep Convolutional Neural Networks

  • Qilong Wang
  • Zilin Gao
  • Jiangtao Xie
  • Wangmeng Zuo
  • Peihua Li

In most of existing deep convolutional neural networks (CNNs) for classification, global average (first-order) pooling (GAP) has become a standard module to summarize activations of the last convolution layer as final representation for prediction. Recent researches show integration of higher-order pooling (HOP) methods clearly improves performance of deep CNNs. However, both GAP and existing HOP methods assume unimodal distributions, which cannot fully capture statistics of convolutional activations, limiting representation ability of deep CNNs, especially for samples with complex contents. To overcome the above limitation, this paper proposes a global Gated Mixture of Second-order Pooling (GM-SOP) method to further improve representation ability of deep CNNs. To this end, we introduce a sparsity-constrained gating mechanism and propose a novel parametric SOP as component of mixture model. Given a bank of SOP candidates, our method can adaptively choose Top-K (K > 1) candidates for each input sample through the sparsity-constrained gating module, and performs weighted sum of outputs of K selected candidates as representation of the sample. The proposed GM-SOP can flexibly accommodate a large number of personalized SOP candidates in an efficient way, leading to richer representations. The deep networks with our GM-SOP can be end-to-end trained, having potential to characterize complex, multi-modal distributions. The proposed method is evaluated on two large scale image benchmarks (i. e. , downsampled ImageNet-1K and Places365), and experimental results show our GM-SOP is superior to its counterparts and achieves very competitive performance. The source code will be available at http: //www. peihuali. org/GM-SOP.

AAAI Conference 2018 Conference Paper

Learning a Wavelet-Like Auto-Encoder to Accelerate Deep Neural Networks

  • Tianshui Chen
  • Liang Lin
  • Wangmeng Zuo
  • Xiaonan Luo
  • Lei Zhang

Accelerating deep neural networks (DNNs) has been attracting increasing attention as it can benefit a wide range of applications, e. g. , enabling mobile systems with limited computing resources to own powerful visual recognition ability. A practical strategy to this goal usually relies on a two-stage process: operating on the trained DNNs (e. g. , approximating the convolutional filters with tensor decomposition) and finetuning the amended network, leading to difficulty in balancing the trade-off between acceleration and maintaining recognition performance. In this work, aiming at a general and comprehensive way for neural network acceleration, we develop a Wavelet-like Auto-Encoder (WAE) that decomposes the original input image into two low-resolution channels (sub-images) and incorporate the WAE into the classification neural networks for joint training. The two decomposed channels, in particular, are encoded to carry the low-frequency information (e. g. , image profiles) and high-frequency (e. g. , image details or noises), respectively, and enable reconstructing the original input image through the decoding process. Then, we feed the low-frequency channel into a standard classification network such as VGG or ResNet and employ a very lightweight network to fuse with the high-frequency channel to obtain the classification result. Compared to existing DNN acceleration solutions, our framework has the following advantages: i) it is tolerant to any existing convolutional neural networks for classification without amending their structures; ii) the WAE provides an interpretable way to preserve the main components of the input image for classification.

IJCAI Conference 2017 Conference Paper

Discriminant Tensor Dictionary Learning with Neighbor Uncorrelation for Image Set Based Classification

  • Fei Wu
  • Xiao-Yuan Jing
  • Wangmeng Zuo
  • Ruiping Wang
  • Xiaoke Zhu

Image set based classification (ISC) has attracted lots of research interest in recent years. Several ISC methods have been developed, and dictionary learning technique based methods obtain state-of-the-art performance. However, existing ISC methods usually transform the image sample of a set into a vector for subsequent processing, which breaks the inherent spatial structure of image sample and the set. In this paper, we utilize tensor to model an image set with two spatial modes and one set mode, which can fully explore the intrinsic structure of image set. We propose a novel ISC approach, named discriminant tensor dictionary learning with neighbor uncorrelation (DTDLNU), which jointly learns two spatial dictionaries and one set dictionary. The spatial and set dictionaries are composed by set-specific sub-dictionaries corresponding to the class labels, such that the reconstruction error is discriminative. To obtain dictionaries with favorable discriminative power, DTDLNU designs a neighbor-uncorrelated discriminant tensor dictionary term, which minimizes the within-class scatter of the training sets in the projected tensor space and reduces tensor dictionary correlation among set-specific sub-dictionaries corresponding to neighbor sets from different classes. Experiments on three challenging datasets demonstrate the effectiveness of DTDLNU.

AAAI Conference 2017 Conference Paper

Learning Heterogeneous Dictionary Pair with Feature Projection Matrix for Pedestrian Video Retrieval via Single Query Image

  • Xiaoke Zhu
  • Xiao-Yuan Jing
  • Fei Wu
  • Yunhong Wang
  • Wangmeng Zuo
  • Wei-Shi Zheng

Person re-identification (re-id) plays an important role in video surveillance and forensics applications. In many cases, person re-id needs to be conducted between image and video clip, e. g. , re-identifying a suspect from large quantities of pedestrian videos given a single image of him. We call reid in this scenario as image to video person re-id (IVPR). In practice, image and video are usually represented with different features, and there usually exist large variations between frames within each video. These factors make matching between image and video become a very challenging task. In this paper, we propose a joint feature projection matrix and heterogeneous dictionary pair learning (PHDL) approach for IVPR. Specifically, PHDL jointly learns an intra-video projection matrix and a pair of heterogeneous image and video dictionaries. With the learned projection matrix, the influence of variations within each video to the matching can be reduced. With the learned dictionary pair, the heterogeneous image and video features can be transformed into coding coefficients with the same dimension, such that the matching can be conducted using coding coefficients. Furthermore, to ensure that the obtained coding coefficients have favorable discriminability, PHDL designs a point-to-set coefficient discriminant term. Experiments on the public iLIDS-VID and PRID 2011 datasets demonstrate the effectiveness of the proposed approach.

AAAI Conference 2017 Conference Paper

Learning Patch-Based Dynamic Graph for Visual Tracking

  • Chenglong Li
  • Liang Lin
  • Wangmeng Zuo
  • Jin Tang

Existing visual tracking methods usually localize the object with a bounding box, in which the foreground object trackers/detectors are often disturbed by the introduced background information. To handle this problem, we aim to learn a more robust object representation for visual tracking. In particular, the tracked object is represented with a graph structure (i. e. , a set of non-overlapping image patches), in which the weight of each node (patch) indicates how likely it belongs to the foreground and edges are also weighed for indicating the appearance compatibility of two neighboring nodes. This graph is dynamically learnt (i. e. , the nodes and edges received weights) and applied in object tracking and model updating. We constrain the graph learning from two aspects: i) the global low-rank structure over all nodes and ii) the local sparseness of node neighbors. During the tracking process, our method performs the following steps at each frame. First, the graph is initialized by assigning either 1 or 0 to the weights of some image patches according to the predicted bounding box. Second, the graph is optimized through designing a new ALM (Augmented Lagrange Multiplier) based algorithm. Third, the object feature representation is updated by imposing the weights of patches on the extracted image features. The object location is finally predicted by adopting the Struck tracker (Hare, Saffari, and Torr 2011). Extensive experiments show that our approach outperforms the state-of-the-art tracking methods on two standard benchmarks, i. e. , OTB100 and NUS-PRO.

AAAI Conference 2017 Conference Paper

Multiset Feature Learning for Highly Imbalanced Data Classification

  • Fei Wu
  • Xiao-Yuan Jing
  • Shiguang Shan
  • Wangmeng Zuo
  • Jing-Yu Yang

With the expansion of data, increasing imbalanced data has emerged. When the imbalance ratio of data is high, most existing imbalanced learning methods decline in classification performance. To address this problem, a few highly imbalanced learning methods have been presented. However, most of them are still sensitive to the high imbalance ratio. This work aims to provide an effective solution for the highly imbalanced data classification problem. We conduct highly imbalanced learning from the perspective of feature learning. We partition the majority class into multiple blocks with each being balanced to the minority class and combine each block with the minority class to construct a balanced sample set. Multiset feature learning (MFL) is performed on these sets to learn discriminant features. We thus propose an uncorrelated cost-sensitive multiset learning (UCML) approach. UCML provides a multiple sets construction strategy, incorporates the cost-sensitive factor into MFL, and designs a weighted uncorrelated constraint to remove the correlation among multiset features. Experiments on five highly imbalanced datasets indicate that: UCML outperforms state-of-the-art imbalanced learning methods.

IJCAI Conference 2016 Conference Paper

A Self-Representation Induced Classifier

  • Pengfei Zhu
  • Lei Zhang
  • Wangmeng Zuo
  • Xiangchu Feng
  • Qinghua Hu

Almost all the existing representation based classifiers represent a query sample as a linear combination of training samples, and their time and memory cost will increase rapidly with the number of training samples. We investigate the representation based classification problem from a rather different perspective in this paper, that is, we learn how each feature (i. e. , each element) of a sample can be represented by the features of itself. Such a self-representation property of sample features can be readily employed for pattern classification and a novel self-representation induced classifier (SRIC) is proposed. SRIC learns a self-representation matrix for each class. Given a query sample, its self-representation residual can be computed by each of the learned self-representation matrices, and classification can then be performed by comparing these residuals. In light of the principle of SRIC, a discriminative SRIC (DSRIC) method is developed. For each class, a discriminative self-representation matrix is trained to minimize the self-representation residual of this class while representing little the features of other classes. Experimental results on different pattern recognition tasks show that DSRIC achieves comparable or superior recognition rate to state-of-the-art representation based classifiers, however, it is much more efficient and needs much less storage space.

JBHI Journal 2016 Journal Article

Comparison of Three Different Types of Wrist Pulse Signals by Their Physical Meanings and Diagnosis Performance

  • Wangmeng Zuo
  • Peng Wang
  • David Zhang

Increasing interest has been focused on computational pulse diagnosis where sensors are developed to acquire pulse signals, and machine learning techniques are exploited to analyze health conditions based on the acquired pulse signals. By far, a number of sensors have been employed for pulse signal acquisition, which can be grouped into three major categories, i. e. , pressure, photoelectric, and ultrasonic sensors. To guide the sensor selection for computational pulse diagnosis, in this paper, we analyze the physical meanings and sensitivities of signals acquired by these three types of sensors. The dependence and complementarity of the different sensors are discussed from both the perspective of cardiovascular fluid dynamics and comparative experiments by evaluating disease classification performance. Experimental results indicate that each sensor is more appropriate for the diagnosis of some specific disease that the changes of physiological factors can be effectively reflected by the sensor, e. g. , ultrasonic sensor for diabetes and pressure sensor for arteriosclerosis, and improved diagnosis performance can be obtained by combining three types of signals.

AAAI Conference 2016 Conference Paper

Coupled Dictionary Learning for Unsupervised Feature Selection

  • Pengfei Zhu
  • Qinghua Hu
  • Changqing Zhang
  • Wangmeng Zuo

Unsupervised feature selection (UFS) aims to reduce the time complexity and storage burden, as well as improve the generalization performance. Most existing methods convert UFS to supervised learning problem by generating labels with specific techniques (e. g. , spectral analysis, matrix factorization and linear predictor). Instead, we proposed a novel coupled analysis-synthesis dictionary learning method, which is free of generating labels. The representation coefficients are used to model the cluster structure and data distribution. Specifically, the synthesis dictionary is used to reconstruct samples, while the analysis dictionary analytically codes the samples and assigns probabilities to the samples. Afterwards, the analysis dictionary is used to select features that can well preserve the data distribution. The effective L2, p-norm (0 < p ≤ 1) regularization is imposed on the analysis dictionary to get much sparse solution and is more effective in feature selection. We proposed an iterative reweighted least squares algorithm to solve the L2, p-norm optimization problem and proved it can converge to a fixed point. Experiments on benchmark datasets validated the effectiveness of the proposed method.

AAAI Conference 2016 Conference Paper

Two-Stream Contextualized CNN for Fine-Grained Image Classification

  • Jiang Liu
  • Chenqiang Gao
  • Deyu Meng
  • Wangmeng Zuo

Human’s cognition system prompts that context information provides potentially powerful clue while recognizing objects. However, for fine-grained image classification, the contribution of context may vary over different images, and sometimes the context even confuses the classification result. To alleviate this problem, in our work, we develop a novel approach, two-stream contextualized Convolutional Neural Network, which provides a simple but efficient contextcontent joint classification model under deep learning framework. The network merely requires the raw image and a coarse segmentation as input to extract both content and context features without need of human interaction. Moreover, our network adopts a weighted fusion scheme to combine the content and the context classifiers, while a subnetwork is introduced to adaptively determine the weight for each image. According to our experiments on public datasets, our approach achieves considerable high recognition accuracy without any tedious human’s involvements, as compared with the state-of-the-art approaches.

NeurIPS Conference 2014 Conference Paper

Deep Joint Task Learning for Generic Object Extraction

  • Xiaolong Wang
  • Liliang Zhang
  • Liang Lin
  • Zhujin Liang
  • Wangmeng Zuo

This paper investigates how to extract objects-of-interest without relying on hand-craft features and sliding windows approaches, that aims to jointly solve two sub-tasks: (i) rapidly localizing salient objects from images, and (ii) accurately segmenting the objects based on the localizations. We present a general joint task learning framework, in which each task (either object localization or object segmentation) is tackled via a multi-layer convolutional neural network, and the two networks work collaboratively to boost performance. In particular, we propose to incorporate latent variables bridging the two networks in a joint optimization manner. The first network directly predicts the positions and scales of salient objects from raw images, and the latent variables adjust the object localizations to feed the second network that produces pixelwise object masks. An EM-type method is then studied for the joint optimization, iterating with two steps: (i) by using the two networks, it estimates the latent variables by employing an MCMC-based sampling method; (ii) it optimizes the parameters of the two networks unitedly via back propagation, with the fixed latent variables. Extensive experiments demonstrate that our joint learning framework significantly outperforms other state-of-the-art approaches in both accuracy and efficiency (e. g. , 1000 times faster than competing approaches).

NeurIPS Conference 2014 Conference Paper

Projective dictionary pair learning for pattern classification

  • Shuhang Gu
  • Lei Zhang
  • Wangmeng Zuo
  • Xiangchu Feng

Discriminative dictionary learning (DL) has been widely studied in various pattern classification problems. Most of the existing DL methods aim to learn a synthesis dictionary to represent the input signal while enforcing the representation coefficients and/or representation residual to be discriminative. However, the $\ell_0$ or $\ell_1$-norm sparsity constraint on the representation coefficients adopted in many DL methods makes the training and testing phases time consuming. We propose a new discriminative DL framework, namely projective dictionary pair learning (DPL), which learns a synthesis dictionary and an analysis dictionary jointly to achieve the goal of signal representation and discrimination. Compared with conventional DL methods, the proposed DPL method can not only greatly reduce the time complexity in the training and testing phases, but also lead to very competitive accuracies in a variety of visual classification tasks.

ICML Conference 2014 Conference Paper

Robust Principal Component Analysis with Complex Noise

  • Qian Zhao 0002
  • Deyu Meng
  • Zongben Xu
  • Wangmeng Zuo
  • Lei Zhang 0006

The research on robust principal component analysis (RPCA) has been attracting much attention recently. The original RPCA model assumes sparse noise, and use the L_1-norm to characterize the error term. In practice, however, the noise is much more complex and it is not appropriate to simply use a certain L_p-norm for noise modeling. We propose a generative RPCA model under the Bayesian framework by modeling data noise as a mixture of Gaussians (MoG). The MoG is a universal approximator to continuous distributions and thus our model is able to fit a wide range of noises such as Laplacian, Gaussian, sparse noises and any combinations of them. A variational Bayes algorithm is presented to infer the posterior of the proposed model. All involved parameters can be recursively updated in closed form. The advantage of our method is demonstrated by extensive experiments on synthetic data, face modeling and background subtraction.