Author name cluster

Yifan Jiang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

12 papers

2 author rows

AAAI Conference 2025 Conference Paper

COLUMBUS: Evaluating COgnitive Lateral Understanding Through Multiple-Choice reBUSes

Koen Kraaijveld
Yifan Jiang
Kaixin Ma
Filip Ilievski

While visual question-answering (VQA) benchmarks have catalyzed the development of reasoning techniques, they have focused on vertical thinking. Effective problem-solving also necessitates lateral thinking, which remains understudied in AI and has not been used to test visual perception systems. To bridge this gap, we formulate visual lateral thinking as a multiple-choice question-answering task and describe a three-step taxonomy-driven methodology for instantiating task examples. Then, we develop COLUMBUS, a synthetic benchmark that applies the task pipeline to create QA sets with text and icon rebus puzzles based on publicly available collections of compounds and common phrases. COLUMBUS comprises over 1,000 puzzles, each with four answer candidates. While the SotA vision language models (VLMs) achieve decent performance, our evaluation demonstrates a substantial gap between humans and models. VLMs benefit from human-curated descriptions but struggle to self-generate such representations at the right level of abstraction.

PDF Details DOI

NeurIPS Conference 2025 Conference Paper

SRPO: Enhancing Multimodal LLM Reasoning via Reflection-Aware Reinforcement Learning

Zhongwei Wan
Zhihao Dou
Che Liu
Yu Zhang
Dongfei Cui
Qinjian Zhao
Hui Shen
Jing Xiong

Multimodal large language models (MLLMs) have shown promising capabilities in reasoning tasks, yet still struggle significantly with complex problems requiring explicit self-reflection and self-correction, especially compared to their unimodal text-based counterparts. Existing reflection methods are simplistic and struggle to generate meaningful, instructive feedback, as the reasoning ability and knowledge limits of pre-trained models are largely fixed during initial training. To overcome these challenges, we propose \textit{multimodal \textbf{S}elf-\textbf{R}eflection enhanced reasoning with Group Relative \textbf{P}olicy \textbf{O}ptimization} \textbf{SRPO}, a two-stage reflection-aware reinforcement learning (RL) framework explicitly designed to enhance multimodal LLM reasoning. In the first stage, we construct a high-quality, reflection-focused dataset under the guidance of an advanced MLLM, which generates reflections based on initial responses to help the policy model to learn both reasoning and self-reflection. In the second stage, we introduce a novel reward mechanism within the GRPO framework that encourages concise and cognitively meaningful reflection while avoiding redundancy. Extensive experiments across multiple multimodal reasoning benchmarks—including MathVista, MathVision, Mathverse, and MMMU-Pro—using Qwen-2. 5-VL-7B and Qwen-2. 5-VL-32B demonstrate that SRPO significantly outperforms state-of-the-art models, achieving notable improvements in both reasoning accuracy and reflection quality.

PDF Details

IROS Conference 2025 Conference Paper

ZS-Puffin: Design, Modeling and Implementation of an Unmanned Aerial-Aquatic Vehicle with Amphibious Wings

Zhenjiang Wang
Yunhua Jiang
Zikun Zhen
Yifan Jiang
Yubin Tan
Wubin Wang

Unmanned aerial-aquatic vehicles (UAAVs) can operate both in the air and underwater, giving them broad application prospects. Inspired by the dual-function wings of puffins, we propose a UAAV with amphibious wings to address the challenge posed by medium differences on the vehicle’s propulsion system. The amphibious wing, redesigned based on a fixed-wing structure, features a single degree of freedom in pitch and requires no additional components. It can generate lift in the air and function as a flapping wing for propulsion underwater, reducing disturbance to marine life and making it environmentally friendly. Additionally, an artificial central pattern generator (CPG) is introduced to enhance the smoothness of the flapping motion. This paper presents the prototype, design details, and practical implementation of this concept.

Details

NeurIPS Conference 2024 Conference Paper

MARVEL: Multidimensional Abstraction and Reasoning through Visual Evaluation and Learning

Yifan Jiang
Jiarui Zhang
Kexuan Sun
Zhivar Sourati
Kian Ahrabian
Kaixin Ma
Filip Ilievski
Jay Pujara

While multi-modal large language models (MLLMs) have shown significant progress across popular visual reasoning benchmarks, whether they possess abstract visual reasoning abilities remains an open question. Similar to the Sudoku puzzles, abstract visual reasoning (AVR) problems require finding high-level patterns (e. g. , repetition constraints on numbers) that control the input shapes (e. g. , digits) in a specific task configuration (e. g. , matrix). However, existing AVR benchmarks only consider a limited set of patterns (addition, conjunction), input shapes (rectangle, square), and task configurations (3 × 3 matrices). And they fail to capture all abstract reasoning patterns in human cognition necessary for addressing real-world tasks, such as geometric properties and object boundary understanding in real-world navigation. To evaluate MLLMs’ AVR abilities systematically, we introduce MARVEL founded on the core knowledge system in human cognition, a multi-dimensional AVR benchmark with 770 puzzles composed of six core knowledge patterns, geometric and abstract shapes, and five different task configurations. To inspect whether the model performance is grounded in perception or reasoning, MARVEL complements the standard AVR question with perception questions in a hierarchical evaluation framework. We conduct comprehensive experiments on MARVEL with ten representative MLLMs in zero-shot and few-shot settings. Our experiments reveal that all MLLMs show near-random performance on MARVEL, with significant performance gaps (40%) compared to humans across all patterns and task configurations. Further analysis of perception questions reveals that MLLMs struggle to comprehend the visual features (near-random performance). Although closed-source MLLMs, such as GPT-4V, show a promising understanding of reasoning patterns (on par with humans) after adding textual descriptions, this advantage is hindered by their weak perception abilities. We release our entirecode and dataset at https: //github. com/1171-jpg/MARVEL_AVR.

PDF Details DOI

TMLR Journal 2023 Journal Article

Chasing Better Deep Image Priors between Over- and Under-parameterization

Qiming Wu
Xiaohan Chen
Yifan Jiang
Zhangyang Wang

Deep Neural Networks (DNNs) are well-known to act as \textbf{over-parameterized} deep image priors (DIP) that regularize various image inverse problems. Meanwhile, researchers also proposed extremely compact, \textbf{under-parameterized} image priors (e.g., deep decoder) that are strikingly competent for image restoration too, despite a loss of accuracy. These two extremes push us to think whether there exists a better solution in the middle: \textit{between over- and under-parameterized image priors, can one identify ``intermediate" parameterized image priors that achieve better trade-offs between performance, efficiency, and even preserving strong transferability?} Drawing inspirations from the lottery ticket hypothesis (LTH), we conjecture and study a novel ``lottery image prior" (\textbf{LIP}) by exploiting DNN inherent sparsity, stated as: \textit{given an over-parameterized DNN-based image prior, it will contain a sparse subnetwork that can be trained in isolation, to match the original DNN's performance when being applied as a prior to various image inverse problems}. Our results validate the superiority of LIPs: we can successfully locate the LIP subnetworks from over-parameterized DIPs at substantial sparsity ranges. Those LIP subnetworks significantly outperform deep decoders under comparably compact model sizes (by often fully preserving the effectiveness of their over-parameterized counterparts), and they also possess high transferability across different images as well as restoration task types. Besides, we also extend LIP to compressive sensing image reconstruction, where a \textit{pre-trained} GAN generator is used as the prior (in contrast to \textit{untrained} DIP or deep decoder), and confirm its validity in this setting too. To our best knowledge, this is the first time that LTH is demonstrated to be relevant in the context of inverse problems or image priors. Codes are available at https://github.com/VITA-Group/Chasing-Better-DIPs.

PDF Details

NeurIPS Conference 2023 Conference Paper

In-Context Learning Unlocked for Diffusion Models

Zhendong Wang
Yifan Jiang
Yadong Lu
Yelong Shen
Pengcheng He
Weizhu Chen
Zhangyang "Atlas" Wang
Mingyuan Zhou

We present Prompt Diffusion, a framework for enabling in-context learning in diffusion-based generative models. Given a pair of task-specific example images, such as depth from/to image and scribble from/to image, and a text guidance, our model automatically understands the underlying task and performs the same task on a new query image following the text guidance. To achieve this, we propose a vision-language prompt that can model a wide range of vision-language tasks and a diffusion model that takes it as input. The diffusion model is trained jointly on six different tasks using these prompts. The resulting Prompt Diffusion model becomes the first diffusion-based vision-language foundation model capable of in-context learning. It demonstrates high-quality in-context generation for the trained tasks and effectively generalizes to new, unseen vision tasks using their respective prompts. Our model also shows compelling text-guided image editing results. Our framework aims to facilitate research into in-context learning for computer vision. We share our code and pre-trained models at https: //github. com/Zhendong-Wang/Prompt-Diffusion.

PDF Details

NeurIPS Conference 2023 Conference Paper

Patch Diffusion: Faster and More Data-Efficient Training of Diffusion Models

Zhendong Wang
Yifan Jiang
Huangjie Zheng
Peihao Wang
Pengcheng He
Zhangyang "Atlas" Wang
Weizhu Chen
Mingyuan Zhou

Diffusion models are powerful, but they require a lot of time and data to train. We propose Patch Diffusion, a generic patch-wise training framework, to significantly reduce the training time costs while improving data efficiency, which thus helps democratize diffusion model training to broader users. At the core of our innovations is a new conditional score function at the patch level, where the patch location in the original image is included as additional coordinate channels, while the patch size is randomized and diversified throughout training to encode the cross-region dependency at multiple scales. Sampling with our method is as easy as in the original diffusion model. Through Patch Diffusion, we could achieve $\mathbf{\ge 2\times}$ faster training, while maintaining comparable or better generation quality. Patch Diffusion meanwhile improves the performance of diffusion models trained on relatively small datasets, $e. g. $, as few as 5, 000 images to train from scratch. We achieve outstanding FID scores in line with state-of-the-art benchmarks: 1. 77 on CelebA-64$\times$64, 1. 93 on AFHQv2-Wild-64$\times$64, and 2. 72 on ImageNet-256$\times$256. We share our code and pre-trained models at https: //github. com/Zhendong-Wang/Patch-Diffusion.

PDF Details

NeurIPS Conference 2023 Conference Paper

Wasserstein distributional robustness of neural networks

Xingjian Bai
Guangyi He
Yifan Jiang
Jan Obloj

Deep neural networks are known to be vulnerable to adversarial attacks (AA). For an image recognition task, this means that a small perturbation of the original can result in the image being misclassified. Design of such attacks as well as methods of adversarial training against them are subject of intense research. We re-cast the problem using techniques of Wasserstein distributionally robust optimization (DRO) and obtain novel contributions leveraging recent insights from DRO sensitivity analysis. We consider a set of distributional threat models. Unlike the traditional pointwise attacks, which assume a uniform bound on perturbation of each input data point, distributional threat models allow attackers to perturb inputs in a non-uniform way. We link these more general attacks with questions of out-of-sample performance and Knightian uncertainty. To evaluate the distributional robustness of neural networks, we propose a first-order AA algorithm and its multistep version. Our attack algorithms include Fast Gradient Sign Method (FGSM) and Projected Gradient Descent (PGD) as special cases. Furthermore, we provide a new asymptotic estimate of the adversarial accuracy against distributional threat models. The bound is fast to compute and first-order accurate, offering new insights even for the pointwise AA. It also naturally yields out-of-sample performance guarantees. We conduct numerical experiments on CIFAR-10, CIFAR-100, ImageNet datasets using DNNs on RobustBench to illustrate our theoretical results. Our code is available at https: //github. com/JanObloj/W-DRO-Adversarial-Methods.

PDF Details

NeurIPS Conference 2022 Conference Paper

Signal Processing for Implicit Neural Representations

Dejia Xu
Peihao Wang
Yifan Jiang
Zhiwen Fan
Zhangyang Wang

Implicit Neural Representations (INRs) encoding continuous multi-media data via multi-layer perceptrons has shown undebatable promise in various computer vision tasks. Despite many successful applications, editing and processing an INR remains intractable as signals are represented by latent parameters of a neural network. Existing works manipulate such continuous representations via processing on their discretized instance, which breaks down the compactness and continuous nature of INR. In this work, we present a pilot study on the question: how to directly modify an INR without explicit decoding? We answer this question by proposing an implicit neural signal processing network, dubbed INSP-Net, via differential operators on INR. Our key insight is that spatial gradients of neural networks can be computed analytically and are invariant to translation, while mathematically we show that any continuous convolution filter can be uniformly approximated by a linear combination of high-order differential operators. With these two knobs, INSP-Net instantiates the signal processing operator as a weighted composition of computational graphs corresponding to the high-order derivatives of INRs, where the weighting parameters can be data-driven learned. Based on our proposed INSP-Net, we further build the first Convolutional Neural Network (CNN) that implicitly runs on INRs, named INSP-ConvNet. Our experiments validate the expressiveness of INSP-Net and INSP-ConvNet in fitting low-level image and geometry processing kernels (e. g. blurring, deblurring, denoising, inpainting, and smoothening) as well as for high-level tasks on implicit fields such as image classification.

PDF Details

JBHI Journal 2021 Journal Article

COVID-19 CT Image Synthesis With a Conditional Generative Adversarial Network

Yifan Jiang
Han Chen
Murray Loew
Hanseok Ko

Coronavirus disease 2019 (COVID-19) is an ongoing global pandemic that has spread rapidly since December 2019. Real-time reverse transcription polymerase chain reaction (rRT-PCR) and chest computed tomography (CT) imaging both play an important role in COVID-19 diagnosis. Chest CT imaging offers the benefits of quick reporting, a low cost, and high sensitivity for the detection of pulmonary infection. Recently, deep-learning-based computer vision methods have demonstrated great promise for use in medical imaging applications, including X-rays, magnetic resonance imaging, and CT imaging. However, training a deep-learning model requires large volumes of data, and medical staff faces a high risk when collecting COVID-19 CT data due to the high infectivity of the disease. Another issue is the lack of experts available for data labeling. In order to meet the data requirements for COVID-19 CT imaging, we propose a CT image synthesis approach based on a conditional generative adversarial network that can effectively generate high-quality and realistic COVID-19 CT images for use in deep-learning-based medical imaging tasks. Experimental results show that the proposed method outperforms other state-of-the-art image synthesis methods with the generated COVID-19 CT images and indicates promising for various machine learning applications including semantic segmentation and classification.

Details DOI

NeurIPS Conference 2021 Conference Paper

IA-RED$^2$: Interpretability-Aware Redundancy Reduction for Vision Transformers

Bowen Pan
Rameswar Panda
Yifan Jiang
Zhangyang Wang
Rogerio Feris
Aude Oliva

The self-attention-based model, transformer, is recently becoming the leading backbone in the field of computer vision. In spite of the impressive success made by transformers in a variety of vision tasks, it still suffers from heavy computation and intensive memory costs. To address this limitation, this paper presents an Interpretability-Aware REDundancy REDuction framework (IA-RED$^2$). We start by observing a large amount of redundant computation, mainly spent on uncorrelated input patches, and then introduce an interpretable module to dynamically and gracefully drop these redundant patches. This novel framework is then extended to a hierarchical structure, where uncorrelated tokens at different stages are gradually removed, resulting in a considerable shrinkage of computational cost. We include extensive experiments on both image and video tasks, where our method could deliver up to 1. 4x speed-up for state-of-the-art models like DeiT and TimeSformer, by only sacrificing less than 0. 7% accuracy. More importantly, contrary to other acceleration approaches, our method is inherently interpretable with substantial visual evidence, making vision transformer closer to a more human-understandable architecture while being lighter. We demonstrate that the interpretability that naturally emerged in our framework can outperform the raw attention learned by the original visual transformer, as well as those generated by off-the-shelf interpretation methods, with both qualitative and quantitative results. Project Page: http: //people. csail. mit. edu/bpan/ia-red/.

PDF Details

NeurIPS Conference 2021 Conference Paper

TransGAN: Two Pure Transformers Can Make One Strong GAN, and That Can Scale Up

Yifan Jiang
Shiyu Chang
Zhangyang Wang

The recent explosive interest on transformers has suggested their potential to become powerful ``universal" models for computer vision tasks, such as classification, detection, and segmentation. While those attempts mainly study the discriminative models, we explore transformers on some more notoriously difficult vision tasks, e. g. , generative adversarial networks (GANs). Our goal is to conduct the first pilot study in building a GAN \textit{completely free of convolutions}, using only pure transformer-based architectures. Our vanilla GAN architecture, dubbed \textbf{TransGAN}, consists of a memory-friendly transformer-based generator that progressively increases feature resolution, and correspondingly a multi-scale discriminator to capture simultaneously semantic contexts and low-level textures. On top of them, we introduce the new module of grid self-attention for alleviating the memory bottleneck further, in order to scale up TransGAN to high-resolution generation. We also develop a unique training recipe including a series of techniques that can mitigate the training instability issues of TransGAN, such as data augmentation, modified normalization, and relative position encoding. Our best architecture achieves highly competitive performance compared to current state-of-the-art GANs using convolutional backbones. Specifically, TransGAN sets \textbf{new state-of-the-art} inception score of 10. 43 and FID of 18. 28 on STL-10. It also reaches the inception score of 9. 02 and FID of 9. 26 on CIFAR-10, and 5. 28 FID on CelebA $\mathbf{128} \times \mathbf{128}$, respectively: both on par with the current best results and outperforming StyleGAN-V2. When it comes to higher-resolution (e. g. $\mathbf{256} \times \mathbf{256}$) generation tasks, such as on CelebA-HQ and LSUN-Church, TransGAN continues to produce diverse visual examples with high fidelity and impressive texture details. In addition, we dive deep into the transformer-based generation models to understand how their behaviors differ from convolutional ones, by visualizing training dynamics. The code is available at: https: //github. com/VITA-Group/TransGAN.

PDF Details