Arrow Research search

Author name cluster

Lijuan Wang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

44 papers
2 author rows

Possible papers

44

AAAI Conference 2026 Conference Paper

Towards Zero-Shot Diabetic Retinopathy Grading: Learning Generalized Knowledge via Prompt-Driven Matching and Emulating

  • Huan Wang
  • Haoran Li
  • Yuxin Lin
  • Huaming Chen
  • Jun Yan
  • Lijuan Wang
  • Jiahua Shi
  • Qihao Xu

As one of the primary causes of visual impairment, Diabetic Retinopathy (DR) requires accurate and robust grading to facilitate timely diagnosis and intervention. Different from conventional DR grading methods that utilize single-view images, recent clinical studies have revealed that multi-view fundus images can significantly enhance DR grading performance by expanding the field of view (FOV). However, there is a long-tailed distribution problem in fundus image analysis, i.e., a high prevalence of mild DR grades and a low prevalence of rare ones (e.g., cases of high severity), which presents a significant challenge to developing a unified model capable of detecting rare or unseen DR grades not encountered during training. In this paper, we propose ProME-DR, a Prompt-driven zero-shot DR grading framework, which leverages prompt Matching and Emulating to recognize the unseen DR categories and views beyond the training set. ProME-DR disentangles the training process into two stages to learn generalized knowledge for novel DR disease grading. Initially, ProME-DR leverages two sets of prompt units to capture semantic and inter-view consistency knowledge via a split-and-mask manner, gathering instance-level DR visual clues. Subsequently, it constructs a concept-aware emulator to generate context prompt units, linking extensible knowledge learned from the previously seen DR attributes for zero-shot DR grading. Extensive experiments conducted on eight datasets and various scenarios confirm the superiority of ProME-DR.

EAAI Journal 2025 Journal Article

An adaptive traffic signal control scheme with Proximal Policy Optimization based on deep reinforcement learning for a single intersection

  • Lijuan Wang
  • Guoshan Zhang
  • Qiaoli Yang
  • Tianyang Han

Adaptive traffic signal control (ATSC) is an important means to alleviate traffic congestion and improve the quality of road traffic. Although deep reinforcement learning (DRL) technology has shown great potential in solving traffic signal control problems, the state representation and reward design, as well as action interval time, still need to be further studied. The advantages of policy learning have not been fully applied in TSC. To address the aforementioned issues, we propose a DRL-based traffic signal control scheme with Poximal Policy Optimization (PPO-TSC). We use the waiting time of vehicles and the queue length of lanes represented the spatiotemporal characteristics of traffic flow to design the simplified traffic states feature vectors, and define the reward function that is consistent with the state. Additionally, we compare and analyze the performance indexes obtained by various methods using action intervals of 5s, 10s, and 15s. The algorithm is implemented based on the Actor-Critic architecture, using the advantage estimation and the clip mechanism to constrain the range of gradient updates. We validate the proposed scheme at a single intersection in Simulation of Urban MObility (SUMO) under two different traffic demand patterns of flat traffic and peak traffic. The experimental results show that the proposed method is significantly better than other compared methods. Specifically, PPO-TSC demonstrates a reduction of 24% in average travel time (ATT), a decrease of 45% in the average time loss (ATL), and an increase of 16% in average speed (AS) compared with the existing methods under peak traffic condition.

ICML Conference 2025 Conference Paper

Can MLLMs Reason in Multimodality? EMMA: An Enhanced MultiModal ReAsoning Benchmark

  • Yunzhuo Hao
  • Jiawei Gu
  • Huichen Will Wang
  • Linjie Li
  • Zhengyuan Yang
  • Lijuan Wang
  • Yu Cheng 0001

The ability to organically reason over and with both text and images is a pillar of human intelligence, yet the ability of Multimodal Large Language Models (MLLMs) to perform such multimodal reasoning remains under-explored. Existing benchmarks often emphasize text-dominant reasoning or rely on shallow visual cues, failing to adequately assess integrated visual and textual reasoning. We introduce EMMA (Enhanced MultiModal reAsoning), a benchmark targeting organic multimodal reasoning across mathematics, physics, chemistry, and coding. EMMA tasks demand advanced cross-modal reasoning that cannot be addressed by reasoning independently in each modality, offering an enhanced test suite for MLLMs’ reasoning capabilities. Our evaluation of state-of-the-art MLLMs on EMMA reveals significant limitations in handling complex multimodal and multi-step reasoning tasks, even with advanced techniques like Chain-of-Thought prompting and test-time compute scaling underperforming. These findings underscore the need for improved multimodal architectures and training paradigms to close the gap between human and model reasoning in multimodality.

ICLR Conference 2025 Conference Paper

CertainlyUncertain: A Benchmark and Metric for Multimodal Epistemic and Aleatoric Awareness

  • Khyathi Raghavi Chandu
  • Linjie Li
  • Anas Awadalla
  • Ximing Lu
  • Jae Sung Park
  • Jack Hessel
  • Lijuan Wang
  • Yejin Choi 0001

The ability to acknowledge the inevitable uncertainty in their knowledge and reasoning is a prerequisite for AI systems to be truly truthful and reliable. In this paper, we present a taxonomy of uncertainty specific to vision-language AI systems, distinguishing between epistemic uncertainty (arising from a lack of information) and aleatoric uncertainty (due to inherent unpredictability), and further explore finer categories within. Based on this taxonomy, we synthesize a benchmark dataset, CertainlyUncertain, featuring 178K visual question answering (VQA) samples as contrastive pairs. This is achieved by 1) inpainting images to make previously answerable questions into unanswerable ones; and 2) using image captions to prompt large language models for both answerable and unanswerable questions. Additionally, we introduce a new metric confidence-weighted accuracy, that is well correlated with both accuracy and calibration error, to address the shortcomings of existing metrics. Despite the recent rapid progress in vision-language models (VLMs), evaluations on our benchmark show that they perform poorly in uncertain scenarios. Further experiments demonstrate that supervised fine-tuning with CertainlyUncertain enhances the performance of VLMs, and reduces the calibration error. These improvements extend beyond our benchmark to existing refusal-oriented datasets and show positive results on reducing hallucinations, while maintaining performance on standard VQA benchmarks. Our work underscores the importance of addressing uncertainty in vision-language AI systems to improve their reliability and trustworthiness in real-world applications.

ICLR Conference 2025 Conference Paper

EditRoom: LLM-parameterized Graph Diffusion for Composable 3D Room Layout Editing

  • Kaizhi Zheng
  • Xiaotong Chen
  • Xuehai He
  • Jing Gu
  • Linjie Li
  • Zhengyuan Yang
  • Kevin Lin
  • Jianfeng Wang

Given the steep learning curve of professional 3D software and the time- consuming process of managing large 3D assets, language-guided 3D scene editing has significant potential in fields such as virtual reality, augmented reality, and gaming. However, recent approaches to language-guided 3D scene editing either require manual interventions or focus only on appearance modifications without supporting comprehensive scene layout changes. In response, we propose EditRoom, a unified framework capable of executing a variety of layout edits through natural language commands, without requiring manual intervention. Specifically, EditRoom leverages Large Language Models (LLMs) for command planning and generates target scenes using a diffusion-based method, enabling six types of edits: rotate, translate, scale, replace, add, and remove. To address the lack of data for language-guided 3D scene editing, we have developed an automatic pipeline to augment existing 3D scene synthesis datasets and introduced EditRoom-DB, a large-scale dataset with 83k editing pairs, for training and evaluation. Our experiments demonstrate that our approach consistently outperforms other baselines across all metrics, indicating higher accuracy and coherence in language-guided scene layout editing.

ICLR Conference 2025 Conference Paper

GenXD: Generating Any 3D and 4D Scenes

  • Yuyang Zhao
  • Chung-Ching Lin
  • Kevin Lin
  • Zhiwen Yan
  • Linjie Li
  • Zhengyuan Yang
  • Jianfeng Wang
  • Gim Hee Lee

Recent developments in 2D visual generation have been remarkably successful. However, 3D and 4D generation remain challenging in real-world applications due to the lack of large-scale 4D data and effective model design. In this paper, we propose to jointly investigate general 3D and 4D generation by leveraging camera and object movements commonly observed in daily life. Due to the lack of real-world 4D data in the community, we first propose a data curation pipeline to obtain camera poses and object motion strength from videos. Based on this pipeline, we introduce a large-scale real-world 4D scene dataset: CamVid-30K. By leveraging all the 3D and 4D data, we develop our framework, GenXD, which allows us to produce any 3D or 4D scene. We propose multiview-temporal modules, which disentangle camera and object movements, to seamlessly learn from both 3D and 4D data. Additionally, GenXD employs masked latent conditions to support a variety of conditioning views. GenXD can generate videos that follow the camera trajectory as well as consistent 3D views that can be lifted into 3D representations. We perform extensive evaluations across various real-world and synthetic datasets, demonstrating GenXD's effectiveness and versatility compared to previous methods in 3D and 4D generation.

ICLR Conference 2025 Conference Paper

MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models

  • Peng Xia 0005
  • Siwei Han
  • Shi Qiu 0016
  • Yiyang Zhou
  • Zhaoyang Wang 0004
  • Wenhao Zheng
  • Zhaorun Chen
  • Chenhang Cui

Interleaved multimodal comprehension and generation, enabling models to produce and interpret both images and text in arbitrary sequences, have become a pivotal area in multimodal learning. Despite significant advancements, the evaluation of this capability remains insufficient. Existing benchmarks suffer from limitations in data scale, scope, and evaluation depth, while current evaluation metrics are often costly or biased, lacking in reliability for practical applications. To address these challenges, we introduce MMIE, a large-scale knowledge-intensive benchmark for evaluating interleaved multimodal comprehension and generation in Large Vision-Language Models (LVLMs). MMIE comprises 20K meticulously curated multimodal queries, spanning 3 categories, 12 fields, and 102 subfields, including mathematics, coding, physics, literature, health, and arts. It supports both interleaved inputs and outputs, offering a mix of multiple-choice and open-ended question formats to evaluate diverse competencies. Moreover, we propose a reliable automated evaluation metric, leveraging a scoring model fine-tuned with human-annotated data and systematic evaluation criteria, aimed at reducing bias and improving evaluation accuracy. Extensive experiments demonstrate the effectiveness of our benchmark and metrics in providing a comprehensive evaluation of interleaved LVLMs. Specifically, we evaluate eight LVLMs, revealing that even the best models show significant room for improvement, with most achieving only moderate results. We believe MMIE will drive further advancements in the development of interleaved LVLMs.

ICLR Conference 2025 Conference Paper

MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos

  • Xuehai He
  • Weixi Feng
  • Kaizhi Zheng
  • Yujie Lu
  • Wanrong Zhu
  • Jiachen Li
  • Yue Fan
  • Jianfeng Wang

Multimodal Language Language Models (MLLMs) demonstrate the emerging abilities of "world models"---interpreting and reasoning about complex real-world dynamics. To assess these abilities, we posit videos are the ideal medium, as they encapsulate rich representations of real-world dynamics and causalities. To this end, we introduce MMWorld, a new benchmark for multi-discipline, multi-faceted multimodal video understanding. MMWorld distinguishes itself from previous video understanding benchmarks with two unique advantages: (1) multi-discipline, covering various disciplines that often require domain expertise for comprehensive understanding; (2) multi-faceted reasoning, including explanation, counterfactual thinking, future prediction, etc. MMWorld consists of a human-annotated dataset to evaluate MLLMs with questions about the whole videos and a synthetic dataset to analyze MLLMs within a single modality of perception. Together, MMWorld encompasses 1,910 videos across seven broad disciplines and 69 subdisciplines, complete with 6,627 question-answer pairs and associated captions. The evaluation includes 4 proprietary and 11 open-source MLLMs, which struggle on MMWorld (e.g., GPT-4o performs the best with only 62.5% accuracy), showing large room for improvement. Further ablation studies reveal other interesting findings such as models' different skill sets from humans. We hope MMWorld can serve as an essential step towards world model evaluation in videos.

NeurIPS Conference 2025 Conference Paper

Point-RFT: Improving Multimodal Reasoning with Visually Grounded Reinforcement Finetuning

  • Minheng Ni
  • Zhengyuan Yang
  • Linjie Li
  • Chung-Ching Lin
  • Kevin Lin
  • Wangmeng Zuo
  • Lijuan Wang

Recent advances in large language models have significantly improved textual reasoning through the effective use of Chain-of-Thought (CoT) and reinforcement learning. However, extending these successes to vision-language tasks remains challenging due to inherent limitations in text-only CoT, such as visual hallucinations and insufficient multimodal integration. In this paper, we introduce Point-RFT, a multimodal reasoning framework explicitly designed to leverage visually grounded CoT reasoning for visual document understanding. Our approach consists of two stages: First, we conduct format finetuning using a curated dataset of 71K diverse visual reasoning problems, each annotated with detailed, step-by-step rationales explicitly grounded to corresponding visual elements. Second, we employ reinforcement finetuning targeting visual document understanding. On ChartQA, our approach improves accuracy from 70. 88% (format-finetuned baseline) to 90. 04%, surpassing the 83. 92% accuracy achieved by reinforcement finetuning relying solely on text-based CoT. The result shows that our grounded CoT is more effective for multimodal reasoning compared with the text-only CoT. Moreover, Point-RFT exhibits superior generalization capability across several out-of-domain visual document reasoning benchmarks, including CharXiv, PlotQA, IconQA, TabMWP, etc. , and highlights its potential in complex real-world scenarios.

ICLR Conference 2025 Conference Paper

SlowFast-VGen: Slow-Fast Learning for Action-Driven Long Video Generation

  • Yining Hong
  • Beide Liu
  • Maxine Wu
  • Yuanhao Zhai 0001
  • Kai-Wei Chang 0001
  • Linjie Li
  • Kevin Lin
  • Chung-Ching Lin

Human beings are endowed with a complementary learning system, which bridges the slow learning of general world dynamics with fast storage of episodic memory from a new experience. Previous video generation models, however, primarily focus on slow learning by pre-training on vast amounts of data, overlooking the fast learning phase crucial for episodic memory storage. This oversight leads to inconsistencies across temporally distant frames when generating longer videos, as these frames fall beyond the model's context window. To this end, we introduce SlowFast-VGen, a novel dual-speed learning system for action-driven long video generation. Our approach incorporates a masked conditional video diffusion model for the slow learning of world dynamics, alongside an inference-time fast learning strategy based on a temporal LoRA module. Specifically, the fast learning process updates its temporal LoRA parameters based on local inputs and outputs, thereby efficiently storing episodic memory in its parameters. We further propose a slow-fast learning loop algorithm that seamlessly integrates the inner fast learning loop into the outer slow learning loop, enabling the recall of prior multi-episode experiences for context-aware skill learning. To facilitate the slow learning of an approximate world model, we collect a large-scale dataset of 200k videos with language action annotations, covering a wide range of scenarios. Extensive experiments show that SlowFast-VGen outperforms baselines across various metrics for action-driven video generation, achieving an FVD score of 514 compared to 782, and maintaining consistency in longer videos, with an average of 0.37 scene cuts versus 0.89. The slow-fast learning loop algorithm significantly enhances performances on long-horizon planning tasks as well.

NeurIPS Conference 2025 Conference Paper

SoTA with Less: MCTS-Guided Sample Selection for Data-Efficient Visual Reasoning Self-Improvement

  • Xiyao Wang
  • Zhengyuan Yang
  • Chao Feng
  • Hongjin Lu
  • Linjie Li
  • Chung-Ching Lin
  • Kevin Lin
  • Furong Huang

We introduce ThinkLite-VL, a family of visual reasoning models that achieve state-of-the-art (SoTA) performance using an order of magnitude fewer training samples, relying purely on reinforcement fine-tuning (RFT) self-improvement without any knowledge distillation. Our central insight is that sample difficulty critically influences RFT effectiveness: appropriately challenging examples can drive substantial reasoning improvements, even in low-data regimes. However, quantifying sample difficulty in a reliable and scalable manner remains non-trivial. To address this, we repurpose Monte Carlo Tree Search (MCTS) to measure sample difficulty via the number of reasoning iterations a vision-language model (VLM) requires to solve each instance. This MCTS-based selection procedure identifies samples that induce deeper reasoning while remaining solvable, allowing us to filter a high-quality subset from 70k open-source examples spanning math, natural image understanding, and chart comprehension. Using this approach, we select just 11k challenging samples for RFT on Qwen2. 5-VL-7B-Instruct and 7. 5k samples for Qwen2. 5-VL-72B-Instruct. The resulting models, ThinkLite-VL-7B and ThinkLite-VL-72B, significantly outperform their respective base models across eight visual reasoning benchmarks. In particular, ThinkLite-VL-7B improves the average performance of Qwen2. 5-VL-7B-Instruct by 7\% and surpasses all existing 7B-level models, as well as much larger models such as GPT-4o, O1 and Qwen2. 5-VL-72B, achieving a new SoTA score of 75. 1 on MathVista. ThinkLite-VL-72B further advances the SoTA frontier, achieving an accuracy of 79. 7 on MathVista and an average benchmark improvement of 4. 42 over the open-source SOTA. These results demonstrate that MCTS-guided difficulty filtering provides a scalable and effective path toward data-efficient self-improvement in multimodal reasoning.

ICLR Conference 2025 Conference Paper

Tuning Timestep-Distilled Diffusion Model Using Pairwise Sample Optimization

  • Zichen Miao
  • Zhengyuan Yang
  • Kevin Lin
  • Ze Wang 0008
  • Zicheng Liu 0001
  • Lijuan Wang
  • Qiang Qiu 0001

Recent advancements in timestep-distilled diffusion models have enabled high-quality image generation that rivals non-distilled multi-step models, but with significantly fewer inference steps. While such models are attractive for applications due to the low inference cost and latency, fine-tuning them with a naive diffusion objective would result in degraded and blurry outputs. An intuitive alternative is to repeat the diffusion distillation process with a fine-tuned teacher model, which produces good results but is cumbersome and computationally intensive: the distillation training usually requires magnitude higher of training compute compared to fine-tuning for specific image styles. In this paper, we present an algorithm named pairwise sample optimization (PSO), which enables the direct fine-tuning of an arbitrary timestep-distilled diffusion model. PSO introduces additional reference images sampled from the current time-step distilled model, and increases the relative likelihood margin between the training images and reference images. This enables the model to retain its few-step generation ability, while allowing for fine-tuning of its output distribution. We also demonstrate that PSO is a generalized formulation which be flexible extended to both offline-sampled and online-sampled pairwise data, covering various popular objectives for diffusion model preference optimization. We evaluate PSO in both preference optimization and other fine-tuning tasks, including style transfer and concept customization. We show that PSO can directly adapt distilled models to human-preferred generation with both offline and online-generated pairwise preference image data. PSO also demonstrates effectiveness in style transfer and concept customization by directly tuning timestep-distilled diffusion models.

NeurIPS Conference 2025 Conference Paper

VAGEN: Reinforcing World Model Reasoning for Multi-Turn VLM Agents

  • Kangrui Wang
  • Pingyue Zhang
  • Zihan Wang
  • Yaning Gao
  • Linjie Li
  • Qineng Wang
  • Hanyang Chen
  • Yiping Lu

A major challenge in training VLM agents, compared to LLM agents, is that states shift from simple texts to complex visual observations, which introduces partial observability and demands robust world modeling. We ask: can VLM agents build internal world models through explicit visual state reasoning? In this work, we architecturally enforce and reward VLM agent’s reasoning process via reinforcement learning (RL), formulating the problem as a Partially Observable Markov Decision Process (POMDP). We demonstrate that structuring agent’s reasoning into StateEstimation (“what is the current state? ”) and TransitionModeling (“what is next? ”) is critical by studying five reasoning strategies. Investigating how agents should ground visual states and represent these internal beliefs, we reveal the optimal representations are task-dependent: Natural Language excels at capturing semantic relationships for general tasks, while Structured formats are essential for high-precision manipulation. These insights motivate our approach to reward shaping and credit assignment. We leverage a WorldModeling Reward to densely rewards the agent’s turn-by-turn state predictions, while our Bi-Level General Advantage Estimation (Bi-Level GAE) enables turn-aware credit assignment. Through such world model reasoning, we enable a 3B model to achieve performance of 0. 82 on a set of five diverse agent tasks, nearly 3× improvement over its untrained counterpart (0. 21) and surpassing proprietary reasoning models like GPT-5 (0. 75), Gemini 2. 5 Pro (0. 67) and Claude 4. 5 (0. 62). All experiments are supported by our VAGEN framework, a scalable system for training and analyzing multi-turn VLM agents across diverse visual environments

NeurIPS Conference 2025 Conference Paper

ViCrit: A Verifiable Reinforcement Learning Proxy Task for Visual Perception in VLMs

  • Xiyao Wang
  • Zhengyuan Yang
  • Chao Feng
  • Yuhang Zhou
  • Xiaoyu Liu
  • Yongyuan Liang
  • Ming Li
  • Ziyi Zang

Reinforcement learning (RL) has shown great effectiveness for fine-tuning large language models (LLMs) using tasks that are challenging yet easily verifiable, such as math reasoning or code generation. However, extending this success to visual perception in vision–language models (VLMs) has been impeded by the scarcity of vision-centric tasks that are simultaneously challenging and unambiguously verifiable. To this end, we introduce \textbf{ViCrit} (\textit{Visual Caption Hallucination Critic}), an RL proxy task that trains VLMs to localize a subtle, synthetic visual hallucination injected into paragraphs of human-written image captions. Starting from a 200-word captions, we inject a single, subtle visual description error—altering a few words on objects, attributes, counts, or spatial relations—and task the model to pinpoint the corrupted span given the image and the modified caption. This formulation preserves the full perceptual difficulty while providing a binary, exact-match reward that is easy to compute and unambiguous. Models trained with the \textbf{ViCrit Task} exhibit substantial gains across a variety of VL benchmarks. Crucially, the improvements transfer beyond natural-image training data to abstract image reasoning and visual math, showing promises of learning to perceive rather than barely memorizing seen objects. To facilitate evaluation, we further introduce \textbf{ViCrit-Bench}, a category-balanced diagnostic benchmark that systematically probes perception errors across diverse image domains and error types. Together, our results demonstrate that fine-grained hallucination criticism is an effective and generalizable objective for enhancing visual perception in VLMs.

IJCAI Conference 2024 Conference Paper

Bring Metric Functions into Diffusion Models

  • Jie An
  • Zhengyuan Yang
  • Jianfeng Wang
  • Linjie Li
  • Zicheng Liu
  • Lijuan Wang
  • Jiebo Luo

We introduce a Cascaded Diffusion Model (Cas-DM) that improves a Denoising Diffusion Probabilistic Model (DDPM) by effectively incorporating additional metric functions in training. Metric functions such as the LPIPS loss have been proven highly effective in consistency models derived from the score matching. However, for the diffusion counterparts, the methodology and efficacy of adding extra metric functions remain unclear. One major challenge is the mismatch between the noise predicted by a DDPM at each step and the desired clean image that the metric function works well on. To address this problem, we propose Cas-DM, a network architecture that cascades two network modules to effectively apply metric functions to the diffusion model training. The first module, similar to a standard DDPM, learns to predict the added noise and is unaffected by the metric function. The second cascaded module learns to predict the clean image, thereby facilitating the metric function computation. Experiment results show that the proposed diffusion model backbone enables the effective use of the LPIPS loss, improving the image quality (FID, sFID) of diffusion models on various established benchmarks.

ICML Conference 2024 Conference Paper

Completing Visual Objects via Bridging Generation and Segmentation

  • Xiang Li 0106
  • Yinpeng Chen
  • Chung-Ching Lin
  • Hao Chen 0102
  • Kai Hu 0010
  • Rita Singh
  • Bhiksha Raj
  • Lijuan Wang

This paper presents a novel approach to object completion, with the primary goal of reconstructing a complete object from its partially visible components. Our method, named MaskComp, delineates the completion process through iterative stages of generation and segmentation. In each iteration, the object mask is provided as an additional condition to boost image generation, and, in return, the generated images can lead to a more accurate mask by fusing the segmentation of images. We demonstrate that the combination of one generation and one segmentation stage effectively functions as a mask denoiser. Through alternation between the generation and segmentation stages, the partial object mask is progressively refined, providing precise shape guidance and yielding superior object completion results. Our experiments demonstrate the superiority of MaskComp over existing approaches, e. g. , ControlNet and Stable Diffusion, establishing it as an effective solution for object completion.

NeurIPS Conference 2024 Conference Paper

Interfacing Foundation Models' Embeddings

  • Xueyan Zou
  • Linjie Li
  • Jianfeng Wang
  • Jianwei Yang
  • Mingyu Ding
  • Junyi Wei
  • Zhengyuan Yang
  • Feng Li

Foundation models possess strong capabilities in reasoning and memorizing across modalities. To further unleash the power of foundation models, we present FIND, a generalized interface for aligning foundation models' embeddings with unified image and dataset-level understanding spanning modality and granularity. As shown in Fig. 1, a lightweight transformer interface without tuning any foundation model weights is enough for segmentation, grounding, and retrieval in an interleaved manner. The proposed interface has the following favorable attributes: (1) Generalizable. It applies to various tasks spanning retrieval, segmentation, etc. , under the same architecture and weights. (2) Interleavable. With the benefit of multi-task multi-modal training, the proposed interface creates an interleaved shared embedding space. (3) Extendable. The proposed interface is adaptive to new tasks, and new models. In light of the interleaved embedding space, we introduce FIND-Bench, which introduces new training and evaluation annotations to the COCO dataset for interleaved segmentation and retrieval. We are the first work aligning foundations models' embeddings for interleave understanding. Meanwhile, our approach achieves state-of-the-art performance on FIND-Bench and competitive performance on standard retrieval and segmentation settings.

NeurIPS Conference 2024 Conference Paper

Leveraging Visual Tokens for Extended Text Contexts in Multi-Modal Learning

  • Alex Jinpeng Wang
  • Linjie Li
  • Yiqi Lin
  • Min Li
  • Lijuan Wang
  • Mike Zheng Shou

Training models with longer in-context lengths is a significant challenge for multimodal machine learning due to substantial GPU memory and computational costs. This exploratory study does not present state-of-the-art models; rather, it introduces an innovative method designed to increase in-context text length in multi-modality large language models (MLLMs) efficiently. We present \ModelFullName (\ModelName), which processes long in-context text using visual tokens. This technique significantly reduces GPU memory usage and floating point operations (FLOPs). For instance, our method expands the pre-training in-context length from 256 to 2048 tokens with fewer FLOPs for a 56 billion parameter MOE model. Experimental results demonstrate that \ModelName enhances OCR capabilities and delivers superior performance on common downstream benchmarks for in-context few-shot evaluation. Additionally, \ModelName proves effective for long context inference, achieving results comparable to full text input while maintaining computational efficiency.

ICLR Conference 2024 Conference Paper

Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning

  • Fuxiao Liu
  • Kevin Lin
  • Linjie Li
  • Jianfeng Wang
  • Yaser Yacoob
  • Lijuan Wang

Despite the promising progress in multi-modal tasks, current large multi-modal models (LMMs) are prone to hallucinating inconsistent descriptions with respect to the associated image and human instructions. This paper addresses this issue by introducing the first large and diverse visual instruction tuning dataset, named Large-scale Robust Visual (LRV)-Instruction. Our dataset comprises 400k visual instructions generated by GPT4, covering 16 vision-and-language tasks with open-ended instructions and answers. Unlike existing studies that primarily focus on positive instruction samples, we design LRV-Instruction to include both positive and negative instructions for more robust visual instruction tuning. Our negative instructions are designed at three semantic levels: (i) Nonexistent Object Manipulation, (ii) Existent Object Manipulation and (iii) Knowledge Manipulation. To efficiently measure the hallucination generated by LMMs, we propose GPT4-Assisted Visual Instruction Evaluation (GAVIE), a stable approach to evaluate visual instruction tuning like human experts. GAVIE does not require human-annotated groundtruth answers and can adapt to diverse instruction formats. We conduct comprehensive experiments to investigate the hallucination of LMMs. Our results demonstrate existing LMMs exhibit significant hallucinations when presented with our negative instructions, particularly Existent Object and Knowledge Manipulation instructions. Moreover, we successfully mitigate hallucination by finetuning MiniGPT4 and mPLUG-Owl on LRV-Instruction while improving performance on several public datasets compared to state-of-the-art methods. Additionally, we observed that a balanced ratio of positive and negative instances in the training data leads to a more robust model. Code and data will be released upon publication.

ICML Conference 2024 Conference Paper

MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities

  • Weihao Yu 0001
  • Zhengyuan Yang
  • Linjie Li
  • Jianfeng Wang
  • Kevin Lin
  • Zicheng Liu 0001
  • Xinchao Wang
  • Lijuan Wang

We propose MM-Vet, an evaluation benchmark that examines large multimodal models (LMMs) on complicated multimodal tasks. Recent LMMs have shown various intriguing abilities, such as solving math problems written on the blackboard, reasoning about events and celebrities in news images, and explaining visual jokes. Rapid model advancements pose challenges to evaluation benchmark development. Problems include: (1) How to systematically structure and evaluate the complicated multimodal tasks; (2) How to design evaluation metrics that work well across question and answer types; and (3) How to give model insights beyond a simple performance ranking. To this end, we present MM-Vet, designed based on the insight that the intriguing ability to solve complicated tasks is often achieved by a generalist model being able to integrate different core vision-language (VL) capabilities. MM-Vet defines 6 core VL capabilities and examines the 16 integrations of interest derived from the capability combination. For evaluation metrics, we propose an LLM-based evaluator for open-ended outputs. The evaluator enables the evaluation across different question types and answer styles, resulting in a unified scoring metric. We evaluate representative LMMs on MM-Vet, providing insights into the capabilities of different LMM system paradigms and models.

NeurIPS Conference 2024 Conference Paper

Motion Consistency Model: Accelerating Video Diffusion with Disentangled Motion-Appearance Distillation

  • Yuanhao Zhai
  • Kevin Lin
  • Zhengyuan Yang
  • Linjie Li
  • Jianfeng Wang
  • Chung-Ching Lin
  • David Doermann
  • Junsong Yuan

Image diffusion distillation achieves high-fidelity generation with very few sampling steps. However, directly applying these techniques to video models results in unsatisfied frame quality. This issue arises from the limited frame appearance quality in public video datasets, affecting the performance of both teacher and student video diffusion models. Our study aims to improve video diffusion distillation and meanwhile enabling the student model to improve frame appearance using the abundant high-quality image data. To this end, we propose motion consistency models (MCM), a single-stage video diffusion distillation method that disentangles motion and appearance learning. Specifically, MCM involves a video consistency model that distills motion from the video teacher model, and an image discriminator that boosts frame appearance to match high-quality image data. However, directly combining these components leads to two significant challenges: a conflict in frame learning objectives, where video distillation learns from low-quality video frames while the image discriminator targets high-quality images, and training-inference discrepancies due to the differing quality of video samples used during training and inference. To address these challenges, we introduce disentangled motion distillation and mixed trajectory distillation. The former applies the distillation objective solely to the motion representation, while the latter mitigates training-inference discrepancies by mixing distillation trajectories from both the low- and high-quality video domains. Extensive experiments show that our MCM achieves state-of-the-art video diffusion distillation performance. Additionally, our method can enhance frame quality in video diffusion models, producing frames with high aesthetic value or specific styles.

AAAI Conference 2024 Conference Paper

ORES: Open-Vocabulary Responsible Visual Synthesis

  • Minheng Ni
  • Chenfei Wu
  • Xiaodong Wang
  • Shengming Yin
  • Lijuan Wang
  • Zicheng Liu
  • Nan Duan

Avoiding synthesizing specific visual concepts is an essential challenge in responsible visual synthesis. However, the visual concept that needs to be avoided for responsible visual synthesis tends to be diverse, depending on the region, context, and usage scenarios. In this work, we formalize a new task, Open-vocabulary Responsible Visual Synthesis (ORES), where the synthesis model is able to avoid forbidden visual concepts while allowing users to input any desired content. To address this problem, we present a Two-stage Intervention (TIN) framework. By introducing 1) rewriting with learnable instruction through a large-scale language model (LLM) and 2) synthesizing with prompt intervention on a diffusion synthesis model, it can effectively synthesize images avoiding any concepts but following the user's query as much as possible. To evaluate on ORES, we provide a publicly available dataset, baseline models, and benchmark. Experimental results demonstrate the effectiveness of our method in reducing risks of image generation. Our work highlights the potential of LLMs in responsible visual synthesis. Our code and dataset is public available in https://github.com/kodenii/ORES.

ICML Conference 2024 Conference Paper

StrokeNUWA - Tokenizing Strokes for Vector Graphic Synthesis

  • Zecheng Tang
  • Chenfei Wu
  • Zekai Zhang
  • Minheng Ni
  • Shengming Yin
  • Yu Liu
  • Zhengyuan Yang
  • Lijuan Wang

To leverage LLMs for visual synthesis, traditional methods convert raster image information into discrete grid tokens through specialized visual modules, while disrupting the model’s ability to capture the true semantic representation of visual scenes. This paper posits that an alternative representation of images, vector graphics, can effectively surmount this limitation by enabling a more natural and semantically coherent segmentation of the image information. Thus, we introduce StrokeNUWA, a pioneering work exploring a better visual representation "stroke" tokens on vector graphics, which is inherently visual semantics rich, naturally compatible with LLMs, and highly compressed. Equipped with stroke tokens, StrokeNUWA can significantly surpass traditional LLM-based and optimization-based methods across various metrics in the vector graphic generation task. Besides, StrokeNUWA achieves up to a $94\times$ speedup in inference over the speed of prior methods with an exceptional SVG code compression ratio of 6. 9%.

YNICL Journal 2024 Journal Article

Temporal evolution of microstructural integrity in cerebellar peduncles in Parkinson’s disease: Stage-specific patterns and dopaminergic correlates

  • Chentao He
  • Rui Yang
  • Siming Rong
  • Piao Zhang
  • Xi Chen
  • Qi Qi
  • Ziqi Gao
  • Yan Li

BACKGROUND: Previous research revealed differences in cerebellar white matter integrity by disease stages, indicating a compensatory role in Parkinson's disease (PD). However, the temporal evolution of cerebellar white matter microstructure in patients with PD (PwPD) remains unclear. OBJECTIVE: To unravel temporal evolution of cerebellar white matter and its dopaminergic correlates in PD. METHODS: We recruited 124 PwPD from the PPMI study. The participants were divided into two subsets: Subset 1 (n = 41) had three MRI scans (baseline, 2 years, and 4 years), and Subset 2 (n = 106) had at least two MRI scans at baseline, 1 year, and/or 2 years. Free water-corrected diffusion metrics were used to measure the microstructural integrity in cerebellar peduncles (CP), the main white matter tracts connecting to and from the cerebellum. The ACAPULCO processing pipeline was used to assess cerebellar lobules volumes. Linear mixed-effect models were used to study longitudinal changes. We also examined the relationships between microstructural integrity in CP, striatal dopamine transporter specific binding ratio (SBR), and clinical symptoms. RESULTS: Microstructural changes in CP showed a non-linear pattern in PwPD. Free water-corrected fractional anisotropy (FAt) increased in the first two years but declined from 2 to 4 years, while free water-corrected mean diffusivity exhibited the opposite trend. The initial increased FAt in CP correlated with cerebellar regional volume atrophy, striatal dopaminergic SBR decline, and worsening clinical symptoms, but this correlation varied across disease stages. CONCLUSIONS: Our findings suggest a non-linear evolution of microstructural integrity in CP throughout the course of PD, indicating the adaptive structural reorganization of the cerebellum simultaneously with progressive striatal dopaminergic degeneration in PD.

NeurIPS Conference 2024 Conference Paper

VideoGUI: A Benchmark for GUI Automation from Instructional Videos

  • Kevin Q. Lin
  • Linjie Li
  • Difei Gao
  • Qinchen Wu
  • Mingyi Yan
  • Zhengyuan Yang
  • Lijuan Wang
  • Mike Z. Shou

Graphical User Interface (GUI) automation holds significant promise for enhancing human productivity by assisting with computer tasks. Existing task formulations primarily focus on simple tasks that can be specified by a single, language-only instruction, such as “Insert a new slide. ” In this work, we introduce VideoGUI, a novel multi-modal benchmark designed to evaluate GUI assistants on visual-centric GUI tasks. Sourced from high-quality web instructional videos, our benchmark focuses on tasks involving professional and novel software (e. g. , Adobe Pho- toshop or Stable Diffusion WebUI) and complex activities (e. g. , video editing). VideoGUI evaluates GUI assistants through a hierarchical process, allowing for identification of the specific levels at which they may fail: (i) high-level planning: reconstruct procedural subtasks from visual conditions without language descrip- tions; (ii) middle-level planning: generate sequences of precise action narrations based on visual state (i. e. , screenshot) and goals; (iii) atomic action execution: perform specific actions such as accurately clicking designated elements. For each level, we design evaluation metrics across individual dimensions to provide clear signals, such as individual performance in clicking, dragging, typing, and scrolling for atomic action execution. Our evaluation on VideoGUI reveals that even the SoTA large multimodal model GPT4o performs poorly on visual-centric GUI tasks, especially for high-level planning. The data and code are available at https: //github. com/showlab/videogui.

IJCAI Conference 2023 Conference Paper

Learning 3D Photography Videos via Self-supervised Diffusion on Single Images

  • Xiaodong Wang
  • Chenfei Wu
  • Shengming Yin
  • Minheng Ni
  • Jianfeng Wang
  • Linjie Li
  • Zhengyuan Yang
  • Fan Yang

3D photography renders a static image into a video with appealing 3D visual effects. Existing approaches typically first conduct monocular depth estimation, then render the input frame to subsequent frames with various viewpoints, and finally use an inpainting model to fill those missing/occluded regions. The inpainting model plays a crucial role in rendering quality, but it is normally trained on out-of-domain data. To reduce the training and inference gap, we propose a novel self-supervised diffusion model as the inpainting module. Given a single input image, we automatically construct a training pair of the masked occluded image and the ground-truth image with random cycle rendering. The constructed training samples are closely aligned to the testing instances, without the need for data annotation. To make full use of the masked images, we designed a Masked Enhanced Block (MEB), which can be easily plugged into the UNet and enhance the semantic conditions. Towards real-world animation, we present a novel task: out-animation, which extends the space and time of input objects. Extensive experiments on real datasets show that our method achieves competitive results with existing SOTA methods.

YNIMG Journal 2023 Journal Article

Optimization of structural connectomes and scaled patterns of structural-functional decoupling in Parkinson's disease

  • Song'an Shang
  • Lijuan Wang
  • Yao Xu
  • Hongying Zhang
  • Lanlan Chen
  • Weiqiang Dou
  • Xindao Yin
  • Jing Ye

Parkinson's disease (PD) is manifested with disrupted topology of the structural connection network (SCN) and the functional connection network (FCN). However, the SCN and its interactions with the FCN remain to be further investigated. This multimodality study attempted to precisely characterize the SCN using diffusion kurtosis imaging (DKI) and further identify the neuropathological pattern of SCN-FCN decoupling, underscoring the neurodegeneration of PD. Diffusion-weighted imaging and resting-state functional imaging were available for network constructions among sixty-nine patients with PD and seventy demographically matched healthy control (HC) participants. The classification performance and topological prosperities of both the SCN and the FCN were analyzed, followed by quantification of the SCN-FCN couplings across scales. The SCN constructed by kurtosis metrics achieved optimal classification performance (area under the curve 0.89, accuracy 80.55 %, sensitivity 78.40 %, and specificity 80.65 %). Along with diverse alterations of structural and functional network topology, the PD group exhibited decoupling across scales including: reduced global coupling; increased nodal coupling within the sensorimotor network (SMN) and subcortical network (SN); higher intramodular coupling within the SMN and SN and lower intramodular coupling of the default mode network (DMN); decreased coupling between the modules of DMN-fronto-parietal network and DMN-visual network, but increased coupling between the SMN-SN module. Several associations between the coupling coefficient and topological properties of the SCN, as well as between network values and clinical scores, were observed. These findings validated the clinical implementation of DKI for structural network construction with better differentiation ability and characterized the SCN-FCN decoupling as supplementary insight into the pathological process underlying PD.

ICLR Conference 2023 Conference Paper

Prompting GPT-3 To Be Reliable

  • Chenglei Si
  • Zhe Gan
  • Zhengyuan Yang
  • Shuohang Wang
  • Jianfeng Wang
  • Jordan L. Boyd-Graber
  • Lijuan Wang

Large language models (LLMs) show impressive abilities via few-shot prompting. Commercialized APIs such as OpenAI GPT-3 further increase their use in real-world language applications. However, the crucial problem of how to improve the reliability of GPT-3 is still under-explored. While reliability is a broad and vaguely defined term, we decompose reliability into four main facets that correspond to the existing framework of ML safety and are well-recognized to be important: generalizability, social biases, calibration, and factuality. Our core contribution is to establish simple and effective prompts that improve GPT-3’s reliability as it: 1) generalizes out-of-distribution, 2) balances demographic distribution and uses natural language instructions to reduce social biases, 3) calibrates output probabilities, and 4) updates the LLM’s factual knowledge and reasoning chains. With appropriate prompts, GPT-3 is more reliable than smaller-scale supervised models on all these facets. We release all processed datasets, evaluation scripts, and model predictions. Our systematic empirical study not only sheds new insights on the reliability of prompting LLMs, but more importantly, our prompting strategies can help practitioners more reliably use LLMs like GPT-3.

NeurIPS Conference 2023 Conference Paper

Segment Everything Everywhere All at Once

  • Xueyan Zou
  • Jianwei Yang
  • Hao Zhang
  • Feng Li
  • Linjie Li
  • Jianfeng Wang
  • Lijuan Wang
  • Jianfeng Gao

In this work, we present SEEM, a promotable and interactive model for segmenting everything everywhere all at once in an image. In SEEM, we propose a novel and versatile decoding mechanism that enables diverse prompting for all types of segmentation tasks, aiming at a universal interface that behaves like large language models (LLMs). More specifically, SEEM is designed with four desiderata: i) Versatility. We introduce a new visual prompt to unify different spatial queries including points, boxes, scribbles, and masks, which can further generalize to a different referring image; ii) Compositionality. We learn a joint visual-semantic space between text and visual prompts, which facilitates the dynamic composition of two prompt types required for various segmentation tasks, as shown in Fig. 1; iii) Interactivity. We further incorporate learnable memory prompts into the decoder to retain segmentation history through mask-guided cross-attention from the decoder to image features; iv) Semantic awareness. We use a text encoder to encode text queries and mask labels into the same semantic space for open-vocabulary segmentation. We conduct a comprehensive empirical study to validate the effectiveness of SEEM across diverse segmentation tasks. The results demonstrate that SEEM exhibits robust generalizing to unseen user intents as it learns to compose prompts of different types in a unified representation space. Our approach achieves competitive performance on interactive segmentation, generic segmentation, referring segmentation, and video object segmentation on 9 datasets with minimum 1/100 supervision in a single set of weights.

TMLR Journal 2022 Journal Article

Adversarial Feature Augmentation and Normalization for Visual Recognition

  • Tianlong Chen
  • Yu Cheng
  • Zhe Gan
  • Jianfeng Wang
  • Lijuan Wang
  • Jingjing Liu
  • Zhangyang Wang

Recent advances in computer vision take advantage of adversarial data augmentation to improve the generalization of classification models. Here, we present an effective and efficient alternative that advocates adversarial augmentation on intermediate feature embeddings, instead of relying on computationally-expensive pixel-level perturbations. We propose $\textbf{A}$dversarial $\textbf{F}$eature $\textbf{A}$ugmentation and $\textbf{N}$ormalization (A-FAN), which ($i$) first augments visual recognition models with adversarial features that integrate flexible scales of perturbation strengths, ($ii$) then extracts adversarial feature statistics from batch normalization, and re-injects them into clean features through feature normalization. We validate the proposed approach across diverse visual recognition tasks with representative backbone networks, including ResNets and EfficientNets for classification, Faster-RCNN for detection, and Deeplab V3+ for segmentation. Extensive experiments show that A-FAN yields consistent generalization improvement over strong baselines across various datasets for classification, detection, and segmentation tasks, such as CIFAR-10, CIFAR-100, ImageNet, Pascal VOC2007, Pascal VOC2012, COCO2017, and Cityspaces. Comprehensive ablation studies and detailed analyses also demonstrate that adding perturbations to specific modules and layers of classification/detection/segmentation backbones yields optimal performance. Codes and pre-trained models are available in: https://github.com/VITA-Group/CV_A-FAN.

YNIMG Journal 2022 Journal Article

An awareness-dependent mapping of saliency in the human visual system

  • Lijuan Wang
  • Ling Huang
  • Mengsha Li
  • Xiaotong Wang
  • Shiyu Wang
  • Yuefa Lin
  • Xilin Zhang

The allocation of exogenously cued spatial attention is governed by a saliency map. Yet, how salience is mapped when multiple salient stimuli are present simultaneously, and how this mapping interacts with awareness remains unclear. These questions were addressed here using either visible or invisible displays presenting two foreground stimuli (whose bars were oriented differently from the bars in the otherwise uniform background): a high salience target and a distractor of varied, lesser salience. Interference, or not, by the distractor with the effective salience of the target served to index a graded or non-graded nature of salience mapping, respectively. The invisible and visible displays were empirically validated by a two-alternative forced choice test (detecting the quadrant of the target) demonstrating subjects' performance at or above chance level, respectively. By combining psychophysics, fMRI, and effective connectivity analysis, we found a graded distribution of salience with awareness, changing to a non-graded distribution without awareness. Crucially, we further revealed that the graded distribution was contingent upon feedback from the posterior intraparietal sulcus (pIPS, especially from the right pIPS), whereas the non-graded distribution was innate to V1. Together, this awareness-dependent mapping of saliency reconciles several previous, seemingly contradictory findings regarding the nature of the saliency map.

AAAI Conference 2022 Conference Paper

An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA

  • Zhengyuan Yang
  • Zhe Gan
  • Jianfeng Wang
  • Xiaowei Hu
  • Yumao Lu
  • Zicheng Liu
  • Lijuan Wang

Knowledge-based visual question answering (VQA) involves answering questions that require external knowledge not present in the image. Existing methods first retrieve knowledge from external resources, then reason over the selected knowledge, the input image, and question for answer prediction. However, this two-step approach could lead to mismatches that potentially limit the VQA performance. For example, the retrieved knowledge might be noisy and irrelevant to the question, and the re-embedded knowledge features during reasoning might deviate from their original meanings in the knowledge base (KB). To address this challenge, we propose PICa, a simple yet effective method that Prompts GPT- 3 via the use of Image Captions, for knowledge-based VQA. Inspired by GPT-3’s power in knowledge retrieval and question answering, instead of using structured KBs as in previous work, we treat GPT-3 as an implicit and unstructured KB that can jointly acquire and process relevant knowledge. Specifically, we first convert the image into captions (or tags) that GPT-3 can understand, then adapt GPT-3 to solve the VQA task in a few-shot manner by just providing a few in-context VQA examples. We further boost performance by carefully investigating: (i) what text formats best describe the image content, and (ii) how in-context examples can be better selected and used. PICa unlocks the first use of GPT-3 for multimodal tasks. By using only 16 examples, PICa surpasses the supervised state of the art by an absolute +8. 6 points on the OK-VQA dataset. We also benchmark PICa on VQAv2, where PICa also shows a decent few-shot performance.

NeurIPS Conference 2022 Conference Paper

Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone

  • Zi-Yi Dou
  • Aishwarya Kamath
  • Zhe Gan
  • Pengchuan Zhang
  • Jianfeng Wang
  • Linjie Li
  • Zicheng Liu
  • Ce Liu

Vision-language (VL) pre-training has recently received considerable attention. However, most existing end-to-end pre-training approaches either only aim to tackle VL tasks such as image-text retrieval, visual question answering (VQA) and image captioning that test high-level understanding of images, or only target region-level understanding for tasks such as phrase grounding and object detection. We present FIBER (Fusion-In-the-Backbone-based transformER), a new VL model architecture that can seamlessly handle both these types of tasks. Instead of having dedicated transformer layers for fusion after the uni-modal backbones, FIBER pushes multimodal fusion deep into the model by inserting cross-attention into the image and text backbones to better capture multimodal interactions. In addition, unlike previous work that is either only pre-trained on image-text data or on fine-grained data with box-level annotations, we present a two-stage pre-training strategy that uses both these kinds of data efficiently: (i) coarse-grained pre-training based on image-text data; followed by (ii) fine-grained pre-training based on image-text-box data. We conduct comprehensive experiments on a wide range of VL tasks, ranging from VQA, image captioning, and retrieval, to phrase grounding, referring expression comprehension, and object detection. Using deep multimodal fusion coupled with the two-stage pre-training, FIBER provides consistent performance improvements over strong baselines across all tasks, often outperforming methods using magnitudes more data. Code is released at https: //github. com/microsoft/FIBER.

TMLR Journal 2022 Journal Article

GIT: A Generative Image-to-text Transformer for Vision and Language

  • Jianfeng Wang
  • Zhengyuan Yang
  • Xiaowei Hu
  • Linjie Li
  • Kevin Lin
  • Zhe Gan
  • Zicheng Liu
  • Ce Liu

In this paper, we design and train a Generative Image-to-text Transformer, GIT, to unify vision-language tasks such as image/video captioning and question answering. While generative models provide a consistent network architecture between pre-training and fine-tuning, existing work typically contains complex structures (uni/multi-modal encoder/decoder) and depends on external modules such as object detectors/taggers and optical character recognition (OCR). In GIT, we simplify the architecture as one image encoder and one text decoder under a single language modeling task. We also scale up the pre-training data and the model size to boost the model performance. Without bells and whistles, our GIT establishes new state of the arts on numerous challenging benchmarks with a large margin. For instance, our model surpasses the human performance for the first time on TextCaps (138.2 vs. 125.5 in CIDEr). Furthermore, we present a new scheme of generation-based image classification and scene text recognition, achieving decent performance on standard benchmarks.

NeurIPS Conference 2022 Conference Paper

GLIPv2: Unifying Localization and Vision-Language Understanding

  • Haotian Zhang
  • Pengchuan Zhang
  • Xiaowei Hu
  • Yen-Chun Chen
  • Liunian Li
  • Xiyang Dai
  • Lijuan Wang
  • Lu Yuan

We present GLIPv2, a grounded VL understanding model, that serves both localization tasks (e. g. , object detection, instance segmentation) and Vision-Language (VL) understanding tasks (e. g. , VQA, image captioning). GLIPv2 elegantly unifies localization pre-training and Vision-Language Pre-training (VLP) with three pre-training tasks: phrase grounding as a VL reformulation of the detection task, region-word contrastive learning as a novel region-word level contrastive learning task, and the masked language modeling. This unification not only simplifies the previous multi-stage VLP procedure but also achieves mutual benefits between localization and understanding tasks. Experimental results show that a single GLIPv2 model (all model weights are shared) achieves near SoTA performance on various localization and understanding tasks. The model also shows (1) strong zero-shot and few-shot adaption performance on open-vocabulary object detection tasks and (2) superior grounding capability on VL understanding tasks.

NeurIPS Conference 2022 Conference Paper

K-LITE: Learning Transferable Visual Models with External Knowledge

  • Sheng Shen
  • Chunyuan Li
  • Xiaowei Hu
  • Yujia Xie
  • Jianwei Yang
  • Pengchuan Zhang
  • Zhe Gan
  • Lijuan Wang

The new generation of state-of-the-art computer vision systems are trained from natural language supervision, ranging from simple object category names to descriptive captions. This form of supervision ensures high generality and usability of the learned visual models, based on the broad concept coverage achieved through large-scale data collection process. Alternatively, we argue that learning with external knowledge about images is a promising way which leverages a much more structured source of supervision and offers sample efficiency. In this paper, we propose K-LITE (Knowledge-augmented Language-Image Training and Evaluation), a simple strategy to leverage external knowledge for building transferable visual systems: In training, it enriches entities in natural language with WordNet and Wiktionary knowledge, leading to an efficient and scalable approach to learning image representations that uses knowledge about the visual concepts; In evaluation, the natural language is also augmented with external knowledge and then used to reference learned visual concepts (or describe new ones) to enable zero-shot and few-shot transfer of the pre-trained models. We study the performance of K-LITE on two important computer vision problems, image classification and object detection, benchmarking on 20 and 13 different existing datasets, respectively. The proposed knowledge-augmented models show significant improvement in transfer learning performance over existing methods. Our code is released at https: //github. com/microsoft/klite.

NeurIPS Conference 2022 Conference Paper

NUWA-Infinity: Autoregressive over Autoregressive Generation for Infinite Visual Synthesis

  • Jian Liang
  • Chenfei Wu
  • Xiaowei Hu
  • Zhe Gan
  • Jianfeng Wang
  • Lijuan Wang
  • Zicheng Liu
  • Yuejian Fang

Infinite visual synthesis aims to generate high-resolution images, long-duration videos, and even visual generation of infinite size. Some recent work tried to solve this task by first dividing data into processable patches and then training the models on them without considering the dependencies between patches. However, since they fail to model global dependencies between patches, the quality and consistency of the generation can be limited. To address this issue, we propose NUWA-Infinity, a patch-level \emph{``render-and-optimize''} strategy for infinite visual synthesis. Given a large image or a long video, NUWA-Infinity first splits it into non-overlapping patches and uses the ordered patch chain as a complete training instance, a rendering model autoregressively predicts each patch based on its contexts. Once a patch is predicted, it is optimized immediately and its hidden states are saved as contexts for the next \emph{``render-and-optimize''} process. This brings two advantages: ($i$) The autoregressive rendering process with information transfer between contexts provides an implicit global probabilistic distribution modeling; ($ii$) The timely optimization process alleviates the optimization stress of the model and helps convergence. Based on the above designs, NUWA-Infinity shows a strong synthesis ability on high-resolution images and long-duration videos. The homepage link is \url{https: //nuwa-infinity. microsoft. com}.

AAAI Conference 2022 Conference Paper

OVIS: Open-Vocabulary Visual Instance Search via Visual-Semantic Aligned Representation Learning

  • Sheng Liu
  • Kevin Lin
  • Lijuan Wang
  • Junsong Yuan
  • Zicheng Liu

We introduce the task of open-vocabulary visual instance search (OVIS). Given an arbitrary textual search query, Openvocabulary Visual Instance Search (OVIS) aims to return a ranked list of visual instances, i. e. , image patches, that satisfies the search intent from an image database. The term “open vocabulary” means that there are neither restrictions to the visual instance to be searched nor restrictions to the word that can be used to compose the textual search query. We propose to address such a search challenge via visual-semantic aligned representation learning (ViSA). ViSA leverages massive amount of image-caption pairs as weak image-level (not instance-level) supervision to learn a rich cross-modal semantic space where the representations of visual instances (not images) and those of textual queries are aligned, thus allowing us to measure the similarities between any visual instance and an arbitrary textual query. To evaluate the performance of ViSA, we build two datasets named OVIS40 and OVIS1400 and also introduce a pipeline for error analysis. Through extensive experiments on the two datasets, we demonstrate ViSA’s ability to search for visual instances in images not available during training given a wide range of textual queries including those composed of uncommon words. Experimental results show that ViSA achieves an mAP@50 of 27. 8% on OVIS40 and achieves a recall@30 of 21. 3% on OVIS1400 dataset under the most challenging settings.

AAAI Conference 2022 Conference Paper

Playing Lottery Tickets with Vision and Language

  • Zhe Gan
  • Yen-Chun Chen
  • Linjie Li
  • Tianlong Chen
  • Yu Cheng
  • Shuohang Wang
  • Jingjing Liu
  • Lijuan Wang

Large-scale pre-training has recently revolutionized visionand-language (VL) research. Models such as LXMERT and UNITER have significantly lifted the state of the art over a wide range of VL tasks. However, the large number of parameters in such models hinders their application in practice. In parallel, work on the lottery ticket hypothesis (LTH) has shown that deep neural networks contain small matching subnetworks that can achieve on par or even better performance than the dense networks when trained in isolation. In this work, we perform the first empirical study to assess whether such trainable subnetworks also exist in pre-trained VL models. We use UNITER as the main testbed (also test on LXMERT and ViLT), and consolidate 7 representative VL tasks for experiments, including visual question answering, visual commonsense reasoning, visual entailment, referring expression comprehension, image-text retrieval, GQA, and NLVR2. Through comprehensive analysis, we summarize our main findings as follows. (i) It is difficult to find subnetworks that strictly match the performance of the full model. However, we can find “relaxed” winning tickets at 50%-70% sparsity that maintain 99% of the full accuracy. (ii) Subnetworks found by task-specific pruning transfer reasonably well to the other tasks, while those found on the pre-training tasks at 60%/70% sparsity transfer universally, matching 98%/96% of the full accuracy on average over all the tasks. (iii) Besides UNITER, other models such as LXMERT and ViLT can also play lottery tickets. However, the highest sparsity we can achieve for ViLT is far lower than LXMERT and UNITER (30% vs. 70%). (iv) LTH also remains relevant when using other training methods (e. g. , adversarial training).

ICLR Conference 2021 Conference Paper

SEED: Self-supervised Distillation For Visual Representation

  • Zhiyuan Fang
  • Jianfeng Wang
  • Lijuan Wang
  • Lei Zhang 0001
  • Yezhou Yang
  • Zicheng Liu 0001

This paper is concerned with self-supervised learning for small models. The problem is motivated by our empirical studies that while the widely used contrastive self-supervised learning method has shown great progress on large model training, it does not work well for small models. To address this problem, we propose a new learning paradigm, named $\textbf{SE}$lf-Sup$\textbf{E}$rvised $\textbf{D}$istillation (${\large S}$EED), where we leverage a larger network (as Teacher) to transfer its representational knowledge into a smaller architecture (as Student) in a self-supervised fashion. Instead of directly learning from unlabeled data, we train a student encoder to mimic the similarity score distribution inferred by a teacher over a set of instances. We show that ${\large S}$EED dramatically boosts the performance of small networks on downstream tasks. Compared with self-supervised baselines, ${\large S}$EED improves the top-1 accuracy from 42.2% to 67.6% on EfficientNet-B0 and from 36.3% to 68.2% on MobileNet-v3-Large on the ImageNet-1k dataset.

NeurIPS Conference 2021 Conference Paper

VALUE: A Multi-Task Benchmark for Video-and-Language Understanding Evaluation

  • Linjie Li
  • Jie Lei
  • Zhe Gan
  • Licheng Yu
  • Yen-Chun Chen
  • Rohit Pillai
  • Yu Cheng
  • Luowei Zhou

Most existing video-and-language (VidL) research focuses on a single dataset, or multiple datasets of a single task. In reality, a truly useful VidL system is expected to be easily generalizable to diverse tasks, domains, and datasets. To facilitate the evaluation of such systems, we introduce Video-And-Language Understanding Evaluation (VALUE) benchmark, an assemblage of 11 VidL datasets over 3 popular tasks: (i) text-to-video retrieval; (ii) video question answering; and (iii) video captioning. VALUE benchmark aims to cover a broad range of video genres, video lengths, data volumes, and task difficulty levels. Rather than focusing on single-channel videos with visual information only, VALUE promotes models that leverage information from both video frames and their associated subtitles, as well as models that share knowledge across multiple tasks. We evaluate various baseline methods with and without large-scale VidL pre-training, and systematically investigate the impact of video input channels, fusion methods, and different video representations. We also study the transferability between tasks, and conduct multi-task learning under different settings. The significant gap between our best model and human performance calls for future study for advanced VidL models. VALUE is available at https: //value-benchmark. github. io/.

AAAI Conference 2021 Conference Paper

VIVO: Visual Vocabulary Pre-Training for Novel Object Captioning

  • Xiaowei Hu
  • Xi Yin
  • Kevin Lin
  • Lei Zhang
  • Jianfeng Gao
  • Lijuan Wang
  • Zicheng Liu

It is highly desirable yet challenging to generate image captions that can describe novel objects which are unseen in caption-labeled training data, a capability that is evaluated in the novel object captioning challenge (nocaps). In this challenge, no additional image-caption training data, other than COCO Captions, is allowed for model training. Thus, conventional Vision-Language Pre-training (VLP) methods cannot be applied. This paper presents VIsual VOcabulary pretraining (VIVO) that performs pre-training in the absence of caption annotations. By breaking the dependency of paired image-caption training data in VLP, VIVO can leverage large amounts of paired image-tag data to learn a visual vocabulary. This is done by pre-training a multi-layer Transformer model that learns to align image-level tags with their corresponding image region features. To address the unordered nature of image tags, VIVO uses a Hungarian matching loss with masked tag prediction to conduct pre-training. We validate the effectiveness of VIVO by fine-tuning the pre-trained model for image captioning. In addition, we perform an analysis of the visual-text alignment inferred by our model. The results show that our model can not only generate fluent image captions that describe novel objects, but also identify the locations of these objects. Our single model has achieved new state-of-the-art results on nocaps and surpassed the human CIDEr score.

AAAI Conference 2020 Conference Paper

Pyramid Constrained Self-Attention Network for Fast Video Salient Object Detection

  • Yuchao Gu
  • Lijuan Wang
  • Ziqin Wang
  • Yun Liu
  • Ming-Ming Cheng
  • Shao-Ping Lu

Spatiotemporal information is essential for video salient object detection (VSOD) due to the highly attractive object motion for human’s attention. Previous VSOD methods usually use Long Short-Term Memory (LSTM) or 3D ConvNet (C3D), which can only encode motion information through step-by-step propagation in the temporal domain. Recently, the non-local mechanism is proposed to capture long-range dependencies directly. However, it is not straightforward to apply the non-local mechanism into VSOD, because i) it fails to capture motion cues and tends to learn motion-independent global contexts; ii) its computation and memory costs are prohibitive for video dense prediction tasks such as VSOD. To address the above problems, we design a Constrained Self- Attention (CSA) operation to capture motion cues, based on the prior that objects always move in a continuous trajectory. We group a set of CSA operations in Pyramid structures (PCSA) to capture objects at various scales and speeds. Extensive experimental results demonstrate that our method outperforms previous state-of-the-art methods in both accuracy and speed (110 FPS on a single Titan Xp) on five challenge datasets. Our code is available at https: //github. com/ guyuchao/PyramidCSA.

IROS Conference 2012 Conference Paper

Modeling and simulation of friction forces during needle insertion using Local Constraint Method

  • Lijuan Wang
  • Zhongkui Wang
  • Shinichi Hirai

In modern clinical practices, accurate orientation for needle-like tools inserting into soft tissues is cumbersome, mainly due to the tissue's non-linear deformation and the complicated combination of forces between the tissue and the tool. In this paper, the interaction between tissue deformation and friction forces has been discussed. We consider the relative velocity and contact length as the main factors of friction force during tissue deforming. An available friction model has been built for dynamic needle insertion simulation based on Finite Element (FE) framework. A Local Constraint Method (LCM) is proposed to calculate the tissue deformation and apply the friction forces to the tissue frame for avoiding remeshing. In our approach a series of equivalent constraints and forces are generated by decomposing them inside Local Regions (LRs) to nodal points. Simulations based on this method to realize the dynamic needle insertion have been conducted for validation.