Arrow Research search

Author name cluster

Zirui Wang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

29 papers
2 author rows

Possible papers

29

AAAI Conference 2026 Conference Paper

Disentangling for Transfer: Boosting Limited Modalities via Information-Theoretic Regularization and Cross-Modal Reconstruction

  • Zhiyun Zhang
  • Yan-Jie Zhou
  • Yujian Hu
  • Xiyao Ma
  • Zhouhang Yuan
  • Zirui Wang
  • Hongkun Zhang
  • Minfeng Xu

Missing critical modalities in medical imaging poses significant challenges for AI-driven diagnostic systems, particularly in scenarios where limited modalities must suffice for downstream tasks. Existing approaches often fail to fully leverage privileged features available only at training or address the information gap between privileged and limited modalities, resulting in suboptimal performance. To address this, we propose a unified, dual-stage Disentanglement-AligNmenT framEwork (DANTE), which uses InformationTheoretic Regularization and Cross-Modal Reconstruction to decompose full-modality information into alignable and privileged-exclusive components. In the first stage, a self-supervised pre-training strategy based on cross-modal reconstruction acts as a proxy task to implicitly incentivize disentangled representations. In the second stage, we present an information-theoretic regularization to explicitly maximize the transfer of privileged knowledge through two novel modules: (1) a Mutual Alignment Module that employs multilevel bidirectional alignment between limited-modality features and alignable features, enhancing cross-modal representation consistency; (2) a Privileged Compaction Module that restricts the privileged-exclusive information flow, promoting the integration of task-relevant content into alignable representations. Experimental results on three challenging medical datasets demonstrate that DANTE achieves state-of-the-art performance, demonstrating its effectiveness in leveraging privileged guidance under modality scarcity, and exhibits broad applicability across diverse medical imaging scenarios.

ECAI Conference 2025 Conference Paper

CRED-SQL: Enhancing Real-World Large Scale Database Text-to-SQL Parsing Through Cluster Retrieval and Execution Description

  • Shaoming Duan
  • Zirui Wang
  • Chuanyi Liu
  • Zhibin Zhu
  • Yuhao Zhang
  • Peiyi Han
  • Liang Yan
  • Zewu Peng

Recent advances in large language models (LLMs) have significantly improved the accuracy of Text-to-SQL systems. However, a critical challenge remains: the semantic mismatch between natural language questions (NLQs) and their corresponding SQL queries. This issue is exacerbated in large-scale databases, where semantically similar attributes hinder schema linking and semantic drift during SQL generation, ultimately reducing model accuracy. To address these challenges, we introduce CRED-SQL, a framework designed for large-scale databases that integrates Cluster Retrieval and Execution Description. CRED-SQL first performs cluster-based large-scale schema retrieval to pinpoint the tables and columns most relevant to a given NLQ, alleviating schema mismatch. It then introduces an intermediate natural language representation—Execution Description Language (EDL)—to bridge the gap between NLQs and SQL. This reformulation decomposes the task into two stages: Text-to-EDL and EDL-to-SQL, leveraging LLMs’ strong general reasoning capabilities while reducing semantic deviation. Extensive experiments on two large-scale, cross-domain benchmarks—SpiderUnion and BirdUnion—demonstrate that CRED-SQL achieves new state-of-the-art (SOTA) performance, validating its effectiveness and scalability. Our code is available at https: //github. com/smduan/CRED-SQL. git

NeurIPS Conference 2025 Conference Paper

Cue3D: Quantifying the Role of Image Cues in Single-Image 3D Generation

  • Xiang Li
  • Zirui Wang
  • Zixuan Huang
  • James Rehg

Humans and traditional computer vision methods rely on a diverse set of monocular cues to infer 3D structure from a single image, such as shading, texture, silhouette, etc. While recent deep generative models have dramatically advanced single-image 3D generation, it remains unclear which image cues these methods actually exploit. We introduce Cue3D, the first comprehensive, model-agnostic framework for quantifying the influence of individual image cues in single-image 3D generation. Our unified benchmark evaluates seven state-of-the-art methods, spanning regression-based, multi-view, and native 3D generative paradigms. By systematically perturbing cues such as shading, texture, silhouette, perspective, edges, and local continuity, we measure their impact on 3D output quality. Our analysis reveals that shape meaningfulness, not texture, dictates generalization. Geometric cues, particularly shading, are crucial for 3D generation. We further identify over-reliance on provided silhouettes and diverse sensitivities to cues such as perspective and local continuity across model families. By dissecting these dependencies, Cue3D advances our understanding of how modern 3D networks leverage classical vision cues, and offers directions for developing more transparent, robust, and controllable single-image 3D generation models.

NeurIPS Conference 2025 Conference Paper

Efficient Rectified Flow for Image Fusion

  • Zirui Wang
  • Jiayi Zhang
  • Tianwei Guan
  • Yuhan Zhou
  • Xingyuan Li
  • Minjing Dong
  • Jinyuan Liu

Image fusion is a fundamental and important task in computer vision, aiming to combine complementary information from different modalities to fuse images. In recent years, diffusion models have made significant developments in the field of image fusion. However, diffusion models often require complex computations and redundant inference time, which reduces the applicability of these methods. To address this issue, we propose RFfusion, an efficient one-step diffusion model for image fusion based on Rectified Flow. We incorporate Rectified Flow into the image fusion task to straighten the sampling path in the diffusion model, achieving one-step sampling without the need for additional training, while still maintaining high-quality fusion results. Furthermore, we propose a task-specific variational autoencoder (VAE) architecture tailored for image fusion, where the fusion operation is embedded within the latent space to further reduce computational complexity. To address the inherent discrepancy between conventional reconstruction-oriented VAE objectives and the requirements of image fusion, we introduce a two-stage training strategy. This approach facilitates the effective learning and integration of complementary information from multi-modal source images, thereby enabling the model to retain fine-grained structural details while significantly enhancing inference efficiency. Extensive experiments demonstrate that our method outperforms other state-of-the-art methods in terms of both inference speed and fusion quality.

ICLR Conference 2025 Conference Paper

GS-CPR: Efficient Camera Pose Refinement via 3D Gaussian Splatting

  • Changkun Liu 0001
  • Shuai Chen
  • Yash Bhalgat
  • Siyan Hu
  • Ming Cheng
  • Zirui Wang
  • Victor Adrian Prisacariu
  • Tristan Braud

We leverage 3D Gaussian Splatting (3DGS) as a scene representation and propose a novel test-time camera pose refinement (CPR) framework, GS-CPR. This framework enhances the localization accuracy of state-of-the-art absolute pose regression and scene coordinate regression methods. The 3DGS model renders high-quality synthetic images and depth maps to facilitate the establishment of 2D-3D correspondences. GS-CPR obviates the need for training feature extractors or descriptors by operating directly on RGB images, utilizing the 3D foundation model, MASt3R, for precise 2D matching. To improve the robustness of our model in challenging outdoor environments, we incorporate an exposure-adaptive module within the 3DGS framework. Consequently, GS-CPR enables efficient one-shot pose refinement given a single RGB query and a coarse initial pose estimation. Our proposed approach surpasses leading NeRF-based optimization methods in both accuracy and runtime across indoor and outdoor visual localization benchmarks, achieving new state-of-the-art accuracy on two indoor datasets.

IROS Conference 2025 Conference Paper

Hierarchical Trajectory Planning Method for Piano-Playing Robot

  • Zirui Wang
  • Jiayu Zhang
  • Wei Jiang
  • Tao Jiang
  • Jingdong Zhao
  • Liangliang Zhao
  • Baoshi Cao
  • Le Qi

Piano-playing tasks, which effectively demonstrate bimanual coordination capabilities in humanoid robots, are increasingly becoming a research focus. However, prior research has predominantly focused on Cartesian space trajectory planning without adequately addressing real-world obstacle avoidance constraints and manipulator acceleration limits. This paper proposes a hierarchical trajectory planning framework that systematically incorporates both obstacle avoidance and acceleration constraints. Firstly, discrete Cartesian path points are generated using a dynamic programming approach; secondly, joint space path points are derived considering obstacle avoidance and joint limit constraints through dynamic programming; thirdly, the joint space trajectory is interpolated using a Jacobian inverse-based method; finally, the trajectory is refined using Model Predictive Control (MPC). Experimental results demonstrate that the proposed method produces trajectories satisfying both obstacle avoidance and acceleration constraints, enabling fluent piano piece execution in real-world environments.

ICRA Conference 2025 Conference Paper

Learning Humanoid Locomotion with Perceptive Internal Model

  • Junfeng Long
  • Junli Ren
  • Moji Shi
  • Zirui Wang
  • Tao Huang
  • Ping Luo 0002
  • Jiangmiao Pang

In contrast to quadruped robots that can navigate diverse terrains using a “blind” policy, humanoid robots require accurate perception for stable locomotion due to their high degrees of freedom and inherently unstable morphology. However, incorporating perceptual signals often introduces additional disturbances to the system, potentially reducing its robustness, generalizability, and efficiency. This paper presents the Perceptive Internal Model (PIM), which relies on onboard, continuously updated elevation maps centered around the robot to perceive its surroundings. We train the policy using ground-truth obstacle heights surrounding the robot in simulation, optimizing it based on the Hybrid Internal Model (HIM), and perform inference with heights sampled from the constructed elevation map. Unlike previous methods that directly encode depth maps or raw point clouds, our approach allows the robot to perceive the terrain beneath its feet clearly and is less affected by camera movement or noise. Furthermore, since depth map rendering is not required in simulation, our method introduces minimal additional computational costs and can train the policy in 3 hours on an RTX 4090 GPU. We verify the effectiveness of our method across various humanoid robots, various indoor and outdoor terrains, stairs, and various sensor configurations. Our method can enable a humanoid robot to continuously climb stairs and has the potential to serve as a foundational algorithm for the development of future humanoid control methods.

ICLR Conference 2025 Conference Paper

MM1. 5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning

  • Haotian Zhang 0005
  • Mingfei Gao
  • Zhe Gan
  • Philipp Dufter
  • Nina Wenzel
  • Forrest Huang
  • Dhruti Shah
  • Xianzhi Du

We present MM1.5, a new family of multimodal large language models (MLLMs) designed to enhance capabilities in text-rich image understanding, visual referring and grounding, and multi-image reasoning. Building upon the MM1 architecture, MM1.5 adopts a data-centric approach to model training, systematically exploring the impact of diverse data mixtures across the entire model training lifecycle. This includes high-quality OCR data and synthetic captions for continual pre-training, as well as an optimized visual instruction-tuning data mixture for supervised fine-tuning. Our models range from 1B to 30B parameters, encompassing both dense and mixture-of-experts (MoE) variants, and demonstrate that careful data curation and training strategies can yield strong performance even at small scales (1B and 3B). Additionally, we introduce two specialized variants: MM1.5-Video, designed for video understanding, and MM1.5-UI, tailored for mobile UI understanding. Through extensive empirical studies and ablations, we provide detailed insights into the training processes and decisions that inform our final designs, offering valuable guidance for future research in MLLM development.

NeurIPS Conference 2025 Conference Paper

Seeing in the Dark: Benchmarking Egocentric 3D Vision with the Oxford Day-and-Night Dataset

  • Zirui Wang
  • Wenjing Bian
  • Xinghui Li
  • Yifu Tao
  • Jianeng Wang
  • Maurice Fallon
  • Victor Prisacariu

We introduce Oxford Day-and-Night, a large-scale, egocentric dataset for novel view synthesis (NVS) and visual relocalisation under challenging lighting conditions. Existing datasets often lack crucial combinations of features such as ground-truth 3D geometry, wide-ranging lighting variation, and full 6DoF motion. Oxford Day-and-Night addresses these gaps by leveraging Meta ARIA glasses to capture egocentric video and applying multi-session SLAM to estimate camera poses, reconstruct 3D point clouds, and align sequences captured under varying lighting conditions, including both day and night. The dataset spans over 30 km of recorded trajectories and covers an area of $40{, }000\mathrm{m}^2$, offering a rich foundation for egocentric 3D vision research. It supports two core benchmarks, NVS and relocalisation, providing a unique platform for evaluating models in realistic and diverse environments. Project page: https: //oxdan. active. vision/

NeurIPS Conference 2025 Conference Paper

Situat3DChange: Situated 3D Change Understanding Dataset for Multimodal Large Language Model

  • Ruiping Liu
  • Junwei Zheng
  • Yufan Chen
  • Zirui Wang
  • Kunyu Peng
  • Kailun Yang
  • Jiaming Zhang
  • Marc Pollefeys

Physical environments and circumstances are fundamentally dynamic, yet current 3D datasets and evaluation benchmarks tend to concentrate on either dynamic scenarios or dynamic situations in isolation, resulting in incomplete comprehension. To overcome these constraints, we introduce Situat3DChange, an extensive dataset supporting three situation-aware change understanding tasks following the perception-action model: 121K question-answer pairs, 36K change descriptions for perception tasks, and 17K rearrangement instructions for the action task. To construct this large-scale dataset, Situat3DChange leverages 11K human observations of environmental changes to establish shared mental models and shared situational awareness for human-AI collaboration. These observations, enriched with egocentric and allocentric perspectives as well as categorical and coordinate spatial relations, are integrated using an LLM to support understanding of situated changes. To address the challenge of comparing pairs of point clouds from the same scene with minor changes, we propose SCReasoner, an efficient 3D MLLM approach that enables effective point cloud comparison with minimal parameter overhead and no additional tokens required for the language decoder. Comprehensive evaluation on Situat3DChange tasks highlights both the progress and limitations of MLLMs in dynamic scene and situation understanding. Additional experiments on data scaling and cross-domain transfer demonstrate the task-agnostic effectiveness of using Situat3DChange as a training dataset for MLLMs. The established dataset and source code are publicly available at: https: //github. com/RuipingL/Situat3DChange.

NeurIPS Conference 2024 Conference Paper

CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs

  • Zirui Wang
  • Mengzhou Xia
  • Luxi He
  • Howard Chen
  • Yitao Liu
  • Richard Zhu
  • Kaiqu Liang
  • Xindi Wu

Chart understanding plays a pivotal role when applying Multimodal Large Language Models (MLLMs) to real-world tasks such as analyzing scientific papers or financial reports. However, existing datasets often focus on oversimplified and homogeneous charts with template-based questions, leading to an overly optimistic measure of progress. We demonstrate that although open-source models can appear to outperform strong proprietary models on these benchmarks, a simple stress test with slightly different charts or questions deteriorates performance by up to 34. 5%. In this work, we propose CharXiv, a comprehensive evaluation suite involving 2, 323 natural, challenging, and diverse charts from scientific papers. CharXiv includes two types of questions: 1) descriptive questions about examining basic chart elements and 2) reasoning questions that require synthesizing information across complex visual elements in the chart. To ensure quality, all charts and questions are handpicked, curated, and verified by human experts. Our results reveal a substantial, previously underestimated gap between the reasoning skills of the strongest proprietary model (i. e. , GPT-4o), which achieves 47. 1% accuracy, and the strongest open-source model (i. e. , InternVL Chat V1. 5), which achieves 29. 2%. All models lag far behind human performance of 80. 5%, underscoring weaknesses in the chart understanding capabilities of existing MLLMs. We hope that CharXiv facilitates future research on MLLM chart understanding by providing a more realistic and faithful measure of progress. Project website: https: //charxiv. github. io/

ICLR Conference 2024 Conference Paper

Ferret: Refer and Ground Anything Anywhere at Any Granularity

  • Haoxuan You
  • Haotian Zhang 0005
  • Zhe Gan
  • Xianzhi Du
  • Bowen Zhang 0002
  • Zirui Wang
  • Liangliang Cao
  • Shih-Fu Chang

We introduce Ferret, a new Multimodal Large Language Model (MLLM) capable of understanding spatial referring of any shape or granularity within an image and accurately grounding open-vocabulary descriptions. To unify referring and grounding in the LLM paradigm, Ferret employs a novel and powerful hybrid region representation that integrates discrete coordinates and continuous features jointly to represent a region in the image. To extract the continuous features of versatile regions, we propose a spatial-aware visual sampler, adept at handling varying sparsity across different shapes. Consequently, Ferret can accept diverse region inputs, such as points, bounding boxes, and free-form shapes. To bolster the desired capability of Ferret, we curate GRIT, a comprehensive refer-and-ground instruction tuning dataset including 1.1M samples that contain rich hierarchical spatial knowledge, with an additional 130K hard negative data to promote model robustness. The resulting model not only achieves superior performance in classical referring and grounding tasks, but also greatly outperforms existing MLLMs in region-based and localization-demanded multimodal chatting. Our evaluations also reveal a significantly improved capability of describing image details and a remarkable alleviation in object hallucination.

ICLR Conference 2024 Conference Paper

Hybrid Internal Model: Learning Agile Legged Locomotion with Simulated Robot Response

  • Junfeng Long
  • Zirui Wang
  • Quanyi Li
  • Liu Cao
  • Jiawei Gao 0004
  • Jiangmiao Pang

Robust locomotion control depends on accurate state estimations. However, the sensors of most legged robots can only provide partial and noisy observations, making the estimation particularly challenging, especially for external states like terrain frictions and elevation maps. Inspired by the classical Internal Model Control principle, we consider these external states as disturbances and introduce Hybrid Internal Model (HIM) to estimate them according to the response of the robot. The response, which we refer to as the hybrid internal embedding, contains the robot’s explicit velocity and implicit stability representation, corresponding to two primary goals for locomotion tasks: explicitly tracking velocity and implicitly maintaining stability. We use contrastive learning to optimize the embedding to be close to the robot’s successor state, in which the response is naturally embedded. HIM has several appealing benefits: It only needs the robot’s proprioceptions, i.e., those from joint encoders and IMU as observations. It innovatively maintains consistent observations between simulation reference and reality that avoids information loss in mimicking learning. It exploits batch-level information that is more robust to noises and keeps better sample efficiency. It only requires 1 hour of training on an RTX 4090 to enable a quadruped robot to traverse any terrain under any disturbances. A wealth of real-world experiments demonstrates its agility, even in high-difficulty tasks and cases never occurred during the training process, revealing remarkable open-world generalizability.

IJCAI Conference 2024 Conference Paper

Improving Multi-agent Reinforcement Learning with Stable Prefix Policy

  • Yue Deng
  • Zirui Wang
  • Yin Zhang

In multi-agent reinforcement learning (MARL), the epsilon-greedy method plays an important role in balancing exploration and exploitation during the decision-making process in value-based algorithms. However, the epsilon-greedy exploration process will introduce conservativeness when calculating the expected state value when the agents are more in need of exploitation during the approximate policy convergence, which may result in a suboptimal policy convergence. Besides, eliminating the epsilon-greedy algorithm leaves no exploration and may lead to unacceptable local optimal policies. To address this dilemma, we use the previously collected trajectories to construct a Monte-Carlo Trajectory Tree, so that an existing optimal template, a sequence of state prototypes, can be planned out. The agents start by following the planned template and act according to the policy without exploration, Stable Prefix Policy. The agents will adaptively dropout and begin to explore by following the epsilon-greedy method when the policy still needs exploration. We scale our approach to various value-based MARL methods and empirically verify our method in a cooperative MARL task, SMAC benchmarks. Experimental results demonstrate that our method achieves not only better performance but also faster convergence speed than baseline algorithms within early time steps.

ICML Conference 2024 Conference Paper

Language Models as Science Tutors

  • Alexis Chevalier
  • Jiayi Geng
  • Alexander Wettig
  • Howard Chen 0003
  • Sebastian Mizera
  • Toni Annala
  • Max Jameson Aragon
  • Arturo Rodríguez Fanlo

NLP has recently made exciting progress toward training language models (LMs) with strong scientific problem-solving skills. However, model development has not focused on real-life use-cases of LMs for science, including applications in education that require processing long scientific documents. To address this, we introduce TutorEval and TutorChat. TutorEval is a diverse question-answering benchmark consisting of questions about long chapters from STEM textbooks, written by experts. TutorEval helps measure real-life usability of LMs as scientific assistants, and it is the first benchmark combining long contexts, free-form generation, and multi-disciplinary scientific knowledge. Moreover, we show that fine-tuning base models with existing dialogue datasets leads to poor performance on TutorEval. Therefore, we create TutorChat, a dataset of 80, 000 long synthetic dialogues about textbooks. We use TutorChat to fine-tune Llemma models with 7B and 34B parameters. These LM tutors specialized in math have a 32K-token context window, and they excel at TutorEval while performing strongly on GSM8K and MATH. Our datasets build on open-source materials, and we release our models, data, and evaluations publicly.

ICRA Conference 2024 Conference Paper

Model Design and Concept of Operations of Standard Interface for On-orbit Construction

  • Jingdong Zhao
  • Zirui Wang
  • Ziyi Liu
  • Liangliang Zhao
  • Qifan Duan
  • Hong Liu 0002

The construction of large-scale space facilities requires the use of on-orbit construction technology. However, several of its key components, such as standard interface design, compliant control methods, and path planning for multi-branch robots, still need improvement before practical application. This paper presents a comprehensive solution for on-orbit construction tasks, encompassing a novel standard interface, docking control method, and path planning method for space multi-branch robots. Firstly, a novel standard interface is introduced, which features multiple mating modes and a lightweight design. Additionally, a compliant docking method is provided to generate lower contact forces along the Z-direction. Furthermore, for four-armed space robots, a hierarchical planning method is proposed, which innovates in environment map construction and locomotion planning. Specifically, the closed-form Minkowski sum method is employed to solve the robot’s free space, and a concise locomotion method is elucidated based on transition support points. Finally, simulations and experiments are conducted.

NeurIPS Conference 2024 Conference Paper

Parallelizing Model-based Reinforcement Learning Over the Sequence Length

  • Zirui Wang
  • Yue Deng
  • Junfeng Long
  • Yin Zhang

Recently, Model-based Reinforcement Learning (MBRL) methods have demonstrated stunning sample efficiency in various RL domains. However, achieving this extraordinary sample efficiency comes with additional training costs in terms of computations, memory, and training time. To address these challenges, we propose the Pa rallelized Mo del-based R einforcement L earning ( PaMoRL ) framework. PaMoRL introduces two novel techniques: the P arallel W orld M odel ( PWM ) and the P arallelized E ligibility T race E stimation ( PETE ) to parallelize both model learning and policy learning stages of current MBRL methods over the sequence length. Our PaMoRL framework is hardware-efficient and stable, and it can be applied to various tasks with discrete or continuous action spaces using a single set of hyperparameters. The empirical results demonstrate that the PWM and PETE within PaMoRL significantly increase training speed without sacrificing inference efficiency. In terms of sample efficiency, PaMoRL maintains an MBRL-level sample efficiency that outperforms other no-look-ahead MBRL methods and model-free RL methods, and it even exceeds the performance of planning-based MBRL methods and methods with larger networks in certain tasks.

EAAI Journal 2024 Journal Article

Research on data-driven model for power grid fault diagnosis fusing topological quantification information

  • Xu Zhang
  • Zirui Wang
  • Mingxuan Du
  • Xuekui Mao
  • Ruiting Ding
  • Haoran Yu
  • Ziqi Zhang

In increasingly complex power grid operation scenarios and fault modes, rapid and accurate fault identification is highly important for improving the reliability of power systems. Faced with the massive amount of available power grid data, the rapid development of artificial intelligence technology provides a powerful tool for power grid fault diagnosis. However, existing data-driven diagnosis methods lack quantitative power grid topology change representations, cannot integrate power grid fault topology with alarm information, and have limited effectiveness in terms of diagnosing complex faults. To address these issues, a data-driven power grid fault diagnosis model that integrates topological quantitative features is proposed. By studying the changes in the topological connection relationships of equipment before and after a power grid fault and the topological connectivity in a power failure zone, based on the basic topological network indicators in graph theory, a representation of the quantitative features of the power grid fault topology is achieved, and these quantitative features and the alarm information are integrated to construct a data-driven power grid fault diagnosis model. The model uses the light gradient boosting machine algorithm to determine the types of complex faults and accurately identify faulty equipment, addressing the lack of model diagnosis effects considered in previous studies. Finally, the accuracy and effectiveness of the model are verified by using simulated fault cases.

TMLR Journal 2023 Journal Article

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

  • Aarohi Srivastava
  • Abhinav Rastogi
  • Abhishek Rao
  • Abu Awal Md Shoeb
  • Abubakar Abid
  • Adam Fisch
  • Adam R. Brown
  • Adam Santoro

Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale. Despite their potentially transformative impact, these new capabilities are as yet poorly characterized. In order to inform future research, prepare for disruptive new model capabilities, and ameliorate socially harmful effects, it is vital that we understand the present and near-future capabilities and limitations of language models. To address this challenge, we introduce the Beyond the Imitation Game benchmark (BIG- bench). BIG-bench currently consists of 204 tasks, contributed by 450 authors across 132 institutions. Task topics are diverse, drawing problems from linguistics, childhood develop- ment, math, common-sense reasoning, biology, physics, social bias, software development, and beyond. BIG-bench focuses on tasks that are believed to be beyond the capabilities of current language models. We evaluate the behavior of OpenAI's GPT models, Google- internal dense transformer architectures, and Switch-style sparse transformers on BIG-bench, across model sizes spanning millions to hundreds of billions of parameters. In addition, a team of human expert raters performed all tasks in order to provide a strong baseline. Findings include: model performance and calibration both improve with scale, but are poor in absolute terms (and when compared with rater performance); performance is remarkably similar across model classes, though with benefits from sparsity; tasks that improve gradually and predictably commonly involve a large knowledge or memorization component, whereas tasks that exhibit "breakthrough" behavior at a critical scale often involve multiple steps or components, or brittle metrics; social bias typically increases with scale in settings with ambiguous context, but this can be improved with prompting.

JMLR Journal 2023 Journal Article

Boosting Multi-agent Reinforcement Learning via Contextual Prompting

  • Yue Deng
  • Zirui Wang
  • Xi Chen
  • Yin Zhang

Multi-agent reinforcement learning (MARL) has gained increasing attention due to its ability to enable multiple agents to learn policies simultaneously. However, the bootstrapping error arises from the difference between the estimated Q value and the real discounted return and accumulates backward through dynamic programming iterations. This error can become even larger as the number of agents increases, due to the exponential growth of agent interactions, resulting in infeasible learning time and incorrect actions during early training steps. To address this challenge, we observe that previously collected trajectories are useful contexts, model them using a contextual predictor to yield the next action and observation, and use the contextual predictor to replace the Q value function or utility function during the early training phase. Furthermore, we employ a joint-action sampling mechanism to restrict the action space and dynamically select policies from the vanilla utility network and those from the contextual trajectory predictor to perform rollout processes. By reasonably constraining the action space and rollout process, we can significantly accelerate the algorithm training process. Our framework applies to various value-based MARL methods in both centralized training decentralized execution (CTDE) and non-CTDE scenarios where agents are accessible (non-accessible) to global states during the training process. Experimental results on three tasks, Spread, Tag, and Reference, from the Particle World Environment (PWE) show that our framework significantly accelerates the training process of existing state-of-the-art CTDE and non-CTDE MARL methods, while also competing with or outperforming their original versions. [abs] [ pdf ][ bib ] &copy JMLR 2023. ( edit, beta )

NeurIPS Conference 2023 Conference Paper

Language Models Meet World Models: Embodied Experiences Enhance Language Models

  • Jiannan Xiang
  • Tianhua Tao
  • Yi Gu
  • Tianmin Shu
  • Zirui Wang
  • Zichao Yang
  • Zhiting Hu

While large language models (LMs) have shown remarkable capabilities across numerous tasks, they often struggle with simple reasoning and planning in physical environments, such as understanding object permanence or planning household activities. The limitation arises from the fact that LMs are trained only on written text and miss essential embodied knowledge and skills. In this paper, we propose a new paradigm of enhancing LMs by finetuning them with world models, to gain diverse embodied knowledge while retaining their general language capabilities. Our approach deploys an embodied agent in a world model, particularly a simulator of the physical world (VirtualHome), and acquires a diverse set of embodied experiences through both goal-oriented planning and random exploration. These experiences are then used to finetune LMs to teach diverse abilities of reasoning and acting in the physical world, e. g. , planning and completing goals, object permanence and tracking, etc. Moreover, it is desirable to preserve the generality of LMs during finetuning, which facilitates generalizing the embodied knowledge across tasks rather than being tied to specific simulations. We thus further introduce the classical elastic weight consolidation (EWC) for selective weight updates, combined with low-rank adapters (LoRA) for training efficiency. Extensive experiments show our approach substantially improves base LMs on 18 downstream tasks by 64. 28% on average. In particular, the small LMs (1. 3B, 6B, and 13B) enhanced by our approach match or even outperform much larger LMs (e. g. , ChatGPT).

ICLR Conference 2023 Conference Paper

On the Feasibility of Cross-Task Transfer with Model-Based Reinforcement Learning

  • Yifan Xu 0009
  • Nicklas Hansen 0001
  • Zirui Wang
  • Yung-Chieh Chan
  • Hao Su 0001
  • Zhuowen Tu

Reinforcement Learning (RL) algorithms can solve challenging control problems directly from image observations, but they often require millions of environment interactions to do so. Recently, model-based RL algorithms have greatly improved sample-efficiency by concurrently learning an internal model of the world, and supplementing real environment interactions with imagined rollouts for policy improvement. However, learning an effective model of the world from scratch is challenging, and in stark contrast to humans that rely heavily on world understanding and visual cues for learning new skills. In this work, we investigate whether internal models learned by modern model-based RL algorithms can be leveraged to solve new, distinctly different tasks faster. We propose Model-Based Cross-Task Transfer (XTRA), a framework for sample-efficient online RL with scalable pretraining and finetuning of learned world models. By offline multi-task pretraining and online cross-task finetuning, we achieve substantial improvements over a baseline trained from scratch; we improve mean performance of model-based algorithm EfficientZero by 23%, and by as much as 71% in some instances. Project page: https://nicklashansen.github.io/xtra

TMLR Journal 2022 Journal Article

CoCa: Contrastive Captioners are Image-Text Foundation Models

  • Jiahui Yu
  • Zirui Wang
  • Vijay Vasudevan
  • Legg Yeung
  • Mojtaba Seyedhosseini
  • Yonghui Wu

Exploring large-scale pretrained foundation models is of significant interest in computer vision because these models can be quickly transferred to many downstream tasks. This paper presents Contrastive Captioner (CoCa), a minimalist design to pretrain an image-text encoder-decoder foundation model jointly with contrastive loss and captioning loss, thereby subsuming model capabilities from contrastive approaches like CLIP and generative methods like SimVLM. In contrast to standard encoder-decoder transformers where all decoder layers attend to encoder outputs, CoCa omits cross-attention in the first half of decoder layers to encode unimodal text representations, and cascades the remaining decoder layers which cross-attend to the image encoder for multimodal image-text representations. We apply a contrastive loss between unimodal image and text embeddings, in addition to a captioning loss on the multimodal decoder outputs which predicts text tokens autoregressively. By sharing the same computational graph, the two training objectives are computed efficiently with minimal overhead. CoCa is pretrained end-to-end and from scratch on both web-scale alt-text data and annotated images by treating all labels simply as text, seamlessly unifying natural language supervision for representation learning. Empirically, CoCa achieves state-of-the-art performance with zero-shot transfer or minimal task-specific adaptation on a broad range of downstream tasks, spanning visual recognition (ImageNet, Kinetics-400/600/700, Moments-in-Time), crossmodal retrieval (MSCOCO, Flickr30K, MSR-VTT), multimodal understanding (VQA, SNLI-VE, NLVR2), and image captioning (MSCOCO, NoCaps). Notably on ImageNet classification, CoCa obtains 86.3% zero-shot top-1 accuracy, 90.6% with a frozen encoder and learned classification head, and 91.0% with a finetuned encoder.

AAAI Conference 2022 Conference Paper

HarmoFL: Harmonizing Local and Global Drifts in Federated Learning on Heterogeneous Medical Images

  • Meirui Jiang
  • Zirui Wang
  • Qi Dou

Multiple medical institutions collaboratively training a model using federated learning (FL) has become a promising solution for maximizing the potential of data-driven models, yet the non-independent and identically distributed (non-iid) data in medical images is still an outstanding challenge in real-world practice. The feature heterogeneity caused by diverse scanners or protocols introduces a drift in the learning process, in both local (client) and global (server) optimizations, which harms the convergence as well as model performance. Many previous works have attempted to address the non-iid issue by tackling the drift locally or globally, but how to jointly solve the two essentially coupled drifts is still unclear. In this work, we concentrate on handling both local and global drifts and introduce a new harmonizing framework called HarmoFL. First, we propose to mitigate the local update drift by normalizing amplitudes of images transformed into the frequency domain to mimic a unified imaging setting, in order to generate a harmonized feature space across local clients. Second, based on harmonized features, we design a client weight perturbation guiding each local model to reach a flat optimum, where a neighborhood area of the local optimal solution has a uniformly low loss. Without any extra communication cost, the perturbation assists the global model to optimize towards a converged optimal solution by aggregating several local flat optima. We have theoretically analyzed the proposed method and empirically conducted extensive experiments on three medical image classification and segmentation tasks, showing that HarmoFL outperforms a set of recent stateof-the-art methods with promising convergence behavior. Code is available at: https: //github. com/med-air/HarmoFL

TMLR Journal 2022 Journal Article

Scaling Autoregressive Models for Content-Rich Text-to-Image Generation

  • Jiahui Yu
  • Yuanzhong Xu
  • Jing Yu Koh
  • Thang Luong
  • Gunjan Baid
  • Zirui Wang
  • Vijay Vasudevan
  • Alexander Ku

We present the Pathways Autoregressive Text-to-Image (Parti) model, which generates high-fidelity photorealistic images and supports content-rich synthesis involving complex compositions and world knowledge. Parti treats text-to-image generation as a sequence-to-sequence modeling problem, akin to machine translation, with sequences of image tokens as the target outputs rather than text tokens in another language. This strategy can naturally tap into the rich body of prior work on large language models, which have seen continued advances in capabilities and performance through scaling data and model sizes. Our approach is simple: First, Parti uses a Transformer-based image tokenizer, ViT-VQGAN, to encode images as sequences of discrete tokens. Second, we achieve consistent quality improvements by scaling the encoder-decoder Transformer model up to 20B parameters, with a new state-of-the-art zero-shot FID score of 7.23 and finetuned FID score of 3.22 on MS-COCO. Our detailed analysis on Localized Narratives as well as PartiPrompts (P2), a new holistic benchmark of over 1600 English prompts, demonstrate the effectiveness of Parti across a wide variety of categories and difficulty aspects. We also explore and highlight limitations of our models in order to define and exemplify key areas of focus for further improvements.

ICLR Conference 2022 Conference Paper

SimVLM: Simple Visual Language Model Pretraining with Weak Supervision

  • Zirui Wang
  • Jiahui Yu
  • Adams Wei Yu
  • Zihang Dai
  • Yulia Tsvetkov
  • Yuan Cao 0007

With recent progress in joint modeling of visual and textual representations, Vision-Language Pretraining (VLP) has achieved impressive performance on many multimodal downstream tasks. However, the requirement for expensive annotations including clean image captions and regional labels limits the scalability of existing approaches, and complicates the pretraining procedure with the introduction of multiple dataset-specific objectives. In this work, we relax these constraints and present a minimalist pretraining framework, named Simple Visual Language Model (SimVLM). Unlike prior work, SimVLM reduces the training complexity by exploiting large-scale weak supervision, and is trained end-to-end with a single prefix language modeling objective. Without utilizing extra data or task-specific customization, the resulting model significantly outperforms previous pretraining methods and achieves new state-of-the-art results on a wide range of discriminative and generative vision-language benchmarks, including VQA (+3.74% vqa-score), NLVR2 (+1.17% accuracy), SNLI-VE (+1.37% accuracy) and image captioning tasks (+10.1% average CIDEr score). Furthermore, we demonstrate that SimVLM acquires strong generalization and transfer ability, enabling zero-shot behavior including open-ended visual question answering and cross-modality transfer.

YNIMG Journal 2021 Journal Article

Genes associated with gray matter volume alterations in schizophrenia

  • Yuan Ji
  • Xue Zhang
  • Zirui Wang
  • Wen Qin
  • Huaigui Liu
  • Kaizhong Xue
  • Jie Tang
  • Qiang Xu

Although both schizophrenia and gray matter volume (GMV) show high heritability, however, genes accounting for GMV alterations in schizophrenia remain largely unknown. Based on risk genes identified in schizophrenia by the genome-wide association study of the Schizophrenia Working Group of the Psychiatric Genomics Consortium, we used transcription-neuroimaging association analysis to test that which of these genes are associated with GMV changes in schizophrenia. For each brain tissue sample, the expression profiles of 196 schizophrenia risk genes were extracted from six donated normal brains of the Allen Human Brain Atlas, and GMV differences between patients with schizophrenia and healthy controls were calculated based on five independent case-control structural MRI datasets (276 patients and 284 controls). Genes associated with GMV changes in schizophrenia were identified by performing cross-sample spatial correlations between expression levels of each gene and case-control GMV difference derived from the five MRI datasets integrated by harmonization and meta-analysis. We found that expression levels of 98 genes consistently showed significant cross-sample spatial correlations with GMV changes in schizophrenia. These genes were functionally enriched for chemical synaptic transmission, central nervous system development, and cell projection. Overall, this study provides a set of genes possibly associated with GMV changes in schizophrenia, which could be used as candidate genes to explore biological mechanisms underlying the structural impairments in schizophrenia.

ICLR Conference 2021 Conference Paper

Gradient Vaccine: Investigating and Improving Multi-task Optimization in Massively Multilingual Models

  • Zirui Wang
  • Yulia Tsvetkov
  • Orhan Firat
  • Yuan Cao 0007

Massively multilingual models subsuming tens or even hundreds of languages pose great challenges to multi-task optimization. While it is a common practice to apply a language-agnostic procedure optimizing a joint multilingual task objective, how to properly characterize and take advantage of its underlying problem structure for improving optimization efficiency remains under-explored. In this paper, we attempt to peek into the black-box of multilingual optimization through the lens of loss function geometry. We find that gradient similarity measured along the optimization trajectory is an important signal, which correlates well with not only language proximity but also the overall model performance. Such observation helps us to identify a critical limitation of existing gradient-based multi-task learning methods, and thus we derive a simple and scalable optimization procedure, named Gradient Vaccine, which encourages more geometrically aligned parameter updates for close tasks. Empirically, our method obtains significant model performance gains on multilingual machine translation and XTREME benchmark tasks for multilingual language models. Our work reveals the importance of properly measuring and utilizing language proximity in multilingual optimization, and has broader implications for multi-task learning beyond multilingual modeling.

ICLR Conference 2020 Conference Paper

Cross-lingual Alignment vs Joint Training: A Comparative Study and A Simple Unified Framework

  • Zirui Wang
  • Jiateng Xie
  • Ruochen Xu
  • Yiming Yang 0002
  • Graham Neubig
  • Jaime G. Carbonell

Learning multilingual representations of text has proven a successful method for many cross-lingual transfer learning tasks. There are two main paradigms for learning such representations: (1) alignment, which maps different independently trained monolingual representations into a shared space, and (2) joint training, which directly learns unified multilingual representations using monolingual and cross-lingual objectives jointly. In this paper, we first conduct direct comparisons of representations learned using both of these methods across diverse cross-lingual tasks. Our empirical results reveal a set of pros and cons for both methods, and show that the relative performance of alignment versus joint training is task-dependent. Stemming from this analysis, we propose a simple and novel framework that combines these two previously mutually-exclusive approaches. Extensive experiments demonstrate that our proposed framework alleviates limitations of both approaches, and outperforms existing methods on the MUSE bilingual lexicon induction (BLI) benchmark. We further show that this framework can generalize to contextualized representations such as Multilingual BERT, and produces state-of-the-art results on the CoNLL cross-lingual NER benchmark.