Arrow Research search

Author name cluster

Junjie Wang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

25 papers
2 author rows

Possible papers

25

AAAI Conference 2026 Conference Paper

Adversarial Attack on Black-Box Multi-Agent by Adaptive Perturbation

  • Jianming Chen
  • Yawen Wang
  • Junjie Wang
  • Xiaofei Xie
  • Yuanzhe Hu
  • Qing Wang
  • Fanjiang Xu

Evaluating security and reliability for multi-agent systems (MAS) is urgent as they become increasingly prevalent in various applications. As an evaluation technique, existing adversarial attack frameworks face certain limitations, e.g., impracticality due to the requirement of white-box information or high control authority, and a lack of stealthiness or effectiveness as they often target all agents or specific fixed agents. To address these issues, we propose AdapAM, a novel framework for adversarial attacks on black-box MAS. AdapAM incorporates two key components: (1) Adaptive Selection Policy simultaneously selects the victim and determines the anticipated malicious action (the action would lead to the worst impact on MAS), balancing effectiveness and stealthiness. (2) Proxy-based Perturbation to Induce Malicious Action utilizes generative adversarial imitation learning to approximate the target MAS, allowing AdapAM to generate perturbed observations using white-box information and thus induce victims to execute malicious action in black-box settings. We evaluate AdapAM across eight multi-agent environments and compare it with four state-of-the-art and commonly-used baselines. Results demonstrate that AdapAM achieves the best attack performance in different perturbation rates. Besides, AdapAM-generated perturbations are the least noisy and hardest to detect, emphasizing the stealthiness.

TMLR Journal 2026 Journal Article

A‌ Survey of Reasoning in Autonomous Driving Systems: Open Challenges and Emerging Paradigms

  • Kejin Yu
  • Yuhan Sun
  • Taiqiang Wu
  • Ruixu Zhang
  • Zhiqiang Lin
  • Yuxin Meng
  • Junjie Wang
  • Yujiu Yang

The development of high-level autonomous driving (AD) is shifting from perception-centric limitations to a more fundamental bottleneck, namely, a deficit in robust and generalizable reasoning. Although current AD systems manage structured environments, they consistently falter in long-tail scenarios and complex social interactions that require human-like judgment. Meanwhile, the advent of large language and multimodal models (LLMs and MLLMs) presents a transformative opportunity to integrate a powerful cognitive engine into AD systems, moving beyond pattern matching toward genuine comprehension. However, a systematic framework to guide this integration is critically lacking. To bridge this gap, we provide a comprehensive review of this emerging field and argue that reasoning should be elevated from a modular component to the system's cognitive core. Specifically, we first propose a novel Cognitive Hierarchy to decompose the monolithic driving task according to its cognitive and interactive complexity. Building on this, we further derive and systematize seven core reasoning challenges, such as the responsiveness-reasoning trade-off and social-game reasoning. Furthermore, we conduct a dual-perspective review of the state-of-the-art, analyzing both system-centric approaches to architecting intelligent agents and evaluation-centric practices for their validation. Our analysis reveals a clear trend toward holistic and interpretable "glass-box'' agents. In conclusion, we identify a fundamental and unresolved tension between the high-latency, deliberative nature of LLM-based reasoning and the millisecond-scale, safety-critical demands of vehicle control. For future work, a primary objective is to bridge the symbolic-to-physical gap by developing verifiable neuro-symbolic architectures, robust reasoning under uncertainty, and scalable models for implicit social negotiation.

AAAI Conference 2026 Conference Paper

Joint-GCG: Unified Gradient-Based Poisoning Attacks on Retrieval-Augmented Generation Systems

  • Haowei Wang
  • Rupeng Zhang
  • Junjie Wang
  • Mingyang Li
  • Yuekai Huang
  • Dandan Wang
  • Qing Wang

Retrieval-Augmented Generation (RAG) systems enhance Large Language Models (LLMs) by retrieving relevant documents from external corpora before generating responses. This approach significantly expands LLM capabilities by leveraging vast, up-to-date external knowledge. However, this reliance on external knowledge makes RAG systems vulnerable to corpus poisoning attacks that manipulate generated outputs via poisoned document injection. Existing poisoning attack strategies typically treat the retrieval and generation stages as disjointed, limiting their effectiveness. We propose Joint-GCG, the first framework to unify gradient-based attacks across both retriever and generator models through three innovations: (1) Cross-Vocabulary Projection for aligning embedding spaces, (2) Gradient Tokenization Alignment for synchronizing token-level gradient signals, and (3) Adaptive Weighted Fusion for dynamically balancing attacking objectives. Evaluations demonstrate that Joint-GCG achieves at most 25% and an average of 5% higher attack success rate than previous methods across multiple retrievers and generators. While optimized under a white-box assumption, the generated poisons show unprecedented transferability to unseen models. Joint-GCG's innovative unification of gradient-based attacks across retrieval and generation stages fundamentally reshapes our understanding of vulnerabilities within RAG systems.

AAAI Conference 2026 Conference Paper

Many Minds, One Path: LLM-Augmented Consensus Decision for Distributed Control in Multi-Agent Collaborative Stable Scenarios

  • Zhuohao Yu
  • Zhe Liu
  • Tao Ren
  • Chenxue Wang
  • Junjie Wang
  • Qing Wang

Distributed multi-agent systems are increasingly deployed in dynamic and high-stakes environments such as power grids, intelligent traffic systems, and collaborative robotics. In these systems, long-term stability, the ability to maintain coherent and safe system behavior over time, is critical but underexplored in existing research. This paper presents LLMASC, a framework designed to enhance long-term stability in multi-agent collaboration by combining semantic reasoning with decentralized control. LLMASC comprises three key components: a Semantic Perception Encoder that transforms heterogeneous agent observations into structured natural language; an LLM-Guided Consensus Decision module that enables strategic alignment through proposal exchange and voting; and a Policy Execution Controller that maps high-level plans to executable actions via reinforcement learning. We evaluate LLMASC across three representative simulation domains (Multi-Walker, Simulation of Urban Mobility and Power Grid Stabilization), spanning both physical and cyber-physical systems. Experiments show that LLMASC consistently outperforms the best baselines, improving stability rates by up to 44% and long-term success by 31%. Further analysis confirms its decision-making efficiency and robustness under varying agent populations and model choices.

AAAI Conference 2026 Conference Paper

O-DisCo-Edit: Object Distortion Control for Unified Realistic Video Editing

  • Yuqing Chen
  • Junjie Wang
  • Lin Liu
  • Ruihang Chu
  • Xiaopeng Zhang
  • Qi Tian
  • Yujiu Yang

Diffusion models have recently advanced video editing, yet controllable editing remains challenging due to the need for precise manipulation of diverse object properties. Current methods require different control signal for diverse editing tasks, which complicates model design and demands significant training resources. To address this, we propose O-DisCo-Edit, a unified framework that incorporates a novel object distortion control (O-DisCo). This signal, based on random and adaptive noise, flexibly encapsulates a wide range of editing cues within a single representation. Paired with a “copy-form” preservation module for preserving non-edited regions, O-DisCo-Edit enables efficient, high-fidelity editing through an effective training paradigm. Extensive experiments and comprehensive human evaluations consistently demonstrate that O-DisCo-Edit surpasses both specialized and multitask state-of-the-art methods across various video editing tasks.

AAAI Conference 2026 Conference Paper

SpecQuant: Spectral Decomposition and Adaptive Truncation for Ultra-Low-Bit LLMs Quantization

  • Zhixiong Zhao
  • Fangxin Liu
  • Junjie Wang
  • Chenyang Guan
  • Zongwu Wang
  • Li Jiang
  • Haibing Guan

The emergence of accurate open large language models (LLMs) has sparked a push for advanced quantization techniques to enable efficient deployment on end-user devices. In this paper, we revisit the challenge of extreme LLM compression---targeting ultra-low-bit quantization for both activations and weights---from a Fourier frequency domain perspective. We propose SpecQuant, a two-stage framework that tackles activation outliers and cross-channel variance. In the first stage, activation outliers are smoothed and transferred into the weight matrix to simplify downstream quantization. In the second stage, we apply channel-wise low-frequency Fourier truncation to suppress high-frequency components while preserving essential signal energy, improving quantization robustness. Our method builds on the principle that most of the weight energy is concentrated in low-frequency components, which can be retained with minimal impact on model accuracy. To enable runtime adaptability, we introduce a lightweight truncation module during inference that adjusts truncation thresholds based on channel characteristics. On LLaMA-3 8B, SpecQuant achieves 4-bit quantization for both weights and activations, narrowing the zero-shot accuracy gap to only 1.5% compared to full precision, while delivering 2× faster inference and 3× lower memory usage.

JBHI Journal 2026 Journal Article

Wavelet-Driven Spatial Frequency Mamba Network for Spine Image Segmentation

  • Yuefeng Zhao
  • Qifei Wang
  • Nai Zhou
  • Pengfei Sun
  • Junjie Wang
  • Nannan Hu

Accurate spine segmentation is crucial for diagnosing and treating various spine diseases. Recently, Mamba-based methods have been widely applied in medical image segmentation. However, the spatial domain scanning strategy of Mamba fails to fully capture the fine anatomical structures and global dependencies of the spine. Moreover, existing frequency-enhanced methods often suffer from the loss of spatial localization. To address these challenges, we propose the Wavelet-Driven Spatial Frequency Mamba Network (WDSFM-Net). Specifically, we integrate the Discrete Wavelet Transform (DWT) into Mamba to construct the Spatial-Frequency Mamba Block (SFMB). By decomposing features into distinct frequency subbands, SFMB explicitly captures global structural context from low-frequency components while enhancing local anatomical details through high-frequency components. To accommodate specific spinal morphology, the Global Strip Pooling Attention (GSPA) module aggregates directional contexts to model the elongated and anisotropic spinal anatomy, while the Multi-Scale Attention Enhancement (MSAE) module employs multi-scale convolutions to adapt to significant vertebral scale variations. Additionally, we introduce a Dual-Domain Loss (DDL) function, which optimizes both spatial and frequency domain representations for robust training. We evaluated our WDSFM-Net on two public spine MRI datasets. The results show that the WDSFM-Net outperforms other state-of-the-art methods, achieving average Dice similarity coefficients of 0. 8885 and 0. 8669 in the Spider and MRSpine datasets, respectively.

NeurIPS Conference 2025 Conference Paper

Less Is More, but Where? Dynamic Token Compression via LLM-Guided Keyframe Prior

  • Yulin Li
  • Haokun GUI
  • Ziyang Fan
  • Junjie Wang
  • Bin Kang
  • Bin Chen
  • Zhuotao Tian

Recent advances in Video Large Language Models (VLLMs) have achieved remarkable video understanding capabilities, yet face critical efficiency bottlenecks due to quadratic computational growth with lengthy visual token sequences of long videos. While existing keyframe sampling methods can improve temporal modeling efficiency, additional computational cost is introduced before feature encoding, and the binary frame selection paradigm is found suboptimal. Therefore, in this work, we propose Dy namic To ken compression via LLM-guided K eyframe prior ( DyToK ), a training-free paradigm that enables dynamic token compression by harnessing VLLMs' inherent attention mechanisms. Our analysis reveals that VLLM attention layers naturally encoding query-conditioned keyframe priors, by which DyToK dynamically adjusts per-frame token retention ratios, prioritizing semantically rich frames while suppressing redundancies. Extensive experiments demonstrate that DyToK achieves state-of-the-art efficiency-accuracy tradeoffs. DyToK shows plug-and-play compatibility with existing compression methods, such as VisionZip and FastV, attaining 2. 5x faster inference while preserving accuracy across multiple VLLMs, such as LLaVA-OneVision and Qwen2. 5-VL. Code and models will be made publicly available.

NeurIPS Conference 2025 Conference Paper

LVLM-Driven Attribute-Aware Modeling for Visible-Infrared Person Re-Identification

  • Zhiqi Pang
  • Lingling Zhao
  • Junjie Wang
  • Chunyu Wang

Visible-infrared person re-identification (VI-ReID) aims to match visible and infrared images of the same individual. Supervised VI-ReID (SVI-ReID) methods have achieved promising performance under the guidance of manually annotated identity labels. However, the substantial annotation cost severely limits their scalability in real-world applications. As a result, unsupervised VI-ReID (UVI-ReID) methods have attracted increasing attention. These methods typically rely on pseudo-labels generated by clustering and matching algorithms to replace manual annotations. Nevertheless, the quality of pseudo-labels is often difficult to guarantee, and low-quality pseudo-labels can significantly hinder model performance improvements. To address these challenges, we explore the use of attribute arrays extracted by a large vision-language model (LVLM) to enhance VI-ReID, and propose a novel LVLM-driven attribute-aware modeling (LVLM-AAM) approach. Specifically, we first design an attribute-aware reliable labeling strategy, which refines intra-modality clustering results based on image-level attributes and improves inter-modality matching by grouping clusters according to cluster-level attributes. Next, we develop an explicit-implicit attribute fusion module, which integrates explicit and implicit attributes to obtain more fine-grained identity-related text features. Finally, we introduce an attribute-aware contrastive learning module, which jointly leverages static and dynamic text features to promote modality-invariant feature learning. Extensive experiments conducted on VI-ReID datasets validate the effectiveness of the proposed LVLM-AAM and its individual components. LVLM-AAM not only significantly outperforms existing unsupervised methods but also surpasses several supervised methods.

AAAI Conference 2025 Conference Paper

OV-DQUO: Open-Vocabulary DETR with Denoising Text Query Training and Open-World Unknown Objects Supervision

  • Junjie Wang
  • Bin Chen
  • Bin Kang
  • Yulin Li
  • Weizhi Xian
  • Yichi Chen
  • Yong Xu

Open-vocabulary detection aims to detect objects from novel categories beyond the base categories on which the detector is trained. However, existing open-vocabulary detectors trained on base category data tend to assign higher confidence to trained categories and confuse novel categories with the background. To resolve this, we propose OV-DQUO, an Open-Vocabulary DETR with Denoising text Query training and open-world Unknown Objects supervision. Specifically, we introduce a wildcard matching method. This method enables the detector to learn from pairs of unknown objects recognized by the open-world detector and text embeddings with general semantics, mitigating the confidence bias between base and novel categories. Additionally, we propose a denoising text query training strategy. It synthesizes foreground and background query-box pairs from open-world unknown objects to train the detector through contrastive learning, enhancing its ability to distinguish novel objects from the background. We conducted extensive experiments on the OV-COCO and OV-LVIS benchmarks, achieving new state-of-the-art results of 45.6 AP50 and 39.3 mAP on novel categories, respectively.

IROS Conference 2025 Conference Paper

Shape-Adaptive Planning and Control for a Deformable Quadrotor

  • Yuze Wu
  • Zhichao Han 0002
  • Xuankang Wu
  • Yuan Zhou
  • Junjie Wang
  • Zheng Fang
  • Fei Gao 0011

Drones have become essential in various applications, but conventional quadrotors face limitations in confined spaces and complex tasks. Deformable drones, which can adapt their shape in real-time, offer a promising solution to overcome these challenges, while also enhancing maneuverability and enabling novel tasks like object grasping. This paper presents a novel approach to autonomous motion planning and control for deformable quadrotors. We introduce a shape-adaptive trajectory planner that incorporates deformation dynamics into path generation, using a scalable kinodynamic A* search to handle deformation parameters in complex environments. The backend spatio-temporal optimization is capable of generating optimally smooth trajectories that incorporate shape deformation. Additionally, we propose an enhanced control strategy that compensates for external forces and torque disturbances, achieving a 37. 3% reduction in trajectory tracking error compared to our previous work. Our approach is validated through simulations and real-world experiments, demonstrating its effectiveness in narrow-gap traversal and multi-modal deformable tasks.

ICLR Conference 2025 Conference Paper

TorchTitan: One-stop PyTorch native solution for production ready LLM pretraining

  • Wanchao Liang
  • Tianyu Liu
  • Less Wright
  • Will Constable
  • Andrew Gu
  • Chien-Chin Huang
  • Iris Zhang
  • Wei Feng

The development of large language models (LLMs) has been instrumental in advancing state-of-the-art natural language processing applications. Training LLMs with billions of parameters and trillions of tokens requires sophisticated distributed systems that enable composing and comparing several state-of-the-art techniques in order to efficiently scale across thousands of accelerators. However, existing solutions are complex, scattered across multiple libraries/repositories, lack interoperability, and are cumbersome to maintain. Thus, curating and empirically comparing training recipes requires non-trivial engineering effort. This paper introduces **TORCHTITAN**$^1$, a PyTorch-native distributed training system that unifies and advances state-of-the-art techniques, streamlining integration and reducing engineering overhead. TORCHTITAN enables seamless application of 4D parallelism in a modular and composable manner, while featuring elastic scaling to adapt to changing computational requirements. The system provides comprehensive logging, efficient checkpointing, and debugging tools, ensuring production-ready training. Moreover, TORCHTITAN incorporates innovative hardware-software co-designed solutions, leveraging cutting-edge features like Float8 training and SymmetricMemory to maximize hardware utilization. As a flexible experimental test bed, TORCHTITAN facilitates the curation and comparison of custom recipes for diverse training contexts. By leveraging TORCHTITAN, we developed optimized training recipes for the Llama 3.1 family and provide actionable guidance on selecting and combining distributed training techniques to maximize training efficiency, based on our hands-on experiences. We thoroughly assess TORCHTITAN on the Llama 3.1 family of LLMs, spanning 8 billion to 405 billion parameters, and showcase its exceptional performance, modular composability, and elastic scalability. By stacking training optimizations, we demonstrate accelerations ranging from 65.08% on Llama 3.1 8B at 128 GPU scale (1D), 12.59% on Llama 3.1 70B at 256 GPU scale (2D), to 30% on Llama 3.1 405B at 512 GPU scale (3D) on NVIDIA H100 GPUs over optimized baselines. We also demonstrate the effectiveness of 4D parallelism in enabling long context training. $^1$ GitHub: [https://github.com/pytorch/torchtitan](https://github.com/pytorch/torchtitan)

NeurIPS Conference 2025 Conference Paper

Towards Resilient Safety-driven Unlearning for Diffusion Models against Downstream Fine-tuning

  • Boheng Li
  • Renjie Gu
  • Junjie Wang
  • Leyi Qi
  • Yiming Li
  • Run Wang
  • Zhan Qin
  • Tianwei Zhang

Text-to-image (T2I) diffusion models have achieved impressive image generation quality and are increasingly fine-tuned for personalized applications. However, these models often inherit unsafe behaviors from toxic pretraining data, raising growing safety concerns. While recent safety-driven unlearning methods have made promising progress in suppressing model toxicity, they are found to be fragile to downstream fine-tuning, as we reveal that state-of-the-art methods largely fail to retain their effectiveness even when fine-tuned on entirely benign datasets. To mitigate this problem, in this paper, we propose ResAlign, a safety-driven unlearning framework with enhanced resilience against downstream fine-tuning. By modeling downstream fine-tuning as an implicit optimization problem with a Moreau envelope-based reformulation, ResAlign enables efficient gradient estimation to minimize the recovery of harmful behaviors. Additionally, a meta-learning strategy is proposed to simulate a diverse distribution of fine-tuning scenarios to improve generalization. Extensive experiments across a wide range of datasets, fine-tuning methods, and configurations demonstrate that ResAlign consistently outperforms prior unlearning approaches in retaining safety, while effectively preserving benign generation capability. Our code and pretrained models are publicly available at https: //github. com/AntigoneRandy/ResAlign.

AAAI Conference 2025 Conference Paper

TrustUQA: A Trustful Framework for Unified Structured Data Question Answering

  • Wen Zhang
  • Long Jin
  • Yushan Zhu
  • Jiaoyan Chen
  • Zhiwei Huang
  • Junjie Wang
  • Yin Hua
  • Lei Liang

Natural language question answering (QA) over structured data sources such as tables and knowledge graphs have been widely investigated, especially with Large Language Models (LLMs) in recent years. The main solutions include question to formal query parsing and retrieval-based answer generation. However, current methods of the former often suffer from weak generalization, failing to dealing with multi-types of sources, while the later is limited in trustfulness. In this paper, we propose TrustUQA, a trustful QA framework that can simultaneously support multiple types of structured data in a unified way. To this end, it adopts an LLM-friendly and unified knowledge representation method called Condition Graph (CG), and uses an LLM and demonstration-based two-level method for CG querying. For enhancement, it is also equipped with dynamic demonstration retrieval. We have evaluated TrustUQA with 5 benchmarks covering 3 types of structured data. It outperforms 2 existing unified structured data QA methods. In comparison with the baselines that are specific to one data type, it achieves state-of-the-art on 2 of the datasets. Further more, we have demonstrated the potential of our method for more general QA tasks, QA over mixed structured data and QA across structured data.

AAAI Conference 2025 Conference Paper

Understanding Individual Agent Importance in Multi-Agent System via Counterfactual Reasoning

  • Jianming Chen
  • Yawen Wang
  • Junjie Wang
  • Xiaofei Xie
  • Jun Hu
  • Qing Wang
  • Fanjiang Xu

Explaining multi-agent systems (MAS) is urgent as these systems become increasingly prevalent in various applications. Previous work has provided explanations for the actions or states of agents, yet falls short in understanding the blackboxed agent’s importance within a MAS and the overall team strategy. To bridge this gap, we propose EMAI, a novel agent-level explanation approach that evaluates the individual agent’s importance. Inspired by counterfactual reasoning, a larger change in reward caused by the randomized action of agent indicates its higher importance. We model it as a MARL problem to capture interactions across agents. Utilizing counterfactual reasoning, EMAI learns the masking agents to identify important agents. Specifically, we define the optimization function to minimize the reward difference before and after action randomization and introduce sparsity constraints to encourage the exploration of more action randomization of agents during training. The experimental results in seven multi-agent tasks demonstrate that EMAI achieves higher fidelity in explanations compared to baselines and provides more effective guidance in practical applications concerning understanding policies, launching attacks, and patching policies.

ICML Conference 2025 Conference Paper

Weakly-Supervised Contrastive Learning for Imprecise Class Labels

  • Zi-Hao Zhou
  • Junjie Wang
  • Tong Wei 0001
  • Min-Ling Zhang

Contrastive learning has achieved remarkable success in learning effective representations, with supervised contrastive learning often outperforming self-supervised approaches. However, in real-world scenarios, data annotations are often ambiguous or inaccurate, meaning that class labels may not reliably indicate whether two examples belong to the same class. This limitation restricts the applicability of supervised contrastive learning. To address this challenge, we introduce the concept of “continuous semantic similarity” to define positive and negative pairs. Instead of directly relying on imprecise class labels, we measure the semantic similarity between example pairs, which quantifies how closely they belong to the same category by iteratively refining weak supervisory signals. Based on this concept, we propose a graph-theoretic framework for weakly-supervised contrastive learning, where semantic similarity serves as the graph weights. Our framework is highly versatile and can be applied to many weakly-supervised learning scenarios. We demonstrate its effectiveness through experiments in two common settings, i. e. , noisy label and partial label learning, where existing methods can be easily integrated to significantly improve performance. Theoretically, we establish an error bound for our approach, showing that it can approximate supervised contrastive learning under mild conditions. The implementation code is available at https: //github. com/Speechless-10308/WSC.

JBHI Journal 2024 Journal Article

A Hierarchical Graph Neural Network Framework for Predicting Protein-Protein Interaction Modulators With Functional Group Information and Hypergraph Structure

  • Zitong Zhang
  • Lingling Zhao
  • Junjie Wang
  • Chunyu Wang

Accurate prediction of small molecule modulators targeting protein-protein interactions (PPIMs) remains a significant challenge in drug discovery. Existing machine learning-based models rely on manual feature engineering, which is tedious and task-specific. Recently, deep learning models based on graph neural networks have made remarkable progress in molecular representation learning. However, many graph-based approaches ignore molecular hierarchical structure modeling guided by domain knowledge. In chemistry, the functional groups of a molecule determine its interaction with specific targets. Therefore, we propose a hierarchical graph neural network framework (called HiGPPIM) for predicting PPIMs by integrating atom-level and functional group-level features of molecules. HiGPPIM constructs atom-level and functional group-level graphs based on chemical knowledge and learns graph representations using graph attention networks. Furthermore, a hypergraph attention network is designed in HiGPPIM to aggregate and transform two-level graph information. We evaluate the performance of HiGPPIM on eight PPI families and two prediction tasks, namely PPIM identification and potency prediction. Experimental results demonstrate that HiGPPIM achieves state-of-the-art performance on both tasks and that using functional group information to guide PPIM prediction is effective.

NeurIPS Conference 2024 Conference Paper

Vision Model Pre-training on Interleaved Image-Text Data via Latent Compression Learning

  • Chenyu Yang
  • Xizhou Zhu
  • Jinguo Zhu
  • Weijie Su
  • Junjie Wang
  • Xuan Dong
  • Wenhai Wang
  • Lewei Lu

Recently, vision model pre-training has evolved from relying on manually annotated datasets to leveraging large-scale, web-crawled image-text data. Despite these advances, there is no pre-training method that effectively exploits the interleaved image-text data, which is very prevalent on the Internet. Inspired by the recent success of compression learning in natural language processing, we propose a novel vision model pre-training method called Latent Compression Learning (LCL) for interleaved image-text data. This method performs latent compression learning by maximizing the mutual information between the inputs and outputs of a causal attention model. The training objective can be decomposed into two basic tasks: 1) contrastive learning between visual representation and preceding context, and 2) generating subsequent text based on visual representation. Our experiments demonstrate that our method not only matches the performance of CLIP on paired pre-training datasets (e. g. , LAION), but can also leverage interleaved pre-training data (e. g. , MMC4) to learn robust visual representations from scratch, showcasing the potential of vision model pre-training with interleaved image-text data.