Author name cluster

Yijiang Li

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

10 papers

2 author rows

TMLR Journal 2026 Journal Article

The Landscape of Agentic Reinforcement Learning for LLMs: A Survey

Guibin Zhang
Hejia Geng
Xiaohang Yu
Zhenfei Yin
Zaibin Zhang
Zelin Tan
Heng Zhou
Zhong-Zhi Li

The emergence of agentic reinforcement learning (Agentic RL) marks a paradigm shift from conventional reinforcement learning applied to large language models (LLM RL), reframing LLMs from passive sequence generators into autonomous, decision-making agents embedded in complex, dynamic worlds. This survey formalizes this conceptual shift by contrasting the degenerate single-step Markov Decision Processes (MDPs) of LLM RL with the temporally extended Partially Observable Markov Decision Processes (POMDPs) that define Agentic RL. Building on this foundation, we propose a comprehensive twofold taxonomy: one organized around core agentic capabilities, including planning, tool use, memory, reasoning, self-improvement, and perception, and the other around their applications across diverse task domains. Central to our thesis is that reinforcement learning serves as the critical mechanism for transforming these capabilities from static, heuristic modules into adaptive, robust agentic behavior. To support and accelerate future research, we consolidate the landscape of open-source environments, benchmarks, and frameworks into a practical compendium. By synthesizing over five hundred recent works, this survey charts the contours of this rapidly evolving field and highlights the opportunities and challenges that will shape the development of scalable, general-purpose AI agents.

PDF Details

TMLR Journal 2025 Journal Article

ARVideo: Autoregressive Pretraining for Self-Supervised Video Representation Learning

Sucheng Ren
Hongru Zhu
Chen Wei
Yijiang Li
Alan Yuille
Cihang Xie

This paper presents a new self-supervised video representation learning framework \textbf{ARVideo}, which \textit{autoregressively} predict the next video token in a tailored sequence order. Two key designs are included. First, we organize autoregressive video tokens into clusters that span both \textit{spatially} and \textit{temporally}, thereby enabling a richer aggregation of contextual information compared to the standard spatial-only or temporal-only clusters. Second, we adopt a randomized spatiotemporal prediction order to facilitate learning from multi-dimensional data, addressing the limitations of a handcrafted spatial-first or temporal-first sequence order. Extensive experiments establish ARVideo as an effective paradigm for self-supervised video representation learning. For example, when trained with the ViT-B backbone, ARVideo competitively attains 81.2\% on Kinetics-400 and 70.9\% on Something-Something V2, which are on par with the strong benchmark set by VideoMAE. Importantly, ARVideo also demonstrates higher training efficiency, \ie, it trains 14\% faster and requires 58\% less GPU memory compared to VideoMAE.

PDF Details

ICML Conference 2025 Conference Paper

Core Knowledge Deficits in Multi-Modal Language Models

Yijiang Li
Qingying Gao
Tianwei Zhao
Bingyang Wang
Haoran Sun
Haiyun Lyu
Robert D. Hawkins
Nuno Vasconcelos

While Multi-modal Large Language Models (MLLMs) demonstrate impressive abilities over high-level perception and reasoning, their robustness in the wild remains limited, often falling short on tasks that are intuitive and effortless for humans. We examine the hypothesis that these deficiencies stem from the absence of core knowledge—rudimentary cognitive abilities innate to humans from early childhood. To explore the core knowledge representation in MLLMs, we introduce CoreCognition, a large-scale benchmark encompassing 12 core knowledge concepts grounded in developmental cognitive science. We evaluate 230 models with 11 different prompts, leading to a total of 2, 530 data points for analysis. Our experiments uncover four key findings, collectively demonstrating core knowledge deficits in MLLMs: they consistently underperform and show reduced, or even absent, scalability on low-level abilities relative to high-level ones. Finally, we propose Concept Hacking, a novel controlled evaluation method, that reveals MLLMs fail to progress toward genuine core knowledge understanding, but instead rely on shortcut learning as they scale. Project page at https: //williamium3000. github. io/core-knowledge/.

Details

ICML Conference 2025 Conference Paper

EgoPrivacy: What Your First-Person Camera Says About You?

Yijiang Li
Genpei Zhang
Jiacheng Cheng
Yi Li 0051
Xiaojun Shan
Dashan Gao 0001
Jiancheng Lyu
Yuan Li

While the rapid proliferation of wearable cameras has raised significant concerns about egocentric video privacy, prior work has largely overlooked the unique privacy threats posed to the camera wearer. This work investigates the core question: How much privacy information about the camera wearer can be inferred from their first-person view videos? We introduce EgoPrivacy, the first large-scale benchmark for the comprehensive evaluation of privacy risks in egocentric vision. EgoPrivacy covers three types of privacy (demographic, individual, and situational), defining seven tasks that aim to recover private information ranging from fine-grained (e. g. , wearer’s identity) to coarse-grained (e. g. , age group). To further emphasize the privacy threats inherent to egocentric vision, we propose Retrieval-Augmented Attack, a novel attack strategy that leverages ego-to-exo retrieval from an external pool of exocentric videos to boost the effectiveness of demographic privacy attacks. An extensive comparison of the different attacks possible under all threat models is presented, showing that private information of the wearer is highly susceptible to leakage. For instance, our findings indicate that foundation models can effectively compromise wearer privacy even in zero-shot settings by recovering attributes such as identity, scene, gender, and race with 70–80% accuracy. Our code and data are available at https: //github. com/williamium3000/ego-privacy.

Details

JBHI Journal 2025 Journal Article

Enhancing Weakly Supervised Semantic Segmentation With Multi-Label Contrastive Learning and LLM Features Guidance

Wentian Cai
Yijiang Li
Yandan Chen
Jing Lin
Zihao Huang
Ping Gao
Thippa Reddy Gadekallu
Wei Wang

Histopathological whole-slide images (WSIs) segmentation is essential for precise tissue characterization in medical diagnostics. However, traditional approaches require labor-intensive pixel-level annotations. To this end, we study weakly supervised semantic segmentation (WSSS) which uses patch-level classification labels, reducing annotation efforts significantly. However, the complexity of WSIs and the challenge of sparse classification labels hinder effective dense pixel predictions. Moreover, due to the multi-label nature of WSI, existing approaches of single-label contrastive learning designed for the representation of single-category, neglecting the presence of other relevant categories and thus fail to adapt to WSI tasks. This paper presents a novel multi-label contrastive learning method for WSSS by incorporating class-specific embedding extraction with LLM features guidance. Specifically, we propose to obtain class-specific embeddings by utilizing classifier weights, followed by a dot-product-based attention fusion method that leverages LLM features to enrich their semantics, facilitating contrastive learning between different classes from single image. Besides, we propose a Robust Learning approach that leverages multi-layer features to evaluate the uncertainty of pseudo-labels, thereby mitigating the impact of noisy pseudo-labels on the learning process of segmentation. Extensive experiments have been conducted on two histopathological image segmentation datasets, i. e. LUAD dataset and BCSS dataset, demonstrating the effectiveness of our methods with leading performance.

Details DOI

ICLR Conference 2025 Conference Paper

From Pixels to Tokens: Byte-Pair Encoding on Quantized Visual Modalities

Wanpeng Zhang 0002
Zilong Xie
Yicheng Feng
Yijiang Li
Xingrun Xing
Sipeng Zheng
Zongqing Lu 0002

Multimodal Large Language Models have made significant strides in integrating visual and textual information, yet they often struggle with effectively aligning these modalities. We introduce a novel image tokenizer that bridges this gap by applying the principle of Byte-Pair Encoding (BPE) to visual data. Unlike conventional approaches that rely on separate visual encoders, our method directly incorporates structural prior information into image tokens, mirroring the successful tokenization strategies used in text-only Large Language Models. This innovative approach enables Transformer models to more effectively learn and reason across modalities. Through theoretical analysis and extensive experiments, we demonstrate that our BPE Image Tokenizer significantly enhances MLLMs' multimodal understanding capabilities, even with limited training data. Leveraging this method, we develop Being-VL-0, a model that demonstrates superior performance across various benchmarks and shows promising scalability, potentially paving the way for more efficient and capable multimodal foundation models. For further details, visit our website https://github.com/BeingBeyond/Being-VL-0.

Details

AAAI Conference 2025 Conference Paper

Towards Adversarially Robust Dataset Distillation by Curvature Regularization

Eric Xue
Yijiang Li
Haoyang Liu
Peiran Wang
Yifan Shen
Haohan Wang

Dataset distillation (DD) allows datasets to be distilled to fractions of their original size while preserving the rich distributional information so that models trained on the distilled datasets can achieve a comparable accuracy while saving significant computational loads. Recent research in this area has been focusing on improving the accuracy of models trained on distilled datasets. In this paper, we aim to explore a new perspective of DD. We study how to embed adversarial robustness in distilled datasets, so that models trained on these datasets maintain the high accuracy and meanwhile acquire better adversarial robustness. We propose a new method that achieves this goal by incorporating curvature regularization into the distillation process with much less computational overhead than standard adversarial training. Extensive empirical experiments suggest that our method not only outperforms standard adversarial training on both accuracy and robustness with less computation overhead but is also capable of generating robust distilled datasets that can withstand various adversarial attacks.

PDF Details DOI

TMLR Journal 2024 Journal Article

Choosing Wisely and Learning Deeply: Selective Cross-Modality Distillation via CLIP for Domain Generalization

Jixuan Leng
Yijiang Li
Haohan Wang

Domain Generalization (DG), a crucial research area, seeks to train models across multiple domains and test them on unseen ones. In this paper, we introduce a novel approach, namely, Selective Cross-Modality Distillation for Domain Generalization (SCMD). SCMD leverages the capabilities of large vision-language models, specifically CLIP, to train a more efficient model, ensuring it acquires robust generalization capabilities across unseen domains. Our primary contribution is a unique selection framework strategically designed to identify hard-to-learn samples for distillation. In parallel, we introduce a novel cross-modality module that seamlessly combines the projected features of the student model with the text embeddings from CLIP, ensuring the alignment of similarity distributions. We assess SCMD's performance on various benchmarks, where it empowers a ResNet50 to deliver state-of-the-art performance, surpassing existing domain generalization methods. Furthermore, we provide a theoretical analysis of our selection strategy, offering deeper insight into its effectiveness and potential in the field of DG.

PDF Details

NeurIPS Conference 2024 Conference Paper

SparseLLM: Towards Global Pruning of Pre-trained Language Models

Guangji Bai
Yijiang Li
Chen Ling
Kibaek Kim
Liang Zhao

The transformative impact of large language models (LLMs) like LLaMA and GPT on natural language processing is countered by their prohibitive computational demands. Pruning has emerged as a pivotal compression strategy, introducing sparsity to enhance both memory and computational efficiency. Yet, traditional global pruning is impractical for LLMs due to scalability issues, while local pruning, despite its efficiency, leads to suboptimal solutions. Addressing these challenges, we propose SparseLLM, a novel framework that redefines the global pruning process into manageable, coordinated subproblems, allowing for resource-efficient optimization with global optimality. SparseLLM's approach, which conceptualizes LLMs as a chain of modular functions and leverages auxiliary variables for problem decomposition, not only facilitates a pragmatic application on LLMs but also demonstrates significant performance improvements, particularly in high-sparsity regimes where it surpasses current state-of-the-art methods. Our source code is publicly available at https: //github. com/BaiTheBest/SparseLLM.

PDF Details DOI

TMLR Journal 2024 Journal Article

Towards Understanding Adversarial Transferability in Federated Learning

Yijiang Li
Ying Gao
Haohan Wang

We investigate a specific security risk in FL: a group of malicious clients has impacted the model during training by disguising their identities and acting as benign clients but later switching to an adversarial role. They use their data, which is part of the training set, to train a substitute model and conduct transferable adversarial attacks against the federated model. This type of attack is subtle and hard to detect because these clients initially appear to be benign. The key question we address is: How robust is the FL system to such covert attacks, especially compared to traditional centralized learning systems? We empirically show that the proposed attack imposes a high-security risk to current FL systems. By using only 3\% of the client's data, we achieve the highest attack rate of over 80\%. To further offer a full understanding of the challenges the FL system faces in transferable attacks, we provide a comprehensive analysis of the transfer robustness of FL across a spectrum of configurations. Surprisingly, FL systems show a higher level of robustness than their centralized counterparts, especially when both systems are equally good at handling regular, non-malicious data. We attribute this increased robustness to two main factors: 1) Decentralized Data Training: Each client trains the model on its own data, reducing the overall impact of any single malicious client. 2) Model Update Averaging: The updates from each client are averaged together, further diluting any malicious alterations. Both practical experiments and theoretical analyses support our conclusions. This research not only sheds light on the resilience of FL systems against hidden attacks but also raises important considerations for their future application and development。

PDF Details