Arrow Research search

Author name cluster

Shuai Shao

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

14 papers
2 author rows

Possible papers

14

AAAI Conference 2026 Conference Paper

Style4D-Bench: A Benchmark Suite for 4D Stylization

  • Beiqi Chen
  • Shuai Shao
  • Haitang Feng
  • Jianhuang Lai
  • Jianlou Si
  • Guangcong Wang

We introduce Style4D-Bench, the first benchmark suite specifically designed for 4D stylization, with the goal of standardizing evaluation and facilitating progress in this emerging area. Style4D-Bench comprises: 1) a strong baseline that make an initial attempt for 4D stylization, 2) a comprehensive evaluation protocol measuring spatial fidelity, temporal coherence, and multi-view consistency through both perceptual and quantitative metrics, and 3) a curated collection of high-resolution dynamic 4D scenes with diverse motions and complex backgrounds. To establish a strong baseline, we present Style4D, a novel framework built upon 4D Gaussian Splatting. It consists of three key components: a basic 4DGS scene representation to capture reliable geometry, a Style Gaussian Representation that leverages lightweight per-Gaussian MLPs for temporally and spatially aware appearance control, and a Holistic Geometry-Preserved Style Transfer module designed to enhance spatio-temporal consistency via contrastive coherence learning and structural content preservation. Extensive experiments on Style4D-Bench demonstrate that Style4D achieves state-of-the-art performance in 4D stylization, producing fine-grained stylistic details with stable temporal dynamics and consistent multi-view rendering. We expect Style4D-Bench to become a valuable resource for benchmarking and advancing research in stylized rendering of dynamic 3D scenes.

NeurIPS Conference 2025 Conference Paper

AgentNet: Decentralized Evolutionary Coordination for LLM-based Multi-Agent Systems

  • Yingxuan Yang
  • Huacan Chai
  • Shuai Shao
  • Yuanyi Song
  • Siyuan Qi
  • Renting Rui
  • Weinan Zhang

The rapid advancement of Large Language Models (LLMs) has catalyzed the development of multi-agent systems, where multiple LLM-based agents collaborate to solve complex tasks. However, existing systems predominantly rely on centralized coordination, which introduces scalability bottlenecks, limits adaptability, and creates single points of failure. Additionally, concerns over privacy and proprietary knowledge sharing hinder cross-organizational collaboration, leading to siloed expertise. To address these challenges, we propose AgentNet, a decentralized, Retrieval-Augmented Generation (RAG)-based framework that enables LLM-based agents to autonomously evolve their capabilities and collaborate efficiently in a Directed Acyclic Graph (DAG)-structured network. Unlike traditional multi-agent systems that depend on static role assignments or centralized control, AgentNet allows agents to specialize dynamically, adjust their connectivity, and route tasks without relying on predefined workflows. AgentNet’s core design is built upon several key innovations: (1) Fully Decentralized Paradigm: Removing the central orchestrator, allowing agents to coordinate and specialize autonomously, fostering fault tolerance and emergent collective intelligence. (2) Dynamically Evolving Graph Topology: Real-time adaptation of agent connections based on task demands, ensuring scalability and resilience. (3) Adaptive Learning for Expertise Refinement: A retrieval-based memory system that enables agents to continuously update and refine their specialized skills. By eliminating centralized control, AgentNet enhances fault tolerance, promotes scalable specialization, and enables privacy-preserving collaboration across organizations. Through decentralized coordination and minimal data exchange, agents can leverage diverse knowledge sources while safeguarding sensitive information. Experimental results demonstrate that AgentNet outperforms traditional centralized multi-agent systems, significantly improving efficiency, adaptability, and scalability in dynamic environments, making it a promising foundation for next-generation autonomous, privacy-respecting multi-agent ecosystems.

AAAI Conference 2025 Conference Paper

Excluding the Impossible for Open Vocabulary Semantic Segmentation

  • Shiyuan Zhao
  • Baodi Liu
  • Yu Bai
  • Weifeng Liu
  • Shuai Shao

Open vocabulary semantic segmentation is a hot topic in research, focusing on segmenting and recognizing a diverse array of categories in varied environments, including those previously unknown, thereby holding significant practical value. Mainstream studies utilize the CLIP model for direct semantic segmentation (denoted as “forward methods”), which often struggles to represent underrepresented categories effectively. To address this issue, this paper introduces a novel approach Excluding the ImpossibLe Semantic Segmentation Network (ELSE-Net) based on reverse thinking. By excluding improbable categories, ELSE-Net narrows the selection range for forward methods, significantly reducing the risk of misclassification. In implementation, we initially draw on leading research to design the General Processing Block (GP-Block), which generates inclusion probabilities (the likelihood of belonging to a category) by using the CLIP model cooperated with a Mask Proposal Network (MPN). We then present the EXcluding the ImPossible Block (EXP-Block), which computes exclusion probabilities (the likelihood of not belonging to a category) through the CLIPN model and a custom-designed Reverse Retrieval Adapter (R2-Adapter). These exclusion probabilities are subsequently used to refine the inclusion probabilities, which are ultimately employed to annotate class-agnostic masks. Moreover, the core component of our EXP-Block is model-agnostic, enabling it to enhance the capabilities of existing frameworks. Experimental results from four benchmark datasets validate the effectiveness of ELSE-Net and underscore the seamless model-agnostic functionality of the EXP-Block.

ICML Conference 2025 Conference Paper

Extreme Value Policy Optimization for Safe Reinforcement Learning

  • Shiqing Gao
  • Yihang Zhou
  • Shuai Shao
  • Haoyu Luo
  • Yiheng Bing
  • Jiaxin Ding 0001
  • Luoyi Fu
  • Xinbing Wang

Ensuring safety is a critical challenge in applying Reinforcement Learning (RL) to real-world scenarios. Constrained Reinforcement Learning (CRL) addresses this by maximizing returns under predefined constraints, typically formulated as the expected cumulative cost. However, expectation-based constraints overlook rare but high-impact extreme value events in the tail distribution, such as black swan incidents, which can lead to severe constraint violations. To address this issue, we propose the Extreme Value policy Optimization (EVO) algorithm, leveraging Extreme Value Theory (EVT) to model and exploit extreme reward and cost samples, reducing constraint violations. EVO introduces an extreme quantile optimization objective to explicitly capture extreme samples in the cost tail distribution. Additionally, we propose an extreme prioritization mechanism during replay, amplifying the learning signal from rare but high-impact extreme samples. Theoretically, we establish upper bounds on expected constraint violations during policy updates, guaranteeing strict constraint satisfaction at a zero-violation quantile level. Further, we demonstrate that EVO achieves a lower probability of constraint violations than expectation-based methods and exhibits lower variance than quantile regression methods. Extensive experiments show that EVO significantly reduces constraint violations during training while maintaining competitive policy performance compared to baselines.

TMLR Journal 2025 Journal Article

Preference Discerning with LLM-Enhanced Generative Retrieval

  • Fabian Paischer
  • Liu Yang
  • Linfeng Liu
  • Shuai Shao
  • Kaveh Hassani
  • Jiacheng Li
  • Ricky T. Q. Chen
  • Zhang Gabriel Li

In sequential recommendation, models recommend items based on user's interaction history. To this end, current models usually incorporate information such as item descriptions and user intent or preferences. User preferences are usually not explicitly given in open-source datasets, and thus need to be approximated, for example via large language models (LLMs). Current approaches leverage approximated user preferences only during training and rely solely on the past interaction history for recommendations, limiting their ability to dynamically adapt to changing preferences, potentially reinforcing echo chambers. To address this issue, we propose a new paradigm, namely *preference discerning*, which explicitly conditions a generative recommendation model on user preferences in natural language within its context. To evaluate *preference discerning*, we introduce a novel benchmark that provides a holistic evaluation across various scenarios, including preference steering and sentiment following. Upon evaluating current state-of-the-art methods on our benchmark, we discover that their ability to dynamically adapt to evolving user preferences is limited. To address this, we propose a new method named Mender (**M**ultimodal Prefer**en**ce **D**iscern**er**), which achieves state-of-the-art performance in our benchmark. Our results show that Mender effectively adapts its recommendation guided by human preferences, even if not observed during training, paving the way toward more flexible recommendation models.

NeurIPS Conference 2025 Conference Paper

RiOSWorld: Benchmarking the Risk of Multimodal Computer-Use Agents

  • Jingyi Yang
  • Shuai Shao
  • Dongrui Liu
  • Jing Shao

With the rapid development of multimodal large language models (MLLMs), they are increasingly deployed as autonomous computer-use agents capable of accomplishing complex computer tasks. However, a pressing issue arises: Can the safety risk principles designed and aligned for general MLLMs in dialogue scenarios be effectively transferred to real-world computer-use scenarios? Existing research on evaluating the safety risks of MLLM-based computer-use agents suffers from several limitations: it either lacks realistic interactive environments, or narrowly focuses on one or a few specific risk types. These limitations ignore the complexity, variability, and diversity of real-world environments, thereby restricting comprehensive risk evaluation for computer-use agents. To this end, we introduce RiOSWorld, a benchmark designed to evaluate the potential risks of MLLM-based agents during real-world computer manipulations. Our benchmark includes 492 risky tasks spanning various computer applications, involving web, social media, multimedia, os, email, and office software. We categorize these risks into two major classes based on their risk source: (i) User-originated risks and (ii) Environmental risks. For the evaluation, we evaluate safety risks from two perspectives: (i) Risk goal intention and (ii) Risk goal completion. Extensive experiments with multimodal agents on RiOSWorld demonstrate that current computer-use agents confront significant safety risks in real-world scenarios. Our findings highlight the necessity and urgency of safety alignment for computer-use agents in real-world computer manipulation, providing valuable insights for developing trustworthy computer-use agents.

TMLR Journal 2025 Journal Article

Unifying Generative and Dense Retrieval for Sequential Recommendation

  • Liu Yang
  • Fabian Paischer
  • Kaveh Hassani
  • Jiacheng Li
  • Shuai Shao
  • Zhang Gabriel Li
  • Yun He
  • Xue Feng

Sequential dense retrieval models utilize advanced sequence learning techniques to compute item and user representations, which are then used to rank relevant items for a user through inner product computation between the user and all item representations. While effective, these approaches incur high memory and computational costs due to the need to store and compare a unique embedding for each item--leading to lower resource efficiency. In contrast, the recently proposed generative retrieval paradigm offers a promising alternative by directly predicting item indices using a generative model trained on semantic IDs that encapsulate items’ semantic information. Despite its potential for large-scale applications, a comprehensive comparison between generative retrieval and sequential dense retrieval under fair conditions is still lacking, leaving open questions regarding performance and resource efficiency trade-offs. To address this, we compare these two approaches under controlled conditions on academic benchmarks and observe performance gaps, with dense retrieval showing stronger ranking performance, while generative retrieval provides greater resource efficiency. Motivated by these observations, we propose LIGER (LeveragIng dense retrieval for GEnerative Retrieval), a hybrid model that combines the strengths of these two widely used approaches. LIGER integrates sequential dense retrieval into generative retrieval, mitigating performance differences between the two methods, and enhancing cold-start item recommendation in the evaluated datasets. This hybrid approach provides insight into the trade-offs between these approaches and demonstrates improvements in efficiency and effectiveness for recommendation systems in small-scale benchmarks.

AAAI Conference 2024 Conference Paper

Collaborative Consortium of Foundation Models for Open-World Few-Shot Learning

  • Shuai Shao
  • Yu Bai
  • Yan Wang
  • Baodi Liu
  • Bin Liu

Open-World Few-Shot Learning (OFSL) is a crucial research field dedicated to accurately identifying target samples in scenarios where data is limited and labels are unreliable. This research holds significant practical implications and is highly relevant to real-world applications. Recently, the advancements in foundation models like CLIP and DINO have showcased their robust representation capabilities even in resource-constrained settings with scarce data. This realization has brought about a transformative shift in focus, moving away from “building models from scratch” towards “effectively harnessing the potential of foundation models to extract pertinent prior knowledge suitable for OFSL and utilizing it sensibly”. Motivated by this perspective, we introduce the Collaborative Consortium of Foundation Models (CO3), which leverages CLIP, DINO, GPT-3, and DALL-E to collectively address the OFSL problem. CO3 comprises four key blocks: (1) the Label Correction Block (LC-Block) corrects unreliable labels, (2) the Data Augmentation Block (DA-Block) enhances available data, (3) the Feature Extraction Block (FE-Block) extracts multi-modal features, and (4) the Text-guided Fusion Adapter (TeFu-Adapter) integrates multiple features while mitigating the impact of noisy labels through semantic constraints. Only the adapter's parameters are adjustable, while the others remain frozen. Through collaboration among these foundation models, CO3 effectively unlocks their potential and unifies their capabilities to achieve state-of-the-art performance on multiple benchmark datasets. https://github.com/The-Shuai/CO3.

AAAI Conference 2024 Conference Paper

EVE: Efficient Vision-Language Pre-training with Masked Prediction and Modality-Aware MoE

  • Junyi Chen
  • Longteng Guo
  • Jia Sun
  • Shuai Shao
  • Zehuan Yuan
  • Liang Lin
  • Dongyu Zhang

Building scalable vision-language models to learn from diverse, multimodal data remains an open challenge. In this paper, we introduce an Efficient Vision-languagE foundation model, namely EVE, which is one unified multimodal Transformer pre-trained solely by one unified pre-training task. Specifically, EVE encodes both vision and language within a shared Transformer network integrated with modality-aware sparse Mixture-of-Experts (MoE) modules, which capture modality-specific information by selectively switching to different experts. To unify pre-training tasks of vision and language, EVE performs masked signal modeling on image-text pairs to reconstruct masked signals, i.e., image pixels and text tokens, given visible signals. This simple yet effective pre-training objective accelerates training by 4x compared to the model pre-trained with Image-Text Contrastive and Image-Text Matching losses. Owing to the combination of the unified architecture and pre-training task, EVE is easy to scale up, enabling better downstream performance with fewer resources and faster training speed. Despite its simplicity, EVE achieves state-of-the-art performance on various vision-language downstream tasks, including visual question answering, visual reasoning, and image-text retrieval.

NeurIPS Conference 2023 Conference Paper

Algorithm Selection for Deep Active Learning with Imbalanced Datasets

  • Jifan Zhang
  • Shuai Shao
  • Saurabh Verma
  • Robert Nowak

Label efficiency has become an increasingly important objective in deep learning applications. Active learning aims to reduce the number of labeled examples needed to train deep networks, but the empirical performance of active learning algorithms can vary dramatically across datasets and applications. It is difficult to know in advance which active learning strategy will perform well or best in a given application. To address this, we propose the first adaptive algorithm selection strategy for deep active learning. For any unlabeled dataset, our (meta) algorithm TAILOR (Thompson ActIve Learning algORithm selection) iteratively and adaptively chooses among a set of candidate active learning algorithms. TAILOR uses novel reward functions aimed at gathering class-balanced examples. Extensive experiments in multi-class and multi-label applications demonstrate TAILOR's effectiveness in achieving accuracy comparable or better than that of the best of the candidate algorithms. Our implementation of TAILOR is open-sourced at https: //github. com/jifanz/TAILOR.

AAAI Conference 2019 Conference Paper

Scene Text Detection with Supervised Pyramid Context Network

  • Enze Xie
  • Yuhang Zang
  • Shuai Shao
  • Gang Yu
  • Cong Yao
  • Guangyao Li

Scene text detection methods based on deep learning have achieved remarkable results over the past years. However, due to the high diversity and complexity of natural scenes, previous state-of-the-art text detection methods may still produce a considerable amount of false positives, when applied to images captured in real-world environments. To tackle this issue, mainly inspired by Mask R-CNN, we propose in this paper an effective model for scene text detection, which is based on Feature Pyramid Network (FPN) and instance segmentation. We propose a supervised pyramid context network (SPCNET) to precisely locate text regions while suppressing false positives. Benefited from the guidance of semantic information and sharing FPN, SPCNET obtains significantly enhanced performance while introducing marginal extra computation. Experiments on standard datasets demonstrate that our SPCNET clearly outperforms start-of-the-art methods. Specifically, it achieves an F-measure of 92. 1% on ICDAR2013, 87. 2% on ICDAR2015, 74. 1% on ICDAR2017 MLT and 82. 9% on Total-Text.