Arrow Research search

Author name cluster

Xiu Li

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

44 papers
1 author row

Possible papers

44

TMLR Journal 2026 Journal Article

EgoPlan: Towards Effective Embodied Agents via Egocentric Planning

  • Zhirui Fang
  • Ming Yang
  • Weishuai Zeng
  • Junpeng Yue
  • Boyu Li
  • Jiafei Lyu
  • Xiu Li
  • Zongqing Lu

We explore leveraging large multi-modal models (LMMs) and Text2image models to build a more general embodied agent. LMMs excel in planning long-horizon tasks over symbolic abstractions but struggle with grounding in the physical world, often failing to accurately identify object positions in images. A bridge is needed to connect LMMs to the physical world. The paper proposes a novel approach, egocentric vision language planning (EgoPlan), to handle long-horizon tasks from an egocentric perspective in varying household scenarios. This pipeline leverages a diffusion model to simulate the fundamental dynamics between states and actions, discusses how to integrate computer vision related techniques like style transfer and optical flow to enhance ability of modeling spatial states and generalization across different environmental dynamics. The LMM serves as a planner, breaking down instructions into sub-goals and selecting actions based on their alignment with these sub-goals, thus enabling more generalized and effective decision-making. By using LMM, we can output text actions, using a series of mechanisms such as reflection to perform high-level task decomposition and low-level action output end-to-end. Experiments show that EgoPlan improves long-horizon task success rates from the egocentric view compared to baselines across household scenarios.

AAAI Conference 2026 Conference Paper

GenPRM: Scaling Test-Time Compute of Process Reward Models via Generative Reasoning

  • Jian Zhao
  • Runze Liu
  • Kaiyan Zhang
  • Zhimu Zhou
  • Junqi Gao
  • Dong Li
  • Jiafei Lyu
  • Zhouyi Qian

Recent advancements in Large Language Models (LLMs) have shown that it is promising to utilize Process Reward Models (PRMs) as verifiers to enhance the performance of LLMs. However, current PRMs face three key challenges: (1) limited process supervision and generalization capabilities, (2) dependence on scalar value prediction without leveraging the generative abilities of LLMs, and (3) inability to scale the test-time compute of PRMs. In this work, we introduce GenPRM, a generative process reward model that performs explicit Chain-of-Thought (CoT) reasoning with code verification before providing judgment for each reasoning step. To obtain high-quality process supervision labels and rationale data, we propose Relative Progress Estimation (RPE) and a rationale synthesis framework that incorporates code verification. Experimental results on ProcessBench and several mathematical reasoning tasks show that GenPRM significantly outperforms prior PRMs with only 23K training data from MATH dataset. Through test-time scaling, a 1.5B GenPRM outperforms GPT-4o, and a 7B GenPRM surpasses Qwen2.5-Math-PRM-72B on ProcessBench. Additionally, GenPRM demonstrates strong abilities to serve as a critic model for policy model refinement. This work establishes a new paradigm for process supervision that bridges the gap between PRMs and critic models in LLMs.

AAAI Conference 2026 Conference Paper

Zo3T: Zero-Shot 3D-Aware Trajectory-Guided Image-to-Video Generation via Test-Time Training

  • Ruicheng Zhang
  • Jun Zhou
  • Zunnan Xu
  • Zihao Liu
  • Jiehui Huang
  • Mingyang Zhang
  • Yu Sun
  • Xiu Li

Trajectory-Guided image-to-video (I2V) generation aims to synthesize videos that adhere to user-specified motion instructions. Existing methods typically rely on computationally expensive fine-tuning on scarce annotated datasets. Although some zero-shot methods attempt to trajectory control in the latent space, they may yield unrealistic motion by neglecting 3D perspective and creating a misalignment between the manipulated latents and the network's noise predictions. To address these challenges, we introduce Zo3T, a novel zero-shot test-time-training framework for trajectory-guided generation with three core innovations: First, we incorporate a 3D-Aware Kinematic Projection, leveraging inferring scene depth to derive perspective-correct affine transformations for target regions. Second, we introduce Trajectory-Guided Test-Time LoRA, a mechanism that dynamically injects and optimizes ephemeral LoRA adapters into the denoising network alongside the latent state. Driven by a regional feature consistency loss, this co-adaptation effectively enforces motion constraints while allowing the pre-trained model to locally adapt its internal representations to the manipulated latent, thereby ensuring generative fidelity and on-manifold adherence. Finally, we develop Guidance Field Rectification, which refines the denoising evolutionary path by optimizing the conditional guidance field through a one-step lookahead strategy, ensuring efficient generative progression towards the target trajectory. Zo3T significantly enhances 3D realism and motion accuracy in trajectory-controlled I2V generation, demonstrating superior performance over existing training-based and zero-shot approaches.

NeurIPS Conference 2025 Conference Paper

ADG: Ambient Diffusion-Guided Dataset Recovery for Corruption-Robust Offline Reinforcement Learning

  • Zeyuan Liu
  • Zhihe Yang
  • Jiawei Xu
  • Rui Yang
  • Jiafei Lyu
  • Baoxiang Wang
  • Yunjian Xu
  • Xiu Li

Real-world datasets collected from sensors or human inputs are prone to noise and errors, posing significant challenges for applying offline reinforcement learning (RL). While existing methods have made progress in addressing corrupted actions and rewards, they remain insufficient for handling corruption in high-dimensional state spaces and for cases where multiple elements in the dataset are corrupted simultaneously. Diffusion models, known for their strong denoising capabilities, offer a promising direction for this problem—but their tendency to overfit noisy samples limits their direct applicability. To overcome this, we propose A mbient D iffusion- G uided Dataset Recovery ( ADG ), a novel approach that pioneers the use of diffusion models to tackle data corruption in offline RL. First, we introduce Ambient Denoising Diffusion Probabilistic Models (DDPM) from approximated distributions, which enable learning on partially corrupted datasets with theoretical guarantees. Second, we use the noise-prediction property of Ambient DDPM to distinguish between clean and corrupted data, and then use the clean subset to train a standard DDPM. Third, we employ the trained standard DDPM to refine the previously identified corrupted data, enhancing data quality for subsequent offline RL training. A notable strength of ADG is its versatility—it can be seamlessly integrated with any offline RL algorithm. Experiments on a range of benchmarks, including MuJoCo, Kitchen, and Adroit, demonstrate that ADG effectively mitigates the impact of corrupted data and improves the robustness of offline RL under various noise settings, achieving state-of-the-art results.

AAMAS Conference 2025 Conference Paper

CDSA: Conservative Denoising Score-based Algorithm for Offline Reinforcement Learning

  • Zeyuan Liu
  • Kai Yang
  • Jiafei Lyu
  • Xiu Li

Distribution shift is a major obstacle in offline reinforcement learning (RL). While existing conservative offline RL algorithms perform well in learning in-distribution policies, they often fail to generalize to unseen actions. To address this issue, we propose leveraging knowledge derived from the gradient fields of the dataset’s density to refine and adjust the original actions. Building on this, we introduce the Conservative Denoising Score-based Algorithm (CDSA), which utilizes score-based diffusion models to estimate the gradients of the dataset density and generates action correction subcomponents to refine the actions. This approach enables more accurate and efficient decision-making during the testing phase in Markov Decision Process (MDP) environments. By decoupling conservatism constraints from the policy, our method is broadly applicable to various offline RL algorithms. Experiments demonstrate that our approach significantly enhances baseline performance on D4RL datasets and exhibits plug-and-play compatibility with different pre-trained offline RL policies.

TMLR Journal 2025 Journal Article

Decentralized Transformers with Centralized Aggregation are Sample-Efficient Multi-Agent World Models

  • Yang Zhang
  • Chenjia Bai
  • Bin Zhao
  • Junchi Yan
  • Xiu Li
  • Xuelong Li

Learning a world model for model-free Reinforcement Learning (RL) agents can significantly improve the sample efficiency by learning policies in imagination. However, building a world model for Multi-Agent RL (MARL) can be particularly challenging due to the scalability issue across different number of agents in a centralized architecture, and also the non-stationarity issue in a decentralized architecture stemming from the inter-dependency among agents. To address both challenges, we propose a novel world model for MARL that learns decentralized local dynamics for scalability, combined with a centralized representation aggregation from all agents. We cast the dynamics learning as an auto-regressive sequence modeling problem over discrete tokens by leveraging the expressive Transformer architecture, in order to model complex local dynamics across different agents and provide accurate and consistent long-term imaginations. As the first pioneering Transformer-based world model for multi-agent systems, we introduce a Perceiver Transformer as an effective solution to enable centralized representation aggregation within this context. Extensive results on Starcraft Multi-Agent Challenge (SMAC) and MAMujoco demonstrate superior sample efficiency and overall performance compared to strong model-free approaches and existing model-based methods.

AAAI Conference 2025 Conference Paper

Densely Connected Parameter-Efficient Tuning for Referring Image Segmentation

  • Jiaqi Huang
  • Zunnan Xu
  • Ting Liu
  • Yong Liu
  • Haonan Han
  • Kehong Yuan
  • Xiu Li

In the domain of computer vision, Parameter-Efficient Tuning (PET) is increasingly replacing the traditional paradigm of pre-training followed by full fine-tuning. PET is particularly favored for its effectiveness in large foundation models, as it streamlines transfer learning costs and optimizes hardware utilization. However, the current PET methods are mainly designed for single-modal optimization. While some pioneering studies have undertaken preliminary explorations, they still remain at the level of aligned encoders (e.g., CLIP) and lack exploration of misaligned encoders. These methods show sub-optimal performance with misaligned encoders, as they fail to effectively align the multimodal features during fine-tuning. In this paper, we introduce DETRIS, a parameter-efficient tuning framework designed to enhance low-rank visual feature propagation by establishing dense interconnections between each layer and all preceding layers, which enables effective cross-modal feature interaction and adaptation to misaligned encoders. We also suggest using text adapters to improve textual features. Our simple yet efficient approach greatly surpasses state-of-the-art methods with 0.9% to 1.8% backbone parameter updates, evaluated on challenging benchmarks.

AAMAS Conference 2025 Conference Paper

Leveraging Score-based Models for Generating Penalization in Model-based Offline Reinforcement Learning

  • Zeyuan Liu
  • Zhirui Fang
  • Jiafei Lyu
  • Xiu Li

A core challenge in model-based offline reinforcement learning is constructing penalties over the state-action space of the offline dataset, which is typically high-dimensional. We define “cliffs” as regions in the state-action space where data density changes sharply, and our investigation shows that existing approaches struggle with accuracy near these cliffs. The formation of cliffs could be influenced by human-defined parameters and objective physical laws, often beyond the understanding of RL agents. This results in a lack of established methods to address this issue. To overcome these limitations, we propose Score as a Penalty for Model-based Offline Reinforcement Learning (ScorePen-MORL). This innovative approach generates penalties based on the gradient filed of dataset density in the state-action space. ScorePen-MORL is a plug-and-play solution that can achieve impressive results independently while also enhancing the performance of baseline algorithms through the joint effect. Our empirical findings demonstrate that cliff regions in the dataset are a significant bottleneck in offline model-based RL, and ScorePen-MORL effectively addresses this issue by generating highly sensitive penalties for these cliff regions. Through the empirical results on the D4RL and NeoRL benchmarks, we find our method outperforms recent strong model-based offline RL baseline algorithms.

NeurIPS Conference 2025 Conference Paper

Linear Differential Vision Transformer: Learning Visual Contrasts via Pairwise Differentials

  • Yifan Pu
  • Jixuan Ying
  • Qixiu Li
  • Tianzhu Ye
  • Dongchen Han
  • Xiaochen Wang
  • Ziyi Wang
  • shao xinyu

Vision Transformers (ViTs) have become a universal backbone for both image recognition and image generation. Yet their Multi–Head Self–Attention (MHSA) layer still performs a quadratic query–key interaction for \emph{every} token pair, spending the bulk of computation on visually weak or redundant correlations. We introduce \emph{Visual–Contrast Attention} (VCA), a drop-in replacement for MHSA that injects an explicit notion of discrimination while reducing the theoretical complexity from $\mathcal{O}(N^{2}C)$ to $\mathcal{O}(N n C)$ with $n\! \ll\! N$. VCA first distils each head’s dense query field into a handful of spatially pooled \emph{visual–contrast tokens}, then splits them into a learnable \emph{positive} and \emph{negative} stream whose differential interaction highlights what truly separates one region from another. The module adds fewer than $0. 3$\, M parameters to a DeiT-Tiny backbone, requires no extra FLOPs, and is wholly architecture-agnostic. Empirically, VCA lifts DeiT-Tiny top-1 accuracy on ImageNet-1K from $72. 2\%$ to \textbf{$75. 6\%$} (+$3. 4$) and improves three strong hierarchical ViTs by up to $3. 1$\%, while in class-conditional ImageNet generation it lowers FID-50K by $2. 1$ to $5. 2$ points across both diffusion (DiT) and flow (SiT) models. Extensive ablations confirm that (i) spatial pooling supplies low-variance global cues, (ii) dual positional embeddings are indispensable for contrastive reasoning, and (iii) combining the two in both stages yields the strongest synergy. VCA therefore offers a simple path towards faster and sharper Vision Transformers. The source code is available at \href{https: //github. com/LeapLabTHU/LinearDiff}{https: //github. com/LeapLabTHU/LinearDiff}.

NeurIPS Conference 2025 Conference Paper

MindOmni: Unleashing Reasoning Generation in Vision Language Models with RGPO

  • Yicheng Xiao
  • Lin Song
  • Yukang Chen
  • Yingmin Luo
  • Yuxin Chen
  • Yukang Gan
  • Wei Huang
  • Xiu Li

Recent text-to-image systems face limitations in handling multimodal inputs and complex reasoning tasks. We introduce MindOmni, a unified multimodal large language model that addresses these challenges by incorporating reasoning generation through reinforcement learning. MindOmni leverages a three-phase training strategy: i) design of a unified vision language model with a decoder-only diffusion module, ii) supervised fine-tuning with Chain-of-Thought (CoT) instruction data, and iii) our proposed Reasoning Generation Policy Optimization (RGPO) algorithm, utilizing multimodal feedback to effectively guide policy updates. Experimental results demonstrate that MindOmni outperforms existing models, achieving impressive performance on both understanding and generation benchmarks, meanwhile showcasing advanced fine-grained reasoning generation capabilities, especially with mathematical reasoning instruction. All codes will be made public.

AAAI Conference 2025 Conference Paper

MultiBooth: Towards Generating All Your Concepts in an Image from Text

  • Chenyang Zhu
  • Kai Li
  • Yue Ma
  • Chunming He
  • Xiu Li

This paper introduces MultiBooth, a method that generates images from texts containing various concepts from users.Despite diffusion models bringing significant advancements for customized text-to-image generation, existing methods often struggle with multi-concept scenarios due to low concept fidelity and high inference cost. MultiBooth addresses these issues by dividing the multi-concept generation process into two phases: a single-concept learning phase and a multi-concept integration phase. During the single-concept learning phase, we employ a multi-modal image encoder and an efficient concept encoding technique to learn a concise and discriminative representation for each concept. In the multi-concept integration phase, we use bounding boxes to define the generation area for each concept within the cross-attention map. This method enables the creation of individual concepts within their specified regions, thereby facilitating the formation of multi-concept images. This strategy not only improves concept fidelity but also reduces additional inference cost. MultiBooth surpasses various baselines in both qualitative and quantitative evaluations, showcasing its superior performance and computational efficiency.

NeurIPS Conference 2025 Conference Paper

Puppeteer: Rig and Animate Your 3D Models

  • Chaoyue Song
  • Xiu Li
  • Fan Yang
  • Zhongcong XU
  • Jiacheng Wei
  • Fayao Liu
  • Jiashi Feng
  • Guosheng Lin

Modern interactive applications increasingly demand dynamic 3D content, yet the transformation of static 3D models into animated assets constitutes a significant bottleneck in content creation pipelines. While recent advances in generative AI have revolutionized static 3D model creation, rigging and animation continue to depend heavily on expert intervention. We present \textbf{Puppeteer}, a comprehensive framework that addresses both automatic rigging and animation for diverse 3D objects. Our system first predicts plausible skeletal structures via an auto-regressive transformer that introduces a joint-based tokenization strategy for compact representation and a hierarchical ordering methodology with stochastic perturbation that enhances bidirectional learning capabilities. It then infers skinning weights via an attention-based architecture incorporating topology-aware joint attention that explicitly encodes inter-joint relationships based on skeletal graph distances. Finally, we complement these rigging advances with a differentiable optimization-based animation pipeline that generates stable, high-fidelity animations while being computationally more efficient than existing approaches. Extensive evaluations across multiple benchmarks demonstrate that our method significantly outperforms state-of-the-art techniques in both skeletal prediction accuracy and skinning quality. The system robustly processes diverse 3D content, ranging from professionally designed game assets to AI-generated shapes, producing temporally coherent animations that eliminate the jittering issues common in existing methods.

NeurIPS Conference 2025 Conference Paper

Revisiting Multi-Agent World Modeling from a Diffusion-Inspired Perspective

  • Yang Zhang
  • Xinran Li
  • Jianing Ye
  • Shuang Qiu
  • Delin Qu
  • Xiu Li
  • Chongjie Zhang
  • Chenjia Bai

World models have recently attracted growing interest in Multi-Agent Reinforcement Learning (MARL) due to their ability to improve sample efficiency for policy learning. However, accurately modeling environments in MARL is challenging due to the exponentially large joint action space and highly uncertain dynamics inherent in multi-agent systems. To address this, we reduce modeling complexity by shifting from jointly modeling the entire state-action transition dynamics to focusing on the state space alone at each timestep through sequential agent modeling. Specifically, our approach enables the model to progressively resolve uncertainty while capturing the structured dependencies among agents, providing a more accurate representation of how agents influence the state. Interestingly, this sequential revelation of agents' actions in a multi-agent system aligns with the reverse process in diffusion models—a class of powerful generative models known for their expressiveness and training stability compared to autoregressive or latent variable models. Leveraging this insight, we develop a flexible and robust world model for MARL using diffusion models. Our method, \textbf{D}iffusion-\textbf{I}nspired \textbf{M}ulti-\textbf{A}gent world model (DIMA), achieves state-of-the-art performance across multiple multi-agent control benchmarks, significantly outperforming prior world models in terms of final return and sample efficiency, including MAMuJoCo and Bi-DexHands. DIMA establishes a new paradigm for constructing multi-agent world models, advancing the frontier of MARL research. Codes are open-sourced at \url{https: //github. com/breez3young/DIMA}.

NeurIPS Conference 2025 Conference Paper

SAM-R1: Leveraging SAM for Reward Feedback in Multimodal Segmentation via Reinforcement Learning

  • Jiaqi Huang
  • Zunnan Xu
  • Jun Zhou
  • Ting Liu
  • Yicheng Xiao
  • Mingwen Ou
  • Bowen Ji
  • Xiu Li

Leveraging multimodal large models for image segmentation has become a prominent research direction. However, existing approaches typically rely heavily on manually annotated datasets that include explicit reasoning processes, which are costly and time-consuming to produce. Recent advances suggest that reinforcement learning (RL) can endow large models with reasoning capabilities without requiring such reasoning-annotated data. In this paper, we propose SAM-R1, a novel framework that enables multimodal large models to perform fine-grained reasoning in image understanding tasks. Our approach is the first to incorporate fine-grained segmentation settings during the training of multimodal reasoning models. By integrating task-specific, fine-grained rewards with a tailored optimization objective, we further enhance the model's reasoning and segmentation alignment. We also leverage the Segment Anything Model (SAM) as a strong and flexible reward provider to guide the learning process. With only 3k training samples, SAM-R1 achieves strong performance across multiple benchmarks, demonstrating the effectiveness of reinforcement learning in equipping multimodal models with segmentation-oriented reasoning capabilities.

JBHI Journal 2025 Journal Article

Seeking Common Ground While Reserving Differences: Multiple Anatomy Collaborative Framework for Undersampled MRI Reconstruction

  • Jiangpeng Yan
  • Chenghui Yu
  • Hanbo Chen
  • Zhe Xu
  • Junzhou Huang
  • Xiu Li
  • Jianhua Yao

Recently, deep neural networks have greatly advanced undersampled Magnetic Resonance Image (MRI) reconstruction, wherein most studies follow the one-anatomy-one-network fashion, i. e. , each expert network is trained and evaluated for a specific anatomy. Apart from inefficiency in training multiple independent models, such convention ignores the shared de-aliasing knowledge across various anatomies which can benefit each other. To explore the shared knowledge, one naive way is to combine all the data from various anatomies to train an all-round network. Unfortunately, despite the existence of the shared de-aliasing knowledge, we reveal that the exclusive knowledge across different anatomies can deteriorate specific reconstruction targets, yielding overall performance degradation. Observing this, in this study, we present a novel deep MRI reconstruction framework with both anatomy-shared and anatomy-specific parameterized learners, aiming to “seek common ground while reserving differences” across different anatomies. Particularly, the primary anatomy-shared learners are exposed to different anatomies to model rich shared de-aliasing knowledge, while the efficient anatomy-specific learners are trained with their target anatomy for exclusive knowledge. Four different implementations of anatomy-specific learners are presented and explored on the top of our framework in two MRI reconstruction networks. Comprehensive experiments on brain, knee and cardiac MRI datasets demonstrate that three of these learners are able to enhance reconstruction performance via multiple anatomy collaborative learning. Extensive studies show that our strategy can also benefit multiple pulse sequence MRI reconstruction by integrating sequence-specific learners.

AAAI Conference 2025 Conference Paper

SUMO: Search-Based Uncertainty Estimation for Model-Based Offline Reinforcement Learning

  • Zhongjian Qiao
  • Jiafei Lyu
  • Kechen Jiao
  • Qi Liu
  • Xiu Li

The performance of offline reinforcement learning (RL) suffers from the limited size and quality of static datasets. Model-based offline RL addresses this issue by generating synthetic samples through a dynamics model to enhance overall performance. To evaluate the reliability of the generated samples, uncertainty estimation methods are often employed. However, model ensemble, the most commonly used uncertainty estimation method, is not always the best choice. In this paper, we propose a Search-based Uncertainty estimation method for Model-based Offline RL (SUMO) as an alternative. SUMO characterizes the uncertainty of synthetic samples by measuring their cross entropy against the in-distribution dataset samples, and uses an efficient search-based method for implementation. In this way, SUMO can achieve trustworthy uncertainty estimation. We integrate SUMO into several model-based offline RL algorithms including MOPO and Adapted MOReL (AMOReL), and provide theoretical analysis for them. Extensive experimental results on D4RL datasets demonstrate that SUMO can provide accurate uncertainty estimation and boost the performance of base algorithms. These indicate that SUMO could be a better uncertainty estimator for model-based offline RL when used in either reward penalty or trajectory truncation.

EAAI Journal 2024 Journal Article

A two-stage reinforcement learning-based approach for multi-entity task allocation

  • Aicheng Gong
  • Kai Yang
  • Jiafei Lyu
  • Xiu Li

Task allocation is a key combinatorial optimization problem, crucial for modern applications such as multi-robot cooperation and resource scheduling. Decision makers must allocate entities to tasks reasonably across different scenarios. However, traditional methods assume static attributes and numbers of tasks and entities, often relying on dynamic programming and heuristic algorithms for solutions. In reality, task allocation resembles Markov decision processes, with dynamically changing task and entity attributes. Thus, algorithms must dynamically allocate tasks based on their states. To address this issue, we propose a two-stage task allocation algorithm based on similarity, utilizing reinforcement learning to learn allocation strategies. The proposed pre-assign strategy allows entities to preselect appropriate tasks, effectively avoiding local optima and thereby better finding the optimal allocation. We also introduce an attention mechanism and a hyperparameter network structure to adapt to the changing number and attributes of entities and tasks, enabling our network structure to generalize to new tasks. Experimental results across multiple environments demonstrate that our algorithm effectively addresses the challenges of dynamic task allocation in practical applications. Compared to heuristic algorithms like genetic algorithms, our reinforcement learning approach better solves dynamic allocation problems and achieves zero-shot generalization to new tasks with good performance. The code is available at https: //github. com/yk7333/TaskAllocation.

IJCAI Conference 2024 Conference Paper

BATON: Aligning Text-to-Audio Model Using Human Preference Feedback

  • Huan Liao
  • Haonan Han
  • Kai Yang
  • Tianjiao Du
  • Rui Yang
  • Qinmei Xu
  • Zunnan Xu
  • Jingquan Liu

With the development of AI-Generated Content (AIGC), text-to-audio models are gaining widespread attention. However, it is challenging for these models to generate audio aligned with human preference due to the inherent information density of natural language and limited model understanding ability. To alleviate this issue, we formulate the BATON, the first framework specifically designed to enhance the alignment between generated audio and text prompt using human preference feedback. Our BATON comprises three key stages: Firstly, we curated a dataset containing both prompts and the corresponding generated audio, which was then annotated based on human feedback. Secondly, we introduced a reward model using the constructed dataset, which can mimic human preference by assigning rewards to input text-audio pairs. Finally, we employed the reward model to fine-tune an off-the-shelf text-to-audio model. The experiment results demonstrate that our BATON can significantly improve the generation quality of the original text-to-audio models, concerning audio integrity, temporal relationship, and alignment with human preference. Project page is available at https: //baton2024. github. io.

NeurIPS Conference 2024 Conference Paper

Bridging the Divide: Reconsidering Softmax and Linear Attention

  • Dongchen Han
  • Yifan Pu
  • Zhuofan Xia
  • Yizeng Han
  • Xuran Pan
  • Xiu Li
  • Jiwen Lu
  • Shiji Song

Widely adopted in modern Vision Transformer designs, Softmax attention can effectively capture long-range visual information; however, it incurs excessive computational cost when dealing with high-resolution inputs. In contrast, linear attention naturally enjoys linear complexity and has great potential to scale up to higher-resolution images. Nonetheless, the unsatisfactory performance of linear attention greatly limits its practical application in various scenarios. In this paper, we take a step forward to close the gap between the linear and Softmax attention with novel theoretical analyses, which demystify the core factors behind the performance deviations. Specifically, we present two key perspectives to understand and alleviate the limitations of linear attention: the injective property and the local modeling ability. Firstly, we prove that linear attention is not injective, which is prone to assign identical attention weights to different query vectors, thus adding to severe semantic confusion since different queries correspond to the same outputs. Secondly, we confirm that effective local modeling is essential for the success of Softmax attention, in which linear attention falls short. The aforementioned two fundamental differences significantly contribute to the disparities between these two attention paradigms, which is demonstrated by our substantial empirical validation in the paper. In addition, more experiment results indicate that linear attention, as long as endowed with these two properties, can outperform Softmax attention across various tasks while maintaining lower computation complexity. Code is available at https: //github. com/LeapLabTHU/InLine.

AAAI Conference 2024 Conference Paper

Chain of Generation: Multi-Modal Gesture Synthesis via Cascaded Conditional Control

  • Zunnan Xu
  • Yachao Zhang
  • Sicheng Yang
  • Ronghui Li
  • Xiu Li

This study aims to improve the generation of 3D gestures by utilizing multimodal information from human speech. Previous studies have focused on incorporating additional modalities to enhance the quality of generated gestures. However, these methods perform poorly when certain modalities are missing during inference. To address this problem, we suggest using speech-derived multimodal priors to improve gesture generation. We introduce a novel method that separates priors from speech and employs multimodal priors as constraints for generating gestures. Our approach utilizes a chain-like modeling method to generate facial blendshapes, body movements, and hand gestures sequentially. Specifically, we incorporate rhythm cues derived from facial deformation and stylization prior based on speech emotions, into the process of generating gestures. By incorporating multimodal priors, our method improves the quality of generated gestures and eliminate the need for expensive setup preparation during inference. Extensive experiments and user studies confirm that our proposed approach achieves state-of-the-art performance.

NeurIPS Conference 2024 Conference Paper

COVE: Unleashing the Diffusion Feature Correspondence for Consistent Video Editing

  • Jiangshan Wang
  • Yue Ma
  • Jiayi Guo
  • Yicheng Xiao
  • Gao Huang
  • Xiu Li

Video editing is an emerging task, in which most current methods adopt the pre-trained text-to-image (T2I) diffusion model to edit the source video in a zero-shot manner. Despite extensive efforts, maintaining the temporal consistency of edited videos remains challenging due to the lack of temporal constraints in the regular T2I diffusion model. To address this issue, we propose COrrespondence-guided Video Editing (COVE), leveraging the inherent diffusion feature correspondence to achieve high-quality and consistent video editing. Specifically, we propose an efficient sliding-window-based strategy to calculate the similarity among tokens in the diffusion features of source videos, identifying the tokens with high correspondence across frames. During the inversion and denoising process, we sample the tokens in noisy latent based on the correspondence and then perform self-attention within them. To save the usage of GPU memory and accelerate the editing process, we further introduce the temporal-dimensional token merging strategy, which can effectively reduce the redundancy. COVE can be seamlessly integrated into the pre-trained T2I diffusion model without the need for extra training or optimization. Extensive experiment results demonstrate that COVE achieves the start-of-the-art performance in various video editing scenarios, outperforming existing methods both quantitatively and qualitatively. The source code will be released.

AAAI Conference 2024 Conference Paper

Cross-Modal Match for Language Conditioned 3D Object Grounding

  • Yachao Zhang
  • Runze Hu
  • Ronghui Li
  • Yanyun Qu
  • Yuan Xie
  • Xiu Li

Language conditioned 3D object grounding aims to find the object within the 3D scene mentioned by natural language descriptions, which mainly depends on the matching between visual and natural language. Considerable improvement in grounding performance is achieved by improving the multimodal fusion mechanism or bridging the gap between detection and matching. However, several mismatches are ignored, i.e., mismatch in local visual representation and global sentence representation, and mismatch in visual space and corresponding label word space. In this paper, we propose crossmodal match for 3D grounding from mitigating these mismatches perspective. Specifically, to match local visual features with the global description sentence, we propose BEV (Bird’s-eye-view) based global information embedding module. It projects multiple object proposal features into the BEV and the relations of different objects are accessed by the visual transformer which can model both positions and features with long-range dependencies. To circumvent the mismatch in feature spaces of different modalities, we propose crossmodal consistency learning. It performs cross-modal consistency constraints to convert the visual feature space into the label word feature space resulting in easier matching. Besides, we introduce label distillation loss and global distillation loss to drive these matches learning in a distillation way. We evaluate our method in mainstream evaluation settings on three datasets, and the results demonstrate the effectiveness of the proposed method.

AAAI Conference 2024 Short Paper

Dual Mapping of 2D StyleGAN for 3D-Aware Image Generation and Manipulation (Student Abstract)

  • Zhuo Chen
  • Haimei Zhao
  • Chaoyue Wang
  • Bo Yuan
  • Xiu Li

3D-aware GANs successfully solve the problem of 3D-consistency generation and furthermore provide a 3D shape of the generated object. However, the application of the volume renderer disturbs the disentanglement of the latent space, which makes it difficult to manipulate 3D-aware GANs and lowers the image quality of style-based generators. In this work, we devise a dual-mapping framework to make the generated images of pretrained 2D StyleGAN consistent in 3D space. We utilize a tri-plane representation to estimate the 3D shape of the generated object and two mapping networks to bridge the latent space of StyleGAN and the 3D tri-plane space. Our method does not alter the parameters of the pretrained generator, which means the interpretability of latent space is preserved for various image manipulations. Experiments show that our method lifts the 3D awareness of pretrained 2D StyleGAN to 3D-aware GANs and outperforms the 3D-aware GANs in controllability and image quality.

AAAI Conference 2024 Conference Paper

Follow Your Pose: Pose-Guided Text-to-Video Generation Using Pose-Free Videos

  • Yue Ma
  • Yingqing He
  • Xiaodong Cun
  • Xintao Wang
  • Siran Chen
  • Xiu Li
  • Qifeng Chen

Generating text-editable and pose-controllable character videos have an imperious demand in creating various digital human. Nevertheless, this task has been restricted by the absence of a comprehensive dataset featuring paired video-pose captions and the generative prior models for videos. In this work, we design a novel two-stage training scheme that can utilize easily obtained datasets (i.e., image pose pair and pose-free video) and the pre-trained text-to-image (T2I) model to obtain the pose-controllable character videos. Specifically, in the first stage, only the keypoint image pairs are used only for a controllable text-to-image generation. We learn a zero-initialized convolutional encoder to encode the pose information. In the second stage, we finetune the motion of the above network via a pose-free video dataset by adding the learnable temporal self-attention and reformed cross-frame self-attention blocks. Powered by our new designs, our method successfully generates continuously pose-controllable character videos while keeps the editing and concept composition ability of the pre-trained T2I model. The code and models are available on https://follow-your-pose.github.io/.

NeurIPS Conference 2024 Conference Paper

MambaTalk: Efficient Holistic Gesture Synthesis with Selective State Space Models

  • Zunnan Xu
  • Yukang Lin
  • Haonan Han
  • Sicheng Yang
  • Ronghui Li
  • Yachao Zhang
  • Xiu Li

Gesture synthesis is a vital realm of human-computer interaction, with wide-ranging applications across various fields like film, robotics, and virtual reality. Recent advancements have utilized the diffusion model to improve gesture synthesis. However, the high computational complexity of these techniques limits the application in reality. In this study, we explore the potential of state space models (SSMs). Direct application of SSMs in gesture synthesis encounters difficulties, which stem primarily from the diverse movement dynamics of various body parts. The generated gestures may also exhibit unnatural jittering issues. To address these, we implement a two-stage modeling strategy with discrete motion priors to enhance the quality of gestures. Built upon the selective scan mechanism, we introduce MambaTalk, which integrates hybrid fusion modules, local and global scans to refine latent space representations. Subjective and objective experiments demonstrate that our method surpasses the performance of state-of-the-art models. Our project is publicly available at~\url{https: //kkakkkka. github. io/MambaTalk/}.

NeurIPS Conference 2024 Conference Paper

MambaTree: Tree Topology is All You Need in State Space Model

  • Yicheng Xiao
  • Lin Song
  • Shaoli Huang
  • Jiangshan Wang
  • Siyu Song
  • Yixiao Ge
  • Xiu Li
  • Ying Shan

The state space models, employing recursively propagated features, demonstrate strong representation capabilities comparable to Transformer models and superior efficiency. However, constrained by the inherent geometric constraints of sequences, it still falls short in modeling long-range dependencies. To address this issue, we propose the MambaTree network, which first dynamically generates a tree topology based on spatial relationships and input features. Then, feature propagation is performed based on this graph, thereby breaking the original sequence constraints to achieve stronger representation capabilities. Additionally, we introduce a linear complexity dynamic programming algorithm to enhance long-range interactions without increasing computational cost. MambaTree is a versatile multimodal framework that can be applied to both visual and textual tasks. Extensive experiments demonstrate that our method significantly outperforms existing structured state space models on image classification, object detection and segmentation. Besides, by fine-tuning large language models, our approach achieves consistent improvements in multiple textual tasks at minor training cost.

AAMAS Conference 2024 Conference Paper

Normalization Enhances Generalization in Visual Reinforcement Learning

  • Lu Li
  • Jiafei Lyu
  • Guozheng Ma
  • Zilin Wang
  • Zhenjie Yang
  • Xiu Li
  • Zhiheng Li

Recent advances in visual reinforcement learning (RL) have led to impressive success in handling complex tasks. However, these methods have demonstrated limited generalization capability to visual disturbances, which poses a significant challenge to their real-world application and adaptability. Though normalization techniques have demonstrated huge success in supervised and unsupervised learning, their applications in visual RL are still scarce. In this paper, we explore the potential benefits of integrating normalization into visual RL methods with respect to generalization performance. We find that, perhaps surprisingly, incorporating suitable normalization techniques is sufficient to enhance the generalization capabilities, without any additional special design. We utilize the combination of two normalization techniques, CrossNorm and SelfNorm, for generalizable visual RL. Extensive experiments are conducted on DMControl Generalization Benchmark, CARLA, and ProcGen Benchmark to validate the effectiveness of our method. We show that our method significantly improves generalization capability while only marginally affecting sample efficiency. In particular, when integrated with DrQ-v2, our method enhances the test performance of DrQ-v2 on CARLA across various scenarios, from 14% of the training performance to 97%. Our project page: https: //sites. google. com/view/norm-generalization-vrl/home

NeurIPS Conference 2024 Conference Paper

ODRL: A Benchmark for Off-Dynamics Reinforcement Learning

  • Jiafei Lyu
  • Kang Xu
  • Jiacheng Xu
  • Mengbei Yan
  • Jingwen Yang
  • Zongzhang Zhang
  • Chenjia Bai
  • Zongqing Lu

We consider off-dynamics reinforcement learning (RL) where one needs to transfer policies across different domains with dynamics mismatch. Despite the focus on developing dynamics-aware algorithms, this field is hindered due to the lack of a standard benchmark. To bridge this gap, we introduce ODRL, the first benchmark tailored for evaluating off-dynamics RL methods. ODRL contains four experimental settings where the source and target domains can be either online or offline, and provides diverse tasks and a broad spectrum of dynamics shifts, making it a reliable platform to comprehensively evaluate the agent's adaptation ability to the target domain. Furthermore, ODRL includes recent off-dynamics RL algorithms in a unified framework and introduces some extra baselines for different settings, all implemented in a single-file manner. To unpack the true adaptation capability of existing methods, we conduct extensive benchmarking experiments, which show that no method has universal advantages across varied dynamics shifts. We hope this benchmark can serve as a cornerstone for future research endeavors. Our code is publicly available at https: //github. com/OffDynamicsRL/off-dynamics-rl.

NeurIPS Conference 2024 Conference Paper

Real-world Image Dehazing with Coherence-based Pseudo Labeling and Cooperative Unfolding Network

  • Chengyu Fang
  • Chunming He
  • Fengyang Xiao
  • Yulun Zhang
  • Longxiang Tang
  • Yuelin Zhang
  • Kai Li
  • Xiu Li

Real-world Image Dehazing (RID) aims to alleviate haze-induced degradation in real-world settings. This task remains challenging due to the complexities in accurately modeling real haze distributions and the scarcity of paired real-world data. To address these challenges, we first introduce a cooperative unfolding network that jointly models atmospheric scattering and image scenes, effectively integrating physical knowledge into deep networks to restore haze-contaminated details. Additionally, we propose the first RID-oriented iterative mean-teacher framework, termed the Coherence-based Label Generator, to generate high-quality pseudo labels for network training. Specifically, we provide an optimal label pool to store the best pseudo-labels during network training, leveraging both global and local coherence to select high-quality candidates and assign weights to prioritize haze-free regions. We verify the effectiveness of our method, with experiments demonstrating that it achieves state-of-the-art performance on RID tasks. Code will be available at https: //github. com/cnyvfang/CORUN-Colabator.

AAAI Conference 2024 Short Paper

STViT: Improving Self-Supervised Multi-Camera Depth Estimation with Spatial-Temporal Context and Adversarial Geometry Regularization (Student Abstract)

  • Zhuo Chen
  • Haimei Zhao
  • Bo Yuan
  • Xiu Li

Multi-camera depth estimation has recently garnered significant attention due to its substantial practical implications in the realm of autonomous driving. In this paper, we delve into the task of self-supervised multi-camera depth estimation and propose an innovative framework, STViT, featuring several noteworthy enhancements: 1) we propose a Spatial-Temporal Transformer to comprehensively exploit both local connectivity and the global context of image features, meanwhile learning enriched spatial-temporal cross-view correlations to recover 3D geometry. 2) to alleviate the severe effect of adverse conditions, e.g., rainy weather and nighttime driving, we introduce a GAN-based Adversarial Geometry Regularization Module (AGR) to further constrain the depth estimation with unpaired normal-condition depth maps and prevent the model from being incorrectly trained. Experiments on challenging autonomous driving datasets Nuscenes and DDAD show that our method achieves state-of-the-art performance.

AAMAS Conference 2024 Conference Paper

Towards Understanding How to Reduce Generalization Gap in Visual Reinforcement Learning

  • Jiafei Lyu
  • Le Wan
  • Xiu Li
  • Zongqing Lu

It is vital to learn a generalizable policy in visual reinforcement learning (RL). Many algorithms are proposed to handle this problem while none of them theoretically show what affects the generalization gap and why their methods work. In this paper, we bridge this issue by theoretically answering the key factors that contribute to the generalization gap when the testing environment has distractors. Our theories indicate that minimizing the representation distance between training and testing environments is the most critical. Our theoretical results are supported by the empirical evidence in the DMControl Generalization Benchmark.

JAIR Journal 2024 Journal Article

Understanding What Affects the Generalization Gap in Visual Reinforcement Learning: Theory and Empirical Evidence

  • Jiafei Lyu
  • Le Wan
  • Xiu Li
  • Zongqing Lu

Recently, there are many efforts attempting to learn useful policies for continuous control in visual reinforcement learning (RL). In this scenario, it is important to learn a generalizable policy, as the testing environment may differ from the training environment, e.g., there exist distractors during deployment. Many practical algorithms are proposed to handle this problem. However, to the best of our knowledge, none of them provide a theoretical understanding of what affects the generalization gap and why their proposed methods work. In this paper, we bridge this issue by theoretically answering the key factors that contribute to the generalization gap when the testing environment has distractors. Our theories indicate that minimizing the representation distance between training and testing environments, which aligns with human intuition, is the most critical for the benefit of reducing the generalization gap. Our theoretical results are supported by the empirical evidence in the DMControl Generalization Benchmark (DMC-GB).

AAAI Conference 2023 Conference Paper

Adversarial Alignment for Source Free Object Detection

  • Qiaosong Chu
  • Shuyan Li
  • Guangyi Chen
  • Kai Li
  • Xiu Li

Source-free object detection (SFOD) aims to transfer a detector pre-trained on a label-rich source domain to an unlabeled target domain without seeing source data. While most existing SFOD methods generate pseudo labels via a source-pretrained model to guide training, these pseudo labels usually contain high noises due to heavy domain discrepancy. In order to obtain better pseudo supervisions, we divide the target domain into source-similar and source-dissimilar parts and align them in the feature space by adversarial learning.Specifically, we design a detection variance-based criterion to divide the target domain. This criterion is motivated by a finding that larger detection variances denote higher recall and larger similarity to the source domain. Then we incorporate an adversarial module into a mean teacher framework to drive the feature spaces of these two subsets indistinguishable. Extensive experiments on multiple cross-domain object detection datasets demonstrate that our proposed method consistently outperforms the compared SFOD methods. Our implementation is available at https://github.com/ChuQiaosong.

AAMAS Conference 2023 Conference Paper

Benchmarking Robustness and Generalization in Multi-Agent Systems: A Case Study on Neural MMO

  • Yangkun Chen
  • Joseph Suarez
  • Junjie Zhang
  • Chenghui Yu
  • Bo Wu
  • Hanmo Chen
  • Hengman Zhu
  • Rui Du

We present the results of the second Neural MMO challenge, hosted at IJCAI 2022, which received 1600+ submissions. This competition targets robustness and generalization in multi-agent systems: participants train teams of agents to complete a multi-task objective against opponents not seen during training. We summarize the competition design and results and suggest that, considering our work as a case study, competitions are an effective approach to solving hard problems and establishing a solid benchmark for algorithms. We will open-source our benchmark including the environment wrapper, baselines, a visualization tool, and selected policies for further research.

AAAI Conference 2023 Conference Paper

Data-Efficient Image Quality Assessment with Attention-Panel Decoder

  • Guanyi Qin
  • Runze Hu
  • Yutao Liu
  • Xiawu Zheng
  • Haotian Liu
  • Xiu Li
  • Yan Zhang

Blind Image Quality Assessment (BIQA) is a fundamental task in computer vision, which however remains unresolved due to the complex distortion conditions and diversified image contents. To confront this challenge, we in this paper propose a novel BIQA pipeline based on the Transformer architecture, which achieves an efficient quality-aware feature representation with much fewer data. More specifically, we consider the traditional fine-tuning in BIQA as an interpretation of the pre-trained model. In this way, we further introduce a Transformer decoder to refine the perceptual information of the CLS token from different perspectives. This enables our model to establish the quality-aware feature manifold efficiently while attaining a strong generalization capability. Meanwhile, inspired by the subjective evaluation behaviors of human, we introduce a novel attention panel mechanism, which improves the model performance and reduces the prediction uncertainty simultaneously. The proposed BIQA method maintains a light-weight design with only one layer of the decoder, yet extensive experiments on eight standard BIQA datasets (both synthetic and authentic) demonstrate its superior performance to the state-of-the-art BIQA methods, i.e., achieving the SRCC values of 0.875 (vs. 0.859 in LIVEC) and 0.980 (vs. 0.969 in LIVE). Checkpoints, logs and code will be available at https://github.com/narthchin/DEIQT.

NeurIPS Conference 2023 Conference Paper

GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction

  • Rui Yang
  • Lin Song
  • Yanwei Li
  • Sijie Zhao
  • Yixiao Ge
  • Xiu Li
  • Ying Shan

This paper aims to efficiently enable Large Language Models (LLMs) to use multi-modal tools. The advanced proprietary LLMs, such as ChatGPT and GPT-4, have shown great potential for tool usage through sophisticated prompt engineering. Nevertheless, these models typically rely on prohibitive computational costs and publicly inaccessible data. To address these challenges, we propose the GPT4Tools based on self-instruct to enable open-source LLMs, such as LLaMA and OPT, to use tools. It generates an instruction-following dataset by prompting an advanced teacher with various multi-modal contexts. By using the Low-Rank Adaptation (LoRA) optimization, our approach facilitates the open-source LLMs to solve a range of visual problems, including visual comprehension and image generation. Moreover, we provide a benchmark to evaluate the ability of LLMs to use tools, which is performed in both zero-shot and fine-tuning ways. Extensive experiments demonstrate the effectiveness of our method on various language models, which not only significantly improves the accuracy of invoking seen tools, but also enables the zero-shot capacity for unseen tools.

NeurIPS Conference 2023 Conference Paper

MeGraph: Capturing Long-Range Interactions by Alternating Local and Hierarchical Aggregation on Multi-Scaled Graph Hierarchy

  • Honghua Dong
  • Jiawei Xu
  • Yu Yang
  • Rui Zhao
  • Shiwen Wu
  • Chun Yuan
  • Xiu Li
  • Chris J. Maddison

Graph neural networks, which typically exchange information between local neighbors, often struggle to capture long-range interactions (LRIs) within the graph. Building a graph hierarchy via graph pooling methods is a promising approach to address this challenge; however, hierarchical information propagation cannot entirely take over the role of local information aggregation. To balance locality and hierarchy, we integrate the local and hierarchical structures, represented by intra- and inter-graphs respectively, of a multi-scale graph hierarchy into a single mega graph. Our proposed MeGraph model consists of multiple layers alternating between local and hierarchical information aggregation on the mega graph. Each layer first performs local-aware message-passing on graphs of varied scales via the intra-graph edges, then fuses information across the entire hierarchy along the bidirectional pathways formed by inter-graph edges. By repeating this fusion process, local and hierarchical information could intertwine and complement each other. To evaluate our model, we establish a new Graph Theory Benchmark designed to assess LRI capture ability, in which MeGraph demonstrates dominant performance. Furthermore, MeGraph exhibits superior or equivalent performance to state-of-the-art models on the Long Range Graph Benchmark. The experimental results on commonly adopted real-world datasets further demonstrate the broad applicability of MeGraph.

TMLR Journal 2023 Journal Article

Replay-enhanced Continual Reinforcement Learning

  • Tiantian Zhang
  • Kevin Zehua Shen
  • Zichuan Lin
  • Bo Yuan
  • Xueqian Wang
  • Xiu Li
  • Deheng Ye

Replaying past experiences has proven to be a highly effective approach for averting catastrophic forgetting in supervised continual learning. However, some crucial factors are still largely ignored, making it vulnerable to serious failure, when used as a solution to forgetting in continual reinforcement learning, even in the context of perfect memory where all data of previous tasks are accessible in the current task. On the one hand, since most reinforcement learning algorithms are not invariant to the reward scale, the previously well-learned tasks (with high rewards) may appear to be more salient to the current learning process than the current task (with small initial rewards). This causes the agent to concentrate on those salient tasks at the expense of generality on the current task. On the other hand, offline learning on replayed tasks while learning a new task may induce a distributional shift between the dataset and the learned policy on old tasks, resulting in forgetting. In this paper, we introduce RECALL, a replay-enhanced method that greatly improves the plasticity of existing replay-based methods on new tasks while effectively avoiding the recurrence of catastrophic forgetting in continual reinforcement learning. RECALL leverages adaptive normalization on approximate targets and policy distillation on old tasks to enhance generality and stability, respectively. Extensive experiments on the Continual World benchmark show that RECALL performs significantly better than purely perfect memory replay, and achieves comparable or better overall performance against state-of-the-art continual learning methods.

NeurIPS Conference 2023 Conference Paper

SOC: Semantic-Assisted Object Cluster for Referring Video Object Segmentation

  • Zhuoyan Luo
  • Yicheng Xiao
  • Yong Liu
  • Shuyan Li
  • Yitong Wang
  • Yansong Tang
  • Xiu Li
  • Yujiu Yang

This paper studies referring video object segmentation (RVOS) by boosting video-level visual-linguistic alignment. Recent approaches model the RVOS task as a sequence prediction problem and perform multi-modal interaction as well as segmentation for each frame separately. However, the lack of a global view of video content leads to difficulties in effectively utilizing inter-frame relationships and understanding textual descriptions of object temporal variations. To address this issue, we propose Semantic-assisted Object Cluster (SOC), which aggregates video content and textual guidance for unified temporal modeling and cross-modal alignment. By associating a group of frame-level object embeddings with language tokens, SOC facilitates joint space learning across modalities and time steps. Moreover, we present multi-modal contrastive supervision to help construct well-aligned joint space at the video level. We conduct extensive experiments on popular RVOS benchmarks, and our method outperforms state-of-the-art competitors on all benchmarks by a remarkable margin. Besides, the emphasis on temporal coherence enhances the segmentation stability and adaptability of our method in processing text expressions with temporal variations. Code is available at https: //github. com/RobertLuo1/NeurIPS2023_SOC.

NeurIPS Conference 2023 Conference Paper

Weakly-Supervised Concealed Object Segmentation with SAM-based Pseudo Labeling and Multi-scale Feature Grouping

  • Chunming He
  • Kai Li
  • Yachao Zhang
  • Guoxia Xu
  • Longxiang Tang
  • Yulun Zhang
  • Zhenhua Guo
  • Xiu Li

Weakly-Supervised Concealed Object Segmentation (WSCOS) aims to segment objects well blended with surrounding environments using sparsely-annotated data for model training. It remains a challenging task since (1) it is hard to distinguish concealed objects from the background due to the intrinsic similarity and (2) the sparsely-annotated training data only provide weak supervision for model learning. In this paper, we propose a new WSCOS method to address these two challenges. To tackle the intrinsic similarity challenge, we design a multi-scale feature grouping module that first groups features at different granularities and then aggregates these grouping results. By grouping similar features together, it encourages segmentation coherence, helping obtain complete segmentation results for both single and multiple-object images. For the weak supervision challenge, we utilize the recently-proposed vision foundation model, ``Segment Anything Model (SAM)'', and use the provided sparse annotations as prompts to generate segmentation masks, which are used to train the model. To alleviate the impact of low-quality segmentation masks, we further propose a series of strategies, including multi-augmentation result ensemble, entropy-based pixel-level weighting, and entropy-based image-level selection. These strategies help provide more reliable supervision to train the segmentation model. We verify the effectiveness of our method on various WSCOS tasks, and experiments demonstrate that our method achieves state-of-the-art performance on these tasks.

NeurIPS Conference 2022 Conference Paper

Double Check Your State Before Trusting It: Confidence-Aware Bidirectional Offline Model-Based Imagination

  • Jiafei Lyu
  • Xiu Li
  • Zongqing Lu

The learned policy of model-free offline reinforcement learning (RL) methods is often constrained to stay within the support of datasets to avoid possible dangerous out-of-distribution actions or states, making it challenging to handle out-of-support region. Model-based RL methods offer a richer dataset and benefit generalization by generating imaginary trajectories with either trained forward or reverse dynamics model. However, the imagined transitions may be inaccurate, thus downgrading the performance of the underlying offline RL method. In this paper, we propose to augment the offline dataset by using trained bidirectional dynamics models and rollout policies with double check. We introduce conservatism by trusting samples that the forward model and backward model agree on. Our method, confidence-aware bidirectional offline model-based imagination, generates reliable samples and can be combined with any model-free offline RL method. Experimental results on the D4RL benchmarks demonstrate that our method significantly boosts the performance of existing model-free offline RL algorithms and achieves competitive or better scores against baseline methods.

AAAI Conference 2022 Conference Paper

Efficient Continuous Control with Double Actors and Regularized Critics

  • Jiafei Lyu
  • Xiaoteng Ma
  • Jiangpeng Yan
  • Xiu Li

How to obtain good value estimation is a critical problem in Reinforcement Learning (RL). Current value estimation methods in continuous control, such as DDPG and TD3, suffer from unnecessary over- or under- estimation. In this paper, we explore the potential of double actors, which has been neglected for a long time, for better value estimation in the continuous setting. First, we interestingly find that double actors improve the exploration ability of the agent. Next, we uncover the bias alleviation property of double actors in handling overestimation with single critic, and underestimation with double critics respectively. Finally, to mitigate the potentially pessimistic value estimate in double critics, we propose to regularize the critics under double actors architecture. Together, we present Double Actors Regularized Critics (DARC) algorithm. Extensive experiments on challenging continuous control benchmarks, MuJoCo and PyBullet, show that DARC significantly outperforms current baselines with higher average return and better sample efficiency.

NeurIPS Conference 2022 Conference Paper

Mildly Conservative Q-Learning for Offline Reinforcement Learning

  • Jiafei Lyu
  • Xiaoteng Ma
  • Xiu Li
  • Zongqing Lu

Offline reinforcement learning (RL) defines the task of learning from a static logged dataset without continually interacting with the environment. The distribution shift between the learned policy and the behavior policy makes it necessary for the value function to stay conservative such that out-of-distribution (OOD) actions will not be severely overestimated. However, existing approaches, penalizing the unseen actions or regularizing with the behavior policy, are too pessimistic, which suppresses the generalization of the value function and hinders the performance improvement. This paper explores mild but enough conservatism for offline learning while not harming generalization. We propose Mildly Conservative Q-learning (MCQ), where OOD actions are actively trained by assigning them proper pseudo Q values. We theoretically show that MCQ induces a policy that behaves at least as well as the behavior policy and no erroneous overestimation will occur for OOD actions. Experimental results on the D4RL benchmarks demonstrate that MCQ achieves remarkable performance compared with prior work. Furthermore, MCQ shows superior generalization ability when transferring from offline to online, and significantly outperforms baselines. Our code is publicly available at https: //github. com/dmksjfl/MCQ.

NeurIPS Conference 2022 Conference Paper

OrdinalCLIP: Learning Rank Prompts for Language-Guided Ordinal Regression

  • Wanhua Li
  • Xiaoke Huang
  • Zheng Zhu
  • Yansong Tang
  • Xiu Li
  • Jie Zhou
  • Jiwen Lu

This paper presents a language-powered paradigm for ordinal regression. Existing methods usually treat each rank as a category and employ a set of weights to learn these concepts. These methods are easy to overfit and usually attain unsatisfactory performance as the learned concepts are mainly derived from the training set. Recent large pre-trained vision-language models like CLIP have shown impressive performance on various visual tasks. In this paper, we propose to learn the rank concepts from the rich semantic CLIP latent space. Specifically, we reformulate this task as an image-language matching problem with a contrastive objective, which regards labels as text and obtains a language prototype from a text encoder for each rank. While prompt engineering for CLIP is extremely time-consuming, we propose OrdinalCLIP, a differentiable prompting method for adapting CLIP for ordinal regression. OrdinalCLIP consists of learnable context tokens and learnable rank embeddings. The learnable rank embeddings are constructed by explicitly modeling numerical continuity, resulting in well-ordered, compact language prototypes in the CLIP space. Once learned, we can only save the language prototypes and discard the huge language model, resulting in zero additional computational overhead compared with the linear head counterpart. Experimental results show that our paradigm achieves competitive performance in general ordinal regression tasks, and gains improvements in few-shot and distribution shift settings for age estimation. The code is available at https: //github. com/xk-huang/OrdinalCLIP.