Author name cluster

Shuai Yang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

17 papers

2 author rows

TIST Journal 2026 Journal Article

Mutual Information-Guided Style Augmentation for Single Domain Generalization

Shuai Yang
Zhen Zhang
Kui Yu
Lichuan Gu
Xindong Wu

Single domain generalization aims to develop a robust model trained on a source domain to generalize well on unseen target domains. Recent progress in single domain generalization has focused on expanding the scope of training data through style (e.g., backgrounds) augmentation. However, existing methods are difficult to generate data with large style shifts due to the lack of precise correlation measures between the generated and original data, and they struggle to effectively capture the consistency between the generated and original data when learning feature representations. In this article, we propose a novel Mutual Information-guided Style Augmentation (MISA) based single domain generalization method. Specifically, MISA incorporates a style diversity module, which uses the matrix-based Rényi’s $\alpha$ -order entropy functionals to compute an approximate mutual information value between the augmented and original data, minimizing it to guide style generator learning. Moreover, MISA combines the merits of the random convolution and affine transformation to further improve the texture diversity of the augmented data. Additionally, MISA introduces a representation learning module, which minimizes the approximate mutual information value between the prediction logits of the original sample and its corresponding residual component to capture the consistency between the generated and original data for feature representation optimization. Using five real-world datasets, the extensive experiments have demonstrated the effectiveness of MISA, in comparison with state-of-the-art methods.

Details DOI

AAAI Conference 2026 Conference Paper

Towards Efficient and Robust Manipulation via Multi-Frame Vision-Language-Action Modeling

Hao Li
Shuai Yang
Yilun Chen
Xinyi Chen
Xiaoda Yang
Yang Tian
Hanqing Wang
Tai WANG

Recent vision-language-action (VLA) models built on pretrained vision-language models (VLMs) have demonstrated strong performance in robotic manipulation. However, these models remain constrained by the single-frame image paradigm and fail to fully leverage the temporal information offered by multi-frame histories, as directly feeding multiple frames into VLM backbones incurs substantial computational overhead and inference latency. We propose CronusVLA, a unified framework that extends single-frame VLA models to the multi-frame paradigm. CronusVLA follows a two-stage process: (1) Single-frame pretraining on large-scale embodied datasets with autoregressive prediction of action tokens, establishing an effective embodied vision-language foundation; (2) Multi-frame post-training, which adapts the prediction of the vision-language backbone from discrete tokens to learnable features, and aggregates historical information via feature chunking. CronusVLA effectively addresses the existing challenges of multi-frame modeling while enhancing performance. To evaluate the robustness under temporal and spatial disturbances, we introduce SimplerEnv-OR, a novel benchmark featuring 24 types of observational disturbances and 120 severity levels. Experiments across three embodiments in simulated and real-world environments demonstrate that CronusVLA achieves leading performance and superior robustness, with a 70.9% success rate on SimplerEnv, a 26.8% improvement over OpenVLA on LIBERO, and the highest robustness score on SimplerEnv-OR, showing the promise of efficient multi-frame adaptation for real-world VLA deployment.

PDF Details DOI

EAAI Journal 2025 Journal Article

Constrained multi-objective optimization assisted by convergence and diversity auxiliary tasks

Qianlong Dang
Wutao Shang
Zhengxin Huang
Shuai Yang

In the field of constrained multi-objective optimization, constructing auxiliary tasks can guide the algorithm to achieve efficient search. Different forms of auxiliary tasks have their own advantages, and a reasonable combination can effectively improve the performance of the algorithm. Inspired by this, a Constrained Multi-objective Optimization Evolutionary Algorithm based on Convergence and Diversity auxiliary Tasks (CMOEA-CDT) is proposed. This algorithm achieves efficient search through simultaneous optimization and knowledge transfer of the main task, convergence auxiliary task, and diversity auxiliary task. Specifically, the main task is to find feasible Pareto front, which improves the global exploration and local exploitation of the algorithm through knowledge transfer from the convergence and diversity auxiliary tasks. In addition, the convergence auxiliary task helps the main task population traverse infeasible obstacles by ignoring constraints to achieve global search. The diversity auxiliary task aims to provide local diversity to the regions around the main task population to exploit promising search regions. The convergence and diversity of the algorithm are significantly improved by knowledge transfer between the convergence auxiliary task, diversity auxiliary task, and main task. CMOEA-CDT is compared with five state-of-the-art constrained multi-objective evolutionary optimization algorithms on 37 benchmark problems and a disc brake engineering design problem. The experimental results indicate that the proposed CMOEA-CDT respectively obtains 19 and 20 best results on the two indicators, and achieves the best performance on disc brake engineering design problem.

Details DOI

NeurIPS Conference 2025 Conference Paper

Imagine360: Immersive 360 Video Generation from Perspective Anchor

Jing Tan
Shuai Yang
Tong Wu
Jingwen He
Yuwei Guo
Ziwei Liu
Dahua Lin

$360^\circ$ videos offer a hyper-immersive experience that allows the viewers to explore a dynamic scene from full 360 degrees. To achieve more accessible and personalized content creation in $360^\circ$ video format, we seek to lift standard perspective videos into $360^\circ$ equirectangular videos. To this end, we introduce **Imagine360**, the first perspective-to-$360^\circ$ video generation framework that creates high-quality $360^\circ$ videos with rich and diverse motion patterns from video anchors. Imagine360 learns fine-grained spherical visual and motion patterns from limited $360^\circ$ video data with several key designs. **1)** Firstly we adopt the dual-branch design, including a perspective and a panorama video denoising branch to provide local and global constraints for $360^\circ$ video generation, with motion module and spatial LoRA layers fine-tuned on $360^\circ$ videos. **2)** Additionally, an antipodal mask is devised to capture long-range motion dependencies, enhancing the reversed camera motion between antipodal pixels across hemispheres. **3)** To handle diverse perspective video inputs, we propose rotation-aware designs that adapt to varying video masking due to changing camera poses across frames. **4)** Lastly, we introduce a new 360 video dataset featuring 10K high-quality, trimmed 360 video clips with structured motion to facilitate training. Extensive experiments show Imagine360 achieves superior graphics quality and motion coherence with our curated dataset among state-of-the-art $360^\circ$ video generation methods. We believe Imagine360 holds promise for advancing personalized, immersive $360^\circ$ video creation.

PDF Details

NeurIPS Conference 2025 Conference Paper

SGAR: Structural Generative Augmentation for 3D Human Motion Retrieval

Jiahang Zhang
Lilang Lin
Shuai Yang
Jiaying Liu

3D human motion-text retrieval is essential for accurate motion understanding, targeted at cross-modal alignment learning. Existing methods typically align the global motion-text concepts directly, suffering from sub-optimal generalization due to the uncertainty of correspondence learning between multiple motion concepts coupled in a single motion/text sequence. Therefore, we study the explicit fine-grained concept decomposition for alignment learning and present a novel framework, Structural Generative Augmentation for 3D Human Motion Retrieval (SGAR), to enable generation-augmented retrieval. Specifically, relying on the strong priors of existing large language model (LLM) assets, we effectively decompose human motions structurally into subtler semantic units, \ie, body parts, for fine-grained motion modeling. Based on this, we develop part-mixture learning to better decouple the local motion concept learning, boosting part-level alignment. Moreover, a directional relation alignment strategy exploiting the correspondence between full-body and part motions is incorporated to regularize feature manifold for better consistency. Extensive experiments on three benchmarks, including motion-text retrieval as well as recognition and generation applications, demonstrate the superior performance and promising transferability of our method.

PDF Details

IJCAI Conference 2025 Conference Paper

State Revisit and Re-explore: Bridging Sim-to-Real Gaps in Offline-and-Online Reinforcement Learning with An Imperfect Simulator

Xingyu Chen
Jiayi Xie
Zhijian Xu
Ruixun Liu
Shuai Yang
Zeyang Liu
Lipeng Wan
Xuguang Lan

In reinforcement learning (RL) based robot skill acquisition, a high-fidelity simulator is usually indispensable but unattainable since the real environment dynamics are difficult to model, which leads to severe sim-to-real gaps. Existing methods solve this problem by combining offline and online RL to jointly learn transferable policies from limited offline data and imperfect simulators. However, due to the unrestricted exploration in the imperfect simulator, the hybrid offline-and-online RL methods inevitably suffer from low sample efficiency and insufficient state-action space coverage during training. To solve this problem, we propose a State Revisit and Re-exploration (SR2) hybrid offline-and-online RL framework. In particular, the proposed algorithm employs a meta-policy and a sub-policy, where the meta-policy aims to find high-quality states in the offline trajectories for online exploration, and the sub-policy learns the robot skill using mixed offline and online data. By introducing the state revisit and explore mechanism, our approach efficiently improves performance on a set of sim-to-real robotic tasks. Through extensive simulation and real-world tasks, we demonstrate the superior performance of our approach against other state-of-the-art methods.

PDF Details DOI

NeurIPS Conference 2025 Conference Paper

Video World Models with Long-term Spatial Memory

Tong Wu
Shuai Yang
Ryan Po
Yinghao Xu
Ziwei Liu
Dahua Lin
Gordon Wetzstein

Emerging world models autoregressively generate video frames in response to actions, such as camera movements and text prompts, among other control signals. Due to limited temporal context window sizes, these models often struggle to maintain scene consistency during revisits, leading to severe forgetting of previously generated environments. Inspired by the mechanisms of human memory, we introduce a novel framework to enhancing long-term consistency of video world models through a geometry-grounded long-term spatial memory. Our framework includes mechanisms to store and retrieve information from the long-term spatial memory and we curate custom datasets to train and evaluate world models with explicitly stored 3D memory mechanisms. Our evaluations show improved quality, consistency, and context length compared to relevant baselines, paving the way towards long-term consistent world generation.

PDF Details

NeurIPS Conference 2025 Conference Paper

WorldMem: Long-term Consistent World Simulation with Memory

Zeqi Xiao
Yushi Lan
Yifan Zhou
Wenqi Ouyang
Shuai Yang
Yanhong Zeng
Xingang Pan

World simulation has gained increasing popularity due to its ability to model virtual environments and predict the consequences of actions. However, the limited temporal context window often leads to failures in maintaining long-term consistency, particularly in preserving 3D spatial consistency. In this work, we present WorldMem, a framework that enhances scene generation with a memory bank consisting of memory units that store memory frames and states (e. g. , poses and timestamps). By employing state-aware memory attention that effectively extracts relevant information from these memory frames based on their states, our method is capable of accurately reconstructing previously observed scenes, even under significant viewpoint or temporal gaps. Furthermore, by incorporating timestamps into the states, our framework not only models a static world but also captures its dynamic evolution over time, enabling both perception and interaction within the simulated world. Extensive experiments in both virtual and real scenarios validate the effectiveness of our approach.

PDF Details

ICLR Conference 2024 Conference Paper

Denoising Diffusion Step-aware Models

Shuai Yang
Yukang Chen
Luozhou Wang
Shu Liu 0005
Ying-Cong Chen

Denoising Diffusion Probabilistic Models (DDPMs) have garnered popularity for data generation across various domains. However, a significant bottleneck is the necessity for whole-network computation during every step of the generative process, leading to high computational overheads. This paper presents a novel framework, Denoising Diffusion Step-aware Models (DDSM), to address this challenge. Unlike conventional approaches, DDSM employs a spectrum of neural networks whose sizes are adapted according to the importance of each generative step, as determined through evolutionary search. This step-wise network variation effectively circumvents redundant computational efforts, particularly in less critical steps, thereby enhancing the efficiency of the diffusion model. Furthermore, the step-aware design can be seamlessly integrated with other efficiency-geared diffusion models such as DDIMs and latent diffusion, thus broadening the scope of computational savings. Empirical evaluations demonstrate that DDSM achieves computational savings of 49% for CIFAR-10, 61% for CelebA-HQ, 59% for LSUN-bedroom, 71% for AFHQ, and 76% for ImageNet, all without compromising the generation quality. Our code and models are available at https://github.com/EnVision-Research/DDSM.

Details

EAAI Journal 2024 Journal Article

Dynamic region-aware transformer backbone network for visual tracking

Jun Wang
Shuai Yang
Yuanyun Wang

In visual tracking, the Transformer architecture is widely used because it can capture the global dependencies of sequence data without inductive bias. However, the attention mechanism of Transformer will bring ultra-high computational complexity and space occupancy, so that the tracking task cannot meet the real-time requirements. In this paper, we explore a sparsity region-aware attention mechanism. The sparse attention mechanism retains the regions with semantic relevance, and performs fine-grained attention calculation in this region. In the region-aware attention mechanism, a DropKey technique is introduced to reduce model over-fitting and improve the generalization ability of the model. Using region-aware attention as the basic building block, we design a dynamic region-aware Transformer backbone for visual tracking. This backbone network can effectively reduce the computational complexity while exploring global context dependencies. Based on the region-aware Transformer backbone network, this paper proposes a dynamic region-aware Transformer backbone visual tracking algorithm, which uses an optimization based model predictor to fully fuse object appearance and background information, so as to achieve more robust object tracking. The proposed tracker is trained in an end-to-end manner and experimentally evaluated on eight tracking benchmarks. Experimental results show that the algorithm has good tracking performance, especially in the application of unmanned aerial vehicle (UAV) tracking, our proposed tracker achieves an area under curve (AUC) score of 66. 5% on the UAV123 dataset. Code is available at https: //github. com/YSGFF/RTDiMP.

Details DOI

ICLR Conference 2024 Conference Paper

Forward Learning of Graph Neural Networks

Namyong Park 0001
Xing Wang
Antoine Simoulin
Shuai Yang
Grey Yang
Ryan A. Rossi
Puja Trivedi
Nesreen K. Ahmed

Graph neural networks (GNNs) have achieved remarkable success across a wide range of applications, such as recommendation, drug discovery, and question answering. Behind the success of GNNs lies the backpropagation (BP) algorithm, which is the de facto standard for training deep neural networks (NNs). However, despite its effectiveness, BP imposes several constraints, which are not only biologically implausible, but also limit the scalability, parallelism, and flexibility in learning NNs. Examples of such constraints include storage of neural activities computed in the forward pass for use in the subsequent backward pass, and the dependence of parameter updates on non-local signals. To address these limitations, the forward-forward algorithm (FF) was recently proposed as an alternative to BP in the image classification domain, which trains NNs by performing two forward passes over positive and negative data. Inspired by this advance, we propose ForwardGNN in this work, a new forward learning procedure for GNNs, which avoids the constraints imposed by BP via an effective layer-wise local forward training. ForwardGNN extends the original FF to deal with graph data and GNNs, and makes it possible to operate without generating negative inputs (hence no longer forward-forward). Further, ForwardGNN enables each layer to learn from both the bottom-up and top-down signals without relying on the backpropagation of errors. Extensive experiments on real-world datasets show the effectiveness and generality of the proposed forward graph learning framework. We release our code at https://github.com/facebookresearch/forwardgnn.

Details

NeurIPS Conference 2024 Conference Paper

MMScan: A Multi-Modal 3D Scene Dataset with Hierarchical Grounded Language Annotations

Ruiyuan Lyu
Jingli Lin
Tai WANG
Shuai Yang
Xiaohan Mao
Yilun Chen
Runsen Xu
Haifeng Huang

With the emergence of LLMs and their integration with other data modalities, multi-modal 3D perception attracts more attention due to its connectivity to the physical world and makes rapid progress. However, limited by existing datasets, previous works mainly focus on understanding object properties or inter-object spatial relationships in a 3D scene. To tackle this problem, this paper builds the first largest ever multi-modal 3D scene dataset and benchmark with hierarchical grounded language annotations, MMScan. It is constructed based on a top-down logic, from region to object level, from a single target to inter-target relationships, covering holistic aspects of spatial and attribute understanding. The overall pipeline incorporates powerful VLMs via carefully designed prompts to initialize the annotations efficiently and further involve humans' correction in the loop to ensure the annotations are natural, correct, and comprehensive. Built upon existing 3D scanning data, the resulting multi-modal 3D dataset encompasses 1. 4M meta-annotated captions on 109k objects and 7. 7k regions as well as over 3. 04M diverse samples for 3D visual grounding and question-answering benchmarks. We evaluate representative baselines on our benchmarks, analyze their capabilities in different aspects, and showcase the key problems to be addressed in the future. Furthermore, we use this high-quality dataset to train state-of-the-art 3D visual grounding and LLMs and obtain remarkable performance improvement both on existing benchmarks and in-the-wild evaluation.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

Unified Generative and Discriminative Training for Multi-modal Large Language Models

Wei Chow
Juncheng Li
Qifan Yu
Kaihang Pan
Hao Fei
Zhiqi Ge
Shuai Yang
Siliang Tang

In recent times, Vision-Language Models (VLMs) have been trained under two predominant paradigms. Generative training has enabled Multimodal Large Language Models (MLLMs) to tackle various complex tasks, yet issues such as hallucinations and weak object discrimination persist. Discriminative training, exemplified by models like CLIP, excels in zero-shot image-text classification and retrieval, yet struggles with complex scenarios requiring fine-grained semantic differentiation. This paper addresses these challenges by proposing a unified approach that integrates the strengths of both paradigms. Considering interleaved image-text sequences as the general format of input samples, we introduce a structure-induced training strategy that imposes semantic relationships between input samples and the MLLM’s hidden state. This approach enhances the MLLM’s ability to capture global semantics and distinguish fine-grained semantics. By leveraging dynamic sequence alignment within the Dynamic Time Warping framework and integrating a novel kernel for fine-grained semantic differentiation, our method effectively balances generative and discriminative tasks. Extensive experiments demonstrate the effectiveness of our approach, achieving state-of-the-art results in multiple generative tasks, especially those requiring cognitive and discrimination abilities. Additionally, our method surpasses discriminative benchmarks in interleaved and fine-grained retrieval tasks. By employing a retrieval-augmented generation strategy, our approach further enhances performance in some generative tasks within one model, offering a promising direction for future research in vision-language modeling.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

Video Diffusion Models are Training-free Motion Interpreter and Controller

Zeqi Xiao
Yifan Zhou
Shuai Yang
Xingang Pan

Video generation primarily aims to model authentic and customized motion across frames, making understanding and controlling the motion a crucial topic. Most diffusion-based studies on video motion focus on motion customization with training-based paradigms, which, however, demands substantial training resources and necessitates retraining for diverse models. Crucially, these approaches do not explore how video diffusion models encode cross-frame motion information in their features, lacking interpretability and transparency in their effectiveness. To answer this question, this paper introduces a novel perspective to understand, localize, and manipulate motion-aware features in video diffusion models. Through analysis using Principal Component Analysis (PCA), our work discloses that robust motion-aware feature already exists in video diffusion models. We present a new MOtion FeaTure (MOFT) by eliminating content correlation information and filtering motion channels. MOFT provides a distinct set of benefits, including the ability to encode comprehensive motion information with clear interpretability, extraction without the need for training, and generalizability across diverse architectures. Leveraging MOFT, we propose a novel training-free video motion control framework. Our method demonstrates competitive performance in generating natural and faithful motion, providing architecture-agnostic insights and applicability in a variety of downstream tasks.

PDF Details DOI

TIST Journal 2023 Journal Article

Causal Feature Selection in the Presence of Sample Selection Bias

Shuai Yang
Xianjie Guo
Kui Yu
Xiaoling Huang
Tingting Jiang
Jin He
Lichuan Gu

Almost all existing causal feature selection methods are proposed without considering the problem of sample selection bias. However, in practice, as data-gathering process cannot be fully controlled, sample selection bias often occurs, leading to spurious correlations between features and the class variable, which seriously deteriorates the performance of those existing methods. In this article, we study the problem of causal feature selection under sample selection bias and propose a novel Progressive Causal Feature Selection (PCFS) algorithm which has three phases. First, PCFS learns the sample weights to balance the treated group and control group distributions corresponding to each feature for removing spurious correlations. Second, based on the sample weights, PCFS uses a weighted cross-entropy model to estimate the causal effect of each feature and removes some irrelevant features from the confounder set. Third, PCFS progressively repeats the first two phases to remove more irrelevant features and finally obtains a causal feature set. Using synthetic and real-world datasets, the experiments have validated the effectiveness of PCFS, in comparison with several state-of-the-art classical and causal feature selection methods.

Details DOI

IJCAI Conference 2021 Conference Paper

Instance-Aware Coherent Video Style Transfer for Chinese Ink Wash Painting

Hao Liang
Shuai Yang
Wenjing Wang
Jiaying Liu

Recent researches have made remarkable achievements in fast video style transfer based on western paintings. However, due to the inherent different drawing techniques and aesthetic expressions of Chinese ink wash painting, existing methods either achieve poor temporal consistency or fail to transfer the key freehand brushstroke characteristics of Chinese ink wash painting. In this paper, we present a novel video style transfer framework for Chinese ink wash paintings. The two key ideas are a multi-frame fusion for temporal coherence and an instance-aware style transfer. The frame reordering and stylization based on reference frame fusion are proposed to improve temporal consistency. Meanwhile, the proposed method is able to adaptively leave the white spaces in the background and to select proper scales to extract features and depict the foreground subject by leveraging instance segmentation. Experimental results demonstrate the superiority of the proposed method over state-of-the-art style transfer methods in terms of both temporal coherence and visual quality. Our project website is available at https: //oblivioussy. github. io/InkVideo/.

PDF Details DOI

AAAI Conference 2019 Conference Paper

TET-GAN: Text Effects Transfer via Stylization and Destylization

Shuai Yang
Jiaying Liu
Wenjing Wang
Zongming Guo

Text effects transfer technology automatically makes the text dramatically more impressive. However, previous style transfer methods either study the model for general style, which cannot handle the highly-structured text effects along the glyph, or require manual design of subtle matching criteria for text effects. In this paper, we focus on the use of the powerful representation abilities of deep neural features for text effects transfer. For this purpose, we propose a novel Texture Effects Transfer GAN (TET-GAN), which consists of a stylization subnetwork and a destylization subnetwork. The key idea is to train our network to accomplish both the objective of style transfer and style removal, so that it can learn to disentangle and recombine the content and style features of text effects images. To support the training of our network, we propose a new text effects dataset with as much as 64 professionally designed styles on 837 characters. We show that the disentangled feature representations enable us to transfer or remove all these styles on arbitrary glyphs using one network. Furthermore, the flexible network design empowers TET-GAN to efficiently extend to a new text style via oneshot learning where only one example is required. We demonstrate the superiority of the proposed method in generating high-quality stylized text over the state-of-the-art methods.

PDF Details