Author name cluster

Shaobo Min

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

8 papers

2 author rows

AAAI Conference 2026 Conference Paper

S²Flow: Towards Fast and Authentic Training-Free High-Resolution Video Generation

Chaoqun Wang
Shaobo Min
Xu Yang

Rectified flow models have shown strong potential in high-fidelity video generation, yet extending them to high-resolution remains challenging due to the high cost of full attention and error accumulation in the ODE-solving process. In this paper, we propose S^2Flow, a training-free framework that enables efficient and authentic high-resolution video generation by jointly exploring Flow-guided Sparse attention and Second-order ODE solution. Specifically, S^2Flow exploits and transfers the semantic and structural information from the low-resolution flow trajectory to guide the high-resolution flow in two aspects. First, S^2Flow dynamically captures the sparse patterns of the spatio-temporal attention maps from low-resolution videos to construct localized 3D windows, enabling efficient window attention in high-resolution inference. This can significantly reduce redundant computation while preserving contextual dependencies. Second, S^2Flow adopts a second-order ODE solver based on Taylor expansion, where the high-order derivative is approximated via central difference from the low-resolution flow, facilitating accurate high-resolution denoising. Extensive experiments on VBench dataset demonstrate that S^2Flow outperforms prior methods in both visual quality and inference speed, enabling 4x acceleration on 2560x1536 video generation.

PDF Details DOI

AAAI Conference 2025 Conference Paper

Infinite-Canvas: Higher-Resolution Video Outpainting with Extensive Content Generation

Qihua Chen
Yue Ma
Hongfa Wang
Junkun Yuan
Wenzhe Zhao
Qi Tian
Hongmei Wang
Shaobo Min

This paper explores higher-resolution video outpainting with extensive content generation. We point out common issues faced by existing methods when attempting to largely outpaint videos: the generation of low-quality content and limitations imposed by GPU memory. To address these challenges, we propose a diffusion-based method called Infinite-Canvas. It builds upon two core designs. First, instead of employing the common practice of "single-shot" outpainting, we distribute the task across spatial windows and seamlessly merge them. It allows us to outpaint videos of any size and resolution without being constrained by GPU memory. Second, the source video and its relative positional relation are injected into the generation process of each window. It makes the generated spatial layout within each window harmonize with the source video. Coupling with these two designs enables us to generate higher-resolution outpainting videos with rich content while keeping spatial and temporal consistency. Infinite-Canvas excels in large-scale video outpainting, e.g., from 512 × 512 to 1152 × 2048 (9×), while producing high-quality and aesthetically pleasing results. It achieves the best quantitative results across various resolution and scale setups. The code is available at https://github.com/mayuelala/FollowYourCanvas.

PDF Details DOI

ICLR Conference 2025 Conference Paper

Towards Multiple Character Image Animation Through Enhancing Implicit Decoupling

Jingyun Xue
Hongfa Wang
Qi Tian 0003
Yue Ma 0016
Andong Wang
Zhiyuan Zhao 0002
Shaobo Min
Wenzhe Zhao

Controllable character image animation has a wide range of applications. Although existing studies have consistently improved performance, challenges persist in the field of character image animation, particularly concerning stability in complex backgrounds and tasks involving multiple characters. To address these challenges, we propose a novel multi-condition guided framework for character image animation, employing several well-designed input modules to enhance the implicit decoupling capability of the model. First, the optical flow guider calculates the background optical flow map as guidance information, which enables the model to implicitly learn to decouple the background motion into background constants and background momentum during training, and generate a stable background by setting zero background momentum during inference. Second, the depth order guider calculates the order map of the characters, which transforms the depth information into the positional information of multiple characters. This facilitates the implicit learning of decoupling different characters, especially in accurately separating the occluded body parts of multiple characters. Third, the reference pose map is input to enhance the ability to decouple character texture and pose information in the reference image. Furthermore, to fill the gap of fair evaluation of multi-character image animation, we propose a new benchmark comprising about 4,000 frames. Extensive qualitative and quantitative evaluations demonstrate that our method excels in generating high-quality character animations, especially in scenarios of complex backgrounds and multiple characters.

Details

NeurIPS Conference 2021 Conference Paper

Dual Progressive Prototype Network for Generalized Zero-Shot Learning

Chaoqun Wang
Shaobo Min
Xuejin Chen
Xiaoyan Sun
Houqiang Li

Generalized Zero-Shot Learning (GZSL) aims to recognize new categories with auxiliary semantic information, e. g. , category attributes. In this paper, we handle the critical issue of domain shift problem, i. e. , confusion between seen and unseen categories, by progressively improving cross-domain transferability and category discriminability of visual representations. Our approach, named Dual Progressive Prototype Network (DPPN), constructs two types of prototypes that record prototypical visual patterns for attributes and categories, respectively. With attribute prototypes, DPPN alternately searches attribute-related local regions and updates corresponding attribute prototypes to progressively explore accurate attribute-region correspondence. This enables DPPN to produce visual representations with accurate attribute localization ability, which benefits the semantic-visual alignment and representation transferability. Besides, along with progressive attribute localization, DPPN further projects category prototypes into multiple spaces to progressively repel visual representations from different categories, which boosts category discriminability. Both attribute and category prototypes are collaboratively learned in a unified framework, which makes visual representations of DPPN transferable and distinctive. Experiments on four benchmarks prove that DPPN effectively alleviates the domain shift problem in GZSL.

PDF Details

AAAI Conference 2021 Conference Paper

Semantic-guided Reinforced Region Embedding for Generalized Zero-Shot Learning

Jiannan Ge
Hongtao Xie
Shaobo Min
Yongdong Zhang

Generalized Zero-Shot Learning (GZSL) aims to recognize images from either seen or unseen domain, mainly by learning a joint embedding space to associate image features with the corresponding category descriptions. Recent methods have proved that localizing important object regions can effectively bridge the semantic-visual gap. However, these are all based on one-off visual localizers, lacking of interpretability and flexibility. In this paper, we propose a novel Semanticguided Reinforced Region Embedding (SR2E) network that can localize important objects in the long-term interests to construct semantic-visual embedding space. SR2E consists of Reinforced Region Module (R2M) and Semantic Alignment Module (SAM). First, without the annotated bounding box as supervision, R2M encodes the semantic category guidance into the reward and punishment criteria to teach the localizer serialized region searching. Besides, R2M explores different action spaces during the serialized searching path to avoid local optimal localization, which thereby generates discriminative visual features with less redundancy. Second, SAM preserves the semantic relationship into visual features via semantic-visual alignment and designs a domain detector to alleviate the domain confusion. Experiments on four public benchmarks demonstrate that the proposed SR2E is an effective GZSL method with reinforced embedding space, which obtains averaged 6. 1% improvements.

PDF Details

AAAI Conference 2021 Conference Paper

Task-Independent Knowledge Makes for Transferable Representations for Generalized Zero-Shot Learning

Chaoqun Wang
Xuejin Chen
Shaobo Min
Xiaoyan Sun
Houqiang Li

Generalized Zero-Shot Learning (GZSL) targets recognizing new categories by learning transferable image representations. Existing methods find that, by aligning image representations with corresponding semantic labels, the semanticaligned representations can be transferred to unseen categories. However, supervised by only seen category labels, the learned semantic knowledge is highly task-specific, which makes image representations biased towards seen categories. In this paper, we propose a novel Dual-Contrastive Embedding Network (DCEN) that simultaneously learns taskspecific and task-independent knowledge via semantic alignment and instance discrimination. First, DCEN leverages task labels to cluster representations of the same semantic category by cross-modal contrastive learning and exploring semantic-visual complementarity. Besides task-specific knowledge, DCEN then introduces task-independent knowledge by attracting representations of different views of the same image and repelling representations of different images. Compared to high-level seen category supervision, this instance discrimination supervision encourages DCEN to capture low-level visual knowledge, which is less biased toward seen categories and alleviates the representation bias. Consequently, the task-specific and task-independent knowledge jointly make for transferable representations of DCEN, which obtains averaged 4. 1% improvement on four public benchmarks.

PDF Details

NeurIPS Conference 2020 Conference Paper

Hierarchical Granularity Transfer Learning

Shaobo Min
Hongtao Xie
Hantao Yao
Xuran Deng
Zheng-Jun Zha
Yongdong Zhang

In the real world, object categories usually have a hierarchical granularity tree. Nowadays, most researchers focus on recognizing categories in a specific granularity, \emph{e. g. ,} basic-level or sub(ordinate)-level. Compared with basic-level categories, the sub-level categories provide more valuable information, but its training annotations are harder to acquire. Therefore, an attractive problem is how to transfer the knowledge learned from basic-level annotations to sub-level recognition. In this paper, we introduce a new task, named Hierarchical Granularity Transfer Learning (HGTL), to recognize sub-level categories with basic-level annotations and semantic descriptions for hierarchical categories. Different from other recognition tasks, HGTL has a serious granularity gap, ~\emph{i. e. ,} the two granularities share an image space but have different category domains, which impede the knowledge transfer. To this end, we propose a novel Bi-granularity Semantic Preserving Network (BigSPN) to bridge the granularity gap for robust knowledge transfer. Explicitly, BigSPN constructs specific visual encoders for different granularities, which are aligned with a shared semantic interpreter via a novel subordinate entropy loss. Experiments on three benchmarks with hierarchical granularities show that BigSPN is an effective framework for Hierarchical Granularity Transfer Learning.

PDF Details

AAAI Conference 2019 Conference Paper

A Two-Stream Mutual Attention Network for Semi-Supervised Biomedical Segmentation with Noisy Labels

Shaobo Min
Xuejin Chen
Zheng-Jun Zha
Feng Wu
Yongdong Zhang

Learning-based methods suffer from a deficiency of clean annotations, especially in biomedical segmentation. Although many semi-supervised methods have been proposed to provide extra training data, automatically generated labels are usually too noisy to retrain models effectively. In this paper, we propose a Two-Stream Mutual Attention Network (TS- MAN) that weakens the influence of back-propagated gradients caused by incorrect labels, thereby rendering the network robust to unclean data. The proposed TSMAN consists of two sub-networks that are connected by three types of attention models in different layers. The target of each attention model is to indicate potentially incorrect gradients in a certain layer for both sub-networks by analyzing their inferred features using the same input. In order to achieve this purpose, the attention models are designed based on the propagation analysis of noisy gradients at different layers. This allows the attention models to effectively discover incorrect labels and weaken their influence during parameter updating process. By exchanging multi-level features within two-stream architecture, the effects of noisy labels in each sub-network are reduced by decreasing the noisy gradients. Furthermore, a hierarchical distillation is developed to provide reliable pseudo labels for unlabelded data, which further boosts the performance of TSMAN. The experiments using both HVSMR 2016 and BRATS 2015 benchmarks demonstrate that our semi-supervised learning framework surpasses the state-of-the-art fully-supervised results.

PDF Details