Arrow Research search

Author name cluster

Shuicheng Yan

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

115 papers
2 author rows

Possible papers

115

TMLR Journal 2026 Journal Article

MEMO: Memory-Guided Diffusion for Expressive Talking Video Generation

  • Longtao Zheng
  • Yifan Zhang
  • Hanzhong Guo
  • Jiachun Pan
  • Zhenxiong Tan
  • Jiahao Lu
  • Chuanxin Tang
  • Bo An

Recent advances in video diffusion models have unlocked new potential for realistic audio-driven talking video generation. However, maintaining long-term identity consistency, achieving seamless lip-audio synchronization, and producing natural, audio-aligned expressions in generated talking videos remain significant challenges. To address these challenges, we propose Memory-guided EMOtion-aware diffusion (MEMO), an end-to-end audio-driven portrait animation approach to generate identity-consistent and expressive talking videos. Our approach is built around two key modules: (1) a memory-guided temporal module, which enhances long-term identity consistency and motion smoothness by developing causal motion memory to store information from an extended past context to guide temporal modeling; and (2) an emotion-aware audio module, which replaces traditional cross attention with multi-modal attention to enhance audio-video interaction, while detecting emotions from audio to refine facial expressions via emotion-adaptive layer norm. Extensive quantitative and qualitative results demonstrate that MEMO generates more realistic talking videos across diverse image and audio types, outperforming state-of-the-art methods in overall quality, lip-audio synchronization, identity consistency, and expression-audio alignment. Our model and video demos are available at https://memoavatar.github.io.

AAAI Conference 2026 Conference Paper

PointDGRWKV: Generalizing RWKV-like Architecture to Unseen Domains for Point Cloud Classification

  • Hao Yang
  • Qianyu Zhou
  • Haijia Sun
  • Xiangtai Li
  • Xuequan Lu
  • Lizhuang Ma
  • Shuicheng Yan

Domain Generalization (DG) has been recently explored to enhance the generalizability of Point Cloud Classification (PCC) models toward unseen domains. Prior works are based on convolutional networks, Transformer or Mamba architectures, either suffering from limited receptive fields or high computational cost, or insufficient long-range dependency modeling. RWKV, as an emerging architecture, possesses superior linear complexity, global receptive fields, and long-range dependency. In this paper, we present the first work that studies the generalizability of RWKV models in DG PCC. We find that directly applying RWKV to DG PCC encounters two significant challenges: RWKV's fixed direction token shift methods, like Q-Shift, introduce spatial distortions when applied to unstructured point clouds, weakening local geometric modeling and reducing robustness. In addition, the Bi-WKV attention in RWKV amplifies slight cross-domain differences in key distributions through exponential weighting, leading to attention shifts and degraded generalization. To this end, we propose PointDGRWKV, the first RWKV-based framework tailored for DG PCC. It introduces two core modules to enhance spatial modeling and cross-domain robustness, while maintaining RWKV's linear efficiency. In particular, we present Adaptive Geometric Token Shift to model local neighborhood structures to improve geometric context awareness. In addition, Cross-Domain key feature Distribution Alignment is designed to mitigate attention drift by aligning key feature distributions across domains. Extensive experiments on multiple benchmarks demonstrate that PointDGRWKV achieves state-of-the-art performance on DG PCC.

TMLR Journal 2026 Journal Article

The Landscape of Agentic Reinforcement Learning for LLMs: A Survey

  • Guibin Zhang
  • Hejia Geng
  • Xiaohang Yu
  • Zhenfei Yin
  • Zaibin Zhang
  • Zelin Tan
  • Heng Zhou
  • Zhong-Zhi Li

The emergence of agentic reinforcement learning (Agentic RL) marks a paradigm shift from conventional reinforcement learning applied to large language models (LLM RL), reframing LLMs from passive sequence generators into autonomous, decision-making agents embedded in complex, dynamic worlds. This survey formalizes this conceptual shift by contrasting the degenerate single-step Markov Decision Processes (MDPs) of LLM RL with the temporally extended Partially Observable Markov Decision Processes (POMDPs) that define Agentic RL. Building on this foundation, we propose a comprehensive twofold taxonomy: one organized around core agentic capabilities, including planning, tool use, memory, reasoning, self-improvement, and perception, and the other around their applications across diverse task domains. Central to our thesis is that reinforcement learning serves as the critical mechanism for transforming these capabilities from static, heuristic modules into adaptive, robust agentic behavior. To support and accelerate future research, we consolidate the landscape of open-source environments, benchmarks, and frameworks into a practical compendium. By synthesizing over five hundred recent works, this survey charts the contours of this rapidly evolving field and highlights the opportunities and challenges that will shape the development of scalable, general-purpose AI agents.

ICLR Conference 2025 Conference Paper

AgentStudio: A Toolkit for Building General Virtual Agents

  • Longtao Zheng
  • Zhiyuan Huang
  • Zhenghai Xue
  • Xinrun Wang
  • Bo An 0001
  • Shuicheng Yan

General virtual agents need to handle multimodal observations, master complex action spaces, and self-improve in dynamic, open-domain environments. However, existing environments are often domain-specific and require complex setups, which limits agent development and evaluation in real-world settings. As a result, current evaluations lack in-depth analyses that decompose fundamental agent capabilities. We introduce AgentStudio, a trinity of environments, tools, and benchmarks to address these issues. AgentStudio provides a lightweight, interactive environment with highly generic observation and action spaces, e.g., video observations and GUI/API actions. It integrates tools for creating online benchmark tasks, annotating GUI elements, and labeling actions in videos. Based on our environment and tools, we curate an online task suite that benchmarks both GUI interactions and function calling with efficient auto-evaluation. We also reorganize existing datasets and collect new ones using our tools to establish three datasets: GroundUI, IDMBench, and CriticBench. These datasets evaluate fundamental agent abilities, including GUI grounding, learning from videos, and success detection, pointing to the desiderata for robust, general, and open-ended virtual agents.

AAAI Conference 2025 Conference Paper

Combating Multimodal LLM Hallucination via Bottom-Up Holistic Reasoning

  • Shengqiong Wu
  • Hao Fei
  • Liangming Pan
  • William Yang Wang
  • Shuicheng Yan
  • Tat-Seng Chua

Recent advancements in multimodal large language models (MLLMs) have shown unprecedented capabilities in advancing various vision-language tasks. However, MLLMs face significant challenges with hallucinations, and misleading outputs that do not align with the input data. While existing efforts are paid to combat MLLM hallucinations, several pivotal challenges are still unsolved. First, while current approaches aggressively focus on addressing errors at the perception level, another important type at the cognition level requiring factual commonsense can be overlooked. In addition, existing methods might fall short in finding a more effective way to represent visual input, which is yet a key bottleneck that triggers visual hallucinations. Moreover, MLLMs can frequently be misled by faulty textual inputs and cause hallucinations, while unfortunately, this type of issue has long been overlooked by existing studies. Inspired by human intuition in handling hallucinations, this paper introduces a novel bottom-up reasoning framework. Our framework systematically addresses potential issues in both visual and textual inputs by verifying and integrating perception-level information with cognition-level commonsense knowledge, ensuring more reliable outputs. Extensive experiments demonstrate significant improvements in multiple hallucination benchmarks after integrating MLLMs with the proposed framework. In-depth analyses reveal the great potential of our methods in addressing perception- and cognition-level hallucinations.

ICML Conference 2025 Conference Paper

Cradle: Empowering Foundation Agents towards General Computer Control

  • Weihao Tan
  • Wentao Zhang 0007
  • Xinrun Xu
  • Haochong Xia
  • Ziluo Ding
  • Boyu Li 0003
  • Bohan Zhou
  • Junpeng Yue

Despite their success in specific scenarios, existing foundation agents still struggle to generalize across various virtual scenarios, mainly due to the dramatically different encapsulations of environments with manually designed observation and action spaces. To handle this issue, we propose the General Computer Control (GCC) setting to restrict foundation agents to interact with software through the most unified and standardized interface, i. e. , using screenshots as input and keyboard and mouse actions as output. We introduce Cradle, a modular and flexible LMM-powered framework, as a preliminary attempt towards GCC. Enhanced by six key modules, Information Gathering, Self-Reflection, Task Inference, Skill Curation, Action Planning, and Memory, Cradle is able to understand input screenshots and output executable code for low-level keyboard and mouse control after high-level planning and information retrieval, so that Cradle can interact with any software and complete long-horizon complex tasks without relying on any built-in APIs. Experimental results show that Cradle exhibits remarkable generalizability and impressive performance across four previously unexplored commercial video games (Red Dead Redemption 2, Cities: Skylines, Stardew Valley and Dealer’s Life 2), five software applications (Chrome, Outlook, Feishu, Meitu and CapCut), and a comprehensive benchmark, OSWorld. With a unified interface to interact with any software, Cradle greatly extends the reach of foundation agents thus paving the way for generalist agents.

ICML Conference 2025 Conference Paper

EasyInv: Toward Fast and Better DDIM Inversion

  • Ziyue Zhang
  • Mingbao Lin
  • Shuicheng Yan
  • Rongrong Ji

This paper introduces EasyInv, an easy yet novel approach that significantly advances the field of DDIM Inversion by addressing the inherent inefficiencies and performance limitations of traditional iterative optimization methods. At the core of our EasyInv is a refined strategy for approximating inversion noise, which is pivotal for enhancing the accuracy and reliability of the inversion process. By prioritizing the initial latent state, which encapsulates rich information about the original images, EasyInv steers clear of the iterative refinement of noise items. Instead, we introduce a methodical aggregation of the latent state from the preceding time step with the current state, effectively increasing the influence of the initial latent state and mitigating the impact of noise. We illustrate that EasyInv is capable of delivering results that are either on par with or exceed those of the conventional DDIM Inversion approach, especially under conditions where the model’s precision is limited or computational resources are scarce. Concurrently, our EasyInv offers an approximate threefold enhancement regarding inference efficiency over off-the-shelf iterative optimization techniques. It can be easily combined with most existing inversion methods by only four lines of code. See code at https: //github. com/potato-kitty/EasyInv.

AAAI Conference 2025 Conference Paper

Explore In-Context Segmentation via Latent Diffusion Models

  • Chaoyang Wang
  • Xiangtai Li
  • Henghui Ding
  • Lu Qi
  • Jiangning Zhang
  • Yunhai Tong
  • Chen Change Loy
  • Shuicheng Yan

In-context segmentation has drawn increasing attention with the advent of vision foundation models. Its goal is to segment objects using given reference images. Most existing approaches adopt metric learning or masked image modeling to build the correlation between visual prompts and input image queries. This work approaches the problem from a fresh perspective - unlocking the capability of the latent diffusion model (LDM) for in-context segmentation and investigating different design choices. Specifically, we examine the problem from three angles: instruction extraction, output alignment, and meta-architectures. We design a two-stage masking strategy to prevent interfering information from leaking into the instructions. In addition, we propose an augmented pseudo-masking target to ensure the model predicts without forgetting the original images. Moreover, we build a new and fair in-context segmentation benchmark that covers both image and video datasets. Experiments validate the effectiveness of our approach, demonstrating comparable or even stronger results than previous specialist or visual foundation models. We hope our work inspires others to rethink the unification of segmentation and generation.

NeurIPS Conference 2025 Conference Paper

G-Memory: Tracing Hierarchical Memory for Multi-Agent Systems

  • Guibin Zhang
  • Muxin Fu
  • Kun Wang
  • Frank Wan
  • Miao Yu
  • Shuicheng Yan

Large language model (LLM)-powered multi-agent systems (MAS) have demonstrated cognitive and execution capabilities that far exceed those of single LLM agents, yet their capacity for self-evolution remains hampered by underdeveloped memory architectures. Upon close inspection, we are alarmed to discover that prevailing MAS memory mechanisms (1) are overly simplistic, completely disregarding the nuanced inter-agent collaboration trajectories, and (2) lack cross-trial and agent-specific customization, in stark contrast to the expressive memory developed for single agents. To bridge this gap, we introduce G-Memory, a hierarchical, agentic memory system for MAS inspired by organizational memory theory, which manages the lengthy MAS interaction via a three-tier graph hierarchy: insight, query, and interaction graphs. Upon receiving a new user query, G-Memory performs bi-directional memory traversal to retrieve both \textit{high-level, generalizable insights} that enable the system to leverage cross-trial knowledge, and \textit{fine-grained, condensed interaction trajectories} that compactly encode prior collaboration experiences. Upon task execution, the entire hierarchy evolves by assimilating new collaborative trajectories, nurturing the progressive evolution of agent teams. Extensive experiments across five benchmarks, three LLM backbones, and three popular MAS frameworks demonstrate that G-Memory improves success rates in embodied action and accuracy in knowledge QA by up to $20. 89\\%$ and $10. 12\\%$, respectively, without any modifications to the original frameworks.

ICLR Conference 2025 Conference Paper

IPDreamer: Appearance-Controllable 3D Object Generation with Complex Image Prompts

  • Bohan Zeng
  • Shanglin Li
  • Yutang Feng
  • Ling Yang 0006
  • Juan Zhang
  • Hong Li 0016
  • Jiaming Liu
  • Conghui He

Recent advances in 3D generation have been remarkable, with methods such as DreamFusion leveraging large-scale text-to-image diffusion-based models to guide 3D object generation. These methods enable the synthesis of detailed and photorealistic textured objects. However, the appearance of 3D objects produced by such text-to-3D models is often unpredictable, and it is hard for single-image-to-3D methods to deal with images lacking a clear subject, complicating the generation of appearance-controllable 3D objects from complex images. To address these challenges, we present IPDreamer, a novel method that captures intricate appearance features from complex **I**mage **P**rompts and aligns the synthesized 3D object with these extracted features, enabling high-fidelity, appearance-controllable 3D object generation. Our experiments demonstrate that IPDreamer consistently generates high-quality 3D objects that align with both the textual and complex image prompts, highlighting its promising capability in appearance-controlled, complex 3D object generation.

NeurIPS Conference 2025 Conference Paper

JarvisArt: Liberating Human Artistic Creativity via an Intelligent Photo Retouching Agent

  • Yunlong Lin
  • Zixu Lin
  • Kunjie Lin
  • Jinbin Bai
  • Panwang Pan
  • Chenxin Li
  • Haoyu Chen
  • Zhongdao Wang

Photo retouching has become integral to contemporary visual storytelling, enabling users to capture aesthetics and express creativity. While professional tools such as Adobe Lightroom offer powerful capabilities, they demand substantial expertise and manual effort. In contrast, existing AI-based solutions provide automation but often suffer from limited adjustability and poor generalization, failing to meet diverse and personalized editing needs. To bridge this gap, we introduce JarvisArt, a multi-modal large language model (MLLM)-driven agent that understands user intent, mimics the reasoning process of professional artists, and intelligently coordinates over 200 retouching tools within Lightroom. JarvisArt undergoes a two-stage training process: an initial Chain-of-Thought supervised fine-tuning to establish basic reasoning and tool-use skills, followed by Group Relative Policy Optimization for Retouching (GRPO-R) to further enhance its decision-making and tool proficiency. We also propose the Agent-to-Lightroom Protocol to facilitate seamless integration with Lightroom. To evaluate performance, we develop MMArt-Bench, a novel benchmark constructed from real-world user edits. JarvisArt demonstrates user-friendly interaction, superior generalization, and fine-grained control over both global and local adjustments, paving a new avenue for intelligent photo retouching. Notably, it outperforms GPT-4o with a 60\% improvement in average pixel-level metrics on MMArt-Bench for content fidelity, while maintaining comparable instruction-following capabilities.

NeurIPS Conference 2025 Conference Paper

JavisGPT: A Unified Multi-modal LLM for Sounding-Video Comprehension and Generation

  • Kai Liu
  • Jungang Li
  • Yuchong Sun
  • Shengqiong Wu
  • jianzhang gao
  • Daoan Zhang
  • Wei Zhang
  • Sheng Jin

This paper presents JavisGPT, the first unified multimodal large language model (MLLM) for joint audio-video (JAV) comprehension and generation. JavisGPT has a concise encoder-LLM-decoder architecture, which has a SyncFusion module for spatio-temporal audio-video fusion and synchrony-aware learnable queries to bridge a pretrained JAV-DiT generator. This design enables temporally coherent video-audio understanding and generation from multimodal instructions. We design an effective three-stage training pipeline consisting of multimodal pretraining, audio-video fine-tuning, and large-scale instruction-tuning, to progressively build multimodal comprehension and generation from existing vision-language models. For instruction tuning, we construct JavisInst-Omni, a high-quality instruction dataset with over 200K GPT-4o-curated audio-video-text dialogues that cover diverse and multi-level comprehension and generation scenarios. On JAV comprehension and generation benchmarks, our experiments show that JavisGPT outperforms existing MLLMs, particularly in complex and temporally synchronized settings.

ICLR Conference 2025 Conference Paper

Meissonic: Revitalizing Masked Generative Transformers for Efficient High-Resolution Text-to-Image Synthesis

  • Jinbin Bai
  • Tian Ye 0001
  • Wei Chow
  • Enxin Song
  • Qing-Guo Chen
  • Xiangtai Li
  • Zhen Dong 0003
  • Lei Zhu 0003

We present Meissonic, which elevates non-autoregressive text-to-image Masked Image Modeling (MIM) to a level comparable with state-of-the-art diffusion models like SDXL. By incorporating a comprehensive suite of architectural innovations, advanced positional encoding strategies, and optimized sampling conditions, Meissonic substantially improves MIM's performance and efficiency. Additionally, we leverage high-quality training data, integrate micro-conditions informed by human preference scores, and employ feature compression layers to further enhance image fidelity and resolution. Our model not only matches but often exceeds the performance of existing methods in generating high-quality, high-resolution images. Extensive experiments validate Meissonic’s capabilities, demonstrating its potential as a new standard in text-to-image synthesis.

ICLR Conference 2025 Conference Paper

MoE++: Accelerating Mixture-of-Experts Methods with Zero-Computation Experts

  • Peng Jin 0001
  • Bo Zhu
  • Li Yuan 0007
  • Shuicheng Yan

In this work, we aim to simultaneously enhance the effectiveness and efficiency of Mixture-of-Experts (MoE) methods. To achieve this, we propose MoE++, a general and heterogeneous MoE framework that integrates both Feed-Forward Network (FFN) and zero-computation experts. Specifically, we introduce three types of zero-computation experts: the zero expert, copy expert, and constant expert, which correspond to discard, skip, and replace operations, respectively. This design offers three key advantages: (i) **Low Computing Overhead**: Unlike the uniform mixing mechanism for all tokens within vanilla MoE, MoE++ allows each token to engage with a dynamic number of FFNs, be adjusted by constant vectors, or even skip the MoE layer entirely. (ii) **High Performance**: By enabling simple tokens to utilize fewer FFN experts, MoE++ allows more experts to focus on challenging tokens, thereby unlocking greater performance potential than vanilla MoE. (iii) **Deployment Friendly**: Given that zero-computation experts have negligible parameters, we can deploy all zero-computation experts on each GPU, eliminating the significant communication overhead and expert load imbalance associated with FFN experts distributed across different GPUs. Moreover, we leverage gating residuals, enabling each token to consider the pathway taken in the previous layer when selecting the appropriate experts. Extensive experimental results demonstrate that MoE++ achieves better performance while delivering 1.1$\sim$2.1$\times$ expert forward throughput compared to a vanilla MoE model of the same size, which lays a solid foundation for developing advanced and efficient MoE-related models.

ICML Conference 2025 Conference Paper

MoH: Multi-Head Attention as Mixture-of-Head Attention

  • Peng Jin 0001
  • Bo Zhu
  • Li Yuan 0007
  • Shuicheng Yan

In this work, we upgrade the multi-head attention mechanism, the core of the Transformer model, to reduce computational costs while maintaining or surpassing the previous accuracy level. We show that multi-head attention can be expressed in the summation form. Drawing on the insight that not all attention heads hold equal significance, we propose Mixture-of-Head attention (MoH), a new architecture that treats attention heads as experts in the Mixture-of-Experts (MoE) mechanism. MoH has two significant advantages: First, MoH enables each token to select the appropriate attention heads, enhancing inference efficiency without compromising accuracy or increasing the number of parameters. Second, MoH replaces the standard summation in multi-head attention with a weighted summation, introducing flexibility to the attention mechanism and unlocking extra performance potential. Extensive experiments on ViT, DiT, and LLMs demonstrate that MoH outperforms multi-head attention by using only 50%$\sim$90% of the attention heads. Moreover, we demonstrate that pre-trained multi-head attention models, such as LLaMA3-8B, can be further continue-tuned into our MoH models. Notably, MoH-LLaMA3-8B achieves an average accuracy of 64. 0% across 14 benchmarks, outperforming LLaMA3-8B by 2. 4% by utilizing only 75% of the attention heads. We believe the proposed MoH is a promising alternative to multi-head attention and provides a strong foundation for developing advanced and efficient attention-based models.

ICML Conference 2025 Conference Paper

On Path to Multimodal Generalist: General-Level and General-Bench

  • Hao Fei 0001
  • Yuan Zhou 0016
  • Juncheng Li 0006
  • Xiangtai Li
  • Qingshan Xu 0001
  • Bobo Li 0001
  • Shengqiong Wu
  • Yaoting Wang

The Multimodal Large Language Model (MLLM) is currently experiencing rapid growth, driven by the advanced capabilities of language-based LLMs. Unlike their specialist predecessors, existing MLLMs are evolving towards a Multimodal Generalist paradigm. Initially limited to understanding multiple modalities, these models have advanced to not only comprehend but also generate across modalities. Their capabilities have expanded from coarse-grained to fine-grained multimodal understanding and from supporting singular modalities to accommodating a wide array of or even arbitrary modalities. To assess the capabilities of various MLLMs, a diverse array of benchmark test sets has been proposed. This leads to a critical question: Can we simply assume that higher performance across tasks indicates a stronger MLLM capability, bringing us closer to human-level AI? We argue that the answer is not as straightforward as it seems. In this project, we introduce an evaluation framework to delineate the capabilities and behaviors of current multimodal generalists. This framework, named General-Level, establishes 5-scale levels of MLLM performance and generality, offering a methodology to compare MLLMs and gauge the progress of existing systems towards more robust multimodal generalists and, ultimately, towards AGI (Artificial General Intelligence). Central to our framework is the use of Synergy as the evaluative criterion, categorizing capabilities based on whether MLLMs preserve synergy across comprehension and generation, as well as across multimodal interactions. To evaluate the comprehensive abilities of various generalists, we present a massive multimodal benchmark, General-Bench, which encompasses a broader spectrum of skills, modalities, formats, and capabilities, including over 700 tasks and 325, 800 instances. The evaluation results that involve over 100 existing state-of-the-art MLLMs uncover the capability rankings of generalists, highlighting the challenges in reaching genuine AI. We expect this project to pave the way for future research on next-generation multimodal foundation models, providing a robust infrastructure to accelerate the realization of AGI. Project Page: https: //generalist. top/, Leaderboard: https: //generalist. top/leaderboard/, Benchmark: https: //huggingface. co/General-Level/.

AAAI Conference 2025 Conference Paper

Point Cloud Mamba: Point Cloud Learning via State Space Model

  • Tao Zhang
  • Haobo Yuan
  • Lu Qi
  • Jiangning Zhang
  • Qianyu Zhou
  • Shunping Ji
  • Shuicheng Yan
  • Xiangtai Li

Recently, state space models have exhibited strong global modeling capabilities and linear computational complexity in contrast to transformers. This research focuses on applying such architecture to more efficiently and effectively model point cloud data globally with linear computational complexity. In particular, for the first time, we demonstrate that Mamba-based point cloud methods can outperform previous methods based on transformer or multi-layer perceptrons (MLPs). To enable Mamba to process 3-D point cloud data more effectively, we propose a novel Consistent Traverse Serialization method to convert point clouds into 1-D point sequences while ensuring that neighboring points in the sequence are also spatially adjacent. Consistent Traverse Serialization yields six variants by permuting the order of x, y, and z coordinates, and the synergistic use of these variants aids Mamba in comprehensively observing point cloud data. Furthermore, to assist Mamba in handling point sequences with different orders more effectively, we introduce point prompts to inform Mamba of the sequence’s arrangement rules. Finally, we propose positional encoding based on spatial coordinate mapping to inject positional information into point cloud sequences more effectively. Point Cloud Mamba surpasses the state-of-the-art (SOTA) point-based method PointNeXt and achieves new SOTA performance on the ScanObjectNN, ModelNet40, ShapeNetPart, and S3DIS datasets. It is worth mentioning that when using a more powerful local feature extraction module, our PCM achieves 79.6 mIoU on S3DIS, significantly surpassing the previous SOTA models, DeLA and PTv3, by 5.5 mIoU and 4.9 mIoU, respectively.

AAAI Conference 2025 Conference Paper

PointDGMamba: Domain Generalization of Point Cloud Classification via Generalized State Space Model

  • Hao Yang
  • Qianyu Zhou
  • Haijia Sun
  • Xiangtai Li
  • Fengqi Liu
  • Xuequan Lu
  • Lizhuang Ma
  • Shuicheng Yan

Domain Generalization (DG) has been recently explored to improve the generalizability of point cloud classification (PCC) models toward unseen domains. However, they often suffer from limited receptive fields or quadratic complexity due to the use of convolution neural networks or vision Transformers. In this paper, we present the first work that studies the generalizability of state space models (SSMs) in DG PCC and find that directly applying SSMs into DG PCC will encounter several challenges: the inherent topology of the point cloud tends to be disrupted and leads to noise accumulation during the serialization stage. Besides, the lack of designs in domain-agnostic feature learning and data scanning will introduce unanticipated domain-specific information into the 3D sequence data. To this end, we propose a novel framework, PointDGMamba, that excels in strong generalizability toward unseen domains and has the advantages of global receptive fields and efficient linear complexity. PointDGMamba consists of three innovative components: Masked Sequence Denoising (MSD), Sequence-wise Cross-domain Feature Aggregation (SCFA), and Dual-level Domain Scanning (DDS). In particular, MSD selectively masks out the noised point tokens of the point cloud sequences, SCFA introduces cross-domain but same-class point cloud features to encourage the model to learn how to extract more generalized features. DDS includes intra-domain scanning and cross-domain scanning to facilitate information exchange between features. In addition, we propose a new and more challenging benchmark PointDG-3to1 for multi-domain generalization. Extensive experiments demonstrate the effectiveness and state-of-the-art performance of PointDGMamba.

ICLR Conference 2025 Conference Paper

Poison-splat: Computation Cost Attack on 3D Gaussian Splatting

  • Jiahao Lu
  • Yifan Zhang 0004
  • Qiuhong Shen
  • Xinchao Wang
  • Shuicheng Yan

3D Gaussian splatting (3DGS), known for its groundbreaking performance and efficiency, has become a dominant 3D representation and brought progress to many 3D vision tasks. However, in this work, we reveal a significant security vulnerability that has been largely overlooked in 3DGS: the computation cost of training 3DGS could be maliciously tampered by poisoning the input data. By developing an attack named Poison-splat, we reveal a novel attack surface where the adversary can poison the input images to drastically increase the computation memory and time needed for 3DGS training, pushing the algorithm towards its worst computation complexity. In extreme cases, the attack can even consume all allocable memory, leading to a Denial-of-Service (DoS) that disrupts servers, resulting in practical damages to real-world 3DGS service vendors. Such a computation cost attack is achieved by addressing a bi-level optimization problem through three tailored strategies: attack objective approximation, proxy model rendering, and optional constrained optimization. These strategies not only ensure the effectiveness of our attack but also make it difficult to defend with simple defensive measures. We hope the revelation of this novel attack surface can spark attention to this crucial yet overlooked vulnerability of 3DGS systems. Our code is available at https://github.com/jiahaolu97/poison-splat .

ICLR Conference 2025 Conference Paper

Policy Optimization under Imperfect Human Interactions with Agent-Gated Shared Autonomy

  • Zhenghai Xue
  • Bo An 0001
  • Shuicheng Yan

We introduce AGSA, an Agent-Gated Shared Autonomy framework that learns from high-level human feedback to tackle the challenges of reward-free training, safe exploration, and imperfect low-level human control. Recent human-in-the loop learning methods enable human participants to intervene a learning agent’s control and provide online demonstrations. Nonetheless, these methods rely heavily on perfect human interactions, including accurate human-monitored intervention decisions and near-optimal human demonstrations. AGSA employs a dedicated gating agent to determine when to switch control, thereby reducing the need of constant human monitoring. To obtain a precise and foreseeable gating agent, AGSA trains a long-term gating value function from human evaluative feedback on the gating agent’s intervention requests and preference feedback on pairs of human intervention trajectories. Instead of relying on potentially suboptimal human demonstrations, the learning agent is trained using control-switching signals from the gating agent. We provide theoretical insights on performance bounds that respectively describe the ability of the two agents. Experiments are conducted with both simulated and real human participants at different skill levels in challenging continuous control environments. Comparative results highlight that AGSA achieves significant improvements over previous human-in-the-loop learning methods in terms of training safety, policy performance, and user-friendliness.

ICML Conference 2025 Conference Paper

Policy Regularization on Globally Accessible States in Cross-Dynamics Reinforcement Learning

  • Zhenghai Xue
  • Lang Feng 0002
  • Jiacheng Xu
  • Kang Kang
  • Xiang Wen
  • Bo An 0001
  • Shuicheng Yan

To learn from data collected in diverse dynamics, Imitation from Observation (IfO) methods leverage expert state trajectories based on the premise that recovering expert state distributions in other dynamics facilitates policy learning in the current one. However, Imitation Learning inherently imposes a performance upper bound of learned policies. Additionally, as the environment dynamics change, certain expert states may become inaccessible, rendering their distributions less valuable for imitation. To address this, we propose a novel framework that integrates reward maximization with IfO, employing F-distance regularized policy optimization. This framework enforces constraints on globally accessible states—those with nonzero visitation frequency across all considered dynamics—mitigating the challenge posed by inaccessible states. By instantiating F-distance in different ways, we derive two theoretical analysis and develop a practical algorithm called Accessible State Oriented Policy Regularization (ASOR). ASOR serves as a general-purpose module that can be incorporated into various RL approaches, including offline RL and off-policy RL. Extensive experiments across multiple benchmarks demonstrate ASOR’s effectiveness in enhancing state-of-the-art cross-domain policy transfer algorithms, significantly improving their performance.

NeurIPS Conference 2025 Conference Paper

RoboCerebra: A Large-scale Benchmark for Long-horizon Robotic Manipulation Evaluation

  • Songhao Han
  • Boxiang Qiu
  • Yue Liao
  • Siyuan Huang
  • Chen Gao
  • Shuicheng Yan
  • Si Liu

Recent advances in vision-language models (VLMs) have enabled instruction-conditioned robotic systems with improved generalization. However, most existing work focuses on reactive System 1 policies, underutilizing VLMs’ strengths in semantic reasoning and long-horizon planning. These System 2 capabilities—characterized by deliberative, goal-directed thinking—remain underexplored due to the limited temporal scale and structural complexity of current benchmarks. To address this gap, we introduce RoboCerebra, a benchmark for evaluating high-level reasoning in long-horizon robotic manipulation. RoboCerebra includes: (1) a large-scale simulation dataset with extended task horizons and diverse subtask sequences in household environments; (2) a hierarchical framework combining a high-level VLM planner with a low-level vision-language-action (VLA) controller; and (3) an evaluation protocol targeting planning, reflection, and memory through structured System 1–System 2 interaction. The dataset is constructed via a top-down pipeline, where GPT generates task instructions and decomposes them into subtask sequences. Human operators execute the subtasks in simulation, yielding high-quality trajectories with dynamic object variations. Compared to prior benchmarks, RoboCerebra features significantly longer action sequences and denser annotations. We further benchmark state-of-the-art VLMs as System 2 modules and analyze their performance across key cognitive dimensions, advancing the development of more capable and generalizable robotic planners.

ICLR Conference 2025 Conference Paper

SuperCorrect: Advancing Small LLM Reasoning with Thought Template Distillation and Self-Correction

  • Ling Yang 0006
  • Zhaochen Yu
  • Tianjun Zhang
  • Minkai Xu
  • Joseph E. Gonzalez
  • Bin Cui 0001
  • Shuicheng Yan

Large language models (LLMs) like GPT-4, DeepSeek-R1, and ReasonFlux have shown significant improvements in various reasoning tasks. However, smaller LLMs still struggle with complex mathematical reasoning because they fail to effectively identify and correct reasoning errors. Recent reflection-based methods aim to address these issues by enabling self-reflection and self-correction, but they still face challenges in independently detecting errors in their reasoning steps. To overcome these limitations, we propose SuperCorrect, a novel two-stage framework that uses a large teacher model to supervise and correct both the reasoning and reflection processes of a smaller student model. In the first stage, we extract hierarchical high-level and detailed thought templates from the teacher model to guide the student model in eliciting more fine-grained reasoning thoughts. In the second stage, we introduce cross-model collaborative direct preference optimization (DPO) to enhance the self-correction abilities of the student model by following the teacher's correction traces during training. This cross-model DPO approach teaches the student model to effectively locate and resolve erroneous thoughts with error-driven insights from the teacher model, breaking the bottleneck of its thoughts and acquiring new skills and knowledge to tackle challenging problems. Extensive experiments consistently demonstrate our superiority over previous methods. Notably, our SuperCorrect-7B model significantly surpasses powerful DeepSeekMath-7B by 7.8\%/5.3\% and Qwen2.5-Math-7B by 15.1\%/6.3\% on MATH/GSM8K benchmarks, achieving new SOTA performance among all 7B models. Code is available at: https://github.com/YangLing0818/SuperCorrect-llm

ICLR Conference 2025 Conference Paper

Towards Semantic Equivalence of Tokenization in Multimodal LLM

  • Shengqiong Wu
  • Hao Fei 0001
  • Xiangtai Li
  • Jiayi Ji
  • Hanwang Zhang
  • Tat-Seng Chua
  • Shuicheng Yan

Multimodal Large Language Models (MLLMs) have demonstrated exceptional capabilities in processing vision-language tasks. One of the crux of MLLMs lies in vision tokenization, which involves efficiently transforming input visual signals into feature representations that are most beneficial for LLMs. However, existing vision tokenizers, essential for semantic alignment between vision and language, remain problematic. Existing methods aggressively fragment visual input, corrupting the visual semantic integrity. To address this, this paper proposes a novel dynamic Semantic-Equivalent Vision Tokenizer (SeTok), which groups visual features into semantic units via a dynamic clustering algorithm, flexibly determining the number of tokens based on image complexity. The resulting vision tokens effectively preserve semantic integrity and capture both low-frequency and high-frequency visual features. The proposed MLLM (Setokim) equipped with SeTok significantly demonstrates superior performance across various tasks, as evidenced by our experimental results.

NeurIPS Conference 2024 Conference Paper

Action Imitation in Common Action Space for Customized Action Image Synthesis

  • Wang Lin
  • Jingyuan Chen
  • Jiaxin Shi
  • Zirun Guo
  • Yichen Zhu
  • Zehan Wang
  • Tao Jin
  • Zhou Zhao

We propose a novel method, \textbf{TwinAct}, to tackle the challenge of decoupling actions and actors in order to customize the text-guided diffusion models (TGDMs) for few-shot action image generation. TwinAct addresses the limitations of existing methods that struggle to decouple actions from other semantics (e. g. , the actor's appearance) due to the lack of an effective inductive bias with few exemplar images. Our approach introduces a common action space, which is a textual embedding space focused solely on actions, enabling precise customization without actor-related details. Specifically, TwinAct involves three key steps: 1) Building common action space based on a set of representative action phrases; 2) Imitating the customized action within the action space; and 3) Generating highly adaptable customized action images in diverse contexts with action similarity loss. To comprehensively evaluate TwinAct, we construct a novel benchmark, which provides sample images with various forms of actions. Extensive experiments demonstrate TwinAct's superiority in generating accurate, context-independent customized actions while maintaining the identity consistency of different subjects, including animals, humans, and even customized actors.

ICML Conference 2024 Conference Paper

Auto-Encoding Morph-Tokens for Multimodal LLM

  • Kaihang Pan
  • Siliang Tang
  • Juncheng Li 0006
  • Zhaoyu Fan
  • Wei Chow
  • Shuicheng Yan
  • Tat-Seng Chua
  • Yueting Zhuang

For multimodal LLMs, the synergy of visual comprehension (textual output) and generation (visual output) presents an ongoing challenge. This is due to a conflicting objective: for comprehension, an MLLM needs to abstract the visuals; for generation, it needs to preserve the visuals as much as possible. Thus, the objective is a dilemma for visual-tokens. To resolve the conflict, we propose encoding images into morph-tokens to serve a dual purpose: for comprehension, they act as visual prompts instructing MLLM to generate texts; for generation, they take on a different, non-conflicting role as complete visual-tokens for image reconstruction, where the missing visual cues are recovered by the MLLM. Extensive experiments show that morph-tokens can achieve a new SOTA for multimodal comprehension and generation simultaneously. Our project is available at https: //github. com/DCDmllm/MorphTokens.

NeurIPS Conference 2024 Conference Paper

Automating Dataset Updates Towards Reliable and Timely Evaluation of Large Language Models

  • Jiahao Ying
  • Yixin Cao
  • Yushi Bai
  • Qianru Sun
  • Bo Wang
  • Wei Tang
  • Zhaojun Ding
  • Yizhe Yang

Large language models (LLMs) have achieved impressive performance across various natural language benchmarks, prompting a continual need to curate more difficult datasets for larger LLMs, which is costly and time-consuming. In this paper, we propose to automate dataset updating and provide systematical analysis regarding its effectiveness in dealing with benchmark leakage issue, difficulty control, and stability. Thus, once current benchmark has been mastered or leaked, we can update it for timely and reliable evaluation. There are two updating strategies: 1) mimicking strategy to generate similar samples based on original data, preserving stylistic and contextual essence, and 2) extending strategy that further expands existing samples at varying cognitive levels by adapting Bloom’s taxonomy of educational objectives. Extensive experiments on updated MMLU and BIG-Bench demonstrate the stability of the proposed strategies and find that the mimicking strategy can effectively alleviate issues of overestimation from benchmark leakage. In cases where the efficient mimicking strategy fails, our extending strategy still shows promising results. Additionally, by controlling the difficulty, we can better discern the models’ performance and enable fine-grained analysis — neither too difficult nor too easy an exam can fairly judge students’ learning status. To the best of our knowledge, we are the first to automate updating benchmarks for reliable and timely evaluation. Our demo leaderboard can be found at https: //yingjiahao14. github. io/Automating-DatasetUpdates/.

RLC Conference 2024 Conference Paper

Learning to Optimize for Reinforcement Learning

  • Qingfeng Lan
  • A. Rupam Mahmood
  • Shuicheng Yan
  • Zhongwen Xu

In recent years, by leveraging more data, computation, and diverse tasks, learned optimizers have achieved remarkable success in supervised learning, outperforming classical hand-designed optimizers. Reinforcement learning (RL) is essentially different from supervised learning, and in practice, these learned optimizers do not work well even in simple RL tasks. We investigate this phenomenon and identify two issues. First, the agent-gradient distribution is non-independent and identically distributed, leading to inefficient meta-training. Moreover, due to highly stochastic agent-environment interactions, the agent-gradients have high bias and variance, which increases the difficulty of learning an optimizer for RL. We propose pipeline training and a novel optimizer structure with a good inductive bias to address these issues, making it possible to learn an optimizer for reinforcement learning from scratch. We show that, although only trained in toy tasks, our learned optimizer can generalize to unseen complex tasks in Brax.

RLJ Journal 2024 Journal Article

Learning to Optimize for Reinforcement Learning

  • Qingfeng Lan
  • A. Rupam Mahmood
  • Shuicheng Yan
  • Zhongwen Xu

In recent years, by leveraging more data, computation, and diverse tasks, learned optimizers have achieved remarkable success in supervised learning, outperforming classical hand-designed optimizers. Reinforcement learning (RL) is essentially different from supervised learning, and in practice, these learned optimizers do not work well even in simple RL tasks. We investigate this phenomenon and identify two issues. First, the agent-gradient distribution is non-independent and identically distributed, leading to inefficient meta-training. Moreover, due to highly stochastic agent-environment interactions, the agent-gradients have high bias and variance, which increases the difficulty of learning an optimizer for RL. We propose pipeline training and a novel optimizer structure with a good inductive bias to address these issues, making it possible to learn an optimizer for reinforcement learning from scratch. We show that, although only trained in toy tasks, our learned optimizer can generalize to unseen complex tasks in Brax.

NeurIPS Conference 2024 Conference Paper

MVGamba: Unify 3D Content Generation as State Space Sequence Modeling

  • Xuanyu Yi
  • Zike Wu
  • Qiuhong Shen
  • Qingshan Xu
  • Pan Zhou
  • Joo-Hwee Lim
  • Shuicheng Yan
  • Xinchao Wang

Recent 3D large reconstruction models (LRMs) can generate high-quality 3D content in sub-seconds by integrating multi-view diffusion models with scalable multi-view reconstructors. Current works further leverage 3D Gaussian Splatting as 3D representation for improved visual quality and rendering efficiency. However, we observe that existing Gaussian reconstruction models often suffer from multi-view inconsistency and blurred textures. We attribute this to the compromise of multi-view information propagation in favor of adopting powerful yet computationally intensive architectures (\eg, Transformers). To address this issue, we introduce MVGamba, a general and lightweight Gaussian reconstruction model featuring a multi-view Gaussian reconstructor based on the RNN-like State Space Model (SSM). Our Gaussian reconstructor propagates causal context containing multi-view information for cross-view self-refinement while generating a long sequence of Gaussians for fine-detail modeling with linear complexity. With off-the-shelf multi-view diffusion models integrated, MVGamba unifies 3D generation tasks from a single image, sparse images, or text prompts. Extensive experiments demonstrate that MVGamba outperforms state-of-the-art baselines in all 3D content generation scenarios with approximately only $0. 1\times$ of the model size. The codes are available at \url{https: //github. com/SkyworkAI/MVGamba}.

ICML Conference 2024 Conference Paper

Non-confusing Generation of Customized Concepts in Diffusion Models

  • Wang Lin
  • Jingyuan Chen
  • Jiaxin Shi
  • Yichen Zhu
  • Chen Liang
  • Junzhong Miao
  • Tao Jin 0004
  • Zhou Zhao 0001

We tackle the common challenge of inter-concept visual confusion in compositional concept generation using text-guided diffusion models (TGDMs). It becomes even more pronounced in the generation of customized concepts, due to the scarcity of user-provided concept visual examples. By revisiting the two major stages leading to the success of TGDMs—1) contrastive image-language pre-training (CLIP) for text encoder that encodes visual semantics, and 2) training TGDM that decodes the textual embeddings into pixels—we point that existing customized generation methods only focus on fine-tuning the second stage while overlooking the first one. To this end, we propose a simple yet effective solution called CLIF: contrastive image-language fine-tuning. Specifically, given a few samples of customized concepts, we obtain non-confusing textual embeddings of a concept by fine-tuning CLIP via contrasting a concept and the over-segmented visual regions of other concepts. Experimental results demonstrate the effectiveness of CLIF in preventing the confusion of multi-customized concept generation. Project page: https: //clif-official. github. io/clif.

NeurIPS Conference 2024 Conference Paper

OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding

  • Tao Zhang
  • Xiangtai Li
  • Hao Fei
  • Haobo Yuan
  • Shengqiong Wu
  • Shunping Ji
  • Chen Change Loy
  • Shuicheng Yan

Current universal segmentation methods demonstrate strong capabilities in pixel-level image and video understanding. However, they lack reasoning abilities and cannot be controlled via text instructions. In contrast, large vision-language multimodal models exhibit powerful vision-based conversation and reasoning capabilities but lack pixel-level understanding and have difficulty accepting visual prompts for flexible user interaction. This paper proposes OMG-LLaVA, a new and elegant framework combining powerful pixel-level vision understanding with reasoning abilities. It can accept various visual and text prompts for flexible user interaction. Specifically, we use a universal segmentation method as the visual encoder, integrating image information, perception priors, and visual prompts into visual tokens provided to the LLM. The LLM is responsible for understanding the user's text instructions and providing text responses and pixel-level segmentation results based on the visual information. We propose perception prior embedding to better integrate perception priors with image features. OMG-LLaVA achieves image-level, object-level, and pixel-level reasoning and understanding in a single model, matching or surpassing the performance of specialized methods on multiple benchmarks. Rather than using LLM to connect each specialist, our work aims at end-to-end training on one encoder, one decoder, and one LLM. The code and model have been released for further research.

IJCAI Conference 2024 Conference Paper

Reinforcement Learning from Diverse Human Preferences

  • Wanqi Xue
  • Bo An
  • Shuicheng Yan
  • Zhongwen Xu

The complexity of designing reward functions has been a major obstacle to the wide application of deep reinforcement learning (RL) techniques. Describing an agent's desired behaviors and properties can be difficult, even for experts. A new paradigm called reinforcement learning from human preferences (or preference-based RL) has emerged as a promising solution, in which reward functions are learned from human preference labels among behavior trajectories. However, existing methods for preference-based RL are limited by the need for accurate oracle preference labels. This paper addresses this limitation by developing a method for learning from diverse human preferences. The key idea is to stabilize reward learning through regularization and correction in a latent space. To ensure temporal consistency, a strong constraint is imposed on the reward model that forces its latent space to be close to a non-parameterized distribution. Additionally, a confidence-based reward model ensembling method is designed to generate more stable and reliable predictions. The proposed method is tested on a variety of tasks in DMcontrol and Meta-world and has shown consistent and significant improvements over existing preference-based RL algorithms when learning from diverse feedback, paving the way for real-world applications of RL methods.

NeurIPS Conference 2024 Conference Paper

Vitron: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing

  • Hao Fei
  • Shengqiong Wu
  • Hanwang Zhang
  • Tat-Seng Chua
  • Shuicheng Yan

Recent developments of vision large language models (LLMs) have seen remarkable progress, yet still encounter challenges towards multimodal generalists, such as coarse-grained instance-level understanding, lack of unified support for both images and videos, and insufficient coverage across various vision tasks. In this paper we present Vitron, a universal pixel-level vision LLM designed for comprehensive understanding, generating, segmenting, and editing of both static images and dynamic videos. Building on top of an LLM backbone, Vitron incorporates encoders for images, videos, and pixel-level regional visuals within its frontend modules, while employing state-of-the-art visual specialists as its backend, via which Vitron supports a spectrum of vision end tasks, spanning visual comprehension to visual generation, from low level to high level. To ensure an effective and precise message passing from LLM to backend modules for function invocation, we propose a novel hybrid method by simultaneously integrating discrete textual instructions and continuous signal embeddings. Further, we design various pixel-level spatiotemporal vision-language alignment learning for Vitron to reach the best fine-grained visual capability. Finally, a cross-task synergy module is advised to learn to maximize the task-invariant fine-grained visual features, enhancing the synergy between different visual tasks. Demonstrated over 12 visual tasks and evaluated across 22 datasets, Vitron showcases its extensive capabilities in the four main vision task clusters. Overall, this work illuminates the great potential of developing a more unified multimodal generalist.

JMLR Journal 2024 Journal Article

Win: Weight-Decay-Integrated Nesterov Acceleration for Faster Network Training

  • Pan Zhou
  • Xingyu Xie
  • Zhouchen Lin
  • Kim-Chuan Toh
  • Shuicheng Yan

Training deep networks on large-scale datasets is computationally challenging. This work explores the problem of “how to accelerate adaptive gradient algorithms in a general manner", and proposes an effective Weight-decay-Integrated Nesterov acceleration (Win) to accelerate adaptive algorithms. Taking AdamW and Adam as examples, per iteration, we construct a dynamical loss that combines the vanilla training loss and a dynamic regularizer inspired by proximal point method, and respectively minimize the first- and second-order Taylor approximations of dynamical loss to update variable. This yields our Win acceleration that uses a conservative step and an aggressive step to update, and linearly combines these two updates for acceleration. Next, we extend Win into Win2 which uses multiple aggressive update steps for faster convergence. Then we apply Win and Win2 to the popular LAMB and SGD optimizers. Our transparent derivation could provide insights for other accelerated methods and their integration into adaptive algorithms. Besides, we theoretically justify the faster convergence of Win- and Win2-accelerated AdamW, Adam and LAMB to their non-accelerated counterparts. Experimental results demonstrates the faster convergence speed and superior performance of our Win- and Win2-accelerated AdamW, Adam, LAMB and SGD over their vanilla counterparts on vision classification and language modeling tasks. [abs] [ pdf ][ bib ] [ code ] &copy JMLR 2024. ( edit, beta )

ICLR Conference 2023 Conference Paper

Arbitrary Virtual Try-on Network: Characteristics Representation and Trade-off between Body and Clothing

  • Yu Liu 0114
  • Mingbo Zhao
  • Zhao Zhang 0001
  • Jicong Fan 0001
  • Yang Lou
  • Shuicheng Yan

Deep learning based virtual try-on system has achieved some encouraging progress recently, but there still remain several big challenges that need to be solved, such as trying on arbitrary clothes of all types, trying on the clothes from one category to another and generating image-realistic results with few artifacts. To handle this issue, we propose the Arbitrary Virtual Try-On Network (AVTON) that is utilized for all-type clothes, which can synthesize realistic try-on images by preserving and trading off characteristics of the target clothes and the reference person. Our approach includes three modules: 1) Limbs Prediction Module, which is utilized for predicting the human body parts by preserving the characteristics of the reference person. This is especially good for handling cross-category try-on task (e.g., long sleeves \(\leftrightarrow\) short sleeves or long pants \(\leftrightarrow\) skirts, etc.), where the exposed arms or legs with the skin colors and details can be reasonably predicted; 2) Improved Geometric Matching Module, which is designed to warp clothes according to the geometry of the target person. We improve the TPS-based warping method with a compactly supported radial function (Wendland's \(\Psi\)-function); 3) Trade-Off Fusion Module, which is to trade off the characteristics of the warped clothes and the reference person. This module is to make the generated try-on images look more natural and realistic based on a fine-tuning symmetry of the network structure. Extensive simulations are conducted and our approach can achieve better performance compared with the state-of-the-art virtual try-on methods.

ICML Conference 2023 Conference Paper

Bag of Tricks for Training Data Extraction from Language Models

  • Weichen Yu
  • Tianyu Pang
  • Qian Liu 0033
  • Chao Du
  • Bingyi Kang
  • Yan Huang
  • Min Lin
  • Shuicheng Yan

With the advance of language models, privacy protection is receiving more attention. Training data extraction is therefore of great importance, as it can serve as a potential tool to assess privacy leakage. However, due to the difficulty of this task, most of the existing methods are proof-of-concept and still not effective enough. In this paper, we investigate and benchmark tricks for improving training data extraction using a publicly available dataset. Because most existing extraction methods use a pipeline of generating-then-ranking, i. e. , generating text candidates as potential training data and then ranking them based on specific criteria, our research focuses on the tricks for both text generation (e. g. , sampling strategy) and text ranking (e. g. , token-level criteria). The experimental results show that several previously overlooked tricks can be crucial to the success of training data extraction. Based on the GPT-Neo 1. 3B evaluation results, our proposed tricks outperform the baseline by a large margin in most cases, providing a much stronger baseline for future research. The code is available at https: //github. com/weichen-yu/LM-Extraction.

ICLR Conference 2023 Conference Paper

Bag of Tricks for Unsupervised Text-to-Speech

  • Yi Ren 0006
  • Chen Zhang 0020
  • Shuicheng Yan

Unsupervised text-to-speech (TTS) aims to train TTS models for a specific language without any paired speech-text training data in that language. Existing methods either use speech and corresponding pseudo text generated by an unsupervised automatic speech recognition (ASR) model as training data, or employ the back-translation technique. Though effective, they suffer from low robustness to low-quality data and heavy dependence on the lexicon of a language that is sometimes unavailable, leading to difficulty in convergence, especially in low-resource language scenarios. In this work, we introduce a bag of tricks to enable effective unsupervised TTS. Specifically, 1) we carefully design a voice conversion model to normalize the variable and noisy information in the low-quality speech data while preserving the pronunciation information; 2) we employ the non-autoregressive TTS model to overcome the robustness issue; and 3) we explore several tricks applied in back-translation, including curriculum learning, length augmentation and auxiliary supervised loss to stabilize the back-translation and improve its effectiveness. Through experiments, it has been demonstrated that our method achieves better intelligibility and audio quality than all previous methods, and that these tricks are very essential to the performance gain.

ICML Conference 2023 Conference Paper

Better Diffusion Models Further Improve Adversarial Training

  • Zekai Wang
  • Tianyu Pang
  • Chao Du
  • Min Lin
  • Weiwei Liu 0003
  • Shuicheng Yan

It has been recognized that the data generated by the denoising diffusion probabilistic model (DDPM) improves adversarial training. After two years of rapid development in diffusion models, a question naturally arises: can better diffusion models further improve adversarial training? This paper gives an affirmative answer by employing the most recent diffusion model which has higher efficiency ($\sim 20$ sampling steps) and image quality (lower FID score) compared with DDPM. Our adversarially trained models achieve state-of-the-art performance on RobustBench using only generated data (no external datasets). Under the $\ell_\infty$-norm threat model with $\epsilon=8/255$, our models achieve $70. 69\\%$ and $42. 67\\%$ robust accuracy on CIFAR-10 and CIFAR-100, respectively, i. e. improving upon previous state-of-the-art models by $+4. 58\\%$ and $+8. 03\\%$. Under the $\ell_2$-norm threat model with $\epsilon=128/255$, our models achieve $84. 86\\%$ on CIFAR-10 ($+4. 44\\%$). These results also beat previous works that use external data. We also provide compelling results on the SVHN and TinyImageNet datasets. Our code is at https: //github. com/wzekai99/DM-Improves-AT.

ICLR Conference 2023 Conference Paper

D4FT: A Deep Learning Approach to Kohn-Sham Density Functional Theory

  • Tianbo Li
  • Min Lin
  • Zheyuan Hu 0002
  • Kunhao Zheng
  • Giovanni Vignale
  • Kenji Kawaguchi
  • A. H. Castro Neto
  • Kostya S. Novoselov

Kohn-Sham Density Functional Theory (KS-DFT) has been traditionally solved by the Self-Consistent Field (SCF) method. Behind the SCF loop is the physics intuition of solving a system of non-interactive single-electron wave functions under an effective potential. In this work, we propose a deep learning approach to KS-DFT. First, in contrast to the conventional SCF loop, we propose to directly minimize the total energy by reparameterizing the orthogonal constraint as a feed-forward computation. We prove that such an approach has the same expressivity as the SCF method, yet reduces the computational complexity from O(N^4) to O(N^3). Second, the numerical integration which involves a summation over the quadrature grids can be amortized to the optimization steps. At each step, stochastic gradient descent (SGD) is performed with a sampled minibatch of the grids. Extensive experiments are carried out to demonstrate the advantage of our approach in terms of efficiency and stability. In addition, we show that our approach enables us to explore more complex neural-based wave functions.

ICLR Conference 2023 Conference Paper

Distributional Meta-Gradient Reinforcement Learning

  • Haiyan Yin
  • Shuicheng Yan
  • Zhongwen Xu

Meta-gradient reinforcement learning (RL) algorithms have substantially boosted the performance of RL agents by learning an adaptive return. All the existing algorithms adhere to the same reward learning principle, where the adaptive return is simply formulated in the form of expected cumulative rewards, upon which the policy and critic update rules are specified under well-adopted distance metrics. In this paper, we present a novel algorithm that builds on the success of meta-gradient RL algorithms and effectively improves such algorithms by following a simple recipe, i.e., going beyond the expected return to formulate and learn the return in a more expressive form, value distributions. To this end, we first formulate a distributional return that could effectively capture bootstrapping and discounting behaviors over distributions, to form an informative distributional return target in value update. Then we derive an efficient meta update rule to learn the adaptive distributional return with meta-gradients. For empirical evaluation, we first present an illustrative example on a toy two-color grid-world domain, which validates the benefit of learning distributional return over expectation; then we conduct extensive comparisons on a large-scale RL benchmark Atari 2600, where we confirm that our proposed method with distributional return works seamlessly well with the actor-critic framework and leads to state-of-the-art median human normalized score among meta-gradient RL literature.

NeurIPS Conference 2023 Conference Paper

Efficient Diffusion Policies For Offline Reinforcement Learning

  • Bingyi Kang
  • Xiao Ma
  • Chao Du
  • Tianyu Pang
  • Shuicheng Yan

Offline reinforcement learning (RL) aims to learn optimal policies from offline datasets, where the parameterization of policies is crucial but often overlooked. Recently, Diffsuion-QL significantly boosts the performance of offline RL by representing a policy with a diffusion model, whose success relies on a parametrized Markov Chain with hundreds of steps for sampling. However, Diffusion-QL suffers from two critical limitations. 1) It is computationally inefficient to forward and backward through the whole Markov chain during training. 2) It is incompatible with maximum likelihood-based RL algorithms (e. g. , policy gradient methods) as the likelihood of diffusion models is intractable. Therefore, we propose efficient diffusion policy (EDP) to overcome these two challenges. EDP approximately constructs actions from corrupted ones at training to avoid running the sampling chain. We conduct extensive experiments on the D4RL benchmark. The results show that EDP can reduce the diffusion policy training time from 5 days to 5 hours on gym-locomotion tasks. Moreover, we show that EDP is compatible with various offline RL algorithms (TD3, CRR, and IQL) and achieves new state-of-the-art on D4RL by large margins over previous methods.

ICLR Conference 2023 Conference Paper

Efficient Offline Policy Optimization with a Learned Model

  • Zichen Liu
  • Siyi Li
  • Wee Sun Lee
  • Shuicheng Yan
  • Zhongwen Xu

MuZero Unplugged presents a promising approach for offline policy learning from logged data. It conducts Monte-Carlo Tree Search (MCTS) with a learned model and leverages Reanalyze algorithm to learn purely from offline data. For good performance, MCTS requires accurate learned models and a large number of simulations, thus costing huge computing time. This paper investigates a few hypotheses where MuZero Unplugged may not work well under the offline RL settings, including 1) learning with limited data coverage; 2) learning from offline data of stochastic environments; 3) improperly parameterized models given the offline data; 4) with a low compute budget. We propose to use a regularized one-step look-ahead approach to tackle the above issues. Instead of planning with the expensive MCTS, we use the learned model to construct an advantage estimation based on a one-step rollout. Policy improvements are towards the direction that maximizes the estimated advantage with regularization of the dataset. We conduct extensive empirical studies with BSuite environments to verify the hypotheses and then run our algorithm on the RL Unplugged Atari benchmark. Experimental results show that our proposed approach achieves stable performance even with an inaccurate learned model. On the large-scale Atari benchmark, the proposed method outperforms MuZero Unplugged by 43%. Most significantly, it uses only 5.6% wall-clock time (i.e., 1 hour) compared to MuZero Unplugged (i.e., 17.8 hours) to achieve a 150% IQM normalized score with the same hardware and software stacks. Our implementation is open-sourced at https://github.com/sail-sg/rosmo.

NeurIPS Conference 2023 Conference Paper

Gaussian Mixture Solvers for Diffusion Models

  • Hanzhong Guo
  • Cheng Lu
  • Fan Bao
  • Tianyu Pang
  • Shuicheng Yan
  • Chao Du
  • Chongxuan Li

Recently, diffusion models have achieved great success in generative tasks. Sampling from diffusion models is equivalent to solving the reverse diffusion stochastic differential equations (SDEs) or the corresponding probability flow ordinary differential equations (ODEs). In comparison, SDE-based solvers can generate samples of higher quality and are suited for image translation tasks like stroke-based synthesis. During inference, however, existing SDE-based solvers are severely constrained by the efficiency-effectiveness dilemma. Our investigation suggests that this is because the Gaussian assumption in the reverse transition kernel is frequently violated (even in the case of simple mixture data) given a limited number of discretization steps. To overcome this limitation, we introduce a novel class of SDE-based solvers called \emph{Gaussian Mixture Solvers (GMS)} for diffusion models. Our solver estimates the first three-order moments and optimizes the parameters of a Gaussian mixture transition kernel using generalized methods of moments in each step during sampling. Empirically, our solver outperforms numerous SDE-based solvers in terms of sample quality in image generation and stroke-based synthesis in various diffusion models, which validates the motivation and effectiveness of GMS. Our code is available at https: //github. com/Guohanzhong/GMS.

ICLR Conference 2023 Conference Paper

LPT: Long-tailed Prompt Tuning for Image Classification

  • Bowen Dong 0001
  • Pan Zhou 0002
  • Shuicheng Yan
  • Wangmeng Zuo

For long-tailed classification tasks, most works often pretrain a big model on a large-scale (unlabeled) dataset, and then fine-tune the whole pretrained model for adapting to long-tailed data. Though promising, fine-tuning the whole pretrained model tends to suffer from high cost in computation and deployment of different models for different tasks, as well as weakened generalization capability for overfitting to certain features of long-tailed data. To alleviate these issues, we propose an effective Long-tailed Prompt Tuning (LPT) method for long-tailed classification tasks. LPT introduces several trainable prompts into a frozen pretrained model to adapt it to long-tailed data. For better effectiveness, we divide prompts into two groups: 1) a shared prompt for the whole long-tailed dataset to learn general features and to adapt a pretrained model into the target long-tailed domain; and 2) group-specific prompts to gather group-specific features for the samples which have similar features and also to empower the pretrained model with fine-grained discrimination ability. Then we design a two-phase training paradigm to learn these prompts. In the first phase, we train the shared prompt via conventional supervised prompt tuning to adapt a pretrained model to the desired long-tailed domain. In the second phase, we use the learnt shared prompt as query to select a small best matched set for a group of similar samples from the group-specific prompt set to dig the common features of these similar samples, and then optimize these prompts with a dual sampling strategy and the asymmetric Gaussian Clouded Logit loss. By only fine-tuning a few prompts while fixing the pretrained model, LPT can reduce training cost and deployment cost by storing a few prompts, and enjoys a strong generalization ability of the pretrained model. Experiments show that on various long-tailed benchmarks, with only $\sim$1.1\% extra trainable parameters, LPT achieves comparable or higher performance than previous whole model fine-tuning methods, and is more robust to domain-shift.

NeurIPS Conference 2023 Conference Paper

Mutual Information Regularized Offline Reinforcement Learning

  • Xiao Ma
  • Bingyi Kang
  • Zhongwen Xu
  • Min Lin
  • Shuicheng Yan

The major challenge of offline RL is the distribution shift that appears when out-of-distribution actions are queried, which makes the policy improvement direction biased by extrapolation errors. Most existing methods address this problem by penalizing the policy or value for deviating from the behavior policy during policy improvement or evaluation. In this work, we propose a novel MISA framework to approach offline RL from the perspective of Mutual Information between States and Actions in the dataset by directly constraining the policy improvement direction. MISA constructs lower bounds of mutual information parameterized by the policy and Q-values. We show that optimizing this lower bound is equivalent to maximizing the likelihood of a one-step improved policy on the offline dataset. Hence, we constrain the policy improvement direction to lie in the data manifold. The resulting algorithm simultaneously augments the policy evaluation and improvement by adding mutual information regularizations. MISA is a general framework that unifies conservative Q-learning (CQL) and behavior regularization methods (e. g. , TD3+BC) as special cases. We introduce 3 different variants of MISA, and empirically demonstrate that tighter mutual information lower bound gives better offline RL performance. In addition, our extensive experiments show MISA significantly outperforms a wide range of baselines on various tasks of the D4RL benchmark, e. g. , achieving 742. 9 total points on gym-locomotion tasks. Our code is attached and will be released upon publication.

ICML Conference 2023 Conference Paper

Nonparametric Generative Modeling with Conditional Sliced-Wasserstein Flows

  • Chao Du
  • Tianbo Li
  • Tianyu Pang
  • Shuicheng Yan
  • Min Lin

Sliced-Wasserstein Flow (SWF) is a promising approach to nonparametric generative modeling but has not been widely adopted due to its suboptimal generative quality and lack of conditional modeling capabilities. In this work, we make two major contributions to bridging this gap. First, based on a pleasant observation that (under certain conditions) the SWF of joint distributions coincides with those of conditional distributions, we propose Conditional Sliced-Wasserstein Flow (CSWF), a simple yet effective extension of SWF that enables nonparametric conditional modeling. Second, we introduce appropriate inductive biases of images into SWF with two techniques inspired by local connectivity and multiscale representation in vision research, which greatly improve the efficiency and quality of modeling images. With all the improvements, we achieve generative performance comparable with many deep parametric generative models on both conditional and unconditional tasks in a purely nonparametric fashion, demonstrating its great potential.

NeurIPS Conference 2023 Conference Paper

On Calibrating Diffusion Probabilistic Models

  • Tianyu Pang
  • Cheng Lu
  • Chao Du
  • Min Lin
  • Shuicheng Yan
  • Zhijie Deng

Recently, diffusion probabilistic models (DPMs) have achieved promising results in diverse generative tasks. A typical DPM framework includes a forward process that gradually diffuses the data distribution and a reverse process that recovers the data distribution from time-dependent data scores. In this work, we observe that the stochastic reverse process of data scores is a martingale, from which concentration bounds and the optional stopping theorem for data scores can be derived. Then, we discover a simple way for calibrating an arbitrary pretrained DPM, with which the score matching loss can be reduced and the lower bounds of model likelihood can consequently be increased. We provide general calibration guidelines under various model parametrizations. Our calibration method is performed only once and the resulting models can be used repeatedly for sampling. We conduct experiments on multiple datasets to empirically validate our proposal. Our code is available at https: //github. com/thudzj/Calibrated-DPMs.

ICLR Conference 2023 Conference Paper

Revisiting Intrinsic Reward for Exploration in Procedurally Generated Environments

  • Kaixin Wang
  • Kuangqi Zhou
  • Bingyi Kang
  • Jiashi Feng
  • Shuicheng Yan

Exploration under sparse rewards remains a key challenge in deep reinforcement learning. Recently, studying exploration in procedurally-generated environments has drawn increasing attention. Existing works generally combine lifelong intrinsic rewards and episodic intrinsic rewards to encourage exploration. Though various lifelong and episodic intrinsic rewards have been proposed, the individual contributions of the two kinds of intrinsic rewards to improving exploration are barely investigated. To bridge this gap, we disentangle these two parts and conduct ablative experiments. We consider lifelong and episodic intrinsic rewards used in prior works, and compare the performance of all lifelong-episodic combinations on the commonly used MiniGrid benchmark. Experimental results show that only using episodic intrinsic rewards can match or surpass prior state-of-the-art methods. On the other hand, only using lifelong intrinsic rewards hardly makes progress in exploration. This demonstrates that episodic intrinsic reward is more crucial than lifelong one in boosting exploration. Moreover, we find through experimental analysis that the lifelong intrinsic reward does not accurately reflect the novelty of states, which explains why it does not help much in improving exploration.

ICLR Conference 2023 Conference Paper

RPM: Generalizable Multi-Agent Policies for Multi-Agent Reinforcement Learning

  • Wei Qiu 0001
  • Xiao Ma 0006
  • Bo An 0001
  • Svetlana Obraztsova
  • Shuicheng Yan
  • Zhongwen Xu

Despite the recent advancement in multi-agent reinforcement learning (MARL), the MARL agents easily overfit the training environment and perform poorly in evaluation scenarios where other agents behave differently. Obtaining generalizable policies for MARL agents is thus necessary but challenging mainly due to complex multi-agent interactions. In this work, we model the MARL problem with Markov Games and propose a simple yet effective method, called ranked policy memory (RPM), i.e., to maintain a look-up memory of policies to achieve good generalizability. The main idea of RPM is to train MARL policies via gathering massive multi-agent interaction data. In particular, we first rank each agent’s policies by its training episode return, i.e., the episode return of each agent in the training environment; we then save the ranked policies in the memory; when an episode starts, each agent can randomly select a policy from the RPM as the behavior policy. Each agent uses the behavior policy to gather multi-agent interaction data for MARL training. This innovative self-play framework guarantees the diversity of multi-agent interaction in the training data. Experimental results on Melting Pot demonstrate that RPM enables MARL agents to interact with unseen agents in multi-agent generalization evaluation scenarios and complete given tasks. It significantly boosts the performance up to 818% on average.

NeurIPS Conference 2023 Conference Paper

ScaleLong: Towards More Stable Training of Diffusion Model via Scaling Network Long Skip Connection

  • Zhongzhan Huang
  • Pan Zhou
  • Shuicheng Yan
  • Liang Lin

In diffusion models, UNet is the most popular network backbone, since its long skip connects (LSCs) to connect distant network blocks can aggregate long-distant information and alleviate vanishing gradient. Unfortunately, UNet often suffers from unstable training in diffusion models which can be alleviated by scaling its LSC coefficients smaller. However, theoretical understandings of the instability of UNet in diffusion models and also the performance improvement of LSC scaling remain absent yet. To solve this issue, we theoretically show that the coefficients of LSCs in UNet have big effects on the stableness of the forward and backward propagation and robustness of UNet. Specifically, the hidden feature and gradient of UNet at any layer can oscillate and their oscillation ranges are actually large which explains the instability of UNet training. Moreover, UNet is also provably sensitive to perturbed input, and predicts an output distant from the desired output, yielding oscillatory loss and thus oscillatory gradient. Besides, we also observe the theoretical benefits of the LSC coefficient scaling of UNet in the stableness of hidden features and gradient and also robustness. Finally, inspired by our theory, we propose an effective coefficient scaling framework ScaleLong that scales the coefficients of LSC in UNet and better improve the training stability of UNet. Experimental results on CIFAR10, CelebA, ImageNet and COCO show that our methods are superior to stabilize training, and yield about 1. 5x training acceleration on different diffusion models with UNet or UViT backbones.

ICLR Conference 2023 Conference Paper

Spikformer: When Spiking Neural Network Meets Transformer

  • Zhaokun Zhou
  • Yuesheng Zhu
  • Chao He
  • Yaowei Wang 0001
  • Shuicheng Yan
  • Yonghong Tian 0001
  • Li Yuan 0007

We consider two biologically plausible structures, the Spiking Neural Network (SNN) and the self-attention mechanism. The former offers an energy-efficient and event-driven paradigm for deep learning, while the latter has the ability to capture feature dependencies, enabling Transformer to achieve good performance. It is intuitively promising to explore the marriage between them. In this paper, we consider leveraging both self-attention capability and biological properties of SNNs, and propose a novel Spiking Self Attention (SSA) as well as a powerful framework, named Spiking Transformer (Spikformer). The SSA mechanism in Spikformer models the sparse visual feature by using spike-form Query, Key, and Value without softmax. Since its computation is sparse and avoids multiplication, SSA is efficient and has low computational energy consumption. It is shown that Spikformer with SSA can outperform the state-of-the-art SNNs-like frameworks in image classification on both neuromorphic and static datasets. Spikformer (66.3M parameters) with comparable size to SEW-ResNet-152 (60.2M,69.26%) can achieve 74.81% top1 accuracy on ImageNet using 4 time steps, which is the state-of-the-art in directly trained SNNs models. Code is avaiable at https://github.com/ZK-Zhou/spikformer.

ICLR Conference 2023 Conference Paper

Towards Understanding Why Mask Reconstruction Pretraining Helps in Downstream Tasks

  • Jiachun Pan
  • Pan Zhou 0002
  • Shuicheng Yan

For unsupervised pretraining, mask-reconstruction pretraining (MRP) approaches, e.g. MAE and data2vec, randomly mask input patches and then reconstruct the pixels or semantic features of these masked patches via an auto-encoder. Then for a downstream task, supervised fine-tuning the pretrained encoder remarkably surpasses the conventional "supervised learning" (SL) trained from scratch. However, it is still unclear 1) how MRP performs semantic (feature) learning in the pretraining phase and 2) why it helps in downstream tasks. To solve these problems, we first theoretically show that on an auto-encoder of a two/one-layered convolution encoder/decoder, MRP can capture all discriminative semantics of each potential semantic class in the pretraining dataset. Then considering the fact that the pretraining dataset is of huge size and high diversity and thus covers most semantics in downstream dataset, in fine-tuning phase, the pretrained encoder can capture as much semantics as it can in downstream datasets, and would not lost these semantics with theoretical guarantees. In contrast, SL only randomly captures some semantics due to lottery ticket hypothesis. So MRP provably achieves better performance than SL on the classification tasks. Experimental results testify to our data assumptions and also our theoretical implications.

AAAI Conference 2023 Conference Paper

Value-Consistent Representation Learning for Data-Efficient Reinforcement Learning

  • Yang Yue
  • Bingyi Kang
  • Zhongwen Xu
  • Gao Huang
  • Shuicheng Yan

Deep reinforcement learning (RL) algorithms suffer severe performance degradation when the interaction data is scarce, which limits their real-world application. Recently, visual representation learning has been shown to be effective and promising for boosting sample efficiency in RL. These methods usually rely on contrastive learning and data augmentation to train a transition model, which is different from how the model is used in RL---performing value-based planning. Accordingly, the learned representation by these visual methods may be good for recognition but not optimal for estimating state value and solving the decision problem. To address this issue, we propose a novel method, called value-consistent representation learning (VCR), to learn representations that are directly related to decision-making. More specifically, VCR trains a model to predict the future state (also referred to as the "imagined state'') based on the current one and a sequence of actions. Instead of aligning this imagined state with a real state returned by the environment, VCR applies a Q value head on both of the states and obtains two distributions of action values. Then a distance is computed and minimized to force the imagined state to produce a similar action value prediction as that by the real state. We develop two implementations of the above idea for the discrete and continuous action spaces respectively. We conduct experiments on Atari 100k and DeepMind Control Suite benchmarks to validate their effectiveness for improving sample efficiency. It has been demonstrated that our methods achieve new state-of-the-art performance for search-free RL algorithms.

ICLR Conference 2023 Conference Paper

Visual Imitation Learning with Patch Rewards

  • Minghuan Liu
  • Tairan He
  • Weinan Zhang 0001
  • Shuicheng Yan
  • Zhongwen Xu

Visual imitation learning enables reinforcement learning agents to learn to behave from expert visual demonstrations such as videos or image sequences, without explicit, well-defined rewards. Previous reseaches either adopt supervised learning techniques or induce simple and coarse scalar rewards from pixels, neglecting the dense information contained in the image demonstrations. In this work, we propose to measure the expertise of various local regions of image samples, or called patches, and recover multi-dimensional patch rewards accordingly. Patch reward is a more precise rewarding characterization that serves as fine-grained expertise measurement and visual explainability tool. Specifically, we present Adversarial Imitation Learning with Patch Rewards (PatchAIL), which employs a patch-based discriminator to measure the expertise of different local parts from given images and provide patch rewards. The patch-based knowledge is also used to regularize the aggregated reward and stabilize the training. We evaluate our method on the standard pixel-based benchmark DeepMind Control Suite. The experiment results have demonstrated that PatchAIL outperforms baseline methods and provides valuable interpretations for visual demonstrations.

ICLR Conference 2023 Conference Paper

Win: Weight-Decay-Integrated Nesterov Acceleration for Adaptive Gradient Algorithms

  • Pan Zhou 0002
  • Xingyu Xie
  • Shuicheng Yan

Training deep networks on large-scale datasets is computationally challenging. In this work, we explore the problem of ``\textit{how to accelerate adaptive gradient algorithms in a general manner}", and aim to provide practical efficiency-boosting insights. To this end, we propose an effective and general {Weight-decay-Integrated Nesterov acceleration} (Win) to accelerate adaptive algorithms. Taking AdamW and Adam as examples, we minimize a dynamical loss per iteration which combines the vanilla training loss and a dynamic regularizer inspired by proximal point method (PPM) to improve the convexity of the problem. To introduce Nesterov-alike-acceleration into AdamW and Adam, we respectively use the first- and second-order Taylor approximations of vanilla loss to update the variable twice. In this way, we arrive at our Win acceleration for AdamW and Adam that uses a conservative step and a reckless step to update twice and then linearly combines these two updates for acceleration. Next, we extend Win acceleration to LAMB and SGD. Our transparent acceleration derivation could provide insights for other accelerated methods and their integration into adaptive algorithms. Besides, we prove the convergence of Win-accelerated adaptive algorithms and justify their convergence superiority over their non-accelerated counterparts by taking AdamW and Adam as examples. Experimental results testify to the faster convergence speed and superior performance of our Win-accelerated AdamW, Adam, LAMB and SGD over their non-accelerated counterparts on vision classification tasks and language modeling tasks with both CNN and Transformer backbones. We hope Win shall be a default acceleration option for popular optimizers in deep learning community to improve the training efficiency. Code will be released at \url{https://github.com/sail-sg/win}.

NeurIPS Conference 2022 Conference Paper

EnvPool: A Highly Parallel Reinforcement Learning Environment Execution Engine

  • Jiayi Weng
  • Min Lin
  • Shengyi Huang
  • Bo Liu
  • Denys Makoviichuk
  • Viktor Makoviychuk
  • Zichen Liu
  • Yufan Song

There has been significant progress in developing reinforcement learning (RL) training systems. Past works such as IMPALA, Apex, Seed RL, Sample Factory, and others, aim to improve the system's overall throughput. In this paper, we aim to address a common bottleneck in the RL training system, i. e. , parallel environment execution, which is often the slowest part of the whole system but receives little attention. With a curated design for paralleling RL environments, we have improved the RL environment simulation speed across different hardware setups, ranging from a laptop and a modest workstation, to a high-end machine such as NVIDIA DGX-A100. On a high-end machine, EnvPool achieves one million frames per second for the environment execution on Atari environments and three million frames per second on MuJoCo environments. When running EnvPool on a laptop, the speed is 2. 8x that of the Python subprocess. Moreover, great compatibility with existing RL training libraries has been demonstrated in the open-sourced community, including CleanRL, rl_games, DeepMind Acme, etc. Finally, EnvPool allows researchers to iterate their ideas at a much faster pace and has great potential to become the de facto RL environment execution engine. Example runs show that it only takes five minutes to train agents to play Atari Pong and MuJoCo Ant on a laptop. EnvPool is open-sourced at https: //github. com/sail-sg/envpool.

NeurIPS Conference 2022 Conference Paper

Inception Transformer

  • Chenyang Si
  • Weihao Yu
  • Pan Zhou
  • Yichen Zhou
  • Xinchao Wang
  • Shuicheng Yan

Recent studies show that transformer has strong capability of building long-range dependencies, yet is incompetent in capturing high frequencies that predominantly convey local information. To tackle this issue, we present a novel and general-purpose $\textit{Inception Transformer}$, or $\textit{iFormer}$ for short, that effectively learns comprehensive features with both high- and low-frequency information in visual data. Specifically, we design an Inception mixer to explicitly graft the advantages of convolution and max-pooling for capturing the high-frequency information to transformers. Different from recent hybrid frameworks, the Inception mixer brings greater efficiency through a channel splitting mechanism to adopt parallel convolution/max-pooling path and self-attention path as high- and low-frequency mixers, while having the flexibility to model discriminative information scattered within a wide frequency range. Considering that bottom layers play more roles in capturing high-frequency details while top layers more in modeling low-frequency global information, we further introduce a frequency ramp structure, i. e. , gradually decreasing the dimensions fed to the high-frequency mixer and increasing those to the low-frequency mixer, which can effectively trade-off high- and low-frequency components across different layers. We benchmark the iFormer on a series of vision tasks, and showcase that it achieves impressive performance on image classification, COCO detection and ADE20K segmentation. For example, our iFormer-S hits the top-1 accuracy of 83. 4% on ImageNet-1K, much higher than DeiT-S by 3. 6%, and even slightly better than much bigger model Swin-B (83. 3%) with only 1/4 parameters and 1/3 FLOPs. Code and models are released at https: //github. com/sail-sg/iFormer.

ICML Conference 2022 Conference Paper

Robustness and Accuracy Could Be Reconcilable by (Proper) Definition

  • Tianyu Pang
  • Min Lin
  • Xiao Yang 0028
  • Jun Zhu 0001
  • Shuicheng Yan

The trade-off between robustness and accuracy has been widely studied in the adversarial literature. Although still controversial, the prevailing view is that this trade-off is inherent, either empirically or theoretically. Thus, we dig for the origin of this trade-off in adversarial training and find that it may stem from the improperly defined robust error, which imposes an inductive bias of local invariance — an overcorrection towards smoothness. Given this, we advocate employing local equivariance to describe the ideal behavior of a robust model, leading to a self-consistent robust error named SCORE. By definition, SCORE facilitates the reconciliation between robustness and accuracy, while still handling the worst-case uncertainty via robust optimization. By simply substituting KL divergence with variants of distance metrics, SCORE can be efficiently minimized. Empirically, our models achieve top-rank performance on RobustBench under AutoAttack. Besides, SCORE provides instructive insights for explaining the overfitting phenomenon and semantic input gradients observed on robust models.

NeurIPS Conference 2021 Conference Paper

Direct Multi-view Multi-person 3D Pose Estimation

  • Tao Wang
  • Jianfeng Zhang
  • Yujun Cai
  • Shuicheng Yan
  • Jiashi Feng

We present Multi-view Pose transformer (MvP) for estimating multi-person 3D poses from multi-view images. Instead of estimating 3D joint locations from costly volumetric representation or reconstructing the per-person 3D pose from multiple detected 2D poses as in previous methods, MvP directly regresses the multi-person 3D poses in a clean and efficient way, without relying on intermediate tasks. Specifically, MvP represents skeleton joints as learnable query embeddings and let them progressively attend to and reason over the multi-view information from the input images to directly regress the actual 3D joint locations. To improve the accuracy of such a simple pipeline, MvP presents a hierarchical scheme to concisely represent query embeddings of multi-person skeleton joints and introduces an input-dependent query adaptation approach. Further, MvP designs a novel geometrically guided attention mechanism, called projective attention, to more precisely fuse the cross-view information for each joint. MvP also introduces a RayConv operation to integrate the view-dependent camera geometry into the feature representations for augmenting the projective attention. We show experimentally that our MvP model outperforms the state-of-the-art methods on several benchmarks while being much more efficient. Notably, it achieves 92. 3% AP25 on the challenging Panoptic dataset, improving upon the previous best approach [35] by 9. 8%. MvP is general and also extendable to recovering human mesh represented by the SMPL model, thus useful for modeling multi-person body shapes. Code and models are available at https: //github. com/sail-sg/mvp.

NeurIPS Conference 2021 Conference Paper

How Should Pre-Trained Language Models Be Fine-Tuned Towards Adversarial Robustness?

  • Xinshuai Dong
  • Anh Tuan Luu
  • Min Lin
  • Shuicheng Yan
  • Hanwang Zhang

The fine-tuning of pre-trained language models has a great success in many NLP fields. Yet, it is strikingly vulnerable to adversarial examples, e. g. , word substitution attacks using only synonyms can easily fool a BERT-based sentiment analysis model. In this paper, we demonstrate that adversarial training, the prevalent defense technique, does not directly fit a conventional fine-tuning scenario, because it suffers severely from catastrophic forgetting: failing to retain the generic and robust linguistic features that have already been captured by the pre-trained model. In this light, we propose Robust Informative Fine-Tuning (RIFT), a novel adversarial fine-tuning method from an information-theoretical perspective. In particular, RIFT encourages an objective model to retain the features learned from the pre-trained model throughout the entire fine-tuning process, whereas a conventional one only uses the pre-trained weights for initialization. Experimental results show that RIFT consistently outperforms the state-of-the-arts on two popular NLP tasks: sentiment analysis and natural language inference, under different attacks across various pre-trained language models.

AAAI Conference 2021 Conference Paper

Partial-Label and Structure-constrained Deep Coupled Factorization Network

  • Yan Zhang
  • Zhao Zhang
  • Yang Wang
  • Zheng Zhang
  • Li Zhang
  • Shuicheng Yan
  • Meng Wang

In this paper, we technically propose an enriched prior guided framework, called Dual-constrained Deep Semi-Supervised Coupled Factorization Network (DS2 CF-Net), for discovering hierarchical coupled data representation. To extract hidden deep features, DS2 CF-Net is formulated as a partial-label and geometrical structure-constrained framework. Specifically, DS2 CF-Net designs a deep factorization architecture using multilayers of linear transformations, which can coupled update both the basis vectors and new representations in each layer. To enable learned deep representations and coefficients to be discriminative, we also consider enriching the supervised prior by joint deep coefficients-based label prediction and then incorporate the enriched prior information as additional label and structure constraints. The label constraint can enable the intra-class samples to have same coordinate in feature space, and the structure constraint forces the coefficients in each layer to be block-diagonal so that the enriched prior using the self-expressive label propagation are more accurate. Our network also integrates the adaptive dualgraph learning to retain the local structures of both data and feature manifolds in each layer. Extensive experiments on image datasets demonstrate the effectiveness of DS2 CF-Net for representation learning and clustering.

NeurIPS Conference 2021 Conference Paper

Towards Understanding Why Lookahead Generalizes Better Than SGD and Beyond

  • Pan Zhou
  • Hanshu Yan
  • Xiaotong Yuan
  • Jiashi Feng
  • Shuicheng Yan

To train networks, lookahead algorithm~\cite{zhang2019lookahead} updates its fast weights $k$ times via an inner-loop optimizer before updating its slow weights once by using the latest fast weights. Any optimizer, e. g. SGD, can serve as the inner-loop optimizer, and the derived lookahead generally enjoys remarkable test performance improvement over the vanilla optimizer. But theoretical understandings on the test performance improvement of lookahead remain absent yet. To solve this issue, we theoretically justify the advantages of lookahead in terms of the excess risk error which measures the test performance. Specifically, we prove that lookahead using SGD as its inner-loop optimizer can better balance the optimization error and generalization error to achieve smaller excess risk error than vanilla SGD on (strongly) convex problems and nonconvex problems with Polyak-{\L}ojasiewicz condition which has been observed/proved in neural networks. Moreover, we show the stagewise optimization strategy~\cite{barshan2015stage} which decays learning rate several times during training can also benefit lookahead in improving its optimization and generalization errors on strongly convex problems. Finally, we propose a stagewise locally-regularized lookahead (SLRLA) algorithm which sums up the vanilla objective and a local regularizer to minimize at each stage and provably enjoys optimization and generalization improvement over the conventional (stagewise) lookahead. Experimental results on CIFAR10/100 and ImageNet testify its advantages. Codes is available at \url{https: //github. com/sail-sg/SLRLA-optimizer}.

NeurIPS Conference 2020 Conference Paper

ConvBERT: Improving BERT with Span-based Dynamic Convolution

  • Zi-Hang Jiang
  • Weihao Yu
  • Daquan Zhou
  • Yunpeng Chen
  • Jiashi Feng
  • Shuicheng Yan

Pre-trained language models like BERT and its variants have recently achieved impressive performance in various natural language understanding tasks. However, BERT heavily relies on the global self-attention block and thus suffers large memory footprint and computation cost. Although all its attention heads query on the whole input sequence for generating the attention map from a global perspective, we observe some heads only need to learn local dependencies, which means existence of computation redundancy. We therefore propose a novel span-based dynamic convolution to replace these self-attention heads to directly model local dependencies. The novel convolution heads, together with the rest self-attention heads, form a new mixed attention block that is more efficient at both global and local context learning. We equip BERT with this mixed attention design and build a ConvBERT model. Experiments have shown that ConvBERT significantly outperforms BERT and its variants in various downstream tasks, with lower training cost and fewer model parameters. Remarkably, ConvBERTbase model achieves 86. 4 GLUE score, 0. 7 higher than ELECTRAbase, using less than 1/4 training cost. Code and pre-trained models will be released.

ECAI Conference 2020 Conference Paper

Convolutional Dictionary Pair Learning Network for Image Representation Learning

  • Zhao Zhang 0001
  • Yulin Sun
  • Yang Wang 0023
  • Zheng-Jun Zha
  • Shuicheng Yan
  • Meng Wang 0001

Both the Dictionary Learning (DL) and Convolutional Neural Networks (CNN) are powerful image representation learning systems based on different mechanisms and principles, however whether we can seamlessly integrate them to improve the performance is noteworthy exploring. To address this issue, we propose a novel generalized end-to-end representation learning architecture, dubbed Convolutional Dictionary Pair Learning Network (CDPL-Net) in this paper, which integrates the learning schemes of the CNN and dictionary pair learning into a unified framework. Generally, the architecture of CDPL-Net includes two convolutional/pooling layers and two dictionary pair learning (DPL) layers in the representation learning module. Besides, it uses two fully-connected layers as the multi-layer perception layer in the nonlinear classification module. In particular, the DPL layer can jointly formulate the discriminative synthesis and analysis representations driven by minimizing the batch based reconstruction error over the flatted feature maps from the convolution/pooling layer. Moreover, DPL layer uses l1-norm on the analysis dictionary so that sparse representation can be delivered, and the embedding process will also be robust to noise. To speed up the training process of DPL layer, the efficient stochastic gradient descent is used. Extensive simulations on real databases show that our CDPL-Net can deliver enhanced performance over other state-of-the-art methods.

ECAI Conference 2020 Conference Paper

Multilayer Collaborative Low-Rank Coding Network for Robust Deep Subspace Discovery

  • Xianzhen Li
  • Zhao Zhang 0001
  • Yang Wang 0023
  • Guangcan Liu
  • Shuicheng Yan
  • Meng Wang 0001

For subspace recovery, most existing low-rank representation (LRR) models performs in the original space in single-layer mode. As such, the deep hierarchical information cannot be learned, which may result in inaccurate recoveries for complex real data. In this paper, we explore the deep multi-subspace recovery problem by designing a multilayer architecture for latent LRR. Technically, we propose a new Multilayer Collaborative Low-Rank Representation Network model termed DeepLRR to discover deep features and deep subspaces. In each layer (>2), DeepLRR bilinearly reconstructs the data matrix by the collaborative representation with low-rank coefficients and projection matrices in the previous layer. The bilinear low-rank reconstruction of previous layer is directly fed into the next layer as the input and low-rank dictionary for representation learning, and is further decomposed into a deep principal feature part, a deep salient feature part and a deep sparse error. As such, the coherence issue can be also resolved due to the low-rank dictionary, and the robustness against noise can also be enhanced in the feature subspace. To recover the sparse errors in layers accurately, a dynamic growing strategy is used, as the noise level will become smaller for the increase of layers. Besides, a neighborhood reconstruction error is also included to encode the locality of deep salient features by deep coefficients adaptively in each layer. Extensive results on public databases show that our DeepLRR outperforms other related models for subspace discovery and clustering.

NeurIPS Conference 2019 Conference Paper

Efficient Meta Learning via Minibatch Proximal Update

  • Pan Zhou
  • Xiaotong Yuan
  • Huan Xu
  • Shuicheng Yan
  • Jiashi Feng

We address the problem of meta-learning which learns a prior over hypothesis from a sample of meta-training tasks for fast adaptation on meta-testing tasks. A particularly simple yet successful paradigm for this research is model-agnostic meta-learning (MAML). Implementation and analysis of MAML, however, can be tricky; first-order approximation is usually adopted to avoid directly computing Hessian matrix but as a result the convergence and generalization guarantees remain largely mysterious for MAML. To remedy this deficiency, in this paper we propose a minibatch proximal update based meta-learning approach for learning to efficient hypothesis transfer. The principle is to learn a prior hypothesis shared across tasks such that the minibatch risk minimization biased regularized by this prior can quickly converge to the optimal hypothesis in each training task. The prior hypothesis training model can be efficiently optimized via SGD with provable convergence guarantees for both convex and non-convex problems. Moreover, we theoretically justify the benefit of the learnt prior hypothesis for fast adaptation to new few-shot learning tasks via minibatch proximal update. Experimental results on several few-shot regression and classification tasks demonstrate the advantages of our method over state-of-the-arts.

AAAI Conference 2019 Conference Paper

Look across Elapse: Disentangled Representation Learning and Photorealistic Cross-Age Face Synthesis for Age-Invariant Face Recognition

  • Jian Zhao
  • Yu Cheng
  • Yi Cheng
  • Yang Yang
  • Fang Zhao
  • Jianshu Li
  • Hengzhu Liu
  • Shuicheng Yan

Despite the remarkable progress in face recognition related technologies, reliably recognizing faces across ages still remains a big challenge. The appearance of a human face changes substantially over time, resulting in significant intraclass variations. As opposed to current techniques for ageinvariant face recognition, which either directly extract ageinvariant features for recognition, or first synthesize a face that matches target age before feature extraction, we argue that it is more desirable to perform both tasks jointly so that they can leverage each other. To this end, we propose a deep Age-Invariant Model (AIM) for face recognition in the wild with three distinct novelties. First, AIM presents a novel unified deep architecture jointly performing cross-age face synthesis and recognition in a mutual boosting way. Second, AIM achieves continuous face rejuvenation/aging with remarkable photorealistic and identity-preserving properties, avoiding the requirement of paired data and the true age of testing samples. Third, we develop effective and novel training strategies for end-to-end learning the whole deep architecture, which generates powerful age-invariant face representations explicitly disentangled from the age variation. Extensive experiments on several cross-age datasets (MORPH, CACD and FG-NET) demonstrate the superiority of the proposed AIM model over the state-of-the-arts. Benchmarking our model on one of the most popular unconstrained face recognition datasets IJB-C additionally verifies the promising generalizability of AIM in recognizing faces in the wild.

IJCAI Conference 2019 Conference Paper

Multi-Prototype Networks for Unconstrained Set-based Face Recognition

  • Jian Zhao
  • Jianshu Li
  • Xiaoguang Tu
  • Fang Zhao
  • Yuan Xin
  • Junliang Xing
  • Hengzhu Liu
  • Shuicheng Yan

In this paper, we address the challenging unconstrained set-based face recognition problem where each subject face is instantiated by a set of media (images and videos) instead of a single image. Naively aggregating information from all the media within a set would suffer from the large intra-set variance caused by heterogeneous factors (e. g. , varying media modalities, poses and illumination) and fail to learn discriminative face representations. A novel Multi-Prototype Network (MP- Net) model is thus proposed to learn multiple prototype face representations adaptively from the media sets. Each learned prototype is representative for the subject face under certain condition in terms of pose, illumination and media modality. Instead of handcrafting the set partition for prototype learn- ing, MPNet introduces a Dense SubGraph (DSG) learning sub-net that implicitly untangles inconsistent media and learns a number of representative prototypes. Qualitative and quantitative experiments clearly demonstrate the superiority of the proposed model over state-of-the-arts.

IJCAI Conference 2018 Conference Paper

3D-Aided Deep Pose-Invariant Face Recognition

  • Jian Zhao
  • Lin Xiong
  • Yu Cheng
  • Yi Cheng
  • Jianshu Li
  • Li Zhou
  • Yan Xu
  • Jayashree Karlekar

Learning from synthetic faces, though perhaps appealing for high data efficiency, may not bring satisfactory performance due to the distribution discrepancy of the synthetic and real face images. To mitigate this gap, we propose a 3D-Aided Deep Pose-Invariant Face Recognition Model (3D-PIM), which automatically recovers realistic frontal faces from arbitrary poses through a 3D face model in a novel way. Specifically, 3D-PIM incorporates a simulator with the aid of a 3D Morphable Model (3D MM) to obtain shape and appearance prior for accelerating face normalization learning, requiring less training data. It further leverages a global-local Generative Adversarial Network (GAN) with multiple critical improvements as a refiner to enhance the realism of both global structures and local details of the face simulator’s output using unlabelled real data only, while preserving the identity information. Qualitative and quantitative experiments on both controlled and in-the-wild benchmarks clearly demonstrate superiority of the proposed model over state-of-the-arts.

NeurIPS Conference 2018 Conference Paper

A^2-Nets: Double Attention Networks

  • Yunpeng Chen
  • Yannis Kalantidis
  • Jianshu Li
  • Shuicheng Yan
  • Jiashi Feng

Learning to capture long-range relations is fundamental to image/video recognition. Existing CNN models generally rely on increasing depth to model such relations which is highly inefficient. In this work, we propose the “double attention block”, a novel component that aggregates and propagates informative global features from the entire spatio-temporal space of input images/videos, enabling subsequent convolution layers to access features from the entire space efficiently. The component is designed with a double attention mechanism in two steps, where the first step gathers features from the entire space into a compact set through second-order attention pooling and the second step adaptively selects and distributes features to each location via another attention. The proposed double attention block is easy to adopt and can be plugged into existing deep neural networks conveniently. We conduct extensive ablation studies and experiments on both image and video recognition tasks for evaluating its performance. On the image recognition task, a ResNet-50 equipped with our double attention blocks outperforms a much larger ResNet-152 architecture on ImageNet-1k dataset with over 40% less the number of parameters and less FLOPs. On the action recognition task, our proposed model achieves the state-of-the-art results on the Kinetics and UCF-101 datasets with significantly higher efficiency than recent works.

IJCAI Conference 2018 Conference Paper

Exact Low Tubal Rank Tensor Recovery from Gaussian Measurements

  • Canyi Lu
  • Jiashi Feng
  • Zhouchen Lin
  • Shuicheng Yan

The recent proposed Tensor Nuclear Norm (TNN) [Lu et al. , 2016; 2018a] is an interesting convex penalty induced by the tensor SVD [Kilmer and Martin, 2011]. It plays a similar role as the matrix nuclear norm which is the convex surrogate of the matrix rank. Considering that the TNN based Tensor Robust PCA [Lu et al. , 2018a] is an elegant extension of Robust PCA with a similar tight recovery bound, it is natural to solve other low rank tensor recovery problems extended from the matrix cases. However, the extensions and proofs are generally tedious. The general atomic norm provides a unified view of low-complexity structures induced norms, e. g. , the l1-norm and nuclear norm. The sharp estimates of the required number of generic measurements for exact recovery based on the atomic norm are known in the literature. In this work, with a careful choice of the atomic set, we prove that TNN is a special atomic norm. Then by computing the Gaussian width of certain cone which is necessary for the sharp estimate, we achieve a simple bound for guaranteed low tubal rank tensor recovery from Gaussian measurements. Specifically, we show that by solving a TNN minimization problem, the underlying tensor of size n1×n2×n3 with tubal rank r can be exactly recovered when the given number of Gaussian measurements is O(r(n1+n2−r)n3). It is order optimal when comparing with the degrees of freedom r(n1+n2−r)n3. Beyond the Gaussian mapping, we also give the recovery guarantee of tensor completion based on the uniform random mapping by TNN minimization. Numerical experiments verify our theoretical results.

IJCAI Conference 2018 Conference Paper

High Resolution Feature Recovering for Accelerating Urban Scene Parsing

  • Rui Zhang
  • Sheng Tang
  • Luoqi Liu
  • Yongdong Zhang
  • Jintao Li
  • Shuicheng Yan

Both accuracy and speed are equally important in urban scene parsing. Most of the existing methods mainly focus on improving parsing accuracy, ignoring the problem of low inference speed due to large-sized input and high resolution feature maps. To tackle this issue, we propose a High Resolution Feature Recovering (HRFR) framework to accelerate a given parsing network. A Super-Resolution Recovering module is employed to recover features of large original-sized images from features of down-sampled input. Therefore, our framework can combine the advantages of (1) fast speed of networks with down-sampled input and (2) high accuracy of networks with large original-sized input. Additionally, we employ auxiliary intermediate supervision and boundary region re-weighting to facilitate the optimization of the network. Extensive experiments on the two challenging Cityscapes and CamVid datasets well demonstrate the effectiveness of the proposed HRFR framework, which can accelerate the scene parsing inference process by about 3. 0x speedup from 1/2 down-sampled input with negligible accuracy reduction.

TIST Journal 2018 Journal Article

High-Precision Camera Localization in Scenes with Repetitive Patterns

  • Xiaobai Liu
  • Qian Xu
  • Yadong Mu
  • Jiadi Yang
  • Liang Lin
  • Shuicheng Yan

This article presents a high-precision multi-modal approach for localizing moving cameras with monocular videos, which has wide potentials in many intelligent applications, including robotics, autonomous vehicles, and so on. Existing visual odometry methods often suffer from symmetric or repetitive scene patterns, e.g., windows on buildings or parking stalls. To address this issue, we introduce a robust camera localization method that contributes in two aspects. First, we formulate feature tracking, the critical step of visual odometry, as a hierarchical min-cost network flow optimization task, and we regularize the formula with flow constraints, cross-scale consistencies, and motion heuristics. The proposed regularized formula is capable of adaptively selecting distinctive features or feature combinations, which is more effective than traditional methods that detect and group repetitive patterns in a separate step. Second, we develop a joint formula for integrating dense visual odometry and sparse GPS readings in a common reference coordinate. The fusion process is guided with high-order statistics knowledge to suppress the impacts of noises, clusters, and model drifting. We evaluate the proposed camera localization method on both public video datasets and a newly created dataset that includes scenes full of repetitive patterns. Results with comparisons show that our method can achieve comparable performance to state-of-the-art methods and is particularly effective for addressing repetitive pattern issues.

AAAI Conference 2018 Conference Paper

Nonconvex Sparse Spectral Clustering by Alternating Direction Method of Multipliers and Its Convergence Analysis

  • Canyi Lu
  • Jiashi Feng
  • Zhouchen Lin
  • Shuicheng Yan

Spectral Clustering (SC) is a widely used data clustering method which first learns a low-dimensional embedding U of data by computing the eigenvectors of the normalized Laplacian matrix, and then performs k-means on U to get the final clustering result. The Sparse Spectral Clustering (SSC) method extends SC with a sparse regularization on UU by using the block diagonal structure prior of UU in the ideal case. However, encouraging UU to be sparse leads to a heavily nonconvex problem which is challenging to solve and the work (Lu, Yan, and Lin 2016) proposes a convex relaxation in the pursuit of this aim indirectly. However, the convex relaxation generally leads to a loose approximation and the quality of the solution is not clear. This work instead considers to solve the nonconvex formulation of SSC which directly encourages UU to be sparse. We propose an efficient Alternating Direction Method of Multipliers (ADMM) to solve the nonconvex SSC and provide the convergence guarantee. In particular, we prove that the sequences generated by ADMM always exist a limit point and any limit point is a stationary point. Our analysis does not impose any assumptions on the iterates and thus is practical. Our proposed ADMM for nonconvex problems allows the stepsize to be increasing but upper bounded, and this makes it very efficient in practice. Experimental analysis on several real data sets verifies the effectiveness of our method.

IJCAI Conference 2018 Conference Paper

Sharing Residual Units Through Collective Tensor Factorization To Improve Deep Neural Networks

  • Yunpeng Chen
  • Xiaojie Jin
  • Bingyi Kang
  • Jiashi Feng
  • Shuicheng Yan

The residual unit and its variations are wildly used in building very deep neural networks for alleviating optimization difficulty. In this work, we revisit the standard residual function as well as its several successful variants and propose a unified framework based on tensor Block Term Decomposition (BTD) to explain these apparently different residual functions from the tensor decomposition view. With the BTD framework, we further propose a novel basic network architecture, named the Collective Residual Unit (CRU). CRU further enhances parameter efficiency of deep residual neural networks by sharing core factors derived from collective tensor factorization over the involved residual units. It enables efficient knowledge sharing across multiple residual units, reduces the number of model parameters, lowers the risk of over-fitting, and provides better generalization ability. Extensive experimental results show that our proposed CRU network brings outstanding parameter efficiency -- it achieves comparable classification performance with ResNet-200 while using a model size as small as ResNet-50 on the ImageNet-1k and Places365-Standard benchmark datasets.

ICML Conference 2018 Conference Paper

WSNet: Compact and Efficient Networks Through Weight Sampling

  • Xiaojie Jin
  • Yingzhen Yang
  • Ning Xu 0001
  • Jianchao Yang
  • Nebojsa Jojic
  • Jiashi Feng
  • Shuicheng Yan

We present a new approach and a novel architecture, termed WSNet, for learning compact and efficient deep neural networks. Existing approaches conventionally learn full model parameters independently and then compress them via ad hoc processing such as model pruning or filter factorization. Alternatively, WSNet proposes learning model parameters by sampling from a compact set of learnable parameters, which naturally enforces parameter sharing throughout the learning process. We demonstrate that such a novel weight sampling approach (and induced WSNet) promotes both weights and computation sharing favorably. By employing this method, we can more efficiently learn much smaller networks with competitive performance compared to baseline networks with equal numbers of convolution filters. Specifically, we consider learning compact and efficient 1D convolutional neural networks for audio classification. Extensive experiments on multiple audio classification datasets verify the effectiveness of WSNet. Combined with weight quantization, the resulted models are up to 180x smaller and theoretically up to 16x faster than the well-established baselines, without noticeable performance drop.

NeurIPS Conference 2017 Conference Paper

Dual Path Networks

  • Yunpeng Chen
  • Jianan Li
  • Huaxin Xiao
  • Xiaojie Jin
  • Shuicheng Yan
  • Jiashi Feng

In this work, we present a simple, highly efficient and modularized Dual Path Network (DPN) for image classification which presents a new topology of connection paths internally. By revealing the equivalence of the state-of-the-art Residual Network (ResNet) and Densely Convolutional Network (DenseNet) within the HORNN framework, we find that ResNet enables feature re-usage while DenseNet enables new features exploration which are both important for learning good representations. To enjoy the benefits from both path topologies, our proposed Dual Path Network shares common features while maintaining the flexibility to explore new features through dual path architectures. Extensive experiments on three benchmark datasets, ImagNet-1k, Places365 and PASCAL VOC, clearly demonstrate superior performance of the proposed DPN over state-of-the-arts. In particular, on the ImagNet-1k dataset, a shallow DPN surpasses the best ResNeXt-101(64x4d) with 26% smaller model size, 25% less computational cost and 8% lower memory consumption, and a deeper DPN (DPN-131) further pushes the state-of-the-art single model performance with about 2 times faster training speed. Experiments on the Places365 large-scale scene dataset, PASCAL VOC detection dataset, and PASCAL VOC segmentation dataset also demonstrate its consistently better performance than DenseNet, ResNet and the latest ResNeXt model over various applications.

NeurIPS Conference 2017 Conference Paper

Dual-Agent GANs for Photorealistic and Identity Preserving Profile Face Synthesis

  • Jian Zhao
  • Lin Xiong
  • Panasonic Karlekar Jayashree
  • Jianshu Li
  • Fang Zhao
  • Zhecan Wang
  • Panasonic Sugiri Pranata
  • Panasonic Shengmei Shen

Synthesizing realistic profile faces is promising for more efficiently training deep pose-invariant models for large-scale unconstrained face recognition, by populating samples with extreme poses and avoiding tedious annotations. However, learning from synthetic faces may not achieve the desired performance due to the discrepancy between distributions of the synthetic and real face images. To narrow this gap, we propose a Dual-Agent Generative Adversarial Network (DA-GAN) model, which can improve the realism of a face simulator's output using unlabeled real faces, while preserving the identity information during the realism refinement. The dual agents are specifically designed for distinguishing real v. s. fake and identities simultaneously. In particular, we employ an off-the-shelf 3D face model as a simulator to generate profile face images with varying poses. DA-GAN leverages a fully convolutional network as the generator to generate high-resolution images and an auto-encoder as the discriminator with the dual agents. Besides the novel architecture, we make several key modifications to the standard GAN to preserve pose and texture, preserve identity and stabilize training process: (i) a pose perception loss; (ii) an identity perception loss; (iii) an adversarial loss with a boundary equilibrium regularization term. Experimental results show that DA-GAN not only presents compelling perceptual results but also significantly outperforms state-of-the-arts on the large-scale and challenging NIST IJB-A unconstrained face recognition benchmark. In addition, the proposed DA-GAN is also promising as a new approach for solving generic transfer learning problems more effectively.

TIST Journal 2017 Journal Article

Event Classification in Microblogs via Social Tracking

  • Yue Gao
  • Hanwang Zhang
  • Xibin Zhao
  • Shuicheng Yan

Social media websites have become important information sharing platforms. The rapid development of social media platforms has led to increasingly large-scale social media data, which has shown remarkable societal and marketing values. There are needs to extract important events in live social media streams. However, microblogs event classification is challenging due to two facts, i.e., the short/conversational nature and the incompatible meanings between the text and the corresponding image in social posts, and the rapidly evolving contents. In this article, we propose to conduct event classification via deep learning and social tracking. First, we introduce a Multi-modal Multi-instance Deep Network (M 2 DN) for microblogs classification, which is able to handle the weakly labeled microblogs data oriented from the incompatible meanings inside microblogs. Besides predicting each microblogs as predefined events, we propose to employ social tracking to extract social-related auxiliary information to enrich the testing samples. We extract a set of candidate-relevant microblogs in a short time window by using social connections, such as related users and geographical locations. All these selected microblogs and the testing data are formulated in a Markov Random Field model. The inference on the Markov Random Field is conducted to update the classification results of the testing microblogs. This method is evaluated on the Brand-Social-Net dataset for classification of 20 events. Experimental results and comparison with the state of the arts show that the proposed method can achieve better performance for the event classification task.

IJCAI Conference 2017 Conference Paper

Global-residual and Local-boundary Refinement Networks for Rectifying Scene Parsing Predictions

  • Rui Zhang
  • Sheng Tang
  • Min Lin
  • Jintao Li
  • Shuicheng Yan

Most of existing scene parsing methods suffer from the serious problems of both inconsistent parsing results and object boundary shift. To tackle these problems, we first propose an iterative Global-residual Refinement Network (GRN) through exploiting global contextual information to predict the parsing residuals and iteratively smoothen the inconsistent parsing labels. Furthermore, we propose a Local-boundary Refinement Network (LRN) to learn the position-adaptive propagation coefficients so that local contextual information from neighbors can be optimally captured for refining object boundaries. Finally, we cascade the proposed two refinement networks after a fully residual convolutional neural network within a uniform framework. Extensive experiments on ADE20K and Cityscapes datasets well demonstrate the effectiveness of the two refinement methods for refining scene parsing predictions.

AAAI Conference 2017 Conference Paper

Multi-Path Feedback Recurrent Neural Networks for Scene Parsing

  • Xiaojie Jin
  • Yunpeng Chen
  • Zequn Jie
  • Jiashi Feng
  • Shuicheng Yan

In this paper, we consider the scene parsing problem and propose a novel Multi-Path Feedback recurrent neural network (MPF-RNN) for parsing scene images. MPF-RNN can enhance the capability of RNNs in modeling long-range context information at multiple levels and better distinguish pixels that are easy to confuse. Different from feedforward CNNs and RNNs with only single feedback, MPF-RNN propagates the contextual features learned at top layer through multiple weighted recurrent connections to learn bottom features. For better training MPF-RNN, we propose a new strategy that considers accumulative loss at multiple recurrent steps to improve performance of the MPF-RNN on parsing small objects. With these two novel components, MPF-RNN has achieved significant improvement over strong baselines (VGG16 and Res101) on five challenging scene parsing benchmarks, including traditional SiftFlow, Barcelona, CamVid, Stanford Background as well as the recently released large-scale ADE20K.

IJCAI Conference 2017 Conference Paper

Online Robust Low-Rank Tensor Learning

  • Ping Li
  • Jiashi Feng
  • Xiaojie Jin
  • Luming Zhang
  • Xianghua Xu
  • Shuicheng Yan

The rapid increase of multidimensional data (a. k. a. tensor) like videos brings new challenges for low-rank data modeling approaches such as dynamic data size, complex high-order relations, and multiplicity of low-rank structures. Resolving these challenges require a new tensor analysis method that can perform tensor data analysis online, which however is still absent. In this paper, we propose an Online Robust Low-rank Tensor Modeling (ORLTM) approach to address these challenges. ORLTM dynamically explores the high-order correlations across all tensor modes for low-rank structure modeling. To analyze mixture data from multiple subspaces, ORLTM introduces a new dictionary learning component. ORLTM processes data streamingly and thus requires quite low memory cost that is independent of data size. This makes ORLTM quite suitable for processing large-scale tensor data. Empirical studies have validated the effectiveness of the proposed method on both synthetic data and one practical task, i. e. , video background subtraction. In addition, we provide theoretical analysis regarding computational complexity and memory cost, demonstrating the efficiency of ORLTM rigorously.

NeurIPS Conference 2017 Conference Paper

Predicting Scene Parsing and Motion Dynamics in the Future

  • Xiaojie Jin
  • Huaxin Xiao
  • Xiaohui Shen
  • Jimei Yang
  • Zhe Lin
  • Yunpeng Chen
  • Zequn Jie
  • Jiashi Feng

It is important for intelligent systems, e. g. autonomous vehicles and robotics to anticipate the future in order to plan early and make decisions accordingly. Predicting the future scene parsing and motion dynamics helps the agents better understand the visual environment better as the former provides dense semantic segmentations, i. e. what objects will be present and where they will appear, while the latter provides dense motion information, i. e. how the objects move in the future. In this paper, we propose a novel model to predict the scene parsing and motion dynamics in unobserved future video frames simultaneously. Using history information (preceding frames and corresponding scene parsing results) as input, our model is able to predict the scene parsing and motion for arbitrary time steps ahead. More importantly, our model is superior compared to other methods that predict parsing and motion separately, as the complementary relationship between the two tasks are fully utilized in our model through joint learning. To our best knowledge, this is the first attempt in jointly predicting scene parsing and motion dynamics in the future frames. On the large-scale Cityscapes dataset, it is demonstrated that our model produces significantly better parsing and motion prediction results compared to well established baselines. In addition, we also show our model can be used to predict the steering angle of the vehicles, which further verifies the ability of our model to learn underlying latent parameters.

IJCAI Conference 2017 Conference Paper

Training Group Orthogonal Neural Networks with Privileged Information

  • Yunpeng Chen
  • Xiaojie Jin
  • Jiashi Feng
  • Shuicheng Yan

Learning rich and diverse representations is critical for the performance of deep convolutional neural networks (CNNs). In this paper, we consider how to use privileged information to promote inherent diversity of a single CNN model such that the model can learn better representations and offer stronger generalization ability. To this end, we propose a novel group orthogonal convolutional neural network (GoCNN) that learns untangled representations within each layer by exploiting provided privileged information and enhances representation diversity effectively. We take image classification as an example where image segmentation annotations are used as privileged information during the training process. Experiments on two benchmark datasets – ImageNet and PASCAL VOC – clearly demonstrate the strong generalization ability of our proposed GoCNN model. On the ImageNet dataset, GoCNN improves the performance of state-of-the-art ResNet-152 model by absolute value of 1. 2% while only uses privileged information of 10% of the training images, confirming effectiveness of GoCNN on utilizing available privileged knowledge to train better CNNs.

TIST Journal 2017 Journal Article

Visual Classification of Furniture Styles

  • Zhenhen Hu
  • Yonggang Wen
  • Luoqi Liu
  • Jianguo Jiang
  • Richang Hong
  • Meng Wang
  • Shuicheng Yan

Furniture style describes the discriminative appearance characteristics of furniture. It plays an important role in real-world indoor decoration. In this article, we explore the furniture style features and study the problem of furniture style classification. Differing from traditional object classification, furniture style classification aims at classifying different furniture in terms of the “style” that describes its appearance (e.g., American style, Gothic style, Rococo style, etc.) rather than the “kind” that is more related to its functional structure (e.g., bed, desk, etc.). To pursue efficient furniture style features, we construct a novel dataset of furniture styles that contains 16 common style categories and implement three strategies with respect to two categories of classification, that is, handcrafted classification and learning-based classification. First, we follow the typical image classification pipeline to extract the handcrafted features and train the classifier by support vector machine. Then we use the convolutional neural network to extract learning-based features from training images. To obtain comprehensive furniture style features, we finally combine the handcrafted image classification pipeline and the learning-based network. We experimentally evaluate the performances of handcrafted features and learning-based features of each strategy, and the results show the superiority of learning-based features and also the comprehensiveness of handcrafted features.

AAAI Conference 2016 Conference Paper

Deep Learning with S-Shaped Rectified Linear Activation Units

  • Xiaojie Jin
  • Chunyan Xu
  • Jiashi Feng
  • Yunchao Wei
  • Junjun Xiong
  • Shuicheng Yan

Rectified linear activation units are important components for state-of-the-art deep convolutional networks. In this paper, we propose a novel S-shaped rectified linear activation unit (SReLU) to learn both convex and non-convex functions, imitating the multiple function forms given by the two fundamental laws, namely the Webner-Fechner law and the Stevens law, in psychophysics and neural sciences. Specifically, SReLU consists of three piecewise linear functions, which are formulated by four learnable parameters. The SReLU is learned jointly with the training of the whole deep network through back propagation. During the training phase, to initialize SReLU in different layers, we propose a “freezing” method to degenerate SReLU into a predefined leaky rectified linear unit in the initial several training epochs and then adaptively learn the good initial values. SReLU can be universally used in the existing deep networks with negligible additional parameters and computation cost. Experiments with two popular CNN architectures, Network in Network and GoogLeNet on scale-various benchmarks including CI- FAR10, CIFAR100, MNIST and ImageNet demonstrate that SReLU achieves remarkable improvement compared to other activation functions.

AAAI Conference 2016 Conference Paper

Fast Proximal Linearized Alternating Direction Method of Multiplier with Parallel Splitting

  • Canyi Lu
  • Huan Li
  • Zhouchen Lin
  • Shuicheng Yan

The Augmented Lagragian Method (ALM) and Alternating Direction Method of Multiplier (ADMM) have been powerful optimization methods for general convex programming subject to linear constraint. We consider the convex problem whose objective consists of a smooth part and a nonsmooth but simple part. We propose the Fast Proximal Augmented Lagragian Method (Fast PALM) which achieves the convergence rate O(1/K2 ), compared with O(1/K) by the traditional PALM. In order to further reduce the per-iteration complexity and handle the multi-blocks problem, we propose the Fast Proximal ADMM with Parallel Splitting (Fast PL-ADMM-PS) method. It also partially improves the rate related to the smooth part of the objective function. Experimental results on both synthesized and real world data demonstrate that our fast methods significantly improve the previous PALM and ADMM.

TIST Journal 2016 Journal Article

Modality-Dependent Cross-Media Retrieval

  • Yunchao Wei
  • Yao Zhao
  • Zhenfeng Zhu
  • Shikui Wei
  • Yanhui Xiao
  • Jiashi Feng
  • Shuicheng Yan

In this article, we investigate the cross-media retrieval between images and text, that is, using image to search text (I2T) and using text to search images (T2I). Existing cross-media retrieval methods usually learn one couple of projections, by which the original features of images and text can be projected into a common latent space to measure the content similarity. However, using the same projections for the two different retrieval tasks (I2T and T2I) may lead to a tradeoff between their respective performances, rather than their best performances. Different from previous works, we propose a modality-dependent cross-media retrieval (MDCR) model, where two couples of projections are learned for different cross-media retrieval tasks instead of one couple of projections. Specifically, by jointly optimizing the correlation between images and text and the linear regression from one modal space (image or text) to the semantic space, two couples of mappings are learned to project images and text from their original feature spaces into two common latent subspaces (one for I2T and the other for T2I). Extensive experiments show the superiority of the proposed MDCR compared with other methods. In particular, based on the 4,096-dimensional convolutional neural network (CNN) visual feature and 100-dimensional Latent Dirichlet Allocation (LDA) textual feature, the mAP of the proposed method achieves the mAP score of 41.5%, which is a new state-of-the-art performance on the Wikipedia dataset.

TIST Journal 2016 Journal Article

Multitask Low-Rank Affinity Graph for Image Segmentation and Image Annotation

  • Teng Li
  • Bin Cheng
  • Bingbing Ni
  • Guangchan Liu
  • Shuicheng Yan

This article investigates a low-rank representation--based graph, which can used in graph-based vision tasks including image segmentation and image annotation. It naturally fuses multiple types of image features in a framework named multitask low-rank affinity pursuit. Given the image patches described with multiple types of features, we aim at inferring a unified affinity matrix that implicitly encodes the relations among these patches. This is achieved by seeking the sparsity-consistent low-rank affinities from the joint decompositions of multiple feature matrices into pairs of sparse and low-rank matrices, the latter of which is expressed as the production of the image feature matrix and its corresponding image affinity matrix. The inference process is formulated as a minimization problem and solved efficiently with the augmented Lagrange multiplier method. Considering image patches as vertices, a graph can be built based on the resulted affinity matrix. Compared to previous methods, which are usually based on a single type of feature, the proposed method seamlessly integrates multiple types of features to jointly produce the affinity matrix in a single inference step. The proposed method is applied to graph-based image segmentation and graph-based image annotation. Experiments on benchmark datasets well validate the superiority of using multiple features over single feature and also the superiority of our method over conventional methods for feature fusion.

NeurIPS Conference 2016 Conference Paper

Tree-Structured Reinforcement Learning for Sequential Object Localization

  • Zequn Jie
  • Xiaodan Liang
  • Jiashi Feng
  • Xiaojie Jin
  • Wen Lu
  • Shuicheng Yan

Existing object proposal algorithms usually search for possible object regions over multiple locations and scales \emph{ separately}, which ignore the interdependency among different objects and deviate from the human perception procedure. To incorporate global interdependency between objects into object localization, we propose an effective Tree-structured Reinforcement Learning (Tree-RL) approach to sequentially search for objects by fully exploiting both the current observation and historical search paths. The Tree-RL approach learns multiple searching policies through maximizing the long-term reward that reflects localization accuracies over all the objects. Starting with taking the entire image as a proposal, the Tree-RL approach allows the agent to sequentially discover multiple objects via a tree-structured traversing scheme. Allowing multiple near-optimal policies, Tree-RL offers more diversity in search paths and is able to find multiple objects with a single feed-forward pass. Therefore, Tree-RL can better cover different objects with various scales which is quite appealing in the context of object proposal. Experiments on PASCAL VOC 2007 and 2012 validate the effectiveness of the Tree-RL, which can achieve comparable recalls with current object proposal algorithms via much fewer candidate windows.

AAAI Conference 2015 Conference Paper

Generalized Singular Value Thresholding

  • Canyi Lu
  • Changbo Zhu
  • Chunyan Xu
  • Shuicheng Yan
  • Zhouchen Lin

This work studies the Generalized Singular Value Thresholding (GSVT) operator Proxσ g (·), Proxσ g (B) = arg min X m X i=1 g(σi(X)) + 1 2 ||X − B||2 F, associated with a nonconvex function g defined on the singular values of X. We prove that GSVT can be obtained by performing the proximal operator of g (denoted as Proxg(·)) on the singular values since Proxg(·) is monotone when g is lower bounded. If the nonconvex g satisfies some conditions (many popular nonconvex surrogate functions, e. g. , `p-norm, 0 < p < 1, of `0-norm are special cases), a general solver to find Proxg(b) is proposed for any b ≥ 0. GSVT greatly generalizes the known Singular Value Thresholding (SVT) which is a basic subroutine in many convex low rank minimization methods. We are able to solve the nonconvex low rank minimization problem by using GSVT in place of SVT.

NeurIPS Conference 2014 Conference Paper

Convex Optimization Procedure for Clustering: Theoretical Revisit

  • Changbo Zhu
  • Huan Xu
  • Chenlei Leng
  • Shuicheng Yan

In this paper, we present theoretical analysis of SON~--~a convex optimization procedure for clustering using a sum-of-norms (SON) regularization recently proposed in \cite{ICML2011Hocking_419, SON, Lindsten650707, pelckmans2005convex}. In particular, we show if the samples are drawn from two cubes, each being one cluster, then SON can provably identify the cluster membership provided that the distance between the two cubes is larger than a threshold which (linearly) depends on the size of the cube and the ratio of numbers of samples in each cluster. To the best of our knowledge, this paper is the first to provide a rigorous analysis to understand why and when SON works. We believe this may provide important insights to develop novel convex optimization based algorithms for clustering.

NeurIPS Conference 2014 Conference Paper

On a Theory of Nonparametric Pairwise Similarity for Clustering: Connecting Clustering to Classification

  • Yingzhen Yang
  • Feng Liang
  • Shuicheng Yan
  • Zhangyang Wang
  • Thomas Huang

Pairwise clustering methods partition the data space into clusters by the pairwise similarity between data points. The success of pairwise clustering largely depends on the pairwise similarity function defined over the data points, where kernel similarity is broadly used. In this paper, we present a novel pairwise clustering framework by bridging the gap between clustering and multi-class classification. This pairwise clustering framework learns an unsupervised nonparametric classifier from each data partition, and search for the optimal partition of the data by minimizing the generalization error of the learned classifiers associated with the data partitions. We consider two nonparametric classifiers in this framework, i. e. the nearest neighbor classifier and the plug-in classifier. Modeling the underlying data distribution by nonparametric kernel density estimation, the generalization error bounds for both unsupervised nonparametric classifiers are the sum of nonparametric pairwise similarity terms between the data points for the purpose of clustering. Under uniform distribution, the nonparametric similarity terms induced by both unsupervised classifiers exhibit a well known form of kernel similarity. We also prove that the generalization error bound for the unsupervised plug-in classifier is asymptotically equal to the weighted volume of cluster boundary for Low Density Separation, a widely used criteria for semi-supervised learning and clustering. Based on the derived nonparametric pairwise similarity using the plug-in classifier, we propose a new nonparametric exemplar-based clustering method with enhanced discriminative capability, whose superiority is evidenced by the experimental results.

AAAI Conference 2014 Conference Paper

Proximal Iteratively Reweighted Algorithm with Multiple Splitting for Nonconvex Sparsity Optimization

  • Canyi Lu
  • Yunchao Wei
  • Zhouchen Lin
  • Shuicheng Yan

This paper proposes the Proximal Iteratively REweighted (PIRE) algorithm for solving a general problem, which involves a large body of nonconvex sparse and structured sparse related problems. Comparing with previous iterative solvers for nonconvex sparse problem, PIRE is much more general and efficient. The computational cost of PIRE in each iteration is usually as low as the state-of-the-art convex solvers. We further propose the PIRE algorithm with Parallel Splitting (PIRE-PS) and PIRE algorithm with Alternative Updating (PIRE-AU) to handle the multi-variable problems. In theory, we prove that our proposed methods converge and any limit solution is a stationary point. Extensive experiments on both synthesis and real data sets demonstrate that our methods achieve comparative learning performance, but are much more efficient, by comparing with previous nonconvex solvers.

NeurIPS Conference 2014 Conference Paper

Robust Logistic Regression and Classification

  • Jiashi Feng
  • Huan Xu
  • Shie Mannor
  • Shuicheng Yan

We consider logistic regression with arbitrary outliers in the covariate matrix. We propose a new robust logistic regression algorithm, called RoLR, that estimates the parameter through a simple linear programming procedure. We prove that RoLR is robust to a constant fraction of adversarial outliers. To the best of our knowledge, this is the first result on estimating logistic regression model when the covariate matrix is corrupted with any performance guarantees. Besides regression, we apply RoLR to solving binary classification problems where a fraction of training samples are corrupted.

AAAI Conference 2014 Conference Paper

Similarity-Preserving Binary Signature for Linear Subspaces

  • Jianqiu Ji
  • Jianmin Li
  • Shuicheng Yan
  • Qi Tian
  • Bo Zhang

Linear subspace is an important representation for many kinds of real-world data in computer vision and pattern recognition, e. g. faces, motion videos, speeches. In this paper, first we define pairwise angular similarity and angular distance for linear subspaces. The angular distance satisfies non-negativity, identity of indiscernibles, symmetry and triangle inequality, and thus it is a metric. Then we propose a method to compress linear subspaces into compact similarity-preserving binary signatures, between which the normalized Hamming distance is an unbiased estimator of the angular distance. We provide a lower bound on the length of the binary signatures which suffices to guarantee uniform distancepreservation within a set of subspaces. Experiments on face recognition demonstrate the effectiveness of the binary signature in terms of recognition accuracy, speed and storage requirement. The results show that, compared with the exact method, the approximation with the binary signatures achieves an order of magnitude speedup, while requiring significantly smaller amount of storage space, yet it still accurately preserves the similarity, and achieves high recognition accuracy comparable to the exact method in face recognition.

TIST Journal 2014 Journal Article

Snap & Play

  • Si Liu
  • Qiang Chen
  • Shuicheng Yan
  • Changsheng Xu
  • Hanqing Lu

In this article, by taking a popular game, the Find-the-Difference (FiDi) game, as a concrete example, we explore how state-of-the-art image processing techniques can assist in developing a personalized, automatic, and dynamic game. Unlike the traditional FiDi game, where image pairs (source image and target image) with five different patches are manually produced by professional game developers, the proposed Personalized FiDi (P-FiDi) electronic game can be played in a fully automatic Snap & Play mode. Snap means that players first take photos with their digital cameras. The newly captured photos are used as source images and fed into the P-FiDi system to autogenerate the counterpart target images for users to play. Four steps are adopted to autogenerate target images: enhancing the visual quality of source images, extracting some changeable patches from the source image, selecting the most suitable combination of changeable patches and difference styles for the image, and generating the differences on the target image with state-of-the-art image processing techniques. In addition, the P-FiDi game can be easily redesigned for the im-game advertising. Extensive experiments show that the P-FiDi electronic game is satisfying in terms of player experience, seamless advertisement, and technical feasibility.

AAAI Conference 2014 Conference Paper

Supervised Hashing for Image Retrieval via Image Representation Learning

  • Rongkai Xia
  • Yan Pan
  • Hanjiang Lai
  • Cong Liu
  • Shuicheng Yan

Hashing is a popular approximate nearest neighbor search approach for large-scale image retrieval. Supervised hashing, which incorporates similarity/dissimilarity information on entity pairs to improve the quality of hashing function learning, has recently received increasing attention. However, in the existing supervised hashing methods for images, an input image is usually encoded by a vector of hand-crafted visual features. Such hand-crafted feature vectors do not necessarily preserve the accurate semantic similarities of images pairs, which may often degrade the performance of hashing function learning. In this paper, we propose a supervised hashing method for image retrieval, in which we automatically learn a good image representation tailored to hashing as well as a set of hash functions. The proposed method has two stages. In the first stage, given the pairwise similarity matrix S over training images, we propose a scalable coordinate descent method to decompose S into a product of HHT where H is a matrix with each of its rows being the approximate hash code associated to a training image. In the second stage, we propose to simultaneously learn a good feature representation for the input images as well as a set of hash functions, via a deep convolutional network tailored to the learned hash codes in H and optionally the discrete class labels of the images. Extensive empirical evaluations on three benchmark datasets with different kinds of images show that the proposed method has superior performance gains over several state-of-the-art supervised and unsupervised hashing methods.

NeurIPS Conference 2013 Conference Paper

Online PCA for Contaminated Data

  • Jiashi Feng
  • Huan Xu
  • Shie Mannor
  • Shuicheng Yan

We consider the online Principal Component Analysis (PCA) for contaminated samples (containing outliers) which are revealed sequentially to the Principal Components (PCs) estimator. Due to their sensitiveness to outliers, previous online PCA algorithms fail in this case and their results can be arbitrarily bad. Here we propose the online robust PCA algorithm, which is able to improve the PCs estimation upon an initial one steadily, even when faced with a constant fraction of outliers. We show that the final result of the proposed online RPCA has an acceptable degradation from the optimum. Actually, under mild conditions, online RPCA achieves the maximal robustness with a $50\%$ breakdown point. Moreover, online RPCA is shown to be efficient for both storage and computation, since it need not re-explore the previous samples as in traditional robust PCA algorithms. This endows online RPCA with scalability for large scale data.

NeurIPS Conference 2013 Conference Paper

Online Robust PCA via Stochastic Optimization

  • Jiashi Feng
  • Huan Xu
  • Shuicheng Yan

Robust PCA methods are typically based on batch optimization and have to load all the samples into memory. This prevents them from efficiently processing big data. In this paper, we develop an Online Robust Principal Component Analysis (OR-PCA) that processes one sample per time instance and hence its memory cost is independent of the data size, significantly enhancing the computation and storage efficiency. The proposed method is based on stochastic optimization of an equivalent reformulation of the batch RPCA method. Indeed, we show that OR-PCA provides a sequence of subspace estimations converging to the optimum of its batch counterpart and hence is provably robust to sparse corruption. Moreover, OR-PCA can naturally be applied for tracking dynamic subspace. Comprehensive simulations on subspace recovering and tracking demonstrate the robustness and efficiency advantages of the OR-PCA over online PCA and batch RPCA methods.

AAAI Conference 2013 Conference Paper

Rank Aggregation via Low-Rank and Structured-Sparse Decomposition

  • Yan Pan
  • Hanjiang Lai
  • Cong Liu
  • Yong Tang
  • Shuicheng Yan

Rank aggregation, which combines multiple individual rank lists to obtain a better one, is a fundamental technique in various applications such as meta-search and recommendation systems. Most existing rank aggregation methods blindly combine multiple rank lists with possibly considerable noises, which often degrades their performances. In this paper, we propose a new model for robust rank aggregation (RRA) via matrix learning, which recovers a latent rank list from the possibly incomplete and noisy input rank lists. In our model, we construct a pairwise comparison matrix to encode the order information in each input rank list. Based on our observations, each comparison matrix can be naturally decomposed into a shared low-rank matrix, combined with a deviation error matrix which is the sum of a column-sparse matrix and a row-sparse one. The latent rank list can be easily extracted from the learned lowrank matrix. The optimization formulation of RRA has an element-wise multiplication operator to handle missing values, a symmetric constraint on the noise structure, and a factorization trick to restrict the maximum rank of the low-rank matrix. To solve this challenging optimization problem, we propose a novel procedure based on the Augmented Lagrangian Multiplier scheme. We conduct extensive experiments on metasearch and collaborative filtering benchmark datasets. The results show that the proposed RRA has superior performance gain over several state-of-the-art algorithms for rank aggregation.

NeurIPS Conference 2012 Conference Paper

Super-Bit Locality-Sensitive Hashing

  • Jianqiu Ji
  • Jianmin Li
  • Shuicheng Yan
  • Bo Zhang
  • Qi Tian

Sign-random-projection locality-sensitive hashing (SRP-LSH) is a probabilistic dimension reduction method which provides an unbiased estimate of angular similarity, yet suffers from the large variance of its estimation. In this work, we propose the Super-Bit locality-sensitive hashing (SBLSH). It is easy to implement, which orthogonalizes the random projection vectors in batches, and it is theoretically guaranteed that SBLSH also provides an unbiased estimate of angular similarity, yet with a smaller variance when the angle to estimate is within $(0, \pi/2]$. The extensive experiments on real data well validate that given the same length of binary code, SBLSH may achieve significant mean squared error reduction in estimating pairwise angular similarity. Moreover, SBLSH shows the superiority over SRP-LSH in approximate nearest neighbor (ANN) retrieval experiments.

AAAI Conference 2011 Conference Paper

Efficient Subspace Segmentation via Quadratic Programming

  • Shusen Wang
  • Xiaotong Yuan
  • Tiansheng Yao
  • Shuicheng Yan
  • Jialie Shen

We explore in this paper efficient algorithmic solutions to robust subspace segmentation. We propose the SSQP, namely Subspace Segmentation via Quadratic Programming, to partition data drawn from multiple subspaces into multiple clusters. The basic idea of SSQP is to express each datum as the linear combination of other data regularized by an overall term targeting zero reconstruction coefficients over vectors from different subspaces. The derived coefficient matrix by solving a quadratic programming problem is taken as an affinity matrix, upon which spectral clustering is applied to obtain the ultimate segmentation result. Similar to sparse subspace clustering (SCC) and low-rank representation (LRR), SSQP is robust to data noises as validated by experiments on toy data. Experiments on Hopkins 155 database show that SSQP can achieve competitive accuracy as SCC and LRR in segmenting affine subspaces, while experimental results on the Extended Yale Face Database B demonstrate SSQP’s superiority over SCC and LRR. Beyond segmentation accuracy, all experiments show that SSQP is much faster than both SSC and LRR in the practice of subspace segmentation.

TIST Journal 2011 Journal Article

Image annotation by k NN-sparse graph-based label propagation over noisily tagged web images

  • Jinhui Tang
  • Richang Hong
  • Shuicheng Yan
  • Tat-Seng Chua
  • Guo-Jun Qi
  • Ramesh Jain

In this article, we exploit the problem of annotating a large-scale image corpus by label propagation over noisily tagged web images. To annotate the images more accurately, we propose a novel k NN-sparse graph-based semi-supervised learning approach for harnessing the labeled and unlabeled data simultaneously. The sparse graph constructed by datum-wise one-vs- k NN sparse reconstructions of all samples can remove most of the semantically unrelated links among the data, and thus it is more robust and discriminative than the conventional graphs. Meanwhile, we apply the approximate k nearest neighbors to accelerate the sparse graph construction without loosing its effectiveness. More importantly, we propose an effective training label refinement strategy within this graph-based learning framework to handle the noise in the training labels, by bringing in a dual regularization for both the quantity and sparsity of the noise. We conduct extensive experiments on a real-world image database consisting of 55,615 Flickr images and noisily tagged training labels. The results demonstrate both the effectiveness and efficiency of the proposed approach and its capability to deal with the noise in the training labels.

TIST Journal 2011 Journal Article

Recognizing pair-activities by causality analysis

  • Yue Zhou
  • Bingbing Ni
  • Shuicheng Yan
  • Thomas S. Huang

In this article, beyond solo-activity analysis for single object, we study the more complicated pair-activity recognition problem by exploring the relationship between two active objects based on their trajectory clues obtained from video sensor. Our contributions are three-fold. First, we design two sets of features for representing the pair-activities encoded as length-variable trajectory pairs. One set characterizes the strength of causality between two trajectories, for example, the causality ratio and feedback ratio based on the Granger Causality Test (GCT), and another set describes the style of causality between two trajectories, for example, the sampled frequency responses of the digital filter with these two trajectories as the input and output discrete signals respectively. These features along with conventional velocity and position features of a trajectory-pair are essentially of multi-modalities, and may be greatly different in scales and importance. To make full use of them, we then develop a novel feature fusing procedure to learn the coefficients for weighting these features by maximizing the discriminating power measured by weighted correlation. Finally, we collected a pair-activity database of five popular categories, each of which consists of about 170 instances. The extensive experiments on this database validate the effectiveness of the designed features for pair-activity representation, and also demonstrate that the proposed feature fusing procedure significantly boosts the pair-activity classification accuracy.

AAAI Conference 2011 Conference Paper

Size Adaptive Selection of Most Informative Features

  • Si Liu
  • Hairong Liu
  • Longin Jan Latecki
  • Shuicheng Yan
  • Changsheng Xu
  • Hanqing Lu

In this paper, we propose a novel method to select the most informative subset of features, which has little redundancy and very strong discriminating power. Our proposed approach automatically determines the optimal number of features and selects the best subset accordingly by maximizing the average pairwise informativeness, thus has obvious advantage over traditional filter methods. By relaxing the essential combinatorial optimization problem into the standard quadratic programming problem, the most informative feature subset can be obtained efficiently, and a strategy to dynamically compute the redundancy between feature pairs further greatly accelerates our method through avoiding unnecessary computations of mutual information. As shown by the extensive experiments, the proposed method can successfully select the most informative subset of features, and the obtained classification results significantly outperform the state-of-the-art results on most test datasets.

AAAI Conference 2010 Conference Paper

Non-Metric Locality-Sensitive Hashing

  • Yadong Mu
  • Shuicheng Yan

Non-metric distances are often more reasonable compared with metric ones in terms of consistency with human perceptions. However, existing locality-sensitive hashing (LSH) algorithms can only support data which are gauged with metrics. In this paper we propose a novel locality-sensitive hashing algorithm targeting such non-metric data. Data in original feature space are embedded into an implicit reproducing kernel Kreı̆n space and then hashed to obtain binary bits. Here we utilize the norm-keeping property of p-stable functions to ensure that two data’s collision probability reflects their nonmetric distance in original feature space. We investigate various concrete examples to validate the proposed algorithm. Extensive empirical evaluations well illustrate its effectiveness in terms of accuracy and retrieval speedup.

NeurIPS Conference 2010 Conference Paper

Robust Clustering as Ensembles of Affinity Relations

  • Hairong Liu
  • Longin Latecki
  • Shuicheng Yan

In this paper, we regard clustering as ensembles of k-ary affinity relations and clusters correspond to subsets of objects with maximal average affinity relations. The average affinity relation of a cluster is relaxed and well approximated by a constrained homogenous function. We present an efficient procedure to solve this optimization problem, and show that the underlying clusters can be robustly revealed by using priors systematically constructed from the data. Our method can automatically select some points to form clusters, leaving other points un-grouped; thus it is inherently robust to large numbers of outliers, which has seriously limited the applicability of classical methods. Our method also provides a unified solution to clustering from k-ary affinity relations with k ≥ 2, that is, it applies to both graph-based and hypergraph-based clustering problems. Both theoretical analysis and experimental results show the superiority of our method over classical solutions to the clustering problem, especially when there exists a large number of outliers.

IJCAI Conference 2009 Conference Paper

  • Yangqing Jia
  • Shuicheng Yan
  • Changshui Zhang

In this paper, we consider semi-supervised classi- fication on evolutionary data, where the distribution of the data and the underlying concept that we aim to learn change over time due to shortterm noises and long-term drifting, making a single aggregated classifier inapplicable for long-term classification. The drift is smooth if we take a localized view over the time dimension, which enables us to impose temporal smoothness assumption for the learning algorithm. We first discuss how to carry out such assumption using temporal regularizers defined in a structural way with respect to the Hilbert space, and then derive the online algorithm that efficiently finds the closed-form solution to the classification functions. Experimental results on real-world evolutionary mailing list data demonstrate that our algorithm outperforms classical semi-supervised learning algorithms in both algorithmic stability and classification accuracy.

IJCAI Conference 2007 Conference Paper

  • Huan Wang
  • Shuicheng Yan
  • Thomas Huang
  • Xiaoou Tang

Recently, substantial efforts have been devoted to the subspace learning techniques based on tensor representation, such as 2DLDA, DATER and Tensor Subspace Analysis (TSA). In this context, a vital yet unsolved problem is that the computational convergency of these iterative algorithms is not guaranteed. In this work, we present a novel solution procedure for general tensor-based subspace learning, followed by a detailed convergency proof of the solution projection matrices and the objective function value. Extensive experiments on real-world databases verify the high convergence speed of the proposed procedure, as well as its superiority in classification capability over traditional solution procedures.

ICML Conference 2007 Conference Paper

Transductive regression piloted by inter-manifold relations

  • Huan Wang 0001
  • Shuicheng Yan
  • Thomas S. Huang
  • Jianzhuang Liu
  • Xiaoou Tang

In this paper, we present a novel semisupervised regression algorithm working on multiclass data that may lie on multiple manifolds. Unlike conventional manifold regression algorithms that do not consider the class distinction of samples, our method introduces the class information to the regression process and tries to exploit the similar configurations shared by the label distribution of multi-class data. To utilize the correlations among data from different classes, we develop a cross-manifold label propagation process and employ labels from different classes to enhance the regression performance. The interclass relations are coded by a set of intermanifold graphs and a regularization item is introduced to impose inter-class smoothness on the possible solutions. In addition, the algorithm is further extended with the kernel trick for predicting labels of the out-of-sample data even without class information. Experiments on both synthesized data and real world problems validate the effectiveness of the proposed framework for semisupervised regression.