Author name cluster

Ming-Hsuan Yang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

56 papers

1 author row

TMLR Journal 2026 Journal Article

Any Image Restoration via Efficient Spatial-Frequency Degradation Adaptation

Bin Ren
Eduard Zamfir
Zongwei Wu
Yawei Li
Yidi Li
Danda Pani Paudel
Radu Timofte
Ming-Hsuan Yang

Restoring multiple degradations efficiently via just one model has become increasingly significant and impactful, especially with the proliferation of mobile devices. Traditional solutions typically involve training dedicated models per degradation, resulting in inefficiency and redundancy. More recent approaches either introduce additional modules to learn visual prompts, significantly increasing the size of the model, or incorporate cross-modal transfer from large language models trained on vast datasets, adding complexity to the system architecture. In contrast, our approach, termed AnyIR, takes a unified path that leverages inherent similarity across various degradations to enable both efficient and comprehensive restoration through a joint embedding mechanism, without scaling up the model or relying on large language models. Specifically, we examine the sub-latent space of each input, identifying key components and reweighting them first in a gated manner. To unify intrinsic degradation awareness with contextualized attention, we propose a spatial–frequency parallel fusion strategy that strengthens spatially informed local–global interactions and enriches restoration fidelity from the frequency domain. Comprehensive evaluations across four all-in-one restoration benchmarks demonstrate that AnyIR attains state-of-the-art performance while reducing model parameters by 84% and FLOPs by 80% relative to the baseline. These results highlight the potential of AnyIR as an effective and lightweight solution for further all-in-one image restoration. Our code is available at: https://github.com/Amazingren/AnyIR.

TMLR Journal 2026 Journal Article

KITTEN: A Knowledge-Integrated Evaluation of Image Generation on Visual Entities

Hsin-Ping Huang
Xinyi Wang
Yonatan Bitton
Hagai Taitelbaum
Gaurav Singh Tomar
Ming-Wei Chang
Xuhui Jia
Kelvin C.K. Chan

Recent advances in text-to-image generation have improved the quality of synthesized images, but evaluations mainly focus on aesthetics or alignment with text prompts. Thus, it remains unclear whether these models can accurately represent a wide variety of realistic visual entities. To bridge this gap, we propose KITTEN, a benchmark for Knowledge-InTegrated image generaTion on real-world ENtities. Using KITTEN, we conduct a systematic study of recent text-to-image models, retrieval-augmented models, and unified understanding and generation models, focusing on their ability to generate real-world visual entities such as landmarks and animals. Analyses using carefully designed human evaluations, automatic metrics, and MLLMs as judges show that even advanced text-to-image and unified models fail to generate accurate visual details of entities. While retrieval-augmented models improve entity fidelity by incorporating reference images, they tend to over-rely on them and struggle to create novel configurations of the entities in creative text prompts. The dataset and evaluation code are publicly available at https://kitten-project.github.io.

AAAI Conference 2026 Conference Paper

Tracking the Unstable: Appearance-Guided Motion Modeling for Robust Multi-Object Tracking in UAV-Captured Videos

Jianbo Ma
Hui Luo
Qi Chen
Yuankai Qi
Yumei Sun
Amin Beheshti
Jianlin Zhang
Ming-Hsuan Yang

Multi-object tracking (MOT) aims to track multiple objects while maintaining consistent identities across frames of a given video. In unmanned aerial vehicle (UAV) recorded videos, frequent viewpoint changes and complex UAV-ground relative motion dynamics pose significant challenges, which often lead to unstable affinity measurement and ambiguous association. Existing methods typically model motion and appearance cues separately, overlooking their spatio-temporal interplay and resulting in suboptimal tracking performance. In this work, we propose AMOT, which jointly exploits appearance and motion cues through two key components: an Appearance-Motion Consistency (AMC) matrix and a Motion-aware Track Continuation (MTC) module. Specifically, the AMC matrix computes bi-directional spatial consistency under the guidance of appearance features, enabling more reliable and context-aware identity association. The MTC module complements AMC by reactivating unmatched tracks through appearance-guided predictions that align with Kalman-based predictions, thereby reducing broken trajectories caused by missed detections. Extensive experiments on three UAV benchmarks, including VisDrone2019, UAVDT, and VT-MOT-UAV, demonstrate that our AMOT outperforms current state-of-the-art methods and generalizes well in a plug-and-play and training-free manner.

PDF Details DOI

TMLR Journal 2026 Journal Article

Video Prediction Transformers without Recurrence or Convolution

Yujin Tang
Lu Qi
Xiangtai Li
Chao Ma
Ming-Hsuan Yang

Video prediction has witnessed the emergence of RNN-based models led by ConvLSTM, and CNN-based models led by SimVP. Following the significant success of ViT, recent works have integrated ViT into both RNN and CNN frameworks, achieving improved performance. While we appreciate these prior approaches, we raise a fundamental question: Is there a simpler yet more effective solution that can eliminate the high computational cost of RNNs while addressing the limited receptive fields and poor generalization of CNNs? How far can it go with a simple pure transformer model for video prediction? In this paper, we propose PredFormer, a framework entirely based on Gated Transformers. We provide a comprehensive analysis of 3D Attention in the context of video prediction. Extensive experiments demonstrate that PredFormer delivers state-of-the-art performance across four standard benchmarks. The significant improvements in both accuracy and efficiency highlight the potential of PredFormer as a strong baseline for real-world video prediction applications. The source code and trained models will be released to the public.

NeurIPS Conference 2025 Conference Paper

4KAgent: Agentic Any Image to 4K Super-Resolution

Yushen Zuo
Qi Zheng
Mingyang Wu
Xinrui Jiang
Renjie Li
Jian Wang
Yide Zhang
Gengchen Mai

We present 4KAgent, a unified agentic super-resolution generalist system designed to universally upscale any image to 4K resolution (and even higher, if applied iteratively). Our system can transform images from extremely low resolutions with severe degradations, for example, highly distorted inputs at $256\times 256$, into crystal-clear, photorealistic 4K outputs. 4KAgent comprises three core components: (1) Profiling, a module that customizes the 4KAgent pipeline based on bespoke use cases; (2) A Perception Agent, which leverages vision-language models alongside image quality assessment experts to analyze the input image and make a tailored restoration plan; and (3) A Restoration Agent, which executes the plan, following a recursive execution-reflection paradigm, guided by a quality-driven mixture-of-experts policy to select the optimal output for each step. Additionally, 4KAgent embeds a specialized face restoration pipeline, significantly enhancing facial details in portrait and selfie photos. We rigorously evaluate our 4KAgent across 11 distinct task categories encompassing a total of 26 diverse benchmarks, setting new state-of-the-art on a broad spectrum of imaging domains. Our evaluations cover natural images, portrait photos, AI-generated content, satellite imagery, fluorescence microscopy, and medical imaging like fundoscopy, ultrasound, and X-ray, demonstrating superior performance in terms of both perceptual (e. g. , NIQE, MUSIQ) and fidelity (e. g. , PSNR) metrics. By establishing a novel agentic paradigm for low-level vision tasks, we aim to catalyze broader interest and innovation within vision-centric autonomous agents across diverse research communities. We release all the code, models, and results at: https: //4kagent. github. io.

TMLR Journal 2025 Journal Article

CoCoIns: Consistent Subject Generation via Contrastive Instantiated Concepts

Lee Hsin-Ying
Kelvin C.K. Chan
Ming-Hsuan Yang

While text-to-image generative models can synthesize diverse and faithful content, subject variation across multiple generations limits their application to long-form content generation. Existing approaches require time-consuming fine-tuning, reference images for all subjects, or access to previously generated content. We introduce Contrastive Concept Instantiation (CoCoIns), a framework that effectively synthesizes consistent subjects across multiple independent generations. The framework consists of a generative model and a mapping network that transforms input latent codes into pseudo-words associated with specific concept instances. Users can generate consistent subjects by reusing the same latent codes. To construct such associations, we propose a contrastive learning approach that trains the network to distinguish between different combinations of prompts and latent codes. Extensive evaluations on human faces with a single subject show that CoCoIns performs comparably to existing methods while maintaining greater flexibility. We also demonstrate the potential for extending CoCoIns to multiple subjects and other object categories.

NeurIPS Conference 2025 Conference Paper

DGS-LRM: Real-Time Deformable 3D Gaussian Reconstruction From Monocular Videos

Chieh Lin
Zhaoyang Lv
Songyin Wu
Zhen Xu
Thu Nguyen-Phuoc
Hung-Yu Tseng
Julian Straub
Numair Khan

We introduce the Deformable Gaussian Splats Large Reconstruction Model (DGS-LRM), the first feed-forward method predicting deformable 3D Gaussian splats from a monocular posed video of any dynamic scene. Feed-forward scene reconstruction has gained significant attention for its ability to rapidly create digital replicas of real-world environments. However, most existing models are limited to static scenes and fail to reconstruct the motion of moving objects. Developing a feed-forward model for dynamic scene reconstruction poses significant challenges, including the scarcity of training data and the need for appropriate 3D representations and training paradigms. To address these challenges, we introduce several key technical contributions: an enhanced large-scale synthetic dataset with ground-truth multi-view videos and dense 3D scene flow supervision; a per-pixel deformable 3D Gaussian representation that is easy to learn, supports high-quality dynamic view synthesis, and enables long-range 3D tracking; and a large transformer network that achieves real-time, generalizable dynamic scene reconstruction. Extensive qualitative and quantitative experiments demonstrate that DGS-LRM achieves dynamic scene reconstruction quality comparable to optimization-based methods, while significantly outperforming the state-of-the-art predictive dynamic reconstruction method on real-world examples. Its predicted physically grounded 3D deformation is accurate and can be readily adapted for long-range 3D tracking tasks, achieving performance on par with state-of-the-art monocular video 3D tracking methods.

NeurIPS Conference 2025 Conference Paper

EA3D: Online Open-World 3D Object Extraction from Streaming Videos

Xiaoyu Zhou
Jingqi Wang
Yuang Jia
Yongtao Wang
Deqing Sun
Ming-Hsuan Yang

Current 3D scene understanding methods are limited by offline-collected multi-view data or pre-constructed 3D geometry. In this paper, we present ExtractAnything3D (EA3D), a unified online framework for open-world 3D object extraction that enables simultaneous geometric reconstruction and holistic scene understanding. Given a streaming video, EA3D dynamically interprets each frame using vision-language and 2D vision foundation encoders to extract object-level knowledge. This knowledge is integrated and embedded into a Gaussian feature map via a feed-forward online update strategy. We then iteratively estimate visual odometry from historical frames and incrementally update online Gaussian features with new observations. A recurrent joint optimization module directs the model's attention to regions of interest, simultaneously enhancing both geometric reconstruction and semantic understanding. Extensive experiments across diverse benchmarks and tasks, including photo-realistic rendering, semantic and instance segmentation, 3D bounding box and semantic occupancy estimation, and 3D mesh generation, demonstrate the effectiveness of EA3D. Our method establishes a unified and efficient framework for joint online 3D reconstruction and holistic scene understanding, enabling a broad range of downstream tasks. The project webpage is available at \url{https: //github. com/VDIGPKU/EA3D}.

AAAI Conference 2025 Conference Paper

Generating Synthetic Data for Unsupervised Federated Learning of Cross-Modal Retrieval

Tianlong Zhang
Zhe Xue
Adnan Mahmood
Junping Du
Yuchen Dong
Shilong Ou
Lang Feng
Ming-Hsuan Yang

Unsupervised federated learning for cross-modal retrieval has received increasing attention in recent years as it can free the requirement for annotations and avoid uploading original clients’ data to servers. Most existing methods focus on how to learn better local models and their aggregation to overcome data distribution drift across clients. Unlike prior works, we propose to address the data distribution problem by generating synthetic data, which can benefit existing federated learning methods. Specifically, we train a WGAN generator with three newly designed loss constraints on each client to improve the quality of the generated data. We first compute cluster prototypes to address the problem of lack of labels. Then, a direct contrastive loss between generated image and text features, an indirect contrastive loss with reference to cluster prototypes, and a Jensen-Shannon Divergence (JSD) loss also with reference to cluster prototypes work together to constrain the WGAN. The locally trained generators and local prototypes are sent to the server to generate and filter synthetic data with consideration of data distribution across all clients. The filtered data are used to train the aggregated global retrieval model, which is later sent to clients. The final global model becomes robust to all clients after several rounds of client-server iteration. Extensive experiments using four baselines across three datasets demonstrate that our method performs favourably against state-of-the-art methods.

PDF Details DOI

NeurIPS Conference 2025 Conference Paper

HoliGS: Holistic Gaussian Splatting for Embodied View Synthesis

Xiaoyuan Wang
Yizhou Zhao
Botao Ye
Shan Xiaojun
Weijie Lyu
Lu Qi
Kelvin Chan
Yinxiao Li

We propose HoliGS, a novel deformable Gaussian splatting framework that addresses embodied view synthesis from long monocular RGB videos. Unlike prior 4D Gaussian splatting and dynamic NeRF pipelines, which struggle with training overhead in minute-long captures, our method leverages invertible Gaussian Splatting deformation networks to reconstruct large-scale, dynamic environments accurately. Specifically, we decompose each scene into a static background plus time-varying objects, each represented by learned Gaussian primitives undergoing global rigid transformations, skeleton-driven articulation, and subtle non-rigid deformations via an invertible neural flow. This hierarchical warping strategy enables robust free-viewpoint novel-view rendering from various embodied camera trajectories by attaching Gaussians to a complete canonical foreground shape (e. g. , egocentric or third-person follow), which may involve substantial viewpoint changes and interactions between multiple actors. Our experiments demonstrate that HoliGS achieves superior reconstruction quality on challenging datasets while significantly reducing both training and rendering time compared to state-of-the-art monocular deformable NeRFs. These results highlight a practical and scalable solution for EVS in real-world scenarios. The source code will be released.

NeurIPS Conference 2025 Conference Paper

IllumiCraft: Unified Geometry and Illumination Diffusion for Controllable Video Generation

Yuanze Lin
Yi-Wen Chen
Yi-Hsuan Tsai
Ronald Clark
Ming-Hsuan Yang

Although diffusion-based models can generate high-quality and high-resolution video sequences from textual or image inputs, they lack explicit integration of geometric cues when controlling scene lighting and visual appearance across frames. To address this limitation, we propose IllumiCraft, an end-to-end diffusion framework accepting three complementary inputs: (1) high-dynamic-range (HDR) video maps for detailed lighting control; (2) synthetically relit frames with randomized illumination changes (optionally paired with a static background reference image) to provide appearance cues; and (3) 3D point tracks that capture precise 3D geometry information. By integrating the lighting, appearance, and geometry cues within a unified diffusion architecture, IllumiCraft generates temporally coherent videos aligned with user-defined prompts. It supports the background-conditioned and text-conditioned video relighting and provides better fidelity than existing controllable video generation methods.

NeurIPS Conference 2025 Conference Paper

InstaInpaint: Instant 3D-Scene Inpainting with Masked Large Reconstruction Model

Junqi You
Chieh Lin
Weijie Lyu
Zhengbo Zhang
Ming-Hsuan Yang

Recent advances in 3D scene reconstruction enable real-time viewing in virtual and augmented reality. To support interactive operations for better immersiveness, such as moving or editing objects, 3D scene inpainting methods are proposed to repair or complete the altered geometry. To support users in interacting (such as moving or editing objects) with the scene for the next level of immersiveness, 3D scene inpainting methods are developed to repair the altered geometry. However, current approaches rely on lengthy and computationally intensive optimization, making them impractical for real-time or online applications. We propose InstaInpaint, a reference-based feed-forward framework that produces 3D-scene inpainting from a 2D inpainting proposal within 0. 4 seconds. We develop a self-supervised masked-finetuning strategy to enable training of our custom large reconstruction model (LRM) on the large-scale dataset. Through extensive experiments, we analyze and identify several key designs that improve generalization, textural consistency, and geometric correctness. InstaInpaint achieves a 1000$\times$ speed-up from prior methods while maintaining a state-of-the-art performance across two standard benchmarks. Moreover, we show that InstaInpaint generalizes well to flexible downstream applications such as object insertion and multi-region inpainting.

NeurIPS Conference 2025 Conference Paper

KRIS-Bench: Benchmarking Next-Level Intelligent Image Editing Models

Yongliang Wu
Zonghui Li
Xinting Hu
Xinyu Ye
Xianfang Zeng
Gang Yu
Wenbo Zhu
Bernt Schiele

Recent advances in multi-modal generative models have enabled significant progress in instruction-based image editing. However, while these models produce visually plausible outputs, their capacity for knowledge-based reasoning editing tasks remains under-explored. In this paper, We introduce KRIS-Bench (Knowledge-based Reasoning in Image-editing Systems Benchmark), a diagnostic benchmark designed to assess models through a cognitively informed lens. Drawing from educational theory, KRIS-Bench categorizes editing tasks across three foundational knowledge types: Factual, Conceptual, and Procedural. Based on this taxonomy, we design 22 representative tasks spanning 7 reasoning dimensions and release 1, 267 high-quality annotated editing instances. To support fine-grained evaluation, we propose a comprehensive protocol that incorporates a novel Knowledge Plausibility metric, enhanced by knowledge hints and calibrated through human studies. Empirical results on nine state-of-the-art models reveal significant gaps in reasoning performance, highlighting the need for knowledge-centric benchmarks to advance the development of intelligent image editing systems.

NeurIPS Conference 2025 Conference Paper

OpenAD: Open-World Autonomous Driving Benchmark for 3D Object Detection

Zhongyu Xia
Jishuo Li
Zhiwei Lin
Xinhao Wang
Yongtao Wang
Ming-Hsuan Yang

Open-world perception aims to develop a model adaptable to novel domains and various sensor configurations and can understand uncommon objects and corner cases. However, current research lacks sufficiently comprehensive open-world 3D perception benchmarks and robust generalizable methodologies. This paper introduces OpenAD, the first real open-world autonomous driving benchmark for 3D object detection. OpenAD is built upon a corner case discovery and annotation pipeline that integrates with a multimodal large language model (MLLM). The proposed pipeline annotates corner case objects in a unified format for five autonomous driving perception datasets with 2000 scenarios. In addition, we devise evaluation methodologies and evaluate various open-world and specialized 2D and 3D models. Moreover, we propose a vision-centric 3D open-world object detection baseline and further introduce an ensemble method by fusing general and specialized models to address the issue of lower precision in existing open-world methods for the OpenAD benchmark. We host an online challenge on EvalAI. Data, toolkit codes, and evaluation codes are available at https: //github. com/VDIGPKU/OpenAD.

NeurIPS Conference 2025 Conference Paper

RAPID Hand: Robust, Affordable, Perception-Integrated, Dexterous Manipulation Platform for Embodied Intelligence

Zhaoliang Wan
Zetong Bi
Zida Zhou
Hao Ren
Yiming Zeng
Yihan Li
Lu Qi
Xu Yang

This paper addresses the scarcity of low-cost but high-dexterity platforms for collecting real-world multi-fingered robot manipulation data towards generalist robot autonomy. To achieve it, we propose the RAPID Hand, a co-optimized hardware and software platform where the compact 20-DoF hand, robust whole-hand perception, and high-DoF teleoperation interface are jointly designed. Specifically, RAPID Hand adopts a compact and practical hand ontology and a hardware-level perception framework that stably integrates wrist-mounted vision, fingertip tactile sensing, and proprioception with sub-7 ms latency and spatial alignment. Collecting high-quality demonstrations on high-DoF hands is challenging, as existing teleoperation methods struggle with precision and stability on complex multi-fingered systems. We address this by co-optimizing hand design, perception integration, and teleoperation interface through a universal actuation scheme, custom perception electronics, and two retargeting constraints. We evaluate the platform’s hardware, perception, and teleoperation interface. Training a diffusion policy on collected data shows superior performance over prior works, validating the system’s capability for reliable, high-quality data collection. The platform is constructed from low-cost and off-the-shelf components and will be made public to ensure reproducibility and ease of adoption.

NeurIPS Conference 2025 Conference Paper

Restage4D: Reanimating Deformable 3D Reconstruction from a Single Video

Jixuan He
Chieh Lin
Lu Qi
Ming-Hsuan Yang

Motion is one of the key components in deformable 3D scenes. Generative video models allow users to animate static scenes with text prompts for novel motion, but when it comes to 4D reconstruction, such reanimations often fall apart. The generated videos often suffer from geometric artifacts, implausible motion, and occlusions, which hinder physically consistent 4D reanimation. In this work, we introduce \textbf{Restage4D}, a geometry-preserving pipeline for deformable scene reconstruction from a single edited video. Our key insight is to leverage the unedited original video as an additional source of supervision, allowing the model to propagate accurate structure into occluded and disoccluded regions. To achieve this, we propose a video-rewinding training scheme that temporally bridges the edited and original sequences via a shared motion representation. We further introduce an occlusion-aware ARAP regularization to preserve local rigidity, and a disocclusion backtracing mechanism that supplements missing geometry in the canonical space. Together, these components enable robust reconstruction even when the edited input contains hallucinated content or inconsistent motion. We validate Restage4D on DAVIS and PointOdyssey, demonstrating improved geometry consistency, motion quality, and 3D tracking performance. Our method not only preserves deformable structure under novel motion, but also automatically corrects errors introduced by generative models, bridging the gap between flexible video synthesis and physically grounded 4D reconstruction.

AAAI Conference 2024 Conference Paper

BEV-MAE: Bird’s Eye View Masked Autoencoders for Point Cloud Pre-training in Autonomous Driving Scenarios

Zhiwei Lin
Yongtao Wang
Shengxiang Qi
Nan Dong
Ming-Hsuan Yang

Existing LiDAR-based 3D object detection methods for autonomous driving scenarios mainly adopt the training-from-scratch paradigm. Unfortunately, this paradigm heavily relies on large-scale labeled data, whose collection can be expensive and time-consuming. Self-supervised pre-training is an effective and desirable way to alleviate this dependence on extensive annotated data. In this work, we present BEV-MAE, an efficient masked autoencoder pre-training framework for LiDAR-based 3D object detection in autonomous driving. Specifically, we propose a bird's eye view (BEV) guided masking strategy to guide the 3D encoder learning feature representation in a BEV perspective and avoid complex decoder design during pre-training. Furthermore, we introduce a learnable point token to maintain a consistent receptive field size of the 3D encoder with fine-tuning for masked point cloud inputs. Based on the property of outdoor point clouds in autonomous driving scenarios, i.e., the point clouds of distant objects are more sparse, we propose point density prediction to enable the 3D encoder to learn location information, which is essential for object detection. Experimental results show that BEV-MAE surpasses prior state-of-the-art self-supervised methods and achieves a favorably pre-training efficiency. Furthermore, based on TransFusion-L, BEV-MAE achieves new state-of-the-art LiDAR-based 3D object detection results, with 73.6 NDS and 69.6 mAP on the nuScenes benchmark. The source code will be released at https://github.com/VDIGPKU/BEV-MAE.

PDF Details DOI

AAAI Conference 2024 Conference Paper

CSL: Class-Agnostic Structure-Constrained Learning for Segmentation Including the Unseen

Hao Zhang
Fang Li
Lu Qi
Ming-Hsuan Yang
Narendra Ahuja

Addressing Out-Of-Distribution (OOD) Segmentation and Zero-Shot Semantic Segmentation (ZS3) is challenging, necessitating segmenting unseen classes. Existing strategies adapt the class-agnostic Mask2Former (CA-M2F) tailored to specific tasks. However, these methods cater to singular tasks, demand training from scratch, and we demonstrate certain deficiencies in CA-M2F, which affect performance. We propose the Class-Agnostic Structure-Constrained Learning (CSL), a plug-in framework that can integrate with existing methods, thereby embedding structural constraints and achieving performance gain, including the unseen, specifically OOD, ZS3, and domain adaptation (DA) tasks. There are two schemes for CSL to integrate with existing methods (1) by distilling knowledge from a base teacher network, enforcing constraints across training and inference phrases, or (2) by leveraging established models to obtain per-pixel distributions without retraining, appending constraints during the inference phase. Our soft assignment and mask split methodologies enhance OOD object segmentation. Empirical evaluations demonstrate CSL's prowess in boosting the performance of existing algorithms spanning OOD segmentation, ZS3, and DA segmentation, consistently transcending the state-of-art across all three tasks.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

Extending Video Masked Autoencoders to 128 frames

Nitesh B. Gundavarapu
Luke Friedman
Raghav Goyal
Chaitra Hegde
Eirikur Agustsson
Sagar Waghmare
Mikhail Sirotenko
Ming-Hsuan Yang

Video understanding has witnessed significant progress with recent video foundation models demonstrating strong performance owing to self-supervised pre-training objectives; Masked Autoencoders (MAE) being the design of choice. Nevertheless, the majority of prior works that leverage MAE pre-training have focused on relatively short video representations (16 / 32 frames in length) largely due to hardware memory and compute limitations that scale poorly with video length due to the dense memory-intensive self-attention decoding. One natural strategy to address these challenges is to subsample tokens to reconstruct during decoding (or decoder masking). In this work, we propose an effective strategy for prioritizing tokens which allows training on longer video sequences (128 frames) and gets better performance than, more typical, random and uniform masking strategies. The core of our approach is an adaptive decoder masking strategy that prioritizes the most important tokens and uses quantized tokens as reconstruction objectives. Our adaptive strategy leverages a powerful MAGVIT-based tokenizer that jointly learns the tokens and their priority. We validate our design choices through exhaustive ablations and observe improved performance of the resulting long-video (128 frames) encoders over short-video (32 frames) counterparts. With our long-video masked autoencoder (LVMAE) strategy, we surpass state-of-the-art on Diving48 by 3. 9 points and EPIC-Kitchens-100 verb classification by 2. 5 points while relying on a simple core architecture and video-only pre-training (unlike some of the prior works that require millions of labeled video-text pairs or specialized encoders).

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

SemFlow: Binding Semantic Segmentation and Image Synthesis via Rectified Flow

Chaoyang Wang
Xiangtai Li
Lu Qi
Henghui Ding
Yunhai Tong
Ming-Hsuan Yang

Semantic segmentation and semantic image synthesis are two representative tasks in visual perception and generation. While existing methods consider them as two distinct tasks, we propose a unified framework (SemFlow) and model them as a pair of reverse problems. Specifically, motivated by rectified flow theory, we train an ordinary differential equation (ODE) model to transport between the distributions of real images and semantic masks. As the training object is symmetric, samples belonging to the two distributions, images and semantic masks, can be effortlessly transferred reversibly. For semantic segmentation, our approach solves the contradiction between the randomness of diffusion outputs and the uniqueness of segmentation results. For image synthesis, we propose a finite perturbation approach to enhance the diversity of generated results without changing the semantic categories. Experiments show that our SemFlow achieves competitive results on semantic segmentation and semantic image synthesis tasks. We hope this simple framework will motivate people to rethink the unification of low-level and high-level vision.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

Sharing Key Semantics in Transformer Makes Efficient Image Restoration

Bin Ren
Yawei Li
Jingyun Liang
Rakesh Ranjan
Mengyuan Liu
Rita Cucchiara
Luc Van Gool
Ming-Hsuan Yang

Image Restoration (IR), a classic low-level vision task, has witnessed significant advancements through deep models that effectively model global information. Notably, the emergence of Vision Transformers (ViTs) has further propelled these advancements. When computing, the self-attention mechanism, a cornerstone of ViTs, tends to encompass all global cues, even those from semantically unrelated objects or regions. This inclusivity introduces computational inefficiencies, particularly noticeable with high input resolution, as it requires processing irrelevant information, thereby impeding efficiency. Additionally, for IR, it is commonly noted that small segments of a degraded image, particularly those closely aligned semantically, provide particularly relevant information to aid in the restoration process, as they contribute essential contextual cues crucial for accurate reconstruction. To address these challenges, we propose boosting IR's performance by sharing the key semantics via Transformer for IR (i. e. , SemanIR) in this paper. Specifically, SemanIR initially constructs a sparse yet comprehensive key-semantic dictionary within each transformer stage by establishing essential semantic connections for every degraded patch. Subsequently, this dictionary is shared across all subsequent transformer blocks within the same stage. This strategy optimizes attention calculation within each block by focusing exclusively on semantically related components stored in the key-semantic dictionary. As a result, attention calculation achieves linear computational complexity within each window. Extensive experiments across 6 IR tasks confirm the proposed SemanIR's state-of-the-art performance, quantitatively and qualitatively showcasing advancements. The visual results, code, and trained models are available at: https: //github. com/Amazingren/SemanIR.

PDF Details DOI

TMLR Journal 2024 Journal Article

VideoGLUE: Video General Understanding Evaluation of Foundation Models

Liangzhe Yuan
Nitesh Bharadwaj Gundavarapu
Long Zhao
Hao Zhou
Yin Cui
Lu Jiang
Xuan Yang
Menglin Jia

We evaluate the video understanding capabilities of existing foundation models (FMs) using a carefully designed experiment protocol consisting of three hallmark tasks (action recognition,temporal localization, and spatiotemporal localization), eight datasets well received by the community, and four adaptation methods tailoring an FM for downstream tasks. Furthermore,we jointly profile FMs’ efficacy and efficiency when adapting to general video understanding tasks using cost measurements during both training and inference. Our main findings areas follows. First, task-specialized models significantly outperform the seven FMs studied in this work, in sharp contrast to what FMs have achieved in natural language and image understanding. Second, video-native FMs, whose pretraining data mainly contains the video modality, are generally better than image-native FMs in classifying motion-rich videos,localizing actions in time, and understanding a video of more than one action. Third, the video-native FMs can perform well on video tasks under light adaptations to downstream tasks (e.g., freezing the FM backbones), while image-native FMs win in full end-to-end finetuning. The first two observations reveal the need and tremendous opportunities to conduct research on video-focused FMs, and the last confirms that both tasks and adaptation methods matter when it comes to the evaluation of FMs. Our code is released under: https://github.com/tensorflow/models/tree/master/official/projects/videoglue

NeurIPS Conference 2023 Conference Paper

A Tale of Two Features: Stable Diffusion Complements DINO for Zero-Shot Semantic Correspondence

Junyi Zhang
Charles Herrmann
Junhwa Hur
Luisa Polania Cabrera
Varun Jampani
Deqing Sun
Ming-Hsuan Yang

Text-to-image diffusion models have made significant advances in generating and editing high-quality images. As a result, numerous approaches have explored the ability of diffusion model features to understand and process single images for downstream tasks, e. g. , classification, semantic segmentation, and stylization. However, significantly less is known about what these features reveal across multiple, different images and objects. In this work, we exploit Stable Diffusion (SD) features for semantic and dense correspondence and discover that with simple post-processing, SD features can perform quantitatively similar to SOTA representations. Interestingly, the qualitative analysis reveals that SD features have very different properties compared to existing representation learning features, such as the recently released DINOv2: while DINOv2 provides sparse but accurate matches, SD features provide high-quality spatial information but sometimes inaccurate semantic matches. We demonstrate that a simple fusion of these two features works surprisingly well, and a zero-shot evaluation using nearest neighbors on these fused features provides a significant performance gain over state-of-the-art methods on benchmark datasets, e. g. , SPair-71k, PF-Pascal, and TSS. We also show that these correspondences can enable interesting applications such as instance swapping in two images. Project page: https: //sd-complements-dino. github. io/.

NeurIPS Conference 2023 Conference Paper

AIMS: All-Inclusive Multi-Level Segmentation for Anything

Lu Qi
Jason Kuen
Weidong Guo
Jiuxiang Gu
Zhe Lin
Bo Du
Yu Xu
Ming-Hsuan Yang

Despite the progress of image segmentation for accurate visual entity segmentation, completing the diverse requirements of image editing applications for different-level region-of-interest selections remains unsolved. In this paper, we propose a new task, All-Inclusive Multi-Level Segmentation (AIMS), which segments visual regions into three levels: part, entity, and relation (two entities with some semantic relationships). We also build a unified AIMS model through multi-dataset multi-task training to address the two major challenges of annotation inconsistency and task correlation. Specifically, we propose task complementarity, association, and prompt mask encoder for three-level predictions. Extensive experiments demonstrate the effectiveness and generalization capacity of our method compared to other state-of-the-art methods on a single dataset or the concurrent work on segment anything. We will make our code and training model publicly available.

NeurIPS Conference 2023 Conference Paper

ARTIC3D: Learning Robust Articulated 3D Shapes from Noisy Web Image Collections

Chun-Han Yao
Amit Raj
Wei-Chih Hung
Michael Rubinstein
Yuanzhen Li
Ming-Hsuan Yang
Varun Jampani

Estimating 3D articulated shapes like animal bodies from monocular images is inherently challenging due to the ambiguities of camera viewpoint, pose, texture, lighting, etc. We propose ARTIC3D, a self-supervised framework to reconstruct per-instance 3D shapes from a sparse image collection in-the-wild. Specifically, ARTIC3D is built upon a skeleton-based surface representation and is further guided by 2D diffusion priors from Stable Diffusion. First, we enhance the input images with occlusions/truncation via 2D diffusion to obtain cleaner mask estimates and semantic features. Second, we perform diffusion-guided 3D optimization to estimate shape and texture that are of high-fidelity and faithful to input images. We also propose a novel technique to calculate more stable image-level gradients via diffusion models compared to existing alternatives. Finally, we produce realistic animations by fine-tuning the rendered shape and texture under rigid part transformations. Extensive evaluations on multiple existing datasets as well as newly introduced noisy web image collections with occlusions and truncation demonstrate that ARTIC3D outputs are more robust to noisy images, higher quality in terms of shape and texture details, and more realistic when animated.

NeurIPS Conference 2023 Conference Paper

Diffusion-SS3D: Diffusion Model for Semi-supervised 3D Object Detection

Cheng-Ju Ho
Chen-Hsuan Tai
Yen-Yu Lin
Ming-Hsuan Yang
Yi-Hsuan Tsai

Semi-supervised object detection is crucial for 3D scene understanding, efficiently addressing the limitation of acquiring large-scale 3D bounding box annotations. Existing methods typically employ a teacher-student framework with pseudo-labeling to leverage unlabeled point clouds. However, producing reliable pseudo-labels in a diverse 3D space still remains challenging. In this work, we propose Diffusion-SS3D, a new perspective of enhancing the quality of pseudo-labels via the diffusion model for semi-supervised 3D object detection. Specifically, we include noises to produce corrupted 3D object size and class label distributions, and then utilize the diffusion model as a denoising process to obtain bounding box outputs. Moreover, we integrate the diffusion model into the teacher-student framework, so that the denoised bounding boxes can be used to improve pseudo-label generation, as well as the entire semi-supervised learning process. We conduct experiments on the ScanNet and SUN RGB-D benchmark datasets to demonstrate that our approach achieves state-of-the-art performance against existing methods. We also present extensive analysis to understand how our diffusion model design affects performance in semi-supervised learning. The source code will be available at https: //github. com/luluho1208/Diffusion-SS3D.

NeurIPS Conference 2023 Conference Paper

Module-wise Adaptive Distillation for Multimodality Foundation Models

Chen Liang
Jiahui Yu
Ming-Hsuan Yang
Matthew Brown
Yin Cui
Tuo Zhao
Boqing Gong
Tianyi Zhou

Pre-trained multimodal foundation models have demonstrated remarkable generalizability but pose challenges for deployment due to their large sizes. One effective approach to reducing their sizes is layerwise distillation, wherein small student models are trained to match the hidden representations of large teacher models at each layer. Motivated by our observation that certain architecture components, referred to as modules, contribute more significantly to the student's performance than others, we propose to track the contributions of individual modules by recording the loss decrement after distillation each module and choose the module with a greater contribution to distill more frequently. Such an approach can be naturally formulated as a multi-armed bandit (MAB) problem, where modules and loss decrements are considered as arms and rewards, respectively. We then develop a modified-Thompson sampling algorithm named OPTIMA to address the nonstationarity of module contributions resulting from model updating. Specifically, we leverage the observed contributions in recent history to estimate the changing contribution of each module and select modules based on these estimations to maximize the cumulative contribution. We evaluate the effectiveness of OPTIMA through distillation experiments on various multimodal understanding and image captioning tasks, using the CoCa-Large model \citep{yu2022coca} as the teacher model.

NeurIPS Conference 2023 Conference Paper

SPAE: Semantic Pyramid AutoEncoder for Multimodal Generation with Frozen LLMs

Lijun Yu
Yong Cheng
Zhiruo Wang
Vivek Kumar
Wolfgang Macherey
Yanping Huang
David Ross
Irfan Essa

In this work, we introduce Semantic Pyramid AutoEncoder (SPAE) for enabling frozen LLMs to perform both understanding and generation tasks involving non-linguistic modalities such as images or videos. SPAE converts between raw pixels and interpretable lexical tokens (or words) extracted from the LLM's vocabulary. The resulting tokens capture both the rich semantic meaning and the fine-grained details needed for visual reconstruction, effectively translating the visual content into a language comprehensible to the LLM, and empowering it to perform a wide array of multimodal tasks. Our approach is validated through in-context learning experiments with frozen PaLM 2 and GPT 3. 5 on a diverse set of image understanding and generation tasks. Our method marks the first successful attempt to enable a frozen LLM to generate image content while surpassing state-of-the-art performance in image understanding tasks, under the same setting, by over 25%.

NeurIPS Conference 2023 Conference Paper

Video Timeline Modeling For News Story Understanding

Meng Liu
Mingda Zhang
Jialu Liu
Hanjun Dai
Ming-Hsuan Yang
Shuiwang Ji
Zheyun Feng
Boqing Gong

In this paper, we present a novel problem, namely video timeline modeling. Our objective is to create a video-associated timeline from a set of videos related to a specific topic, thereby facilitating the content and structure understanding of the story being told. This problem has significant potential in various real-world applications, for instance, news story summarization. To bootstrap research in this area, we curate a realistic benchmark dataset, YouTube-News-Timeline, consisting of over $12$k timelines and $300$k YouTube news videos. Additionally, we propose a set of quantitative metrics to comprehensively evaluate and compare methodologies. With such a testbed, we further develop and benchmark several deep learning approaches to tackling this problem. We anticipate that this exploratory work will pave the way for further research in video timeline modeling. The assets are available via https: //github. com/google-research/google-research/tree/master/video_timeline_modeling.

NeurIPS Conference 2022 Conference Paper

LASSIE: Learning Articulated Shapes from Sparse Image Ensemble via 3D Part Discovery

Chun-Han Yao
Wei-Chih Hung
Yuanzhen Li
Michael Rubinstein
Ming-Hsuan Yang
Varun Jampani

Creating high-quality articulated 3D models of animals is challenging either via manual creation or using 3D scanning tools. Therefore, techniques to reconstruct articulated 3D objects from 2D images are crucial and highly useful. In this work, we propose a practical problem setting to estimate 3D pose and shape of animals given only a few (10-30) in-the-wild images of a particular animal species (say, horse). Contrary to existing works that rely on pre-defined template shapes, we do not assume any form of 2D or 3D ground-truth annotations, nor do we leverage any multi-view or temporal information. Moreover, each input image ensemble can contain animal instances with varying poses, backgrounds, illuminations, and textures. Our key insight is that 3D parts have much simpler shape compared to the overall animal and that they are robust w. r. t. animal pose articulations. Following these insights, we propose LASSIE, a novel optimization framework which discovers 3D parts in a self-supervised manner with minimal user intervention. A key driving force behind LASSIE is the enforcing of 2D-3D part consistency using self-supervisory deep features. Experiments on Pascal-Part and self-collected in-the-wild animal datasets demonstrate considerably better 3D reconstructions as well as both 2D and 3D part discovery compared to prior arts. Project page: https: //chhankyao. github. io/lassie/

NeurIPS Conference 2021 Conference Paper

End-to-end Multi-modal Video Temporal Grounding

Yi-Wen Chen
Yi-Hsuan Tsai
Ming-Hsuan Yang

We address the problem of text-guided video temporal grounding, which aims to identify the time interval of a certain event based on a natural language description. Different from most existing methods that only consider RGB images as visual features, we propose a multi-modal framework to extract complementary information from videos. Specifically, we adopt RGB images for appearance, optical flow for motion, and depth maps for image structure. While RGB images provide abundant visual cues of certain events, the performance may be affected by background clutters. Therefore, we use optical flow to focus on large motion and depth maps to infer the scene configuration when the action is related to objects recognizable with their shapes. To integrate the three modalities more effectively and enable inter-modal learning, we design a dynamic fusion scheme with transformers to model the interactions between modalities. Furthermore, we apply intra-modal self-supervised learning to enhance feature representations across videos for each modality, which also facilitates multi-modal learning. We conduct extensive experiments on the Charades-STA and ActivityNet Captions datasets, and show that the proposed method performs favorably against state-of-the-art approaches.

NeurIPS Conference 2021 Conference Paper

Exploring Cross-Video and Cross-Modality Signals for Weakly-Supervised Audio-Visual Video Parsing

Yan-Bo Lin
Hung-Yu Tseng
Hsin-Ying Lee
Yen-Yu Lin
Ming-Hsuan Yang

The audio-visual video parsing task aims to temporally parse a video into audio or visual event categories. However, it is labor intensive to temporally annotate audio and visual events and thus hampers the learning of a parsing model. To this end, we propose to explore additional cross-video and cross-modality supervisory signals to facilitate weakly-supervised audio-visual video parsing. The proposed method exploits both the common and diverse event semantics across videos to identify audio or visual events. In addition, our method explores event co-occurrence across audio, visual, and audio-visual streams. We leverage the explored cross-modality co-occurrence to localize segments of target events while excluding irrelevant ones. The discovered supervisory signals across different videos and modalities can greatly facilitate the training with only video-level annotations. Quantitative and qualitative results demonstrate that the proposed method performs favorably against existing methods on weakly-supervised audio-visual video parsing.

NeurIPS Conference 2021 Conference Paper

Intriguing Properties of Vision Transformers

Muhammad Muzammal Naseer
Kanchana Ranasinghe
Salman H Khan
Munawar Hayat
Fahad Shahbaz Khan
Ming-Hsuan Yang

Vision transformers (ViT) have demonstrated impressive performance across numerous machine vision tasks. These models are based on multi-head self-attention mechanisms that can flexibly attend to a sequence of image patches to encode contextual cues. An important question is how such flexibility (in attending image-wide context conditioned on a given patch) can facilitate handling nuisances in natural images e. g. , severe occlusions, domain shifts, spatial permutations, adversarial and natural perturbations. We systematically study this question via an extensive set of experiments encompassing three ViT families and provide comparisons with a high-performing convolutional neural network (CNN). We show and analyze the following intriguing properties of ViT: (a)Transformers are highly robust to severe occlusions, perturbations and domain shifts, e. g. , retain as high as 60% top-1 accuracy on ImageNet even after randomly occluding 80% of the image content. (b)The robustness towards occlusions is not due to texture bias, instead we show that ViTs are significantly less biased towards local textures, compared to CNNs. When properly trained to encode shape-based features, ViTs demonstrate shape recognition capability comparable to that of human visual system, previously unmatched in the literature. (c)Using ViTs to encode shape representation leads to an interesting consequence of accurate semantic segmentation without pixel-level supervision. (d)Off-the-shelf features from a single ViT model can be combined to create a feature ensemble, leading to high accuracy rates across a range of classification datasets in both traditional and few-shot learning paradigms. We show effective features of ViTs are due to flexible and dynamic receptive fields possible via self-attention mechanisms. Our code will be publicly released.

NeurIPS Conference 2021 Conference Paper

Learning 3D Dense Correspondence via Canonical Point Autoencoder

An-Chieh Cheng
Xueting Li
Min Sun
Ming-Hsuan Yang
Sifei Liu

We propose a canonical point autoencoder (CPAE) that predicts dense correspondences between 3D shapes of the same category. The autoencoder performs two key functions: (a) encoding an arbitrarily ordered point cloud to a canonical primitive, e. g. , a sphere, and (b) decoding the primitive back to the original input instance shape. As being placed in the bottleneck, this primitive plays a key role to map all the unordered point clouds on the canonical surface, and to be reconstructed in an ordered fashion. Once trained, points from different shape instances that are mapped to the same locations on the primitive surface are determined to be a pair of correspondence. Our method does not require any form of annotation or self-supervised part segmentation network and can handle unaligned input point clouds within a certain rotation range. Experimental results on 3D semantic keypoint transfer and part segmentation transfer show that our model performs favorably against state-of-the-art correspondence learning methods.

AAAI Conference 2020 Conference Paper

Adversarial Learning of Privacy-Preserving and Task-Oriented Representations

Taihong Xiao
Yi-Hsuan Tsai
Kihyuk Sohn
Manmohan Chandraker
Ming-Hsuan Yang

Data privacy has emerged as an important issue as data-driven deep learning has been an essential component of modern machine learning systems. For instance, there could be a potential privacy risk of machine learning systems via the model inversion attack, whose goal is to reconstruct the input data from the latent representation of deep networks. Our work aims at learning a privacy-preserving and task-oriented representation to defend against such model inversion attacks. Speciﬁcally, we propose an adversarial reconstruction learning framework that prevents the latent representations decoded into original input data. By simulating the expected behavior of adversary, our framework is realized by minimizing the negative pixel reconstruction loss or the negative feature reconstruction (i. e. , perceptual distance) loss. We validate the proposed method on face attribute prediction, showing that our method allows protecting visual privacy with a small decrease in utility performance. In addition, we show the utilityprivacy trade-off with different choices of hyperparameter for negative perceptual distance loss at training, allowing service providers to determine the right level of privacy-protection with a certain utility performance. Moreover, we provide an extensive study with different selections of features, tasks, and the data to further analyze their inﬂuence on privacy protection.

NeurIPS Conference 2020 Conference Paper

Online Adaptation for Consistent Mesh Reconstruction in the Wild

Xueting Li
Sifei Liu
Shalini De Mello
Kihwan Kim
Xiaolong Wang
Ming-Hsuan Yang
Jan Kautz

This paper presents an algorithm to reconstruct temporally consistent 3D meshes of deformable object instances from videos in the wild. Without requiring annotations of 3D mesh, 2D keypoints, or camera pose for each video frame, we pose video-based reconstruction as a self-supervised online adaptation problem applied to any incoming test video. We first learn a category-specific 3D reconstruction model from a collection of single-view images of the same category that jointly predicts the shape, texture, and camera pose of an image. Then, at inference time, we adapt the model to a test video over time using self-supervised regularization terms that exploit temporal consistency of an object instance to enforce that all reconstructed meshes share a common texture map, a base shape, as well as parts. We demonstrate that our algorithm recovers temporally consistent and reliable 3D structures from videos of non-rigid objects including those of animals captured in the wild -- an extremely challenging task rarely addressed before.

NeurIPS Conference 2019 Conference Paper

Dancing to Music

Hsin-Ying Lee
Xiaodong Yang
Ming-Yu Liu
Ting-Chun Wang
Yu-Ding Lu
Ming-Hsuan Yang
Jan Kautz

Dancing to music is an instinctive move by humans. Learning to model the music-to-dance generation process is, however, a challenging problem. It requires significant efforts to measure the correlation between music and dance as one needs to simultaneously consider multiple aspects, such as style and beat of both music and dance. Additionally, dance is inherently multimodal and various following movements of a pose at any moment are equally likely. In this paper, we propose a synthesis-by-analysis learning framework to generate dance from music. In the top-down analysis phase, we decompose a dance into a series of basic dance units, through which the model learns how to move. In the bottom-up synthesis phase, the model learns how to compose a dance by combining multiple basic dancing movements seamlessly according to input music. Experimental qualitative and quantitative results demonstrate that the proposed method can synthesize realistic, diverse, style-consistent, and beat-matching dances from music.

NeurIPS Conference 2019 Conference Paper

Joint-task Self-supervised Learning for Temporal Correspondence

Xueting Li
Sifei Liu
Shalini De Mello
Xiaolong Wang
Jan Kautz
Ming-Hsuan Yang

This paper proposes to learn reliable dense correspondence from videos in a self-supervised manner. Our learning process integrates two highly related tasks: tracking large image regions and establishing fine-grained pixel-level associations between consecutive video frames. We exploit the synergy between both tasks through a shared inter-frame affinity matrix, which simultaneously models transitions between video frames at both the region- and pixel-levels. While region-level localization helps reduce ambiguities in fine-grained matching by narrowing down search regions; fine-grained matching provides bottom-up features to facilitate region-level localization. Our method outperforms the state-of-the-art self-supervised methods on a variety of visual correspondence tasks, including video-object and part-segmentation propagation, keypoint tracking, and object tracking. Our self-supervised method even surpasses the fully-supervised affinity feature representation obtained from a ResNet-18 pre-trained on the ImageNet.

AAAI Conference 2019 Conference Paper

Learning Attribute-Specific Representations for Visual Tracking

Yuankai Qi
Shengping Zhang
Weigang Zhang
Li Su
Qingming Huang
Ming-Hsuan Yang

In recent years, convolutional neural networks (CNNs) have achieved great success in visual tracking. Most of existing methods train or fine-tune a binary classifier to distinguish the target from its background. However, they may suffer from the performance degradation due to insufficient training data. In this paper, we show that attribute information (e. g. , illumination changes, occlusion and motion) in the context facilitates training an effective classifier for visual tracking. In particular, we design an attribute-based CNN with multiple branches, where each branch is responsible for classifying the target under a specific attribute. Such a design reduces the appearance diversity of the target under each attribute and thus requires less data to train the model. We combine all attributespecific features via ensemble layers to obtain more discriminative representations for the final target/background classification. The proposed method achieves favorable performance on the OTB100 dataset compared to state-of-the-art tracking methods. After being trained on the VOT datasets, the proposed network also shows a good generalization ability on the UAV-Traffic dataset, which has significantly different attributes and target appearances with the VOT datasets.

NeurIPS Conference 2019 Conference Paper

Quadratic Video Interpolation

Xiangyu Xu
Li Siyao
Wenxiu Sun
Qian Yin
Ming-Hsuan Yang

Video interpolation is an important problem in computer vision, which helps overcome the temporal limitation of camera sensors. Existing video interpolation methods usually assume uniform motion between consecutive frames and use linear models for interpolation, which cannot well approximate the complex motion in the real world. To address these issues, we propose a quadratic video interpolation method which exploits the acceleration information in videos. This method allows prediction with curvilinear trajectory and variable velocity, and generates more accurate interpolation results. For high-quality frame synthesis, we develop a flow reversal layer to estimate flow fields starting from the unknown target frame to the source frame. In addition, we present techniques for flow refinement. Extensive experiments demonstrate that our approach performs favorably against the existing linear models on a wide variety of video datasets.

NeurIPS Conference 2018 Conference Paper

Context-aware Synthesis and Placement of Object Instances

Donghoon Lee
Sifei Liu
Jinwei Gu
Ming-Yu Liu
Ming-Hsuan Yang
Jan Kautz

Learning to insert an object instance into an image in a semantically coherent manner is a challenging and interesting problem. Solving it requires (a) determining a location to place an object in the scene and (b) determining its appearance at the location. Such an object insertion model can potentially facilitate numerous image editing and scene parsing applications. In this paper, we propose an end-to-end trainable neural network for the task of inserting an object instance mask of a specified class into the semantic label map of an image. Our network consists of two generative modules where one determines where the inserted object mask should be (i. e. , location and scale) and the other determines what the object mask shape (and pose) should look like. The two modules are connected together via a spatial transformation network and jointly trained. We devise a learning procedure that leverage both supervised and unsupervised data and show our model can insert an object at diverse locations with various appearances. We conduct extensive experimental validations with comparisons to strong baselines to verify the effectiveness of the proposed network.

NeurIPS Conference 2018 Conference Paper

Deep Attentive Tracking via Reciprocative Learning

Shi Pu
Yibing Song
Chao Ma
Honggang Zhang
Ming-Hsuan Yang

Visual attention, derived from cognitive neuroscience, facilitates human perception on the most pertinent subset of the sensory data. Recently, significant efforts have been made to exploit attention schemes to advance computer vision systems. For visual tracking, it is often challenging to track target objects undergoing large appearance changes. Attention maps facilitate visual tracking by selectively paying attention to temporal robust features. Existing tracking-by-detection approaches mainly use additional attention modules to generate feature weights as the classifiers are not equipped with such mechanisms. In this paper, we propose a reciprocative learning algorithm to exploit visual attention for training deep classifiers. The proposed algorithm consists of feed-forward and backward operations to generate attention maps, which serve as regularization terms coupled with the original classification loss function for training. The deep classifier learns to attend to the regions of target objects robust to appearance changes. Extensive experiments on large-scale benchmark datasets show that the proposed attentive tracking method performs favorably against the state-of-the-art approaches.

NeurIPS Conference 2018 Conference Paper

Deep Non-Blind Deconvolution via Generalized Low-Rank Approximation

Wenqi Ren
Jiawei Zhang
Lin Ma
Jinshan Pan
Xiaochun Cao
Wangmeng Zuo
Wei Liu
Ming-Hsuan Yang

In this paper, we present a deep convolutional neural network to capture the inherent properties of image degradation, which can handle different kernels and saturated pixels in a unified framework. The proposed neural network is motivated by the low-rank property of pseudo-inverse kernels. We first compute a generalized low-rank approximation for a large number of blur kernels, and then use separable filters to initialize the convolutional parameters in the network. Our analysis shows that the estimated decomposed matrices contain the most essential information of the input kernel, which ensures the proposed network to handle various blurs in a unified framework and generate high-quality deblurring results. Experimental results on benchmark datasets with noise and saturated pixels demonstrate that the proposed algorithm performs favorably against state-of-the-art methods.

AAAI Conference 2018 Conference Paper

Learning Binary Residual Representations for Domain-Specific Video Streaming

Yi-Hsuan Tsai
Ming-Yu Liu
Deqing Sun
Ming-Hsuan Yang
Jan Kautz

We study domain-speciﬁc video streaming. Speciﬁcally, we target a streaming setting where the videos to be streamed from a server to a client are all in the same domain and they have to be compressed to a small size for low-latency transmission. Several popular video streaming services, such as the video game streaming services of GeForce Now and Twitch, fall in this category. While conventional video compression standards such as H. 264 are commonly used for this task, we hypothesize that one can leverage the property that the videos are all in the same domain to achieve better video quality. Based on this hypothesis, we propose a novel video compression pipeline. Speciﬁcally, we ﬁrst apply H. 264 to compress domain-speciﬁc videos. We then train a novel binary autoencoder to encode the leftover domain-speciﬁc residual information frame-by-frame into binary representations. These binary representations are then compressed and sent to the client together with the H. 264 stream. In our experiments, we show that our pipeline yields consistent gains over standard H. 264 compression across several benchmark datasets while using the same channel bandwidth.

AAAI Conference 2018 Conference Paper

Retrieving and Classifying Affective Images via Deep Metric Learning

Jufeng Yang
Dongyu She
Yu-Kun Lai
Ming-Hsuan Yang

Affective image understanding has been extensively studied in the last decade since more and more users express emotion via visual contents. While current algorithms based on convolutional neural networks aim to distinguish emotional categories in a discrete label space, the task is inherently ambiguous. This is mainly because emotional labels with the same polarity (i. e. , positive or negative) are highly related, which is different from concrete object concepts such as cat, dog and bird. To the best of our knowledge, few methods focus on leveraging such characteristic of emotions for affective image understanding. In this work, we address the problem of understanding affective images via deep metric learning and propose a multi-task deep framework to optimize both retrieval and classiﬁcation goals. We propose the sentiment constraints adapted from the triplet constraints, which are able to explore the hierarchical relation of emotion labels. We further exploit the sentiment vector as an effective representation to distinguish affective images utilizing the texture representation derived from convolutional layers. Extensive evaluations on four widely-used affective datasets, i. e. , Flickr and Instagram, IAPSa, Art Photo, and Abstract Paintings, demonstrate that the proposed algorithm performs favorably against the state-of-the-art methods on both affective image retrieval and classiﬁcation tasks.

NeurIPS Conference 2017 Conference Paper

Learning Affinity via Spatial Propagation Networks

Sifei Liu
Shalini De Mello
Jinwei Gu
Guangyu Zhong
Ming-Hsuan Yang
Jan Kautz

In this paper, we propose a spatial propagation networks for learning affinity matrix. We show that by constructing a row/column linear propagation model, the spatially variant transformation matrix constitutes an affinity matrix that models dense, global pairwise similarities of an image. Specifically, we develop a three-way connection for the linear propagation model, which (a) formulates a sparse transformation matrix where all elements can be the output from a deep CNN, but (b) results in a dense affinity matrix that is effective to model any task-specific pairwise similarity. Instead of designing the similarity kernels according to image features of two points, we can directly output all similarities in a pure data-driven manner. The spatial propagation network is a generic framework that can be applied to numerous tasks, which traditionally benefit from designed affinity, e. g. , image matting, colorization, and guided filtering, to name a few. Furthermore, the model can also learn semantic-aware affinity for high-level vision tasks due to the learning capability of the deep model. We validate the proposed framework by refinement of object segmentation. Experiments on the HELEN face parsing and PASCAL VOC-2012 semantic segmentation tasks show that the spatial propagation network provides general, effective and efficient solutions for generating high-quality segmentation results.

NeurIPS Conference 2017 Conference Paper

Semi-Supervised Learning for Optical Flow with Generative Adversarial Networks

Wei-Sheng Lai
Jia-Bin Huang
Ming-Hsuan Yang

Convolutional neural networks (CNNs) have recently been applied to the optical flow estimation problem. As training the CNNs requires sufficiently large ground truth training data, existing approaches resort to synthetic, unrealistic datasets. On the other hand, unsupervised methods are capable of leveraging real-world videos for training where the ground truth flow fields are not available. These methods, however, rely on the fundamental assumptions of brightness constancy and spatial smoothness priors which do not hold near motion boundaries. In this paper, we propose to exploit unlabeled videos for semi-supervised learning of optical flow with a Generative Adversarial Network. Our key insight is that the adversarial loss can capture the structural patterns of flow warp errors without making explicit assumptions. Extensive experiments on benchmark datasets demonstrate that the proposed semi-supervised algorithm performs favorably against purely supervised and semi-supervised learning schemes.

NeurIPS Conference 2017 Conference Paper

Universal Style Transfer via Feature Transforms

Yijun Li
Chen Fang
Jimei Yang
Zhaowen Wang
Xin Lu
Ming-Hsuan Yang

Universal style transfer aims to transfer arbitrary visual styles to content images. Existing feed-forward based methods, while enjoying the inference efficiency, are mainly limited by inability of generalizing to unseen styles or compromised visual quality. In this paper, we present a simple yet effective method that tackles these limitations without training on any pre-defined styles. The key ingredient of our method is a pair of feature transforms, whitening and coloring, that are embedded to an image reconstruction network. The whitening and coloring transforms reflect direct matching of feature covariance of the content image to a given style image, which shares similar spirits with the optimization of Gram matrix based cost in neural style transfer. We demonstrate the effectiveness of our algorithm by generating high-quality stylized images with comparisons to a number of recent methods. We also analyze our method by visualizing the whitened features and synthesizing textures by simple feature coloring.

NeurIPS Conference 2015 Conference Paper

Weakly-supervised Disentangling with Recurrent Transformations for 3D View Synthesis

Jimei Yang
Scott Reed
Ming-Hsuan Yang
Honglak Lee

An important problem for both graphics and vision is to synthesize novel views of a 3D object from a single image. This is in particular challenging due to the partial observability inherent in projecting a 3D object onto the image space, and the ill-posedness of inferring object shape and pose. However, we can train a neural network to address the problem if we restrict our attention to specific object classes (in our case faces and chairs) for which we can gather ample training data. In this paper, we propose a novel recurrent convolutional encoder-decoder network that is trained end-to-end on the task of rendering rotated objects starting from a single image. The recurrent structure allows our model to capture long- term dependencies along a sequence of transformations, and we demonstrate the quality of its predictions for human faces on the Multi-PIE dataset and for a dataset of 3D chair models, and also show its ability of disentangling latent data factors without using object class labels.

NeurIPS Conference 2006 Conference Paper

Detecting Humans via Their Pose

Alessandro Bissacco
Ming-Hsuan Yang
Stefano Soatto

We consider the problem of detecting humans and classifying their pose from a single image. Specifically, our goal is to devise a statistical model that simultaneously answers two questions: 1) is there a human in the image? and, if so, 2) what is a low-dimensional representation of her pose? We investigate models that can be learned in an unsupervised manner on unlabeled images of human poses, and provide information that can be used to match the pose of a new image to the ones present in the training set. Starting from a set of descriptors recently proposed for human detection, we apply the Latent Dirichlet Allocation framework to model the statistics of these features, and use the resulting model to answer the above questions. We show how our model can efficiently describe the space of images of humans with their pose, by providing an effective representation of poses for tasks such as classification and matching, while performing remarkably well in human/non human decision problems, thus enabling its use for human detection. We validate the model with extensive quantitative experiments and comparisons with other approaches on human detection and pose matching.

NeurIPS Conference 2004 Conference Paper

Adaptive Discriminative Generative Model and Its Applications

Ruei-sung Lin
David Ross
Jongwoo Lim
Ming-Hsuan Yang

This paper presents an adaptive discriminative generative model that gen- eralizes the conventional Fisher Linear Discriminant algorithm and ren- ders a proper probabilistic interpretation. Within the context of object tracking, we aim to find a discriminative generative model that best sep- arates the target from the background. We present a computationally efficient algorithm to constantly update this discriminative model as time progresses. While most tracking algorithms operate on the premise that the object appearance or ambient lighting condition does not significantly change as time progresses, our method adapts a discriminative genera- tive model to reflect appearance variation of the target and background, thereby facilitating the tracking task in ever-changing environments. Nu- merous experiments show that our method is able to learn a discrimina- tive generative model for tracking target objects undergoing large pose and lighting changes.

NeurIPS Conference 2004 Conference Paper

Incremental Learning for Visual Tracking

Jongwoo Lim
David Ross
Ruei-sung Lin
Ming-Hsuan Yang

Most existing tracking algorithms construct a representation of a target object prior to the tracking task starts, and utilize invariant features to handle appearance variation of the target caused by lighting, pose, and view angle change. In this paper, we present an efficient and effec- tive online algorithm that incrementally learns and adapts a low dimen- sional eigenspace representation to reflect appearance changes of the tar- get, thereby facilitating the tracking task. Furthermore, our incremental method correctly updates the sample mean and the eigenbasis, whereas existing incremental subspace update methods ignore the fact the sample mean varies over time. The tracking problem is formulated as a state inference problem within a Markov Chain Monte Carlo framework and a particle filter is incorporated for propagating sample distributions over time. Numerous experiments demonstrate the effectiveness of the pro- posed tracking algorithm in indoor and outdoor environments where the target objects undergo large pose and lighting changes. 1 Introduction The main challenges of visual tracking can be attributed to the difficulty in handling appear- ance variability of a target object. Intrinsic appearance variabilities include pose variation and shape deformation of a target object, whereas extrinsic illumination change, camera motion, camera viewpoint, and occlusions inevitably cause large appearance variation. Due to the nature of the tracking problem, it is imperative for a tracking algorithm to model such appearance variation. Here we developed a method that, during visual tracking, constantly and efficiently up- dates a low dimensional eigenspace representation of the appearance of the target object. The advantages of this adaptive subspace representation are several folds. The eigenspace representation provides a compact notion of the "thing" being tracked rather than treating the target as a set of independent pixels, i. e. , "stuff" [1]. The use of an incremental method continually updates the eigenspace to reflect the appearance change caused by intrinsic and extrinsic factors, thereby facilitating the tracking process. To estimate the locations of the target objects in consecutive frames, we used a sampling algorithm with likelihood estimates, which is in direct contrast to other tracking methods that usually solve complex optimization problems using gradient-descent approach. The proposed method differs from our prior work [14] in several aspects. First, the pro- posed algorithm does not require any training images of the target object before the tracking task starts. That is, our tracker learns a low dimensional eigenspace representation on-line and incrementally updates it as time progresses (We assume, like most tracking algorithms, that the target region has been initialized in the first frame). Second, we extend our sam- pling method to incorporate a particle filter so that the sample distributions are propagated over time. Based on the eigenspace model with updates, an effective likelihood estimation function is developed. Third, we extend the R-SVD algorithm [6] so that both the sample mean and eigenbasis are correctly updated as new data arrive. Though there are numerous subspace update algorithms in the literature, only the method by Hall et al. [8] is also able to update the sample mean. However, their method is based on the addition of a single col- umn (single observation) rather than blocks (a number of observations in our case) and thus is less efficient than ours. While our formulation provides an exact solution, their algorithm gives only approximate updates and thus it may suffer from numerical instability. Finally, the proposed tracker is extended to use a robust error norm for likelihood estimation in the presence of noisy data or partial occlusions, thereby rendering more accurate and robust tracking results. 2 Previous Work and Motivation Black et al. [4] proposed a tracking algorithm using a pre-trained view-based eigenbasis representation and a robust error norm. Instead of relying on the popular brightness con- stancy working principal, they advocated the use of subspace constancy assumption for visual tracking. Although their algorithm demonstrated excellent empirical results, it re- quires to build a set of view-based eigenbases before the tracking task starts. Furthermore, their method assumes that certain factors, such as illumination conditions, do not change significantly as the eigenbasis, once constructed, is not updated. Hager and Belhumeur [7] presented a tracking algorithm to handle the geometry and illu- mination variations of target objects. Their method extends a gradient-based optical flow algorithm to incorporate research findings in [2] for object tracking under varying illumi- nation conditions. Prior to the tracking task starts, a set of illumination basis needs to be constructed at a fixed pose in order to account for appearance variation of the target due to lighting changes. Consequently, it is not clear whether this method is effective if a target object undergoes changes in illumination with arbitrary pose. In [9] Isard and Blake developed the Condensation algorithm for contour tracking in which multiple plausible interpretations are propagated over time. Though their probabilistic ap- proach has demonstrated success in tracking contours in clutter, the representation scheme is rather primitive, i. e. , curves or splines, and is not updated as the appearance of a target varies due to pose or illumination change. Mixture models have been used to describe appearance change for motion estimation [3] [10]. In Black et al. [3] four possible causes are identified in a mixture model for estimating appearance change in consecutive frames, and thereby more reliable image motion can be obtained. A more elaborate mixture model with an online EM algorithm was recently proposed by Jepson et al. [10] in which they use three components and wavelet filters to account for appearance changes during tracking. Their method is able to handle variations in pose, illumination and expression. However, their WSL appearance model treats pixels within the target region independently, and therefore does not have notion of the "thing" being tracked. This may result in modeling background rather than the foreground, and fail to track the target. In contrast to the eigentracking algorithm [4], our algorithm does not require a training phase but learns the eigenbases on-line during the object tracking process, and constantly updates this representation as the appearance changes due to pose, view angle, and illumi- nation variation. Further, our method uses a particle filter for motion parameter estimation rather than the Gauss-Newton method which often gets stuck in local minima or is dis- tracted by outliers [4]. Our appearance-based model provides a richer description than simple curves or splines as used in [9], and has notion of the "thing" being tracked. In addition, the learned representation can be utilized for other tasks such as object recog- nition. In this work, an eigenspace representation is learned directly from pixel values within a target object in the image space. Experiments show that good tracking results can be obtained with this representation without resorting to wavelets as used in [10], and better performance can potentially be achieved using wavelet filters. Note also that the view-based eigenspace representation has demonstrated its ability to model appearance of objects at different pose [13], and under different lighting conditions [2]. 3 Incremental Learning for Tracking We present the details of the proposed incremental learning algorithm for object tracking in this section. 3. 1 Incremental Update of Eigenbasis and Mean The appearance of a target object may change drastically due to intrinsic and extrinsic factors as discussed earlier. Therefore it is important to develop an efficient algorithm to update the eigenspace as the tracking task progresses. Numerous algorithms have been developed to update eigenbasis from a time-varying covariance matrix as more data arrive [6] [8] [11] [5]. However, most methods assume zero mean in updating the eigenbasis except the method by Hall et al. [8] in which they consider the change of the mean when updating eigenbasis as each new datum arrives. Their update algorithm only handles one datum per update and gives approximate results, while our formulation handles multiple data at the same time and renders exact solutions. We extend the work of the classic R-SVD method [6] in which we update the eigenbasis while taking the shift of the sample mean into account. To the best of our knowledge, this formulation with mean update is new in the literature. Given a d n data matrix A = {I1, .. ., In} where each column Ii is an observation (a d- dimensional image vector in this paper), we can compute the singular value decomposition (SVD) of A, i. e. , A = U V. When a dm matrix E of new observations is available, the R-SVD algorithm efficiently computes the SVD of the matrix A = (A|E) = U V based on the SVD of A as follows: 1. Apply QR decomposition to and get orthonormal basis ~ E of E, and U = (U | ~ E). 2. Let V = V 0 0 I where Im is an m m identity matrix. It follows then, m = U A V = U (A|E) V 0 = U AV U E = U E. ~ E 0 Im ~ E AV ~ E E 0 ~ E E 3. Compute the SVD of = ~ U ~ ~ V and the SVD of A is A = U ( ~ U ~ ~ V )V = (U ~ U ) ~ ( ~ V V ). Exploiting the properties of orthonormal bases and block structures, the R-SVD algorithm computes the new eigenbasis efficiently. The computational complexity analysis and more details are described in [6]. One problem with the R-SVD algorithm is that the eigenbasis U is computed from AA with the zero mean assumption. We modify the R-SVD algorithm and compute the eigen- basis with mean update. The following derivation is based on scatter matrix, which is same as covariance matrix except a scalar factor. Proposition 1 Let Ip = {I1, I2, .. ., In}, Iq= {In+1, In+2, .. ., In+m}, and Ir = (Ip|Iq). Denote the means and scatter matrices of Ip, Iq, Ir as Ip, Iq, Ir, and Sp, Sq, Sr respec- tively, then Sr = Sp + Sq + nm (I n+m q - Ip)(Iq - Ip). Proof: By definition, I r = n I I (I n+m p + m n+m q, Ip - Ir = m n+m p - Iq); Iq - Ir = n (I n+m q - Ip) and, Sr = n ( ( i=1 Ii - Ir)(Ii - Ir) + n+m i=n+1 Ii - Ir)(Ii - Ir) = n ( i=1 Ii - Ip + Ip - Ir)(Ii - Ip + Ip - Ir) + n+m ( i=m+1 Ii - Iq + Iq - Ir)(Ii - Iq + Iq - Ir) = Sp + n(Ip - Ir)(Ip - Ir) + Sq + m(Iq - Ir)(Iq - Ir) = Sp + nm2 ( ( ( I I n+m)2 p - Iq)(Ip - Iq) + Sq + n2m (n+m)2 p - Iq)(Ip - Iq) = Sp + Sq + nm (I n+m p - Iq)(Ip - Iq) Let ^ Ip = {I1 - Ip, .. ., In - Ip}, ^ Iq = {In+1 - Iq, .. ., In+m - Iq}, and ^ Ir = {I1 - Ir, .. ., In+m - Ir}, and the SVD of ^Ir = UrrVr. Let ~ E = ^ Iq| nm (I n+m p - Iq), and use Proposition 1, Sr = (^ Ip| ~ E)(^ Ip| ~ E). Therefore, we compute SVD on ( ^ Ip| ~ E) to get Ur. This can be done efficiently by the R-SVD algorithm as described above. In summary, given the mean Ip and the SVD of existing data Ip, i. e. , UppVp and new data Iq, we can compute the the mean Ir and the SVD of Ir, i. e. , UrrVr easily: 1. Compute I r = n I I (I n+m p + m n+m q, and ~ E = Iq - Ir 1(1m) | nm n+m p - Iq). 2. Compute R-SVD with (UppVp ) and ~ E to obtain (UrrVr ). In numerous vision problems, we can further exploit the low dimensional approximation of image data and put larger weights on the recent observations, or equivalently downweight the contributions of previous observations. For example as the appearance of a target object gradually changes, we may want to put more weights on recent observations in updating the eigenbasis since they are more likely to be similar to the current appearance of the target. The forgetting factor f can be used under this premise as suggested in [11], i. e. , A = (f A |E) = (U (f )V |E) where A and A are original and weighted data matrices, respectively. 3. 2 Sequential Inference Model The visual tracking problem is cast as an inference problem with a Markov model and hidden state variable, where a state variable Xt describes the affine motion parameters (and thereby the location) of the target at time t. Given a set of observed images It = {I1, .. ., It}. we aim to estimate the value of the hidden state variable Xt. Using Bayes' theorem, we have p(Xt| It) p(It|Xt) p(Xt|Xt-1) p(Xt-1| It-1) dXt-1 The tracking process is governed by the observation model p(It|Xt) where we estimate the likelihood of Xt observing It, and the dynamical model between two states p(Xt|Xt-1). The Condensation algorithm [9], based on factored sampling, approximates an arbitrary distribution of observations with a stochastically generated set of weighted samples. We use a variant of the Condensation algorithm to model the distribution over the object's location, as it evolves over time. 3. 3 Dynamical and Observation Models The motion of a target object between two consecutive frames can be approximated by an affine image warping. In this work, we use the six parameters of affine transform to model the state transition from Xt-1 to Xt of a target object being tracked. Let Xt = (xt, yt, t, st, t, t) where xt, yt, t, st, t, t, denote x, y translation, rotation angle, scale, aspect ratio, and skew direction at time t. Each parameter in Xt is modeled independently by a Gaussian distribution around its counterpart in Xt-1. That is, p(Xt|Xt-1) = N (Xt; Xt-1, ) where is a diagonal covariance matrix whose elements are the corresponding variances of affine parameters, i. e. , 2x, 2y, 2, 2. s, 2, 2 Since our goal is to use a representation to model the "thing" that we are tracking, we model the image observations using a probabilistic interpretation of principal component analysis [16]. Given an image patch predicated by Xt, we assume the observed image It was generated from a subspace spanned by U centered at. The probability that a sample being generated from the subspace is inversely proportional to the distance d from the sample to the reference point (i. e. , center) of the subspace, which can be decomposed into the distance-to-subspace, dt, and the distance-within-subspace from the projected sample to the subspace center, dw. This distance formulation, based on a orthonormal subspace and its complement space, is similar to [12] in spirit. The probability of a sample generated from a subspace, pd (I t t|Xt), is governed by a Gaus- sian distribution: pd (I t t | Xt) = N (It; , U U + I ) where I is an identity matrix, is the mean, and I term corresponds to the additive Gaus- sian noise in the observation process. It can be shown [15] that the negative exponential distance from It to the subspace spanned by U, i. e. , exp(-||(It - ) - U U (It - )||2), is proportional to N (It; , U U + I) as 0. Within a subspace, the likelihood of the projected sample can be modeled by the Maha- lanobis distance from the mean as follows: pd (I w t | Xt) = N (It; , U -2U ) where is the center of the subspace and is the matrix of singular values corresponding to the columns of U. Put together, the likelihood of a sample being generated from the subspace is governed by p(It|Xt) = pd (I (I t t|Xt) pdw t|Xt) = N (It; , U U + I) N (It; , U-2U ) (1) Given a drawn sample Xt and the corresponding image region It, we aim to compute p(It|Xt) using (1). To minimize the effects of noisy pixels, we utilize a robust error norm [4], (x, ) = x2 instead of the Euclidean norm d(x) = ||x||2, to ignore the "outlier" 2+x2 pixels (i. e. , the pixels that are not likely to appear inside the target region given the current eigenspace). We use a method similar to that used in [4] in order to compute dt and dw. This robust error norm is helpful especially when we use a rectangular region to enclose the target (which inevitably contains some noisy background pixels). 4 Experiments To test the performance of our proposed tracker, we collected a number of videos recorded in indoor and outdoor environments where the targets change pose in different lighting con- ditions. Each video consists of 320 240 gray scale images and are recorded at 15 frames per second unless specified otherwise. For the eigenspace representation, each target image region is resized to 32 32 patch, and the number of eigenvectors used in all experiments is set to 16 though fewer eigenvectors may also work well. Implemented in MATLAB with MEX, our algorithm runs at 4 frames per second on a standard computer with 200 particles. We present some tracking results in this section and more tracking results as well as videos can be found at http: //vision. ucsd. edu/~jwlim/ilt/. 4. 1 Experimental Results Figure 1 shows the tracking results using a challenging sequence recorded with a mov- ing digital camera in which a person moves from a dark room toward a bright area while changing his pose, moving underneath spot lights, changing facial expressions and taking off glasses. All the eigenbases are constructed automatically from scratch and constantly updated to model the appearance of the target object while undergoing appearance changes. Even with the significant camera motion and low frame rate (which makes the motions be- tween frames more significant, or equivalently to tracking fast moving objects), our tracker stays stably on the target throughout the sequence. The second sequence contains an animal doll moving in different pose, scale, and lighting conditions as shown in Figure 2. Experimental results demonstrate that our tracker is able to follow the target as it undergoes large pose change, cluttered background, and lighting variation. Notice that the non-convex target object is localized with an enclosing rectan- gular window, and thus it inevitably contains some background pixels in its appearance representation. The robust error norm enables the tracker to ignore background pixels and estimate the target location correctly. The results also show that our algorithm faithfully Figure 1: A person moves from dark toward bright area with large lighting and pose changes. The images in the second row shows the current sample mean, tracked region, reconstructed image, and the reconstruction error respectively. The third and forth rows shows 10 largest eigenbases. Figure 2: An animal doll moving with large pose, lighting variation in a cluttered background. models the appearance of the target, as shown in eigenbases and reconstructed images, in the presence of noisy background pixels. We recorded a sequence to demonstrate that our tracker performs well in outdoor environ- ment where lighting conditions change drastically. The video was acquired when a person walking underneath a trellis covered by vines. As shown in Figure 3, the cast shadow changes the appearance of the target face drastically. Furthermore, the combined pose and lighting variation with low frame rate makes the tracking task extremely difficult. Nev- ertheless, the results show that our tracker successfully follows the target accurately and robustly. Due to heavy shadows and drastic lighting change, other tracking methods based on gradient, contour, or color information are unlikely to perform well in this case.

AAAI Conference 2002 Conference Paper

Extended Isomap for Pattern Classification

Ming-Hsuan Yang

The Isomap method has demonstrated promising results in finding low dimensional manifolds from data points in the high dimensional input space. While classical subspace methods use Euclidean or Manhattan metrics to represent distances between data points and apply Principal Component Analysis to induce linear manifolds, the Isomap method estimates geodesic distances between data points and then uses Multi-Dimensional Scaling to induce low dimensional manifolds. Since the Isomap method is developed based on reconstruction principle, it may not be optimal from the classification viewpoint. In this paper, we present an extended Isomap method that utilizes Fisher Linear Discriminant for pattern classification. Numerous experiments on image data sets show that our extension is more effective than the original Isomap method for pattern classification. Furthermore, the extended Isomap method shows promising results compared with best methods in the face recognition literature.

NeurIPS Conference 2001 Conference Paper

Face Recognition Using Kernel Methods

Ming-Hsuan Yang

Principal Component Analysis and Fisher Linear Discriminant methods have demonstrated their success in face detection, recog(cid: 173) nition, and tracking. The representation in these subspace methods is based on second order statistics of the image set, and does not address higher order statistical dependencies such as the relation(cid: 173) ships among three or more pixels. Recently Higher Order Statistics and Independent Component Analysis (ICA) have been used as in(cid: 173) formative low dimensional representations for visual recognition. In this paper, we investigate the use of Kernel Principal Compo(cid: 173) nent Analysis and Kernel Fisher Linear Discriminant for learning low dimensional representations for face recognition, which we call Kernel Eigenface and Kernel Fisherface methods. While Eigenface and Fisherface methods aim to find projection directions based on the second order correlation of samples, Kernel Eigenface and Ker(cid: 173) nel Fisherface methods provide generalizations which take higher order correlations into account. We compare the performance of kernel methods with Eigenface, Fisherface and ICA-based meth(cid: 173) ods for face recognition with variation in pose, scale, lighting and expression. Experimental results show that kernel methods pro(cid: 173) vide better representations and achieve lower error rates for face recognition. 1 Motivation and Approach Subspace methods have been applied successfully in numerous visual recognition tasks such as face localization, face recognition, 3D object recognition, and tracking. In particular, Principal Component Analysis (PCA) [20] [13], and Fisher Linear Dis(cid: 173) criminant (FLD) methods [6] have been applied to face recognition with impressive results. While PCA aims to extract a subspace in which the variance is maximized (or the reconstruction error is minimized), some unwanted variations (due to light(cid: 173) ing, facial expressions, viewing points, etc. ) may be retained (See [8] for examples). It has been observed that in face recognition the variations between the images of the same face due to illumination and viewing direction are almost always larger than image variations due to the changes in face identity [1]. Therefore, while the PCA projections are optimal in a correlation sense (or for reconstruction" from a low dimensional subspace), these eigenvectors or bases may be suboptimal from the classification viewpoint. Representations of Eigenface [20] (based on PCA) and Fisherface [6] (based on FLD) methods encode the pattern information based on the second order dependencies, i. e. , pixelwise covariance among the pixels, and are insensitive to the dependencies among multiple (more than two) pixels in the samples. Higher order dependencies in an image include nonlinear relations among the pixel intensity values, such as the relationships among three or more pixels in an edge or a curve, which can cap(cid: 173) ture important information for recognition. Several researchers have conjectured that higher order statistics may be crucial to better represent complex patterns. Recently, Higher Order Statistics (HOS) have been applied to visual learning prob(cid: 173) lems. Rajagopalan et ale use HOS of the images of a target object to get a better approximation of an unknown distribution. Experiments on face detection [16] and vehicle detection [15] show comparable, if no better, results than other PCA-based methods. The concept of Independent Component Analysis (ICA) maximizes the degree of statistical independence of output variables using contrast functions such as Kullback-Leibler divergence, negentropy, and cumulants [9] [10]. A neural net(cid: 173) work algorithm to carry out ICA was proposed by Bell and Sejnowski [7], and was applied to face recognition [3]. Although the idea of computing higher order mo(cid: 173) ments in the ICA-based face recognition method is attractive, the assumption that the face images comprise of a set of independent basis images (or factorial codes) is not intuitively clear. In [3] Bartlett et ale showed that ICA representation out(cid: 173) perform PCA representation in face recognition using a subset of frontal FERET face images. However, Moghaddam recently showed that ICA representation does not provide significant advantage over PCA [12]. The experimental results suggest that seeking non-Gaussian and independent components may not necessarily yield better representation for face recognition. In [18], Sch6lkopf et ale extended the conventional PCA to Kernel Principal Com(cid: 173) ponent Analysis (KPCA). Empirical results on digit recognition using MNIST data set and object recognition using a database of rendered chair images showed that Kernel PCA is able to extract nonlinear features and thus provided better recog(cid: 173) nition results. Recently Baudat and Anouar, Roth and Steinhage, and Mika et ale applied kernel tricks to FLD and proposed Kernel Fisher Linear Discriminant (KFLD) method [11] [17] [5]. Their experiments showed that KFLD is able to ex(cid: 173) tract the most discriminant features in the feature space, which is equivalent to extracting the most discriminant nonlinear features in the original input space. In this paper we seek a method that not only extracts higher order statistics of samples as features, but also maximizes the class separation when we project these features to a lower dimensional space for efficient recognition. Since much of the important information may be contained in the high order dependences among the pixels of a: face image, we investigate the use of Kernel PCA and Kernel FLD for face recognition, which we call Kernel Eigenface and Kernel Fisherface methods, and compare their performance against the standard Eigenface, Fisherface and ICA methods. In the meanwhile, we explain why kernel methods are suitable for visual recognition tasks such as face recognition. 2 Kernel Principal Component Analysis == Given a set of m centered (zero mean, unit variance) samples Xk, Xk [Xkl, .. ., Xkn]T ERn, PCA aims to find the projection directions that maximize the variance, C, which is equivalent to finding the eigenvalues from the covariance

NeurIPS Conference 2000 Conference Paper

Sex with Support Vector Machines

Baback Moghaddam
Ming-Hsuan Yang

Nonlinear Support Vector Machines (SVMs) are investigated for visual sex classification with low resolution "thumbnail" faces (21- by-12 pixels) processed from 1, 755 images from the FE RET face database. The performance of SVMs is shown to be superior to traditional pattern classifiers (Linear, Quadratic, Fisher Linear Dis(cid: 173) criminant, Nearest-Neighbor) as well as more modern techniques such as Radial Basis Function (RBF) classifiers and large ensemble(cid: 173) RBF networks. Furthermore, the SVM performance (3. 4% error) is currently the best result reported in the open literature.

NeurIPS Conference 1999 Conference Paper

A SNoW-Based Face Detector

Ming-Hsuan Yang
Dan Roth
Narendra Ahuja

A novel learning approach for human face detection using a network of linear units is presented. The SNoW learning architecture is a sparse network of linear functions over a pre-defined or incremen(cid: 173) tally learned feature space and is specifically tailored for learning in the presence of a very large number of features. A wide range of face images in different poses, with different expressions and under different lighting conditions are used as a training set to capture the variations of human faces. Experimental results on commonly used benchmark data sets of a wide range of face images show that the SNoW-based approach outperforms methods that use neural networks, Bayesian methods, support vector machines and oth(cid: 173) ers. Furthermore, learning and evaluation using the SNoW-based method are significantly more efficient than with other methods.