Author name cluster

Minsu Cho

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

27 papers

2 author rows

TMLR Journal 2025 Journal Article

Memory-Modular Classification: Learning to Generalize with Memory Replacement

Dahyun Kang
Ahmet Iscen
Eunchan Jo
Sua Choi
Minsu Cho
Cordelia Schmid

We propose a novel memory-modular learner for image classification that separates knowledge memorization from reasoning. Our model enables effective generalization to new classes by simply replacing the memory contents, without the need for model retraining. Unlike traditional models that encode both world knowledge and task-specific skills into their weights during training, our model stores knowledge in the external memory of web-crawled image and text data. At inference time, the model dynamically selects relevant content from the memory based on the input image, allowing it to adapt to arbitrary classes by simply replacing the memory contents. The key differentiator that our learner meta-learns to perform classification tasks with noisy web data from unseen classes, resulting in robust performance across various classification scenarios. Experimental results demonstrate the promising performance and versatility of our approach in handling diverse classification tasks, including zero-shot/few-shot classification of unseen classes, fine-grained classification, and class-incremental classification.

NeurIPS Conference 2025 Conference Paper

Part-Aware Bottom-Up Group Reasoning for Fine-Grained Social Interaction Detection

Dongkeun Kim
Minsu Cho
Suha Kwak

Social interactions often emerge from subtle, fine-grained cues such as facial expressions, gaze, and gestures. However, existing methods for social interaction detection overlook such nuanced cues and primarily rely on holistic representations of individuals. Moreover, they directly detect social groups without explicitly modeling the underlying interactions between individuals. These drawbacks limit their ability to capture localized social signals and introduce ambiguity when group configurations should be inferred from social interactions grounded in nuanced cues. In this work, we propose a part-aware bottom-up group reasoning framework for fine-grained social interaction detection. The proposed method infers social groups and their interactions using body part cues and their interpersonal relations. Our model first detects individuals and enhances their features using part-aware cues, and then infers group configuration by associating individuals via similarity-based reasoning, which considers not only spatial relations but also subtle social cues that signal interactions, leading to more accurate group inference. Experiments on the NVI dataset demonstrate that our method outperforms prior methods, achieving the new state of the art, while additional results on the Café dataset further validate its generalizability to group activity understanding.

NeurIPS Conference 2024 Conference Paper

3D Equivariant Pose Regression via Direct Wigner-D Harmonics Prediction

JongMin Lee
Minsu Cho

Determining the 3D orientations of an object in an image, known as single-image pose estimation, is a crucial task in 3D vision applications. Existing methods typically learn 3D rotations parametrized in the spatial domain using Euler angles or quaternions, but these representations often introduce discontinuities and singularities. SO(3)-equivariant networks enable the structured capture of pose patterns with data-efficient learning, but the parametrizations in spatial domain are incompatible with their architecture, particularly spherical CNNs, which operate in the frequency domain to enhance computational efficiency. To overcome these issues, we propose a frequency-domain approach that directly predicts Wigner-D coefficients for 3D rotation regression, aligning with the operations of spherical CNNs. Our SO(3)-equivariant pose harmonics predictor overcomes the limitations of spatial parameterizations, ensuring consistent pose estimation under arbitrary rotations. Trained with a frequency-domain regression loss, our method achieves state-of-the-art results on benchmarks such as ModelNet10-SO(3) and PASCAL3D+, with significant improvements in accuracy, robustness, and data efficiency.

PDF Details DOI

ICML Conference 2024 Conference Paper

3D Geometric Shape Assembly via Efficient Point Cloud Matching

Nahyuk Lee
Juhong Min
Junha Lee
Seungwook Kim
Kang-Hee Lee
Jaesik Park
Minsu Cho

Learning to assemble geometric shapes into a larger target structure is a pivotal task in various practical applications. In this work, we tackle this problem by establishing local correspondences between point clouds of part shapes in both coarse- and fine-levels. To this end, we introduce Proxy Match Transform (PMT), an approximate high-order feature transform layer that enables reliable matching between mating surfaces of parts while incurring low costs in memory and compute. Building upon PMT, we introduce a new framework, dubbed Proxy Match TransformeR (PMTR), for the geometric assembly task. We evaluate the proposed PMTR on the large-scale 3D geometric shape assembly benchmark dataset of Breaking Bad and demonstrate its superior performance and efficiency compared to state-of-the-art methods. Project page: https: //nahyuklee. github. io/pmtr

NeurIPS Conference 2024 Conference Paper

ActFusion: a Unified Diffusion Model for Action Segmentation and Anticipation

Dayoung Gong
Suha Kwak
Minsu Cho

Temporal action segmentation and long-term action anticipation are two popular vision tasks for the temporal analysis of actions in videos. Despite apparent relevance and potential complementarity, these two problems have been investigated as separate and distinct tasks. In this work, we tackle these two problems, action segmentation, and action anticipation, jointly using a unified diffusion model dubbed ActFusion. The key idea to unification is to train the model to effectively handle both visible and invisible parts of the sequence in an integrated manner; the visible part is for temporal segmentation, and the invisible part is for future anticipation. To this end, we introduce a new anticipative masking strategy during training in which a late part of the video frames is masked as invisible, and learnable tokens replace these frames to learn to predict the invisible future. Experimental results demonstrate the bi-directional benefits between action segmentation and anticipation. ActFusion achieves the state-of-the-art performance across the standard benchmarks of 50 Salads, Breakfast, and GTEA, outperforming task-specific models in both of the two tasks with a single unified model through joint learning.

PDF Details DOI

TMLR Journal 2024 Journal Article

Budget-Aware Sequential Brick Assembly with Efficient Constraint Satisfaction

Seokjun Ahn
Jungtaek Kim
Minsu Cho
Jaesik Park

We tackle the problem of sequential brick assembly with LEGO bricks to create combinatorial 3D structures. This problem is challenging since this brick assembly task encompasses the characteristics of combinatorial optimization problems. In particular, the number of assemblable structures increases exponentially as the number of bricks used increases. To solve this problem, we propose a new method to predict the scores of the next brick position by employing a U-shaped sparse 3D convolutional neural network. Along with the 3D convolutional network, a one-initialized brick-sized convolution filter is used to efficiently validate assembly constraints between bricks without training itself. By the nature of this one-initialized convolution filter, we can readily consider several different brick types by benefiting from modern implementation of convolution operations. To generate a novel structure, we devise a sampling strategy to determine the next brick position considering the satisfaction of assembly constraints. Moreover, our method is designed for either budget-free or budget-aware scenario where a budget may confine the number of bricks and their types. We demonstrate that our method successfully generates a variety of brick structures and outperforms existing methods with Bayesian optimization, deep graph generative model, and reinforcement learning.

ICLR Conference 2024 Conference Paper

Generalized Neural Sorting Networks with Error-Free Differentiable Swap Functions

Jungtaek Kim 0001
Jeongbeen Yoon
Minsu Cho

Sorting is a fundamental operation of all computer systems, having been a long-standing significant research topic. Beyond the problem formulation of traditional sorting algorithms, we consider sorting problems for more abstract yet expressive inputs, e.g., multi-digit images and image fragments, through a neural sorting network. To learn a mapping from a high-dimensional input to an ordinal variable, the differentiability of sorting networks needs to be guaranteed. In this paper we define a softening error by a differentiable swap function, and develop an error-free swap function that holds a non-decreasing condition and differentiability. Furthermore, a permutation-equivariant Transformer network with multi-head attention is adopted to capture dependency between given inputs and also leverage its model capacity with self-attention. Experiments on diverse sorting benchmarks show that our methods perform better than or comparable to baseline methods.

TMLR Journal 2024 Journal Article

PriViT: Vision Transformers for Private Inference

Naren Dhyani
Jianqiao Cambridge Mo
Patrick Yubeaton
Minsu Cho
Ameya Joshi
Siddharth Garg
Brandon Reagen
Chinmay Hegde

The Vision Transformer (ViT) architecture has emerged as the backbone of choice for state-of-the-art deep models for computer vision applications. However, ViTs are ill-suited for private inference using secure multi-party computation (MPC) protocols, due to the large number of non-polynomial operations (self-attention, feed-forward rectifiers, layer normalization). We develop PriViT, a gradient-based algorithm to selectively Taylorize nonlinearities in ViTs while maintaining their prediction accuracy. Our algorithm is conceptually very simple, easy to implement, and achieves improved performance over existing MPC-friendly transformer architectures in terms of the latency-accuracy Pareto frontier.

NeurIPS Conference 2023 Conference Paper

Activity Grammars for Temporal Action Segmentation

Dayoung Gong
Joonseok Lee
Deunsol Jung
Suha Kwak
Minsu Cho

Sequence prediction on temporal data requires the ability to understand compositional structures of multi-level semantics beyond individual and contextual properties of parts. The task of temporal action segmentation remains challenging for the reason, aiming at translating an untrimmed activity video into a sequence of action segments. This paper addresses the problem by introducing an effective activity grammar to guide neural predictions for temporal action segmentation. We propose a novel grammar induction algorithm, dubbed KARI, that extracts a powerful context-free grammar from action sequence data. We also develop an efficient generalized parser, dubbed BEP, that transforms frame-level probability distributions into a reliable sequence of actions according to the induced grammar with recursive rules. Our approach can be combined with any neural network for temporal action segmentation to enhance the sequence prediction and discover its compositional structure. Experimental results demonstrate that our method significantly improves temporal action segmentation in terms of both performance and interpretability on two standard benchmarks, Breakfast and 50 Salads.

NeurIPS Conference 2023 Conference Paper

Locality-Aware Generalizable Implicit Neural Representation

Doyup Lee
Chiheon Kim
Minsu Cho
WOOK SHIN HAN

Generalizable implicit neural representation (INR) enables a single continuous function, i. e. , a coordinate-based neural network, to represent multiple data instances by modulating its weights or intermediate features using latent codes. However, the expressive power of the state-of-the-art modulation is limited due to its inability to localize and capture fine-grained details of data entities such as specific pixels and rays. To address this issue, we propose a novel framework for generalizable INR that combines a transformer encoder with a locality-aware INR decoder. The transformer encoder predicts a set of latent tokens from a data instance to encode local information into each latent token. The locality-aware INR decoder extracts a modulation vector by selectively aggregating the latent tokens via cross-attention for a coordinate input and then predicts the output by progressively decoding with coarse-to-fine modulation through multiple frequency bandwidths. The selective token aggregation and the multi-band feature modulation enable us to learn locality-aware representation in spatial and spectral aspects, respectively. Our framework significantly outperforms previous generalizable INRs and validates the usefulness of the locality-aware latents for downstream tasks such as image generation.

ICML Conference 2023 Conference Paper

Stable and Consistent Prediction of 3D Characteristic Orientation via Invariant Residual Learning

Seungwook Kim 0005
Chunghyun Park
Yoonwoo Jeong
Jaesik Park
Minsu Cho

Learning to predict reliable characteristic orientations of 3D point clouds is an important yet challenging problem, as different point clouds of the same class may have largely varying appearances. In this work, we introduce a novel method to decouple the shape geometry and semantics of the input point cloud to achieve both stability and consistency. The proposed method integrates shape-geometry-based SO(3)-equivariant learning and shape-semantics-based SO(3)-invariant residual learning, where a final characteristic orientation is obtained by calibrating an SO(3)-equivariant orientation hypothesis using an SO(3)-invariant residual rotation. In experiments, the proposed method not only demonstrates superior stability and consistency but also exhibits state-of-the-art performances when applied to point cloud part segmentation, given randomly rotated inputs.

UAI Conference 2022 Conference Paper

Combinatorial Bayesian optimization with random mapping functions to convex polytopes

Jungtaek Kim 0001
Seungjin Choi
Minsu Cho

Bayesian optimization is a popular method for solving the problem of global optimization of an expensive-to-evaluate black-box function. It relies on a probabilistic surrogate model of the objective function, upon which an acquisition function is built to determine where next to evaluate the objective function. In general, Bayesian optimization with Gaussian process regression operates on a continuous space. When input variables are categorical or discrete, an extra care is needed. A common approach is to use one-hot encoded or Boolean representation for categorical variables which might yield a combinatorial explosion problem. In this paper we present a method for Bayesian optimization in a combinatorial space, which can operate well in a large combinatorial space. The main idea is to use a random mapping which embeds the combinatorial space into a convex polytope in a continuous space, on which all essential process is performed to determine a solution to the black-box optimization in the combinatorial space. We describe our combinatorial Bayesian optimization algorithm and present its regret analysis. Numerical experiments demonstrate that our method shows satisfactory performance compared to existing methods.

NeurIPS Conference 2022 Conference Paper

Draft-and-Revise: Effective Image Generation with Contextual RQ-Transformer

Doyup Lee
Chiheon Kim
Saehoon Kim
Minsu Cho
WOOK SHIN HAN

Although autoregressive models have achieved promising results on image generation, their unidirectional generation process prevents the resultant images from fully reflecting global contexts. To address the issue, we propose an effective image generation framework of \emph{Draft-and-Revise} with \emph{Contextual RQ-transformer} to consider global contexts during the generation process. As a generalized VQ-VAE, RQ-VAE first represents a high-resolution image as a sequence of discrete code stacks. After code stacks in the sequence are randomly masked, Contextual RQ-Transformer is trained to infill the masked code stacks based on the unmasked contexts of the image. Then, we propose the two-phase decoding, Draft-and-Revise, for Contextual RQ-Transformer to generates an image, while fully exploiting the global contexts of the image during the generation process. Specifically. in the \emph{draft} phase, our model first focuses on generating diverse images despite rather low quality. Then, in the \emph{revise} phase, the model iteratively improves the quality of images, while preserving the global contexts of generated images. In experiments, our method achieves state-of-the-art results on conditional image generation. We also validate that the Draft-and-Revise decoding can achieve high performance by effectively controlling the quality-diversity trade-off in image generation.

IROS Conference 2022 Conference Paper

GPU-Parallelized Iterative LQR with Input Constraints for Fast Collision Avoidance of Autonomous Vehicles

YeongSeok Lee
Minsu Cho
Kyung-Soo Kim 0001

Collision avoidance in emergency situations is a crucial and challenging task in motion planning for autonomous vehicles. Especially in the field of optimization-based planning using nonlinear model predictive control, many efforts to achieve real-time performance are still ongoing. Among various approaches, the iterative linear quadratic regulator (iLQR) is known as an efficient means of nonlinear optimization. Additionally, parallel computing architectures, such as GPUs, are more widely applied in autonomous vehicles. In this paper, we propose 1) a parallel computing framework for iLQR with input constraints considering the characteristics of the problem and 2) a proper environmental formulation that can be covered with single-precision floating-point computation of the GPU. The GPU-accelerated framework was tested on a real-time simulation-in-the-loop system using CarMaker and ROS at a 20 Hz sampling rate on a low-performance mobile computer and was compared against the same framework realized with a CPU.

IJCAI Conference 2022 Conference Paper

Learning to Assemble Geometric Shapes

Jinhwi Lee
Jungtaek Kim
Hyunsoo Chung
Jaesik Park
Minsu Cho

Assembling parts into an object is a combinatorial problem that arises in a variety of contexts in the real world and involves numerous applications in science and engineering. Previous related work tackles limited cases with identical unit parts or jigsaw-style parts of textured shapes, which greatly mitigate combinatorial challenges of the problem. In this work, we introduce the more challenging problem of shape assembly, which involves textureless fragments of arbitrary shapes with indistinctive junctions, and then propose a learning-based approach to solving it. We demonstrate the effectiveness on shape assembly tasks with various scenarios, including the ones with abnormal fragments (e. g. , missing and distorted), the different number of fragments, and different rotation discretization.

PDF Details DOI

NeurIPS Conference 2022 Conference Paper

PeRFception: Perception using Radiance Fields

Yoonwoo Jeong
Seungjoo Shin
Junha Lee
Chris Choy
Anima Anandkumar
Minsu Cho
Jaesik Park

The recent progress in implicit 3D representation, i. e. , Neural Radiance Fields (NeRFs), has made accurate and photorealistic 3D reconstruction possible in a differentiable manner. This new representation can effectively convey the information of hundreds of high-resolution images in one compact format and allows photorealistic synthesis of novel views. In this work, using the variant of NeRF called Plenoxels, we create the first large-scale radiance fields datasets for perception tasks, called the PeRFception, which consists of two parts that incorporate both object-centric and scene-centric scans for classification and segmentation. It shows a significant memory compression rate (96. 4\%) from the original dataset, while containing both 2D and 3D information in a unified form. We construct the classification and segmentation models that directly take this radiance fields format as input and also propose a novel augmentation technique to avoid overfitting on backgrounds of images. The code and data are publicly available in "https: //postech-cvlab. github. io/PeRFception/".

NeurIPS Conference 2022 Conference Paper

Peripheral Vision Transformer

Juhong Min
Yucheng Zhao
Chong Luo
Minsu Cho

Human vision possesses a special type of visual processing systems called peripheral vision. Partitioning the entire visual field into multiple contour regions based on the distance to the center of our gaze, the peripheral vision provides us the ability to perceive various visual features at different regions. In this work, we take a biologically inspired approach and explore to model peripheral vision in deep neural networks for visual recognition. We propose to incorporate peripheral position encoding to the multi-head self-attention layers to let the network learn to partition the visual field into diverse peripheral regions given training data. We evaluate the proposed network, dubbed PerViT, on ImageNet-1K and systematically investigate the inner workings of the model for machine perception, showing that the network learns to perceive visual data similarly to the way that human vision does. The performance improvements in image classification over the baselines across different model sizes demonstrate the efficacy of the proposed method.

ICML Conference 2022 Conference Paper

Selective Network Linearization for Efficient Private Inference

Minsu Cho
Ameya Joshi
Brandon Reagen
Siddharth Garg
Chinmay Hegde

Private inference (PI) enables inferences directly on cryptographically secure data. While promising to address many privacy issues, it has seen limited use due to extreme runtimes. Unlike plaintext inference, where latency is dominated by FLOPs, in PI non-linear functions (namely ReLU) are the bottleneck. Thus, practical PI demands novel ReLU-aware optimizations. To reduce PI latency we propose a gradient-based algorithm that selectively linearizes ReLUs while maintaining prediction accuracy. We evaluate our algorithm on several standard PI benchmarks. The results demonstrate up to $4. 25%$ more accuracy (iso-ReLU count at 50K) or $2. 2\times$ less latency (iso-accuracy at 70%) than the current state of the art and advance the Pareto frontier across the latency-accuracy space. To complement empirical results, we present a “no free lunch" theorem that sheds light on how and when network linearization is possible while maintaining prediction accuracy.

NeurIPS Conference 2021 Conference Paper

Brick-by-Brick: Combinatorial Construction with Deep Reinforcement Learning

Hyunsoo Chung
Jungtaek Kim
Boris Knyazev
Jinhwi Lee
Graham W. Taylor
Jaesik Park
Minsu Cho

Discovering a solution in a combinatorial space is prevalent in many real-world problems but it is also challenging due to diverse complex constraints and the vast number of possible combinations. To address such a problem, we introduce a novel formulation, combinatorial construction, which requires a building agent to assemble unit primitives (i. e. , LEGO bricks) sequentially -- every connection between two bricks must follow a fixed rule, while no bricks mutually overlap. To construct a target object, we provide incomplete knowledge about the desired target (i. e. , 2D images) instead of exact and explicit volumetric information to the agent. This problem requires a comprehensive understanding of partial information and long-term planning to append a brick sequentially, which leads us to employ reinforcement learning. The approach has to consider a variable-sized action space where a large number of invalid actions, which would cause overlap between bricks, exist. To resolve these issues, our model, dubbed Brick-by-Brick, adopts an action validity prediction network that efficiently filters invalid actions for an actor-critic network. We demonstrate that the proposed method successfully learns to construct an unseen object conditioned on a single image or multiple views of a target object.

NeurIPS Conference 2021 Conference Paper

Differentiable Spline Approximations

Minsu Cho
Aditya Balu
Ameya Joshi
Anjana Deva Prasad
Biswajit Khara
Soumik Sarkar
Baskar Ganapathysubramanian
Adarsh Krishnamurthy

The paradigm of differentiable programming has significantly enhanced the scope of machine learning via the judicious use of gradient-based optimization. However, standard differentiable programming methods (such as autodiff) typically require that the machine learning models be differentiable, limiting their applicability. Our goal in this paper is to use a new, principled approach to extend gradient-based optimization to functions well modeled by splines, which encompass a large family of piecewise polynomial models. We derive the form of the (weak) Jacobian of such functions and show that it exhibits a block-sparse structure that can be computed implicitly and efficiently. Overall, we show that leveraging this redesigned Jacobian in the form of a differentiable "layer'' in predictive models leads to improved performance in diverse applications such as image segmentation, 3D point cloud reconstruction, and finite element analysis. We also open-source the code at \url{https: //github. com/idealab-isu/DSA}.

NeurIPS Conference 2021 Conference Paper

Rebooting ACGAN: Auxiliary Classifier GANs with Stable Training

Minguk Kang
Woohyeon Shim
Minsu Cho
Jaesik Park

Conditional Generative Adversarial Networks (cGAN) generate realistic images by incorporating class information into GAN. While one of the most popular cGANs is an auxiliary classifier GAN with softmax cross-entropy loss (ACGAN), it is widely known that training ACGAN is challenging as the number of classes in the dataset increases. ACGAN also tends to generate easily classifiable samples with a lack of diversity. In this paper, we introduce two cures for ACGAN. First, we identify that gradient exploding in the classifier can cause an undesirable collapse in early training, and projecting input vectors onto a unit hypersphere can resolve the problem. Second, we propose the Data-to-Data Cross-Entropy loss (D2D-CE) to exploit relational information in the class-labeled dataset. On this foundation, we propose the Rebooted Auxiliary Classifier Generative Adversarial Network (ReACGAN). The experimental results show that ReACGAN achieves state-of-the-art generation results on CIFAR10, Tiny-ImageNet, CUB200, and ImageNet datasets. We also verify that ReACGAN benefits from differentiable augmentations and that D2D-CE harmonizes with StyleGAN2 architecture. Model weights and a software package that provides implementations of representative cGANs and all experiments in our paper are available at https: //github. com/POSTECH-CVLab/PyTorch-StudioGAN.

NeurIPS Conference 2021 Conference Paper

Relational Self-Attention: What's Missing in Attention for Video Understanding

Manjin Kim
Heeseung Kwon
Chunyu Wang
Suha Kwak
Minsu Cho

Convolution has been arguably the most important feature transform for modern neural networks, leading to the advance of deep learning. Recent emergence of Transformer networks, which replace convolution layers with self-attention blocks, has revealed the limitation of stationary convolution kernels and opened the door to the era of dynamic feature transforms. The existing dynamic transforms, including self-attention, however, are all limited for video understanding where correspondence relations in space and time, i. e. , motion information, are crucial for effective representation. In this work, we introduce a relational feature transform, dubbed the relational self-attention (RSA), that leverages rich structures of spatio-temporal relations in videos by dynamically generating relational kernels and aggregating relational contexts. Our experiments and ablation studies show that the RSA network substantially outperforms convolution and self-attention counterparts, achieving the state of the art on the standard motion-centric benchmarks for video action recognition, such as Something-Something-V1&V2, Diving48, and FineGym.

NeurIPS Conference 2020 Conference Paper

CircleGAN: Generative Adversarial Learning across Spherical Circles

Woohyeon Shim
Minsu Cho

We present a novel discriminator for GANs that improves realness and diversity of generated samples by learning a structured hypersphere embedding space using spherical circles. The proposed discriminator learns to populate realistic samples around the longest spherical circle, i. e. , a great circle, while pushing unrealistic samples toward the poles perpendicular to the great circle. Since longer circles occupy larger area on the hypersphere, they encourage more diversity in representation learning, and vice versa. Discriminating samples based on their corresponding spherical circles can thus naturally induce diversity to generated samples. We also extend the proposed method for conditional settings with class labels by creating a hypersphere for each category and performing class-wise discrimination and update. In experiments, we validate the effectiveness for both unconditional and conditional generation on standard benchmarks, achieving the state of the art.

AAAI Conference 2020 Conference Paper

InvNet: Encoding Geometric and Statistical Invariances in Deep Generative Models

Ameya Joshi
Minsu Cho
Viraj Shah
Balaji Pokuri
Soumik Sarkar
Baskar Ganapathysubramanian
Chinmay Hegde

Generative Adversarial Networks (GANs), while widely successful in modeling complex data distributions, have not yet been sufﬁciently leveraged in scientiﬁc computing and design. Reasons for this include the lack of ﬂexibility of GANs to represent discrete-valued image data, as well as the lack of control over physical properties of generated samples. We propose a new conditional generative modeling approach (InvNet) that efﬁciently enables modeling discrete-valued images, while allowing control over their parameterized geometric and statistical properties. We evaluate our approach on several synthetic and real world problems: navigating manifolds of geometric shapes with desired sizes; generation of binary two-phase materials; and the (challenging) problem of generating multi-orientation polycrystalline microstructures.

AAAI Conference 2020 Conference Paper

Real-Time Object Tracking via Meta-Learning: Efficient Model Adaptation and One-Shot Channel Pruning

Ilchae Jung
Kihyun You
Hyeonwoo Noh
Minsu Cho
Bohyung Han

We propose a novel meta-learning framework for real-time object tracking with efﬁcient model adaptation and channel pruning. Given an object tracker, our framework learns to ﬁne-tune its model parameters in only a few gradient-descent iterations during tracking while pruning its network channels using the target ground-truth at the ﬁrst frame. Such a learning problem is formulated as a meta-learning task, where a meta-tracker is trained by updating its meta-parameters for initial weights, learning rates, and pruning masks through carefully designed tracking simulations. The integrated metatracker greatly improves tracking performance by accelerating the convergence of online learning and reducing the cost of feature computation. Experimental evaluation on the standard datasets demonstrates its outstanding accuracy and speed compared to the state-of-the-art methods.

NeurIPS Conference 2019 Conference Paper

Mining GOLD Samples for Conditional GANs

Sangwoo Mo
Chiheon Kim
Sungwoong Kim
Minsu Cho
Jinwoo Shin

Conditional generative adversarial networks (cGANs) have gained a considerable attention in recent years due to its class-wise controllability and superior quality for complex generation tasks. We introduce a simple yet effective approach to improving cGANs by measuring the discrepancy between the data distribution and the model distribution on given samples. The proposed measure, coined the gap of log-densities (GOLD), provides an effective self-diagnosis for cGANs while being efficiently, computed from the discriminator. We propose three applications of the GOLD: example re-weighting, rejection sampling, and active learning, which improve the training, inference, and data selection of cGANs, respectively. Our experimental results demonstrate that the proposed methods outperform corresponding baselines for all three applications on different image datasets.

AAAI Conference 2017 Conference Paper

Text-Guided Attention Model for Image Captioning

Jonghwan Mun
Minsu Cho
Bohyung Han

Visual attention plays an important role to understand images and demonstrates its effectiveness in generating natural language descriptions of images. On the other hand, recent studies show that language associated with an image can steer visual attention in the scene during our cognitive process. Inspired by this, we introduce a text-guided attention model for image captioning, which learns to drive visual attention using associated captions. For this model, we propose an exemplarbased learning approach that retrieves from training data associated captions with each image, and use them to learn attention on visual features. Our attention model enables to describe a detailed state of scenes by distinguishing small or confusable objects effectively. We validate our model on MS- COCO Captioning benchmark and achieve the state-of-theart performance in standard metrics.