Author name cluster

Junmo Kim

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

21 papers

1 author row

AAAI Conference 2026 Short Paper

Tailored ViT Slimming: Budget-Aware Multi-Dimensional Sparsity Regularization for Vision Transformers Pruning (Student Abstract)

Suwoong Lee
Seungjae Lee
Yunho Jeon
Junmo Kim

We propose Tailored ViT Slimming (TVS), a budget-aware multi-dimensional pruning framework for Vision Transformers. TVS injects learnable masks into MHSA and MLP modules and applies adaptive non-convex sparsity regularization to achieve maximal utilization of parameters under strict module-wise budgets. In addition, by retaining scaled masks after pruning, TVS avoids abrupt accuracy drops and provides stable initialization for fine-tuning. On ImageNet-1k with DeiT-S and DeiT-B, TVS consistently outperforms prior ViT compression methods. This result empirically shows that the non-convex sparsity regularizer is effective not only in CNNs but also in ViTs.

PDF Details DOI

NeurIPS Conference 2025 Conference Paper

Frequency-Aware Token Reduction for Efficient Vision Transformer

DongJae Lee
Jiwan Hur
Jaehyun Choi
Jaemyung Yu
Junmo Kim

Vision Transformers have demonstrated exceptional performance across various computer vision tasks, yet their quadratic computational complexity concerning token length remains a significant challenge. To address this, token reduction methods have been widely explored. However, existing approaches often overlook the frequency characteristics of self-attention, such as rank collapsing and over-smoothing phenomenon. In this paper, we propose a frequency-aware token reduction strategy that improves computational efficiency while preserving performance by mitigating rank collapsing. Our method partitions tokens into high-frequency tokens and low-frequency tokens. high-frequency tokens are selectively preserved, while low-frequency tokens are aggregated into a compact direct current token to retain essential low-frequency components. Through extensive experiments and analysis, we demonstrate that our approach significantly improves accuracy while reducing computational overhead and mitigating rank collapsing and over smoothing. Furthermore, we analyze the previous methods, shedding light on their implicit frequency characteristics and limitations. The code is available in https: //github. com/jhtwosun/frequency-aware-token-pruning.

PDF Details

NeurIPS Conference 2025 Conference Paper

Preference Distillation via Value based Reinforcement Learning

Minchan Kwon
Junwon Ko
Kangil kim
Junmo Kim

Direct Preference Optimization (DPO) is a powerful paradigm to align language models with human preferences using pairwise comparisons. However, its binary win-or-loss supervision often proves insufficient for training small models with limited capacity. Prior works attempt to distill information from large teacher models using behavior cloning or KL divergence. These methods often focus on mimicking current behavior and overlook distilling reward modeling. To address this issue, we propose \textit{Teacher Value-based Knowledge Distillation} (TVKD), which introduces an auxiliary reward from the value function of the teacher model to provide a soft guide. This auxiliary reward is formulated to satisfy potential-based reward shaping, ensuring that the global reward structure and optimal policy of DPO are preserved. TVKD can be integrated into the standard DPO training framework and does not require additional rollouts. Our experimental results show that TVKD consistently improves performance across various benchmarks and model sizes.

PDF Details

NeurIPS Conference 2025 Conference Paper

Video Diffusion Models Excel at Tracking Similar-Looking Objects Without Supervision

Chenshuang Zhang
Kang Zhang
Joon Son Chung
In So Kweon
Junmo Kim
Chengzhi Mao

Distinguishing visually similar objects by their motion remains a critical challenge in computer vision. Although supervised trackers show promise, contemporary self-supervised trackers struggle when visual cues become ambiguous, limiting their scalability and generalization without extensive labeled data. We find that pre-trained video diffusion models inherently learn motion representations suitable for tracking without task-specific training. This ability arises because their denoising process isolates motion in early, high-noise stages, distinct from later appearance refinement. Capitalizing on this discovery, our self-supervised tracker significantly improves performance in distinguishing visually similar objects, an underexplored failure point for existing methods. Our method achieves up to a 6-point improvement over recent self-supervised approaches on established benchmarks and our newly introduced tests focused on tracking visually similar items. Visualizations confirm that these diffusion-derived motion representations enable robust tracking of even identical objects across challenging viewpoint changes and deformations. Project page: \small{\url{https: //chenshuang-zhang. github. io/projects/ted}}.

PDF Details

EAAI Journal 2024 Journal Article

ContextMix: A context-aware data augmentation method for industrial visual inspection systems

Hyungmin Kim
Donghun Kim
Pyunghwan Ahn
Sungho Suh
Hansang Cho
Junmo Kim

While deep neural networks have achieved remarkable performance, data augmentation has emerged as a crucial strategy to mitigate overfitting and enhance network performance. These techniques hold particular significance in industrial manufacturing contexts. Recently, image mixing-based methods have been introduced, exhibiting improved performance on public benchmark datasets. However, their application to industrial tasks remains challenging. The manufacturing environment generates massive amounts of unlabeled data on a daily basis, with only a few instances of abnormal data occurrences. This leads to severe data imbalance. Thus, creating well-balanced datasets is not straightforward due to the high costs associated with labeling. Nonetheless, this is a crucial step for enhancing productivity. For this reason, we introduce ContextMix, a method tailored for industrial applications and benchmark datasets. ContextMix generates novel data by resizing entire images and integrating them into other images within the batch. This approach enables our method to learn discriminative features based on varying sizes from resized images and train informative secondary features for object recognition using occluded images. With the minimal additional computation cost of image resizing, ContextMix enhances performance compared to existing augmentation techniques. We evaluate its effectiveness across classification, detection, and segmentation tasks using various network architectures on public benchmark datasets. Our proposed method demonstrates improved results across a range of robustness tasks. Its efficacy in real industrial environments is particularly noteworthy, as demonstrated using the passive component dataset.

Details DOI

AAAI Conference 2024 Conference Paper

Foreseeing Reconstruction Quality of Gradient Inversion: An Optimization Perspective

Hyeong Gwon Hong
Yooshin Cho
Hanbyel Cho
Jaesung Ahn
Junmo Kim

Gradient inversion attacks can leak data privacy when clients share weight updates with the server in federated learning (FL). Existing studies mainly use L2 or cosine distance as the loss function for gradient matching in the attack. Our empirical investigation shows that the vulnerability ranking varies with the loss function used. Gradient norm, which is commonly used as a vulnerability proxy for gradient inversion attack, cannot explain this as it remains constant regardless of the loss function for gradient matching. In this paper, we propose a loss-aware vulnerability proxy (LAVP) for the first time. LAVP refers to either the maximum or minimum eigenvalue of the Hessian with respect to gradient matching loss at ground truth. This suggestion is based on our theoretical findings regarding the local optimization of the gradient inversion in proximity to the ground truth, which corresponds to the worst case attack scenario. We demonstrate the effectiveness of LAVP on various architectures and datasets, showing its consistent superiority over the gradient norm in capturing sample vulnerabilities. The performance of each proxy is measured in terms of Spearman's rank correlation with respect to several similarity scores. This work will contribute to enhancing FL security against any potential loss functions beyond L2 or cosine distance in the future.

PDF Details DOI

AAAI Conference 2024 Conference Paper

FRED: Towards a Full Rotation-Equivariance in Aerial Image Object Detection

Chanho Lee
Jinsu Son
Hyounguk Shon
Yunho Jeon
Junmo Kim

Rotation-equivariance is an essential yet challenging property in oriented object detection. While general object detectors naturally leverage robustness to spatial shifts due to the translation-equivariance of the conventional CNNs, achieving rotation-equivariance remains an elusive goal. Current detectors deploy various alignment techniques to derive rotation-invariant features, but still rely on high capacity models and heavy data augmentation with all possible rotations. In this paper, we introduce a Fully Rotation-Equivariant Oriented Object Detector (FRED), whose entire process from the image to the bounding box prediction is strictly equivariant. Specifically, we decouple the invariant task (object classification) and the equivariant task (object localization) to achieve end-to-end equivariance. We represent the bounding box as a set of rotation-equivariant vectors to implement rotation-equivariant localization. Moreover, we utilized these rotation-equivariant vectors as offsets in the deformable convolution, thereby enhancing the existing advantages of spatial adaptation. Leveraging full rotation-equivariance, our FRED demonstrates higher robustness to image-level rotation compared to existing methods. Furthermore, we show that FRED is one step closer to non-axis aligned learning through our experiments. Compared to state-of-the-art methods, our proposed method delivers comparable performance on DOTA-v1.0 and outperforms by 1.5 mAP on DOTA-v1.5, all while significantly reducing the model parameters to 16%.

PDF Details DOI

AAAI Conference 2024 Conference Paper

Modeling Stereo-Confidence out of the End-to-End Stereo-Matching Network via Disparity Plane Sweep

Jae Young Lee
Woonghyun Ka
Jaehyun Choi
Junmo Kim

We propose a novel stereo-confidence that can be measured externally to various stereo-matching networks, offering an alternative input modality choice of the cost volume for learning-based approaches, especially in safety-critical systems. Grounded in the foundational concepts of disparity definition and the disparity plane sweep, the proposed stereo-confidence method is built upon the idea that any shift in a stereo-image pair should be updated in a corresponding amount shift in the disparity map. Based on this idea, the proposed stereo-confidence method can be summarized in three folds. 1) Using the disparity plane sweep, multiple disparity maps can be obtained and treated as a 3-D volume (predicted disparity volume), like the cost volume is constructed. 2) One of these disparity maps serves as an anchor, allowing us to define a desirable (or ideal) disparity profile at every spatial point. 3) By comparing the desirable and predicted disparity profiles, we can quantify the level of matching ambiguity between left and right images for confidence measurement. Extensive experimental results using various stereo-matching networks and datasets demonstrate that the proposed stereo-confidence method not only shows competitive performance on its own but also consistent performance improvements when it is used as an input modality for learning-based stereo-confidence methods.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

Self-supervised Transformation Learning for Equivariant Representations

Jaemyung Yu
Jaehyun Choi
Dong-Jae Lee
HyeongGwon Hong
Junmo Kim

Unsupervised representation learning has significantly advanced various machine learning tasks. In the computer vision domain, state-of-the-art approaches utilize transformations like random crop and color jitter to achieve invariant representations, embedding semantically the same inputs despite transformations. However, this can degrade performance in tasks requiring precise features, such as localization or flower classification. To address this, recent research incorporates equivariant representation learning, which captures transformation-sensitive information. However, current methods depend on transformation labels and thus struggle with interdependency and complex transformations. We propose Self-supervised Transformation Learning (STL), replacing transformation labels with transformation representations derived from image pairs. The proposed method ensures transformation representation is image-invariant and learns corresponding equivariant transformations, enhancing performance without increased batch complexity. We demonstrate the approach’s effectiveness across diverse classification and detection tasks, outperforming existing methods in 7 out of 11 benchmarks and excelling in detection. By integrating complex transformations like AugMix, unusable by prior equivariant methods, this approach enhances performance across tasks, underscoring its adaptability and resilience. Additionally, its compatibility with various base models highlights its flexibility and broad applicability. The code is available at https: //github. com/jaemyung-u/stl.

PDF Details DOI

TMLR Journal 2024 Journal Article

Towards Understanding Dual BN In Hybrid Adversarial Training

Chenshuang Zhang
Chaoning Zhang
Kang Zhang
Axi Niu
Junmo Kim
In So Kweon

There is a growing concern about applying batch normalization (BN) in adversarial training (AT), especially when the model is trained on both adversarial samples and clean samples (termed Hybrid-AT). With the assumption that adversarial and clean samples are from two different domains, a common practice in prior works is to adopt Dual BN, where BN$_{adv}$ and BN$_{clean}$ are used for adversarial and clean branches, respectively. A popular belief for motivating Dual BN is that estimating normalization statistics of this mixture distribution is challenging and thus disentangling it for normalization achieves stronger robustness. In contrast to this belief, we reveal that disentangling statistics plays a less role than disentangling affine parameters in model training. This finding aligns with prior work (Rebuffi et al., 2023), and we build upon their research for further investigations. We demonstrate that the domain gap between adversarial and clean samples is not very large, which is counter-intuitive considering the significant influence of adversarial perturbation on the model accuracy. We further propose a two-task hypothesis which serves as the empirical foundation and a unified framework for Hybrid-AT improvement. We also investigate Dual BN in test-time and reveal that affine parameters characterize the robustness during inference. Overall, our work sheds new light on understanding the mechanism of Dual BN in Hybrid-AT and its underlying justification.

PDF Details

NeurIPS Conference 2024 Conference Paper

Unlocking the Capabilities of Masked Generative Models for Image Synthesis via Self-Guidance

Jiwan Hur
Dong-Jae Lee
Gyojin Han
Jaehyun Choi
Yunho Jeon
Junmo Kim

Masked generative models (MGMs) have shown impressive generative ability while providing an order of magnitude efficient sampling steps compared to continuous diffusion models. However, MGMs still underperform in image synthesis compared to recent well-developed continuous diffusion models with similar size in terms of quality and diversity of generated samples. A key factor in the performance of continuous diffusion models stems from the guidance methods, which enhance the sample quality at the expense of diversity. In this paper, we extend these guidance methods to generalized guidance formulation for MGMs and propose a self-guidance sampling method, which leads to better generation quality. The proposed approach leverages an auxiliary task for semantic smoothing in vector-quantized token space, analogous to the Gaussian blur in continuous pixel space. Equipped with the parameter-efficient fine-tuning method and high-temperature sampling, MGMs with the proposed self-guidance achieve a superior quality-diversity trade-off, outperforming existing sampling methods in MGMs with more efficient training and sampling costs. Extensive experiments with the various sampling hyperparameters confirm the effectiveness of the proposed self-guidance.

PDF Details DOI

AAAI Conference 2023 Conference Paper

Frequency Selective Augmentation for Video Representation Learning

Jinhyung Kim
Taeoh Kim
Minho Shim
Dongyoon Han
Dongyoon Wee
Junmo Kim

Recent self-supervised video representation learning methods focus on maximizing the similarity between multiple augmented views from the same video and largely rely on the quality of generated views. However, most existing methods lack a mechanism to prevent representation learning from bias towards static information in the video. In this paper, we propose frequency augmentation (FreqAug), a spatio-temporal data augmentation method in the frequency domain for video representation learning. FreqAug stochastically removes specific frequency components from the video so that learned representation captures essential features more from the remaining information for various downstream tasks. Specifically, FreqAug pushes the model to focus more on dynamic features rather than static features in the video via dropping spatial or temporal low-frequency components. To verify the generality of the proposed method, we experiment with FreqAug on multiple self-supervised learning frameworks along with standard augmentations. Transferring the improved representation to five video action recognition and two temporal action localization downstream tasks shows consistent improvements over baselines.

PDF Details DOI

NeurIPS Conference 2022 Conference Paper

UniCLIP: Unified Framework for Contrastive Language-Image Pre-training

Janghyeon Lee
Jongsuk Kim
Hyounguk Shon
Bumsoo Kim
Seung Hwan Kim
Honglak Lee
Junmo Kim

Pre-training vision-language models with contrastive objectives has shown promising results that are both scalable to large uncurated datasets and transferable to many downstream applications. Some following works have targeted to improve data efficiency by adding self-supervision terms, but inter-domain (image-text) contrastive loss and intra-domain (image-image) contrastive loss are defined on individual spaces in those works, so many feasible combinations of supervision are overlooked. To overcome this issue, we propose UniCLIP, a Unified framework for Contrastive Language-Image Pre-training. UniCLIP integrates the contrastive loss of both inter-domain pairs and intra-domain pairs into a single universal space. The discrepancies that occur when integrating contrastive loss between different domains are resolved by the three key components of UniCLIP: (1) augmentation-aware feature embedding, (2) MP-NCE loss, and (3) domain dependent similarity measure. UniCLIP outperforms previous vision-language pre-training methods on various single- and multi-modality downstream tasks. In our experiments, we show that each component that comprises UniCLIP contributes well to the final performance.

PDF Details

AAAI Conference 2021 Conference Paper

Discriminative Region Suppression for Weakly-Supervised Semantic Segmentation

Beomyoung Kim
Sangeun Han
Junmo Kim

Weakly-supervised semantic segmentation (WSSS) using image-level labels has recently attracted much attention for reducing annotation costs. Existing WSSS methods utilize localization maps from the classification network to generate pseudo segmentation labels. However, since localization maps obtained from the classifier focus only on sparse discriminative object regions, it is difficult to generate highquality segmentation labels. To address this issue, we introduce discriminative region suppression (DRS) module that is a simple yet effective method to expand object activation regions. DRS suppresses the attention on discriminative regions and spreads it to adjacent non-discriminative regions, generating dense localization maps. DRS requires few or no additional parameters and can be plugged into any network. Furthermore, we introduce an additional learning strategy to give a self-enhancement of localization maps, named localization map refinement learning. Benefiting from this refinement learning, localization maps are refined and enhanced by recovering some missing parts or removing noise itself. Due to its simplicity and effectiveness, our approach achieves mIoU 71. 4% on the PASCAL VOC 2012 segmentation benchmark using only image-level labels. Extensive experiments demonstrate the effectiveness of our approach.

PDF Details

AAAI Conference 2021 Conference Paper

Linearly Replaceable Filters for Deep Network Channel Pruning

Donggyu Joo
Eojindl Yi
Sunghyun Baek
Junmo Kim

Convolutional neural networks (CNNs) have achieved remarkable results; however, despite the development of deep learning, practical user applications are fairly limited because heavy networks can be used solely with the latest hardware and software supports. Therefore, network pruning is gaining attention for general applications in various fields. This paper proposes a novel channel pruning method, Linearly Replaceable Filter (LRF), which suggests that a filter that can be approximated by the linear combination of other filters is replaceable. Moreover, an additional method called Weights Compensation is proposed to support the LRF method. This is a technique that effectively reduces the output difference caused by removing filters via direct weight modification. Through various experiments, we have confirmed that our method achieves state-of-the-art performance in several benchmarks. In particular, on ImageNet, LRF-60 reduces approximately 56% of FLOPs on ResNet-50 without top-5 accuracy drop. Further, through extensive analyses, we proved the effectiveness of our approaches.

PDF Details

AAAI Conference 2021 Conference Paper

Patch-Wise Attention Network for Monocular Depth Estimation

Sihaeng Lee
Janghyeon Lee
Byungju Kim
Eojindl Yi
Junmo Kim

In computer vision, monocular depth estimation is the problem of obtaining a high-quality depth map from a twodimensional image. This map provides information on threedimensional scene geometry, which is necessary for various applications in academia and industry, such as robotics and autonomous driving. Recent studies based on convolutional neural networks achieved impressive results for this task. However, most previous studies did not consider the relationships between the neighboring pixels in a local area of the scene. To overcome the drawbacks of existing methods, we propose a patch-wise attention method for focusing on each local area. After extracting patches from an input feature map, our module generates attention maps for each local patch, using two attention modules for each patch along the channel and spatial dimensions. Subsequently, the attention maps return to their initial positions and merge into one attention feature. Our method is straightforward but effective. The experimental results on two challenging datasets, KITTI and NYU Depth V2, demonstrate that the proposed method achieves significant performance. Furthermore, our method outperforms other state-of-the-art methods on the KITTI depth estimation benchmark.

PDF Details

AAAI Conference 2020 Conference Paper

Residual Continual Learning

Janghyeon Lee
Donggyu Joo
Hyeong Gwon Hong
Junmo Kim

We propose a novel continual learning method called Residual Continual Learning (ResCL). Our method can prevent the catastrophic forgetting phenomenon in sequential learning of multiple tasks, without any source task information except the original network. ResCL reparameterizes network parameters by linearly combining each layer of the original network and a ﬁne-tuned network; therefore, the size of the network does not increase at all. To apply the proposed method to general convolutional neural networks, the effects of batch normalization layers are also considered. By utilizing residuallearning-like reparameterization and a special weight decay loss, the trade-off between source and target performance is effectively controlled. The proposed method exhibits state-ofthe-art performance in various continual learning scenarios.

PDF Details

NeurIPS Conference 2018 Conference Paper

Constructing Fast Network through Deconstruction of Convolution

Yunho Jeon
Junmo Kim

Convolutional neural networks have achieved great success in various vision tasks; however, they incur heavy resource costs. By using deeper and wider networks, network accuracy can be improved rapidly. However, in an environment with limited resources (e. g. , mobile applications), heavy networks may not be usable. This study shows that naive convolution can be deconstructed into a shift operation and pointwise convolution. To cope with various convolutions, we propose a new shift operation called active shift layer (ASL) that formulates the amount of shift as a learnable function with shift parameters. This new layer can be optimized end-to-end through backpropagation and it can provide optimal shift values. Finally, we apply this layer to a light and fast network that surpasses existing state-of-the-art networks.

PDF Details

AAAI Conference 2018 Conference Paper

Less-Forgetful Learning for Domain Expansion in Deep Neural Networks

Heechul Jung
Jeongwoo Ju
Minju Jung
Junmo Kim

Expanding the domain that deep neural network has already learned without accessing old domain data is a challenging task because deep neural networks forget previously learned information when learning new data from a new domain. In this paper, we propose a less-forgetful learning method for the domain expansion scenario. While existing domain adaptation techniques solely focused on adapting to new domains, the proposed technique focuses on working well with both old and new domains without needing to know whether the input is from the old or new domain. First, we present two naive approaches which will be problematic, then we provide a new method using two proposed properties for lessforgetful learning. Finally, we prove the effectiveness of our method through experiments on image classiﬁcation tasks. All datasets used in the paper, will be released on our website for someone’s follow-up study.

PDF Details

AAAI Conference 2016 Conference Paper

Supervised Hashing via Uncorrelated Component Analysis

Sungryull Sohn
Hyunwoo Kim
Junmo Kim

The Approximate Nearest Neighbor (ANN) search problem is important in applications such as information retrieval. Several hashing-based search methods that provide effective solutions to the ANN search problem have been proposed. However, most of these focus on similarity preservation and coding error minimization, and pay little attention to optimizing the precisionrecall curve or receiver operating characteristic curve. In this paper, we propose a novel projection-based hashing method that attempts to maximize the precision and recall. We ﬁrst introduce an uncorrelated component analysis (UCA) by examining the precision and recall, and then propose a UCA-based hashing method. The proposed method is evaluated with a variety of datasets. The results show that UCA-based hashing outperforms state-of-the-art methods, and has computationally efﬁcient training and encoding processes.

PDF Details

AAAI Conference 2015 Conference Paper

On the Equivalence of Linear Discriminant Analysis and Least Squares

Kibok Lee
Junmo Kim

Linear discriminant analysis (LDA) is a popular dimensionality reduction and classification method that simultaneously maximizes between-class scatter and minimizes within-class scatter. In this paper, we verify the equivalence of LDA and least squares (LS) with a set of dependent variable matrices. The equivalence is in the sense that the LDA solution matrix and the LS solution matrix have the same range. The resulting LS provides an intuitive interpretation in which its solution performs data clustering according to class labels. Further, the fact that LDA and LS have the same range allows us to design a two-stage algorithm that computes the LDA solution given by generalized eigenvalue decomposition (GEVD), much faster than computing the original GEVD. Experimental results demonstrate the equivalence of the LDA solution and the proposed LS solution.

PDF Details