Arrow Research search

Author name cluster

Jongwoo Ko

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

10 papers
2 author rows

Possible papers

10

ICLR Conference 2025 Conference Paper

Beyond correlation: The impact of human uncertainty in measuring the effectiveness of automatic evaluation and LLM-as-a-judge

  • Aparna Elangovan
  • Lei Xu 0040
  • Jongwoo Ko
  • Mahsa Elyasi
  • Ling Liu
  • Sravan Babu Bodapati
  • Dan Roth 0001

The effectiveness of automatic evaluation of generative models is typically measured by comparing the labels generated via automation with human labels using correlation metrics. However, metrics like Krippendorff's $\alpha$ and Randolph's $\kappa$ were originally designed to measure the reliability of human labeling, thus make assumptions about typical human labeling behavior, and these assumptions may not be applicable to machine generated labels. In this paper, we show how *relying on a single aggregate correlation score* can obscure fundamental differences between human labels and those from automatic evaluation, including LLM-as-a-Judge. Specifically, we demonstrate that when the proportion of samples with variation or uncertainty in human assigned labels is relatively high, machine labels (generated by automatic evaluation methods) may superficially appear to have similar or better correlation with the human majority label compared to the human-to-human (HH) correlation. This can create the illusion that labels from automatic evaluation approximates the human majority label. However, as the proportion of samples with consistent human labels increases, the correlation between machine and human labels fall well below HH correlation. Based on these findings, we first propose *stratifying data by human label uncertainty* to provide a more robust analysis of automatic evaluation performance. Second, recognizing that uncertainty and variation are inherent in perception-based human evaluations, such as those involving attitudes or preferences, we introduce a new metric -*binned Jensen-Shannon Divergence for perception* for such scenarios to better measure the effectiveness of automatic evaluations. Third, we present visualization techniques -- *perception charts*, to contextualize correlation measures appropriately and to show the strengths and limitations of automatic evaluation. We have open-sourced our analysis and visualization tools at https://github.com/amazon-science/BeyondCorrelation.

ICML Conference 2025 Conference Paper

DistiLLM-2: A Contrastive Approach Boosts the Distillation of LLMs

  • Jongwoo Ko
  • Tianyi Chen
  • Sungnyun Kim
  • Tianyu Ding
  • Luming Liang
  • Ilya Zharkov
  • Se-Young Yun

Despite the success of distillation in large language models (LLMs), most prior work applies identical loss functions to both teacher- and student-generated data. These strategies overlook the synergy between loss formulations and data types, leading to a suboptimal performance boost in student models. To address this, we propose DistiLLM-2, a contrastive approach that simultaneously increases the likelihood of teacher responses and decreases that of student responses by harnessing this synergy. Our extensive experiments show that DistiLLM-2 not only builds high-performing student models across a wide range of tasks, including instruction-following and code generation, but also supports diverse applications, such as preference alignment and vision-language extensions. These findings highlight the potential of a contrastive approach to enhance the efficacy of LLM distillation by effectively aligning teacher and student models across varied data types.

NeurIPS Conference 2025 Conference Paper

Flex-Judge: Text-Only Reasoning Unleashes Zero-Shot Multimodal Evaluators

  • Jongwoo Ko
  • Sungnyun Kim
  • Sungwoo Cho
  • Se-Young Yun

Human-generated reward signals are critical for aligning generative models with human preferences, guiding both training and inference-time evaluations. While large language models (LLMs) employed as proxy evaluators, i. e. , LLM-as-a-Judge, significantly reduce the costs associated with manual annotations, they typically require extensive modality-specific training data and fail to generalize well across diverse multimodal tasks. In this paper, we propose Flex-Judge, a reasoning-guided multimodal judge model that leverages minimal textual reasoning data to robustly generalize across multiple modalities and evaluation formats. Our core intuition is that structured textual reasoning explanations inherently encode generalizable decision-making patterns, enabling an effective transfer to multimodal judgments, e. g. , with images or videos. Empirical results demonstrate that Flex-Judge, despite being trained on significantly fewer text data, achieves competitive or superior performance compared to state-of-the-art commercial APIs and extensively trained multimodal evaluators. Notably, Flex-Judge presents broad impact in modalities like molecule, where comprehensive evaluation benchmarks are scarce, underscoring its practical value in resource-constrained domains. Our framework highlights reasoning-based text supervision as a powerful, cost-effective alternative to traditional annotation-intensive approaches, substantially advancing scalable multimodal model-as-a-judge.

ICLR Conference 2025 Conference Paper

SeRA: Self-Reviewing and Alignment of LLMs using Implicit Reward Margins

  • Jongwoo Ko
  • Saket Dingliwal
  • Bhavana Ganesh
  • Sailik Sengupta
  • Sravan Babu Bodapati
  • Aram Galstyan

Direct alignment algorithms (DAAs), such as direct preference optimization (DPO), have become popular alternatives to Reinforcement Learning from Human Feedback (RLHF) due to their simplicity, efficiency, and stability. However, the preferences used by DAAs are usually collected before alignment training begins and remain unchanged (off-policy). This design leads to two problems where the policy model (1) picks up on spurious correlations in the dataset (as opposed to only learning alignment to human preferences), and (2) overfits to feedback on off-policy trajectories that have less likelihood of being generated by the updated policy model. To address these issues, we introduce Self-Reviewing and Alignment (SeRA), a cost-efficient and effective method that can be readily combined with existing DAAs. SeRA comprises of two components: (1) sample selection using implicit reward margin to alleviate over-optimization on such undesired features, and (2) preference bootstrapping using implicit rewards to augment preference data with updated policy models in a cost-efficient manner. Extensive experiments, including on instruction-following tasks, demonstrate the effectiveness and generality of SeRA in training LLMs with diverse offline preference datasets and and DAAs.

ICML Conference 2024 Conference Paper

DistiLLM: Towards Streamlined Distillation for Large Language Models

  • Jongwoo Ko
  • Sungnyun Kim
  • Tianyi Chen
  • Se-Young Yun

Knowledge distillation (KD) is widely used for compressing a teacher model to a smaller student model, reducing its inference cost and memory footprint while preserving model capabilities. However, current KD methods for auto-regressive sequence models (e. g. , large language models) suffer from missing a standardized objective function. Moreover, the recent use of student-generated outputs to address training-inference mismatches has significantly escalated computational costs. To tackle these issues, we introduce DistiLLM, a more effective and efficient KD framework for auto-regressive language models. DistiLLM comprises two components: (1) a novel skew Kullback-Leibler divergence loss, where we unveil and leverage its theoretical properties, and (2) an adaptive off-policy approach designed to enhance the efficiency in utilizing student-generated outputs. Extensive experiments, including instruction-following tasks, demonstrate the effectiveness of DistiLLM in building high-performing student models while achieving up to 4. 3$\times$ speedup compared to recent KD methods.

IJCAI Conference 2024 Conference Paper

Fine-tuning Pre-trained Models for Robustness under Noisy Labels

  • Sumyeong Ahn
  • Sihyeon Kim
  • Jongwoo Ko
  • Se-Young Yun

The presence of noisy labels in a training dataset can significantly impact the performance of machine learning models. In response to this issue, researchers have focused on identifying clean samples and reducing the influence of noisy labels. Recent works in this field have achieved notable success in terms of generalizability, albeit at the expense of extensive computing resources. Therefore, reducing computational costs remains a crucial challenge. Concurrently, in other research areas, there has been a focus on developing fine-tuning techniques to efficiently achieve high generalization performance. Despite their proven efficiently achievable generalization capabilities, these techniques have seen limited exploration from a label noise point of view. In this research, we aim to find an effective approach to fine-tune pre-trained models for noisy labeled datasets. To achieve this goal, we empirically investigate the characteristics of pre-trained models on noisy labels and propose an algorithm, named TURN. We present the results of extensive testing and demonstrate both efficient and improved denoising performance on various benchmarks, surpassing previous methods.

AAAI Conference 2023 Conference Paper

A Gift from Label Smoothing: Robust Training with Adaptive Label Smoothing via Auxiliary Classifier under Label Noise

  • Jongwoo Ko
  • Bongsoo Yi
  • Se-Young Yun

As deep neural networks can easily overfit noisy labels, robust training in the presence of noisy labels is becoming an important challenge in modern deep learning. While existing methods address this problem in various directions, they still produce unpredictable sub-optimal results since they rely on the posterior information estimated by the feature extractor corrupted by noisy labels. Lipschitz regularization successfully alleviates this problem by training a robust feature extractor, but it requires longer training time and expensive computations. Motivated by this, we propose a simple yet effective method, called ALASCA, which efficiently provides a robust feature extractor under label noise. ALASCA integrates two key ingredients: (1) adaptive label smoothing based on our theoretical analysis that label smoothing implicitly induces Lipschitz regularization, and (2) auxiliary classifiers that enable practical application of intermediate Lipschitz regularization with negligible computations. We conduct wide-ranging experiments for ALASCA and combine our proposed method with previous noise-robust methods on several synthetic and real-world datasets. Experimental results show that our framework consistently improves the robustness of feature extractors and the performance of existing baselines with efficiency.

ICLR Conference 2023 Conference Paper

CUDA: Curriculum of Data Augmentation for Long-tailed Recognition

  • Sumyeong Ahn
  • Jongwoo Ko
  • Se-Young Yun

Class imbalance problems frequently occur in real-world tasks, and conventional deep learning algorithms are well known for performance degradation on imbalanced training datasets. To mitigate this problem, many approaches have aimed to balance among given classes by re-weighting or re-sampling training samples. These re-balancing methods increase the impact of minority classes and reduce the influence of majority classes on the output of models. However, the extracted representations may be of poor quality owing to the limited number of minority samples. To handle this restriction, several methods have been developed that increase the representations of minority samples by leveraging the features of the majority samples. Despite extensive recent studies, no deep analysis has been conducted on determination of classes to be augmented and strength of augmentation has been conducted. In this study, we first investigate the correlation between the degree of augmentation and class-wise performance, and find that the proper degree of augmentation must be allocated for each class to mitigate class imbalance problems. Motivated by this finding, we propose a simple and efficient novel curriculum, which is designed to find the appropriate per-class strength of data augmentation, called CUDA: CUrriculum of Data Augmentation for long-tailed recognition. CUDA can simply be integrated into existing long-tailed recognition methods. We present the results of experiments showing that CUDA effectively achieves better generalization performance compared to the state-of-the-art method on various imbalanced datasets such as CIFAR-100-LT, ImageNet-LT, and iNaturalist 2018.

AAAI Conference 2023 Conference Paper

Self-Contrastive Learning: Single-Viewed Supervised Contrastive Framework Using Sub-network

  • Sangmin Bae
  • Sungnyun Kim
  • Jongwoo Ko
  • Gihun Lee
  • Seungjong Noh
  • Se-Young Yun

Contrastive loss has significantly improved performance in supervised classification tasks by using a multi-viewed framework that leverages augmentation and label information. The augmentation enables contrast with another view of a single image but enlarges training time and memory usage. To exploit the strength of multi-views while avoiding the high computation cost, we introduce a multi-exit architecture that outputs multiple features of a single image in a single-viewed framework. To this end, we propose Self-Contrastive (SelfCon) learning, which self-contrasts within multiple outputs from the different levels of a single network. The multi-exit architecture efficiently replaces multi-augmented images and leverages various information from different layers of a network. We demonstrate that SelfCon learning improves the classification performance of the encoder network, and empirically analyze its advantages in terms of the single-view and the sub-network. Furthermore, we provide theoretical evidence of the performance increase based on the mutual information bound. For ImageNet classification on ResNet-50, SelfCon improves accuracy by +0.6% with 59% memory and 48% time of Supervised Contrastive learning, and a simple ensemble of multi-exit outputs boosts performance up to +1.5%. Our code is available at https://github.com/raymin0223/self-contrastive-learning.

NeurIPS Conference 2021 Conference Paper

FINE Samples for Learning with Noisy Labels

  • Taehyeon Kim
  • Jongwoo Ko
  • Sangwook Cho
  • JinHwan Choi
  • Se-Young Yun

Modern deep neural networks (DNNs) become frail when the datasets contain noisy (incorrect) class labels. Robust techniques in the presence of noisy labels can be categorized into two folds: developing noise-robust functions or using noise-cleansing methods by detecting the noisy data. Recently, noise-cleansing methods have been considered as the most competitive noisy-label learning algorithms. Despite their success, their noisy label detectors are often based on heuristics more than a theory, requiring a robust classifier to predict the noisy data with loss values. In this paper, we propose a novel detector for filtering label noise. Unlike most existing methods, we focus on each data's latent representation dynamics and measure the alignment between the latent distribution and each representation using the eigen decomposition of the data gram matrix. Our framework, coined as filtering noisy instances via their eigenvectors (FINE), provides a robust detector with derivative-free simple methods having theoretical guarantees. Under our framework, we propose three applications of the FINE: sample-selection approach, semi-supervised learning approach, and collaboration with noise-robust loss functions. Experimental results show that the proposed methods consistently outperform corresponding baselines for all three applications on various benchmark datasets.