Arrow Research search

Author name cluster

Minsu Kim

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

31 papers
2 author rows

Possible papers

31

TMLR Journal 2026 Journal Article

Offline Model-Based Optimization: Comprehensive Review

  • Minsu Kim
  • Jiayao Gu
  • Ye Yuan
  • Taeyoung Yun
  • Zixuan Liu
  • Yoshua Bengio
  • Can Chen

Offline black-box optimization is a fundamental challenge in science and engineering, where the goal is to optimize black-box functions using only offline datasets. This setting is particularly relevant when querying the objective function is prohibitively expensive or infeasible, with applications spanning protein engineering, material discovery, neural architecture search, and beyond. The main difficulty lies in accurately estimating the objective landscape beyond the available data, where extrapolations are fraught with significant epistemic uncertainty. This uncertainty can lead to objective hacking (reward hacking)—exploiting model inaccuracies in unseen regions—or other spurious optimizations that yield misleadingly high performance estimates outside the offline distribution. Recent advances in model-based optimization (MBO) have harnessed the generalization capabilities of deep neural networks to develop offline-specific surrogate and generative models. Trained with carefully designed strategies, these models are more robust against out-of-distribution issues, facilitating the discovery of improved designs. Despite its growing impact in accelerating scientific discovery, the field lacks a comprehensive review. To bridge this gap, we present the first thorough review of offline MBO. We begin by formalizing the problem for both single-objective and multi-objective settings and by reviewing recent benchmarks and evaluation metrics. We then categorize existing approaches into two key areas: surrogate modeling, which emphasizes accurate function approximation in out-of-distribution regions, and generative modeling, which explores high-dimensional design spaces to identify high-performing designs. Finally, we examine the key challenges and propose promising directions for advancement in this rapidly evolving field including safe control of superintelligent systems.

IROS Conference 2025 Conference Paper

A Deep Reinforcement Learning based End-to-End Control Framework for Lower Limb Exoskeletons with Smooth Movement Transitions

  • Minsu Kim
  • Woo-Jeong Baek
  • Jaeheung Park

This paper presents an active control strategy for lower limb exoskeletons by proposing an end-to-end framework employing deep reinforcement learning (DRL) to enable smooth transitions between different movement patterns. The majority of existing methods in exoskeleton literature employ finite state machines (FSM) that have proven successful in predicting the control strategy for the next state on the basis of sensor data such as IMU data, force, etc. However, one drawback of FSM occurs due to their inflexibility regarding sudden changes. Specifically, FSM is based on clear state transitions, which makes it hard to manage smooth continuous movements and increases the chance of sudden changes in control during transitions. These, in turn, raise safety concerns for the user. While learning-based control approaches have been suggested in recent years, the validation was performed in simulation environments. Therefore, the real-world applicability remains an open research question to date. To address this issue, we provide the first contribution in this field that proposes an end-to-end learning framework with a Deep Deterministic Policy Gradient (DDPG) module to enable smooth transitions between movement patterns under real-world conditions. By introducing several evaluation metrics, we demonstrate that our framework outperforms existing methods in terms of the adaptability and smoothness in movement transitions.

NeurIPS Conference 2025 Conference Paper

Adaptive Inference-Time Scaling via Cyclic Diffusion Search

  • Gyubin Lee
  • Bao Truong
  • Jaesik Yoon
  • Dongwoo Lee
  • Minsu Kim
  • Yoshua Bengio
  • Sungjin Ahn

Diffusion models have demonstrated strong generative capabilities across domains ranging from image synthesis to complex reasoning tasks. However, most inference-time scaling methods rely on fixed denoising schedules, limiting their ability to allocate computation based on instance difficulty or task-specific demands adaptively. We introduce the challenge of adaptive inference-time scaling—dynamically adjusting computational effort during inference—and propose Adaptive Bi-directional Cyclic Diffusion (ABCD), a flexible, search-based inference framework. ABCD refines outputs through bi-directional diffusion cycles while adaptively controlling exploration depth and termination. It comprises three components: Cyclic Diffusion Search, Automatic Exploration-Exploitation Balancing, and Adaptive Thinking Time. Experiments show that ABCD improves performance across diverse tasks while maintaining computational efficiency.

TMLR Journal 2025 Journal Article

Decoupled Sequence and Structure Generation for Realistic Antibody Design

  • Nayoung Kim
  • Minsu Kim
  • Sungsoo Ahn
  • Jinkyoo Park

Recently, deep learning has made rapid progress in antibody design, which plays a key role in the advancement of therapeutics. A dominant paradigm is to train a model to jointly generate the antibody sequence and the structure as a candidate. However, the joint generation requires the model to generate both the discrete amino acid categories and the continuous 3D coordinates; this limits the space of possible architectures and may lead to suboptimal performance. In response, we propose an antibody sequence-structure decoupling (ASSD) framework, which separates sequence generation and structure prediction. Although our approach is simple, our idea allows the use of powerful neural architectures and demonstrates notable performance improvements. We also find that the widely used non-autoregressive generators promote sequences with overly repeating tokens. Such sequences are both out-of-distribution and prone to undesirable developability properties that can trigger harmful immune responses in patients. To resolve this, we introduce a composition-based objective that allows an efficient trade-off between high performance and low token repetition. ASSD shows improved performance in various antibody design experiments, while the composition-based objective successfully mitigates token repetition of non-autoregressive models.

NeurIPS Conference 2025 Conference Paper

Energy-based generator matching: A neural sampler for general state space

  • Dongyeop Woo
  • Minsu Kim
  • Minkyu Kim
  • Kiyoung Seong
  • Sungsoo Ahn

We propose Energy-based generator matching (EGM), a modality-agnostic approach to train generative models from energy functions in the absence of data. Extending the recently proposed generator matching, EGM enables training of arbitrary continuous-time Markov processes, e. g. , diffusion, flow, and jump, and can generate data from continuous, discrete, and a mixture of two modalities. To this end, we propose estimating the generator matching loss using self-normalized importance sampling with an additional bootstrapping trick to reduce variance in the importance weight. We validate EGM on both discrete and multimodal tasks up to 100 and 20 dimensions, respectively.

NeurIPS Conference 2025 Conference Paper

MoME: Mixture of Matryoshka Experts for Audio-Visual Speech Recognition

  • Umberto Cappellazzo
  • Minsu Kim
  • Pingchuan Ma
  • Honglie Chen
  • Xubo Liu
  • Stavros Petridis
  • Maja Pantic

Large language models (LLMs) have recently shown strong potential in audio-visual speech recognition (AVSR), but their high computational demands and sensitivity to token granularity limit their practicality in resource-constrained settings. Token compression methods can reduce inference cost, but they require fixing a compression rate in advance and produce a single fixed-length output, offering no flexibility to balance information density and efficiency at inference time. Matryoshka representation learning (MRL) addresses this by enabling a single model to operate across multiple token granularities, allowing compression rates to be adjusted dynamically. However, current MRL-based methods treat each scale independently during training, limiting cross-scale generalization, robustness at high compression, and interpretability. To overcome these limitations, we propose MoME (Mixture of Matryoshka Experts), a novel framework that integrates sparse Mixture-of-Experts (MoE) into MRL-based LLMs for AVSR. MoME augments a frozen LLM with top-k routed and shared experts, allowing dynamic capacity allocation across scales and modalities. A shared router promotes consistent expert activation across granularities, enabling compressed sequences to benefit from representations learned at lower compression. Experiments on LRS2 and LRS3 demonstrate that MoME achieves state-of-the-art performance across AVSR, ASR, and VSR tasks, while requiring significantly fewer parameters and maintaining robustness under noise. MoME unifies the adaptability of MRL with the efficiency of MoE, offering a scalable and interpretable solution for resource-aware speech recognition.

NeurIPS Conference 2025 Conference Paper

Non-Adaptive Adversarial Face Generation

  • Sunpill Kim
  • Seunghun Paik
  • Chanwoo Hwang
  • Minsu Kim
  • Jae Hong Seo

Adversarial attacks on face recognition systems (FRSs) pose serious security and privacy threats, especially when these systems are used for identity verification. In this paper, we propose a novel method for generating adversarial faces—synthetic facial images that are visually distinct yet recognized as a target identity by the FRS. Unlike iterative optimization-based approaches (e. g. , gradient descent or other iterative solvers), our method leverages the structural characteristics of the FRS feature space. We figure out that individuals sharing the same attribute (e. g. , gender or race) form an attributed subsphere. By utilizing such subspheres, our method achieves both non-adaptiveness and a remarkably small number of queries. This eliminates the need for relying on transferability and open-source surrogate models, which have been a typical strategy when repeated adaptive queries to commercial FRSs are impossible. Despite requiring only a single non-adaptive query consisting of 100 face images, our method achieves a high success rate of over 93% against AWS’s CompareFaces API at its default threshold. Furthermore, unlike many existing attacks that perturb a given image, our method can deliberately produce adversarial faces that impersonate the target identity while exhibiting high-level attributes chosen by the adversary.

NeurIPS Conference 2025 Conference Paper

On scalable and efficient training of diffusion samplers

  • Minkyu Kim
  • Kiyoung Seong
  • Dongyeop Woo
  • Sungsoo Ahn
  • Minsu Kim

We address the challenge of training diffusion models to sample from unnormalized energy distributions in the absence of data, the so-called diffusion samplers. Although these approaches have shown promise, they struggle to scale in more demanding scenarios where energy evaluations are expensive and the sampling space is high-dimensional. To address this limitation, we propose a scalable and sample-efficient framework that properly harmonizes the powerful classical sampling method and the diffusion sampler. Specifically, we utilize Monte Carlo Markov chain (MCMC) samplers with a novelty-based auxiliary energy as a Searcher to collect off-policy samples, using an auxiliary energy function to compensate for exploring modes the diffusion sampler rarely visits. These off-policy samples are then combined with on-policy data to train the diffusion sampler, thereby expanding its coverage of the energy landscape. Furthermore, we identify primacy bias, i. e. , the preference of samplers for early experience during training, as the main cause of mode collapse during training, and introduce a periodic re-initialization trick to resolve this issue. Our method significantly improves sample efficiency on standard benchmarks for diffusion samplers and also excels at higher-dimensional problems and real-world molecular conformer generation.

NeurIPS Conference 2025 Conference Paper

Trajectory Balance with Asynchrony: Decoupling Exploration and Learning for Fast, Scalable LLM Post-Training

  • Brian Bartoldson
  • Siddarth Venkatraman
  • James Diffenderfer
  • Moksh Jain
  • Tal Ben-Nun
  • Seanie Lee
  • Minsu Kim
  • Johan Obando Ceron

Reinforcement learning (RL) is a critical component of large language model (LLM) post-training. However, on-policy algorithms used for post-training are not naturally robust to a diversified content of experience replay buffers, which asynchronous off-policy actors can efficiently populate in parallel to training. We propose efficiently learning on such off-policy data via Trajectory Balance with Asynchrony (TBA), an approach to asynchronous RL for LLMs that leverages the principled off-policy TB objective. On math, preference-tuning, and automated red-teaming tasks, we post-train models ranging from Pythia 410M to Qwen 2. 5 7B, finding TBA offers speed and performance boosts over strong baselines like Online DPO and Dr. GRPO. Beyond TBA's performance benefits (high accuracy even as asynchrony grows) and speedups ($4\times$ or more), we show its reward- and recency-prioritizing sampling enable further gains as data generation is scaled. Our code is available at https: //github. com/bbartoldson/TBA.

NeurIPS Conference 2024 Conference Paper

Amortizing intractable inference in diffusion models for vision, language, and control

  • Siddarth Venkatraman
  • Moksh Jain
  • Luca Scimeca
  • Minsu Kim
  • Marcin Sendera
  • Mohsin Hasan
  • Luke Rowe
  • Sarthak Mittal

Diffusion models have emerged as effective distribution estimators in vision, language, and reinforcement learning, but their use as priors in downstream tasks poses an intractable posterior inference problem. This paper studies *amortized* sampling of the posterior over data, $\mathbf{x}\sim p^{\rm post}(\mathbf{x})\propto p(\mathbf{x})r(\mathbf{x})$, in a model that consists of a diffusion generative model prior $p(\mathbf{x})$ and a black-box constraint or likelihood function $r(\mathbf{x})$. We state and prove the asymptotic correctness of a data-free learning objective, *relative trajectory balance*, for training a diffusion model that samples from this posterior, a problem that existing methods solve only approximately or in restricted cases. Relative trajectory balance arises from the generative flow network perspective on diffusion models, which allows the use of deep reinforcement learning techniques to improve mode coverage. Experiments illustrate the broad potential of unbiased inference of arbitrary posteriors under diffusion priors: in vision (classifier guidance), language (infilling under a discrete diffusion LLM), and multimodal data (text-to-image generation). Beyond generative modeling, we apply relative trajectory balance to the problem of continuous control with a score-based behavior prior, achieving state-of-the-art results on benchmarks in offline reinforcement learning. Code is available at [this link](https: //github. com/GFNOrg/diffusion-finetuning).

ICRA Conference 2024 Conference Paper

BroadBEV: Collaborative LiDAR-camera Fusion for Broad-sighted Bird's Eye View Map Construction

  • Minsu Kim
  • Giseop Kim
  • Kyong Hwan Jin
  • Sunwook Choi

A recent sensor fusion in a Bird’s Eye View (BEV) space has shown its utility in various tasks such as 3D detection, map segmentation, etc. However, the approach struggles with inaccurate camera BEV estimation, and a perception of distant areas due to the sparsity of LiDAR points. In this paper, we propose a BEV fusion (BroadBEV) that aims to enhance camera BEV estimation for broad perception in the pre-defined BEV range, while simultaneously improving the completion of LiDAR’s sparsity in the entire BEV space. Toward that end, we devise Point-scattering that scatters LiDAR BEV distribution to camera depth distribution. The method boosts the learning of depth estimation of the camera branch and induces accurate location of dense camera features in BEV space. For an effective BEV fusion between the spatially synchronized features, we suggest ColFusion that applies self-attention weights of LiDAR and camera BEV features to each other. Our extensive experiments demonstrate that the suggested methods enable a broad BEV perception with remarkable performance gains.

AAAI Conference 2024 Conference Paper

Equity-Transformer: Solving NP-Hard Min-Max Routing Problems as Sequential Generation with Equity Context

  • Jiwoo Son
  • Minsu Kim
  • Sanghyeok Choi
  • Hyeonah Kim
  • Jinkyoo Park

Min-max routing problems aim to minimize the maximum tour length among multiple agents as they collaboratively visit all cities, i.e., the completion time. These problems include impactful real-world applications but are known as NP-hard. Existing methods are facing challenges, particularly in large-scale problems that require the coordination of numerous agents to cover thousands of cities. This paper proposes Equity-Transformer to solve large-scale min-max routing problems. First, we model min-max routing problems into sequential planning, reducing the complexity and enabling the use of a powerful Transformer architecture. Second, we propose key inductive biases that ensure equitable workload distribution among agents. The effectiveness of Equity-Transformer is demonstrated through its superior performance in two representative min-max routing tasks: the min-max multi-agent traveling salesman problem (min-max mTSP) and the min-max multi-agent pick-up and delivery problem (min-max mPDP). Notably, our method achieves significant reductions of runtime, approximately 335 times, and cost values of about 53% compared to a competitive heuristic (LKH3) in the case of 100 vehicles with 1,000 cities of mTSP. We provide reproducible source code: https://github.com/kaist-silab/equity-transformer.

NeurIPS Conference 2024 Conference Paper

Genetic-guided GFlowNets for Sample Efficient Molecular Optimization

  • Hyeonah Kim
  • Minsu Kim
  • Sanghyeok Choi
  • Jinkyoo Park

The challenge of discovering new molecules with desired properties is crucial in domains like drug discovery and material design. Recent advances in deep learning-based generative methods have shown promise but face the issue of sample efficiency due to the computational expense of evaluating the reward function. This paper proposes a novel algorithm for sample-efficient molecular optimization by distilling a powerful genetic algorithm into deep generative policy using GFlowNets training, the off-policy method for amortized inference. This approach enables the deep generative policy to learn from domain knowledge, which has been explicitly integrated into the genetic algorithm. Our method achieves state-of-the-art performance in the official molecular optimization benchmark, significantly outperforming previous methods. It also demonstrates effectiveness in designing inhibitors against SARS-CoV-2 with substantially fewer reward calls.

NeurIPS Conference 2024 Conference Paper

Improved off-policy training of diffusion samplers

  • Marcin Sendera
  • Minsu Kim
  • Sarthak Mittal
  • Pablo Lemos
  • Luca Scimeca
  • Jarrid Rector-Brooks
  • Alexandre Adam
  • Yoshua Bengio

We study the problem of training diffusion models to sample from a distribution with a given unnormalized density or energy function. We benchmark several diffusion-structured inference methods, including simulation-based variational approaches and off-policy methods (continuous generative flow networks). Our results shed light on the relative advantages of existing algorithms while bringing into question some claims from past work. We also propose a novel exploration strategy for off-policy methods, based on local search in the target space with the use of a replay buffer, and show that it improves the quality of samples on a variety of target distributions. Our code for the sampling methods and benchmarks studied is made public at this link as a base for future work on diffusion models for amortized inference.

ICLR Conference 2024 Conference Paper

Learning Energy Decompositions for Partial Inference in GFlowNets

  • Hyosoon Jang
  • Minsu Kim
  • Sungsoo Ahn

This paper studies generative flow networks (GFlowNets) to sample objects from the Boltzmann energy distribution via a sequence of actions. In particular, we focus on improving GFlowNet with partial inference: training flow functions with the evaluation of the intermediate states or transitions. To this end, the recently developed forward-looking GFlowNet reparameterizes the flow functions based on evaluating the energy of intermediate states. However, such an evaluation of intermediate energies may (i) be too expensive or impossible to evaluate and (ii) even provide misleading training signals under large energy fluctuations along the sequence of actions. To resolve this issue, we propose learning energy decompositions for GFlowNets (LED-GFN). Our main idea is to (i) decompose the energy of an object into learnable potential functions defined on state transitions and (ii) reparameterize the flow functions using the potential functions. In particular, to produce informative local credits, we propose to regularize the potential to change smoothly over the sequence of actions. It is also noteworthy that training GFlowNet with our learned potential can preserve the optimal policy. We empirically verify the superiority of LED-GFN in five problems including the generation of unstructured and maximum independent sets, molecular graphs, and RNA sequences.

NeurIPS Conference 2024 Conference Paper

Pessimistic Backward Policy for GFlowNets

  • Hyosoon Jang
  • Yunhui Jang
  • Minsu Kim
  • Jinkyoo Park
  • Sungsoo Ahn

This paper studies Generative Flow Networks (GFlowNets), which learn to sample objects proportionally to a given reward function through the trajectory of state transitions. In this work, we observe that GFlowNets tend to under-exploit the high-reward objects due to training on insufficient number of trajectories, which may lead to a large gap between the estimated flow and the (known) reward value. In response to this challenge, we propose a pessimistic backward policy for GFlowNets (PBP-GFN), which maximizes the observed flow to align closely with the true reward for the object. We extensively evaluate PBP-GFN across eight benchmarks, including hyper-grid environment, bag generation, structured set generation, molecular generation, and four RNA sequence generation tasks. In particular, PBP-GFN enhances the discovery of high-reward objects, maintains the diversity of the objects, and consistently outperforms existing methods.

AAAI Conference 2024 Conference Paper

Quilt: Robust Data Segment Selection against Concept Drifts

  • Minsu Kim
  • Seong-Hyeon Hwang
  • Steven Euijong Whang

Continuous machine learning pipelines are common in industrial settings where models are periodically trained on data streams. Unfortunately, concept drifts may occur in data streams where the joint distribution of the data X and label y, P(X, y), changes over time and possibly degrade model accuracy. Existing concept drift adaptation approaches mostly focus on updating the model to the new data possibly using ensemble techniques of previous models and tend to discard the drifted historical data. However, we contend that explicitly utilizing the drifted data together leads to much better model accuracy and propose Quilt, a data-centric framework for identifying and selecting data segments that maximize model accuracy. To address the potential downside of efficiency, Quilt extends existing data subset selection techniques, which can be used to reduce the training data without compromising model accuracy. These techniques cannot be used as is because they only assume virtual drifts where the posterior probabilities P(y|X) are assumed not to change. In contrast, a key challenge in our setup is to also discard undesirable data segments with concept drifts. Quilt thus discards drifted data segments and selects data segment subsets holistically for accurate and efficient model training. The two operations use gradient-based scores, which have little computation overhead. In our experiments, we show that Quilt outperforms state-of-the-art drift adaptation and data selection baselines on synthetic and real datasets.

NeurIPS Conference 2024 Conference Paper

SpaFL: Communication-Efficient Federated Learning With Sparse Models And Low Computational Overhead

  • Minsu Kim
  • Walid Saad
  • Merouane DEBBAH
  • Choong S. Hong

The large communication and computation overhead of federated learning (FL) is one of the main challenges facing its practical deployment over resource-constrained clients and systems. In this work, SpaFL: a communication-efficient FL framework is proposed to optimize sparse model structures with low computational overhead. In SpaFL, a trainable threshold is defined for each filter/neuron to prune its all connected parameters, thereby leading to structured sparsity. To optimize the pruning process itself, only thresholds are communicated between a server and clients instead of parameters, thereby learning how to prune. Further, global thresholds are used to update model parameters by extracting aggregated parameter importance. The generalization bound of SpaFL is also derived, thereby proving key insights on the relation between sparsity and performance. Experimental results show that SpaFL improves accuracy while requiring much less communication and computing resources compared to sparse baselines. The code is available at https: //github. com/news-vt/SpaFL NeruIPS 2024

NeurIPS Conference 2023 Conference Paper

Bootstrapped Training of Score-Conditioned Generator for Offline Design of Biological Sequences

  • Minsu Kim
  • Federico Berto
  • Sungsoo Ahn
  • Jinkyoo Park

We study the problem of optimizing biological sequences, e. g. , proteins, DNA, and RNA, to maximize a black-box score function that is only evaluated in an offline dataset. We propose a novel solution, bootstrapped training of score-conditioned generator (BootGen) algorithm. Our algorithm repeats a two-stage process. In the first stage, our algorithm trains the biological sequence generator with rank-based weights to enhance the accuracy of sequence generation based on high scores. The subsequent stage involves bootstrapping, which augments the training dataset with self-generated data labeled by a proxy score function. Our key idea is to align the score-based generation with a proxy score function, which distills the knowledge of the proxy score function to the generator. After training, we aggregate samples from multiple bootstrapped generators and proxies to produce a diverse design. Extensive experiments show that our method outperforms competitive baselines on biological sequential design tasks. We provide reproducible source code: https: //github. com/kaist-silab/bootgen.

AAAI Conference 2023 Conference Paper

Deep Visual Forced Alignment: Learning to Align Transcription with Talking Face Video

  • Minsu Kim
  • Chae Won Kim
  • Yong Man Ro

Forced alignment refers to a technology that time-aligns a given transcription with a corresponding speech. However, as the forced alignment technologies have developed using speech audio, they might fail in alignment when the input speech audio is noise-corrupted or is not accessible. We focus on that there is another component that the speech can be inferred from, the speech video (i.e., talking face video). Since the drawbacks of audio-based forced alignment can be complemented using the visual information when the audio signal is under poor condition, we try to develop a novel video-based forced alignment method. However, different from audio forced alignment, it is challenging to develop a reliable visual forced alignment technology for the following two reasons: 1) Visual Speech Recognition (VSR) has a much lower performance compared to audio-based Automatic Speech Recognition (ASR), and 2) the translation from text to video is not reliable, so the method typically used for building audio forced alignment cannot be utilized in developing visual forced alignment. In order to alleviate these challenges, in this paper, we propose a new method that is appropriate for visual forced alignment, namely Deep Visual Forced Alignment (DVFA). The proposed DVFA can align the input transcription (i.e., sentence) with the talking face video without accessing the speech audio. Moreover, by augmenting the alignment task with anomaly case detection, DVFA can detect mismatches between the input transcription and the input video while performing the alignment. Therefore, we can robustly align the text with the talking face video even if there exist error words in the text. Through extensive experiments, we show the effectiveness of the proposed DVFA not only in the alignment task but also in interpreting the outputs of VSR models.

IJCAI Conference 2022 Conference Paper

CERT: Continual Pre-training on Sketches for Library-oriented Code Generation

  • Daoguang Zan
  • Bei Chen
  • Dejian Yang
  • Zeqi Lin
  • Minsu Kim
  • Bei Guan
  • Yongji Wang
  • Weizhu Chen

Code generation is a longstanding challenge, aiming to generate a code snippet based on a natural language description. Usually, expensive text-code paired data is essential for training a code generation model. Recently, thanks to the success of pre-training techniques, large language models are trained on large unlabelled code corpora and perform well in generating code. In this paper, we investigate how to leverage an unlabelled code corpus to train a model for library-oriented code generation. Since it is a common practice for programmers to reuse third-party libraries, in which case the text-code paired data are harder to obtain due to the huge number of libraries. We observe that library-oriented code snippets are more likely to share similar code sketches. Hence, we present CERT with two steps: a sketcher generates the sketch, then a generator fills the details in the sketch. Both the sketcher and generator are continually pre-trained upon a base model using unlabelled data. Also, we carefully craft two benchmarks to evaluate library-oriented code generation named PandasEval and NumpyEval. Experimental results have shown the impressive performance of CERT. For example, it surpasses the base model by an absolute 15. 67% improvement in terms of pass@1 on PandasEval. Our work is available at https: //github. com/microsoft/PyCodeGPT.

AAAI Conference 2022 Conference Paper

Distinguishing Homophenes Using Multi-Head Visual-Audio Memory for Lip Reading

  • Minsu Kim
  • Jeong Hun Yeo
  • Yong Man Ro

Recognizing speech from silent lip movement, which is called lip reading, is a challenging task due to 1) the inherent information insufficiency of lip movement to fully represent the speech, and 2) the existence of homophenes that have similar lip movement with different pronunciations. In this paper, we try to alleviate the aforementioned two challenges in lip reading by proposing a Multi-head Visual-audio Memory (MVM). Firstly, MVM is trained with audio-visual datasets and remembers audio representations by modelling the interrelationships of paired audio-visual representations. At the inference stage, visual input alone can extract the saved audio representation from the memory by examining the learned inter-relationships. Therefore, the lip reading model can complement the insufficient visual information with the extracted audio representations. Secondly, MVM is composed of multihead key memories for saving visual features and one value memory for saving audio knowledge, which is designed to distinguish the homophenes. With the multi-head key memories, MVM extracts possible candidate audio features from the memory, which allows the lip reading model to consider the possibility of which pronunciations can be represented from the input lip movement. This also can be viewed as an explicit implementation of the one-to-many mapping of viseme-to-phoneme. Moreover, MVM is employed in multitemporal levels to consider the context when retrieving the memory and distinguish the homophenes. Extensive experimental results verify the effectiveness of the proposed method in lip reading and in distinguishing the homophenes.

NeurIPS Conference 2022 Conference Paper

Sym-NCO: Leveraging Symmetricity for Neural Combinatorial Optimization

  • Minsu Kim
  • Junyoung Park
  • Jinkyoo Park

Deep reinforcement learning (DRL)-based combinatorial optimization (CO) methods (i. e. , DRL-NCO) have shown significant merit over the conventional CO solvers as DRL-NCO is capable of learning CO solvers less relying on problem-specific expert domain knowledge (heuristic method) and supervised labeled data (supervised learning method). This paper presents a novel training scheme, Sym-NCO, which is a regularizer-based training scheme that leverages universal symmetricities in various CO problems and solutions. Leveraging symmetricities such as rotational and reflectional invariance can greatly improve the generalization capability of DRL-NCO because it allows the learned solver to exploit the commonly shared symmetricities in the same CO problem class. Our experimental results verify that our Sym-NCO greatly improves the performance of DRL-NCO methods in four CO tasks, including the traveling salesman problem (TSP), capacitated vehicle routing problem (CVRP), prize collecting TSP (PCTSP), and orienteering problem (OP), without utilizing problem-specific expert domain knowledge. Remarkably, Sym-NCO outperformed not only the existing DRL-NCO methods but also a competitive conventional solver, the iterative local search (ILS), in PCTSP at 240$\times$ faster speed. Our source code is available at https: //github. com/alstn12088/Sym-NCO.

AAAI Conference 2022 Conference Paper

SyncTalkFace: Talking Face Generation with Precise Lip-Syncing via Audio-Lip Memory

  • Se Jin Park
  • Minsu Kim
  • Joanna Hong
  • Jeongsoo Choi
  • Yong Man Ro

The challenge of talking face generation from speech lies in aligning two different modal information, audio and video, such that the mouth region corresponds to input audio. Previous methods either exploit audio-visual representation learning or leverage intermediate structural information such as landmarks and 3D models. However, they struggle to synthesize fine details of the lips varying at the phoneme level as they do not sufficiently provide visual information of the lips at the video synthesis step. To overcome this limitation, our work proposes Audio-Lip Memory that brings in visual information of the mouth region corresponding to input audio and enforces fine-grained audio-visual coherence. It stores lip motion features from sequential ground truth images in the value memory and aligns them with corresponding audio features so that they can be retrieved using audio input at inference time. Therefore, using the retrieved lip motion features as visual hints, it can easily correlate audio with visual dynamics in the synthesis step. By analyzing the memory, we demonstrate that unique lip features are stored in each memory slot at the phoneme level, capturing subtle lip motion based on memory addressing. In addition, we introduce visual-visual synchronization loss which can enhance lip-syncing performance when used along with audio-visual synchronization loss in our model. Extensive experiments are performed to verify that our method generates high-quality video with mouth shapes that best align with the input audio, outperforming previous state-of-the-art methods.

AAAI Conference 2021 Conference Paper

Cross-Domain Grouping and Alignment for Domain Adaptive Semantic Segmentation

  • Minsu Kim
  • Sunghun Joung
  • Seungryong Kim
  • JungIn Park
  • Ig-Jae Kim
  • Kwanghoon Sohn

Existing techniques to adapt semantic segmentation networks across source and target domains within deep convolutional neural networks (CNNs) deal with all the samples from the two domains in a global or category-aware manner. They do not consider an inter-class variation within the target domain itself or estimated category, providing the limitation to encode the domains having a multi-modal data distribution. To overcome this limitation, we introduce a learnable clustering module, and a novel domain adaptation framework, called cross-domain grouping and alignment. To cluster the samples across domains with an aim to maximize the domain alignment without forgetting precise segmentation ability on the source domain, we present two loss functions, in particular, for encouraging semantic consistency and orthogonality among the clusters. We also present a loss so as to solve a class imbalance problem, which is the other limitation of the previous methods. Our experiments show that our method consistently boosts the adaptation performance in semantic segmentation, outperforming the state-of-the-arts on various domain adaptation settings.

NeurIPS Conference 2021 Conference Paper

Learning Collaborative Policies to Solve NP-hard Routing Problems

  • Minsu Kim
  • Jinkyoo Park
  • joungho kim

Recently, deep reinforcement learning (DRL) frameworks have shown potential for solving NP-hard routing problems such as the traveling salesman problem (TSP) without problem-specific expert knowledge. Although DRL can be used to solve complex problems, DRL frameworks still struggle to compete with state-of-the-art heuristics showing a substantial performance gap. This paper proposes a novel hierarchical problem-solving strategy, termed learning collaborative policies (LCP), which can effectively find the near-optimum solution using two iterative DRL policies: the seeder and reviser. The seeder generates as diversified candidate solutions as possible (seeds) while being dedicated to exploring over the full combinatorial action space (i. e. , sequence of assignment action). To this end, we train the seeder's policy using a simple yet effective entropy regularization reward to encourage the seeder to find diverse solutions. On the other hand, the reviser modifies each candidate solution generated by the seeder; it partitions the full trajectory into sub-tours and simultaneously revises each sub-tour to minimize its traveling distance. Thus, the reviser is trained to improve the candidate solution's quality, focusing on the reduced solution space (which is beneficial for exploitation). Extensive experiments demonstrate that the proposed two-policies collaboration scheme improves over single-policy DRL framework on various NP-hard routing problems, including TSP, prize collecting TSP (PCTSP), and capacitated vehicle routing problem (CVRP).

NeurIPS Conference 2021 Conference Paper

Lip to Speech Synthesis with Visual Context Attentional GAN

  • Minsu Kim
  • Joanna Hong
  • Yong Man Ro

In this paper, we propose a novel lip-to-speech generative adversarial network, Visual Context Attentional GAN (VCA-GAN), which can jointly model local and global lip movements during speech synthesis. Specifically, the proposed VCA-GAN synthesizes the speech from local lip visual features by finding a mapping function of viseme-to-phoneme, while global visual context is embedded into the intermediate layers of the generator to clarify the ambiguity in the mapping induced by homophene. To achieve this, a visual context attention module is proposed where it encodes global representations from the local visual features, and provides the desired global visual context corresponding to the given coarse speech representation to the generator through audio-visual attention. In addition to the explicit modelling of local and global visual representations, synchronization learning is introduced as a form of contrastive learning that guides the generator to synthesize a speech in sync with the given input lip movements. Extensive experiments demonstrate that the proposed VCA-GAN outperforms existing state-of-the-art and is able to effectively synthesize the speech from multi-speaker that has been barely handled in the previous works.

ICRA Conference 2020 Conference Paper

Fast Adaptation of Deep Reinforcement Learning-Based Navigation Skills to Human Preference

  • Jinyoung Choi
  • Christopher R. Dance
  • Jung-Eun Kim
  • Kyungsik Park
  • Jaehun Han
  • Joonho Seo
  • Minsu Kim

Deep reinforcement learning (RL) is being actively studied for robot navigation due to its promise of superior performance and robustness. However, most existing deep RL navigation agents are trained using fixed parameters, such as maximum velocities and weightings of reward components. Since the optimal choice of parameters depends on the use-case, it can be difficult to deploy such existing methods in a variety of real-world service scenarios. In this paper, we propose a novel deep RL navigation method that can adapt its policy to a wide range of parameters and reward functions without expensive retraining. Additionally, we explore a Bayesian deep learning method to optimize these parameters that requires only a small amount of preference data. We empirically show that our method can learn diverse navigation skills and quickly adapt its policy to a given performance metric or to human preference. We also demonstrate our method in real-world scenarios.

ICRA Conference 2019 Conference Paper

Deep Reinforcement Learning of Navigation in a Complex and Crowded Environment with a Limited Field of View

  • Jinyoung Choi
  • Kyungsik Park
  • Minsu Kim
  • Sangok Seok

Mobile robots are required to navigate freely in a complex and crowded environment in order to provide services to humans. For this navigation ability, deep reinforcement learning (DRL)-based methods are gaining increasing attentions. However, existing DRL methods require a wide field of view (FOV), which imposes the usage of high-cost lidar devices. In this paper, we explore the possibility of replacing expensive lidar devices with affordable depth cameras which have a limited FOV. First, we analyze the effect of a limited field of view in the DRL agents. Second, we propose a LSTM agent with Local-Map Critic (LSTM-LMC), which is a novel DRL method to learn efficient navigation in a complex environment with a limited FOV. Lastly, we introduce the dynamics randomization technique to improve the robustness of the DRL agents in the real world. We found that our method with a limited FOV can outperform the methods having a wide FOV but limited memory. We provide the empirical evidence that our method learns to implicitly model the surrounding environment and dynamics of other agents. We also show that a robot with a single depth camera can navigate through a complex real-world environment using our method.

ICRA Conference 2005 Conference Paper

System Design and Dynamic Walking of Humanoid Robot KHR-2

  • Jung-Yup Kim
  • Ill-Woo Park
  • Jungho Lee
  • Minsu Kim
  • Baek-Kyu Cho
  • Jun-Ho Oh

In this paper, we describe the mechanical design, system integration and dynamic walking of the humanoid, KHR-2 (KAIST Humanoid Robot– 2). KHR-2 has 41 DOFs in total, that allows it to imitate various human-like motions. To control all joint axes effectively, the distributed control architecture is used, which reduces computation burden on the main controller, and allows convenient system. A servo motor controller was used as the sub-controller, whereas a 3-axis force/torque sensor and an inertia sensor were used in the sensory system. The main controller attached on the back of KHR-2 communicates with the sub-controllers in real-time through CAN (Controller Area Network) protocol. Windows XP was used as the operation system, whereas RTX HAL extension commercial software was used to realize the real-time control capability in Windows environment. We define the walking pattern and describe several online controllers in each stage. Some of the experimental results of KHR-2 are also presented.