Author name cluster

Jaeyoung Do

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

5 papers

2 author rows

NeurIPS Conference 2025 Conference Paper

Don’t Let It Fade: Preserving Edits in Diffusion Language Models via Token Timestep Allocation

Woojin Kim
Jaeyoung Do

Classifier guidance is a widely adopted technique in diffusion language models, used to steer generation toward desired attributes. However, such guidance often introduces instability during the generation process, where token-level updates fluctuate across timesteps. We identify and formally characterize this phenomenon as update-forgetting. This instability disrupts the refinement process by overwriting semantic edits, ultimately degrading fluency and coherence, which is particularly problematic in tasks like controllable text generation. To address this, we propose TTA-Diffusion, a novel inference-time approach that dynamically allocates timesteps per token based on refinement needs. Unlike conventional diffusion models that apply uniform updates, TTA-Diffusion employs structured timestep allocation, preserving stable tokens while allowing uncertain tokens to undergo progressive adjustment. Experimental results across diverse tasks demonstrate that TTA-Diffusion significantly outperforms both diffusion-based and auto-regressive baselines in fluency and control accuracy while improving computational efficiency by reducing the number of required timesteps. On the sentiment control task, TTA-Diffusion achieves over 20\% higher accuracy and nearly half the perplexity of prior diffusion models, using less than one-fifth the denoising steps. This work highlights the importance of mitigating fluctuations in token updates and promoting a balanced refinement process, thereby enhancing stability and controllability in controllable language modeling.

PDF Details

NeurIPS Conference 2025 Conference Paper

Exploring and Leveraging Class Vectors for Classifier Editing

Jaeik Kim
Jaeyoung Do

Image classifiers play a critical role in detecting diseases in medical imaging and identifying anomalies in manufacturing processes. However, their predefined behaviors after extensive training make post hoc model editing difficult, especially when it comes to forgetting specific classes or adapting to distribution shifts. Existing classifier editing methods either focus narrowly on correcting errors or incur extensive retraining costs, creating a bottleneck for flexible editing. Moreover, such editing has seen limited investigation in image classification. To overcome these challenges, we introduce class vectors, which capture class-specific representation adjustments during fine-tuning. Whereas task vectors encode task-level changes in weight space, class vectors disentangle each class’s adaptation in the latent space. We show that class vectors capture each class’s semantic shift and that classifier editing can be achieved either by steering latent features along these vectors or by mapping them into weight space to update the decision boundaries. We also demonstrate that the inherent linearity and orthogonality of class vectors support efficient, flexible, and high-level concept editing via simple class arithmetic. Finally, we validate their utility in applications such as unlearning, environmental adaptation, adversarial defense, and adversarial trigger optimization.

PDF Details

AAAI Conference 2025 Conference Paper

MathSpeech: Leveraging Small LMs for Accurate Conversion in Mathematical Speech-to-Formula

Sieun Hyeon
Kyudan Jung
Jaehee Won
Nam-Joon Kim
Hyun Gon Ryu
Hyuk-Jae Lee
Jaeyoung Do

In various academic and professional settings, such as mathematics lectures or research presentations, it is often necessary to convey mathematical expressions orally. However, reading mathematical expressions aloud without accompanying visuals can significantly hinder comprehension, especially for those who are hearing-impaired or rely on subtitles due to language barriers. For instance, when a presenter reads Euler's Formula, current Automatic Speech Recognition (ASR) models often produce a verbose and error-prone textual description (e.g., e to the power of i x equals cosine of x plus i 'side' of x), instead of the concise LaTeX format, which hampers clear understanding and communication. To address this issue, we introduce MathSpeech, a novel pipeline that integrates ASR models with small Language Models (sLMs) to correct errors in mathematical expressions and accurately convert spoken expressions into structured LaTeX representations. Evaluated on a new dataset derived from lecture recordings, MathSpeech demonstrates LaTeX generation capabilities comparable to leading commercial Large Language Models (LLMs), while leveraging fine-tuned small language models of only 120M parameters. Specifically, in terms of CER, BLEU, and ROUGE scores for LaTeX translation, MathSpeech demonstrated significantly superior capabilities compared to GPT-4o. We observed a decrease in CER from 0.390 to 0.298, and higher ROUGE/BLEU scores compared to GPT-4o.

PDF Details DOI

NeurIPS Conference 2025 Conference Paper

MMPB: It’s Time for Multi-Modal Personalization

Jaeik Kim
Woojin Kim
Woohyeon Park
Jaeyoung Do

Visual personalization is essential in user-facing AI systems such as smart homes and healthcare, where aligning model behavior with user-centric concepts is critical. However, recent large Vision-Language Models (VLMs), despite their broad applicability, remain underexplored in their ability to adapt to individual users. In this paper, we introduce MMPB, the first extensive benchmark for evaluating VLMs on personalization. MMPB comprises 10k image-query pairs and includes 111 personalizable concepts across four categories: humans, animals, objects, and characters, with the human category enriched with preference-grounded queries. We structure personalization into three main task types, each highlighting a different key property of VLMs. Using 23 widely used VLMs including both open- and closed-source models, we evaluate personalization performance via a three-stage protocol: concept injection, multi-turn dialogue, and personalized querying. Our findings indicate that most VLMs (including some closed-source models) struggle with personalization, particularly in maintaining consistency over dialogue, handling user preferences, and adapting to visual cues. Our analysis reveals that the challenges in VLM personalization (such as refusal behaviors and long-context forgetting) highlight substantial room for improvement. By identifying these limitations and offering a scalable benchmark, MMPB offers valuable insights and a solid foundation for future research toward truly personalized multi-modal AI.

PDF Details

ICML Conference 2025 Conference Paper

SECOND: Mitigating Perceptual Hallucination in Vision-Language Models via Selective and Contrastive Decoding

Woohyeon Park
Woojin Kim
Jaeik Kim
Jaeyoung Do

Despite significant advancements in Vision-Language Models (VLMs), the performance of existing VLMs remains hindered by object hallucination, a critical challenge to achieving accurate visual understanding. To address this issue, we propose SECOND: Selective and Contrastive Decoding, a novel approach that enables VLMs to effectively leverage multi-scale visual information with an object-centric manner, closely aligning with human visual perception. SECOND progressively selects and integrates multi-scale visual information, facilitating a more precise interpretation of images. By contrasting these visual information iteratively, SECOND significantly reduces perceptual hallucinations and outperforms a wide range of benchmarks. Our theoretical analysis and experiments highlight the largely unexplored potential of multi-scale application in VLMs, showing that prioritizing and contrasting across scales outperforms existing methods.

Details