Arrow Research search

Author name cluster

Minhyeok Lee

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

5 papers
1 author row

Possible papers

5

AAAI Conference 2026 Conference Paper

MonoCLUE: Object-Aware Clustering Enhances Monocular 3D Object Detection

  • Sunghun Yang
  • Minhyeok Lee
  • JungHo Lee
  • Sangyoun Lee

Monocular 3D object detection offers a cost-effective solution for autonomous driving, but it suffers from the ill-posed depth and a limited field of view. These constraints lead to the lack of geometric cues and reduced accuracy in occluded or truncated scenes. While recent approaches incorporate additional depth information to address geometric ambiguity, they overlook the importance of visual cues essential for robust object recognition. In this paper, we propose MonoCLUE that enhances monocular 3D detection by leveraging both local clustering and generalized scene memory of visual features. First, we perform K-means clustering on visual features to capture distinct object-level appearance visual parts (e.g., bonnet, car roof), which improves the detection of partially visible objects. The clustered features are then propagated across the entire region to capture objects with similar appearances. Second, we construct a generalized scene memory by aggregating clustered features across images, providing consistent appearance representations that generalize scenes. This improves the consistency of object-level features, enabling stable detection across varying environments. Lastly, we integrate both local cluster features and generalized scene memory into object queries, guiding attention toward informative regions in the feature map. Exploiting an unified local clustering and generalized scene memory strategy, MonoCLUE enables robust monocular 3D detection under occlusion and limited visibility. Our proposed model achieves state-of-the-art performance on the KITTI benchmark.

NeurIPS Conference 2025 Conference Paper

Empower Words: DualGround for Structured Phrase and Sentence-Level Temporal Grounding

  • Minseok Kang
  • Minhyeok Lee
  • Minjung Kim
  • Donghyeong Kim
  • Sangyoun Lee

Video Temporal Grounding (VTG) aims to localize temporal segments in long, untrimmed videos that align with a given natural language query. This task typically comprises two subtasks: \textit{Moment Retrieval (MR)} and \textit{Highlight Detection (HD)}. While recent advances have been progressed by powerful pretrained vision-language models such as CLIP and InternVideo2, existing approaches commonly treat all text tokens uniformly during cross-modal attention, disregarding their distinct semantic roles. To validate the limitations of this approach, we conduct controlled experiments demonstrating that VTG models overly rely on [EOS]-driven global semantics while failing to effectively utilize word-level signals, which limits their ability to achieve fine-grained temporal alignment. Motivated by this limitation, we propose DualGround, a dual-branch architecture that explicitly separates global and local semantics by routing the [EOS] token through a sentence-level path and clustering word tokens into phrase-level units for localized grounding. Our method introduces (1) token-role-aware cross modal interaction strategies that align video features with sentence-level and phrase-level semantics in a structurally disentangled manner, and (2) a joint modeling framework that not only improves global sentence-level alignment but also enhances fine-grained temporal grounding by leveraging structured phrase-aware context. This design allows the model to capture both coarse and localized semantics, enabling more expressive and context-aware video grounding. DualGround achieves state-of-the-art performance on both Moment Retrieval and Highlight Detection tasks across QVHighlights and Charades-STA benchmarks, demonstrating the effectiveness of disentangled semantic modeling in video-language alignment.

AAAI Conference 2025 Conference Paper

Video Diffusion Models Are Strong Video Inpainter

  • Minhyeok Lee
  • Suhwan Cho
  • Chajin Shin
  • JungHo Lee
  • Sunghun Yang
  • Sangyoun Lee

Propagation-based video inpainting using optical flow at the pixel or feature level has recently garnered significant attention. However, it has limitations such as the inaccuracy of optical flow prediction and the propagation of noise over time. These issues result in non-uniform noise and time consistency problems throughout the video, which are particularly pronounced when the removed area is large and involves substantial movement. To address these issues, we propose a novel First Frame Filling Video Diffusion Inpainting model (FFF-VDI). We design FFF-VDI inspired by the capabilities of pre-trained image-to-video diffusion models that can transform the first frame image into a highly natural video. To apply this to the video inpainting task, we propagate the noise latent information of future frames to fill the masked areas of the first frame's noise latent code. Next, we fine-tune the pre-trained image-to-video diffusion model to generate the inpainted video. The proposed model addresses the limitations of existing methods that rely on optical flow quality, producing much more natural and temporally consistent videos. This proposed approach is the first to effectively integrate image-to-video diffusion models into video inpainting tasks. Through various comparative experiments, we demonstrate that the proposed model can robustly handle diverse inpainting types with high quality.