Author name cluster

Junhwa Hur

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

8 papers

2 author rows

NeurIPS Conference 2025 Conference Paper

Emergent Temporal Correspondences from Video Diffusion Transformers

Jisu Nam
Soowon Son
Dahyun Chung
Jiyoung Kim
Siyoon Jin
Junhwa Hur
Seungryong Kim

Recent advancements in video diffusion models based on Diffusion Transformers (DiTs) have achieved remarkable success in generating temporally coherent videos. Yet, a fundamental question persists: how do these models internally establish and represent temporal correspondences across frames? We introduce DiffTrack, the first quantitative analysis framework designed to answer this question. DiffTrack constructs a dataset of prompt-generated video with pseudo ground-truth tracking annotations and proposes novel evaluation metrics to systematically analyze how each component within the full 3D attention mechanism of DiTs (e. g. , representations, layers, and timesteps) contributes to establishing temporal correspondences. Our analysis reveals that query-key similarities in specific (but not all) layers play a critical role in temporal matching, and that this matching becomes increasingly prominent throughout denoising. We demonstrate practical applications of DiffTrack in zero-shot point tracking, where it achieves state-of-the-art performance compared to existing vision foundation and self-supervised video models. Further, we extend our findings to motion-enhanced video generation with a novel guidance method that improves temporal consistency of generated videos without additional training. We believe our work offers crucial insights into the inner workings of video DiTs and establishes a foundation for further research and applications leveraging their temporal understanding.

PDF Details

AAAI Conference 2025 Conference Paper

High-Resolution Frame Interpolation with Patch-based Cascaded Diffusion

Junhwa Hur
Charles Herrmann
Saurabh Saxena
Janne Kontkanen
Wei-Sheng Lai
Yichang Shih
Michael Rubinstein
David J. Fleet

Despite the recent progress, existing frame interpolation methods still struggle with processing extremely high resolution input and handling challenging cases such as repetitive textures, thin objects, and large motion. To address these issues, we introduce a patch-based cascaded pixel diffusion model for high resolution frame interpolation, HiFI, that excels in these scenarios while achieving competitive performance on standard benchmarks. Cascades, which generate a series of images from low to high resolution, can help significantly with large or complex motion that require both global context for a coarse solution and detailed context for high resolution output. However, contrary to prior work on cascaded diffusion models which perform diffusion on increasingly large resolutions, we use a single model that always performs diffusion at the same resolution and upsamples by processing patches of the inputs and the prior solution. At inference time, this drastically reduces memory usage and allows a single model, solving both frame interpolation (base model’s task) and spatial up-sampling, saving training cost as well. HiFI excels at high-resolution images and complex repeated textures that require global context, achieving comparable or state-of-the-art performance on various benchmarks (Vimeo, Xiph, X-Test, and SEPE-8K). We further introduce a new dataset, LaMoR, that focuses on particularly challenging cases, and HiFI significantly outperforms other baselines.

PDF Details DOI

ICLR Conference 2025 Conference Paper

MonST3R: A Simple Approach for Estimating Geometry in the Presence of Motion

Junyi Zhang 0004
Charles Herrmann
Junhwa Hur
Varun Jampani
Trevor Darrell
Forrester Cole
Deqing Sun
Ming-Hsuan Yang 0001

Estimating geometry from dynamic scenes, where objects move and deform over time, remains a core challenge in computer vision. Current approaches often rely on multi-stage pipelines or global optimizations that decompose the problem into subtasks, like depth and flow, leading to complex systems prone to errors. In this paper, we present Motion DUSt3R (MonST3R), a novel geometry-first approach that directly estimates per-timestep geometry from dynamic scenes. Our key insight is that by simply estimating a pointmap for each timestep, we can effectively adapt DUSt3R’s representation, previously only used for static scenes, to dynamic scenes. However, this approach presents a significant challenge: the scarcity of suitable training data, namely dynamic, posed videos with depth labels. Despite this, we show that by posing the problem as a fine-tuning task, identifying several suitable datasets, and strategically training the model on this limited data, we can surprisingly enable the model to handle dynamics, even without an explicit motion representation. Based on this, we introduce new optimizations for several downstream video-specific tasks and demonstrate strong performance on video depth and camera pose estimation, outperforming prior work in terms of robustness and efficiency. Moreover, MonST3R shows promising results for primarily feed-forward 4D reconstruction. Interactive 4D results, source code, and trained models are available at: https://monst3r-project.github.io/.

Details

NeurIPS Conference 2023 Conference Paper

A Tale of Two Features: Stable Diffusion Complements DINO for Zero-Shot Semantic Correspondence

Junyi Zhang
Charles Herrmann
Junhwa Hur
Luisa Polania Cabrera
Varun Jampani
Deqing Sun
Ming-Hsuan Yang

Text-to-image diffusion models have made significant advances in generating and editing high-quality images. As a result, numerous approaches have explored the ability of diffusion model features to understand and process single images for downstream tasks, e. g. , classification, semantic segmentation, and stylization. However, significantly less is known about what these features reveal across multiple, different images and objects. In this work, we exploit Stable Diffusion (SD) features for semantic and dense correspondence and discover that with simple post-processing, SD features can perform quantitatively similar to SOTA representations. Interestingly, the qualitative analysis reveals that SD features have very different properties compared to existing representation learning features, such as the recently released DINOv2: while DINOv2 provides sparse but accurate matches, SD features provide high-quality spatial information but sometimes inaccurate semantic matches. We demonstrate that a simple fusion of these two features works surprisingly well, and a zero-shot evaluation using nearest neighbors on these fused features provides a significant performance gain over state-of-the-art methods on benchmark datasets, e. g. , SPair-71k, PF-Pascal, and TSS. We also show that these correspondences can enable interesting applications such as instance swapping in two images. Project page: https: //sd-complements-dino. github. io/.

PDF Details

NeurIPS Conference 2023 Conference Paper

The Surprising Effectiveness of Diffusion Models for Optical Flow and Monocular Depth Estimation

Saurabh Saxena
Charles Herrmann
Junhwa Hur
Abhishek Kar
Mohammad Norouzi
Deqing Sun
David J. Fleet

Denoising diffusion probabilistic models have transformed image generation with their impressive fidelity and diversity. We show that they also excel in estimating optical flow and monocular depth, surprisingly without task-specific architectures and loss functions that are predominant for these tasks. Compared to the point estimates of conventional regression-based methods, diffusion models also enable Monte Carlo inference, e. g. , capturing uncertainty and ambiguity in flow and depth. With self-supervised pre-training, the combined use of synthetic and real data for supervised training, and technical innovations (infilling and step-unrolled denoising diffusion training) to handle noisy-incomplete training data, one can train state-of-the-art diffusion models for depth and optical flow estimation, with additional zero-shot coarse-to-fine refinement for high resolution estimates. Extensive experiments focus on quantitative performance against benchmarks, ablations, and the model's ability to capture uncertainty and multimodality, and impute missing values. Our model obtains a state-of-the-art relative depth error of 0. 074 on the indoor NYU benchmark and an Fl-all score of 3. 26\% on the KITTI optical flow benchmark, about 25\% better than the best published method.

PDF Details

IROS Conference 2022 Conference Paper

MasKGrasp: Mask-based Grasping for Scenes with Multiple General Real-world Objects

Junho Lee
Junhwa Hur
Inwoo Hwang
Young Min Kim 0001

In this paper, we introduce a mask-based grasping method that discerns multiple objects within the scene regard-less of transparency or specularity and finds the optimal grasp position avoiding clutter. Conventional vision-based robotic grasping approaches often fail to extend to the scenes containing transparent objects due to their different visual appearance. To handle the different visual characteristics, we first segment both transparent and opaque objects into instance masks, which serve as the domain-agnostic intermediate representation of both object types, using a neural network. While there exists no labelled training dataset that strongly represents both object types, we overcome the limitation by augmenting transparent objects on an existing large-scale dataset. Then, given the object instance masks, our method selects the top K discrete masks and robustly estimates grasp poses avoiding clutter. Through experiments, we verify that the instance masks are light-weight yet provide sufficient information for vision-based grasping agnostic of various appearances. On an unseen real-world test environment with complex objects, our method substantially outperforms previous methods without fine-tuning.

Details

NeurIPS Conference 2022 Conference Paper

Self-supervised surround-view depth estimation with volumetric feature fusion

Jung-Hee Kim
Junhwa Hur
Tien Phuoc Nguyen
Seong-Gyun Jeong

We present a self-supervised depth estimation approach using a unified volumetric feature fusion for surround-view images. Given a set of surround-view images, our method constructs a volumetric feature map by extracting image feature maps from surround-view images and fuse the feature maps into a shared, unified 3D voxel space. The volumetric feature map then can be used for estimating a depth map at each surround view by projecting it into an image coordinate. A volumetric feature contains 3D information at its local voxel coordinate; thus our method can also synthesize a depth map at arbitrary rotated viewpoints by projecting the volumetric feature map into the target viewpoints. Furthermore, assuming static camera extrinsics in the multi-camera system, we propose to estimate a canonical camera motion from the volumetric feature map. Our method leverages 3D spatio- temporal context to learn metric-scale depth and the canonical camera motion in a self-supervised manner. Our method outperforms the prior arts on DDAD and nuScenes datasets, especially estimating more accurate metric-scale depth and consistent depth between neighboring views.

PDF Details

AAAI Conference 2018 Conference Paper

UnFlow: Unsupervised Learning of Optical Flow With a Bidirectional Census Loss

Simon Meister
Junhwa Hur
Stefan Roth

In the era of end-to-end deep learning, many advances in computer vision are driven by large amounts of labeled data. In the optical ﬂow setting, however, obtaining dense perpixel ground truth for real scenes is difﬁcult and thus such data is rare. Therefore, recent end-to-end convolutional networks for optical ﬂow rely on synthetic datasets for supervision, but the domain mismatch between training and test scenarios continues to be a challenge. Inspired by classical energy-based optical ﬂow methods, we design an unsupervised loss based on occlusion-aware bidirectional ﬂow estimation and the robust census transform to circumvent the need for ground truth ﬂow. On the KITTI benchmarks, our unsupervised approach outperforms previous unsupervised deep networks by a large margin, and is even more accurate than similar supervised methods trained on synthetic datasets alone. By optionally ﬁne-tuning on the KITTI training data, our method achieves competitive optical ﬂow accuracy on the KITTI 2012 and 2015 benchmarks, thus in addition enabling generic pre-training of supervised networks for datasets with limited amounts of ground truth.

PDF Details