Author name cluster

Sihaeng Lee

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

3 papers

2 author rows

IROS Conference 2022 Conference Paper

Fully Convolutional Transformer with Local-Global Attention

Sihaeng Lee
Eojindl Yi
Janghyeon Lee 0001
Jinsu Yoo
Honglak Lee
Seung Hwan Kim

In an attempt to imitate the success of transformers in the field of natural language processing into computer vision tasks, vision transformers (ViTs) have recently gained attention. Performance breakthroughs have been achieved in coarse-grained tasks like classification. However, dense prediction tasks, such as detection, segmentation, and depth estimation, require additional modifications and have been tackled only in an ad-hoc manner, by replacing the convolutional neural network encoder backbone of an existing architecture with a ViT. This study proposes a fully convolutional transformer that can perform both coarse and dense prediction tasks. The proposed architecture is, to the best of our knowledge, the first architecture composed of attention layers, even in the decoder part of the network. This is because our newly proposed local-global attention (LGA) can flexibly perform both downsampling and upsampling of spatial features, which are key operations required for dense prediction. Against existing ViTs on classification tasks, our architecture shows a reasonable trade-off between performance and efficiency. In the depth estimation task, our architecture achieves performance comparable to that of state-of-the-art transformer-based methods.

Details

IROS Conference 2022 Conference Paper

Multi-Scaled and Densely Connected Locally Convolutional Layers for Depth Completion

Sihaeng Lee
Eojindl Yi
Janghyeon Lee 0001
Junmo Kim 0002

The depth completion task aims to predict a dense depth map from a sparse LiDAR point cloud and an RGB image. This task is critical because an accurate depth map can be used as prior information to solve many computer vision tasks, such as downstream tasks in autonomous vehicles and robot vision. Previous deep learning methods which focus on the local affinity have achieved impressive results. However, an architecture that is directly designed to extract local affinity has not been proposed yet. In this paper, we propose multi-scaled and densely connected locally convolutional layers to learn the affinity of the neighborhood. We set a different grid factor for each step of this module, and each step consists of several convolutional layers applied only to the local area assigned from the grid factor. In addition, each step is densely connected, sequentially, to take advantage of the multi-scale receptive fields. The proposed module effectively learns the neighbor-hood's affinity in a local area with multiple scales, while keeping the network size small. As a result, our architecture achieves state-of-the-art performance compared to published works on the KITTI depth completion benchmark. On the NYU Depth V2 completion benchmark our method achieves performance comparable to state-of-the-art approaches.

Details

AAAI Conference 2021 Conference Paper

Patch-Wise Attention Network for Monocular Depth Estimation

Sihaeng Lee
Janghyeon Lee
Byungju Kim
Eojindl Yi
Junmo Kim

In computer vision, monocular depth estimation is the problem of obtaining a high-quality depth map from a twodimensional image. This map provides information on threedimensional scene geometry, which is necessary for various applications in academia and industry, such as robotics and autonomous driving. Recent studies based on convolutional neural networks achieved impressive results for this task. However, most previous studies did not consider the relationships between the neighboring pixels in a local area of the scene. To overcome the drawbacks of existing methods, we propose a patch-wise attention method for focusing on each local area. After extracting patches from an input feature map, our module generates attention maps for each local patch, using two attention modules for each patch along the channel and spatial dimensions. Subsequently, the attention maps return to their initial positions and merge into one attention feature. Our method is straightforward but effective. The experimental results on two challenging datasets, KITTI and NYU Depth V2, demonstrate that the proposed method achieves significant performance. Furthermore, our method outperforms other state-of-the-art methods on the KITTI depth estimation benchmark.

PDF Details