Mingjun Yin Papers

IJCAI Conference 2024 Conference Paper

CoAtFormer: Vision Transformer with Composite Attention

Zhiyong Chang
Mingjun Yin
Yan Wang

Transformer has recently gained significant attention and achieved state-of-the-art performance in various computer vision applications, including image classification, instance segmentation, and object detection. However, the self-attention mechanism underlying the transformer leads to quadratic computational cost with respect to image size, limiting its widespread adoption in state-of-the-art vision backbones. In this paper we introduce an efficient and effective attention module we call Composite Attention. It features parallel branches, enabling the modeling of various global dependencies. In each composite attention module, one branch employs a dynamic channel attention module to capture global channel dependencies, while the other branch utilizes an efficient spatial attention module to extract long-range spatial interactions. In addition, we effectively blending composite attention module with convolutions, and accordingly develop a simple hierarchical vision backbone, dubbed CoAtFormer, by simply repeating the basic building block over multiple stages. Extensive experiments show our CoAtFormer achieves state-of-the-art results on various different tasks. Without any pre-training and extra data, CoAtFormer-Tiny, CoAtFormer-Small, and CoAtFormer-Base achieve 84. 4%, 85. 3%, and 85. 9% top-1 accuracy on ImageNet-1K with 24M, 37M, and 73M parameters, respectively. Furthermore, CoAtFormer also consistently outperform prior work in other vision tasks such as object detection, instance segmentation, and semantic segmentation. When further pretraining on the larger dataset ImageNet-22k, we achieve 88. 7% Top-1 accuracy on ImageNet-1K

PDF Details DOI

AAAI Conference 2022 Conference Paper

Context-Aware Transfer Attacks for Object Detection

Zikui Cai
Xinxin Xie
Shasha Li
Mingjun Yin
Chengyu Song
Srikanth V. Krishnamurthy
Amit K. Roy-Chowdhury
M. Salman Asif

Blackbox transfer attacks for image classifiers have been extensively studied in recent years. In contrast, little progress has been made on transfer attacks for object detectors. Object detectors take a holistic view of the image and the detection of one object (or lack thereof) often depends on other objects in the scene. This makes such detectors inherently context-aware and adversarial attacks in this space are more challenging than those targeting image classifiers. In this paper, we present a new approach to generate contextaware attacks for object detectors. We show that by using cooccurrence of objects and their relative locations and sizes as context information, we can successfully generate targeted mis-categorization attacks that achieve higher transfer success rates on blackbox object detectors than the state-of-theart. We test our approach on a variety of object detectors with images from PASCAL VOC and MS COCO datasets and demonstrate up to 20 percentage points improvement in performance compared to the other state-of-the-art methods.

PDF Details

Possible papers

CoAtFormer: Vision Transformer with Composite Attention

Context-Aware Transfer Attacks for Object Detection