Arrow Research search

Author name cluster

Qibin Hou

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

19 papers
2 author rows

Possible papers

19

AAAI Conference 2026 Conference Paper

SM3Det: A Unified Model for Multi-Modal Remote Sensing Object Detection

  • Yuxuan Li
  • Xiang Li
  • Yunheng Li
  • Yicheng Zhang
  • Yimian Dai
  • Qibin Hou
  • Ming-Ming Cheng
  • Jian Yang

With the rapid advancement of remote sensing technology, high-resolution multi-modal imagery is now more widely accessible. Conventional object detection models are trained on a single dataset, often restricted to a specific imaging modality and annotation format. However, such an approach overlooks the valuable shared knowledge across multi-modalities and limits the model’s applicability in more versatile scenarios. This paper introduces a new task called Multi-Modal Datasets and Multi-Task Object Detection (M2Det) for remote sensing, designed to accurately detect horizontal or oriented objects from any sensor modality. This task poses challenges due to 1) the trade-offs involved in managing multi-modal modelling and 2) the complexities of multi-task optimization. To address these, we establish a benchmark dataset and propose a unified model, SM3Det (Single Model for Multi-Modal datasets and Multi-Task object Detection). SM3Det leverages a grid-level sparse MoE backbone to enable joint knowledge learning while preserving distinct feature representations for different modalities. Furthermore, we propose a novel consistency and synchronization optimization mechanism, allowing it to effectively handle varying levels of learning difficulty across modalities and tasks. Extensive experiments demonstrate SM3Det's effectiveness and generalizability, consistently outperforming the combination of specialized models on individual datasets.

AAAI Conference 2026 Conference Paper

Strip R-CNN: Large Strip Convolution for Remote Sensing Object Detection

  • Xinbin Yuan
  • Zhaohui Zheng
  • Yuxuan Li
  • Xialei Liu
  • Li Liu
  • Xiang Li
  • Qibin Hou
  • Ming-Ming Cheng

In this paper, we show that current approaches using large square kernels or transformer-based global modeling aggregate contextual information uniformly across spatial dimensions, leading to feature dilution and localization errors for elongated targets. To mitigate this issue, we propose Strip R-CNN, the first work to systematically explore large strip convolutions for remote sensing object detection. Our key insight is that strip convolutions enable directional feature aggregation along the dominant spatial dimension of slender objects, reducing background interference while preserving essential geometric information. We design two core components: (i) StripNet, a backbone network employing sequential orthogonal large strip convolutions to capture anisotropic spatial patterns, and (ii) Strip Head, which enhances localization precision by incorporating strip convolutions into the detection head. Unlike previous large-kernel approaches that suffer from computational redundancy and isotropic limitations, our method achieves superior performance with remarkable efficiency. Extensive experiments on multiple benchmarks (DOTA, FAIR1M, HRSC2016, and DIOR) demonstrate significant improvements, with our 30M parameter model achieving 82.75% mAP on DOTA-v1.0, establishing a new state-of-the-art record while providing new insights into anisotropic feature learning for remote sensing applications.

ICLR Conference 2025 Conference Paper

Multi-Task Dense Predictions via Unleashing the Power of Diffusion

  • Yuqi Yang
  • Peng-Tao Jiang
  • Qibin Hou
  • Hao Zhang 0063
  • Jinwei Chen 0003
  • Bo Li

Diffusion models have exhibited extraordinary performance in dense prediction tasks. However, there are few works exploring the diffusion pipeline for multi-task dense predictions. In this paper, we unlock the potential of diffusion models in solving multi-task dense predictions and propose a novel diffusion-based method, called TaskDiffusion, which leverages the conditional diffusion process in the decoder. Instead of denoising the noisy labels for different tasks separately, we propose a novel joint denoising diffusion process to capture the task relations during denoising. To be specific, our method first encodes the task-specific labels into a task-integration feature space to unify the encoding strategy. This allows us to get rid of the cumbersome task-specific encoding process. In addition, we also propose a cross-task diffusion decoder conditioned on task-specific multi-level features, which can model the interactions among different tasks and levels explicitly while preserving efficiency. Experiments show that our TaskDiffusion outperforms previous state-of-the-art methods for all dense prediction tasks on the widely-used PASCAL-Context and NYUD-v2 datasets. Our code is available at https://github.com/YuqiYang213/TaskDiffusion.

NeurIPS Conference 2025 Conference Paper

OmniSegmentor: A Flexible Multi-Modal Learning Framework for Semantic Segmentation

  • Bo-Wen Yin
  • Jiao-Long Cao
  • Xuying Zhang
  • Yuming Chen
  • Ming-Ming Cheng
  • Qibin Hou

Recent research on representation learning has proved the merits of multi-modal clues for robust semantic segmentation. Nevertheless, a flexible pretrain-and-finetune pipeline for multiple visual modalities remains unexplored. In this paper, we propose a novel multi-modal learning framework, termed OmniSegmentor. It has two key innovations: 1) Based on ImageNet, we assemble a large-scale dataset for multi-modal pretraining, called OmniSegmentor, which contains five popular visual modalities; 2) We provide an efficient pretraining manner to endow the model with the capacity to encode different modality information in the OmniSegmentor. For the first time, we introduce a universal multi-modal pretraining framework that consistently amplifies the model's perceptual capabilities across various scenarios, regardless of the arbitrary combination of the involved modalities. Remarkably, our OmniSegmentor achieves new state-of-the-art records on a wide range of multi-modal semantic segmentation datasets, including NYU Depthv2, EventScape, MFNet, DeLiVER, SUNRGBD, and KITTI-360. Data, model checkpoints, and source code will be made publicly available: https: //github. com/VCIP-RGBD/DFormer.

ICLR Conference 2025 Conference Paper

Re-Aligning Language to Visual Objects with an Agentic Workflow

  • Yuming Chen
  • Jiangyan Feng
  • Haodong Zhang
  • Lijun Gong
  • Feng Zhu 0006
  • Rui Zhao 0001
  • Qibin Hou
  • Ming-Ming Cheng

Language-based object detection (LOD) aims to align visual objects with language expressions. A large amount of paired data is utilized to improve LOD model generalizations. During the training process, recent studies leverage vision-language models (VLMs) to automatically generate human-like expressions for visual objects, facilitating training data scaling up. In this process, we observe that VLM hallucinations bring inaccurate object descriptions (e.g., object name, color, and shape) to deteriorate VL alignment quality. To reduce VLM hallucinations, we propose an agentic workflow controlled by an LLM to re-align language to visual objects via adaptively adjusting image and text prompts. We name this workflow Real-LOD, which includes planning, tool use, and reflection steps. Given an image with detected objects and VLM raw language expressions, Real-LOD reasons its state automatically and arranges action based on our neural symbolic designs (i.e., planning). The action will adaptively adjust the image and text prompts and send them to VLMs for object re-description (i.e., tool use). Then, we use another LLM to analyze these refined expressions for feedback (i.e., reflection). These steps are conducted in a cyclic form to gradually improve language descriptions for re-aligning to visual objects. We construct a dataset that contains a tiny amount of 0.18M images with re-aligned language expression and train a prevalent LOD model to surpass existing LOD methods by around 50% on the standard benchmarks. Our Real-LOD workflow, with automatic VL refinement, reveals a potential to preserve data quality along with scaling up data quantity, which further improves LOD performance from a data-alignment perspective.

NeurIPS Conference 2025 Conference Paper

SE-GUI: Enhancing Visual Grounding for GUI Agents via Self-Evolutionary Reinforcement Learning

  • Xinbin Yuan
  • Jian Zhang
  • Kaixin Li
  • Zhuoxuan Cai
  • Lujian Yao
  • Jie Chen
  • Enguang Wang
  • Qibin Hou

Graphical User Interface (GUI) agents have made substantial strides in understanding and executing user instructions across diverse platforms. Yet, grounding these instructions to precise interface elements remains challenging—especially in complex, high-resolution, professional environments. Traditional supervised fine-tuning (SFT) methods often require large volumes of diverse data and exhibit weak generalization. To overcome these limitations, we introduce a reinforcement learning (RL)-based framework that incorporates three core strategies: (1) seed data curation to ensure high-quality training samples, (2) a dense policy gradient that provides continuous feedback based on prediction accuracy, and (3) a self-evolutionary reinforcement finetuning mechanism that iteratively refines the model using attention maps. With only 3k training samples, our 7B-parameter model achieves state-of-the-art results among similarly sized models on three grounding benchmarks. Notably, it attains 47. 3\% accuracy on the ScreenSpot-Pro dataset—outperforming much larger models, such as UI-TARS-72B, by a margin of 24. 2\%. These findings underscore the effectiveness of RL-based approaches in enhancing GUI agent performance, particularly in high-resolution, complex environments.

NeurIPS Conference 2025 Conference Paper

TempSamp-R1: Effective Temporal Sampling with Reinforcement Fine-Tuning for Video LLMs

  • Yunheng Li
  • Jing Cheng
  • Shaoyong Jia
  • Hangyi Kuang
  • Shaohui Jiao
  • Qibin Hou
  • Ming-Ming Cheng

This paper introduces TempSamp-R1, a new reinforcement fine-tuning framework designed to improve the effectiveness of adapting multimodal large language models (MLLMs) to video temporal grounding tasks. We reveal that existing reinforcement learning methods, such as Group Relative Policy Optimization (GRPO), rely on on-policy sampling for policy updates. However, in tasks with large temporal search spaces, this strategy becomes both inefficient and limited in performance, as it often fails to identify temporally accurate solutions. To address this limitation, TempSamp-R1 leverages ground-truth annotations as off-policy supervision to provide temporally precise guidance, effectively compensating for the sparsity and misalignment in on-policy solutions. To further stabilize training and reduce variance in reward-based updates, TempSamp-R1 provides a non-linear soft advantage computation method that dynamically reshapes the reward feedback via an asymmetric transformation. By employing a hybrid Chain-of-Thought (CoT) training paradigm, TempSamp-R1 optimizes a single unified model to support both CoT and non-CoT inference modes, enabling efficient handling of queries with varying reasoning complexity. Experimental results demonstrate that TempSamp-R1 outperforms GRPO-based baselines, establishing new state-of-the-art performance on benchmark datasets: Charades-STA (R1\@0. 7: 52. 9\%, + 2. 7 \%), ActivityNet Captions (R1\@0. 5: 56. 0\%, + 5. 3 \%), and QVHighlights (mAP: 30. 0\%, + 3. 0 \%). Moreover, TempSamp-R1 shows robust few-shot generalization capabilities under limited data. Code is available at https: //github. com/HVision-NKU/TempSamp-R1.

ICML Conference 2024 Conference Paper

Cascade-CLIP: Cascaded Vision-Language Embeddings Alignment for Zero-Shot Semantic Segmentation

  • Yunheng Li
  • Zhong-Yu Li 0006
  • Quan-Sheng Zeng 0001
  • Qibin Hou
  • Ming-Ming Cheng

Pre-trained vision-language models, e. g. , CLIP, have been successfully applied to zero-shot semantic segmentation. Existing CLIP-based approaches primarily utilize visual features from the last layer to align with text embeddings, while they neglect the crucial information in intermediate layers that contain rich object details. However, we find that directly aggregating the multi-level visual features weakens the zero-shot ability for novel classes. The large differences between the visual features from different layers make these features hard to align well with the text embeddings. We resolve this problem by introducing a series of independent decoders to align the multi-level visual features with the text embeddings in a cascaded way, forming a novel but simple framework named Cascade-CLIP. Our Cascade-CLIP is flexible and can be easily applied to existing zero-shot semantic segmentation methods. Experimental results show that our simple Cascade-CLIP achieves superior zero-shot performance on segmentation benchmarks, like COCO-Stuff, Pascal-VOC, and Pascal-Context. Our code is available at https: //github. com/HVision-NKU/Cascade-CLIP.

ICLR Conference 2024 Conference Paper

DFormer: Rethinking RGBD Representation Learning for Semantic Segmentation

  • Bowen Yin
  • Xuying Zhang
  • Zhong-Yu Li 0006
  • Li Liu 0002
  • Ming-Ming Cheng
  • Qibin Hou

We present DFormer, a novel RGB-D pretraining framework to learn transferable representations for RGB-D segmentation tasks. DFormer has two new key innovations: 1) Unlike previous works that encode RGB-D information with RGB pretrained backbone, we pretrain the backbone using image-depth pairs from ImageNet-1K, and thus the DFormer is endowed with the capacity to encode RGB-D representations; 2) DFormer comprises a sequence of RGB-D blocks, which are tailored for encoding both RGB and depth information through a novel building block design. DFormer avoids the mismatched encoding of the 3D geometry relationships in depth maps by RGB pretrained backbones, which widely lies in existing methods but has not been resolved. We finetune the pretrained DFormer on two popular RGB-D tasks, i.e., RGB-D semantic segmentation and RGB-D salient object detection, with a lightweight decoder head. Experimental results show that our DFormer achieves new state-of-the-art performance on these two tasks with less than half of the computational cost of the current best methods on two RGB-D semantic segmentation datasets and five RGB-D salient object detection datasets. Code will be made publicly available.

ICLR Conference 2024 Conference Paper

Get What You Want, Not What You Don't: Image Content Suppression for Text-to-Image Diffusion Models

  • Senmao Li
  • Joost van de Weijer 0001
  • Taihang Hu
  • Fahad Shahbaz Khan
  • Qibin Hou
  • Yaxing Wang
  • Jian Yang 0003

The success of recent text-to-image diffusion models is largely due to their capacity to be guided by a complex text prompt, which enables users to precisely describe the desired content. However, these models struggle to effectively suppress the generation of undesired content, which is explicitly requested to be omitted from the generated image in the prompt. In this paper, we analyze how to manipulate the text embeddings and remove unwanted content from them. We introduce two contributions, which we refer to as soft-weighted regularization and inference-time text embedding optimization. The first regularizes the text embedding matrix and effectively suppresses the undesired content. The second method aims to further suppress the unwanted content generation of the prompt, and encourages the generation of desired content. We evaluate our method quantitatively and qualitatively on extensive experiments, validating its effectiveness. Furthermore, our method is generalizability to both the pixel-space diffusion models (i.e. DeepFloyd-IF) and the latent-space diffusion models (i.e. Stable Diffusion).

NeurIPS Conference 2024 Conference Paper

OPUS: Occupancy Prediction Using a Sparse Set

  • Jiabao Wang
  • Zhaojiang Liu
  • Qiang Meng
  • Liujiang Yan
  • Ke Wang
  • Jie Yang
  • Wei Liu
  • Qibin Hou

Occupancy prediction, aiming at predicting the occupancy status within voxelized 3D environment, is quickly gaining momentum within the autonomous driving community. Mainstream occupancy prediction works first discretize the 3D environment into voxels, then perform classification on such dense grids. However, inspection on sample data reveals that the vast majority of voxels is unoccupied. Performing classification on these empty voxels demands suboptimal computation resource allocation, and reducing such empty voxels necessitates complex algorithm designs. To this end, we present a novel perspective on the occupancy prediction task: formulating it as a streamlined set prediction paradigm without the need for explicit space modeling or complex sparsification procedures. Our proposed framework, called OPUS, utilizes a transformer encoder-decoder architecture to simultaneously predict occupied locations and classes using a set of learnable queries. Firstly, we employ the Chamfer distance loss to scale the set-to-set comparison problem to unprecedented magnitudes, making training such model end-to-end a reality. Subsequently, semantic classes are adaptively assigned using nearest neighbor search based on the learned locations. In addition, OPUS incorporates a suite of non-trivial strategies to enhance model performance, including coarse-to-fine learning, consistent point sampling, and adaptive re-weighting, etc. Finally, compared with current state-of-the-art methods, our lightest model achieves superior RayIoU on the Occ3D-nuScenes dataset at near 2x FPS, while our heaviest model surpasses previous best results by 6. 1 RayIoU.

AAAI Conference 2024 Conference Paper

Polyper: Boundary Sensitive Polyp Segmentation

  • Hao Shao
  • Yang Zhang
  • Qibin Hou

We present a new boundary sensitive framework for polyp segmentation, termed Polyper.Our method is motivated by a clinical approach that seasoned medical practitioners often leverage the inherent features of interior polyp regions to tackle blurred boundaries.Inspired by this, we propose to explicitly leverages boundary regions to bolster the model's boundary discrimination capability while minimizing computational resource wastage. Our approach first extracts low-confidence boundary regions and high-confidence prediction regions from an initial segmentation map through differentiable morphological operators.Then, we design the boundary sensitive attention that concentrates on augmenting the features near the boundary regions using the high-confidence prediction region's characteristics to generate good segmentation results.Our proposed method can be seamlessly integrated with classical encoder networks, like ResNet-50, MiT-B1, and Swin Transformer.To evaludate the effectiveness of Polyper, we conduct experiments on five publicly available challenging datasets, and receive state-of-the-art performance on all of them. Code is available at https://github.com/haoshao-nku/medical_seg.git.

NeurIPS Conference 2024 Conference Paper

SARDet-100K: Towards Open-Source Benchmark and ToolKit for Large-Scale SAR Object Detection

  • Yuxuan Li
  • Xiang Li
  • Weijie Li
  • Qibin Hou
  • Li Liu
  • Ming-Ming Cheng
  • Jian Yang

Synthetic Aperture Radar (SAR) object detection has gained significant attention recently due to its irreplaceable all-weather imaging capabilities. However, this research field suffers from both limited public datasets (mostly comprising <2K images with only mono-category objects) and inaccessible source code. To tackle these challenges, we establish a new benchmark dataset and an open-source method for large-scale SAR object detection. Our dataset, SARDet-100K, is a result of intense surveying, collecting, and standardizing 10 existing SAR detection datasets, providing a large-scale and diverse dataset for research purposes. To the best of our knowledge, SARDet-100K is the first COCO-level large-scale multi-class SAR object detection dataset ever created. With this high-quality dataset, we conducted comprehensive experiments and uncovered a crucial challenge in SAR object detection: the substantial disparities between the pretraining on RGB datasets and finetuning on SAR datasets in terms of both data domain and model structure. To bridge these gaps, we propose a novel Multi-Stage with Filter Augmentation (MSFA) pretraining framework that tackles the problems from the perspective of data input, domain transition, and model migration. The proposed MSFA method significantly enhances the performance of SAR object detection models while demonstrating exceptional generalizability and flexibility across diverse models. This work aims to pave the way for further advancements in SAR object detection. The dataset and code is available at \url{https: //github. com/zcablii/SARDet_100K}.

NeurIPS Conference 2024 Conference Paper

StoryDiffusion: Consistent Self-Attention for Long-Range Image and Video Generation

  • Yupeng Zhou
  • Daquan Zhou
  • Ming-Ming Cheng
  • Jiashi Feng
  • Qibin Hou

For recent diffusion-based generative models, maintaining consistent content across a series of generated images, especially those containing subjects and complex details, presents a significant challenge. In this paper, we propose a simple but effective self-attention mechanism, termed Consistent Self-Attention, that boosts the consistency between the generated images. It can be used to augment pre-trained diffusion-based text-to-image models in a zero-shot manner. Based on the images with consistent content, we further show that our method can be extended to long range video generation by introducing a semantic space temporal motion prediction module, named Semantic Motion Predictor. It is trained to estimate the motion conditions between two provided images in the semantic spaces. This module converts the generated sequence of images into videos with smooth transitions and consistent subjects that are more stable than the modules based on latent spaces only, especially in the context of long video generation. By merging these two novel components, our framework, referred to as StoryDiffusion, can describe a text-based story with consistent images or videos encompassing a rich variety of contents. The proposed StoryDiffusion encompasses pioneering explorations in visual story generation with the presentation of images and videos, which we hope could inspire more research from the aspect of architectural modifications.

NeurIPS Conference 2022 Conference Paper

SegNeXt: Rethinking Convolutional Attention Design for Semantic Segmentation

  • Meng-Hao Guo
  • Cheng-Ze Lu
  • Qibin Hou
  • Zhengning Liu
  • Ming-Ming Cheng
  • Shi-Min Hu

We present SegNeXt, a simple convolutional network architecture for semantic segmentation. Recent transformer-based models have dominated the field of se- mantic segmentation due to the efficiency of self-attention in encoding spatial information. In this paper, we show that convolutional attention is a more efficient and effective way to encode contextual information than the self-attention mech- anism in transformers. By re-examining the characteristics owned by successful segmentation models, we discover several key components leading to the perfor- mance improvement of segmentation models. This motivates us to design a novel convolutional attention network that uses cheap convolutional operations. Without bells and whistles, our SegNeXt significantly improves the performance of previous state-of-the-art methods on popular benchmarks, including ADE20K, Cityscapes, COCO-Stuff, Pascal VOC, Pascal Context, and iSAID. Notably, SegNeXt out- performs EfficientNet-L2 w/ NAS-FPN and achieves 90. 6% mIoU on the Pascal VOC 2012 test leaderboard using only 1/10 parameters of it. On average, SegNeXt achieves about 2. 0% mIoU improvements compared to the state-of-the-art methods on the ADE20K datasets with the same or fewer computations.

NeurIPS Conference 2021 Conference Paper

All Tokens Matter: Token Labeling for Training Better Vision Transformers

  • Zi-Hang Jiang
  • Qibin Hou
  • Li Yuan
  • Daquan Zhou
  • Yujun Shi
  • Xiaojie Jin
  • Anran Wang
  • Jiashi Feng

In this paper, we present token labeling---a new training objective for training high-performance vision transformers (ViTs). Different from the standard training objective of ViTs that computes the classification loss on an additional trainable class token, our proposed one takes advantage of all the image patch tokens to compute the training loss in a dense manner. Specifically, token labeling reformulates the image classification problem into multiple token-level recognition problems and assigns each patch token with an individual location-specific supervision generated by a machine annotator. Experiments show that token labeling can clearly and consistently improve the performance of various ViT models across a wide spectrum. For a vision transformer with 26M learnable parameters serving as an example, with token labeling, the model can achieve 84. 4% Top-1 accuracy on ImageNet. The result can be further increased to 86. 4% by slightly scaling the model size up to 150M, delivering the minimal-sized model among previous models (250M+) reaching 86%. We also show that token labeling can clearly improve the generalization capability of the pretrained models on downstream tasks with dense prediction, such as semantic segmentation. Our code and model are publiclyavailable at https: //github. com/zihangJiang/TokenLabeling.

ICLR Conference 2020 Conference Paper

Neural Epitome Search for Architecture-Agnostic Network Compression

  • Daquan Zhou
  • Xiaojie Jin
  • Qibin Hou
  • Kaixin Wang
  • Jianchao Yang
  • Jiashi Feng

Traditional compression methods including network pruning, quantization, low rank factorization and knowledge distillation all assume that network architectures and parameters should be hardwired. In this work, we propose a new perspective on network compression, i.e., network parameters can be disentangled from the architectures. From this viewpoint, we present the Neural Epitome Search (NES), a new neural network compression approach that learns to find compact yet expressive epitomes for weight parameters of a specified network architecture end-to-end. The complete network to compress can be generated from the learned epitome via a novel transformation method that adaptively transforms the epitomes to match shapes of the given architecture. Compared with existing compression methods, NES allows the weight tensors to be independent of the architecture design and hence can achieve a good trade-off between model compression rate and performance given a specific model size constraint. Experiments demonstrate that, on ImageNet, when taking MobileNetV2 as backbone, our approach improves the full-model baseline by 1.47% in top-1 accuracy with 25% MAdd reduction and AutoML for Model Compression (AMC) by 2.5% with nearly the same compression ratio. Moreover, taking EfficientNet-B0 as baseline, our NES yields an improvement of 1.2% but are with 10% less MAdd. In particular, our method achieves a new state-of-the-art results of 77.5% under mobile settings (<350M MAdd). Code will be made publicly available.

AAAI Conference 2018 Conference Paper

FLIC: Fast Linear Iterative Clustering With Active Search

  • Jiaxing Zhao
  • Bo Ren
  • Qibin Hou
  • Ming-Ming Cheng
  • Paul Rosin

In this paper, we reconsider the clustering problem for image over-segmentation from a new perspective. We propose a novel search algorithm named “active search” which explicitly considers neighboring continuity. Based on this search method, we design a back-and-forth traversal strategy and a “joint” assignment and update step to speed up the algorithm. Compared to earlier works, such as Simple Linear Iterative Clustering (SLIC) and its follow-ups, who use fixed search regions and perform the assignment and the update step separately, our novel scheme reduces the number of iterations required for convergence, and also improves the boundary sensitivity of the over-segmentation results. Extensive evaluations on the Berkeley segmentation benchmark verify that our method outperforms competing methods under various evaluation metrics. In particular, lowest time cost is reported among existing methods (approximately 30 fps for a 481 × 321 image on a single CPU core). To facilitate the development of over-segmentation, the code will be publicly available.

NeurIPS Conference 2018 Conference Paper

Self-Erasing Network for Integral Object Attention

  • Qibin Hou
  • PengTao Jiang
  • Yunchao Wei
  • Ming-Ming Cheng

Recently, adversarial erasing for weakly-supervised object attention has been deeply studied due to its capability in localizing integral object regions. However, such a strategy raises one key problem that attention regions will gradually expand to non-object regions as training iterations continue, which significantly decreases the quality of the produced attention maps. To tackle such an issue as well as promote the quality of object attention, we introduce a simple yet effective Self-Erasing Network (SeeNet) to prohibit attentions from spreading to unexpected background regions. In particular, SeeNet leverages two self-erasing strategies to encourage networks to use reliable object and background cues for learning to attention. In this way, integral object regions can be effectively highlighted without including much more background regions. To test the quality of the generated attention maps, we employ the mined object regions as heuristic cues for learning semantic segmentation models. Experiments on Pascal VOC well demonstrate the superiority of our SeeNet over other state-of-the-art methods.