Arrow Research search

Author name cluster

Ming-Ming Cheng

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

41 papers
2 author rows

Possible papers

41

AAAI Conference 2026 Conference Paper

DenoDet V2: Phase-Amplitude Cross Denoising for SAR Object Detection

  • Kang Ni
  • Minrui Zou
  • Yuxuan Li
  • Xiang Li
  • Kehua Guo
  • Ming-Ming Cheng
  • Yimian Dai

One of the primary challenges in Synthetic Aperture Radar (SAR) object detection lies in the pervasive influence of coherent noise. As a common practice, most existing methods, whether handcrafted approaches or deep learning-based methods, employ the analysis or enhancement of object spatial-domain characteristics to achieve implicit denoising. In this paper, we propose DenoDet V2, which explores a completely novel and different perspective to deconstruct and modulate the features in the transform domain via a carefully designed attention architecture. Compared to DenoDet V1, DenoDet V2 is a major advancement that exploits the complementary nature of amplitude and phase information through a band-wise mutual modulation mechanism, which enables a reciprocal enhancement between phase and amplitude spectra. Extensive experiments on various SAR datasets demonstrate the state-of-the-art performance of DenoDet V2. Notably, DenoDet V2 achieves a significant 0.8% improvement on SARDet-100K dataset compared to DenoDet V1, while reducing the model complexity by half.

AAAI Conference 2026 Conference Paper

SM3Det: A Unified Model for Multi-Modal Remote Sensing Object Detection

  • Yuxuan Li
  • Xiang Li
  • Yunheng Li
  • Yicheng Zhang
  • Yimian Dai
  • Qibin Hou
  • Ming-Ming Cheng
  • Jian Yang

With the rapid advancement of remote sensing technology, high-resolution multi-modal imagery is now more widely accessible. Conventional object detection models are trained on a single dataset, often restricted to a specific imaging modality and annotation format. However, such an approach overlooks the valuable shared knowledge across multi-modalities and limits the model’s applicability in more versatile scenarios. This paper introduces a new task called Multi-Modal Datasets and Multi-Task Object Detection (M2Det) for remote sensing, designed to accurately detect horizontal or oriented objects from any sensor modality. This task poses challenges due to 1) the trade-offs involved in managing multi-modal modelling and 2) the complexities of multi-task optimization. To address these, we establish a benchmark dataset and propose a unified model, SM3Det (Single Model for Multi-Modal datasets and Multi-Task object Detection). SM3Det leverages a grid-level sparse MoE backbone to enable joint knowledge learning while preserving distinct feature representations for different modalities. Furthermore, we propose a novel consistency and synchronization optimization mechanism, allowing it to effectively handle varying levels of learning difficulty across modalities and tasks. Extensive experiments demonstrate SM3Det's effectiveness and generalizability, consistently outperforming the combination of specialized models on individual datasets.

AAAI Conference 2026 Conference Paper

Strip R-CNN: Large Strip Convolution for Remote Sensing Object Detection

  • Xinbin Yuan
  • Zhaohui Zheng
  • Yuxuan Li
  • Xialei Liu
  • Li Liu
  • Xiang Li
  • Qibin Hou
  • Ming-Ming Cheng

In this paper, we show that current approaches using large square kernels or transformer-based global modeling aggregate contextual information uniformly across spatial dimensions, leading to feature dilution and localization errors for elongated targets. To mitigate this issue, we propose Strip R-CNN, the first work to systematically explore large strip convolutions for remote sensing object detection. Our key insight is that strip convolutions enable directional feature aggregation along the dominant spatial dimension of slender objects, reducing background interference while preserving essential geometric information. We design two core components: (i) StripNet, a backbone network employing sequential orthogonal large strip convolutions to capture anisotropic spatial patterns, and (ii) Strip Head, which enhances localization precision by incorporating strip convolutions into the detection head. Unlike previous large-kernel approaches that suffer from computational redundancy and isotropic limitations, our method achieves superior performance with remarkable efficiency. Extensive experiments on multiple benchmarks (DOTA, FAIR1M, HRSC2016, and DIOR) demonstrate significant improvements, with our 30M parameter model achieving 82.75% mAP on DOTA-v1.0, establishing a new state-of-the-art record while providing new insights into anisotropic feature learning for remote sensing applications.

NeurIPS Conference 2025 Conference Paper

AngleRoCL: Angle-Robust Concept Learning for Physically View-Invariant Adversarial Patches

  • Wenjun Ji
  • Yuxiang Fu
  • Luyang Ying
  • Deng-Ping Fan
  • Yuyi Wang
  • Ming-Ming Cheng
  • Ivor Tsang
  • Qing Guo

Cutting-edge works have demonstrated that text-to-image (T2I) diffusion models can generate adversarial patches that mislead state-of-the-art object detectors in the physical world, revealing detectors' vulnerabilities and risks. However, these methods neglect the T2I patches' attack effectiveness when observed from different views in the physical world (i. e. , angle robustness of the T2I adversarial patches). In this paper, we study the angle robustness of T2I adversarial patches comprehensively, revealing their angle-robust issues, demonstrating that texts affect the angle robustness of generated patches significantly, and task-specific linguistic instructions fail to enhance the angle robustness. Motivated by the studies, we introduce Angle-Robust Concept Learning (AngleRoCL), a simple and flexible approach that learns a generalizable concept (i. e. , text embeddings in implementation) representing the capability of generating angle-robust patches. The learned concept can be incorporated into textual prompts and guides T2I models to generate patches with their attack effectiveness inherently resistant to viewpoint variations. Through extensive simulation and physical-world experiments on five SOTA detectors across multiple views, we demonstrate that AngleRoCL significantly enhances the angle robustness of T2I adversarial patches compared to baseline methods. Our patches maintain high attack success rates even under challenging viewing conditions, with over 50% average relative improvement in attack effectiveness across multiple angles. This research advances the understanding of physically angle-robust patches and provides insights into the relationship between textual concepts and physical properties in T2I-generated contents. We released our code at https: //github. com/tsingqguo/anglerocl.

NeurIPS Conference 2025 Conference Paper

DepthVanish: Optimizing Adversarial Interval Structures for Stereo-Depth-Invisible Patches

  • Yun Xing
  • Yue Cao
  • Nhat Chung
  • Jie Zhang
  • Ivor Tsang
  • Ming-Ming Cheng
  • Yang Liu
  • Lei Ma

Stereo depth estimation is a critical task in autonomous driving and robotics, where inaccuracies (such as misidentifying nearby objects as distant) can lead to dangerous situations. Adversarial attacks against stereo depth estimation can help revealing vulnerabilities before deployment. Previous works have shown that repeating optimized textures can effectively mislead stereo depth estimation in digital settings. However, our research reveals that these naively repeated textures perform poorly in physical implementations, $\textit{i. e. }$, when deployed as patches, limiting their practical utility for stress-testing stereo depth estimation systems. In this work, for the first time, we discover that introducing regular intervals among the repeated textures, creating a grid structure, significantly enhances the patch attack performance. Through extensive experimentation, we analyze how variations of this novel structure influence the adversarial effectiveness. Based on these insights, we develop a novel stereo depth attack that jointly optimizes both the interval structure and texture elements. Our generated adversarial patches can be inserted into any scenes and successfully attack advanced stereo depth estimation methods of different paradigms, $\textit{i. e. }$, RAFT-Stereo and STTR. Most critically, our patch can also attack commercial RGB-D cameras (Intel RealSense) in real-world conditions, demonstrating their practical relevance for security assessment of stereo systems. The code is officially released at: https: //github. com/WiWiN42/DepthVanish

NeurIPS Conference 2025 Conference Paper

DIPO: Dual-State Images Controlled Articulated Object Generation Powered by Diverse Data

  • Ruiqi Wu
  • Xinjie wang
  • Chun-Le Guo
  • Jiaxiong Qiu
  • Chongyi Li
  • Lichao Huang
  • Zhizhong Su
  • Ming-Ming Cheng

We present DIPO, a novel framework for the controllable generation of articulated 3D objects from a pair of images: one depicting the object in a resting state and the other in an articulated state. Compared to the single-image approach, our dual-image input imposes only a modest overhead for data collection, but at the same time provides important motion information, which is a reliable guide for predicting kinematic relationships between parts. Specifically, we propose a dual-image diffusion model that captures relationships between the image pair to generate part layouts and joint parameters. In addition, we introduce a Chain-of-Thought (CoT) based graph reasoner that explicitly infers part connectivity relationships. To further improve robustness and generalization on complex articulated objects, we develop a fully automated dataset expansion pipeline, name LEGO-Art, that enriches the diversity and complexity of PartNet-Mobility dataset. We propose PM-X, a large-scale dataset of complex articulated 3D objects, accompanied by rendered images, URDF annotations, and textual descriptions. Extensive experiments demonstrate that DIPO significantly outperforms existing baselines in both the resting state and the articulated state, while the proposed PM-X dataset further enhances generalization to diverse and structurally complex articulated objects. Our code and dataset are available at https: //github. com/RQ-Wu/DIPO.

AAAI Conference 2025 Conference Paper

From Words to Worth: Newborn Article Impact Prediction with LLM

  • Penghai Zhao
  • Qinghua Xing
  • Kairan Dou
  • Jinyu Tian
  • Ying Tai
  • Jian Yang
  • Ming-Ming Cheng
  • Xiang Li

Predicting the future impact of newly published articles is pivotal for advancing scientific discovery in an era of unprecedented scholarly expansion. This paper introduces a promising approach, leveraging the capabilities of LLMs to predict the future impact of newborn articles solely based on titles and abstracts. Breaking away from traditional methods heavily reliant on external data, we propose fine-tuning the LLM to uncover the intrinsic semantic patterns shared by highly impactful articles from a vast collection of text-score pairs. These semantic features are further utilized to predict the proposed indicator, TNCSIsp, which incorporates favorable normalization properties across value, field, and time. To facilitate parameter-efficient fine-tuning of the LLM, we have also meticulously curated a dataset containing over 12,000 entries, each annotated with titles, abstracts, and their corresponding TNCSIsp values. Experimental results reveal an MAE of 0.216 and an NDCG@20 of 0.901, setting new benchmarks in predicting the impact of newborn articles. Finally, we present a real-world application example for predicting the impact of newborn journal articles to demonstrate its noteworthy practical value. Overall, our findings challenge existing paradigms and propose a shift towards a more content-focused prediction of academic impact, offering new insights for article impact prediction.

ICLR Conference 2025 Conference Paper

InterLCM: Low-Quality Images as Intermediate States of Latent Consistency Models for Effective Blind Face Restoration

  • Senmao Li
  • Kai Wang 0060
  • Joost van de Weijer 0001
  • Fahad Shahbaz Khan
  • Chun-Le Guo
  • Shiqi Yang 0002
  • Yaxing Wang
  • Jian Yang 0003

Diffusion priors have been used for blind face restoration (BFR) by fine-tuning diffusion models (DMs) on restoration datasets to recover low-quality images. However, the naive application of DMs presents several key limitations. (i) The diffusion prior has inferior semantic consistency (e.g., ID, structure and color.), increasing the difficulty of optimizing the BFR model; (ii) reliance on hundreds of denoising iterations, preventing the effective cooperation with perceptual losses, which is crucial for faithful restoration. Observing that the latent consistency model (LCM) learns consistency noise-to-data mappings on the ODE-trajectory and therefore shows more semantic consistency in the subject identity, structural information and color preservation, we propose $\textit{InterLCM}$ to leverage the LCM for its superior semantic consistency and efficiency to counter the above issues. Treating low-quality images as the intermediate state of LCM, $\textit{InterLCM}$ achieves a balance between fidelity and quality by starting from earlier LCM steps. LCM also allows the integration of perceptual loss during training, leading to improved restoration quality, particularly in real-world scenarios. To mitigate structural and semantic uncertainties, $\textit{InterLCM}$ incorporates a Visual Module to extract visual features and a Spatial Encoder to capture spatial details, enhancing the fidelity of restored images. Extensive experiments demonstrate that $\textit{InterLCM}$ outperforms existing approaches in both synthetic and real-world datasets while also achieving faster inference speed. Code and models will be publicly available.

NeurIPS Conference 2025 Conference Paper

Knowledge Graph Enhanced Generative Multi-modal Models for Class-Incremental Learning

  • Xusheng Cao
  • Haori Lu
  • Linlan Huang
  • Fei Yang
  • Xialei Liu
  • Ming-Ming Cheng

Continual learning in computer vision faces the critical challenge of catastrophic forgetting, where models struggle to retain prior knowledge while adapting to new tasks. Although recent studies have attempted to leverage the generalization capabilities of pre-trained models to mitigate overfitting on current tasks, models still tend to forget details of previously learned categories as tasks progress, leading to misclassification. To address these limitations, we introduce a novel Knowledge Graph Enhanced Generative Multi-modal model (KG-GMM) that builds an evolving knowledge graph throughout the learning process. Our approach utilizes relationships within the knowledge graph to augment the class labels and assigns different relations to similar categories to enhance model differentiation. During testing, we propose a Knowledge Graph Augmented Inference method that locates specific categories by analyzing relationships within the generated text, thereby reducing the loss of detailed information about old classes when learning new knowledge and alleviating forgetting. Experiments demonstrate that our method effectively leverages relational information to help the model correct mispredictions, achieving state-of-the-art results in both conventional CIL and few-shot CIL settings, confirming the efficacy of knowledge graphs at preserving knowledge in the continual learning scenarios.

ICLR Conference 2025 Conference Paper

Multimodality Helps Few-shot 3D Point Cloud Semantic Segmentation

  • Zhaochong An
  • Guolei Sun
  • Yun Liu 0011
  • Runjia Li
  • Min Wu 0008
  • Ming-Ming Cheng
  • Ender Konukoglu
  • Serge J. Belongie

Few-shot 3D point cloud segmentation (FS-PCS) aims at generalizing models to segment novel categories with minimal annotated support samples. While existing FS-PCS methods have shown promise, they primarily focus on unimodal point cloud inputs, overlooking the potential benefits of leveraging multimodal information. In this paper, we address this gap by introducing a multimodal FS-PCS setup, utilizing textual labels and the potentially available 2D image modality. Under this easy-to-achieve setup, we present the MultiModal Few-Shot SegNet (MM-FSS), a model effectively harnessing complementary information from multiple modalities. MM-FSS employs a shared backbone with two heads to extract intermodal and unimodal visual features, and a pretrained text encoder to generate text embeddings. To fully exploit the multimodal information, we propose a Multimodal Correlation Fusion (MCF) module to generate multimodal correlations, and a Multimodal Semantic Fusion (MSF) module to refine the correlations using text-aware semantic guidance. Additionally, we propose a simple yet effective Test-time Adaptive Cross-modal Calibration (TACC) technique to mitigate training bias, further improving generalization. Experimental results on S3DIS and ScanNet datasets demonstrate significant performance improvements achieved by our method. The efficacy of our approach indicates the benefits of leveraging commonly-ignored free modalities for FS-PCS, providing valuable insights for future research. The code is available at github.com/ZhaochongAn/Multimodality-3D-Few-Shot.

NeurIPS Conference 2025 Conference Paper

OmniSegmentor: A Flexible Multi-Modal Learning Framework for Semantic Segmentation

  • Bo-Wen Yin
  • Jiao-Long Cao
  • Xuying Zhang
  • Yuming Chen
  • Ming-Ming Cheng
  • Qibin Hou

Recent research on representation learning has proved the merits of multi-modal clues for robust semantic segmentation. Nevertheless, a flexible pretrain-and-finetune pipeline for multiple visual modalities remains unexplored. In this paper, we propose a novel multi-modal learning framework, termed OmniSegmentor. It has two key innovations: 1) Based on ImageNet, we assemble a large-scale dataset for multi-modal pretraining, called OmniSegmentor, which contains five popular visual modalities; 2) We provide an efficient pretraining manner to endow the model with the capacity to encode different modality information in the OmniSegmentor. For the first time, we introduce a universal multi-modal pretraining framework that consistently amplifies the model's perceptual capabilities across various scenarios, regardless of the arbitrary combination of the involved modalities. Remarkably, our OmniSegmentor achieves new state-of-the-art records on a wide range of multi-modal semantic segmentation datasets, including NYU Depthv2, EventScape, MFNet, DeLiVER, SUNRGBD, and KITTI-360. Data, model checkpoints, and source code will be made publicly available: https: //github. com/VCIP-RGBD/DFormer.

ICLR Conference 2025 Conference Paper

One-Prompt-One-Story: Free-Lunch Consistent Text-to-Image Generation Using a Single Prompt

  • Tao Liu
  • Kai Wang 0060
  • Senmao Li
  • Joost van de Weijer 0001
  • Fahad Shahbaz Khan
  • Shiqi Yang 0002
  • Yaxing Wang
  • Jian Yang 0003

Text-to-image generation models can create high-quality images from input prompts. However, they struggle to support the consistent generation of identity-preserving requirements for storytelling. Existing approaches to this problem typically require extensive training in large datasets or additional modifications to the original model architectures. This limits their applicability across different domains and diverse diffusion model configurations. In this paper, we first observe the inherent capability of language models, coined $\textit{context consistency}$, to comprehend identity through context with a single prompt. Drawing inspiration from the inherent $\textit{context consistency}$, we propose a novel $\textit{training-free}$ method for consistent text-to-image (T2I) generation, termed "One-Prompt-One-Story" ($\textit{1Prompt1Story}$). Our approach $\textit{1Prompt1Story}$ concatenates all prompts into a single input for T2I diffusion models, initially preserving character identities. We then refine the generation process using two novel techniques: $\textit{Singular-Value Reweighting}$ and $\textit{Identity-Preserving Cross-Attention}$, ensuring better alignment with the input description for each frame. In our experiments, we compare our method against various existing consistent T2I generation approaches to demonstrate its effectiveness, through quantitative metrics and qualitative assessments. Code is available at https://github.com/byliutao/1Prompt1Story.

ICLR Conference 2025 Conference Paper

Re-Aligning Language to Visual Objects with an Agentic Workflow

  • Yuming Chen
  • Jiangyan Feng
  • Haodong Zhang
  • Lijun Gong
  • Feng Zhu 0006
  • Rui Zhao 0001
  • Qibin Hou
  • Ming-Ming Cheng

Language-based object detection (LOD) aims to align visual objects with language expressions. A large amount of paired data is utilized to improve LOD model generalizations. During the training process, recent studies leverage vision-language models (VLMs) to automatically generate human-like expressions for visual objects, facilitating training data scaling up. In this process, we observe that VLM hallucinations bring inaccurate object descriptions (e.g., object name, color, and shape) to deteriorate VL alignment quality. To reduce VLM hallucinations, we propose an agentic workflow controlled by an LLM to re-align language to visual objects via adaptively adjusting image and text prompts. We name this workflow Real-LOD, which includes planning, tool use, and reflection steps. Given an image with detected objects and VLM raw language expressions, Real-LOD reasons its state automatically and arranges action based on our neural symbolic designs (i.e., planning). The action will adaptively adjust the image and text prompts and send them to VLMs for object re-description (i.e., tool use). Then, we use another LLM to analyze these refined expressions for feedback (i.e., reflection). These steps are conducted in a cyclic form to gradually improve language descriptions for re-aligning to visual objects. We construct a dataset that contains a tiny amount of 0.18M images with re-aligned language expression and train a prevalent LOD model to surpass existing LOD methods by around 50% on the standard benchmarks. Our Real-LOD workflow, with automatic VL refinement, reveals a potential to preserve data quality along with scaling up data quantity, which further improves LOD performance from a data-alignment perspective.

NeurIPS Conference 2025 Conference Paper

Representation Entanglement for Generation: Training Diffusion Transformers Is Much Easier Than You Think

  • Ge Wu
  • Shen Zhang
  • Ruijing Shi
  • Shanghua Gao
  • Zhenyuan Chen
  • Lei Wang
  • Zhaowei Chen
  • Hongcheng Gao

REPA and its variants effectively mitigate training challenges in diffusion models by incorporating external visual representations from pretrained models, through alignment between the noisy hidden projections of denoising networks and foundational clean image representations. We argue that the external alignment, which is absent during the entire denoising inference process, falls short of fully harnessing the potential of discriminative representations. In this work, we propose a straightforward method called $\textit{$\textbf{R}$epresentation $\textbf{E}$ntanglement for $\textbf{G}$eneration}$ ($\textbf{REG}$), which entangles low-level image latents with a single high-level class token from pretrained foundation models for denoising. REG acquires the capability to produce coherent image-class pairs directly from pure noise, substantially improving both generation quality and training efficiency. This is accomplished with negligible additional inference overhead, requiring only one single additional token for denoising (<0. 5\% increase in FLOPs and latency). The inference process concurrently reconstructs both image latents and their corresponding global semantics, where the acquired semantic knowledge actively guides and enhances the image generation process. On ImageNet 256$\times$256, SiT-XL/2 + REG demonstrates remarkable convergence acceleration, achieving $\textbf{63}\times$ and $\textbf{23}\times$ faster training than SiT-XL/2 and SiT-XL/2 + REPA, respectively. More impressively, SiT-L/2 + REG trained for merely 400K iterations outperforms SiT-XL/2 + REPA trained for 4M iterations ($\textbf{10}\times$ longer). Code is available at: https: //github. com/Martinser/REG.

NeurIPS Conference 2025 Conference Paper

Revisiting End-to-End Learning with Slide-level Supervision in Computational Pathology

  • Wenhao Tang
  • Rong Qin
  • Heng Fang
  • Fengtao Zhou
  • Hao Chen
  • Xiang Li
  • Ming-Ming Cheng

Pre-trained encoders for offline feature extraction followed by multiple instance learning (MIL) aggregators have become the dominant paradigm in computational pathology (CPath), benefiting cancer diagnosis and prognosis. However, performance limitations arise from the absence of encoder fine-tuning for downstream tasks and disjoint optimization with MIL. While slide-level supervised end-to-end (E2E) learning is an intuitive solution to this issue, it faces challenges such as high computational demands and suboptimal results. These limitations motivate us to revisit E2E learning. We argue that prior work neglects inherent E2E optimization challenges, leading to performance disparities compared to traditional two-stage methods. In this paper, we pioneer the elucidation of optimization challenge caused by sparse-attention MIL and propose a novel MIL called ABMILX. ABMILX mitigates this problem through global correlation-based attention refinement and multi-head mechanisms. With the efficient multi-scale random patch sampling strategy, an E2E trained ResNet with ABMILX surpasses SOTA foundation models under the two-stage paradigm across multiple challenging benchmarks, while remaining computationally efficient ($<$ 10 RTX3090 GPU hours). We demonstrate the potential of E2E learning in CPath and calls for greater research focus in this area. The code is https: //github. com/DearCaat/E2E-WSI-ABMILX.

NeurIPS Conference 2025 Conference Paper

TempSamp-R1: Effective Temporal Sampling with Reinforcement Fine-Tuning for Video LLMs

  • Yunheng Li
  • Jing Cheng
  • Shaoyong Jia
  • Hangyi Kuang
  • Shaohui Jiao
  • Qibin Hou
  • Ming-Ming Cheng

This paper introduces TempSamp-R1, a new reinforcement fine-tuning framework designed to improve the effectiveness of adapting multimodal large language models (MLLMs) to video temporal grounding tasks. We reveal that existing reinforcement learning methods, such as Group Relative Policy Optimization (GRPO), rely on on-policy sampling for policy updates. However, in tasks with large temporal search spaces, this strategy becomes both inefficient and limited in performance, as it often fails to identify temporally accurate solutions. To address this limitation, TempSamp-R1 leverages ground-truth annotations as off-policy supervision to provide temporally precise guidance, effectively compensating for the sparsity and misalignment in on-policy solutions. To further stabilize training and reduce variance in reward-based updates, TempSamp-R1 provides a non-linear soft advantage computation method that dynamically reshapes the reward feedback via an asymmetric transformation. By employing a hybrid Chain-of-Thought (CoT) training paradigm, TempSamp-R1 optimizes a single unified model to support both CoT and non-CoT inference modes, enabling efficient handling of queries with varying reasoning complexity. Experimental results demonstrate that TempSamp-R1 outperforms GRPO-based baselines, establishing new state-of-the-art performance on benchmark datasets: Charades-STA (R1\@0. 7: 52. 9\%, + 2. 7 \%), ActivityNet Captions (R1\@0. 5: 56. 0\%, + 5. 3 \%), and QVHighlights (mAP: 30. 0\%, + 3. 0 \%). Moreover, TempSamp-R1 shows robust few-shot generalization capabilities under limited data. Code is available at https: //github. com/HVision-NKU/TempSamp-R1.

ICML Conference 2024 Conference Paper

Cascade-CLIP: Cascaded Vision-Language Embeddings Alignment for Zero-Shot Semantic Segmentation

  • Yunheng Li
  • Zhong-Yu Li 0006
  • Quan-Sheng Zeng 0001
  • Qibin Hou
  • Ming-Ming Cheng

Pre-trained vision-language models, e. g. , CLIP, have been successfully applied to zero-shot semantic segmentation. Existing CLIP-based approaches primarily utilize visual features from the last layer to align with text embeddings, while they neglect the crucial information in intermediate layers that contain rich object details. However, we find that directly aggregating the multi-level visual features weakens the zero-shot ability for novel classes. The large differences between the visual features from different layers make these features hard to align well with the text embeddings. We resolve this problem by introducing a series of independent decoders to align the multi-level visual features with the text embeddings in a cascaded way, forming a novel but simple framework named Cascade-CLIP. Our Cascade-CLIP is flexible and can be easily applied to existing zero-shot semantic segmentation methods. Experimental results show that our simple Cascade-CLIP achieves superior zero-shot performance on segmentation benchmarks, like COCO-Stuff, Pascal-VOC, and Pascal-Context. Our code is available at https: //github. com/HVision-NKU/Cascade-CLIP.

ICLR Conference 2024 Conference Paper

DFormer: Rethinking RGBD Representation Learning for Semantic Segmentation

  • Bowen Yin
  • Xuying Zhang
  • Zhong-Yu Li 0006
  • Li Liu 0002
  • Ming-Ming Cheng
  • Qibin Hou

We present DFormer, a novel RGB-D pretraining framework to learn transferable representations for RGB-D segmentation tasks. DFormer has two new key innovations: 1) Unlike previous works that encode RGB-D information with RGB pretrained backbone, we pretrain the backbone using image-depth pairs from ImageNet-1K, and thus the DFormer is endowed with the capacity to encode RGB-D representations; 2) DFormer comprises a sequence of RGB-D blocks, which are tailored for encoding both RGB and depth information through a novel building block design. DFormer avoids the mismatched encoding of the 3D geometry relationships in depth maps by RGB pretrained backbones, which widely lies in existing methods but has not been resolved. We finetune the pretrained DFormer on two popular RGB-D tasks, i.e., RGB-D semantic segmentation and RGB-D salient object detection, with a lightweight decoder head. Experimental results show that our DFormer achieves new state-of-the-art performance on these two tasks with less than half of the computational cost of the current best methods on two RGB-D semantic segmentation datasets and five RGB-D salient object detection datasets. Code will be made publicly available.

NeurIPS Conference 2024 Conference Paper

Faster Diffusion: Rethinking the Role of the Encoder for Diffusion Model Inference

  • Senmao Li
  • Taihang Hu
  • Joost van de Weijer
  • Fahad S. Khan
  • Tao Liu
  • Linxuan Li
  • Shiqi Yang
  • Yaxing Wang

One of the main drawback of diffusion models is the slow inference time for image generation. Among the most successful approaches to addressing this problem are distillation methods. However, these methods require considerable computational resources. In this paper, we take another approach to diffusion model acceleration. We conduct a comprehensive study of the UNet encoder and empirically analyze the encoder features. This provides insights regarding their changes during the inference process. In particular, we find that encoder features change minimally, whereas the decoder features exhibit substantial variations across different time-steps. This insight motivates us to omit encoder computation at certain adjacent time-steps and reuse encoder features of previous time-steps as input to the decoder in multiple time-steps. Importantly, this allows us to perform decoder computation in parallel, further accelerating the denoising process. Additionally, we introduce a prior noise injection method to improve the texture details in the generated image. Besides the standard text-to-image task, we also validate our approach on other tasks: text-to-video, personalized generation and reference-guided generation. Without utilizing any knowledge distillation technique, our approach accelerates both the Stable Diffusion (SD) and DeepFloyd-IF model sampling by 41$\%$ and 24$\%$ respectively, and DiT model sampling by 34$\%$, while maintaining high-quality generation performance. Our code will be publicly released.

AAAI Conference 2024 Conference Paper

Fine-Grained Knowledge Selection and Restoration for Non-exemplar Class Incremental Learning

  • Jiang-Tian Zhai
  • Xialei Liu
  • Lu Yu
  • Ming-Ming Cheng

Non-exemplar class incremental learning aims to learn both the new and old tasks without accessing any training data from the past. This strict restriction enlarges the difficulty of alleviating catastrophic forgetting since all techniques can only be applied to current task data. Considering this challenge, we propose a novel framework of fine-grained knowledge selection and restoration. The conventional knowledge distillation-based methods place too strict constraints on the network parameters and features to prevent forgetting, which limits the training of new tasks. To loose this constraint, we proposed a novel fine-grained selective patch-level distillation to adaptively balance plasticity and stability. Some task-agnostic patches can be used to preserve the decision boundary of the old task. While some patches containing the important foreground are favorable for learning the new task. Moreover, we employ a task-agnostic mechanism to generate more realistic prototypes of old tasks with the current task sample for reducing classifier bias for fine-grained knowledge restoration. Extensive experiments on CIFAR100, TinyImageNet and ImageNet-Subset demonstrate the effectiveness of our method. Code is available at https://github.com/scok30/vit-cil.

IJCAI Conference 2024 Conference Paper

Let’s Start Over: Retraining with Selective Samples for Generalized Category Discovery

  • Zhimao Peng
  • Enguang Wang
  • Xialei Liu
  • Ming-Ming Cheng

Generalized Category Discovery (GCD) presents a realistic and challenging problem in open-world learning. Given a par- tially labeled dataset, GCD aims to categorize unlabeled data by leveraging visual knowledge from the labeled data, where the unlabeled data includes both known and unknown classes. Existing methods based on parametric/non-parametric classi- fiers attempt to generate pseudo-labels/relationships for the unlabeled data to enhance representation learning. However, the lack of ground-truth labels for novel classes often leads to noisy pseudo-labels/relationships, resulting in suboptimal representation learning. This paper introduces a novel method using Nearest Neighbor Distance-aware Label Consistency sample selection. It creates class-consistent subsets for novel class sample clusters from the current GCD method, acting as “pseudo-labeled sets” to mitigate representation bias. We propose progressive supervised representation learning with selected samples to optimize the trade-off between quantity and purity in each subset. Our method is versatile and appli- cable to various GCD methods, whether parametric or non- parametric. We conducted extensive experiments on multiple generic and fine-grained image classification datasets to eval- uate the effectiveness of our approach. The results demon- strate the superiority of our method in achieving improved performance in generalized category discovery tasks.

NeurIPS Conference 2024 Conference Paper

OPUS: Occupancy Prediction Using a Sparse Set

  • Jiabao Wang
  • Zhaojiang Liu
  • Qiang Meng
  • Liujiang Yan
  • Ke Wang
  • Jie Yang
  • Wei Liu
  • Qibin Hou

Occupancy prediction, aiming at predicting the occupancy status within voxelized 3D environment, is quickly gaining momentum within the autonomous driving community. Mainstream occupancy prediction works first discretize the 3D environment into voxels, then perform classification on such dense grids. However, inspection on sample data reveals that the vast majority of voxels is unoccupied. Performing classification on these empty voxels demands suboptimal computation resource allocation, and reducing such empty voxels necessitates complex algorithm designs. To this end, we present a novel perspective on the occupancy prediction task: formulating it as a streamlined set prediction paradigm without the need for explicit space modeling or complex sparsification procedures. Our proposed framework, called OPUS, utilizes a transformer encoder-decoder architecture to simultaneously predict occupied locations and classes using a set of learnable queries. Firstly, we employ the Chamfer distance loss to scale the set-to-set comparison problem to unprecedented magnitudes, making training such model end-to-end a reality. Subsequently, semantic classes are adaptively assigned using nearest neighbor search based on the learned locations. In addition, OPUS incorporates a suite of non-trivial strategies to enhance model performance, including coarse-to-fine learning, consistent point sampling, and adaptive re-weighting, etc. Finally, compared with current state-of-the-art methods, our lightest model achieves superior RayIoU on the Occ3D-nuScenes dataset at near 2x FPS, while our heaviest model surpasses previous best results by 6. 1 RayIoU.

NeurIPS Conference 2024 Conference Paper

SARDet-100K: Towards Open-Source Benchmark and ToolKit for Large-Scale SAR Object Detection

  • Yuxuan Li
  • Xiang Li
  • Weijie Li
  • Qibin Hou
  • Li Liu
  • Ming-Ming Cheng
  • Jian Yang

Synthetic Aperture Radar (SAR) object detection has gained significant attention recently due to its irreplaceable all-weather imaging capabilities. However, this research field suffers from both limited public datasets (mostly comprising <2K images with only mono-category objects) and inaccessible source code. To tackle these challenges, we establish a new benchmark dataset and an open-source method for large-scale SAR object detection. Our dataset, SARDet-100K, is a result of intense surveying, collecting, and standardizing 10 existing SAR detection datasets, providing a large-scale and diverse dataset for research purposes. To the best of our knowledge, SARDet-100K is the first COCO-level large-scale multi-class SAR object detection dataset ever created. With this high-quality dataset, we conducted comprehensive experiments and uncovered a crucial challenge in SAR object detection: the substantial disparities between the pretraining on RGB datasets and finetuning on SAR datasets in terms of both data domain and model structure. To bridge these gaps, we propose a novel Multi-Stage with Filter Augmentation (MSFA) pretraining framework that tackles the problems from the perspective of data input, domain transition, and model migration. The proposed MSFA method significantly enhances the performance of SAR object detection models while demonstrating exceptional generalizability and flexibility across diverse models. This work aims to pave the way for further advancements in SAR object detection. The dataset and code is available at \url{https: //github. com/zcablii/SARDet_100K}.

NeurIPS Conference 2024 Conference Paper

StoryDiffusion: Consistent Self-Attention for Long-Range Image and Video Generation

  • Yupeng Zhou
  • Daquan Zhou
  • Ming-Ming Cheng
  • Jiashi Feng
  • Qibin Hou

For recent diffusion-based generative models, maintaining consistent content across a series of generated images, especially those containing subjects and complex details, presents a significant challenge. In this paper, we propose a simple but effective self-attention mechanism, termed Consistent Self-Attention, that boosts the consistency between the generated images. It can be used to augment pre-trained diffusion-based text-to-image models in a zero-shot manner. Based on the images with consistent content, we further show that our method can be extended to long range video generation by introducing a semantic space temporal motion prediction module, named Semantic Motion Predictor. It is trained to estimate the motion conditions between two provided images in the semantic spaces. This module converts the generated sequence of images into videos with smooth transitions and consistent subjects that are more stable than the modules based on latent spaces only, especially in the context of long video generation. By merging these two novel components, our framework, referred to as StoryDiffusion, can describe a text-based story with consistent images or videos encompassing a rich variety of contents. The proposed StoryDiffusion encompasses pioneering explorations in visual story generation with the presentation of images and videos, which we hope could inspire more research from the aspect of architectural modifications.

NeurIPS Conference 2024 Conference Paper

Token Merging for Training-Free Semantic Binding in Text-to-Image Synthesis

  • Taihang Hu
  • Linxuan Li
  • Joost van de Weijer
  • Hongcheng Gao
  • Fahad S. Khan
  • Jian Yang
  • Ming-Ming Cheng
  • Kai Wang

Although text-to-image (T2I) models exhibit remarkable generation capabilities, they frequently fail to accurately bind semantically related objects or attributesin the input prompts; a challenge termed semantic binding. Previous approacheseither involve intensive fine-tuning of the entire T2I model or require users orlarge language models to specify generation layouts, adding complexity. In thispaper, we define semantic binding as the task of associating a given object with itsattribute, termed attribute binding, or linking it to other related sub-objects, referredto as object binding. We introduce a novel method called Token Merging (ToMe), which enhances semantic binding by aggregating relevant tokens into a singlecomposite token. This ensures that the object, its attributes and sub-objects all sharethe same cross-attention map. Additionally, to address potential confusion amongmain objects with complex textual prompts, we propose end token substitution asa complementary strategy. To further refine our approach in the initial stages ofT2I generation, where layouts are determined, we incorporate two auxiliary losses, an entropy loss and a semantic binding loss, to iteratively update the compositetoken to improve the generation integrity. We conducted extensive experiments tovalidate the effectiveness of ToMe, comparing it against various existing methodson the T2I-CompBench and our proposed GPT-4o object binding benchmark. Ourmethod is particularly effective in complex scenarios that involve multiple objectsand attributes, which previous methods often fail to address. The code will be publicly available at https: //github. com/hutaihang/ToMe

ICLR Conference 2022 Conference Paper

On the Connection between Local Attention and Dynamic Depth-wise Convolution

  • Qi Han 0007
  • Zejia Fan
  • Qi Dai 0001
  • Lei Sun
  • Ming-Ming Cheng
  • Jiaying Liu 0001
  • Jingdong Wang 0001

Vision Transformer (ViT) attains state-of-the-art performance in visual recognition, and the variant, Local Vision Transformer, makes further improvements. The major component in Local Vision Transformer, local attention, performs the attention separately over small local windows. We rephrase local attention as a channel-wise locally-connected layer and analyze it from two network regularization manners, sparse connectivity and weight sharing, as well as dynamic weight computation. We point out that local attention resembles depth-wise convolution and its dynamic variants in sparse connectivity: there is no connection across channels, and each position is connected to the positions within a small local window. The main differences lie in (i) weight sharing - depth-wise convolution shares connection weights (kernel weights) across spatial positions and attention shares the connection weights across channels, and (ii) dynamic weight computation manners - local attention is based on dot-products between pairwise positions in the local window, and dynamic convolution is based on linear projections conducted on the center representation or the globally pooled representation. The connection between local attention and dynamic depth-wise convolution is empirically verified by the ablation study about weight sharing and dynamic weight computation in Local Vision Transformer and (dynamic) depth-wise convolution. We empirically observe that the models based on depth-wise convolution and the dynamic variants with lower computation complexity perform on-par with or slightly better than Swin Transformer, an instance of Local Vision Transformer, for ImageNet classification, COCO object detection and ADE semantic segmentation. Code is available at https://github.com/Atten4Vis/DemystifyLocalViT.

NeurIPS Conference 2022 Conference Paper

SegNeXt: Rethinking Convolutional Attention Design for Semantic Segmentation

  • Meng-Hao Guo
  • Cheng-Ze Lu
  • Qibin Hou
  • Zhengning Liu
  • Ming-Ming Cheng
  • Shi-Min Hu

We present SegNeXt, a simple convolutional network architecture for semantic segmentation. Recent transformer-based models have dominated the field of se- mantic segmentation due to the efficiency of self-attention in encoding spatial information. In this paper, we show that convolutional attention is a more efficient and effective way to encode contextual information than the self-attention mech- anism in transformers. By re-examining the characteristics owned by successful segmentation models, we discover several key components leading to the perfor- mance improvement of segmentation models. This motivates us to design a novel convolutional attention network that uses cheap convolutional operations. Without bells and whistles, our SegNeXt significantly improves the performance of previous state-of-the-art methods on popular benchmarks, including ADE20K, Cityscapes, COCO-Stuff, Pascal VOC, Pascal Context, and iSAID. Notably, SegNeXt out- performs EfficientNet-L2 w/ NAS-FPN and achieves 90. 6% mIoU on the Pascal VOC 2012 test leaderboard using only 1/10 parameters of it. On average, SegNeXt achieves about 2. 0% mIoU improvements compared to the state-of-the-art methods on the ADE20K datasets with the same or fewer computations.

UAI Conference 2021 Conference Paper

Structured sparsification with joint optimization of group convolution and channel shuffle

  • Xin-Yu Zhang 0023
  • Kai Zhao 0012
  • Taihong Xiao
  • Ming-Ming Cheng
  • Ming-Hsuan Yang 0001

Recent advances in convolutional neural networks (CNNs) usually come with the expense of excessive computational overhead and memory footprint. Network compression aims to alleviate this issue by training compact models with comparable performance. However, existing compression techniques either entail dedicated expert design or compromise with a moderate performance drop. In this paper, we propose a novel structured sparsification method for efficient network compression. The proposed method automatically induces structured sparsity on the convolutional weights, thereby facilitating the implementation of the compressed model with the highly-optimized group convolution. We further address the problem of inter-group communication with a learnable channel shuffle mechanism. The proposed approach can be easily applied to compress many network architectures with a negligible performance drop. Extensive experimental results and analysis demonstrate that our approach gives a competitive performance against the recent network compression counterparts with a sound accuracy-complexity trade-off.

NeurIPS Conference 2020 Conference Paper

ICNet: Intra-saliency Correlation Network for Co-Saliency Detection

  • Wen-Da Jin
  • Jun Xu
  • Ming-Ming Cheng
  • Yi Zhang
  • Wei Guo

Intra-saliency and inter-saliency cues have been extensively studied for co-saliency detection (Co-SOD). Model-based methods produce coarse Co-SOD results due to hand-crafted intra- and inter-saliency features. Current data-driven models exploit inter-saliency cues, but undervalue the potential power of intra-saliency cues. In this paper, we propose an Intra-saliency Correlation Network (ICNet) to extract intra-saliency cues from the single image saliency maps (SISMs) predicted by any off-the-shelf SOD method, and obtain inter-saliency cues by correlation techniques. Specifically, we adopt normalized masked average pooling (NMAP) to extract latent intra-saliency categories from the SISMs and semantic features as intra cues. Then we employ a correlation fusion module (CFM) to obtain inter cues by exploiting correlations between the intra cues and single-image features. To improve Co-SOD performance, we propose a category-independent rearranged self-correlation feature (RSCF) strategy. Experiments on three benchmarks show that our ICNet outperforms previous state-of-the-art methods on Co-SOD. Ablation studies validate the effectiveness of our contributions. The PyTorch code is available at https: //github. com/blanclist/ICNet.

AAAI Conference 2020 Conference Paper

Image Formation Model Guided Deep Image Super-Resolution

  • Jinshan Pan
  • Yang Liu
  • Deqing Sun
  • Jimmy Ren
  • Ming-Ming Cheng
  • Jian Yang
  • Jinhui Tang

We present a simple and effective image super-resolution algorithm that imposes an image formation constraint on the deep neural networks via pixel substitution. The proposed algorithm first uses a deep neural network to estimate intermediate high-resolution images, blurs the intermediate images using known blur kernels, and then substitutes values of the pixels at the un-decimated positions with those of the corresponding pixels from the low-resolution images. The output of the pixel substitution process strictly satisfies the image formation model and is further refined by the same deep neural network in a cascaded manner. The proposed framework is trained in an end-to-end fashion and can work with existing feed-forward deep neural networks for super-resolution and converges fast in practice. Extensive experimental results show that the proposed algorithm performs favorably against state-of-the-art methods.

AAAI Conference 2020 Conference Paper

Pyramid Constrained Self-Attention Network for Fast Video Salient Object Detection

  • Yuchao Gu
  • Lijuan Wang
  • Ziqin Wang
  • Yun Liu
  • Ming-Ming Cheng
  • Shao-Ping Lu

Spatiotemporal information is essential for video salient object detection (VSOD) due to the highly attractive object motion for human’s attention. Previous VSOD methods usually use Long Short-Term Memory (LSTM) or 3D ConvNet (C3D), which can only encode motion information through step-by-step propagation in the temporal domain. Recently, the non-local mechanism is proposed to capture long-range dependencies directly. However, it is not straightforward to apply the non-local mechanism into VSOD, because i) it fails to capture motion cues and tends to learn motion-independent global contexts; ii) its computation and memory costs are prohibitive for video dense prediction tasks such as VSOD. To address the above problems, we design a Constrained Self- Attention (CSA) operation to capture motion cues, based on the prior that objects always move in a continuous trajectory. We group a set of CSA operations in Pyramid structures (PCSA) to capture objects at various scales and speeds. Extensive experimental results demonstrate that our method outperforms previous state-of-the-art methods in both accuracy and speed (110 FPS on a single Titan Xp) on five challenge datasets. Our code is available at https: //github. com/ guyuchao/PyramidCSA.

NeurIPS Conference 2019 Conference Paper

Unsupervised Scale-consistent Depth and Ego-motion Learning from Monocular Video

  • Jiawang Bian
  • Zhichao Li
  • Naiyan Wang
  • Huangying Zhan
  • Chunhua Shen
  • Ming-Ming Cheng
  • Ian Reid

Recent work has shown that CNN-based depth and ego-motion estimators can be learned using unlabelled monocular videos. However, the performance is limited by unidentified moving objects that violate the underlying static scene assumption in geometric image reconstruction. More significantly, due to lack of proper constraints, networks output scale-inconsistent results over different samples, i. e. , the ego-motion network cannot provide full camera trajectories over a long video sequence because of the per-frame scale ambiguity. This paper tackles these challenges by proposing a geometry consistency loss for scale-consistent predictions and an induced self-discovered mask for handling moving objects and occlusions. Since we do not leverage multi-task learning like recent works, our framework is much simpler and more efficient. Comprehensive evaluation results demonstrate that our depth estimator achieves the state-of-the-art performance on the KITTI dataset. Moreover, we show that our ego-motion network is able to predict a globally scale-consistent camera trajectory for long video sequences, and the resulting visual odometry accuracy is competitive with the recent model that is trained using stereo videos. To the best of our knowledge, this is the first work to show that deep networks trained using unlabelled monocular videos can predict globally scale-consistent camera trajectories over a long video sequence.

AAAI Conference 2018 Conference Paper

Automatic Model Selection in Subspace Clustering via Triplet Relationships

  • Jufeng Yang
  • Jie Liang
  • Kai Wang
  • Yong-Liang Yang
  • Ming-Ming Cheng

This paper addresses both the model selection (i. e. estimating the number of clusters K) and subspace clustering problems in a unified model. The real data always distribute on a union of low-dimensional sub-manifolds which are embedded in a high-dimensional ambient space. In this regard, the state-ofthe-art subspace clustering approaches firstly learn the affinity among samples, followed by a spectral clustering to generate the segmentation. However, arguably, the intrinsic geometrical structures among samples are rarely considered in the optimization process. In this paper, we propose to simultaneously estimate K and segment the samples according to the local similarity relationships derived from the affinity matrix. Given the correlations among samples, we define a novel data structure termed the Triplet, each of which reflects a high relevance and locality among three samples which are aimed to be segmented into the same subspace. While the traditional pairwise distance can be close between inter-cluster samples lying on the intersection of two subspaces, the wrong assignments can be avoided by the hyper-correlation derived from the proposed triplets due to the complementarity of multiple constraints. Sequentially, we propose to greedily optimize a new model selection reward to estimate K according to the correlations between inter-cluster triplets. We simultaneously optimize a fusion reward based on the similarities between triplets and clusters to generate the final segmentation. Extensive experiments on the benchmark datasets demonstrate the effectiveness and robustness of the proposed approach.

IJCAI Conference 2018 Conference Paper

DEL: Deep Embedding Learning for Efficient Image Segmentation

  • Yun Liu
  • Peng-tao Jiang
  • Vahan Petrosyan
  • Shi-Jie Li
  • Jiawang Bian
  • Le Zhang
  • Ming-Ming Cheng

Image segmentation has been explored for many years and still remains a crucial vision problem. Some efficient or accurate segmentation algorithms have been widely used in many vision applications. However, it is difficult to design a both efficient and accurate image segmenter. In this paper, we propose a novel method called DEL (deep embedding learning) which can efficiently transform superpixels into image segmentation. Starting with the SLIC superpixels, we train a fully convolutional network to learn the feature embedding space for each superpixel. The learned feature embedding corresponds to a similarity measure that measures the similarity between two adjacent superpixels. With the deep similarities, we can directly merge the superpixels into large segments. The evaluation results on BSDS500 and PASCAL Context demonstrate that our approach achieves a good trade-off between efficiency and effectiveness. Specifically, our DEL algorithm can achieve comparable segments when compared with MCG but is much faster than it, i. e. 11. 4fps vs. 0. 07fps.

ICRA Conference 2018 Conference Paper

Direct Line Guidance Odometry

  • Shijie Li 0006
  • Bo Ren 0003
  • Yun Liu 0011
  • Ming-Ming Cheng
  • Duncan P. Frost
  • Victor Adrian Prisacariu

Modern visual odometry algorithms utilize sparse point-based features for tracking due to their low computational cost. Current state-of-the-art methods are split between indirect methods that process features extracted from the image, and indirect methods that deal directly on pixel intensities. In recent years, line-based features have been used in SLAM and have shown an increase in performance albeit with an increase in computational cost. In this paper, we propose an extension to a point-based direct monocular visual odometry method. Here we that uses lines to guide keypoint selection rather than acting as features. Points on a line are treated as stronger keypoints than those in other parts of the image, steering point-selection away from less distinctive points and thereby increasing efficiency. By combining intensity and geometry information from a set of points on a line, accuracy may also be increased.

IJCAI Conference 2018 Conference Paper

Enhanced-alignment Measure for Binary Foreground Map Evaluation

  • Deng-Ping Fan
  • Cheng Gong
  • Yang Cao
  • Bo Ren
  • Ming-Ming Cheng
  • Ali Borji

The existing binary foreground map (FM) measures address various types of errors in either pixel-wise or structural ways. These measures consider pixel-level match or image-level information independently, while cognitive vision studies have shown that human vision is highly sensitive to both global information and local details in scenes. In this paper, we take a detailed look at current binary FM evaluation measures and propose a novel and effective E-measure (Enhanced-alignment measure). Our measure combines local pixel values with the image-level mean value in one term, jointly capturing image-level statistics and local pixel matching information. We demonstrate the superiority of our measure over the available measures on 4 popular datasets via 5 meta-measures, including ranking models for applications, demoting generic, random Gaussian noise maps, ground-truth switch, as well as human judgments. We find large improvements in almost all the meta-measures. For instance, in terms of application ranking, we observe improvement ranging from 9. 08% to 19. 65% compared with other popular measures.

AAAI Conference 2018 Conference Paper

FLIC: Fast Linear Iterative Clustering With Active Search

  • Jiaxing Zhao
  • Bo Ren
  • Qibin Hou
  • Ming-Ming Cheng
  • Paul Rosin

In this paper, we reconsider the clustering problem for image over-segmentation from a new perspective. We propose a novel search algorithm named “active search” which explicitly considers neighboring continuity. Based on this search method, we design a back-and-forth traversal strategy and a “joint” assignment and update step to speed up the algorithm. Compared to earlier works, such as Simple Linear Iterative Clustering (SLIC) and its follow-ups, who use fixed search regions and perform the assignment and the update step separately, our novel scheme reduces the number of iterations required for convergence, and also improves the boundary sensitivity of the over-segmentation results. Extensive evaluations on the Berkeley segmentation benchmark verify that our method outperforms competing methods under various evaluation metrics. In particular, lowest time cost is reported among existing methods (approximately 30 fps for a 481 × 321 image on a single CPU core). To facilitate the development of over-segmentation, the code will be publicly available.

IJCAI Conference 2018 Conference Paper

Hi-Fi: Hierarchical Feature Integration for Skeleton Detection

  • Kai Zhao
  • Wei Shen
  • Shanghua Gao
  • Dandan Li
  • Ming-Ming Cheng

In natural images, the scales (thickness) of object skeletons may dramatically vary among objects and object parts. Thus, robust skeleton detection requires powerful multi-scale feature integration ability. To address this issue, we present a new convolutional neural network (CNN) architecture by introducing a novel hierarchical feature integration mechanism, named Hi-Fi, to address the object skeleton detection problem. The proposed CNN-based approach intrinsically captures high-level semantics from deeper layers, as well as low-level details from shallower layers. By hierarchically integrating different CNN feature levels with bidirectional guidance, our approach (1) enables mutual refinement across features of different levels, and (2) possesses the strong ability to capture both rich object context and high-resolution details. Experimental results show that our method significantly outperforms the state-of-the-art methods in terms of effectively fusing features from very different scales, as evidenced by a considerable performance improvement on several benchmarks.

NeurIPS Conference 2018 Conference Paper

Self-Erasing Network for Integral Object Attention

  • Qibin Hou
  • PengTao Jiang
  • Yunchao Wei
  • Ming-Ming Cheng

Recently, adversarial erasing for weakly-supervised object attention has been deeply studied due to its capability in localizing integral object regions. However, such a strategy raises one key problem that attention regions will gradually expand to non-object regions as training iterations continue, which significantly decreases the quality of the produced attention maps. To tackle such an issue as well as promote the quality of object attention, we introduce a simple yet effective Self-Erasing Network (SeeNet) to prohibit attentions from spreading to unexpected background regions. In particular, SeeNet leverages two self-erasing strategies to encourage networks to use reliable object and background cues for learning to attention. In this way, integral object regions can be effectively highlighted without including much more background regions. To test the quality of the generated attention maps, we employ the mined object regions as heuristic cues for learning semantic segmentation models. Experiments on Pascal VOC well demonstrate the superiority of our SeeNet over other state-of-the-art methods.

IROS Conference 2018 Conference Paper

Structured Skip List: A Compact Data Structure for 3D Reconstruction

  • Shijie Li 0006
  • Ming-Ming Cheng
  • Yun Liu 0011
  • Shao-Ping Lu
  • Yahui Wang
  • Victor Adrian Prisacariu

The model produced by 3D reconstruction algorithm is usually represented by voxels. The management of these voxels is usually divided into two categories: ordered and unordered methods. The ordered method holds too many empty voxels to maintain data order which leads to a low storage efficiency. On the contrary, the unordered method keeps massive index data to only store nonempty voxels. In this paper, we design a new data management method for real-time indoor 3D reconstruction, called Structured Skip List (SSL). The SSL can be treated as a semi-ordered method, because the advantages of both the ordered and unordered methods are taken into account: 1) it only holds nonempty voxels similar to the unordered method; 2) the structured information is introduced to reduce the storage space of index data. By these designs, the SSL has a better performance on storage efficiency. To handle the data collision in voxel allocation, a hash allocation list (HAL) is proposed. The length of each Skip List is kept balanced by fusing the IMU (Inertial Measurement Unit) information for a high operation efficiency. The storage efficiency analysis of different data management methods is shown in this paper. What's more, exhaustive investigation is carried out on several datasets with these methods. The experimental result demonstrates that our design can achieve a high storage efficiency with little time loss compared to the state-of-the-art methods.

AAAI Conference 2018 Conference Paper

Understanding Image Impressiveness Inspired by Instantaneous Human Perceptual Cues

  • Jufeng Yang
  • Yan Sun
  • Jie Liang
  • Yong-Liang Yang
  • Ming-Ming Cheng

With the explosion of visual information nowadays, millions of digital images are available to the users. How to efficiently explore a large set of images and retrieve useful information thus becomes extremely important. Unfortunately only some of the images can impress the user at first glance. Others that make little sense in human perception are often discarded, while still costing valuable time and space. Therefore, it is significant to identify these two kinds of images for relieving the load of online repositories and accelerating information retrieval process. However, most of the existing image properties, e. g. , memorability and popularity, are based on repeated human interactions, which limit the research and application of evaluating image quality in terms of instantaneous impression. In this paper, we propose a novel image property, called impressiveness, that measures how images impress people with a short-term contact. This is based on an impression-driven model inspired by a number of important human perceptual cues. To achieve this, we first collect three datasets in various domains, which are labeled according to the instantaneous sensation of the annotators. Then we investigate the impressiveness property via six established human perceptual cues as well as the corresponding features from pixel to semantic levels. Sequentially, we verify the consistency of the impressiveness which can be quantitatively measured by multiple visual representations, and evaluate their latent relationships. Finally, we apply the proposed impressiveness property to rank the images for an efficient image recommendation system.