Arrow Research search
Back to ICLR

ICLR 2025

Visually Consistent Hierarchical Image Classification

Conference Paper Accept (Poster) Artificial Intelligence ยท Machine Learning

Abstract

Hierarchical classification predicts labels across multiple levels of a taxonomy, e.g., from coarse-level \textit{Bird} to mid-level \textit{Hummingbird} to fine-level \textit{Green hermit}, allowing flexible recognition under varying visual conditions. It is commonly framed as multiple single-level tasks, but each level may rely on different visual cues. Distinguishing \textit{Bird} from \textit{Plant} relies on {\it global features} like {\it feathers} or {\it leaves}, while separating \textit{Anna's hummingbird} from \textit{Green hermit} requires {\it local details} such as {\it head coloration}. Prior methods improve accuracy using external semantic supervision, but such statistical learning criteria fail to ensure consistent visual grounding at test time, resulting in incorrect hierarchical classification. We propose, for the first time, to enforce \textit{internal visual consistency} by aligning fine-to-coarse predictions through intra-image segmentation. Our method outperforms zero-shot CLIP and state-of-the-art baselines on hierarchical classification benchmarks, achieving both higher accuracy and more consistent predictions. It also improves internal image segmentation without requiring pixel-level annotations.

Authors

Keywords

  • Hierarchical classification
  • visual grounding

Context

Venue
International Conference on Learning Representations
Archive span
2013-2025
Indexed papers
10294
Paper id
320398371914810374