Author name cluster

Haosen Yang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

4 papers

1 author row

AAAI Conference 2025 Conference Paper

Unsupervised Audio-Visual Segmentation with Modality Alignment

Swapnil Bhosale
Haosen Yang
Diptesh Kanojia
Jiankang Deng
Xiatian Zhu

Audio-Visual Segmentation (AVS) aims to identify, at the pixel level, the object in a visual scene that produces a given sound. Current AVS methods rely on costly fine-grained annotations of mask-audio pairs, making them impractical for scalability. To address this, we propose the Modality Correspondence Alignment (MoCA) framework, which seamlessly integrates off-the-shelf foundation models like DINO, SAM, and ImageBind. Our approach leverages existing knowledge within these models and optimizes their joint usage for multimodal associations. Our approach relies on estimating positive and negative image pairs in the feature space. For pixel-level association, we introduce an audio-visual adapter and a novel {pixel matching aggregation} strategy within the image-level contrastive learning framework. This allows for a flexible connection between object appearance and audio signal at the pixel level, with tolerance to imaging variations such as translation and rotation. Extensive experiments on the AVSBench (single and multi-object splits) and AVSS datasets demonstrate that MoCA outperforms unsupervised baseline approaches and some supervised counterparts, particularly in complex scenarios with multiple auditory objects. In terms of mIoU, MoCA achieves a substantial improvement over baselines in both the AVSBench (S4: +17.24%, MS3: +67.64%) and AVSS (+19.23%) audio-visual segmentation challenges.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

AV-GS: Learning Material and Geometry Aware Priors for Novel View Acoustic Synthesis

Swapnil Bhosale
Haosen Yang
Diptesh Kanojia
Jiankang Deng
Xiatian Zhu

Novel view acoustic synthesis (NVAS) aims to render binaural audio at any target viewpoint, given a mono audio emitted by a sound source at a 3D scene. Existing methods have proposed NeRF-based implicit models to exploit visual cues as a condition for synthesizing binaural audio. However, in addition to low efficiency originating from heavy NeRF rendering, these methods all have a limited ability of characterizing the entire scene environment such as room geometry, material properties, and the spatial relation between the listener and sound source. To address these issues, we propose a novel Audio-Visual Gaussian Splatting (AV-GS) model. To obtain a material-aware and geometry-aware condition for audio synthesis, we learn an explicit point-based scene representation with audio-guidance parameters on locally initialized Gaussian points, taking into account the space relation from the listener and sound source. To make the visual scene model audio adaptive, we propose a point densification and pruning strategy to optimally distribute the Gaussian points, with the per-point contribution in sound propagation (e. g. , more points needed for texture-less wall surfaces as they affect sound path diversion). Extensive experiments validate the superiority of our AV-GS over existing alternatives on the real-world RWAS and simulation-based SoundSpaces datasets. Project page: \url{https: //surrey-uplab. github. io/research/avgs/}

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

Recognize Any Regions

Haosen Yang
Chuofan Ma
Bin Wen
Yi Jiang
Zehuan Yuan
Xiatian Zhu

Understanding the semantics of individual regions or patches of unconstrained images, such as open-world object detection, remains a critical yet challenging task in computer vision. Building on the success of powerful image-level vision-language (ViL) foundation models like CLIP, recent efforts have sought to harness their capabilities by either training a contrastive model from scratch with an extensive collection of region-label pairs or aligning the outputs of a detection model with image-level representations of region proposals. Despite notable progress, these approaches are plagued by computationally intensive training requirements, susceptibility to data noise, and deficiency in contextual information. To address these limitations, we explore the synergistic potential of off-the-shelf foundation models, leveraging their respective strengths in localization and semantics. We introduce a novel, generic, and efficient architecture, named RegionSpot, designed to integrate position-aware localization knowledge from a localization foundation model (e. g. , SAM) with semantic information from a ViL model (e. g. , CLIP). To fully exploit pretrained knowledge while minimizing training overhead, we keep both foundation models frozen, focusing optimization efforts solely on a lightweight attention-based knowledge integration module. Extensive experiments in open-world object recognition show that our RegionSpot achieves significant performance gain over prior alternatives, along with substantial computational savings (e. g. , training our model with 3 million data in a single day using 8 V100 GPUs). RegionSpot outperforms GLIP-L by 2. 9 in mAP on LVIS val set, with an even larger margin of 13. 1 AP for more challenging and rare categories, and a 2. 5 AP increase on ODinW. Furthermore, it exceeds GroundingDINO-L by 11. 0 AP for rare categories on the LVIS minival set.

PDF Details DOI

AAAI Conference 2022 Conference Paper

Temporal Action Proposal Generation with Background Constraint

Haosen Yang
Wenhao Wu
Lining Wang
Sheng Jin
Boyang Xia
Hongxun Yao
Hujie Huang

Temporal action proposal generation (TAPG) is a challenging task that aims to locate action instances in untrimmed videos with temporal boundaries. To evaluate the confidence of proposals, the existing works typically predict action score of proposals that are supervised by the temporal Intersectionover-Union (tIoU) between proposal and the ground-truth. In this paper, we innovatively propose a general auxiliary Background Constraint idea to further suppress low-quality proposals, by utilizing the background prediction score to restrict the confidence of proposals. In this way, the Background Constraint concept can be easily plug-and-played into existing TAPG methods (e. g. , BMN, GTAD). From this perspective, we propose the Background Constraint Network (BC- Net) to further take advantage of the rich information of action and background. Specifically, we introduce an Action- Background Interaction module for reliable confidence evaluation, which models the inconsistency between action and background by attention mechanisms at the frame and clip levels. Extensive experiments are conducted on two popular benchmarks, i. e. , ActivityNet-1. 3 and THUMOS14. The results demonstrate that our method outperforms state-of-theart methods. Equipped with the existing action classifier, our method also achieves remarkable performance on the temporal action localization task.

PDF Details