Arrow Research search

Author name cluster

Jiaying Lin

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

12 papers
2 author rows

Possible papers

12

AAAI Conference 2026 Conference Paper

OpenScan: A Benchmark for Generalized Open-Vocabulary 3D Scene Understanding

  • Youjun Zhao
  • Jiaying Lin
  • Shuquan Ye
  • Qianshi Pang
  • Rynson W. H. Lau

Open-vocabulary 3D scene understanding (OV-3D) aims to localize and classify novel objects beyond the closed set of object classes. However, existing approaches and benchmarks primarily focus on the open vocabulary problem within the context of object classes, which is insufficient in providing a holistic evaluation to what extent a model understands the 3D scene. In this paper, we introduce a more challenging task called Generalized Open-Vocabulary 3D Scene Understanding (GOV-3D) to explore the open vocabulary problem beyond object classes. It encompasses an open and diverse set of generalized knowledge, expressed as linguistic queries of fine-grained and object-specific attributes. To this end, we contribute a new benchmark named OpenScan, which consists of 3D object attributes across eight representative linguistic aspects, including affordance, property, and material. We further evaluate state-of-the-art OV-3D methods on our OpenScan benchmark and discover that these methods struggle to comprehend the abstract vocabularies of the GOV-3D task, a challenge that cannot be addressed simply by scaling up object classes during training. We highlight the limitations of existing methodologies and explore promising directions to overcome the identified shortcomings.

AAAI Conference 2025 Conference Paper

Hierarchical Cross-Modal Alignment for Open-Vocabulary 3D Object Detection

  • Youjun Zhao
  • Jiaying Lin
  • Rynson W. H. Lau

Open-vocabulary 3D object detection (OV-3DOD) aims at localizing and classifying novel objects beyond closed sets. The recent success of vision-language models (VLMs) has demonstrated their remarkable capabilities to understand open vocabularies. Existing works that leverage VLMs for 3D object detection (3DOD) generally resort to representations that lose the rich scene context required for 3D perception. To address this problem, we propose in this paper a hierarchical framework, named HCMA, to simultaneously learn local object and global scene information for OV-3DOD. Specifically, we first design a Hierarchical Data Integration (HDI) approach to obtain coarse-to-fine 3D-image-text data, which is fed into a VLM to extract object-centric knowledge. To facilitate the association of feature hierarchies, we then propose an Interactive Cross-Modal Alignment (ICMA) strategy to establish effective intra-level and inter-level feature connections. To better align features across different levels, we further propose an Object-Focusing Context Adjustment (OFCA) module to refine multi-level features by emphasizing object-related features. Extensive experiments demonstrate that the proposed method outperforms SOTA methods on the existing OV-3DOD benchmarks. It also achieves promising OV-3DOD results even without any 3D annotations.

AAAI Conference 2025 Conference Paper

Leveraging RGB-D Data with Cross-Modal Context Mining for Glass Surface Detection

  • Jiaying Lin
  • Yuen-Hei Yeung
  • Shuquan Ye
  • Rynson W. H. Lau

Glass surfaces are becoming increasingly ubiquitous as modern buildings tend to use a lot of glass panels. This, however, poses substantial challenges to the operations of autonomous systems such as robots, self-driving cars, and drones, as the glass panels can become transparent obstacles to navigation. Existing works attempt to exploit various cues, including glass boundary context or reflections, as a prior. However, they are all based on input RGB images. We observe that the transmission of 3D depth sensor light through glass surfaces often produces blank regions in the depth maps, which can offer additional insights to complement the RGB image features for glass surface detection. In this work, we propose a large-scale RGB-D glass surface detection dataset, RGB-D GSD, for rigorous experiments and future research. It contains 3,009 images offering a wide range of real-world RGB-D glass surface categories, paired with precise annotations. Moreover, we propose a novel glass surface detection framework combining RGB and depth information, with two novel modules: a cross-modal context mining (CCM) module to adaptively learn individual and mutual context features from RGB and depth information, and a depth-missing aware attention (DAA) module to explicitly exploit spatial locations where missing depths occur to help detect the presence of glass surfaces. Experimental results show that our proposed model outperforms state-of-the-art methods.

IROS Conference 2025 Conference Paper

Recognizing Skeleton-Based Actions As Points

  • Baiqiao Yin
  • Jiaying Lin
  • Jiajun Wen 0003
  • Yue Li
  • Jinfu Liu
  • Yanfei Wang
  • Mengyuan Liu 0001

Recent advances in skeleton-based action recognition have been primarily driven by Graph Convolutional Networks (GCNs) and skeleton transformers. While conventional approaches focus on modeling joint co-occurrences through skeletal connections, they overlook the inherent positional information in 3D coordinates. Although Hyper-Graphs partially address the limitation of pairwise aggregation in capturing higher-order kinematic dependencies, challenges remain in their topological definitions. To solve these problems, this paper proposes a skeleton-to-point network (Skeleton2Point) to model joints’ position relationships directly in three-dimensional space without fixed topology limitation, which is the first to regard skeleton recognition as point clouds. However, simply considering the raw 3D coordinates would result in the loss of the anatomical identity of each keypoint and its temporal position in the sequence. To address this limitation, we augment the three-dimensional spatial coordinates with two additional dimensions: the anatomical index of each keypoint and its corresponding frame number with a proposed Information Transform Module (ITM). This transformation extends the representation from a three-dimensional to a five-dimensional feature space. Furthermore, we propose a Cluster-Dispatch-Based Interaction module (CDI) to enhance the discrimination of local-global information. In comparison with existing methods on NTU-RGB+D 60 and NTU-RGB+D 120 datasets, Skeleton2Point has demonstrated state-of-the-art performance on both joint modality and stream fusion. Especially, on the challenging NTU-RGB+D 120 dataset under the X-Sub and X-Set setting, the accuracies reach 90. 63% and 91. 92%.

NeurIPS Conference 2024 Conference Paper

Boosting Weakly Supervised Referring Image Segmentation via Progressive Comprehension

  • Zaiquan Yang
  • Yuhao Liu
  • Jiaying Lin
  • Gerhard Hancke
  • Rynson W. Lau

This paper explores the weakly-supervised referring image segmentation (WRIS) problem, and focuses on a challenging setup where target localization is learned directly from image-text pairs. We note that the input text description typically already contains detailed information on how to localize the target object, and we also observe that humans often follow a step-by-step comprehension process (\ie, progressively utilizing target-related attributes and relations as cues) to identify the target object. Hence, we propose a novel Progressive Comprehension Network (PCNet) to leverage target-related textual cues from the input description for progressively localizing the target object. Specifically, we first use a Large Language Model (LLM) to decompose the input text description into short phrases. These short phrases are taken as target-related cues and fed into a Conditional Referring Module (CRM) in multiple stages, to allow updating the referring text embedding and enhance the response map for target localization in a multi-stage manner. Based on the CRM, we then propose a Region-aware Shrinking (RaS) loss to constrain the visual localization to be conducted progressively in a coarse-to-fine manner across different stages. Finally, we introduce an Instance-aware Disambiguation (IaD) loss to suppress instance localization ambiguity by differentiating overlapping response maps generated by different referring texts on the same image. Extensive experiments show that our method outperforms SOTA methods on three common benchmarks.

AAAI Conference 2024 Conference Paper

Multi-View Dynamic Reflection Prior for Video Glass Surface Detection

  • Fang Liu
  • Yuhao Liu
  • Jiaying Lin
  • Ke Xu
  • Rynson W.H. Lau

Recent research has shown significant interest in image-based glass surface detection (GSD). However, detecting glass surfaces in dynamic scenes remains largely unexplored due to the lack of a high-quality dataset and an effective video glass surface detection (VGSD) method. In this paper, we propose the first VGSD approach. Our key observation is that reflections frequently appear on glass surfaces, but they change dynamically as the camera moves. Based on this observation, we propose to offset the excessive dependence on a single uncertainty reflection via joint modeling of temporal and spatial reflection cues. To this end, we propose the VGSD-Net with two novel modules: a Location-aware Reflection Extraction (LRE) module and a Context-enhanced Reflection Integration (CRI) module, for the position-aware reflection feature extraction and the spatial-temporal reflection cues integration, respectively. We have also created the first large-scale video glass surface dataset (VGSD-D), consisting of 19,166 image frames with accurately-annotated glass masks extracted from 297 videos. Extensive experiments demonstrate that VGSD-Net outperforms state-of-the-art approaches adapted from related fields. Code and dataset will be available at https://github.com/fawnliu/VGSD.

AAAI Conference 2023 Conference Paper

Efficient Mirror Detection via Multi-Level Heterogeneous Learning

  • Ruozhen He
  • Jiaying Lin
  • Rynson W.H. Lau

We present HetNet (Multi-level Heterogeneous Network), a highly efficient mirror detection network. Current mirror detection methods focus more on performance than efficiency, limiting the real-time applications (such as drones). Their lack of efficiency is aroused by the common design of adopting homogeneous modules at different levels, which ignores the difference between different levels of features. In contrast, HetNet detects potential mirror regions initially through low-level understandings (e.g., intensity contrasts) and then combines with high-level understandings (contextual discontinuity for instance) to finalize the predictions. To perform accurate yet efficient mirror detection, HetNet follows an effective architecture that obtains specific information at different stages to detect mirrors. We further propose a multi-orientation intensity-based contrasted module (MIC) and a reflection semantic logical module (RSL), equipped on HetNet, to predict potential mirror regions by low-level understandings and analyze semantic logic in scenarios by high-level understandings, respectively. Compared to the state-of-the-art method, HetNet runs 664% faster and draws an average performance gain of 8.9% on MAE, 3.1% on IoU, and 2.0% on F-measure on two mirror detection benchmarks. The code is available at https://github.com/Catherine-R-He/HetNet.

AAAI Conference 2023 Conference Paper

Symmetry-Aware Transformer-Based Mirror Detection

  • Tianyu Huang
  • Bowen Dong
  • Jiaying Lin
  • Xiaohui Liu
  • Rynson W.H. Lau
  • Wangmeng Zuo

Mirror detection aims to identify the mirror regions in the given input image. Existing works mainly focus on integrating the semantic features and structural features to mine specific relations between mirror and non-mirror regions, or introducing mirror properties like depth or chirality to help analyze the existence of mirrors. In this work, we observe that a real object typically forms a loose symmetry relationship with its corresponding reflection in the mirror, which is beneficial in distinguishing mirrors from real objects. Based on this observation, we propose a dual-path Symmetry-Aware Transformer-based mirror detection Network (SATNet), which includes two novel modules: Symmetry-Aware Attention Module (SAAM) and Contrast and Fusion Decoder Module (CFDM). Specifically, we first adopt a transformer backbone to model global information aggregation in images, extracting multi-scale features in two paths. We then feed the high-level dual-path features to SAAMs to capture the symmetry relations. Finally, we fuse the dual-path features and refine our prediction maps progressively with CFDMs to obtain the final mirror mask. Experimental results show that SATNet outperforms both RGB and RGB-D mirror detection methods on all available mirror detection datasets.

AAAI Conference 2023 Conference Paper

Weakly-Supervised Camouflaged Object Detection with Scribble Annotations

  • Ruozhen He
  • Qihua Dong
  • Jiaying Lin
  • Rynson W.H. Lau

Existing camouflaged object detection (COD) methods rely heavily on large-scale datasets with pixel-wise annotations. However, due to the ambiguous boundary, annotating camouflage objects pixel-wisely is very time-consuming and labor-intensive, taking ~60mins to label one image. In this paper, we propose the first weakly-supervised COD method, using scribble annotations as supervision. To achieve this, we first relabel 4,040 images in existing camouflaged object datasets with scribbles, which takes ~10s to label one image. As scribble annotations only describe the primary structure of objects without details, for the network to learn to localize the boundaries of camouflaged objects, we propose a novel consistency loss composed of two parts: a cross-view loss to attain reliable consistency over different images, and an inside-view loss to maintain consistency inside a single prediction map. Besides, we observe that humans use semantic information to segment regions near the boundaries of camouflaged objects. Hence, we further propose a feature-guided loss, which includes visual features directly extracted from images and semantically significant features captured by the model. Finally, we propose a novel network for COD via scribble learning on structural information and semantic relations. Our network has two novel modules: the local-context contrasted (LCC) module, which mimics visual inhibition to enhance image contrast/sharpness and expand the scribbles into potential camouflaged regions, and the logical semantic relation (LSR) module, which analyzes the semantic relation to determine the regions representing the camouflaged object. Experimental results show that our model outperforms relevant SOTA methods on three COD benchmarks with an average improvement of 11.0% on MAE, 3.2% on S-measure, 2.5% on E-measure, and 4.4% on weighted F-measure.

NeurIPS Conference 2022 Conference Paper

Exploiting Semantic Relations for Glass Surface Detection

  • Jiaying Lin
  • Yuen-Hei Yeung
  • Rynson Lau

Glass surfaces are omnipresent in our daily lives and often go unnoticed by the majority of us. While humans are generally able to infer their locations and thus avoid collisions, it can be difficult for current object detection systems to handle them due to the transparent nature of glass surfaces. Previous methods approached the problem by extracting global context information to obtain priors such as object boundaries and reflections. However, their performances cannot be guaranteed when these deterministic features are not available. We observe that humans often reason through the semantic context of the environment, which offers insights into the categories of and proximity between entities that are expected to appear in the surrounding. For example, the odds of co-occurrence of glass windows with walls and curtains are generally higher than that with other objects such as cars and trees, which have relatively less semantic relevance. Based on this observation, we propose a model ('GlassSemNet') that integrates the contextual relationship of the scenes for glass surface detection with two novel modules: (1) Scene Aware Activation (SAA) Module to adaptively filter critical channels with respect to spatial and semantic features, and (2) Context Correlation Attention (CCA) Module to progressively learn the contextual correlations among objects both spatially and semantically. In addition, we propose a large-scale glass surface detection dataset named {\it Glass Surface Detection - Semantics} ('GSD-S'), which contains 4, 519 real-world RGB glass surface images from diverse real-world scenes with detailed annotations for both glass surface detection and semantic segmentation. Experimental results show that our model outperforms contemporary works, especially with 42. 6\% MAE improvement on our proposed GSD-S dataset. Code, dataset, and models are available at https: //jiaying. link/neurips2022-gsds/