Arrow Research search

Author name cluster

Haiyang Wang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

9 papers
2 author rows

Possible papers

9

AAAI Conference 2025 Conference Paper

Pedestrian Attribute Recognition: A New Benchmark Dataset and a Large Language Model Augmented Framework

  • Jiandong Jin
  • Xiao Wang
  • Qian Zhu
  • Haiyang Wang
  • Chenglong Li

Pedestrian Attribute Recognition (PAR) is one of the indispensable tasks in human-centered research. However, existing datasets neglect different domains (e.g., environments, times, populations, and data sources), only conducting simple random splits, and the performance of these datasets has already approached saturation. In the past five years, no large-scale dataset has been opened to the public. To address this issue, this paper proposes a new large-scale, cross-domain pedestrian attribute recognition dataset to fill the data gap, termed MSP60K. It consists of 60,122 images and 57 attribute annotations across eight scenarios. Synthetic degradation is also conducted to further narrow the gap between the dataset and real-world challenging scenarios. To establish a more rigorous benchmark, we evaluate 17 representative PAR models under both random and cross-domain split protocols on our dataset. Additionally, we propose an innovative Large Language Model (LLM) augmented PAR framework, named LLM-PAR. This framework processes pedestrian images through a Vision Transformer (ViT) backbone to extract features and introduces a multi-embedding query Transformer to learn partial-aware features for attribute classification. Significantly, we enhance this framework with LLM for ensemble learning and visual feature augmentation. Comprehensive experiments across multiple PAR benchmark datasets have thoroughly validated the efficacy of our proposed framework.

ICLR Conference 2025 Conference Paper

TokenFormer: Rethinking Transformer Scaling with Tokenized Model Parameters

  • Haiyang Wang
  • Yue Fan
  • Muhammad Ferjad Naeem
  • Yongqin Xian
  • Jan Eric Lenssen
  • Liwei Wang 0001
  • Federico Tombari
  • Bernt Schiele

Transformers have become the predominant architecture in foundation models due to their excellent performance across various domains. However, the substantial cost of scaling these models remains a significant concern. This problem arises primarily from their dependence on a fixed number of parameters within linear projections. When architectural modifications (e.g., channel dimensions) are introduced, the entire model typically requires retraining from scratch. As model sizes continue growing, this strategy results in increasingly high computational costs and becomes unsustainable. To overcome this problem, we introduce Tokenformer, a natively scalable architecture that leverages the attention mechanism not only for computations among input tokens but also for interactions between tokens and model parameters, thereby enhancing architectural flexibility. By treating model parameters as tokens, we replace all the linear projections in Transformers with our token-parameter attention layer, where input tokens act as queries and model parameters as keys and values. This reformulation allows for progressive and efficient scaling without necessitating retraining from scratch. Our model scales from 124M to 1.4B parameters by incrementally adding new key-value parameter pairs, achieving performance comparable to Transformers trained from scratch while greatly reducing training costs. Code and models are available at {\color{red}\url{https://github.com/Haiyang-W/TokenFormer.git}}

NeurIPS Conference 2025 Conference Paper

UFO: A Unified Approach to Fine-grained Visual Perception via Open-ended Language Interface

  • Hao Tang
  • Chen-Wei Xie
  • Haiyang Wang
  • Xiaoyi Bao
  • Tingyu Weng
  • Pandeng Li
  • Yun Zheng
  • Liwei Wang

Generalist models have achieved remarkable success in both language and vision-language tasks, showcasing the potential of unified modeling. However, effectively integrating fine-grained perception tasks like detection and segmentation into these models remains a significant challenge. This is primarily because these tasks often rely heavily on task-specific designs and architectures that can complicate the modeling process. To address this challenge, we present UFO, a framework that unifies fine-grained visual perception tasks through an open-ended language interface. By transforming all perception targets into the language space, UFO unifies object-level detection, pixel-level segmentation, and image-level vision-language tasks into a single model. Additionally, we introduce a novel embedding retrieval approach that relies solely on the language interface to support segmentation tasks. Our framework bridges the gap between fine-grained perception and vision-language tasks, significantly simplifying architectural design and training strategies while achieving comparable or superior performance to methods with intricate task-specific designs. After multi-task training on five standard visual perception datasets, UFO outperforms the previous state-of-the-art generalist models by 12. 3 mAP on COCO instance segmentation and 3. 3 mIoU on ADE20K semantic segmentation. Furthermore, our method seamlessly integrates with existing MLLMs, effectively combining fine-grained perception capabilities with their advanced language abilities, thereby achieving superior performance on the challenging reasoning segmentation. Code and models are available at https: //github. com/nnnth/UFO.

AAAI Conference 2023 Conference Paper

C-NTPP: Learning Cluster-Aware Neural Temporal Point Process

  • Fangyu Ding
  • Junchi Yan
  • Haiyang Wang

Event sequences in continuous time space are ubiquitous across applications and have been intensively studied with both classic temporal point process (TPP) and its recent deep network variants. This work is motivated by an observation that many of event data exhibit inherent clustering patterns in terms of the sparse correlation among events, while such characteristics are seldom explicitly considered in existing neural TPP models whereby the history encoders are often embodied by RNNs or Transformers. In this work, we propose a c-NTPP (Cluster-Aware Neural Temporal Point Process) model, which leverages a sequential variational autoencoder framework to infer the latent cluster each event belongs to in the sequence. Specially, a novel event-clustered attention mechanism is devised to learn each cluster and then aggregate them together to obtain the final representation for each event. Extensive experiments show that c-NTPP achieves superior performance on both real-world and synthetic datasets, and it can also uncover the underlying clustering correlations.

NeurIPS Conference 2023 Conference Paper

PRED: Pre-training via Semantic Rendering on LiDAR Point Clouds

  • Hao Yang
  • Haiyang Wang
  • Di Dai
  • Liwei Wang

Pre-training is crucial in 3D-related fields such as autonomous driving where point cloud annotation is costly and challenging. Many recent studies on point cloud pre-training, however, have overlooked the issue of incompleteness, where only a fraction of the points are captured by LiDAR, leading to ambiguity during the training phase. On the other hand, images offer more comprehensive information and richer semantics that can bolster point cloud encoders in addressing the incompleteness issue inherent in point clouds. Yet, incorporating images into point cloud pre-training presents its own challenges due to occlusions, potentially causing misalignments between points and pixels. In this work, we propose PRED, a novel image-assisted pre-training framework for outdoor point clouds in an occlusion-aware manner. The main ingredient of our framework is a Birds-Eye-View (BEV) feature map conditioned semantic rendering, leveraging the semantics of images for supervision through neural rendering. We further enhance our model's performance by incorporating point-wise masking with a high mask ratio (95%). Extensive experiments demonstrate PRED's superiority over prior point cloud pre-training methods, providing significant improvements on various large-scale datasets for 3D perception tasks. Codes will be available at https: //github. com/PRED4pc/PRED.

NeurIPS Conference 2022 Conference Paper

CAGroup3D: Class-Aware Grouping for 3D Object Detection on Point Clouds

  • Haiyang Wang
  • Lihe Ding
  • Shaocong Dong
  • Shaoshuai Shi
  • Aoxue Li
  • Jianan Li
  • Zhenguo Li
  • Liwei Wang

We present a novel two-stage fully sparse convolutional 3D object detection framework, named CAGroup3D. Our proposed method first generates some high-quality 3D proposals by leveraging the class-aware local group strategy on the object surface voxels with the same semantic predictions, which considers semantic consistency and diverse locality abandoned in previous bottom-up approaches. Then, to recover the features of missed voxels due to incorrect voxel-wise segmentation, we build a fully sparse convolutional RoI pooling module to directly aggregate fine-grained spatial information from backbone for further proposal refinement. It is memory-and-computation efficient and can better encode the geometry-specific features of each 3D proposal. Our model achieves state-of-the-art 3D detection performance with remarkable gains of +3. 6% on ScanNet V2 and +2. 6% on SUN RGB-D in term of mAP@0. 25. Code will be available at https: //github. com/Haiyang-W/CAGroup3D.

NeurIPS Conference 2022 Conference Paper

MsSVT: Mixed-scale Sparse Voxel Transformer for 3D Object Detection on Point Clouds

  • Shaocong Dong
  • Lihe Ding
  • Haiyang Wang
  • Tingfa Xu
  • Xinli Xu
  • Jie Wang
  • Ziyang Bian
  • Ying Wang

3D object detection from the LiDAR point cloud is fundamental to autonomous driving. Large-scale outdoor scenes usually feature significant variance in instance scales, thus requiring features rich in long-range and fine-grained information to support accurate detection. Recent detectors leverage the power of window-based transformers to model long-range dependencies but tend to blur out fine-grained details. To mitigate this gap, we present a novel Mixed-scale Sparse Voxel Transformer, named MsSVT, which can well capture both types of information simultaneously by the divide-and-conquer philosophy. Specifically, MsSVT explicitly divides attention heads into multiple groups, each in charge of attending to information within a particular range. All groups' output is merged to obtain the final mixed-scale features. Moreover, we provide a novel chessboard sampling strategy to reduce the computational complexity of applying a window-based transformer in 3D voxel space. To improve efficiency, we also implement the voxel sampling and gathering operations sparsely with a hash map. Endowed by the powerful capability and high efficiency of modeling mixed-scale information, our single-stage detector built on top of MsSVT surprisingly outperforms state-of-the-art two-stage detectors on Waymo. Our project page: https: //github. com/dscdyc/MsSVT.

NeurIPS Conference 2021 Conference Paper

Non-convex Distributionally Robust Optimization: Non-asymptotic Analysis

  • Jikai Jin
  • Bohang Zhang
  • Haiyang Wang
  • Liwei Wang

Distributionally robust optimization (DRO) is a widely-used approach to learn models that are robust against distribution shift. Compared with the standard optimization setting, the objective function in DRO is more difficult to optimize, and most of the existing theoretical results make strong assumptions on the loss function. In this work we bridge the gap by studying DRO algorithms for general smooth non-convex losses. By carefully exploiting the specific form of the DRO objective, we are able to provide non-asymptotic convergence guarantees even though the objective function is possibly non-convex, non-smooth and has unbounded gradient noise. In particular, we prove that a special algorithm called the mini-batch normalized gradient descent with momentum, can find an $\epsilon$-first-order stationary point within $\mathcal O(\epsilon^{-4})$ gradient complexity. We also discuss the conditional value-at-risk (CVaR) setting, where we propose a penalized DRO objective based on a smoothed version of the CVaR that allows us to obtain a similar convergence guarantee. We finally verify our theoretical results in a number of tasks and find that the proposed algorithm can consistently achieve prominent acceleration.

ICRA Conference 2014 Conference Paper

Human gait modeling and gait analysis based on Kinect

  • Baiqing Sun
  • Xiaogang Liu
  • Xuetang Wu
  • Haiyang Wang

Real-time monitoring of elderly movement can provide valuable information regarding an individual's degree of functional rehabilitation. Many laboratory-based studies have described various gait detection systems with different wearable inertial sensors, but only limited number of papers addressed the issues by using some non-wearable sensors. A practical method of gait information detection and gait analysis is proposed in the paper using an inexpensive Microsoft Kinect fixed on the midpoint of lower extremity rehabilitation robot. The horizontal distances between Kinect plane and every mark pasted on lower extremity are acquired. Taken the characteristics of gait distance series into consideration, the Autoregressive Moving Average (ARMA) model is established to reflect the changing rule of gait status. Combined with the Kalman filter, gait information reflecting rehabilitation status at next moment is predicted accurately. The method regarding the gait detection and gait analysis is verified by amounts of gait experiments finally.