Arrow Research search

Author name cluster

Yaowei Wang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

24 papers
1 author row

Possible papers

24

AAAI Conference 2026 Conference Paper

Exo2Ego: Exocentric Knowledge Guided MLLM for Egocentric Video Understanding

  • Haoyu Zhang
  • Qiaohui Chu
  • Meng Liu
  • Haoxiang Shi
  • Yaowei Wang
  • Liqiang Nie

AI personal assistants, deployed through robots or wearables, require embodied understanding to collaborate effectively with humans. However, current Multimodal Large Language Models (MLLMs) primarily focus on third-person (exocentric) vision, overlooking the unique challenges of first-person (egocentric) videos. Additionally, high acquisition costs limit data size, impairing MLLM performance. To address these challenges, we propose learning the mapping between exocentric and egocentric domains, leveraging the extensive exocentric knowledge within existing MLLMs to enhance egocentric video understanding. To this end, we introduce Ego-ExoClip, a pre-training dataset comprising 1.1M synchronized ego-exo clip-text pairs derived from Ego-Exo4D, together with the instruction-tuning dataset EgoIT, which is collected from multiple sources to enhance the model's instruction-following capabilities. Building upon the datasets, we propose a migration strategy and further design a progressive mapping learning pipeline with three stages: Demonstrator Self-Preparation, Demonstrator-Learner Guidance, and Learner Self-Practice. Extensive experiments across diverse egocentric tasks reveal that existing MLLMs perform inadequately in egocentric video understanding, while our model significantly outperforms these leading models.

IJCAI Conference 2025 Conference Paper

A Survey on the Feedback Mechanism of LLM-based AI Agents

  • Zhipeng Liu
  • Xuefeng Bai
  • Kehai Chen
  • Xinyang Chen
  • Xiucheng Li
  • Yang Xiang
  • Jin Liu
  • Hong-Dong Li

Large language models (LLMs) are increasingly being adopted to develop general-purpose AI agents. However, it remains challenging for these LLM-based AI agents to efficiently learn from feedback and iteratively optimize their strategies. To address this challenge, tremendous efforts have been dedicated to designing diverse feedback mechanisms for LLM-based AI agents. To provide a comprehensive overview of this rapidly evolving field, this paper presents a systematic review of these studies, offering a holistic perspective on the feedback mechanisms in LLM-based AI agents. We begin by discussing the construction of LLM-based AI agents, introducing a generalized framework that encapsulates much of the existing work. Next, we delve into the exploration of feedback mechanisms, categorizing them into four distinct types: internal feedback, external feedback, multi-agent feedback, and human feedback. Additionally, we provide an overview of evaluation protocols and benchmarks specifically tailored for LLM-based AI agents. Finally, we highlight the significant challenges and identify potential directions for future studies. The relevant papers are summarized and will be consistently updated at https: //github. com/kevinson7515/Agents-Feedback-Mechanisms.

NeurIPS Conference 2025 Conference Paper

FOCUS: Unified Vision-Language Modeling for Interactive Editing Driven by Referential Segmentation

  • Fan Yang
  • Yousong Zhu
  • Xin Li
  • Yufei Zhan
  • Hongyin Zhao
  • Shurong Zheng
  • Yaowei Wang
  • Ming Tang

Recent Large Vision Language Models (LVLMs) demonstrate promising capabilities in unifying visual understanding and generative modeling, enabling both accurate content understanding and flexible editing. However, current approaches treat \textbf{\textit{"what to see"}} and \textbf{\textit{"how to edit"}} separately: they either perform isolated object segmentation or utilize segmentation masks merely as conditional prompts for local edit generation tasks, often relying on multiple disjointed models. To bridge these gaps, we introduce FOCUS, a unified LVLM that integrates segmentation-aware perception and controllable object-centric generation within an end-to-end framework. FOCUS employs a dual-branch visual encoder to simultaneously capture global semantic context and fine-grained spatial details. In addition, we leverage a MoVQGAN-based visual tokenizer to produce discrete visual tokens that enhance generation quality. To enable accurate and controllable image editing, we propose a progressive multi-stage training pipeline, where segmentation masks are jointly optimized and used as spatial condition prompts to guide the diffusion decoder. This strategy aligns visual encoding, segmentation, and generation modules, effectively bridging segmentation-aware perception with fine-grained visual synthesis. Extensive experiments across three core tasks, including multimodal understanding, referring segmentation accuracy, and controllable image generation, demonstrate that FOCUS achieves strong performance by jointly optimizing visual perception and generative capabilities.

NeurIPS Conference 2025 Conference Paper

LoRATv2: Enabling Low-Cost Temporal Modeling in One-Stream Trackers

  • Liting Lin
  • Heng Fan
  • Zhipeng Zhang
  • Yuqing Huang
  • Yaowei Wang
  • Yong Xu
  • Haibin Ling

Transformer-based algorithms, such as LoRAT, have significantly enhanced object-tracking performance. However, these approaches rely on a standard attention mechanism, which incurs quadratic token complexity, making real-time inference computationally expensive. In this paper, we introduce LoRATv2, a novel tracking framework that addresses these limitations with three main contributions. First, LoRATv2 integrates frame-wise causal attention, which ensures full self-attention within each frame while enabling causal dependencies across frames, significantly reducing computational overhead. Moreover, key-value (KV) caching is employed to efficiently reuse past embeddings for further speedup. Second, building on LoRAT's parameter-efficient fine-tuning, we propose Stream-Specific LoRA Adapters (SSLA). As frame-wise causal attention introduces asymmetry in how streams access temporal information, SSLA assigns dedicated LoRA modules to the template and each search stream, with the main ViT backbone remaining frozen. This allows specialized adaptation for each stream's role in temporal tracking. Third, we introduce a two-phase progressive training strategy, which first trains a single-search-frame tracker and then gradually extends it to multi-search-frame inputs by introducing additional LoRA modules. This curriculum-based learning paradigm improves long-term tracking while maintaining training efficiency. In extensive experiments on multiple benchmarks, LoRATv2 achieves state-of-the-art performance, substantially improved efficiency, and a superior performance-to-FLOPs ratio over state-of-the-art trackers. The code is available at https: //github. com/LitingLin/LoRATv2.

AAAI Conference 2025 Conference Paper

Pilot: Building the Federated Multimodal Instruction Tuning Framework

  • Baochen Xiong
  • Xiaoshan Yang
  • Yaguang Song
  • Yaowei Wang
  • Changsheng Xu

In this paper, we explore a novel federated multimodal instruction tuning task(FedMIT), which is significant for collaboratively fine-tuning MLLMs on different types of multimodal instruction data on distributed devices. To solve the new task, we propose a federated multimodal instruction tuning framework(Pilot). Our framework integrates two-stage of ``adapter on adapter” into the connector of the vision encoder and the LLM. In stage 1, we extract task-specific features and client-specific features from visual information. In stage 2, we build the cross-task Mixture-of-Adapters(CT-MoA) module to perform cross-task interaction. Each client can not only capture personalized information of local data and learn task-related multimodal information, but also learn general knowledge from other tasks. In addition, we introduce an adaptive parameter aggregation strategy for text training parameters, which optimizes parameter aggregation by calculating weights based on the euclidean distance between parameters, so that parameter aggregation can benefit from positive effects to the greatest extent while effectively reducing negative effects. Our framework can collaboratively exploit distributed data from different local clients to learn cross-task knowledge without being affected by the task heterogeneity during instruction tuning. The effectiveness of our method is verified in two different cross-task scenarios.

NeurIPS Conference 2025 Conference Paper

Spatial Understanding from Videos: Structured Prompts Meet Simulation Data

  • Haoyu Zhang
  • Meng Liu
  • Zaijing Li
  • Haokun Wen
  • Weili Guan
  • Yaowei Wang
  • Liqiang Nie

Visual-spatial understanding, the ability to infer object relationships and layouts from visual input, is fundamental to downstream tasks such as robotic navigation and embodied interaction. However, existing methods face spatial uncertainty and data scarcity, limiting the 3D spatial reasoning capability of pre-trained vision-language models (VLMs). To address these challenges, we present a unified framework for enhancing 3D spatial reasoning in pre-trained VLMs without modifying their architecture. This framework combines SpatialMind, a structured prompting strategy that decomposes complex scenes and questions into interpretable reasoning steps, with ScanForgeQA, a scalable question-answering dataset built from diverse 3D simulation scenes through an automated construction process designed for fine-tuning. Extensive experiments across multiple benchmarks demonstrate the individual and combined effectiveness of our prompting and fine-tuning strategies, and yield insights that may inspire future research on visual-spatial understanding.

AAAI Conference 2025 Conference Paper

Unsupervised Degradation Representation Aware Transform for Real-World Blind Image Super-Resolution

  • Sen Chen
  • Hongying Liu
  • Chaowei Fang
  • Fanhua Shang
  • Yuanyuan Liu
  • Liang Wan
  • Dongmei Jiang
  • Yaowei Wang

Blind image super-resolution (blind SR) aims to restore a high-resolution (HR) image from a low-resolution (LR) image with unknown degradation. Many existing methods explicitly estimate degradation information from various LR images. However, in most cases, image degradations are independent of image content. Their estimations may be influenced by the image content resulting in inaccuracy. Unlike existing works, we design a dual-encoder for degradation representation (DEDR) to preclude the influence of image content from LR images. This benefits in extracting the intrinsic degradation representation more accurately. To the best of our knowledge, this paper is the first work that estimates degradation representation through filtering out image content. Based on the degradation representation extracted by DEDR, we present a novel framework, named degradation representation aware transform network (DRAT) for blind SR. We propose global degradation aware (GDA) blocks to propagate degradation information across spatial and channel dimensions, in which a degradation representation transform module (DRT) is introduced to render features degradation-aware, thereby enhancing the restoration of LR images. Extensive experiments are conducted on three benchmark datasets (including Gaussian 8, DIV2KRK, and real-world datasets) under large scaling factors with complex degradations. The experimental results demonstrate that DRAT surpasses state-of-the-art supervised kernel estimation and unsupervised degradation representation methods.

AAAI Conference 2024 Conference Paper

Feature Distribution Matching by Optimal Transport for Effective and Robust Coreset Selection

  • Weiwei Xiao
  • Yongyong Chen
  • Qiben Shan
  • Yaowei Wang
  • Jingyong Su

Training neural networks with good generalization requires large computational costs in many deep learning methods due to large-scale datasets and over-parameterized models. Despite the emergence of a number of coreset selection methods to reduce the computational costs, the problem of coreset distribution bias, i.e., the skewed distribution between the coreset and the entire dataset, has not been well studied. In this paper, we find that the closer the feature distribution of the coreset is to that of the entire dataset, the better the generalization performance of the coreset, particularly under extreme pruning. This motivates us to propose a simple yet effective method for coreset selection to alleviate the distribution bias between the coreset and the entire dataset, called feature distribution matching (FDMat). Unlike gradient-based methods, which selects samples with larger gradient values or approximates gradient values of the entire dataset, FDMat aims to select coreset that is closest to feature distribution of the entire dataset. Specifically, FDMat transfers coreset selection as an optimal transport problem from the coreset to the entire dataset in feature embedding spaces. Moreover, our method shows strong robustness due to the removal of samples far from the distribution, especially for the entire dataset containing noisy and class-imbalanced samples. Extensive experiments on multiple benchmarks show that FDMat can improve the performance of coreset selection than existing coreset methods. The code is available at https://github.com/successhaha/FDMat.

AAAI Conference 2024 Conference Paper

HARDVS: Revisiting Human Activity Recognition with Dynamic Vision Sensors

  • Xiao Wang
  • Zongzhen Wu
  • Bo Jiang
  • Zhimin Bao
  • Lin Zhu
  • Guoqi Li
  • Yaowei Wang
  • Yonghong Tian

The main streams of human activity recognition (HAR) algorithms are developed based on RGB cameras which usually suffer from illumination, fast motion, privacy preservation, and large energy consumption. Meanwhile, the biologically inspired event cameras attracted great interest due to their unique features, such as high dynamic range, dense temporal but sparse spatial resolution, low latency, low power, etc. As it is a newly arising sensor, even there is no realistic large-scale dataset for HAR. Considering its great practical value, in this paper, we propose a large-scale benchmark dataset to bridge this gap, termed HARDVS, which contains 300 categories and more than 100K event sequences. We evaluate and report the performance of multiple popular HAR algorithms, which provide extensive baselines for future works to compare. More importantly, we propose a novel spatial-temporal feature learning and fusion framework, termed ESTF, for event stream based human activity recognition. It first projects the event streams into spatial and temporal embeddings using StemNet, then, encodes and fuses the dual-view representations using Transformer networks. Finally, the dual features are concatenated and fed into a classification head for activity prediction. Extensive experiments on multiple datasets fully validated the effectiveness of our model. Both the dataset and source code will be released at https://github.com/Event-AHU/HARDVS.

NeurIPS Conference 2024 Conference Paper

LG-VQ: Language-Guided Codebook Learning

  • Guotao Liang
  • Baoquan Zhang
  • Yaowei Wang
  • Xutao Li
  • Yunming Ye
  • Huaibin Wang
  • Chuyao Luo
  • Kola Ye

Vector quantization (VQ) is a key technique in high-resolution and high-fidelity image synthesis, which aims to learn a codebook to encode an image with a sequence of discrete codes and then generate an image in an auto-regression manner. Although existing methods have shown superior performance, most methods prefer to learn a single-modal codebook (\emph{e. g. }, image), resulting in suboptimal performance when the codebook is applied to multi-modal downstream tasks (\emph{e. g. }, text-to-image, image captioning) due to the existence of modal gaps. In this paper, we propose a novel language-guided codebook learning framework, called LG-VQ, which aims to learn a codebook that can be aligned with the text to improve the performance of multi-modal downstream tasks. Specifically, we first introduce pre-trained text semantics as prior knowledge, then design two novel alignment modules (\emph{i. e. }, Semantic Alignment Module, and Relationship Alignment Module) to transfer such prior knowledge into codes for achieving codebook text alignment. In particular, our LG-VQ method is model-agnostic, which can be easily integrated into existing VQ models. Experimental results show that our method achieves superior performance on reconstruction and various multi-modal downstream tasks.

NeurIPS Conference 2024 Conference Paper

M$^3$GPT: An Advanced Multimodal, Multitask Framework for Motion Comprehension and Generation

  • Mingshuang Luo
  • RuiBing Hou
  • Zhuo Li
  • Hong Chang
  • Zimo Liu
  • Yaowei Wang
  • Shiguang Shan

This paper presents M$^3$GPT, an advanced $\textbf{M}$ultimodal, $\textbf{M}$ultitask framework for $\textbf{M}$otion comprehension and generation. M$^3$GPT operates on three fundamental principles. The first focuses on creating a unified representation space for various motion-relevant modalities. We employ discrete vector quantization for multimodal conditional signals, such as text, music and motion/dance, enabling seamless integration into a large language model (LLM) with a single vocabulary. The second involves modeling motion generation directly in the raw motion space. This strategy circumvents the information loss associated with a discrete tokenizer, resulting in more detailed and comprehensive motion generation. Third, M$^3$GPT learns to model the connections and synergies among various motion-relevant tasks. Text, the most familiar and well-understood modality for LLMs, is utilized as a bridge to establish connections between different motion tasks, facilitating mutual reinforcement. To our knowledge, M$^3$GPT is the first model capable of comprehending and generating motions based on multiple signals. Extensive experiments highlight M$^3$GPT's superior performance across various motion-relevant tasks and its powerful zero-shot generalization capabilities for extremely challenging tasks. Project page: \url{https: //github. com/luomingshuang/M3GPT}.

IJCAI Conference 2024 Conference Paper

MLP-DINO: Category Modeling and Query Graphing with Deep MLP for Object Detection

  • Guiping Cao
  • Wenjian Huang
  • Xiangyuan Lan
  • Jianguo Zhang
  • Dongmei Jiang
  • Yaowei Wang

Popular transformer-based detectors detect objects in a one-to-one manner, where both the bounding box and category of each object are predicted only by the single query, leading to the box-sensitive category predictions. Additionally, the initialization of positional queries solely based on the predicted confidence scores or learnable embeddings neglects the significant spatial interrelation between different queries. This oversight leads to an imbalanced spatial distribution of queries (SDQ). In this paper, we propose a new MLP-DINO model to address these issues. Firstly, we present a new Query-Independent Category Supervision (QICS) approach for modeling categories information, decoupling the sensitive bounding box prediction process to improve the detection performance. Additionally, to further improve the category predictions, we introduce a deep MLP model into transformer-based detection framework to capture the long-range and short-range information simultaneously. Thirdly, to balance the SDQ, we design a novel Graph-based Query Selection (GQS) method that distributes each query point in a discrete manner by graphing the spatial information of queries to cover a broader range of potential objects, significantly enhancing the hit-rate of queries. Experimental results on COCO indicate that our MLP-DINO achieves 54. 6% AP with only 44M parame ters under 36-epoch setting, greatly outperforming the original DINO by +3. 7% AP with fewer parameters and FLOPs. The source codes will be available at https: //github. com/Med-Process/MLP-DINO.

NeurIPS Conference 2024 Conference Paper

OneRef: Unified One-tower Expression Grounding and Segmentation with Mask Referring Modeling

  • Linhui Xiao
  • Xiaoshan Yang
  • Fang Peng
  • Yaowei Wang
  • Changsheng Xu

Constrained by the separate encoding of vision and language, existing grounding and referring segmentation works heavily rely on bulky Transformer-based fusion en-/decoders and a variety of early-stage interaction technologies. Simultaneously, the current mask visual language modeling (MVLM) fails to capture the nuanced referential relationship between image-text in referring tasks. In this paper, we propose OneRef, a minimalist referring framework built on the modality-shared one-tower transformer that unifies the visual and linguistic feature spaces. To modeling the referential relationship, we introduce a novel MVLM paradigm called Mask Referring Modeling ( MRefM ), which encompasses both referring-aware mask image modeling and referring-aware mask language modeling. Both modules not only reconstruct modality-related content but also cross-modal referring content. Within MRefM, we propose a referring-aware dynamic image masking strategy that is aware of the referred region rather than relying on fixed ratios or generic random masking schemes. By leveraging the unified visual language feature space and incorporating MRefM's ability to model the referential relations, our approach enables direct regression of the referring results without resorting to various complex techniques. Our method consistently surpasses existing approaches and achieves SoTA performance on both grounding and segmentation tasks, providing valuable insights for future research. Our code and models are available at https: //github. com/linhuixiao/OneRef.

NeurIPS Conference 2024 Conference Paper

VMamba: Visual State Space Model

  • Yue Liu
  • Yunjie Tian
  • Yuzhong Zhao
  • Hongtian Yu
  • Lingxi Xie
  • Yaowei Wang
  • Qixiang Ye
  • Jianbin Jiao

Designing computationally efficient network architectures remains an ongoing necessity in computer vision. In this paper, we adapt Mamba, a state-space language model, into VMamba, a vision backbone with linear time complexity. At the core of VMamba is a stack of Visual State-Space (VSS) blocks with the 2D Selective Scan (SS2D) module. By traversing along four scanning routes, SS2D bridges the gap between the ordered nature of 1D selective scan and the non-sequential structure of 2D vision data, which facilitates the collection of contextual information from various sources and perspectives. Based on the VSS blocks, we develop a family of VMamba architectures and accelerate them through a succession of architectural and implementation enhancements. Extensive experiments demonstrate VMamba’s promising performance across diverse visual perception tasks, highlighting its superior input scaling efficiency compared to existing benchmark models. Source code is available at https: //github. com/MzeroMiko/VMamba

AAAI Conference 2023 Conference Paper

Isolation and Impartial Aggregation: A Paradigm of Incremental Learning without Interference

  • Yabin Wang
  • Zhiheng Ma
  • Zhiwu Huang
  • Yaowei Wang
  • Zhou Su
  • Xiaopeng Hong

This paper focuses on the prevalent stage interference and stage performance imbalance of incremental learning. To avoid obvious stage learning bottlenecks, we propose a new incremental learning framework, which leverages a series of stage-isolated classifiers to perform the learning task at each stage, without interference from others. To be concrete, to aggregate multiple stage classifiers as a uniform one impartially, we first introduce a temperature-controlled energy metric for indicating the confidence score levels of the stage classifiers. We then propose an anchor-based energy self-normalization strategy to ensure the stage classifiers work at the same energy level. Finally, we design a voting-based inference augmentation strategy for robust inference. The proposed method is rehearsal-free and can work for almost all incremental learning scenarios. We evaluate the proposed method on four large datasets. Extensive results demonstrate the superiority of the proposed method in setting up new state-of-the-art overall performance. Code is available at https://github.com/iamwangyabin/ESN.

AAAI Conference 2023 Conference Paper

Learned Distributed Image Compression with Multi-Scale Patch Matching in Feature Domain

  • Yujun Huang
  • Bin Chen
  • Shiyu Qin
  • Jiawei Li
  • Yaowei Wang
  • Tao Dai
  • Shu-Tao Xia

Beyond achieving higher compression efficiency over classical image compression codecs, deep image compression is expected to be improved with additional side information, e.g., another image from a different perspective of the same scene. To better utilize the side information under the distributed compression scenario, the existing method only implements patch matching at the image domain to solve the parallax problem caused by the difference in viewing points. However, the patch matching at the image domain is not robust to the variance of scale, shape, and illumination caused by the different viewing angles, and can not make full use of the rich texture information of the side information image. To resolve this issue, we propose Multi-Scale Feature Domain Patch Matching (MSFDPM) to fully utilizes side information at the decoder of the distributed image compression model. Specifically, MSFDPM consists of a side information feature extractor, a multi-scale feature domain patch matching module, and a multi-scale feature fusion network. Furthermore, we reuse inter-patch correlation from the shallow layer to accelerate the patch matching of the deep layer. Finally, we find that our patch matching in a multi-scale feature domain further improves compression rate by about 20% compared with the patch matching method at image domain.

NeurIPS Conference 2023 Conference Paper

Learning Mask-aware CLIP Representations for Zero-Shot Segmentation

  • Siyu Jiao
  • Yunchao Wei
  • Yaowei Wang
  • Yao Zhao
  • Humphrey Shi

Recently, pre-trained vision-language models have been increasingly used to tackle the challenging zero-shot segmentation task. Typical solutions follow the paradigm of first generating mask proposals and then adopting CLIP to classify them. To maintain the CLIP's zero-shot transferability, previous practices favour to freeze CLIP during training. However, in the paper, we reveal that CLIP is insensitive to different mask proposals and tends to produce similar predictions for various mask proposals of the same image. This insensitivity results in numerous false positives when classifying mask proposals. This issue mainly relates to the fact that CLIP is trained with image-level supervision. To alleviate this issue, we propose a simple yet effective method, named Mask-aware Fine-tuning (MAFT). Specifically, Image-Proposals CLIP Encoder (IP-CLIP Encoder) is proposed to handle arbitrary numbers of image and mask proposals simultaneously. Then, mask-aware loss and self-distillation loss are designed to fine-tune IP-CLIP Encoder, ensuring CLIP is responsive to different mask proposals while not sacrificing transferability. In this way, mask-aware representations can be easily learned to make the true positives stand out. Notably, our solution can seamlessly plug into most existing methods without introducing any new parameters during the fine-tuning process. We conduct extensive experiments on the popular zero-shot benchmarks. With MAFT, the performance of the state-of-the-art methods is promoted by a large margin: 50. 4\% (+ 8. 2\%) on COCO, 81. 8\% (+ 3. 2\%) on Pascal-VOC, and 8. 7\% (+4. 3\%) on ADE20K in terms of mIoU for unseen classes. Codes will be provided for reproducibility. Code is available at https: //github. com/jiaosiyu1999/MAFT. git.

IJCAI Conference 2023 Conference Paper

Manifold-Aware Self-Training for Unsupervised Domain Adaptation on Regressing 6D Object Pose

  • Yichen Zhang
  • Jiehong Lin
  • Ke Chen
  • Zelin Xu
  • Yaowei Wang
  • Kui Jia

Domain gap between synthetic and real data in visual regression (e. g. , 6D pose estimation) is bridged in this paper via global feature alignment and local refinement on the coarse classification of discretized anchor classes in target space, which imposes a piece-wise target manifold regularization into domain-invariant representation learning. Specifically, our method incorporates an explicit self-supervised manifold regularization, revealing consistent cumulative target dependency across domains, to a self-training scheme (e. g. , the popular Self-Paced Self-Training) to encourage more discriminative transferable representations of regression tasks. Moreover, learning unified implicit neural functions to estimate relative direction and distance of targets to their nearest class bins aims to refine target classification predictions, which can gain robust performance against inconsistent feature scaling sensitive to UDA regressors. Experiment results on three public benchmarks of the challenging 6D pose estimation task can verify the effectiveness of our method, consistently achieving superior performance to the state-of-the-art for UDA on 6D pose estimation. Codes and pre-trained models are available https: //github. com/Gorilla-Lab-SCUT/MAST.

NeurIPS Conference 2022 Conference Paper

Learning to Share in Networked Multi-Agent Reinforcement Learning

  • Yuxuan Yi
  • Ge Li
  • Yaowei Wang
  • Zongqing Lu

In this paper, we study the problem of networked multi-agent reinforcement learning (MARL), where a number of agents are deployed as a partially connected network and each interacts only with nearby agents. Networked MARL requires all agents to make decisions in a decentralized manner to optimize a global objective with restricted communication between neighbors over the network. Inspired by the fact that sharing plays a key role in human's learning of cooperation, we propose LToS, a hierarchically decentralized MARL framework that enables agents to learn to dynamically share reward with neighbors so as to encourage agents to cooperate on the global objective through collectives. For each agent, the high-level policy learns how to share reward with neighbors to decompose the global objective, while the low-level policy learns to optimize the local objective induced by the high-level policies in the neighborhood. The two policies form a bi-level optimization and learn alternately. We empirically demonstrate that LToS outperforms existing methods in both social dilemma and networked MARL scenarios across scales.

AAAI Conference 2022 Conference Paper

Towards End-to-End Image Compression and Analysis with Transformers

  • Yuanchao Bai
  • Xu Yang
  • Xianming Liu
  • Junjun Jiang
  • Yaowei Wang
  • Xiangyang Ji
  • Wen Gao

We propose an end-to-end image compression and analysis model with Transformers, targeting to the cloud-based image classification application. Instead of placing an existing Transformer-based image classification model directly after an image codec, we aim to redesign the Vision Transformer (ViT) model to perform image classification from the compressed features and facilitate image compression with the long-term information from the Transformer. Specifically, we first replace the patchify stem (i. e. , image splitting and embedding) of the ViT model with a lightweight image encoder modelled by a convolutional neural network. The compressed features generated by the image encoder are injected convolutional inductive bias and are fed to the Transformer for image classification bypassing image reconstruction. Meanwhile, we propose a feature aggregation module to fuse the compressed features with the selected intermediate features of the Transformer, and feed the aggregated features to a deconvolutional neural network for image reconstruction. The aggregated features can obtain the long-term information from the self-attention mechanism of the Transformer and improve the compression performance. The rate-distortion-accuracy optimization problem is finally solved by a two-step training strategy. Experimental results demonstrate the effectiveness of the proposed model in both the image compression and the classification tasks.

IJCAI Conference 2021 Conference Paper

Direct Measure Matching for Crowd Counting

  • Hui Lin
  • Xiaopeng Hong
  • Zhiheng Ma
  • Xing Wei
  • Yunfeng Qiu
  • Yaowei Wang
  • Yihong Gong

Traditional crowd counting approaches usually use Gaussian assumption to generate pseudo density ground truth, which suffers from problems like inaccurate estimation of the Gaussian kernel sizes. In this paper, we propose a new measure-based counting approach to regress the predicted density maps to the scattered point-annotated ground truth directly. First, crowd counting is formulated as a measure matching problem. Second, we derive a semi-balanced form of Sinkhorn divergence, based on which a Sinkhorn counting loss is designed for measure matching. Third, we propose a self-supervised mechanism by devising a Sinkhorn scale consistency loss to resist scale changes. Finally, an efficient optimization method is provided to minimize the overall loss function. Extensive experiments on four challenging crowd counting datasets namely ShanghaiTech, UCF-QNRF, JHU++ and NWPU have validated the proposed method.

AAAI Conference 2021 Conference Paper

Hierarchically and Cooperatively Learning Traffic Signal Control

  • Bingyu Xu
  • Yaowei Wang
  • Zhaozhi Wang
  • Huizhu Jia
  • Zongqing Lu

Deep reinforcement learning (RL) has been applied to traffic signal control recently and demonstrated superior performance to conventional control methods. However, there are still several challenges we have to address before fully applying deep RL to traffic signal control. Firstly, the objective of traffic signal control is to optimize average travel time, which is a delayed reward in a long time horizon in the context of RL. However, existing work simplifies the optimization by using queue length, waiting time, delay, etc. , as immediate reward and presumes these short-term targets are always aligned with the objective. Nevertheless, these targets may deviate from the objective in different road networks with various traffic patterns. Secondly, it remains unsolved how to cooperatively control traffic signals to directly optimize average travel time. To address these challenges, we propose a hierarchical and cooperative reinforcement learning method– HiLight. HiLight enables each agent to learn a high-level policy that optimizes the objective locally by selecting among the sub-policies that respectively optimize short-term targets. Moreover, the high-level policy additionally considers the objective in the neighborhood with adaptive weighting to encourage agents to cooperate on the objective in the road network. Empirically, we demonstrate that HiLight outperforms state-of-the-art RL methods for traffic signal control in real road networks with real traffic.

AAAI Conference 2020 Conference Paper

Towards Accurate Low Bit-Width Quantization with Multiple Phase Adaptations

  • Zhaoyi Yan
  • Yemin Shi
  • Yaowei Wang
  • Mingkui Tan
  • Zheyang Li
  • Wenming Tan
  • Yonghong Tian

Low bit-width model quantization is highly desirable when deploying a deep neural network on mobile and edge devices. Quantization is an effective way to reduce the model size with low bit-width weight representation. However, the unacceptable accuracy drop hinders the development of this approach. One possible reason for this is that the weights in quantization intervals are directly assigned to the center. At the same time, some quantization applications are limited by the various of different network models. Accordingly, in this paper, we propose Multiple Phase Adaptations (MPA), a framework designed to address these two problems. Firstly, weights in the target interval are assigned to center by gradually spreading the quantization range. During the MPA process, the accuracy drop can be compensated for the unquantized parts. Moreover, as MPA does not introduce hyperparameters that depend on different models or bit-width, the framework can be conveniently applied to various models. Extensive experiments demonstrate that MPA achieves higher accuracy than most existing methods on classification tasks for AlexNet, VGG-16 and ResNet.

AAAI Conference 2015 Conference Paper

Swiss-System Based Cascade Ranking for Gait-Based Person Re-Identification

  • Lan Wei
  • Yonghong Tian
  • Yaowei Wang
  • Tiejun Huang

Human gait has been shown to be an efficient biometric measure for person identification at a distance. However, it often needs different gait features to handle various covariate conditions including viewing angles, walking speed, carrying an object and wearing different types of shoes. In order to improve the robustness of gait-based person re-identification on such multi-covariate conditions, a novel Swiss-system based cascade ranking model is proposed in this paper. Since the ranking model is able to learn a subspace where the potential true match is given the highest ranking, we formulate the gait-based person re-identification as a bipartite ranking problem and utilize it as an effective way for multi-feature ensemble learning. Then a Swiss multi-round competition system is developed for the cascade ranking model to optimize its effectiveness and efficiency. Extensive experiments on three indoor and outdoor public datasets demonstrate that our model outperforms several state-of-the-art methods remarkably.