Arrow Research search

Author name cluster

Xue Yang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

23 papers
2 author rows

Possible papers

23

AAAI Conference 2026 Conference Paper

Earth-Adapter: Bridge the Geospatial Domain Gaps with a Frequency-Guided Mixture of Adapters

  • Xiaoxing Hu
  • ZiYang Gong
  • Yupei Wang
  • Yuru Jia
  • Fei Lin
  • Dexiang Gao
  • Ke An
  • Jianhong Han

Vision Foundation Models (VFMs), while powerful, often struggle in Remote Sensing (RS) segmentation tasks when combined with existing Parameter-Efficient Fine-Tuning (PEFT) methods. We observe that this limitation primarily arises from their inability to effectively handle the pervasive artifacts in RS imagery. To address this, we introduce Earth-Adapter, the first PEFT method specifically designed for RS artifact mitigation. Earth-Adapter introduces a novel Frequency-Guided Mixture of Adapters (MoA) approach, structured around a ''divide and conquer" strategy. It first utilizes Discrete Fourier Transformation (DFT) to "divide" features into distinct frequency components, thereby effectively isolating artifact-related information from semantic signals. Subsequently, to ''conquer" these artifact, MoA independently optimizes features within different subspaces and dynamically assigns weights via a router to aggregate the refined representations. This enables adaptive refinement of the VFM’s representation space to mitigate the impact of artifacts. This simple yet highly effective PEFT method demonstrably mitigates artifacts and significantly enhances VFMs performance on RS segmentation tasks. Extensive experiments demonstrate Earth-Adapter's effectiveness on in-domain semantic segmentation (SS), as well as Domain Adaptive (DA) and Domain Generalized (DG) semantic segmentation tasks. Compared with the baseline Rein, Earth-Adapter significantly improves mIoU by 1.2% in SS, 9.0% in DA, and 3.1% in DG benchmarks. Our code and weights will be released soon.

AAAI Conference 2026 Conference Paper

LWGANet: Addressing Spatial and Channel Redundancy in Remote Sensing Visual Tasks with Light-Weight Grouped Attention

  • Wei Lu
  • Xue Yang
  • Si-Bao Chen

Light-weight neural networks for remote sensing (RS) visual analysis must overcome two inherent redundancies: spatial redundancy from vast, homogeneous backgrounds, and channel redundancy, where extreme scale variations render a single feature space inefficient. Existing models, often designed for natural images, fail to address this dual challenge in RS scenarios. To bridge this gap, we propose LWGANet, a light-weight backbone engineered for RS-specific properties. LWGANet introduces two core innovations: a Top-K Global Feature Interaction (TGFI) module that mitigates spatial redundancy by focusing computation on salient regions, and a Light-Weight Grouped Attention (LWGA) module that resolves channel redundancy by partitioning channels into specialized, scale-specific pathways. By synergistically resolving these core inefficiencies, LWGANet achieves a superior trade-off between feature representation quality and computational cost. Extensive experiments on twelve diverse datasets across four major RS tasks—scene classification, oriented object detection, semantic segmentation, and change detection—demonstrate that LWGANet consistently outperforms state-of-the-art light-weight backbones in both accuracy and efficiency. Our work establishes a new, robust baseline for efficient visual analysis in RS images.

AAAI Conference 2025 Conference Paper

DiffCLIP: Few-shot Language-driven Multimodal Classifier

  • Jiaqing Zhang
  • Mingxiang Cao
  • Xue Yang
  • Kai Jiang
  • Yunsong Li

Visual language models like Contrastive Language-Image Pretraining (CLIP) have shown impressive performance in analyzing natural images with language information. However, these models often encounter challenges when applied to specialized domains such as remote sensing due to the limited availability of image-text pairs for training. To tackle this issue, we introduce DiffCLIP, a novel framework that extends CLIP to effectively convey comprehensive language-driven semantic information for accurate classification of high-dimensional multimodal remote sensing images. DiffCLIP is a few-shot learning method that leverages unlabeled images for pretraining. It employs unsupervised mask diffusion learning to capture the distribution of diverse modalities without requiring labels. The modality-shared image encoder maps multimodal data into a unified subspace, extracting shared features with consistent parameters across modalities. A well-trained image encoder further enhances learning by aligning visual representations with class-label text information from CLIP. By integrating these approaches, DiffCLIP significantly boosts CLIP performance using a minimal number of image-text pairs. We evaluate DiffCLIP on widely used high-dimensional multimodal datasets, demonstrating its effectiveness in addressing few-shot annotated classification tasks. DiffCLIP achieves an overall accuracy improvement of 10.65% across three remote sensing datasets compared with CLIP, while utilizing only 2-shot image-text pairs.

NeurIPS Conference 2025 Conference Paper

Envisioning Beyond the Pixels: Benchmarking Reasoning-Informed Visual Editing

  • Xiangyu Zhao
  • Peiyuan Zhang
  • Kexian Tang
  • Xiaorong Zhu
  • Hao Li
  • Wenhao Chai
  • Zicheng Zhang
  • Renqiu Xia

Large Multi-modality Models (LMMs) have made significant progress in visual understanding and generation, but they still face challenges in General Visual Editing, particularly in following complex instructions, preserving appearance consistency, and supporting flexible input formats. To study this gap, we introduce RISEBench, the first benchmark for evaluating Reasoning-Informed viSual Editing (RISE). RISEBench focuses on four key reasoning categories: Temporal, Causal, Spatial, and Logical Reasoning. We curate high-quality test cases for each category and propose an robust evaluation framework that assesses Instruction Reasoning, Appearance Consistency, and Visual Plausibility with both human judges and the LMM-as-a-judge approach. We conducted experiments evaluating nine prominent visual editing models, comprising both open-source and proprietary models. The evaluation results demonstrate that current models face significant challenges in reasoning-based editing tasks. Even the most powerful model evaluated, GPT-image-1, achieves an accuracy of merely 28. 8%. RISEBench effectively highlights the limitations of contemporary editing models, provides valuable insights, and indicates potential future directions for the field of reasoning-aware visual editing. Our code and data have been released at https: //github. com/PhoenixZ810/RISEBench.

NeurIPS Conference 2025 Conference Paper

GeneMAN: Generalizable Single-Image 3D Human Reconstruction from Multi-Source Human Data

  • Wentao Wang
  • Hang Ye
  • Fangzhou Hong
  • Xue Yang
  • Jianfu Zhang
  • Yizhou Wang
  • Ziwei Liu
  • Liang Pan

Given a single in-the-wild human photo, it remains a challenging task to reconstruct a high-fidelity 3D human model. Existing methods face difficulties including a) the varying body proportions captured by in-the-wild human images; b) diverse personal belongings within the shot; and c) ambiguities in human postures and inconsistency in human textures. In addition, the scarcity of high-quality human data intensifies the challenge. To address these problems, we propose a Generalizable image-to-3D huMAN reconstruction framework, dubbed GeneMAN, building upon a comprehensive multi-source collection of high-quality human data, including 3D scans, multi-view videos, single photos, and our generated synthetic human data. GeneMAN encompasses three key modules. 1) Without relying on parametric human models (e. g. , SMPL), GeneMAN first trains a human-specific text-to-image diffusion model and a view-conditioned diffusion model, serving as GeneMAN 2D human prior and 3D human prior for reconstruction, respectively. 2) With the help of the pretrained human prior models, the Geometry Initialization-&-Sculpting pipeline is leveraged to recover high-quality 3D human geometry given a single image. 3) To achieve high-fidelity 3D human textures, GeneMAN employs the Multi-Space Texture Refinement pipeline, consecutively refining textures in the latent and the pixel spaces. Extensive experimental results demonstrate that GeneMAN could generate high-quality 3D human models from a single image input, outperforming prior state-of-the-art methods. Notably, GeneMAN could reveal much better generalizability in dealing with in-the-wild images, often yielding high-quality 3D human models in natural poses with common items, regardless of the body proportions in the input images.

NeurIPS Conference 2025 Conference Paper

InstructSAM: A Training-free Framework for Instruction-Oriented Remote Sensing Object Recognition

  • Yijie Zheng
  • Weijie Wu
  • Qingyun Li
  • Xuehui Wang
  • Xu Zhou
  • Aiai Ren
  • Jun Shen
  • Long Zhao

Language-guided object recognition in remote sensing imagery is crucial for large-scale mapping and automated data annotation. However, existing open-vocabulary and visual grounding methods rely on explicit category cues, limiting their ability to handle complex or implicit queries that require advanced reasoning. To address this issue, we introduce a new suite of tasks, including Instruction-Oriented Object Counting, Detection, and Segmentation (InstructCDS), covering open-vocabulary, open-ended, and open-subclass scenarios. We further present EarthInstruct, the first InstructCDS benchmark for earth observation. It is constructed from two diverse remote sensing datasets with varying spatial resolutions and annotation rules across 20 categories, necessitating models to interpret dataset-specific instructions. Given the scarcity of semantically rich labeled data in remote sensing, we propose InstructSAM, a training-free framework for instruction-driven object recognition. InstructSAM leverages large vision-language models to interpret user instructions and estimate object counts, employs SAM2 for mask proposal, and formulates mask-label assignment as a binary integer programming problem. By integrating semantic similarity with counting constraints, InstructSAM efficiently assigns categories to predicted masks without relying on confidence thresholds. Experiments demonstrate that InstructSAM matches or surpasses specialized baselines across multiple tasks while maintaining near-constant inference time regardless of object count, reducing output tokens by 89\% and overall runtime by over 32\% compared to direct generation approaches. We believe the contributions of the proposed tasks, benchmark, and effective approach will advance future research in developing versatile object recognition systems. The code is available at https: //VoyagerXvoyagerx. github. io/InstructSAM.

NeurIPS Conference 2025 Conference Paper

OPMapper: Enhancing Open-Vocabulary Semantic Segmentation with Multi-Guidance Information

  • Xuehui Wang
  • Chongjie Si
  • Xue Yang
  • Yuzhi Zhao
  • Wenhai Wang
  • Xiaokang Yang
  • Wei Shen

Open-vocabulary semantic segmentation assigns every pixel a label drawn from an open-ended, text-defined space. Vision–language models such as CLIP excel at zero-shot recognition, yet their image-level pre-training hinders dense prediction. Current approaches either fine-tune CLIP—at high computational cost—or adopt training-free attention refinements that favor local smoothness while overlooking global semantics. In this paper, we present OPMapper, a lightweight, plug-and-play module that injects both local compactness and global connectivity into attention maps of CLIP. It combines Context-aware Attention Injection, which embeds spatial and semantic correlations, and Semantic Attention Alignment, which iteratively aligns the enriched weights with textual prompts. By jointly modeling token dependencies and leveraging textual guidance, OPMapper enhances visual understanding. OPMapper is highly flexible and can be seamlessly integrated into both training-based and training-free paradigms with minimal computational overhead. Extensive experiments demonstrate its effectiveness, yielding significant improvements across 8 open-vocabulary segmentation benchmarks.

NeurIPS Conference 2025 Conference Paper

Raw2Drive: Reinforcement Learning with Aligned World Models for End-to-End Autonomous Driving (in CARLA v2)

  • Zhenjie Yang
  • Xiaosong Jia
  • Qifeng Li
  • Xue Yang
  • Maoqing Yao
  • Junchi Yan

Reinforcement Learning (RL) can mitigate the causal confusion and distribution shift inherent to imitation learning (IL). However, applying RL to end-to-end autonomous driving (E2E-AD) remains an open problem for its training difficulty, and IL is still the mainstream paradigm in both academia and industry. Recently Model-based Reinforcement Learning (MBRL) have demonstrated promising results in neural planning; however, these methods typically require privileged information as input rather than raw sensor data. We fill this gap by designing Raw2Drive, a dual-stream MBRL approach. Initially, we efficiently train an auxiliary privileged world model paired with a neural planner that uses privileged information as input. Subsequently, we introduce a raw sensor world model trained via our proposed Guidance Mechanism, which ensures consistency between the raw sensor world model and the privileged world model during rollouts. Finally, the raw sensor world model combines the prior knowledge embedded in the heads of the privileged world model to effectively guide the training of the raw sensor policy. Raw2Drive is so far the only RL based end-to-end method on CARLA Leaderboard 2. 0, and Bench2Drive and it achieves state-of-the-art performance.

NeurIPS Conference 2024 Conference Paper

Drones Help Drones: A Collaborative Framework for Multi-Drone Object Trajectory Prediction and Beyond

  • Zhechao Wang
  • Peirui Cheng
  • Mingxin Chen
  • Pengju Tian
  • Zhirui Wang
  • Xinming Li
  • Xue Yang
  • Xian Sun

Collaborative trajectory prediction can comprehensively forecast the future motion of objects through multi-view complementary information. However, it encounters two main challenges in multi-drone collaboration settings. The expansive aerial observations make it difficult to generate precise Bird's Eye View (BEV) representations. Besides, excessive interactions can not meet real-time prediction requirements within the constrained drone-based communication bandwidth. To address these problems, we propose a novel framework named "Drones Help Drones" (DHD). Firstly, we incorporate the ground priors provided by the drone's inclined observation to estimate the distance between objects and drones, leading to more precise BEV generation. Secondly, we design a selective mechanism based on the local feature discrepancy to prioritize the critical information contributing to prediction tasks during inter-drone interactions. Additionally, we create the first dataset for multi-drone collaborative prediction, named "Air-Co-Pred", and conduct quantitative and qualitative experiments to validate the effectiveness of our DHD framework. The results demonstrate that compared to state-of-the-art approaches, DHD reduces position deviation in BEV representations by over 20\% and requires only a quarter of the transmission ratio for interactions while achieving comparable prediction performance. Moreover, DHD also shows promising generalization to the collaborative 3D object detection in CoPerception-UAVs.

NeurIPS Conference 2024 Conference Paper

E2E-MFD: Towards End-to-End Synchronous Multimodal Fusion Detection

  • Jiaqing Zhang
  • Mingxiang Cao
  • Weiying Xie
  • Jie Lei
  • Daixun Li
  • Wenbo Huang
  • Yunsong Li
  • Xue Yang

Multimodal image fusion and object detection are crucial for autonomous driving. While current methods have advanced the fusion of texture details and semantic information, their complex training processes hinder broader applications. Addressing this challenge, we introduce E2E-MFD, a novel end-to-end algorithm for multimodal fusion detection. E2E-MFD streamlines the process, achieving high performance with a single training phase. It employs synchronous joint optimization across components to avoid suboptimal solutions associated to individual tasks. Furthermore, it implements a comprehensive optimization strategy in the gradient matrix for shared parameters, ensuring convergence to an optimal fusion detection configuration. Our extensive testing on multiple public datasets reveals E2E-MFD's superior capabilities, showcasing not only visually appealing image fusion but also impressive detection outcomes, such as a 3. 9\% and 2. 0\% $\text{mAP}_{50}$ increase on horizontal object detection dataset M3FD and oriented object detection dataset DroneVehicle, respectively, compared to state-of-the-art approaches.

NeurIPS Conference 2024 Conference Paper

Parameter-Inverted Image Pyramid Networks

  • Xizhou Zhu
  • Xue Yang
  • Zhaokai Wang
  • Hao Li
  • Wenhan Dou
  • Junqi Ge
  • Lewei Lu
  • Yu Qiao

Image pyramids are commonly used in modern computer vision tasks to obtain multi-scale features for precise understanding of images. However, image pyramids process multiple resolutions of images using the same large-scale model, which requires significant computational cost. To overcome this issue, we propose a novel network architecture known as the Parameter-Inverted Image Pyramid Networks (PIIP). Our core idea is to use models with different parameter sizes to process different resolution levels of the image pyramid, thereby balancing computational efficiency and performance. Specifically, the input to PIIP is a set of multi-scale images, where higher resolution images are processed by smaller networks. We further propose a feature interaction mechanism to allow features of different resolutions to complement each other and effectively integrate information from different spatial scales. Extensive experiments demonstrate that the PIIP achieves superior performance in tasks such as object detection, segmentation, and image classification, compared to traditional image pyramid methods and single-branch networks, while reducing computational cost. Notably, when applying our method on a large-scale vision foundation model InternViT-6B, we improve its performance by 1\%-2\% on detection and segmentation with only 40\%-60\% of the original computation. These results validate the effectiveness of the PIIP approach and provide a new technical direction for future vision computing tasks.

NeurIPS Conference 2023 Conference Paper

H2RBox-v2: Incorporating Symmetry for Boosting Horizontal Box Supervised Oriented Object Detection

  • Yi Yu
  • Xue Yang
  • Qingyun Li
  • Yue Zhou
  • Feipeng Da
  • Junchi Yan

With the rapidly increasing demand for oriented object detection, e. g. in autonomous driving and remote sensing, the recently proposed paradigm involving weakly-supervised detector H2RBox for learning rotated box (RBox) from the more readily-available horizontal box (HBox) has shown promise. This paper presents H2RBox-v2, to further bridge the gap between HBox-supervised and RBox-supervised oriented object detection. Specifically, we propose to leverage the reflection symmetry via flip and rotate consistencies, using a weakly-supervised network branch similar to H2RBox, together with a novel self-supervised branch that learns orientations from the symmetry inherent in visual objects. The detector is further stabilized and enhanced by practical techniques to cope with peripheral issues e. g. angular periodicity. To our best knowledge, H2RBox-v2 is the first symmetry-aware self-supervised paradigm for oriented object detection. In particular, our method shows less susceptibility to low-quality annotation and insufficient training data compared to H2RBox. Specifically, H2RBox-v2 achieves very close performance to a rotation annotation trained counterpart -- Rotated FCOS: 1) DOTA-v1. 0/1. 5/2. 0: 72. 31%/64. 76%/50. 33% vs. 72. 44%/64. 53%/51. 77%; 2) HRSC: 89. 66% vs. 88. 99%; 3) FAIR1M: 42. 27% vs. 41. 25%.

NeurIPS Conference 2021 Conference Paper

Learning High-Precision Bounding Box for Rotated Object Detection via Kullback-Leibler Divergence

  • Xue Yang
  • Xiaojiang Yang
  • Jirui Yang
  • Qi Ming
  • Wentao Wang
  • Qi Tian
  • Junchi Yan

Existing rotated object detectors are mostly inherited from the horizontal detection paradigm, as the latter has evolved into a well-developed area. However, these detectors are difficult to perform prominently in high-precision detection due to the limitation of current regression loss design, especially for objects with large aspect ratios. Taking the perspective that horizontal detection is a special case for rotated object detection, in this paper, we are motivated to change the design of rotation regression loss from induction paradigm to deduction methodology, in terms of the relation between rotation and horizontal detection. We show that one essential challenge is how to modulate the coupled parameters in the rotation regression loss, as such the estimated parameters can influence to each other during the dynamic joint optimization, in an adaptive and synergetic way. Specifically, we first convert the rotated bounding box into a 2-D Gaussian distribution, and then calculate the Kullback-Leibler Divergence (KLD) between the Gaussian distributions as the regression loss. By analyzing the gradient of each parameter, we show that KLD (and its derivatives) can dynamically adjust the parameter gradients according to the characteristics of the object. For instance, it will adjust the importance (gradient weight) of the angle parameter according to the aspect ratio. This mechanism can be vital for high-precision detection as a slight angle error would cause a serious accuracy drop for large aspect ratios objects. More importantly, we have proved that KLD is scale invariant. We further show that the KLD loss can be degenerated into the popular Ln-norm loss for horizontal detection. Experimental results on seven datasets using different detectors show its consistent superiority, and codes are available at https: //github. com/yangxue0827/RotationDetection.

AAAI Conference 2021 Conference Paper

Learning Modulated Loss for Rotated Object Detection

  • Wen Qian
  • Xue Yang
  • Silong Peng
  • Junchi Yan
  • Yue Guo

Popular rotated detection methods usually use five parameters (coordinates of the central point, width, height, and rotation angle) or eight parameters (coordinates of four vertices) to describe the rotated bounding box and `1 loss as the loss function. In this paper, we argue that the aforementioned integration can cause training instability and performance degeneration. The main reason is the discontinuity of loss which is caused by the contradiction between the definition of the rotated bounding box and the loss function. We refer to the above issues as rotation sensitivity error (RSE) and propose a modulated rotation loss to dismiss the discontinuity of loss. The modulated rotation loss can achieve consistent improvement on the five parameter methods and the eight parameter methods. Experimental results using one stage and two stages detectors demonstrate the effectiveness of our loss. The integrated network achieves competitive performances on several benchmarks including DOTA and UCAS AOD. The code is available at https: //github. com/ yangxue0827/RotationDetection.

AAAI Conference 2021 Conference Paper

R3Det: Refined Single-Stage Detector with Feature Refinement for Rotating Object

  • Xue Yang
  • Junchi Yan
  • Ziming Feng
  • Tao He

Rotation detection is a challenging task due to the difficulties of locating the multi-angle objects and separating them effectively from the background. Though considerable progress has been made, for practical settings, there still exist challenges for rotating objects with large aspect ratio, dense distribution and category extremely imbalance. In this paper, we propose an end-to-end refined single-stage rotation detector for fast and accurate object detection by using a progressive regression approach from coarse to fine granularity. Considering the shortcoming of feature misalignment in existing refined single-stage detector, we design a feature refinement module to improve detection performance by getting more accurate features. The key idea of feature refinement module is to re-encode the position information of the current refined bounding box to the corresponding feature points through pixel-wise feature interpolation to realize feature reconstruction and alignment. For more accurate rotation estimation, an approximate SkewIoU loss is proposed to solve the problem that the calculation of SkewIoU is not derivable. Experiments on three popular remote sensing public datasets DOTA, HRSC2016, UCAS-AOD as well as one scene text dataset ICDAR2015 show the effectiveness of our approach. The source code is available at https: //github. com/Thinklab-SJTU/R3Det Tensorflow and is also integrated in our open source rotation detection benchmark: https: //github. com/yangxue0827/RotationDetection.

AAAI Conference 2018 Conference Paper

A Network-Specific Markov Random Field Approach to Community Detection

  • Dongxiao He
  • Xinxin You
  • Zhiyong Feng
  • Di Jin
  • Xue Yang
  • Weixiong Zhang

Markov Random Field (MRF) is a powerful framework for developing probabilistic models of complex problems. MRF models possess rich structures to represent properties and constraints of a problem. It has been successful on many application problems, particularly those of computer vision and image processing, where data are structured, e.g., pixels are organized on grids. The problem of identifying communities in networks, which is essential for network analysis, is in principle analogous to finding objects in images. It is surprising that MRF has not yet been explored for network community detection. It is challenging to apply MRF to network analysis problems where data are organized on graphs with irregular structures. Here we present a network-specific MRF approach to community detection. The new method effectively encodes the structural properties of an irregular network in an energy function (the core of an MRF model) so that the minimization of the function gives rise to the best community structures. We analyzed the new MRF-based method on several synthetic benchmarks and real-world networks, showing its superior performance over the state-of-the-art methods for community identification.

ICRA Conference 2017 Conference Paper

Sequence-based multimodal apprenticeship learning for robot perception and decision making

  • Fei Han 0002
  • Xue Yang
  • Yu Zhang
  • Hao Zhang 0011

Apprenticeship learning has recently attracted a wide attention due to its capability of allowing robots to learn physical tasks directly from demonstrations provided by human experts. Most previous techniques assumed that the state space is known a priori or employed simple state representations that usually suffer from perceptual aliasing. Different from previous research, we propose a novel approach named Sequence-based Multimodal Apprenticeship Learning (SMAL), which is capable to simultaneously fusing temporal information and multimodal data, and to integrate robot perception with decision making. To evaluate the SMAL approach, experiments are performed using both simulations and real-world robots in the challenging search and rescue scenarios. The empirical study has validated that our SMAL approach can effectively learn plans for robots to make decisions using sequence of multimodal observations. Experimental results have also showed that SMAL outperforms the baseline methods using individual images.

ICRA Conference 2017 Conference Paper

Simultaneous Feature and Body-Part Learning for real-time robot awareness of human behaviors

  • Fei Han 0002
  • Xue Yang
  • Christopher M. Reardon
  • Yu Zhang
  • Hao Zhang 0011

Robot awareness of human actions is an essential research problem in robotics with many important real-world applications, including human-robot collaboration and teaming. Over the past few years, depth sensors have become a standard device widely used by intelligent robots for 3D perception, which can also offer human skeletal data in 3D space. Several methods based on skeletal data were designed to enable robot awareness of human actions with satisfactory accuracy. However, previous methods treated all body parts and features equally important, without the capability to identify discriminative body parts and features. In this paper, we propose a novel simultaneous Feature And Body-part Learning (FABL) approach that simultaneously identifies discriminative body parts and features, and efficiently integrates all available information together to enable real-time robot awareness of human behaviors. We formulate FABL as a regression-like optimization problem with structured sparsity-inducing norms to model interrelationships of body parts and features. We also develop an optimization algorithm to solve the formulated problem, which possesses a theoretical guarantee to find the optimal solution. To evaluate FABL, three experiments were performed using public benchmark datasets, including the MSR Action3D and CAD-60 datasets, as well as a Baxter robot in practical assistive living applications. Experimental results show that our FABL approach obtains a high recognition accuracy with a processing speed of the order-of-magnitude of 10 1 Hz, which makes FABL a promising method to enable real-time robot awareness of human behaviors in practical robotics applications.

IJCAI Conference 2016 Conference Paper

Enforcing Template Representability and Temporal Consistency for Adaptive Sparse Tracking

  • Xue Yang
  • Fei Han
  • Hua Wang
  • Hao Zhang

Sparse representation has been widely studied in visual tracking, which has shown promising tracking performance. Despite a lot of progress, the visual tracking problem is still a challenging task due to appearance variations over time. In this paper, we propose a novel sparse tracking algorithm that well addresses temporal appearance changes, by enforcing template representability and temporal consistency (TRAC). By modeling temporal consistency, our algorithm addresses the issue of drifting away from a tracking target. By exploring the templates' long-term-short-term representability, the proposed method adaptively updates the dictionary using the most descriptive templates, which significantly improves the robustness to target appearance changes. We compare our TRAC algorithm against the state-of-the-art approaches on 12 challenging benchmark image sequences. Both qualitative and quantitative results demonstrate that our algorithm significantly outperforms previous state-of-the-art trackers.