Arrow Research search

Author name cluster

Xi Li

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

42 papers
2 author rows

Possible papers

42

AAAI Conference 2026 Conference Paper

IGFuse: Interactive 3D Gaussian Scene Reconstruction via Multi-Scans Fusion

  • Wenhao Hu
  • Zesheng Li
  • Haonan Zhou
  • Liu Liu
  • Xuexiang Wen
  • Zhizhong Su
  • Xi Li
  • Gaoang Wang

Reconstructing complete and interactive 3D scenes remains a fundamental challenge in computer vision and robotics, particularly due to persistent object occlusions and limited sensor coverage. Even multi-view observations from a single scene scan often fail to capture the full structural details. Existing approaches typically rely on multi-stage pipelines—such as segmentation, background completion, and inpainting—or require per-object dense scanning, both of which are error-prone, and not easily scalable. We propose IGFuse, a novel framework that reconstructs interactive Gaussian scene by fusing observations from multiple scans, where natural object rearrangement between captures reveal previously occluded regions. Our method constructs segmentation-aware Gaussian fields and enforces bi-directional photometric and semantic consistency across scans. To handle spatial misalignments, we introduce a pseudo-intermediate scene state for symmetric alignment, alongside collaborative co-pruning strategies to refine geometry. IGFuse enables high-fidelity rendering and object-level scene manipulation without dense observations or complex pipelines. Extensive experiments validate the framework’s strong generalization to novel scene configurations, demonstrating its effectiveness for real-world 3D reconstruction and real-to-simulation transfer.

AAAI Conference 2026 Conference Paper

UniScene-MoTion: Unified Scene & Motion-aware Diffusion Transition Framework

  • Rui Jiang
  • Chongmian Wang
  • Xinghe Fu
  • Yehao Lu
  • Teng Li
  • Xi Li

Video transitions are critical for ensuring temporal coherence in edited media, yet existing methods often rely on handcrafted effects or relative-scale trajectories that fail to capture the physical structure of real-world scenes. In this work, we introduce a scale-aware video transition framework that explicitly incorporates depth-aware 3D reasoning into a diffusion-based generation pipeline. Built upon a powerful I2V foundation, our method leverages single-image depth prediction to align camera motion with metric-scale geometry, enabling physically consistent transitions. To reduce reliance on precise camera inputs, we propose a bidirectional conditional control module and a progressive training strategy with conditional dropout, enhancing generalization to loosely specified or missing camera trajectories. Extensive experiments demonstrate that our approach achieves state-of-the-art performance, delivering realistic, geometrically coherent transitions across diverse scenes and applications with minimal input guidance.

ICML Conference 2025 Conference Paper

AAAR-1. 0: Assessing AI's Potential to Assist Research

  • Renze Lou
  • Hanzi Xu
  • Sijia Wang
  • Jiangshu Du
  • Ryo Kamoi
  • Xiaoxin Lu
  • Jian Xie
  • Yuxuan Sun 0002

Numerous studies have assessed the proficiency of AI systems, particularly large language models (LLMs), in facilitating everyday tasks such as email writing, question answering, and creative content generation. However, researchers face unique challenges and opportunities in leveraging LLMs for their own work, such as brainstorming research ideas, designing experiments, and writing or reviewing papers. In this study, we introduce AAAR-1. 0, a benchmark dataset designed to evaluate LLM performance in three fundamental, expertise-intensive research tasks: (i) EquationInference, assessing the correctness of equations based on the contextual information in paper submissions; (ii) ExperimentDesign, designing experiments to validate research ideas and solutions; and (iii) PaperWeakness, identifying weaknesses in paper submissions. AAAR-1. 0 differs from prior benchmarks in two key ways: first, it is explicitly research-oriented, with tasks requiring deep domain expertise; second, it is researcher-oriented, mirroring the primary activities that researchers engage in on a daily basis. An evaluation of both open-source and proprietary LLMs reveals their potential as well as limitations in conducting sophisticated research tasks. We will release the AAAR-1. 0 and keep iterating it to new versions.

NeurIPS Conference 2025 Conference Paper

Collaborating Vision, Depth, and Thermal Signals for Multi-Modal Tracking: Dataset and Algorithm

  • Xue-Feng Zhu
  • Tianyang Xu
  • Yifan Pan
  • Jinjie Gu
  • Xi Li
  • Jiwen Lu
  • Xiaojun Wu
  • Josef Kittler

Existing multi-modal object tracking approaches primarily focus on dual-modal paradigms, such as RGB-Depth or RGB-Thermal, yet remain challenged in complex scenarios due to limited input modalities. To address this gap, this work introduces a novel multi-modal tracking task that leverages three complementary modalities, including visible RGB, Depth (D), and Thermal Infrared (TIR), aiming to enhance robustness in complex scenarios. To support this task, we construct a new multi-modal tracking dataset, coined RGBDT500, which consists of 500 videos with synchronised frames across the three modalities. Each frame provides spatially aligned RGB, depth, and thermal infrared images with precise object bounding box annotations. Furthermore, we propose a novel multi-modal tracker, dubbed RDTTrack. RDTTrack integrates tri-modal information for robust tracking by leveraging a pretrained RGB-only tracking model and prompt learning techniques. In specific, RDTTrack fuses thermal infrared and depth modalities under a proposed orthogonal projection constraint, then integrates them with RGB signals as prompts for the pre-trained foundation tracking model, effectively harmonising tri-modal complementary cues. The experimental results demonstrate the effectiveness and advantages of the proposed method, showing significant improvements over existing dual-modal approaches in terms of tracking accuracy and robustness in complex scenarios. The dataset and source code are publicly available at https: //xuefeng-zhu5. github. io/RGBDT500.

AAAI Conference 2025 Conference Paper

CustomCrafter: Customized Video Generation with Preserving Motion and Concept Composition Abilities

  • Tao Wu
  • Yong Zhang
  • Xintao Wang
  • Xianpan Zhou
  • Guangcong Zheng
  • Zhongang Qi
  • Ying Shan
  • Xi Li

Customized video generation aims to generate high-quality videos guided by text prompts and subject's reference images. However, since it is only trained on static images, the fine-tuning process of subject learning disrupts abilities of video diffusion models (VDMs) to combine concepts and generate motions. To restore these abilities, some methods use additional video similar to the prompt to fine-tune or guide the model. This requires frequent changes of guiding videos and even re-tuning of the model when generating different motions, which is very inconvenient for users. In this paper, we propose CustomCrafter, a novel framework that preserves the model's motion generation and conceptual combination abilities without additional video and fine-tuning to recovery. For preserving conceptual combination ability, we design a plug-and-play module to update few parameters in VDMs, enhancing the model's ability to capture the appearance details and the ability of concept combinations for new subjects. For motion generation, we observed that VDMs tend to restore the motion of video in the early stage of denoising, while focusing on the recovery of subject details in the later stage. Therefore, we propose Dynamic Weighted Video Sampling Strategy. Using the pluggability of our subject learning modules, we reduce the impact of this module on motion generation in the early stage of denoising, preserving the ability to generate motion of VDMs. In the later stage of denoising, we restore this module to repair the appearance details of the specified subject, thereby ensuring the fidelity of the subject's appearance. Experimental results show that our method has a significant improvement compared to previous methods.

AAAI Conference 2025 Conference Paper

Energy-Guided Optimization for Personalized Image Editing with Pretrained Text-to-Image Diffusion Models

  • Rui Jiang
  • Xinghe Fu
  • Guangcong Zheng
  • Teng Li
  • Taiping Yao
  • Xi Li

The rapid advancement of pretrained text-driven diffusion models has significantly enriched applications in image generation and editing. However, as the demand for personalized content editing increases, new challenges emerge especially when dealing with arbitrary objects and complex scenes. Existing methods usually mistakes mask as the object shape prior, which struggle to achieve a seamless integration result. The mostly used inversion noise initialization also hinders the identity consistency towards the target object. To address these challenges, we propose a novel training-free framework that formulates personalized content editing as the optimization of edited images in the latent space, using diffusion models as the energy function guidance conditioned by reference text-image pairs. A coarse-to-fine strategy is proposed that employs text energy guidance at the early stage to achieve a natural transition toward the target class and uses point-to-point feature-level image energy guidance to perform fine-grained appearance alignment with the target object. Additionally, we introduce the latent space content composition to enhance overall identity consistency with the target. Extensive experiments demonstrate that our method excels in object replacement even with a large domain gap, highlighting its potential for high-quality, personalized image editing.

AAAI Conference 2025 Conference Paper

Envisioning Class Entity Reasoning by Large Language Models for Few-shot Learning

  • Mushui Liu
  • Fangtai Wu
  • Bozheng Li
  • Ziqian Lu
  • Yunlong Yu
  • Xi Li

Few-shot learning (FSL) aims to recognize new concepts using a limited number of visual samples. Existing methods attempt to incorporate semantic information into the limited visual data for category understanding. However, these methods often enrich class-level feature representations with abstract category names, failing to capture nuanced features essential for effective generalization. To address this issue, we propose a novel framework for FSL, which incorporates both the abstract class semantics and the concrete class entities extracted from Large Language Models (LLMs), to enhance the representation of the class prototypes. Specifically, our framework composes a Semantic-guided Visual Pattern Extraction (SVPE) module and a Prototype-Calibration (PC) module, where the SVPE meticulously extracts semantic-aware visual patterns across diverse scales, while the PC module seamlessly integrates these patterns to refine the visual prototype, enhancing its representativeness. Extensive experiments on four few-shot classification benchmarks and the BSCD-FSL cross-domain benchmark showcase remarkable advancements over the current state-of-the-art methods. Notably, for the challenging one-shot setting, our approach, utilizing the ResNet-12 backbone, achieves an impressive average improvement of 1.95% over the second-best competitor.

AAAI Conference 2025 Conference Paper

Exploring Unbiased Deepfake Detection via Token-Level Shuffling and Mixing

  • Xinghe Fu
  • Zhiyuan Yan
  • Taiping Yao
  • Shen Chen
  • Xi Li

The generalization problem is broadly recognized as a critical challenge in detecting deepfakes. Most previous work believes that the generalization gap is caused by the differences among various forgery methods. However, our investigation reveals that the generalization issue can still occur when forgery-irrelevant factors shift. In this work, we identify two biases that detectors may also be prone to overfitting: position bias and content bias, as depicted in Fig. 1. For the position bias, we observe that detectors are prone to “lazily” depending on the specific positions within an image (e.g., central regions even no forgery). As for content bias, we argue that detectors may potentially and mistakenly utilize forgery-unrelated information for detection (e.g., background, and hair). To intervene in these biases, we propose two branches for shuffling and mixing with tokens in the latent space of transformers. For the shuffling branch, we rearrange the tokens and corresponding position embedding for each image while maintaining the local correlation. For the mixing branch, we randomly select and mix the tokens in the latent space between two images with the same label within the mini-batch to recombine the content information. During the learning process, we align the outputs of detectors from different branches in both feature space and logit space. Contrastive losses for features and divergence losses for logits are applied to obtain unbiased feature representation and classifiers. We demonstrate and verify the effectiveness of our method through extensive experiments on widely used evaluation datasets.

ICLR Conference 2025 Conference Paper

Solving Token Gradient Conflict in Mixture-of-Experts for Large Vision-Language Model

  • Longrong Yang
  • Dong Shen
  • Chaoxiang Cai
  • Fan Yang
  • Tingting Gao
  • Di Zhang
  • Xi Li

The Mixture-of-Experts (MoE) has gained increasing attention in studying Large Vision-Language Models (LVLMs). It uses a sparse model to replace the dense model, achieving comparable performance while activating fewer parameters during inference, thus significantly reducing the inference cost. Existing MoE methods in LVLM encourage different experts to specialize in different tokens, and they usually employ a router to predict the routing of each token. However, the router is not optimized concerning distinct parameter optimization directions generated from tokens within an expert. This may lead to severe interference between tokens within an expert. To address this problem, we propose to use the token-level gradient analysis to Solving Token Gradient Conflict (STGC) in this paper. Specifically, we first use token-level gradients to identify conflicting tokens in experts. After that, we add a regularization loss tailored to encourage conflicting tokens routing from their current experts to other experts, for reducing interference between tokens within an expert. Our method can serve as a plug-in for diverse LVLM methods, and extensive experimental results demonstrate its effectiveness. demonstrate its effectiveness. The code will be publicly available at https://github.com/longrongyang/STGC.

JBHI Journal 2025 Journal Article

TKR-FSOD: Fetal Anatomical Structure Few-Shot Detection Utilizing Topological Knowledge Reasoning

  • Xi Li
  • Ying Tan
  • Bocheng Liang
  • Bin Pu
  • Jiewen Yang
  • Lei Zhao
  • Yanqing Kong
  • Lixian Yang

Fetal multi-anatomical structure detection in ultrasound (US) images can clearly present the relationship and influence between anatomical structures, providing more comprehensive information about fetal organ structures and assisting sonographers in making more accurate diagnoses, widely used in structure evaluation. Recently, deep learning methods have shown superior performance in detecting various anatomical structures in ultrasound images, but still have the potential for performance improvement in categories where it is difficult to obtain samples, such as rare diseases. Few-shot learning has attracted a lot of attention in medical image analysis due to its ability to solve the problem of data scarcity. However, existing few-shot learning research in medical image analysis focuses on classification and segmentation, and the research on object detection has been neglected. In this paper, we propose a novel fetal anatomical structure few-shot detection method in ultrasound images, TKR-FSOD, which learns topological knowledge through a Topological Knowledge Reasoning Module to help the model reason about and detect anatomical structures. Furthermore, we propose a Discriminate Ability Enhanced Feature Learning Module that extracts abundant discriminative features to enhance the model's discriminative ability. Experimental results demonstrate that our method outperforms the state-of-the-art baseline methods, exceeding the second-best method with a maximum margin of 4. 8% on 5-shot of split 1 under four-chamber cardiac view.

AAAI Conference 2024 Conference Paper

Prompting Segmentation with Sound Is Generalizable Audio-Visual Source Localizer

  • Yaoting Wang
  • Weisong Liu
  • Guangyao Li
  • Jian Ding
  • Di Hu
  • Xi Li

Never having seen an object and heard its sound simultaneously, can the model still accurately localize its visual position from the input audio? In this work, we concentrate on the Audio-Visual Localization and Segmentation tasks but under the demanding zero-shot and few-shot scenarios. To achieve this goal, different from existing approaches that mostly employ the encoder-fusion-decoder paradigm to decode localization information from the fused audio-visual feature, we introduce the encoder-prompt-decoder paradigm, aiming to better fit the data scarcity and varying data distribution dilemmas with the help of abundant knowledge from pre-trained models. Specifically, we first propose to construct a Semantic-aware Audio Prompt (SAP) to help the visual foundation model focus on sounding objects, meanwhile, the semantic gap between the visual and audio modalities is also encouraged to shrink. Then, we develop a Correlation Adapter (ColA) to keep minimal training efforts as well as maintain adequate knowledge of the visual foundation model. By equipping with these means, extensive experiments demonstrate that this new paradigm outperforms other fusion-based methods in both the unseen class and cross-dataset settings. We hope that our work can further promote the generalization study of Audio-Visual Localization and Segmentation in practical application scenarios. Project page: https://github.com/GeWu-Lab/Generalizable-Audio-Visual-Segmentation

AAAI Conference 2024 Conference Paper

SphereDiffusion: Spherical Geometry-Aware Distortion Resilient Diffusion Model

  • Tao Wu
  • Xuewei Li
  • Zhongang Qi
  • Di Hu
  • Xintao Wang
  • Ying Shan
  • Xi Li

Controllable spherical panoramic image generation holds substantial applicative potential across a variety of domains. However, it remains a challenging task due to the inherent spherical distortion and geometry characteristics, resulting in low-quality content generation. In this paper, we introduce a novel framework of SphereDiffusion to address these unique challenges, for better generating high-quality and precisely controllable spherical panoramic images. For the spherical distortion characteristic, we embed the semantics of the distorted object with text encoding, then explicitly construct the relationship with text-object correspondence to better use the pre-trained knowledge of the planar images. Meanwhile, we employ a deformable technique to mitigate the semantic deviation in latent space caused by spherical distortion. For the spherical geometry characteristic, in virtue of spherical rotation invariance, we improve the data diversity and optimization objectives in the training process, enabling the model to better learn the spherical geometry characteristic. Furthermore, we enhance the denoising process of the diffusion model, enabling it to effectively use the learned geometric characteristic to ensure the boundary continuity of the generated images. With these specific techniques, experiments on Structured3D dataset show that SphereDiffusion significantly improves the quality of controllable spherical image generation and relatively reduces around 35% FID on average.

AAAI Conference 2024 Conference Paper

Temporal-Distributed Backdoor Attack against Video Based Action Recognition

  • Xi Li
  • Songhe Wang
  • Ruiquan Huang
  • Mahanth Gowda
  • George Kesidis

Deep neural networks (DNNs) have achieved tremendous success in various applications including video action recognition, yet remain vulnerable to backdoor attacks (Trojans). The backdoor-compromised model will mis-classify to the target class chosen by the attacker when a test instance (from a non-target class) is embedded with a specific trigger, while maintaining high accuracy on attack-free instances. Although there are extensive studies on backdoor attacks against image data, the susceptibility of video-based systems under backdoor attacks remains largely unexplored. Current studies are direct extensions of approaches proposed for image data, e.g., the triggers are independently embedded within the frames, which tend to be detectable by existing defenses. In this paper, we introduce a simple yet effective backdoor attack against video data. Our proposed attack, adding perturbations in a transformed domain, plants an imperceptible, temporally distributed trigger across the video frames, and is shown to be resilient to existing defensive strategies. The effectiveness of the proposed attack is demonstrated by extensive experiments with various well-known models on two video recognition benchmarks, UCF101 and HMDB51, and a sign language recognition benchmark, Greek Sign Language (GSL) dataset. We delve into the impact of several influential factors on our proposed attack and identify an intriguing effect termed "collateral damage" through extensive studies.

IJCAI Conference 2023 Conference Paper

DenseDINO: Boosting Dense Self-Supervised Learning with Token-Based Point-Level Consistency

  • Yike Yuan
  • Xinghe Fu
  • Yunlong Yu
  • Xi Li

In this paper, we propose a simple yet effective transformer framework for self-supervised learning called DenseDINO to learn dense visual representations. To exploit the spatial information that the dense prediction tasks require but neglected by the existing self-supervised transformers, we introduce point-level supervision across views in a novel token-based way. Specifically, DenseDINO introduces some extra input tokens called reference tokens to match the point-level features with the position prior. With the reference token, the model could maintain spatial consistency and deal with multi-object complex scene images, thus generalizing better on dense prediction tasks. Compared with the vanilla DINO, our approach obtains competitive performance when evaluated on classification in ImageNet and achieves a large margin (+7. 2% mIoU) improvement in semantic segmentation on PascalVOC under the linear probing protocol for segmentation.

JBHI Journal 2023 Journal Article

dMIL-Transformer: Multiple Instance Learning Via Integrating Morphological and Spatial Information for Lymph Node Metastasis Classification

  • Yang Chen
  • Zhuchen Shao
  • Hao Bian
  • Zijie Fang
  • Yifeng Wang
  • Yuanhao Cai
  • Haoqian Wang
  • Guojun Liu

Automated classification of lymph node metastasis (LNM) plays an important role in the diagnosis and prognosis. However, it is very challenging to achieve satisfactory performance in LNM classification, because both the morphology and spatial distribution of tumor regions should be taken into account. To address this problem, this article proposes a two-stage dMIL-Transformer framework, which integrates both the morphological and spatial information of the tumor regions based on the theory of multiple instance learning (MIL). In the first stage, a double Max-Min MIL (dMIL) strategy is devised to select the suspected top-K positive instances from each input histopathology image, which contains tens of thousands of patches (primarily negative). The dMIL strategy enables a better decision boundary for selecting the critical instances compared with other methods. In the second stage, a Transformer-based MIL aggregator is designed to integrate all the morphological and spatial information of the selected instances from the first stage. The self-attention mechanism is further employed to characterize the correlation between different instances and learn the bag-level representation for predicting the LNM category. The proposed dMIL-Transformer can effectively deal with the thorny classification in LNM with great visualization and interpretability. We conduct various experiments over three LNM datasets, and achieve 1. 79%-7. 50% performance improvement compared with other state-of-the-art methods.

AAAI Conference 2023 Conference Paper

PUPS: Point Cloud Unified Panoptic Segmentation

  • Shihao Su
  • Jianyun Xu
  • Huanyu Wang
  • Zhenwei Miao
  • Xin Zhan
  • Dayang Hao
  • Xi Li

Point cloud panoptic segmentation is a challenging task that seeks a holistic solution for both semantic and instance segmentation to predict groupings of coherent points. Previous approaches treat semantic and instance segmentation as surrogate tasks, and they either use clustering methods or bounding boxes to gather instance groupings with costly computation and hand-craft designs in the instance segmentation task. In this paper, we propose a simple but effective point cloud unified panoptic segmentation (PUPS) framework, which use a set of point-level classifiers to directly predict semantic and instance groupings in an end-to-end manner. To realize PUPS, we introduce bipartite matching to our training pipeline so that our classifiers are able to exclusively predict groupings of instances, getting rid of hand-crafted designs, e.g. anchors and Non-Maximum Suppression (NMS). In order to achieve better grouping results, we utilize a transformer decoder to iteratively refine the point classifiers and develop a context-aware CutMix augmentation to overcome the class imbalance problem. As a result, PUPS achieves 1st place on the leader board of SemanticKITTI panoptic segmentation task and state-of-the-art results on nuScenes.

AAAI Conference 2023 Conference Paper

Referring Expression Comprehension Using Language Adaptive Inference

  • Wei Su
  • Peihan Miao
  • Huanzhang Dou
  • Yongjian Fu
  • Xi Li

Different from universal object detection, referring expression comprehension (REC) aims to locate specific objects referred to by natural language expressions. The expression provides high-level concepts of relevant visual and contextual patterns, which vary significantly with different expressions and account for only a few of those encoded in the REC model. This leads us to a question: do we really need the entire network with a fixed structure for various referring expressions? Ideally, given an expression, only expression-relevant components of the REC model are required. These components should be small in number as each expression only contains very few visual and contextual clues. This paper explores the adaptation between expressions and REC models for dynamic inference. Concretely, we propose a neat yet efficient framework named Language Adaptive Dynamic Subnets (LADS), which can extract language-adaptive subnets from the REC model conditioned on the referring expressions. By using the compact subnet, the inference can be more economical and efficient. Extensive experiments on RefCOCO, RefCOCO+, RefCOCOg, and Referit show that the proposed method achieves faster inference speed and higher accuracy against state-of-the-art approaches.

IJCAI Conference 2023 Conference Paper

SGAT4PASS: Spherical Geometry-Aware Transformer for PAnoramic Semantic Segmentation

  • Xuewei Li
  • Tao Wu
  • Zhongang Qi
  • Gaoang Wang
  • Ying Shan
  • Xi Li

As an important and challenging problem in computer vision, PAnoramic Semantic Segmentation (PASS) gives complete scene perception based on an ultra-wide angle of view. Usually, prevalent PASS methods with 2D panoramic image input focus on solving image distortions but lack consideration of the 3D properties of original 360 degree data. Therefore, their performance will drop a lot when inputting panoramic images with the 3D disturbance. To be more robust to 3D disturbance, we propose our Spherical Geometry-Aware Transformer for PAnoramic Semantic Segmentation (SGAT4PASS), considering 3D spherical geometry knowledge. Specifically, a spherical geometry-aware framework is proposed for PASS. It includes three modules, i. e. , spherical geometry-aware image projection, spherical deformable patch embedding, and a panorama-aware loss, which takes input images with 3D disturbance into account, adds a spherical geometry-aware constraint on the existing deformable patch embedding, and indicates the pixel density of original 360 degree data, respectively. Experimental results on Stanford2D3D Panoramic datasets show that SGAT4PASS significantly improves performance and robustness, with approximately a 2% increase in mIoU, and when small 3D disturbances occur in the data, the stability of our performance is improved by an order of magnitude. Our code and supplementary material are available at https: //github. com/TencentARC/SGAT4PASS.

IJCAI Conference 2021 Conference Paper

Automatic Translation of Music-to-Dance for In-Game Characters

  • Yinglin Duan
  • Tianyang Shi
  • Zhipeng Hu
  • Zhengxia Zou
  • Changjie Fan
  • Yi Yuan
  • Xi Li

Music-to-dance translation is an emerging and powerful feature in recent role-playing games. Previous works of this topic consider music-to-dance as a supervised motion generation problem based on time-series data. However, these methods require a large amount of training data pairs and may suffer from the degradation of movements. This paper provides a new solution to this task where we re-formulate the translation as a piece-wise dance phrase retrieval problem based on the choreography theory. With such a design, players are allowed to optionally edit the dance movements on top of our generation while other regression-based methods ignore such user interactivity. Considering that the dance motion capture is expensive that requires the assistance of professional dancers, we train our method under a semi-supervised learning fashion with a large unlabeled music dataset (20x than our labeled one) and also introduce self-supervised pre-training to improve the training stability and generalization performance. Experimental results suggest that our method not only generalizes well over various styles of music but also succeeds in choreography for game players. Our project including the large-scale dataset and supplemental materials is available at https: //github. com/FuxiCV/music-to-dance.

AAAI Conference 2019 Conference Paper

Learning a Key-Value Memory Co-Attention Matching Network for Person Re-Identification

  • Yaqing Zhang
  • Xi Li
  • Zhongfei Zhang

Person re-identification (Re-ID) is typically cast as the problem of semantic representation and alignment, which requires precisely discovering and modeling the inherent spatial structure information on person images. Motivated by this observation, we propose a Key-Value Memory Matching Network (KVM-MN) model that consists of key-value memory representation and key-value co-attention matching. The proposed KVM-MN model is capable of building an effective localposition-aware person representation that encodes the spatial feature information in the form of multi-head key-value memory. Furthermore, the proposed KVM-MN model makes use of multi-head co-attention to automatically learn a number of cross-person-matching patterns, resulting in more robust and interpretable matching results. Finally, we build a setwise learning mechanism that implements a more generalized query-to-gallery-image-set learning procedure. Experimental results demonstrate the effectiveness of the proposed model against the state-of-the-art.

AAAI Conference 2019 Conference Paper

Spatio-Temporal Graph Routing for Skeleton-Based Action Recognition

  • Bin Li
  • Xi Li
  • Zhongfei Zhang
  • Fei Wu

With the representation effectiveness, skeleton-based human action recognition has received considerable research attention, and has a wide range of real applications. In this area, many existing methods typically rely on fixed physicalconnectivity skeleton structure for recognition, which is incapable of well capturing the intrinsic high-order correlations among skeleton joints. In this paper, we propose a novel spatio-temporal graph routing (STGR) scheme for skeletonbased action recognition, which adaptively learns the intrinsic high-order connectivity relationships for physicallyapart skeleton joints. Specifically, the scheme is composed of two components: spatial graph router (SGR) and temporal graph router (TGR). The SGR aims to discover the connectivity relationships among the joints based on sub-group clustering along the spatial dimension, while the TGR explores the structural information by measuring the correlation degrees between temporal joint node trajectories. The proposed scheme is naturally and seamlessly incorporated into the framework of graph convolutional networks (GCNs) to produce a set of skeleton-joint-connectivity graphs, which are further fed into the classification networks. Moreover, an insightful analysis on receptive field of graph node is provided to explain the necessity of our method. Experimental results on two benchmark datasets (NTU-RGB+D and Kinetics) demonstrate the effectiveness against the state-of-the-art.

IJCAI Conference 2018 Conference Paper

Deep Convolutional Neural Networks with Merge-and-Run Mappings

  • Liming Zhao
  • Mingjie Li
  • Depu Meng
  • Xi Li
  • Zhaoxiang Zhang
  • Yueting Zhuang
  • Zhuowen Tu
  • Jingdong Wang

A deep residual network, built by stacking a sequence of residual blocks, is easy to train, because identity mappings skip residual branches and thus improve information flow. To further reduce the training difficulty, we present a simple network architecture, deep merge-and-run neural networks. The novelty lies in a modularized building block, merge-and-run block, which assembles residual branches in parallel through a merge-and-run mapping: average the inputs of these residual branches (Merge), and add the average to the output of each residual branch as the input of the subsequent residual branch (Run), respectively. We show that the merge-and-run mapping is a linear idempotent function in which the transformation matrix is idempotent, and thus improves information flow, making training easy. In comparison with residual networks, our networks enjoy compelling advantages: they contain much shorter paths and the width, i. e. , the number of channels, is increased, and the time complexity remains unchanged. We evaluate the performance on the standard recognition tasks. Our approach demonstrates consistent improvements over ResNets with the comparable setup, and achieves competitive results (e. g. , 3. 06% testing error on CIFAR-10, 17. 55% on CIFAR-100, 1. 51% on SVHN).

AAAI Conference 2018 Short Paper

FR-ANet: A Face Recognition Guided Facial Attribute Classification Network

  • Jiajiong Cao
  • Yingming Li
  • Xi Li
  • Zhongfei Zhang

In this paper, we study the problem of facial attribute learning. In particular, we propose a Face Recognition guided facial Attribute classification Network, called FR-ANet. All the attributes share low-level features, while high-level features are specially learned for attribute groups. Further, to utilize the identity information, high-level features are merged to perform face identity recognition. The experimental results on CelebA and LFWA datasets demonstrate the promise of the FR-ANet.

IJCAI Conference 2018 Conference Paper

Knowledge-Guided Agent-Tactic-Aware Learning for StarCraft Micromanagement

  • Yue Hu
  • Juntao Li
  • Xi Li
  • Gang Pan
  • Mingliang Xu

As an important and challenging problem in artificial intelligence (AI) game playing, StarCraft micromanagement involves a dynamically adversarial game playing process with complex multi-agent control within a large action space. In this paper, we propose a novel knowledge-guided agent-tactic-aware learning scheme, that is, opponent-guided tactic learning (OGTL), to cope with this micromanagement problem. In principle, the proposed scheme takes a two-stage cascaded learning strategy which is capable of not only transferring the human tactic knowledge from the human-made opponent agents to our AI agents but also improving the adversarial ability. With the power of reinforcement learning, such a knowledge-guided agent-tactic-aware scheme has the ability to guide the AI agents to achieve high winning-rate performances while accelerating the policy exploration process in a tactic-interpretable fashion. Experimental results demonstrate the effectiveness of the proposed scheme against the state-of-the-art approaches in several benchmark combat scenarios.

AAAI Conference 2018 Conference Paper

Multi-Channel Pyramid Person Matching Network for Person Re-Identification

  • Chaojie Mao
  • Yingming Li
  • Yaqing Zhang
  • Zhongfei Zhang
  • Xi Li

In this work, we present a Multi-Channel deep convolutional Pyramid Person Matching Network (MC-PPMN) based on the combination of the semantic-components and the colortexture distributions to address the problem of person reidentification. In particular, we learn separate deep representations for semantic-components and color-texture distributions from two person images and then employ pyramid person matching network (PPMN) to obtain correspondence representations. These correspondence representations are fused to perform the re-identification task. Further, the proposed framework is optimized via a unified end-to-end deep learning scheme. Extensive experiments on several benchmark datasets demonstrate the effectiveness of our approach against the state-of-the-art literature, especially on the rank-1 recognition rate.

IJCAI Conference 2018 Conference Paper

Progressive Blockwise Knowledge Distillation for Neural Network Acceleration

  • Hui Wang
  • Hanbin Zhao
  • Xi Li
  • Xu Tan

As an important and challenging problem in machine learning and computer vision, neural network acceleration essentially aims to enhance the computational efficiency without sacrificing the model accuracy too much. In this paper, we propose a progressive blockwise learning scheme for teacher-student model distillation at the subnetwork block level. The proposed scheme is able to distill the knowledge of the entire teacher network by locally extracting the knowledge of each block in terms of progressive blockwise function approximation. Furthermore, we propose a structure design criterion for the student subnetwork block, which is able to effectively preserve the original receptive field from the teacher network. Experimental results demonstrate the effectiveness of the proposed scheme against the state-of-the-art approaches.

IJCAI Conference 2018 Conference Paper

Semantic Locality-Aware Deformable Network for Clothing Segmentation

  • Wei Ji
  • Xi Li
  • Yueting Zhuang
  • Omar El Farouk Bourahla
  • Yixin Ji
  • Shihao Li
  • Jiabao Cui

Clothing segmentation is a challenging vision problem typically implemented within a fine-grained semantic segmentation framework. Different from conventional segmentation, clothing segmentation has some domain-specific properties such as texture richness, diverse appearance variations, non-rigid geometry deformations, and small sample learning. To deal with these points, we propose a semantic locality-aware segmentation model, which adaptively attaches an original clothing image with a semantically similar (e. g. , appearance or pose) auxiliary exemplar by search. Through considering the interactions of the clothing image and its exemplar, more intrinsic knowledge about the locality manifold structures of clothing images is discovered to make the learning process of small sample problem more stable and tractable. Furthermore, we present a CNN model based on the deformable convolutions to extract the non-rigid geometry-aware features for clothing images. Experimental results demonstrate the effectiveness of the proposed model against the state-of-the-art approaches.

IJCAI Conference 2017 Conference Paper

Boosted Zero-Shot Learning with Semantic Correlation Regularization

  • Te Pi
  • Xi Li
  • Zhongfei (Mark) Zhang

We study zero-shot learning (ZSL) as a transfer learning problem, and focus on the two key aspects of ZSL, model effectiveness and model adaptation. For effective modeling, we adopt the boosting strategy to learn a zero-shot classifier from weak models to a strong model. For adaptable knowledge transfer, we devise a Semantic Correlation Regularization (SCR) approach to regularize the boosted model to be consistent with the inter-class semantic correlations. With SCR embedded in the boosting objective, and with a self-controlled sample selection for learning robustness, we propose a unified framework, Boosted Zero-shot classification with Semantic Correlation Regularization (BZ-SCR). By balancing the SCR-regularized boosted model selection and the self-controlled sample selection, BZ-SCR is capable of capturing both discriminative and adaptable feature-to-class semantic alignments, while ensuring the reliability and adaptability of the learned samples. The experiments on two ZSL datasets show the superiority of BZ-SCR over the state-of-the-arts.

IJCAI Conference 2017 Conference Paper

Deep Optical Flow Estimation Via Multi-Scale Correspondence Structure Learning

  • Shanshan Zhao
  • Xi Li
  • Omar El Farouk Bourahla

As an important and challenging problem in computer vision, learning based optical flow estimation aims to discover the intrinsic correspondence structure between two adjacent video frames through statistical learning. Therefore, a key issue to solve in this area is how to effectively model the multi-scale correspondence structure properties in an adaptive end-to-end learning fashion. Motivated by this observation, we propose an end-to-end multi-scale correspondence structure learning (MSCSL) approach for optical flow estimation. In principle, the proposed MSCSL approach is capable of effectively capturing the multi-scale inter-image-correlation correspondence structures within a multi-level feature space from deep learning. Moreover, the proposed MSCSL approach builds a spatial Conv-GRU neural network model to adaptively model the intrinsic dependency relationships among these multi-scale correspondence structures. Finally, the above procedures for correspondence structure learning and multi-scale dependency modeling are implemented in a unified end-to-end deep learning framework. Experimental results on several benchmark datasets demonstrate the effectiveness of the proposed approach.

IJCAI Conference 2017 Conference Paper

Group-wise Deep Co-saliency Detection

  • Lina Wei
  • Shanshan Zhao
  • Omar El Farouk Bourahla
  • Xi Li
  • Fei Wu

In this paper, we propose an end-to-end group-wise deep co-saliency detection approach to address the co-salient object discovery problem based on the fully convolutional network (FCN) with group input and group output. The proposed approach captures the group-wise interaction information for group images by learning a semantics-aware image representation based on a convolutional neural network, which adaptively learns the group-wise features for co-saliency detection. Furthermore, the proposed approach discovers the collaborative and interactive relationships between group-wise feature representation and single-image individual feature representation, and model this in a collaborative learning framework. Finally, we set up a unified end-to-end deep learning scheme to jointly optimize the process of group-wise feature representation learning and the collaborative learning, leading to more reliable and robust co-saliency detection results. Experimental results demonstrate the effectiveness of our approach in comparison with the state-of-the-art approaches.

IJCAI Conference 2016 Conference Paper

Diverse Image Captioning via GroupTalk

  • Zhuhao Wang
  • Fei Wu
  • Weiming Lu
  • Jun Xiao
  • Xi Li
  • Zitong Zhang
  • Yueting Zhuang

Generally speaking, different persons tend to describe images from various aspects due to their individually subjective perception. As a result, generating the appropriate descriptions of images with both diversity and high quality is of great importance. In this paper, we propose a framework called GroupTalk to learn multiple image caption distributions simultaneously and effectively mimic the diversity of the image captions written by human beings. In particular, a novel iterative update strategy is proposed to separate training sentence samples into groups and learn their distributions at the same time. Furthermore, we introduce an efficient classifier to solve the problem brought about by the non-linear and discontinuous nature of language distributions which will impair performance. Experiments on several benchmark datasets show that GroupTalk naturally diversifies the generated captions of each image without sacrificing the accuracy.

IJCAI Conference 2016 Conference Paper

Self-Paced Boost Learning for Classification

  • Te Pi
  • Xi Li
  • Zhongfei Zhang
  • Deyu Meng
  • Fei Wu
  • Jun Xiao
  • Yueting Zhuang

Effectiveness and robustness are two essential aspects of supervised learning studies. For effective learning, ensemble methods are developed to build a strong effective model from ensemble of weak models. For robust learning, self-paced learning (SPL) is proposed to learn in a self-controlled pace from easy samples to complex ones. Motivated by simultaneously enhancing the learning effectiveness and robustness, we propose a unified framework, Self-Paced Boost Learning (SPBL). With an adaptive from-easy-to-hard pace in boosting process, SPBL asymptotically guides the model to focus more on the insufficiently learned samples with higher reliability. Via a max-margin boosting optimization with self-paced sample selection, SPBL is capable of capturing the intrinsic inter-class discriminative patterns while ensuring the reliability of the samples involved in learning. We formulate SPBL as a fully-corrective optimization for classification. The experiments on several real-world datasets show the superiority of SPBL in terms of both effectiveness and robustness.

IJCAI Conference 2016 Conference Paper

Semantics-Aware Deep Correspondence Structure Learning for Robust Person Re-Identification

  • Yaqing Zhang
  • Xi Li
  • Liming Zhao
  • Zhongfei Zhang

In this paper, we propose an end-to-end deep correspondence structure learning (DCSL) approach to address the cross-camera person-matching problem in the person re-identification task. The proposed DCSL approach captures the intrinsic structural information on persons by learning a semantics-aware image representation based on convolutional neural networks, which adaptively learns discriminative features for person identification. Furthermore, the proposed DCSL approach seeks to adaptively learn a hierarchical data-driven feature matching function which outputs the matching correspondence results between the learned semantics-aware image representations for a person pair. Finally, we set up a unified end-to-end deep learning scheme to jointly optimize the processes of semantics-aware image representation learning and cross-person correspondence structure learning, leading to more reliable and robust person re-identification results in complicated scenarios. Experimental results on several benchmark datasets demonstrate the effectiveness of our approach against the state-of-the-art approaches.

AAAI Conference 2015 Conference Paper

Metric Learning Driven Multi-Task Structured Output Optimization for Robust Keypoint Tracking

  • Liming Zhao
  • Xi Li
  • Jun Xiao
  • Fei Wu
  • Yueting Zhuang

As an important and challenging problem in computer vision and graphics, keypoint-based object tracking is typically formulated in a spatio-temporal statistical learning framework. However, most existing keypoint trackers are incapable of effectively modeling and balancing the following three aspects in a simultaneous manner: temporal model coherence across frames, spatial model consistency within frames, and discriminative feature construction. To address this issue, we propose a robust keypoint tracker based on spatio-temporal multi-task structured output optimization driven by discriminative metric learning. Consequently, temporal model coherence is characterized by multi-task structured keypoint model learning over several adjacent frames, while spatial model consistency is modeled by solving a geometric verification based structured learning problem. Discriminative feature construction is enabled by metric learning to ensure the intra-class compactness and inter-class separability. Finally, the above three modules are simultaneously optimized in a joint learning scheme. Experimental results have demonstrated the effectiveness of our tracker.

AAAI Conference 2015 Conference Paper

Structured Embedding via Pairwise Relations and Long-Range Interactions in Knowledge Base

  • Fei Wu
  • Jun Song
  • Yi Yang
  • Xi Li
  • Zhongfei Zhang
  • Yueting Zhuang

We consider the problem of embedding entities and relations of knowledge bases into low-dimensional continuous vector spaces (distributed representations). Unlike most existing approaches, which are primarily efficient for modelling pairwise relations between entities, we attempt to explicitly model both pairwise relations and long-range interactions between entities, by interpreting them as linear operators on the low-dimensional embeddings of the entities. Therefore, in this paper we introduces path ranking to capture the long-range interactions of knowledge graph and at the same time preserve the pairwise relations of knowledge graph; we call it structured embedding via pairwise relation and longrange interactions (referred to as SePLi). Comparing with the-state-of-the-art models, SePLi achieves better performances of embeddings.

TIST Journal 2013 Journal Article

A survey of appearance models in visual object tracking

  • Xi Li
  • Weiming Hu
  • Chunhua Shen
  • Zhongfei Zhang
  • Anthony Dick
  • Anton van den Hengel

Visual object tracking is a significant computer vision task which can be applied to many domains, such as visual surveillance, human computer interaction, and video compression. Despite extensive research on this topic, it still suffers from difficulties in handling complex object appearance changes caused by factors such as illumination variation, partial occlusion, shape deformation, and camera motion. Therefore, effective modeling of the 2D appearance of tracked objects is a key issue for the success of a visual tracker. In the literature, researchers have proposed a variety of 2D appearance models. To help readers swiftly learn the recent advances in 2D appearance models for visual object tracking, we contribute this survey, which provides a detailed review of the existing 2D appearance models. In particular, this survey takes a module-based architecture that enables readers to easily grasp the key points of visual object tracking. In this survey, we first decompose the problem of appearance modeling into two different processing stages: visual representation and statistical modeling. Then, different 2D appearance models are categorized and discussed with respect to their composition modules. Finally, we address several issues of interest as well as the remaining challenges for future research on this topic. The contributions of this survey are fourfold. First, we review the literature of visual representations according to their feature-construction mechanisms (i.e., local and global). Second, the existing statistical modeling schemes for tracking-by-detection are reviewed according to their model-construction mechanisms: generative, discriminative, and hybrid generative-discriminative. Third, each type of visual representations or statistical modeling techniques is analyzed and discussed from a theoretical or practical viewpoint. Fourth, the existing benchmark resources (e.g., source codes and video datasets) are examined in this survey.

IJCAI Conference 2009 Conference Paper

  • Xi Li
  • Kazuhiro Fukui
  • Nanning Zheng

Object recognition using image-set or video sequence as input tends to be more robust since image-set or video sequence provides much more information than single snap-shot about the variability in the appearance of the target subject. Constrained Mutual Subspace Method (CMSM) is one of the state-of-the-art algorithms for imageset based object recognition by first projecting the image-set patterns onto the so-called generalized difference subspace then classifying based on the principal angle based mutual subspace distance. By treating the subspace bases for each image-set patterns as basic elements in the grassmann manifold, this paper presents a framework for robust image-set based recognition by CMSMbased ensemble learning in a boosting way. The proposed Boosting Constrained Mutual Subspace Method(BCMSM) improves the original CMSM in the following ways: a) The proposed BCMSM algorithm is insensitive to the dimension of the generalized differnce subspace while the performance of the original CMSM algorithm is quite dependent on the dimension and the selecting of optimum choice is quite empirical and case-dependent; b) By taking advantage of both boosting and CMSM techniques, the generalization ability is improved and much higher classification performance can be achieved. Extensive experiments on real-life data sets (two face recognition tasks and one 3D object category classification task) show that the proposed method outperforms the previous state-ofthe-art algorithms greatly in terms of classification accuracy.