Arrow Research search

Author name cluster

Xiang Long

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

4 papers
1 author row

Possible papers

4

AAAI Conference 2021 Conference Paper

RSPNet: Relative Speed Perception for Unsupervised Video Representation Learning

  • Peihao Chen
  • Deng Huang
  • Dongliang He
  • Xiang Long
  • Runhao Zeng
  • Shilei Wen
  • Mingkui Tan
  • Chuang Gan

We study unsupervised video representation learning that seeks to learn both motion and appearance features from unlabeled video only, which can be reused for downstream tasks such as action recognition. This task, however, is extremely challenging due to 1) the highly complex spatialtemporal information in videos and 2) the lack of labeled data for training. Unlike representation learning for static images, it is difficult to construct a suitable self-supervised task to effectively model both motion and appearance features. More recently, several attempts have been made to learn video representation through video playback speed prediction. However, it is non-trivial to obtain precise speed labels for the videos. More critically, the learned models may tend to focus on motion patterns and thus may not learn appearance features well. In this paper, we observe that the relative playback speed is more consistent with motion patterns and thus provides more effective and stable supervision for representation learning. Therefore, we propose a new way to perceive the playback speed and exploit the relative speed between two video clips as labels. In this way, we are able to effectively perceive speed and learn better motion features. Moreover, to ensure the learning of appearance features, we further propose an appearance-focused task, where we enforce the model to perceive the appearance difference between two video clips. We show that jointly optimizing the two tasks consistently improves the performance on two downstream tasks (namely, action recognition and video retrieval) w. r. t the increasing pre-training epochs. Remarkably, for action recognition on the UCF101 dataset, we achieve 93. 7% accuracy without the use of labeled data for pre-training, which outperforms the ImageNet supervised pre-trained model. Our code, pre-trained models, and supplementary materials can be found at https: //github. com/PeihaoChen/RSPNet.

AAAI Conference 2020 Conference Paper

Cross-Modality Attention with Semantic Graph Embedding for Multi-Label Classification

  • Renchun You
  • Zhiyao Guo
  • Lei Cui
  • Xiang Long
  • Yingze Bao
  • Shilei Wen

Multi-label image and video classification are fundamental yet challenging tasks in computer vision. The main challenges lie in capturing spatial or temporal dependencies between labels and discovering the locations of discriminative features for each class. In order to overcome these challenges, we propose to use cross-modality attention with semantic graph embedding for multi-label classification. Based on the constructed label graph, we propose an adjacency-based similarity graph embedding method to learn semantic label embeddings, which explicitly exploit label relationships. Then our novel cross-modality attention maps are generated with the guidance of learned label embeddings. Experiments on two multi-label image classification datasets (MS-COCO and NUS-WIDE) show our method outperforms other existing state-of-the-arts. In addition, we validate our method on a large multi-label video classification dataset (YouTube-8M Segments) and the evaluation results demonstrate the generalization capability of our method.

AAAI Conference 2020 Conference Paper

Multi-Label Classification with Label Graph Superimposing

  • Ya Wang
  • Dongliang He
  • Fu Li
  • Xiang Long
  • Zhichao Zhou
  • Jinwen Ma
  • Shilei Wen

Images or videos always contain multiple objects or actions. Multi-label recognition has been witnessed to achieve pretty performance attribute to the rapid development of deep learning technologies. Recently, graph convolution network (GCN) is leveraged to boost the performance of multi-label recognition. However, what is the best way for label correlation modeling and how feature learning can be improved with label system awareness are still unclear. In this paper, we propose a label graph superimposing framework to improve the conventional GCN+CNN framework developed for multi-label recognition in the following two aspects. Firstly, we model the label correlations by superimposing label graph built from statistical co-occurrence information into the graph constructed from knowledge priors of labels, and then multilayer graph convolutions are applied on the final superimposed graph for label embedding abstraction. Secondly, we propose to leverage embedding of the whole label system for better representation learning. In detail, lateral connections between GCN and CNN are added at shallow, middle and deep layers to inject information of label system into backbone CNN for label-awareness in the feature learning process. Extensive experiments are carried out on MS- COCO and Charades datasets, showing that our proposed solution can greatly improve the recognition performance and achieves new state-of-the-art recognition performance.

AAAI Conference 2018 Conference Paper

Multimodal Keyless Attention Fusion for Video Classification

  • Xiang Long
  • Chuang Gan
  • Gerard Melo
  • Xiao Liu
  • Yandong Li
  • Fu Li
  • Shilei Wen

The problem of video classification is inherently sequential and multimodal, and deep neural models hence need to capture and aggregate the most pertinent signals for a given input video. We propose Keyless Attention as an elegant and efficient means to more effectively account for the sequential nature of the data. Moreover, comparing a variety of multimodal fusion methods, we find that Multimodal Keyless Attention Fusion is the most successful at discerning interactions between modalities. We experiment on four highly heterogeneous datasets, UCF101, ActivityNet, Kinetics, and YouTube-8M to validate our conclusion, and show that our approach achieves highly competitive results. Especially on large-scale data, our method has great advantages in efficiency and performance. Most remarkably, our best single model can achieve 77. 0% in terms of the top-1 accuracy and 93. 2% in terms of the top-5 accuracy on the Kinetics validation set, and achieve 82. 2% in terms of GAP@20 on the official YouTube-8M test set.