Arrow Research search

Author name cluster

Xiaowei Guo

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

10 papers
1 author row

Possible papers

10

IJCAI Conference 2021 Conference Paper

Dig into Multi-modal Cues for Video Retrieval with Hierarchical Alignment

  • Wenzhe Wang
  • Mengdan Zhang
  • Runnan Chen
  • Guanyu Cai
  • Penghao Zhou
  • Pai Peng
  • Xiaowei Guo
  • Jian Wu

Multi-modal cues presented in videos are usually beneficial for the challenging video-text retrieval task on internet-scale datasets. Recent video retrieval methods take advantage of multi-modal cues by aggregating them to holistic high-level semantics for matching with text representations in a global view. In contrast to this global alignment, the local alignment of detailed semantics encoded within both multi-modal cues and distinct phrases is still not well conducted. Thus, in this paper, we leverage the hierarchical video-text alignment to fully explore the detailed diverse characteristics in multi-modal cues for fine-grained alignment with local semantics from phrases, as well as to capture a high-level semantic correspondence. Specifically, multi-step attention is learned for progressively comprehensive local alignment and a holistic transformer is utilized to summarize multi-modal cues for global alignment. With hierarchical alignment, our model outperforms state-of-the-art methods on three public video retrieval datasets.

AAAI Conference 2021 Conference Paper

Enhancing Unsupervised Video Representation Learning by Decoupling the Scene and the Motion

  • Jinpeng Wang
  • Yuting Gao
  • Ke Li
  • Jianguo Hu
  • Xinyang Jiang
  • Xiaowei Guo
  • Rongrong Ji
  • Xing Sun

One significant factor we expect the video representation learning to capture, especially in contrast with the image representation learning, is the object motion. However, we found that in the current mainstream video datasets, some action categories are highly related with the scene where the action happens, making the model tend to degrade to a solution where only the scene information is encoded. For example, a trained model may predict a video as playing football simply because it sees the field, neglecting that the subject is dancing as a cheerleader on the field. This is against our original intention towards the video representation learning and may bring scene bias on a different dataset that can not be ignored. In order to tackle this problem, we propose to decouple the scene and the motion (DSM) with two simple operations, so that the model attention towards the motion information is better paid. Specifically, we construct a positive clip and a negative clip for each video. Compared to the original video, the positive/negative is motion-untouched/broken but scenebroken/untouched by Spatial Local Disturbance and Temporal Local Disturbance. Our objective is to pull the positive closer while pushing the negative farther to the original clip in the latent space. In this way, the impact of the scene is weakened while the temporal sensitivity of the network is further enhanced. We conduct experiments on two tasks with various backbones and different pre-training datasets, and find that our method surpass the SOTA methods with a remarkable 8. 1% and 8. 8% improvement towards action recognition task on the UCF101 and HMDB51 datasets respectively using the same backbone.

AAAI Conference 2021 Conference Paper

One for More: Selecting Generalizable Samples for Generalizable ReID Model

  • Enwei Zhang
  • Xinyang Jiang
  • Hao Cheng
  • Ancong Wu
  • Fufu Yu
  • Ke Li
  • Xiaowei Guo
  • Feng Zheng

Current training objectives of existing person Re- IDentification (ReID) models only ensure that the loss of the model decreases on selected training batch, with no regards to the performance on samples outside the batch. It will inevitably cause the model to over-fit the data in the dominant position (e. g. , head data in imbalanced class, easy samples or noisy samples). The latest resampling methods address the issue by designing specific criterion to select specific samples that trains the model generalize more on certain type of data (e. g. , hard samples, tail data), which is not adaptive to the inconsistent real world ReID data distributions. Therefore, instead of simply presuming on what samples are generalizable, this paper proposes a one-for-more training objective that directly takes the generalization ability of selected samples as a loss function and learn a sampler to automatically select generalizable samples. More importantly, our proposed one-for-more based sampler can be seamlessly integrated into the ReID training framework which is able to simultaneously train ReID models and the sampler in an end-to-end fashion. The experimental results show that our method can effectively improve the ReID model training and boost the performance of ReID models.

AAAI Conference 2020 Conference Paper

Asymmetric Co-Teaching for Unsupervised Cross-Domain Person Re-Identification

  • Fengxiang Yang
  • Ke Li
  • Zhun Zhong
  • Zhiming Luo
  • Xing Sun
  • Hao Cheng
  • Xiaowei Guo
  • Feiyue Huang

Person re-identification (re-ID), is a challenging task due to the high variance within identity samples and imaging conditions. Although recent advances in deep learning have achieved remarkable accuracy in settled scenes, i. e. , source domain, few works can generalize well on the unseen target domain. One popular solution is assigning unlabeled target images with pseudo labels by clustering, and then retraining the model. However, clustering methods tend to introduce noisy labels and discard low confidence samples as outliers, which may hinder the retraining process and thus limit the generalization ability. In this study, we argue that by explicitly adding a sample filtering procedure after the clustering, the mined examples can be much more efficiently used. To this end, we design an asymmetric co-teaching framework, which resists noisy labels by cooperating two models to select data with possibly clean labels for each other. Meanwhile, one of the models receives samples as pure as possible, while the other takes in samples as diverse as possible. This procedure encourages that the selected training samples can be both clean and miscellaneous, and that the two models can promote each other iteratively. Extensive experiments show that the proposed framework can consistently benefit most clustering based methods, and boost the state-of-the-art adaptation accuracy. Our code is available at https: //github. com/FlyingRoastDuck/ACT AAAI20.

NeurIPS Conference 2020 Conference Paper

Pruning Filter in Filter

  • Fanxu Meng
  • Hao Cheng
  • Ke Li
  • Huixiang Luo
  • Xiaowei Guo
  • Guangming Lu
  • Xing Sun

Pruning has become a very powerful and effective technique to compress and accelerate modern neural networks. Existing pruning methods can be grouped into two categories: filter pruning (FP) and weight pruning (WP). FP wins at hardware compatibility but loses at the compression ratio compared with WP. To converge the strength of both methods, we propose to prune the filter in the filter. Specifically, we treat a filter F, whose size is C K K, as K K stripes, i. e. , 1 1 filters, then by pruning the stripes instead of the whole filter, we can achieves finer granularity than traditional FP while being hardware friendly. We term our method as SWP (Stripe-Wise Pruning). SWP is implemented by introducing a novel learnable matrix called Filter Skeleton, whose values reflect the optimal shape of each filter. As some recent work has shown that the pruned architecture is more crucial than the inherited important weights, we argue that the architecture of a single filter, i. e. , the Filter Skeleton, also matters. Through extensive experiments, we demonstrate that SWP is more effective compared to the previous FP-based methods and achieves the state-of-art pruning ratio on CIFAR-10 and ImageNet datasets without obvious accuracy drop.

AAAI Conference 2020 Conference Paper

Rethinking Temporal Fusion for Video-Based Person Re-Identification on Semantic and Time Aspect

  • Xinyang Jiang
  • Yifei Gong
  • Xiaowei Guo
  • Qize Yang
  • Feiyue Huang
  • Wei-Shi Zheng
  • Feng Zheng
  • Xing Sun

Recently, the research interest of person re-identification (ReID) has gradually turned to video-based methods, which acquire a person representation by aggregating frame features of an entire video. However, existing video-based ReID methods do not consider the semantic difference brought by the outputs of different network stages, which potentially compromises the information richness of the person features. Furthermore, traditional methods ignore important relationship among frames, which causes information redundancy in fusion along the time axis. To address these issues, we propose a novel general temporal fusion framework to aggregate frame features on both semantic aspect and time aspect. As for the semantic aspect, a multi-stage fusion network is explored to fuse richer frame features at multiple semantic levels, which can effectively reduce the information loss caused by the traditional single-stage fusion. While, for the time axis, the existing intra-frame attention method is improved by adding a novel inter-frame attention module, which effectively reduces the information redundancy in temporal fusion by taking the relationship among frames into consideration. The experimental results show that our approach can effectively improve the video-based re-identification accuracy, achieving the state-of-the-art performance.

AAAI Conference 2020 Conference Paper

Viewpoint-Aware Loss with Angular Regularization for Person Re-Identification

  • Zhihui Zhu
  • Xinyang Jiang
  • Feng Zheng
  • Xiaowei Guo
  • Feiyue Huang
  • Xing Sun
  • Weishi Zheng

Although great progress in supervised person reidentification (Re-ID) has been made recently, due to the viewpoint variation of a person, Re-ID remains a massive visual challenge. Most existing viewpoint-based person Re-ID methods project images from each viewpoint into separated and unrelated sub-feature spaces. They only model the identity-level distribution inside an individual viewpoint but ignore the underlying relationship between different viewpoints. To address this problem, we propose a novel approach, called Viewpoint-Aware Loss with Angular Regularization (VA-reID). Instead of one subspace for each viewpoint, our method projects the feature from different viewpoints into a unified hypersphere and effectively models the feature distribution on both the identity-level and the viewpoint-level. In addition, rather than modeling different viewpoints as hard labels used for conventional viewpoint classification, we introduce viewpoint-aware adaptive label smoothing regularization (VALSR) that assigns the adaptive soft label to feature representation. VALSR can effectively solve the ambiguity of the viewpoint cluster label assignment. Extensive experiments on the Market1501 and DukeMTMC-reID datasets demonstrated that our method outperforms the state-of-the-art supervised Re-ID methods.

IJCAI Conference 2019 Conference Paper

A Part Power Set Model for Scale-Free Person Retrieval

  • Yunhang Shen
  • Rongrong Ji
  • Xiaopeng Hong
  • Feng Zheng
  • Xiaowei Guo
  • Yongjian Wu
  • Feiyue Huang

Recently, person re-identification (re-ID) has attracted increasing research attention, which has broad application prospects in video surveillance and beyond. To this end, most existing methods highly relied on well-aligned pedestrian images and hand-engineered part-based model on the coarsest feature map. In this paper, to lighten the restriction of such fixed and coarse input alignment, an end-to-end part power set model with multi-scale features is proposed, which captures the discriminative parts of pedestrians from global to local, and from coarse to fine, enabling part-based scale-free person re-ID. In particular, we first factorize the visual appearance by enumerating $k$-combinations for all $k$ of $n$ body parts to exploit rich global and partial information to learn discriminative feature maps. Then, a combination ranking module is introduced to guide the model training with all combinations of body parts, which alternates between ranking combinations and estimating an appearance model. To enable scale-free input, we further exploit the pyramid architecture of deep networks to construct multi-scale feature maps with a feasible amount of extra cost in term of memory and time. Extensive experiments on the mainstream evaluation datasets, including Market-1501, DukeMTMC-reID and CUHK03, validate that our method achieves the state-of-the-art performance.

IJCAI Conference 2018 Conference Paper

Adversarial Attribute-Image Person Re-identification

  • Zhou Yin
  • Wei-Shi Zheng
  • Ancong Wu
  • Hong-Xing Yu
  • Hai Wan
  • Xiaowei Guo
  • Feiyue Huang
  • Jianhuang Lai

While attributes have been widely used for person re-identification (Re-ID) which aims at matching the same person images across disjoint camera views, they are used either as extra features or for performing multi-task learning to assist the image-image matching task. However, how to find a set of person images according to a given attribute description, which is very practical in many surveillance applications, remains a rarely investigated cross-modality matching problem in person Re-ID. In this work, we present this challenge and leverage adversarial learning to formulate the attribute-image cross-modality person Re-ID model. By imposing a semantic consistency constraint across modalities as a regularization, the adversarial learning enables to generate image-analogous concepts of query attributes for matching the corresponding images at both global level and semantic ID level. We conducted extensive experiments on three attribute datasets and demonstrated that the regularized adversarial modelling is so far the most effective method for the attribute-image cross-modality person Re-ID problem.

IJCAI Conference 2016 Conference Paper

Towards Convolutional Neural Networks Compression via Global Error Reconstruction

  • Shaohui Lin
  • Rongrong Ji
  • Xiaowei Guo
  • Xuelong Li

In recent years, convolutional neural networks (CNNs) have achieved remarkable success in various applications such as image classification, object detection, object parsing and face alignment. Such CNN models are extremely powerful to deal with massive amounts of training data by using millions and billions of parameters. However, these models are typically deficient due to the heavy cost in model storage, which prohibits their usage on resource-limited applications like mobile or embedded devices. In this paper, we target at compressing CNN models to an extreme without significantly losing their discriminability. Our main idea is to explicitly model the output reconstruction error between the original and compressed CNNs, which error is minimized to pursuit a satisfactory rate-distortion after compression. In particular, a global error reconstruction method termed GER is presented, which firstly leverages an SVD-based low-rank approximation to coarsely compress the parameters in the fully connected layers in a layer-wise manner. Subsequently, such layer-wise initial compressions are jointly optimized in a global perspective via back-propagation. The proposed GER method is evaluated on the ILSVRC2012 image classification benchmark, with implementations on two widely-adopted convolutional neural networks, i. e. the AlexNet and VGGNet-19. Comparing to several state-of-the-art and alternative methods of CNN compression, the proposed scheme has demonstrated the best rate-distortion performance on both networks.