Author name cluster

Xiaowei Guo

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

10 papers

1 author row

IJCAI Conference 2021 Conference Paper

Dig into Multi-modal Cues for Video Retrieval with Hierarchical Alignment

Wenzhe Wang
Mengdan Zhang
Runnan Chen
Guanyu Cai
Penghao Zhou
Pai Peng
Xiaowei Guo
Jian Wu

Multi-modal cues presented in videos are usually beneficial for the challenging video-text retrieval task on internet-scale datasets. Recent video retrieval methods take advantage of multi-modal cues by aggregating them to holistic high-level semantics for matching with text representations in a global view. In contrast to this global alignment, the local alignment of detailed semantics encoded within both multi-modal cues and distinct phrases is still not well conducted. Thus, in this paper, we leverage the hierarchical video-text alignment to fully explore the detailed diverse characteristics in multi-modal cues for fine-grained alignment with local semantics from phrases, as well as to capture a high-level semantic correspondence. Specifically, multi-step attention is learned for progressively comprehensive local alignment and a holistic transformer is utilized to summarize multi-modal cues for global alignment. With hierarchical alignment, our model outperforms state-of-the-art methods on three public video retrieval datasets.

PDF Details DOI

AAAI Conference 2021 Conference Paper

Enhancing Unsupervised Video Representation Learning by Decoupling the Scene and the Motion

Jinpeng Wang
Yuting Gao
Ke Li
Jianguo Hu
Xinyang Jiang
Xiaowei Guo
Rongrong Ji
Xing Sun

One significant factor we expect the video representation learning to capture, especially in contrast with the image representation learning, is the object motion. However, we found that in the current mainstream video datasets, some action categories are highly related with the scene where the action happens, making the model tend to degrade to a solution where only the scene information is encoded. For example, a trained model may predict a video as playing football simply because it sees the field, neglecting that the subject is dancing as a cheerleader on the field. This is against our original intention towards the video representation learning and may bring scene bias on a different dataset that can not be ignored. In order to tackle this problem, we propose to decouple the scene and the motion (DSM) with two simple operations, so that the model attention towards the motion information is better paid. Specifically, we construct a positive clip and a negative clip for each video. Compared to the original video, the positive/negative is motion-untouched/broken but scenebroken/untouched by Spatial Local Disturbance and Temporal Local Disturbance. Our objective is to pull the positive closer while pushing the negative farther to the original clip in the latent space. In this way, the impact of the scene is weakened while the temporal sensitivity of the network is further enhanced. We conduct experiments on two tasks with various backbones and different pre-training datasets, and find that our method surpass the SOTA methods with a remarkable 8. 1% and 8. 8% improvement towards action recognition task on the UCF101 and HMDB51 datasets respectively using the same backbone.

PDF Details

AAAI Conference 2021 Conference Paper

One for More: Selecting Generalizable Samples for Generalizable ReID Model

Enwei Zhang
Xinyang Jiang
Hao Cheng
Ancong Wu
Fufu Yu
Ke Li
Xiaowei Guo
Feng Zheng

Current training objectives of existing person Re- IDentification (ReID) models only ensure that the loss of the model decreases on selected training batch, with no regards to the performance on samples outside the batch. It will inevitably cause the model to over-fit the data in the dominant position (e. g. , head data in imbalanced class, easy samples or noisy samples). The latest resampling methods address the issue by designing specific criterion to select specific samples that trains the model generalize more on certain type of data (e. g. , hard samples, tail data), which is not adaptive to the inconsistent real world ReID data distributions. Therefore, instead of simply presuming on what samples are generalizable, this paper proposes a one-for-more training objective that directly takes the generalization ability of selected samples as a loss function and learn a sampler to automatically select generalizable samples. More importantly, our proposed one-for-more based sampler can be seamlessly integrated into the ReID training framework which is able to simultaneously train ReID models and the sampler in an end-to-end fashion. The experimental results show that our method can effectively improve the ReID model training and boost the performance of ReID models.

PDF Details

AAAI Conference 2020 Conference Paper

Asymmetric Co-Teaching for Unsupervised Cross-Domain Person Re-Identification

Fengxiang Yang
Ke Li
Zhun Zhong
Zhiming Luo
Xing Sun
Hao Cheng
Xiaowei Guo
Feiyue Huang

Person re-identiﬁcation (re-ID), is a challenging task due to the high variance within identity samples and imaging conditions. Although recent advances in deep learning have achieved remarkable accuracy in settled scenes, i. e. , source domain, few works can generalize well on the unseen target domain. One popular solution is assigning unlabeled target images with pseudo labels by clustering, and then retraining the model. However, clustering methods tend to introduce noisy labels and discard low conﬁdence samples as outliers, which may hinder the retraining process and thus limit the generalization ability. In this study, we argue that by explicitly adding a sample ﬁltering procedure after the clustering, the mined examples can be much more efﬁciently used. To this end, we design an asymmetric co-teaching framework, which resists noisy labels by cooperating two models to select data with possibly clean labels for each other. Meanwhile, one of the models receives samples as pure as possible, while the other takes in samples as diverse as possible. This procedure encourages that the selected training samples can be both clean and miscellaneous, and that the two models can promote each other iteratively. Extensive experiments show that the proposed framework can consistently beneﬁt most clustering based methods, and boost the state-of-the-art adaptation accuracy. Our code is available at https: //github. com/FlyingRoastDuck/ACT AAAI20.

PDF Details

NeurIPS Conference 2020 Conference Paper

Pruning Filter in Filter

Fanxu Meng
Hao Cheng
Ke Li
Huixiang Luo
Xiaowei Guo
Guangming Lu
Xing Sun

Pruning has become a very powerful and effective technique to compress and accelerate modern neural networks. Existing pruning methods can be grouped into two categories: filter pruning (FP) and weight pruning (WP). FP wins at hardware compatibility but loses at the compression ratio compared with WP. To converge the strength of both methods, we propose to prune the filter in the filter. Specifically, we treat a filter F, whose size is C K K, as K K stripes, i. e. , 1 1 filters, then by pruning the stripes instead of the whole filter, we can achieves finer granularity than traditional FP while being hardware friendly. We term our method as SWP (Stripe-Wise Pruning). SWP is implemented by introducing a novel learnable matrix called Filter Skeleton, whose values reflect the optimal shape of each filter. As some recent work has shown that the pruned architecture is more crucial than the inherited important weights, we argue that the architecture of a single filter, i. e. , the Filter Skeleton, also matters. Through extensive experiments, we demonstrate that SWP is more effective compared to the previous FP-based methods and achieves the state-of-art pruning ratio on CIFAR-10 and ImageNet datasets without obvious accuracy drop.

PDF Details

AAAI Conference 2020 Conference Paper

Rethinking Temporal Fusion for Video-Based Person Re-Identification on Semantic and Time Aspect

Xinyang Jiang
Yifei Gong
Xiaowei Guo
Qize Yang
Feiyue Huang
Wei-Shi Zheng
Feng Zheng
Xing Sun

Recently, the research interest of person re-identiﬁcation (ReID) has gradually turned to video-based methods, which acquire a person representation by aggregating frame features of an entire video. However, existing video-based ReID methods do not consider the semantic difference brought by the outputs of different network stages, which potentially compromises the information richness of the person features. Furthermore, traditional methods ignore important relationship among frames, which causes information redundancy in fusion along the time axis. To address these issues, we propose a novel general temporal fusion framework to aggregate frame features on both semantic aspect and time aspect. As for the semantic aspect, a multi-stage fusion network is explored to fuse richer frame features at multiple semantic levels, which can effectively reduce the information loss caused by the traditional single-stage fusion. While, for the time axis, the existing intra-frame attention method is improved by adding a novel inter-frame attention module, which effectively reduces the information redundancy in temporal fusion by taking the relationship among frames into consideration. The experimental results show that our approach can effectively improve the video-based re-identiﬁcation accuracy, achieving the state-of-the-art performance.

PDF Details

AAAI Conference 2020 Conference Paper

Viewpoint-Aware Loss with Angular Regularization for Person Re-Identification

Zhihui Zhu
Xinyang Jiang
Feng Zheng
Xiaowei Guo
Feiyue Huang
Xing Sun
Weishi Zheng

Although great progress in supervised person reidentiﬁcation (Re-ID) has been made recently, due to the viewpoint variation of a person, Re-ID remains a massive visual challenge. Most existing viewpoint-based person Re-ID methods project images from each viewpoint into separated and unrelated sub-feature spaces. They only model the identity-level distribution inside an individual viewpoint but ignore the underlying relationship between different viewpoints. To address this problem, we propose a novel approach, called Viewpoint-Aware Loss with Angular Regularization (VA-reID). Instead of one subspace for each viewpoint, our method projects the feature from different viewpoints into a uniﬁed hypersphere and effectively models the feature distribution on both the identity-level and the viewpoint-level. In addition, rather than modeling different viewpoints as hard labels used for conventional viewpoint classiﬁcation, we introduce viewpoint-aware adaptive label smoothing regularization (VALSR) that assigns the adaptive soft label to feature representation. VALSR can effectively solve the ambiguity of the viewpoint cluster label assignment. Extensive experiments on the Market1501 and DukeMTMC-reID datasets demonstrated that our method outperforms the state-of-the-art supervised Re-ID methods.

PDF Details

IJCAI Conference 2019 Conference Paper

A Part Power Set Model for Scale-Free Person Retrieval

Yunhang Shen
Rongrong Ji
Xiaopeng Hong
Feng Zheng
Xiaowei Guo
Yongjian Wu
Feiyue Huang

Recently, person re-identification (re-ID) has attracted increasing research attention, which has broad application prospects in video surveillance and beyond. To this end, most existing methods highly relied on well-aligned pedestrian images and hand-engineered part-based model on the coarsest feature map. In this paper, to lighten the restriction of such fixed and coarse input alignment, an end-to-end part power set model with multi-scale features is proposed, which captures the discriminative parts of pedestrians from global to local, and from coarse to fine, enabling part-based scale-free person re-ID. In particular, we first factorize the visual appearance by enumerating $k$-combinations for all $k$ of $n$ body parts to exploit rich global and partial information to learn discriminative feature maps. Then, a combination ranking module is introduced to guide the model training with all combinations of body parts, which alternates between ranking combinations and estimating an appearance model. To enable scale-free input, we further exploit the pyramid architecture of deep networks to construct multi-scale feature maps with a feasible amount of extra cost in term of memory and time. Extensive experiments on the mainstream evaluation datasets, including Market-1501, DukeMTMC-reID and CUHK03, validate that our method achieves the state-of-the-art performance.

PDF Details

IJCAI Conference 2018 Conference Paper

Adversarial Attribute-Image Person Re-identification

Zhou Yin
Wei-Shi Zheng
Ancong Wu
Hong-Xing Yu
Hai Wan
Xiaowei Guo
Feiyue Huang
Jianhuang Lai

While attributes have been widely used for person re-identification (Re-ID) which aims at matching the same person images across disjoint camera views, they are used either as extra features or for performing multi-task learning to assist the image-image matching task. However, how to find a set of person images according to a given attribute description, which is very practical in many surveillance applications, remains a rarely investigated cross-modality matching problem in person Re-ID. In this work, we present this challenge and leverage adversarial learning to formulate the attribute-image cross-modality person Re-ID model. By imposing a semantic consistency constraint across modalities as a regularization, the adversarial learning enables to generate image-analogous concepts of query attributes for matching the corresponding images at both global level and semantic ID level. We conducted extensive experiments on three attribute datasets and demonstrated that the regularized adversarial modelling is so far the most effective method for the attribute-image cross-modality person Re-ID problem.

PDF Details

IJCAI Conference 2016 Conference Paper

Towards Convolutional Neural Networks Compression via Global Error Reconstruction

Shaohui Lin
Rongrong Ji
Xiaowei Guo
Xuelong Li

In recent years, convolutional neural networks (CNNs) have achieved remarkable success in various applications such as image classification, object detection, object parsing and face alignment. Such CNN models are extremely powerful to deal with massive amounts of training data by using millions and billions of parameters. However, these models are typically deficient due to the heavy cost in model storage, which prohibits their usage on resource-limited applications like mobile or embedded devices. In this paper, we target at compressing CNN models to an extreme without significantly losing their discriminability. Our main idea is to explicitly model the output reconstruction error between the original and compressed CNNs, which error is minimized to pursuit a satisfactory rate-distortion after compression. In particular, a global error reconstruction method termed GER is presented, which firstly leverages an SVD-based low-rank approximation to coarsely compress the parameters in the fully connected layers in a layer-wise manner. Subsequently, such layer-wise initial compressions are jointly optimized in a global perspective via back-propagation. The proposed GER method is evaluated on the ILSVRC2012 image classification benchmark, with implementations on two widely-adopted convolutional neural networks, i. e. the AlexNet and VGGNet-19. Comparing to several state-of-the-art and alternative methods of CNN compression, the proposed scheme has demonstrated the best rate-distortion performance on both networks.

PDF Details