Arrow Research search

Author name cluster

Yuchen Guo

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

34 papers
2 author rows

Possible papers

34

AAAI Conference 2026 Conference Paper

2D-CrossScan Mamba: Enhancing State Space Models with Spatially Consistent Multi-Path 2D Information Propagation

  • Longlong Yu
  • Wenxi Li
  • Yaoqi Sun
  • Hang Xu
  • Chenggang Yan
  • Yuchen Guo

Despite recent progress in adapting State Space Models such as Mamba to vision tasks, their intrinsic 1D scanning mechanism imposes limitations when applied to inherently 2D-structured data like images. Existing adaptations, including VMamba and 2DMamba, either suffer from inconsistency between scanning order and spatial locality or restrict inter-patch communication to singular paths, hindering effective information propagation. In this paper, we propose 2D-CrossScan, a novel 2D-compatible scan framework that enables spatially consistent, multi-path hidden state propagation by integrating modified state equations over two-dimensional neighborhoods. Furthermore, we mitigate redundant information accumulation due to overlapping paths via cross-directional subtraction. To fully align with the 2D spatial structure, we introduce a multi-directional scanning strategy that starts simultaneously from all four corners of the image, enabling diverse propagation paths and better feature integration. Our approach maintains efficiency, requiring only minimal architectural changes to existing Mamba variants. Experimental results demonstrate substantial improvements in multiple visual tasks, including object detection and semantic segmentation on PANDA and COCO datasets. Compared to baseline SSM-based methods, 2D-CrossScan consistently yields better spatial representations, as confirmed by extensive effective receptive field visualizations and attention analyses. These results highlight the importance of geometry-aware state propagation and validate 2D-CrossScan as a simple yet powerful extension to SSMs for vision.

AAAI Conference 2026 Conference Paper

GigaMoE: Sparsity-Guided Mixture of Experts for Efficient Gigapixel Object Detection

  • Xiang Li
  • Wenxi Li
  • Yuetong Wang
  • Chenyang Lyu
  • Haozhe Lin
  • Guiguang Ding
  • Yuchen Guo

Object detection in High-Resolution Wide (HRW) shots, or gigapixel images, presents unique challenges due to extreme object sparsity and vast scale variations. State-of-the-art methods like SparseFormer have pioneered sparse processing by selectively focusing on important regions, yet they apply a uniform computational model to all selected regions, overlooking their intrinsic complexity differences. This leads to a suboptimal trade-off between performance and efficiency. In this paper, we introduce GigaMoE, a novel backbone architecture that pioneers adaptive computation for this domain by replacing the standard Feed-Forward Networks (FFNs) with a Mixture-of-Experts (MoE) module. Our architecture first employs a shared expert to provide a robust feature baseline for all selected regions. Upon this foundation, our core innovation---a novel Sparsity-Guided Routing mechanism---insightfully repurposes importance scores from the sparse backbone to provide a "computational bonus,'' dynamically engaging a variable number of specialized experts based on content complexity. The entire system is trained efficiently via a loss-free load-balancing technique, eliminating the need for cumbersome auxiliary losses. Extensive experiments show that GigaMoE sets a new state-of-the-art on the PANDA benchmark, improving detection accuracy by 1.1% over SparseFormer while simultaneously reducing the computational cost (FLOPs) by a remarkable 32.3%.

JBHI Journal 2025 Journal Article

MAT: Mixing Attention Transfer from Multiple Transformers for Medical Tasks

  • Zi-Hao Bo
  • Yuchen Guo
  • Xiangru Chen
  • Jing Xie
  • Lishan Ye
  • Feng Xu

Transformer has been widely used for image analysis tasks, but in medicine, it suffers from limited data availability. To overcome this challenge, we propose a novel approach specially designed for transformers to transfer knowledge from multiple sources to target medical tasks with limited data, named Mixing Attention Transfer (MAT). MAT aims to harness and merge knowledge from multiple source transformers at the token and layer level to improve the performance of target medical tasks. The core component of MAT is the Mixing Attention layer, which encompasses: 1. token-level Routing and Fusion modules that allocate input images to adequate source modules; 2. sequence-level Aligned-Attention module that adaptively aligns outputs produced by different source modules. To the best of our knowledge, this is the first multi-source transfer learning approach specifically designed for transformers. Through extensive evaluations, we demonstrate the effectiveness of MAT on three medical scenarios: noisy-labeled, class-imbalanced, and fine-grained tasks.

AAAI Conference 2024 Conference Paper

Debiased Novel Category Discovering and Localization

  • Juexiao Feng
  • Yuhong Yang
  • Yanchun Xie
  • Yaqian Li
  • Yandong Guo
  • Yuchen Guo
  • Yuwei He
  • Liuyu Xiang

In recent years, object detection in deep learning has experienced rapid development. However, most existing object detection models perform well only on closed-set datasets, ignoring a large number of potential objects whose categories are not defined in the training set. These objects are often identified as background or incorrectly classified as pre-defined categories by the detectors. In this paper, we focus on the challenging problem of Novel Class Discovery and Localization (NCDL), aiming to train detectors that can detect the categories present in the training data, while also actively discover, localize, and cluster new categories. We analyze existing NCDL methods and identify the core issue: object detectors tend to be biased towards seen objects, and this leads to the neglect of unseen targets. To address this issue, we first propose an Debiased Region Mining (DRM) approach that combines class-agnostic Region Proposal Network (RPN) and class-aware RPN in a complementary manner. Additionally, we suggest to improve the representation network through semi-supervised contrastive learning by leveraging unlabeled data. Finally, we adopt a simple and efficient mini-batch K-means clustering method for novel class discovery. We conduct extensive experiments on the NCDL benchmark, and the results demonstrate that the proposed DRM approach significantly outperforms previous methods, establishing a new state-of-the-art.

ECAI Conference 2024 Conference Paper

Detecting Objects as Cascade Corners

  • Chenglong Liu
  • Jintao Liu
  • Haorao Wei
  • Jinze Yang
  • Liangyu Xu
  • Yuchen Guo
  • Lu Fang 0001

The corner-based detection paradigm enjoys the potential to produce high-quality boxes. But the development is constrained by three factors: 1) Hard to match corners. Heuristic corner matching algorithms can lead to incorrect boxes, especially when similar-looking objects co-occur. 2) Poor instance context. Two separate corners preserve few instance semantics, so it is difficult to guarantee getting both two class-specific corners on the same heatmap channel. 3) Unfriendly backbone. The training cost of the hourglass network is high. Accordingly, we build a novel corner-based framework, named Corner2Net. To achieve the corner-matching-free manner, we devise the cascade corner pipeline which progressively predicts the associated corner pair in two steps instead of synchronously searching two independent corners via parallel heads. Corner2Net decouples corner localization and object classification. Both two corners are class-agnostic and the instance-specific bottom-right corner further simplifies its search space. Meanwhile, RoI features with rich semantics are extracted for classification. Popular backbones (e. g. , ResNeXt) can be easily connected to Corner2Net. Experimental results on COCO show Corner2Net surpasses all existing corner-based detectors by a large margin in accuracy and speed.

AAAI Conference 2024 Conference Paper

GigaHumanDet: Exploring Full-Body Detection on Gigapixel-Level Images

  • Chenglong Liu
  • Haoran Wei
  • Jinze Yang
  • Jintao Liu
  • Wenxi Li
  • Yuchen Guo
  • Lu Fang

Performing person detection in super-high-resolution images has been a challenging task. For such a task, modern detectors, which usually encode a box using center and width/height, struggle with accuracy due to two factors: 1) Human characteristic: people come in various postures and the center with high freedom is difficult to capture robust visual pattern; 2) Image characteristic: due to vast scale diversity of input (gigapixel-level), distance regression (for width and height) is hard to pinpoint, especially for a person, with substantial scale, who is near the camera. To address these challenges, we propose GigaHumanDet, an innovative solution aimed at further enhancing detection accuracy for gigapixel-level images. GigaHumanDet employs the corner modeling method to avoid the potential issues of a high degree of freedom in center pinpointing. To better distinguish similar-looking persons and enforce instance consistency of corner pairs, an instance-guided learning approach is designed to capture discriminative individual semantics. Further, we devise reliable shape-aware bodyness equipped with a multi-precision strategy as the human corner matching guidance to be appropriately adapted to the single-view large scene. Experimental results on PANDA and STCrowd datasets show the superiority and strong applicability of our design. Notably, our model achieves 82.4% in term of AP, outperforming current state-of-the-arts by more than 10%.

AAMAS Conference 2024 Conference Paper

JDRec: Practical Actor-Critic Framework for Online Combinatorial Recommender System

  • Xin Zhao
  • Jiaxin Li
  • Zhiwei Fang
  • Yuchen Guo
  • Jinyuan Zhao
  • Jie He
  • Wenlong Chen
  • Changping Peng

In the realm of online recommendation systems, the Combinatorial Recommender (CR) system stands out for its unique approach. It presents users with a list of items on a result page, where user behavior is simultaneously influenced by contextual information and the items listed. Formulated as a combinatorial optimization problem, the objective of the CR system is to maximize the recommendation reward across the entire list of items. Despite the significant potential of CR systems, developing a practical and efficient model remains substantial challenges. These challenges stem from the dynamic nature of online environments and the pressing need for personalized recommendations. To tackle these challenges, we decompose the overarching problem into two sub-problems: list generation and list evaluation. We propose novel and pragmatic model architectures for each sub-problem aiming to concurrently enhance both effectiveness and efficiency. To further adapt the CR system to online scenarios, we integrate a bootstrap algorithm into an actor-critic reinforcement framework. This innovative approach called JD Recommender System (JDRec) is designed to continuously refine the recommendation mode through sustained user interaction, ensuring the system’s adaptability and relevance. The proposed JDRec framework, tested through rigorous offline and online experiments, has shown promising results. It has been successfully deployed in online JD recommendation systems, yielding a notable improvement in click-through rate by 2. 6% and augmenting the total value of the platform by 5. 03%. Besides, we release the large scale dataset used in our work to facilitate further research. This work is licensed under a Creative Commons Attribution International 4. 0 License. *Equal contribution. Proc. of the 23rd International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2024), N. Alechina, V. Dignum, M. Dastani, J. S. Sichman (eds.), May 6 – 10, 2024, Auckland, New Zealand. © 2024 International Foundation for Autonomous Agents and Multiagent Systems (www. ifaamas. org).

ECAI Conference 2024 Conference Paper

SaccadeMOT: Enhancing Object Detection and Tracking in Gigapixel Images via Scale-Aware Density Estimation

  • Wenxi Li
  • Ruxin Zhang
  • Haozhe Lin
  • Yuchen Guo
  • Chao Ma 0004
  • Xiaokang Yang 0001

The proliferation of gigapixel imaging has ushered in unprecedented challenges in object detection and tracking due to the intense computational demands. Previous deep learning approaches, often tailored for megapixel images, fall short in addressing the unique complexities presented by the gigapixel level. To bridge this gap, we introduce SaccadeMOT, a novel architecture designed for efficient gigapixel-level multi-object tracking. Based on our observations of density map regression in crowd counting and small object detection in object detection tasks, we propose a novel gigapixel detection paradigm that combines the strengths of both approaches. Firstly, the “saccade” stage swiftly identifies regions likely containing objects, followed by the “gaze” stage that refines the detection within these areas. This strategic region selection is complemented by a robust tracking mechanism that combines head and body tracking, enhancing accuracy in environments with potential occlusions. Validated on the PANDA dataset, SaccadeMOT not only demonstrates an 13× speed improvement over existing state-of-the-art tracker BotSORT but also exhibits promising applications in gigapixel-level pathology analysis, particularly in Whole Slide Imaging (WSI). This approach sets a new benchmark for handling super high-resolution images, offering significant advancements in both the speed and precision of object tracking technologies.

ICLR Conference 2023 Conference Paper

Consolidator: Mergable Adapter with Group Connections for Visual Adaptation

  • Tianxiang Hao 0001
  • Hui Chen 0013
  • Yuchen Guo
  • Guiguang Ding

Recently, transformers have shown strong ability as visual feature extractors, surpassing traditional convolution-based models in various scenarios. However, the success of vision transformers largely owes to their capacity to accommodate numerous parameters. As a result, new challenges for adapting a well-trained transformer to downstream tasks arise. On the one hand, classic fine-tuning tunes all parameters in a huge model for every downstream task and thus easily falls into an overfitting situation, leading to inferior performance. On the other hand, on resource-limited devices, fine-tuning stores a full copy of all parameters and thus is usually impracticable for the shortage of storage space. However, few works have focused on how to efficiently and effectively transfer knowledge in a vision transformer. Existing methods did not dive into the properties of visual features, leading to inferior performance. Moreover, some of them bring heavy inference cost though benefiting storage. To tackle these problems, we propose consolidator to achieve efficient transfer learning for large vision models. Our consolidator modifies the pre-trained model with the addition of a small set of tunable parameters to temporarily store the task-specific knowledge while freezing the backbone model during adaptation. Motivated by the success of group-wise convolution, we adopt grouped connections across the features extracted by fully connected layers to construct tunable parts in a consolidator. To further enhance the model's capacity to transfer knowledge under a constrained storage budget and keep inference efficient, we consolidate the parameters in two stages: 1. between adaptation and storage, and 2. between loading and inference. On a series of downstream visual tasks, our consolidator can reach up to 7.56 better accuracy than full fine-tuning with merely 0.35% parameters, and outperform state-of-the-art parameter-efficient tuning methods by a clear margin. Code is available at github.

AAAI Conference 2022 Conference Paper

ReMoNet: Recurrent Multi-Output Network for Efficient Video Denoising

  • Liuyu Xiang
  • Jundong Zhou
  • Jirui Liu
  • Zerun Wang
  • Haidong Huang
  • Jie Hu
  • Jungong Han
  • Yuchen Guo

While deep neural network-based video denoising methods have achieved promising results, it is still hard to deploy them on mobile devices due to their high computational cost and memory demands. This paper aims to develop a lightweight deep video denoising method that is friendly to resource-constrained mobile devices. Inspired by the facts that 1) consecutive video frames usually contain redundant temporal coherency, and 2) neural networks are usually over-parameterized, we propose a multi-input multi-output (MIMO) paradigm to process consecutive video frames within one-forward-pass. The basic idea is concretized to a novel architecture termed Recurrent Multi-output Network (ReMoNet), which consists of recurrent temporal fusion and temporal aggregation blocks and is further reinforced by similarity-based mutual distillation. We conduct extensive experiments on NVIDIA GPU and Qualcomm Snapdragon 888 mobile platform with Gaussian noise and simulated Image- Signal-Processor (ISP) noise. The experimental results show that ReMoNet is both effective and efficient on video denoising. Moreover, we show that ReMoNet is more robust under higher noise level scenarios.

AAAI Conference 2022 Conference Paper

SECRET: Self-Consistent Pseudo Label Refinement for Unsupervised Domain Adaptive Person Re-identification

  • Tao He
  • Leqi Shen
  • Yuchen Guo
  • Guiguang Ding
  • Zhenhua Guo

Unsupervised domain adaptive person re-identification aims at learning on an unlabeled target domain with only labeled data in source domain. Currently, the state-of-the-arts usually solve this problem by pseudo-label-based clustering and fine-tuning in target domain. However, the reason behind the noises of pseudo labels is not sufficiently explored, especially for the popular multi-branch models. We argue that the consistency between different feature spaces is the key to the pseudo labels’ quality. Then a SElf-Consistent pseudo label RefinEmenT method, termed as SECRET, is proposed to improve consistency by mutually refining the pseudo labels generated from different feature spaces. The proposed SECRET gradually encourages the improvement of pseudo labels’ quality during training process, which further leads to better cross-domain Re-ID performance. Extensive experiments on benchmark datasets show the superiority of our method. Specifically, our method outperforms the state-ofthe-arts by 6. 3% in terms of mAP on the challenging dataset MSMT17. In the purely unsupervised setting, our method also surpasses existing works by a large margin. Code is available at https: //github. com/LunarShen/SECRET.

AAAI Conference 2020 Conference Paper

Heterogeneous Transfer Learning with Weighted Instance-Correspondence Data

  • Yuwei He
  • Xiaoming Jin
  • Guiguang Ding
  • Yuchen Guo
  • Jungong Han
  • Jiyong Zhang
  • Sicheng Zhao

Instance-correspondence (IC) data are potent resources for heterogeneous transfer learning (HeTL) due to the capability of bridging the source and the target domains at the instancelevel. To this end, people tend to use machine-generated IC data, because manually establishing IC data is expensive and primitive. However, existing IC data machine generators are not perfect and always produce the data that are not of high quality, thus hampering the performance of domain adaption. In this paper, instead of improving the IC data generator, which might not be an optimal way, we accept the fact that data quality variation does exist but find a better way to use the data. Specifically, we propose a novel heterogeneous transfer learning method named Transfer Learning with Weighted Correspondence (TLWC), which utilizes IC data to adapt the source domain to the target domain. Rather than treating IC data equally, TLWC can assign solid weights to each IC data pair depending on the quality of the data. We conduct extensive experiments on HeTL datasets and the state-of-the-art results verify the effectiveness of TLWC.

ICML Conference 2019 Conference Paper

Approximated Oracle Filter Pruning for Destructive CNN Width Optimization

  • Xiaohan Ding
  • Guiguang Ding
  • Yuchen Guo
  • Jungong Han
  • Chenggang Yan 0001

It is not easy to design and run Convolutional Neural Networks (CNNs) due to: 1) finding the optimal number of filters (i. e. , the width) at each layer is tricky, given an architecture; and 2) the computational intensity of CNNs impedes the deployment on computationally limited devices. Oracle Pruning is designed to remove the unimportant filters from a well-trained CNN, which estimates the filters’ importance by ablating them in turn and evaluating the model, thus delivers high accuracy but suffers from intolerable time complexity, and requires a given resulting width but cannot automatically find it. To address these problems, we propose Approximated Oracle Filter Pruning (AOFP), which keeps searching for the least important filters in a binary search manner, makes pruning attempts by masking out filters randomly, accumulates the resulting errors, and finetunes the model via a multi-path framework. As AOFP enables simultaneous pruning on multiple layers, we can prune an existing very deep CNN with acceptable time cost, negligible accuracy drop, and no heuristic knowledge, or re-design a model which exerts higher accuracy and faster inference.

AAAI Conference 2019 Conference Paper

CycleEmotionGAN: Emotional Semantic Consistency Preserved CycleGAN for Adapting Image Emotions

  • Sicheng Zhao
  • Chuang Lin
  • Pengfei Xu
  • Sendong Zhao
  • Yuchen Guo
  • Ravi Krishna
  • Guiguang Ding
  • Kurt Keutzer

Deep neural networks excel at learning from large-scale labeled training data, but cannot well generalize the learned knowledge to new domains or datasets. Domain adaptation studies how to transfer models trained on one labeled source domain to another sparsely labeled or unlabeled target domain. In this paper, we investigate the unsupervised domain adaptation (UDA) problem in image emotion classification. Specifically, we develop a novel cycle-consistent adversarial model, termed CycleEmotionGAN, by enforcing emotional semantic consistency while adapting images cycleconsistently. By alternately optimizing the CycleGAN loss, the emotional semantic consistency loss, and the target classification loss, CycleEmotionGAN can adapt source domain images to have similar distributions to the target domain without using aligned image pairs. Simultaneously, the annotation information of the source images is preserved. Extensive experiments are conducted on the ArtPhoto and FI datasets, and the results demonstrate that CycleEmotionGAN significantly outperforms the state-of-the-art UDA approaches.

AAAI Conference 2019 Conference Paper

Dual-View Ranking with Hardness Assessment for Zero-Shot Learning

  • Yuchen Guo
  • Guiguang Ding
  • Jungong Han
  • Xiaohan Ding
  • Sicheng Zhao
  • Zheng Wang
  • Chenggang Yan
  • Qionghai Dai

Zero-shot learning (ZSL) is to build recognition models for previously unseen target classes which have no labeled data for training by transferring knowledge from some other related auxiliary source classes with abundant labeled samples to the target ones with class attributes as the bridge. The key is to learn a similarity based ranking function between samples and class labels using the labeled source classes so that the proper (unseen) class label for a test sample can be identified by the function. In order to learn the function, single-view ranking based loss is widely used which aims to rank the true label prior to the other labels for a training sample. However, we argue that the ranking can be performed from the other view, which aims to place the images belonging to a label before the images from the other classes. Motivated by it, we propose a novel DuAl-view RanKing (DARK) loss for zeroshot learning simultaneously ranking labels for an image by point-to-point metric and ranking images for a label by pointto-set metric, which is capable of better modeling the relationship between images and classes. In addition, we also notice that previous ZSL approaches mostly fail to well exploit the hardness of training samples, either using only very hard ones or using all samples indiscriminately. In this work, we also introduce a sample hardness assessment method to ZSL which assigns different weights to training samples based on their hardness, which leads to a more accurate and robust ZSL model. Experiments on benchmarks demonstrate that DARK outperforms the state-of-the-arts for (generalized) ZSL.

NeurIPS Conference 2019 Conference Paper

Global Sparse Momentum SGD for Pruning Very Deep Neural Networks

  • Xiaohan Ding
  • Guiguang Ding
  • Xiangxin Zhou
  • Yuchen Guo
  • Jungong Han
  • Ji Liu

Deep Neural Network (DNN) is powerful but computationally expensive and memory intensive, thus impeding its practical usage on resource-constrained front-end devices. DNN pruning is an approach for deep model compression, which aims at eliminating some parameters with tolerable performance degradation. In this paper, we propose a novel momentum-SGD-based optimization method to reduce the network complexity by on-the-fly pruning. Concretely, given a global compression ratio, we categorize all the parameters into two parts at each training iteration which are updated using different rules. In this way, we gradually zero out the redundant parameters, as we update them using only the ordinary weight decay but no gradients derived from the objective function. As a departure from prior methods that require heavy human works to tune the layer-wise sparsity ratios, prune by solving complicated non-differentiable problems or finetune the model after pruning, our method is characterized by 1) global compression that automatically finds the appropriate per-layer sparsity ratios; 2) end-to-end training; 3) no need for a time-consuming re-training process after pruning; and 4) superior capability to find better winning tickets which have won the initialization lottery.

IJCAI Conference 2019 Conference Paper

Landmark Selection for Zero-shot Learning

  • Yuchen Guo
  • Guiguang Ding
  • Jungong Han
  • Chenggang Yan
  • Jiyong Zhang
  • Qionghai Dai

Zero-shot learning (ZSL) is an emerging research topic whose goal is to build recognition models for previously unseen classes. The basic idea of ZSL is based on heterogeneous feature matching which learns a compatibility function between image and class features using seen classes. The function is constructed based on one-vs-all training in which each class has only one class feature and many image features. Existing ZSL works mostly treat all image features equivalently. However, in this paper we argue that it is more reasonable to use some representative cross-domain data instead of all. Motivated by this idea, we propose a novel approach, termed as Landmark Selection(LAST) for ZSL. LAST is able to identify representative cross-domain features which further lead to better image-class compatibility function. Experiments on several ZSL datasets including ImageNet demonstrate the superiority of LAST to the state-of-the-arts.

IJCAI Conference 2019 Conference Paper

Zero-shot Learning with Many Classes by High-rank Deep Embedding Networks

  • Yuchen Guo
  • Guiguang Ding
  • Jungong Han
  • Hang Shao
  • Xin Lou
  • Qionghai Dai

Zero-shot learning (ZSL) is a recently emerging research topic which aims to build classification models for unseen classes with knowledge from auxiliary seen classes. Though many ZSL works have shown promising results on small-scale datasets by utilizing a bilinear compatibility function, the ZSL performance on large-scale datasets with many classes (say, ImageNet) is still unsatisfactory. We argue that the bilinear compatibility function is a low-rank approximation of the true compatibility function such that it is not expressive enough especially when there are a large number of classes because of the rank limitation. To address this issue, we propose a novel approach, termed as High-rank Deep Embedding Networks (GREEN), for ZSL with many classes. In particular, we propose a feature-dependent mixture of softmaxes as the image-class compatibility function, which is a simple extension of the bilinear compatibility function, but yields much better results. It utilizes a mixture of non-linear transformations with feature-dependent latent variables to approximate the true function in a high-rank way, which makes GREEN more expressive. Experiments on several datasets including ImageNet demonstrate GREEN significantly outperforms the state-of-the-art approaches.

IJCAI Conference 2018 Conference Paper

Grouping Attribute Recognition for Pedestrian with Joint Recurrent Learning

  • Xin Zhao
  • Liufang Sang
  • Guiguang Ding
  • Yuchen Guo
  • Xiaoming Jin

Pedestrian attributes recognition is to predict attribute labels of pedestrian from surveillance images, which is a very challenging task for computer vision due to poor imaging quality and small training dataset. It is observed that semantic pedestrian attributes to be recognised tend to show semantic or visual spatial correlation. Attributes can be grouped by the correlation while previous works mostly ignore this phenomenon. Inspired by Recurrent Neural Network (RNN)'s super capability of learning context correlations, this paper proposes an end-to-end Grouping Recurrent Learning (GRL) model that takes advantage of the intra-group mutual exclusion and inter-group correlation to improve the performance of pedestrian attribute recognition. Our GRL method starts with the detection of precise body region via Body Region Proposal followed by feature extraction from detected regions. These features, along with the semantic groups, are fed into RNN for recurrent grouping attribute recognition, where intra group correlations can be learned. Extensive empirical evidence shows that our GRL model achieves state-of-the-art results, based on pedestrian attribute datasets, i. e. standard PETA and RAP datasets.

IJCAI Conference 2018 Conference Paper

Implicit Non-linear Similarity Scoring for Recognizing Unseen Classes

  • Yuchen Guo
  • Guiguang Ding
  • Jungong Han
  • Sicheng Zhao
  • Bin Wang

Recognizing unseen classes is an important task for real-world applications, due to: 1) it is common that some classes in reality have no labeled image exemplar for training; and 2) novel classes emerge rapidly. Recently, to address this task many zero-shot learning (ZSL) approaches have been proposed where explicit linear scores, like inner product score, are employed to measure the similarity between a class and an image. We argue that explicit linear scoring (ELS) seems too weak to capture complicated image-class correspondence. We propose a simple yet effective framework, called Implicit Non-linear Similarity Scoring (ICINESS). In particular, we train a scoring network which uses image and class features as input, fuses them by hidden layers, and outputs the similarity. Based on the universal approximation theorem, it can approximate the true similarity function between images and classes if a proper structure is used in an implicit non-linear way, which is more flexible and powerful. With ICINESS framework, we implement ZSL algorithms by shallow and deep networks, which yield consistently superior results.

AAAI Conference 2018 Conference Paper

On Trivial Solution and High Correlation Problems in Deep Supervised Hashing

  • Yuchen Guo
  • Xin Zhao
  • Guiguang Ding
  • Jungong Han

Deep supervised hashing (DSH), which combines binary learning and convolutional neural network, has attracted considerable research interests and achieved promising performance for highly efficient image retrieval. In this paper, we show that the widely used loss functions, pair-wise loss and triplet loss, suffer from the trivial solution problem and usually lead to highly correlated bits in practice, limiting the performance of DSH. One important reason is that it is difficult to incorporate proper constraints into the loss functions under the mini-batch based optimization algorithm. To tackle these problems, we propose to adopt ensemble learning strategy for deep model training. We found out that this simple strategy is capable of effectively decorrelating different bits, making the hashcodes more informative. Moreover, it is very easy to parallelize the training and support incremental model learning, which are very useful for real-world applications but usually ignored by existing DSH approaches. Experiments on benchmarks demonstrate the proposed ensemble based DSH can improve the performance of DSH approaches significant.

IJCAI Conference 2018 Conference Paper

Where to Prune: Using LSTM to Guide End-to-end Pruning

  • Jing Zhong
  • Guiguang Ding
  • Yuchen Guo
  • Jungong Han
  • Bin Wang

Recent years have witnessed the great success of convolutional neural networks (CNNs) in many related fields. However, its huge model size and computation complexity bring in difficulty when deploying CNNs in some scenarios, like embedded system with low computation power. To address this issue, many works have been proposed to prune filters in CNNs to reduce computation. However, they mainly focus on seeking which filters are unimportant in a layer and then prune filters layer by layer or globally. In this paper, we argue that the pruning order is also very significant for model pruning. We propose a novel approach to figure out which layers should be pruned in each step. First, we utilize a long short-term memory (LSTM) to learn the hierarchical characteristics of a network and generate a pruning decision for each layer, which is the main difference from previous works. Next, a channel-based method is adopted to evaluate the importance of filters in a to-be-pruned layer, followed by an accelerated recovery step. Experimental results demonstrate that our approach is capable of reducing 70. 1% FLOPs for VGG and 47. 5% for Resnet-56 with comparable accuracy. Also, the learning results seem to reveal the sensitivity of each network layer.

AAAI Conference 2018 Conference Paper

Zero-Shot Learning With Attribute Selection

  • Yuchen Guo
  • Guiguang Ding
  • Jungong Han
  • Sheng Tang

Zero-shot learning (ZSL) is regarded as an effective way to construct classification models for target classes which have no labeled samples available. The basic framework is to transfer knowledge from (different) auxiliary source classes having sufficient labeled samples with some attributes shared by target and source classes as bridge. Attributes play an important role in ZSL but they have not gained sufficient attention in recent years. Previous works mostly assume attributes are perfect and treat each attribute equally. However, as shown in this paper, different attributes have different properties, such as their class distribution, variance, and entropy, which may have considerable impact on ZSL accuracy if treated equally. Based on this observation, in this paper we propose to use a subset of attributes, instead of the whole set, for building ZSL models. The attribute selection is conducted by considering the information amount and predictability under a novel joint optimization framework. To our knowledge, this is the first work that notices the influence of attributes themselves and proposes to use a refined attribute set for ZSL. Since our approach focuses on selecting good attributes for ZSL, it can be combined to any attribute based ZSL approaches so as to augment their performance. Experiments on four ZSL benchmarks demonstrate that our approach can improve zeroshot classification accuracy and yield state-of-the-art results.

AAAI Conference 2017 Conference Paper

Active Learning with Cross-Class Similarity Transfer

  • Yuchen Guo
  • Guiguang Ding
  • Yue Gao
  • Jungong Han

How to save labeling efforts for training supervised classi- fiers is an important research topic in machine learning community. Active learning (AL) and transfer learning (TL) are two useful tools to achieve this goal, and their combination, i. e. , transfer active learning (T-AL) has also attracted considerable research interest. However, existing T-AL approaches consider to transfer knowledge from a source/auxiliary domain which has the same class labels as the target domain, but ignore the relationship among classes. In this paper, we investigate a more practical setting where the classes in source domain are related/similar to but different from the target domain classes. Specifically, we propose a novel cross-class T-AL approach to simultaneously transfer knowledge from source domain and actively annotate the most informative samples in target domain so that we can train satisfactory classifiers with as few labeled samples as possible. In particular, based on the class-class similarity and sample-sample similarity, we adopt a similarity propagation to find the source domain samples that can well capture the characteristics of a target class and then transfer the similar samples as the (pseudo) labeled data for the target class. In turn, the labeled and transferred samples are used to train classifiers and actively select new samples for annotation. Extensive experiments on three datasets demonstrate that the proposed approach outperforms significantly the state-of-the-art related approaches.

AAAI Conference 2017 Short Paper

Problems in Large-Scale Image Classification

  • Yuchen Guo

The number of images is growing rapidly in recent years because of development of Internet, especially the social networks like Facebook, and the popularization of portable image capture devices like smart phone. Annotating them with semantically meaningful words to describe them, i.e., classification, is a useful way to manage these images. However, the huge number of images and classes brings several challenges to classification, of which two are 1) how to measure the similarity efficiently between large-scale images, for example, measuring similarity between samples is the building block for SVM and kNN classifiers, and 2) how to train supervised classification models for newly emerging classes with only a few or even no labeled samples because new concepts appear every day in the Web, like Tesla's Model S. The research of my Ph. D. thesis focuses on the two problems in large-scale image classification mentioned above. Formally, these two problems are termed as large-scale similarity search which focuses on the large scale of samples/images and zero-shot/few-shots learning which focuses on the large scale of classes. Specifically, my research considers the following three aspects: 1) hashing based large-scale similarity search which adopts hashing to improve the efficiency; 2) cross-class transfer active learning which simultaneously transfers knowledge from the abundant labeled samples in the Web and selects the most informative samples for expert labeling such that we can construct effective classifiers for novel classes with only a few labeled samples; and 3) zero-shot learning which utilizes no labeled samples for novel classes at all to build supervised classifiers for them by transferring knowledge from the related classes.

IJCAI Conference 2017 Conference Paper

SitNet: Discrete Similarity Transfer Network for Zero-shot Hashing

  • Yuchen Guo
  • Guiguang Ding
  • Jungong Han
  • Yue Gao

Hashing has been widely utilized for fast image retrieval recently. With semantic information as supervision, hashing approaches perform much better, especially when combined with deep convolution neural network(CNN). However, in practice, new concepts emerge every day, making collecting supervised information for re-training hashing model infeasible. In this paper, we propose a novel zero-shot hashing approach, called Discrete Similarity Transfer Network (SitNet), to preserve the semantic similarity between images from both ``seen'' concepts and new ``unseen'' concepts. Motivated by zero-shot learning, the semantic vectors of concepts are adopted to capture the similarity structures among classes, making the model trained with seen concepts generalize well for unseen ones benefiting from the transferability of the semantic vector space. We adopt a multi-task architecture to exploit the supervised information for seen concepts and the semantic vectors simultaneously. Moreover, a discrete hashing layer is integrated into the network for hashcode generating to avoid the information loss caused by real-value relaxation in training phase, which is a critical problem in existing works. Experiments on three benchmarks validate the superiority of SitNet to the state-of-the-arts.

IJCAI Conference 2017 Conference Paper

Synthesizing Samples for Zero-shot Learning

  • Yuchen Guo
  • Guiguang Ding
  • Jungong Han
  • Yue Gao

Zero-shot learning (ZSL) is to construct recognition models for unseen target classes that have no labeled samples for training. It utilizes the class attributes or semantic vectors as side information and transfers supervision information from related source classes with abundant labeled samples. Existing ZSL approaches adopt an intermediary embedding space to measure the similarity between a sample and the attributes of a target class to perform zero-shot classification. However, this way may suffer from the information loss caused by the embedding process and the similarity measure cannot fully make use of the data distribution. In this paper, we propose a novel approach which turns the ZSL problem into a conventional supervised learning problem by synthesizing samples for the unseen classes. Firstly, the probability distribution of an unseen class is estimated by using the knowledge from seen classes and the class attributes. Secondly, the samples are synthesized based on the distribution for the unseen class. Finally, we can train any supervised classifiers based on the synthesized samples. Extensive experiments on benchmarks demonstrate the superiority of the proposed approach to the state-of-the-art ZSL approaches.

IJCAI Conference 2017 Conference Paper

TUCH: Turning Cross-view Hashing into Single-view Hashing via Generative Adversarial Nets

  • Xin Zhao
  • Guiguang Ding
  • Yuchen Guo
  • Jungong Han
  • Yue Gao

Cross-view retrieval, which focuses on searching images as response to text queries or vice versa, has received increasing attention recently. Cross-view hashing is to efficiently solve the cross-view retrieval problem with binary hash codes. Most existing works on cross-view hashing exploit multi-view embedding method to tackle this problem, which inevitably causes the information loss in both image and text domains. Inspired by the Generative Adversarial Nets (GANs), this paper presents a new model that is able to Turn Cross-view Hashing into single-view hashing (TUCH), thus enabling the information of image to be preserved as much as possible. TUCH is a novel deep architecture that integrates a language model network T for text feature extraction, a generator network G to generate fake images from text feature and a hashing network H for learning hashing functions to generate compact binary codes. Our architecture effectively unifies joint generative adversarial learning and cross-view hashing. Extensive empirical evidence shows that our TUCH approach achieves state-of-the-art results, especially on text to image retrieval, based on image-sentences datasets, i. e. standard IAPRTC-12 and large-scale Microsoft COCO.

IJCAI Conference 2017 Conference Paper

Unsupervised Deep Video Hashing with Balanced Rotation

  • Gengshen Wu
  • Li Liu
  • Yuchen Guo
  • Guiguang Ding
  • Jungong Han
  • Jialie Shen
  • Ling Shao

Recently, hashing video contents for fast retrieval has received increasing attention due to the enormous growth of online videos. As the extension of image hashing techniques, traditional video hashing methods mainly focus on seeking the appropriate video features but pay little attention to how the video-specific features can be leveraged to achieve optimal binarization. In this paper, an end-to-end hashing framework, namely Unsupervised Deep Video Hashing (UDVH), is proposed, where feature extraction, balanced code learning and hash function learning are integrated and optimized in a self-taught manner. Particularly, distinguished from previous work, our framework enjoys two novelties: 1) an unsupervised hashing method that integrates the feature clustering and feature binarization, enabling the neighborhood structure to be preserved in the binary space; 2) a smart rotation applied to the video-specific features that are widely spread in the low-dimensional space such that the variance of dimensions can be balanced, thus generating more effective hash codes. Extensive experiments have been performed on two real-world datasets and the results demonstrate its superiority, compared to the state-of-the-art video hashing methods. To bootstrap further developments, the source code will be made publically available.

AAAI Conference 2017 Conference Paper

Zero-Shot Recognition via Direct Classifier Learning with Transferred Samples and Pseudo Labels

  • Yuchen Guo
  • Guiguang Ding
  • Jungong Han
  • Yue Gao

As an interesting and emerging topic, zero-shot recognition (ZSR) makes it possible to train a recognition model by specifying the category’s attributes when there are no labeled exemplars available. The fundamental idea for ZSR is to transfer knowledge from the abundant labeled data in different but related source classes via the class attributes. Conventional ZSR approaches adopt a two-step strategy in test stage, where the samples are projected into the attribute space in the first step, and then the recognition is carried out based on considering the relationship between samples and classes in the attribute space. Due to this intermediate transformation, information loss is unavoidable, thus degrading the performance of the overall system. Rather than following this two-step strategy, in this paper, we propose a novel one-step approach that is able to perform ZSR in the original feature space by using directly trained classifiers. To tackle the problem that no labeled samples of target classes are available, we propose to assign pseudo labels to samples based on the reliability and diversity, which in turn will be used to train the classi- fiers. Moreover, we adopt a robust SVM that accounts for the unreliability of pseudo labels. Extensive experiments on four datasets demonstrate consistent performance gains of our approach over the state-of-the-art two-step ZSR approaches.

AAAI Conference 2016 Conference Paper

Active Learning with Cross-Class Knowledge Transfer

  • Yuchen Guo
  • Guiguang Ding
  • Yuqi Wang
  • Xiaoming Jin

When there are insufficient labeled samples for training a supervised model, we can adopt active learning to select the most informative samples for human labeling, or transfer learning to transfer knowledge from related labeled data source. Combining transfer learning with active learning has attracted much research interest in recent years. Most existing works follow the setting where the class labels in source domain are the same as the ones in target domain. In this paper, we focus on a more challenging cross-class setting where the class labels are totally different in two domains but related to each other in an intermediary attribute space, which is barely investigated before. We propose a novel and effective method that utilizes the attribute representation as the seed parameters to generate the classification models for classes. And we propose a joint learning framework that takes into account the knowledge from the related classes in source domain, and the information in the target domain. Besides, it is simple to perform uncertainty sampling, a fundamental technique for active learning, based on the framework. We conduct experiments on three benchmark datasets and the results demonstrate the efficacy of the proposed method.

IJCAI Conference 2016 Conference Paper

Semi-Supervised Active Learning with Cross-Class Sample Transfer

  • Yuchen Guo
  • Guiguang Ding
  • Yue Gao
  • Jianmin Wang

To save the labeling efforts for training a classification model, we can simultaneously adopt Active Learning (AL) to select the most informative samples for human labeling, and Semi-supervised Learning (SSL) to construct effective classifiers using a few labeled samples and a large number of unlabeled samples. Recently, using Transfer Learning (TL) to enhance AL and SSL, i. e. , T-SS-AL, has gained considerable attention. However, existing T-SS-AL methods mostly focus on the situation where the source domain and the target domain share the same classes. In this paper, we consider a more practical and challenging setting where the source domain and the target domain have different but related classes. We propose a novel cross-class sample transfer based T-SS-AL method, called CC-SS-AL, to exploit the information from the source domain. Our key idea is to select samples from the source domain which are very similar to the target domain classes and assign pseudo labels to them for classifier training. Extensive experiments on three datasets verify the efficacy of the proposed method.

AAAI Conference 2016 Conference Paper

Transductive Zero-Shot Recognition via Shared Model Space Learning

  • Yuchen Guo
  • Guiguang Ding
  • Xiaoming Jin
  • Jianmin Wang

Zero-shot Recognition (ZSR) is to learn recognition models for novel classes without labeled data. It is a challenging task and has drawn considerable attention in recent years. The basic idea is to transfer knowledge from seen classes via the shared attributes. This paper focus on the transductive ZSR, i. e. , we have unlabeled data for novel classes. Instead of learning models for seen and novel classes separately as in existing works, we put forward a novel joint learning approach which learns the shared model space (SMS) for models such that the knowledge can be effectively transferred between classes using the attributes. An effective algorithm is proposed for optimization. We conduct comprehensive experiments on three benchmark datasets for ZSR. The results demonstrates that the proposed SMS can significantly outperform the state-of-the-art related approaches which validates its efficacy for the ZSR task.

AAAI Conference 2015 Conference Paper

Learning Predictable and Discriminative Attributes for Visual Recognition

  • Yuchen Guo
  • Guiguang Ding
  • Xiaoming Jin
  • Jianmin Wang

Utilizing attributes for visual recognition has attracted increasingly interest because attributes can effectively bridge the semantic gap between low-level visual features and high-level semantic labels. In this paper, we propose a novel method for learning predictable and discriminative attributes. Specifically, we require the learned attributes can be reliably predicted from visual features, and discover the inherent discriminative structure of data. In addition, we propose to exploit the intracategory locality of data to overcome the intra-category variance in visual data. We conduct extensive experiments on Animals with Attributes (AwA) and Caltech256 datasets, and the results demonstrate that the proposed method achieves state-of-the-art performance.