Arrow Research search

Author name cluster

Li Niu

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

30 papers
1 author row

Possible papers

30

AAAI Conference 2026 Conference Paper

CareCom: Generative Image Composition with Calibrated Reference Features

  • Jiaxuan Chen
  • Bo Zhang
  • Qingdong He
  • Jinlong Peng
  • Li Niu

Image composition aims to seamlessly insert foreground object into background. Despite the huge progress in generative image composition, the existing methods are still struggling with simultaneous detail preservation and foreground pose/view adjustment. To address this issue, we extend the existing generative composition model to multi-reference version, which allows using arbitrary number of foreground reference images. Furthermore, we propose to calibrate the global and local features of foreground reference images to make them compatible with the background information. The calibrated reference features can supplement the original reference features with useful global and local information of proper pose/view. Extensive experiments on MVImgNet and MureCom demonstrate that the generative model can greatly benefit from the calibrated reference features.

AAAI Conference 2026 Conference Paper

D3ToM: Decider-Guided Dynamic Token Merging for Accelerating Diffusion MLLMs

  • Shuochen Chang
  • Xiaofeng Zhang
  • Qingyang Liu
  • Li Niu

Diffusion-based multimodal large language models (Diffusion MLLMs) have recently demonstrated impressive non-autoregressive generative capabilities across vision-and-language tasks. However, Diffusion MLLMs exhibit substantially slower inference than autoregressive models: Each denoising step employs full bidirectional self-attention over the entire sequence, resulting in cubic decoding complexity that becomes computationally impractical with thousands of visual tokens. To address this challenge, we propose D³ToM, a Decider-guided dynamic token merging method that dynamically merges redundant visual tokens at different denoising steps to accelerate inference in Diffusion MLLMs. At each denoising step, D³ToM uses decider tokens—the tokens generated in the previous denoising step—to build an importance map over all visual tokens. Then it maintains a proportion of the most salient tokens and merges the remainder through similarity-based aggregation. This plug-and-play module integrates into a single transformer layer, physically shortening the visual token sequence for all subsequent layers without altering model parameters. Moreover, D³ToM employs a merge ratio that dynamically varies with each denoising step, aligns with the native decoding process of Diffusion MLLMs, achieving superior performance under equivalent computational budgets. Extensive experiments show that D³ToM accelerates inference while preserving competitive performance.

NeurIPS Conference 2025 Conference Paper

Weak-shot Keypoint Estimation via Keyness and Correspondence Transfer

  • Junjie Chen
  • Zeyu Luo
  • Zezheng Liu
  • Wenhui Jiang
  • Li Niu
  • Yuming Fang

Keypoint estimation is a fundamental task in computer vision, but generally requires large-scale annotated data for training. Few-shot and unsupervised keypoint estimation are prevalent economical paradigms, but the former still requires annotations for extensive novel classes while the latter only supports for single class. In this paper, we focus on the task of weak-shot keypoint estimation, where multiple novel classes are learned from unlabeled images with the help of labeled base classes. The key problem is what to transfer from base classes to novel classes, and we propose to transfer keyness and correspondence, which essentially belong to comparing entities and thus are class-agnostic and class-wise transferable. The keyness compares which pixel in the local region is more key, which can guide the keypoints of novel classes to move towards the local maximum (i. e. , obtaining keypoints). The correspondence compares whether the two pixels belongs to the same semantic part, which can activate the keypoints of novel classes by reinforcing the consistency between corresponding points on two paired images. By transferring keyness and correspondence, our framework achieves favourable performance for weak-shot keypoint estimation. Extensive experiments and analyses on large-scale benchmark MP-100 demonstrate our effectiveness.

NeurIPS Conference 2024 Conference Paper

DomainGallery: Few-shot Domain-driven Image Generation by Attribute-centric Finetuning

  • Yuxuan Duan
  • Yan Hong
  • Bo Zhang
  • Jun Lan
  • Huijia Zhu
  • Weiqiang Wang
  • Jianfu Zhang
  • Li Niu

The recent progress in text-to-image models pretrained on large-scale datasets has enabled us to generate various images as long as we provide a text prompt describing what we want. Nevertheless, the availability of these models is still limited when we expect to generate images that fall into a specific domain either hard to describe or just unseen to the models. In this work, we propose DomainGallery, a few-shot domain-driven image generation method which aims at finetuning pretrained Stable Diffusion on few-shot target datasets in an attribute-centric manner. Specifically, DomainGallery features prior attribute erasure, attribute disentanglement, regularization and enhancement. These techniques are tailored to few-shot domain-driven generation in order to solve key issues that previous works have failed to settle. Extensive experiments are given to validate the superior performance of DomainGallery on a variety of domain-driven generation scenarios.

AAAI Conference 2024 Conference Paper

Painterly Image Harmonization by Learning from Painterly Objects

  • Li Niu
  • Junyan Cao
  • Yan Hong
  • Liqing Zhang

Given a composite image with photographic object and painterly background, painterly image harmonization targets at stylizing the composite object to be compatible with the background. Despite the competitive performance of existing painterly harmonization works, they did not fully leverage the painterly objects in artistic paintings. In this work, we explore learning from painterly objects for painterly image harmonization. In particular, we learn a mapping from background style and object information to object style based on painterly objects in artistic paintings. With the learnt mapping, we can hallucinate the target style of composite object, which is used to harmonize encoder feature maps to produce the harmonized image. Extensive experiments on the benchmark dataset demonstrate the effectiveness of our proposed method.

AAAI Conference 2024 Conference Paper

Progressive Painterly Image Harmonization from Low-Level Styles to High-Level Styles

  • Li Niu
  • Yan Hong
  • Junyan Cao
  • Liqing Zhang

Painterly image harmonization aims to harmonize a photographic foreground object on the painterly background. Different from previous auto-encoder based harmonization networks, we develop a progressive multi-stage harmonization network, which harmonizes the composite foreground from low-level styles (e.g., color, simple texture) to high-level styles (e.g., complex texture). Our network has better interpretability and harmonization performance. Moreover, we design an early-exit strategy to automatically decide the proper stage to exit, which can skip the unnecessary and even harmful late stages. Extensive experiments on the benchmark dataset demonstrate the effectiveness of our progressive harmonization network.

AAAI Conference 2024 Conference Paper

Shadow Generation with Decomposed Mask Prediction and Attentive Shadow Filling

  • Xinhao Tao
  • Junyan Cao
  • Yan Hong
  • Li Niu

Image composition refers to inserting a foreground object into a background image to obtain a composite image. In this work, we focus on generating plausible shadows for the inserted foreground object to make the composite image more realistic. To supplement the existing small-scale dataset, we create a large-scale dataset called RdSOBA with rendering techniques. Moreover, we design a two-stage network named DMASNet with decomposed mask prediction and attentive shadow filling. Specifically, in the first stage, we decompose shadow mask prediction into box prediction and shape prediction. In the second stage, we attend to reference background shadow pixels to fill the foreground shadow. Abundant experiments prove that our DMASNet achieves better visual effects and generalizes well to real composite images.

AAAI Conference 2024 Conference Paper

WeditGAN: Few-Shot Image Generation via Latent Space Relocation

  • Yuxuan Duan
  • Li Niu
  • Yan Hong
  • Liqing Zhang

In few-shot image generation, directly training GAN models on just a handful of images faces the risk of overfitting. A popular solution is to transfer the models pretrained on large source domains to small target ones. In this work, we introduce WeditGAN, which realizes model transfer by editing the intermediate latent codes w in StyleGANs with learned constant offsets (delta w), discovering and constructing target latent spaces via simply relocating the distribution of source latent spaces. The established one-to-one mapping between latent spaces can naturally prevents mode collapse and overfitting. Besides, we also propose variants of WeditGAN to further enhance the relocation process by regularizing the direction or finetuning the intensity of delta w. Experiments on a collection of widely used source/target datasets manifest the capability of WeditGAN in generating realistic and diverse images, which is simple yet highly effective in the research area of few-shot image generation. Codes are available at https://github.com/Ldhlwh/WeditGAN.

AAAI Conference 2023 Conference Paper

Amodal Instance Segmentation via Prior-Guided Expansion

  • Junjie Chen
  • Li Niu
  • Jianfu Zhang
  • Jianlou Si
  • Chen Qian
  • Liqing Zhang

Amodal instance segmentation aims to infer the amodal mask, including both the visible part and occluded part of each object instance. Predicting the occluded parts is challenging. Existing methods often produce incomplete amodal boxes and amodal masks, probably due to lacking visual evidences to expand the boxes and masks. To this end, we propose a prior-guided expansion framework, which builds on a two-stage segmentation model (i.e., Mask R-CNN) and performs box-level (resp., pixel-level) expansion for amodal box (resp., mask) prediction, by retrieving regression (resp., flow) transformations from a memory bank of expansion prior. We conduct extensive experiments on KINS, D2SA, and COCOA cls datasets, which show the effectiveness of our method.

AAAI Conference 2023 Conference Paper

Few-Shot Defect Image Generation via Defect-Aware Feature Manipulation

  • Yuxuan Duan
  • Yan Hong
  • Li Niu
  • Liqing Zhang

The performances of defect inspection have been severely hindered by insufficient defect images in industries, which can be alleviated by generating more samples as data augmentation. We propose the first defect image generation method in the challenging few-shot cases. Given just a handful of defect images and relatively more defect-free ones, our goal is to augment the dataset with new defect images. Our method consists of two training stages. First, we train a data-efficient StyleGAN2 on defect-free images as the backbone. Second, we attach defect-aware residual blocks to the backbone, which learn to produce reasonable defect masks and accordingly manipulate the features within the masked regions by training the added modules on limited defect images. Extensive experiments on MVTec AD dataset not only validate the effectiveness of our method in generating realistic and diverse defect images, but also manifest the benefits it brings to downstream defect inspection tasks. Codes are available at https://github.com/Ldhlwh/DFMGAN.

AAAI Conference 2023 Conference Paper

Geometric Inductive Biases for Identifiable Unsupervised Learning of Disentangled Representations

  • Ziqi Pan
  • Li Niu
  • Liqing Zhang

The model identifiability is a considerable issue in the unsupervised learning of disentangled representations. The PCA inductive biases revealed recently for unsupervised disentangling in VAE-based models are shown to improve local alignment of latent dimensions with principal components of the data. In this paper, in additional to the PCA inductive biases, we propose novel geometric inductive biases from the manifold perspective for unsupervised disentangling, which induce the model to capture the global geometric properties of the data manifold with guaranteed model identifiability. We also propose a Geometric Disentangling Regularized AutoEncoder (GDRAE) that combines the PCA and the proposed geometric inductive biases in one unified framework. The experimental results show the usefulness of the geometric inductive biases in unsupervised disentangling and the effectiveness of our GDRAE in capturing the geometric inductive biases.

AAAI Conference 2023 Conference Paper

Isometric Manifold Learning Using Hierarchical Flow

  • Ziqi Pan
  • Jianfu Zhang
  • Li Niu
  • Liqing Zhang

We propose the Hierarchical Flow (HF) model constrained by isometric regularizations for manifold learning that combines manifold learning goals such as dimensionality reduction, inference, sampling, projection and density estimation into one unified framework. Our proposed HF model is regularized to not only produce embeddings preserving the geometric structure of the manifold, but also project samples onto the manifold in a manner conforming to the rigorous definition of projection. Theoretical guarantees are provided for our HF model to satisfy the two desired properties. In order to detect the real dimensionality of the manifold, we also propose a two-stage dimensionality reduction algorithm, which is a time-efficient algorithm thanks to the hierarchical architecture design of our HF model. Experimental results justify our theoretical analysis, demonstrate the superiority of our dimensionality reduction algorithm in cost of training time, and verify the effect of the aforementioned properties in improving performances on downstream tasks such as anomaly detection.

AAAI Conference 2023 Conference Paper

Painterly Image Harmonization in Dual Domains

  • Junyan Cao
  • Yan Hong
  • Li Niu

Image harmonization aims to produce visually harmonious composite images by adjusting the foreground appearance to be compatible with the background. When the composite image has photographic foreground and painterly background, the task is called painterly image harmonization. There are only few works on this task, which are either time-consuming or weak in generating well-harmonized results. In this work, we propose a novel painterly harmonization network consisting of a dual-domain generator and a dual-domain discriminator, which harmonizes the composite image in both spatial domain and frequency domain. The dual-domain generator performs harmonization by using AdaIN modules in the spatial domain and our proposed ResFFT modules in the frequency domain. The dual-domain discriminator attempts to distinguish the inharmonious patches based on the spatial feature and frequency feature of each patch, which can enhance the ability of generator in an adversarial manner. Extensive experiments on the benchmark dataset show the effectiveness of our method. Our code and model are available at https://github.com/bcmi/PHDNet-Painterly-Image-Harmonization.

AAAI Conference 2023 Conference Paper

Video Object of Interest Segmentation

  • Siyuan Zhou
  • Chunru Zhan
  • Biao Wang
  • Tiezheng Ge
  • Yuning Jiang
  • Li Niu

In this work, we present a new computer vision task named video object of interest segmentation (VOIS). Given a video and a target image of interest, our objective is to simultaneously segment and track all objects in the video that are relevant to the target image. This problem combines the traditional video object segmentation task with an additional image indicating the content that users are concerned with. Since no existing dataset is perfectly suitable for this new task, we specifically construct a large-scale dataset called LiveVideos, which contains 2418 pairs of target images and live videos with instance-level annotations. In addition, we propose a transformer-based method for this task. We revisit Swin Transformer and design a dual-path structure to fuse video and image features. Then, a transformer decoder is employed to generate object proposals for segmentation and tracking from the fused features. Extensive experiments on LiveVideos dataset show the superiority of our proposed method.

AAAI Conference 2022 Conference Paper

Action-Aware Embedding Enhancement for Image-Text Retrieval

  • Jiangtong Li
  • Li Niu
  • Liqing Zhang

Image-text retrieval plays a central role in bridging vision and language, which aims to reduce the semantic discrepancy between images and texts. Most of existing works rely on refined words and objects representation through the dataoriented method to capture the word-object cooccurrence. Such approaches are prone to ignore the asymmetric action relation between images and texts, that is, the text has explicit action representation (i. e. , verb phrase) while the image only contains implicit action information. In this paper, we propose Action-aware Memory-Enhanced embedding (AME) method for image-text retrieval, which aims to emphasize the action information when mapping the images and texts into a shared embedding space. Specifically, we integrate action prediction along with an action-aware memory bank to enrich the image and text features with actionsimilar text features. The effectiveness of our proposed AME method is verified by comprehensive experimental results on two benchmark datasets.

IJCAI Conference 2022 Conference Paper

Deep Video Harmonization With Color Mapping Consistency

  • Xinyuan Lu
  • Shengyuan Huang
  • Li Niu
  • Wenyan Cong
  • Liqing Zhang

Video harmonization aims to adjust the foreground of a composite video to make it compatible with the background. So far, video harmonization has only received limited attention and there is no public dataset for video harmonization. In this work, we construct a new video harmonization dataset HYouTube by adjusting the foreground of real videos to create synthetic composite videos. Moreover, we consider the temporal consistency in video harmonization task. Unlike previous works which establish the spatial correspondence, we design a novel framework based on the assumption of color mapping consistency, which leverages the color mapping of neighboring frames to refine the current frame. Extensive experiments on our HYouTube dataset prove the effectiveness of our proposed framework. Our dataset and code are available at https: //github. com/bcmi/Video-Harmonization-Dataset-HYouTube.

AAAI Conference 2022 Conference Paper

Inharmonious Region Localization by Magnifying Domain Discrepancy

  • Jing Liang
  • Li Niu
  • Penghao Wu
  • Fengjun Guo
  • Teng Long

Inharmonious region localization aims to localize the region in a synthetic image which is incompatible with surrounding background. The inharmony issue is mainly attributed to the color and illumination inconsistency produced by image editing techniques. In this work, we tend to transform the input image to another color space to magnify the domain discrepancy between inharmonious region and background, so that the model can identify the inharmonious region more easily. To this end, we present a novel framework consisting of a color mapping module and an inharmonious region localization network, in which the former is equipped with a novel domain discrepancy magnification loss and the latter could be an arbitrary localization network. Extensive experiments on image harmonization dataset show the superiority of our designed framework.

AAAI Conference 2022 Conference Paper

Shadow Generation for Composite Image in Real-World Scenes

  • Yan Hong
  • Li Niu
  • Jianfu Zhang

Image composition targets at inserting a foreground object into a background image. Most previous image composition methods focus on adjusting the foreground to make it compatible with background while ignoring the shadow effect of foreground on the background. In this work, we focus on generating plausible shadow for the foreground object in the composite image. First, we contribute a real-world shadow generation dataset DESOBA by generating synthetic composite images based on paired real images and deshadowed images. Then, we propose a novel shadow generation network SGRNet, which consists of a shadow mask prediction stage and a shadow filling stage. In the shadow mask prediction stage, foreground and background information are thoroughly interacted to generate foreground shadow mask. In the shadow filling stage, shadow parameters are predicted to fill the shadow area. Extensive experiments on our DESOBA dataset and real composite images demonstrate the effectiveness of our proposed method. Our dataset and code are available at https: //github. com/bcmi/Object-Shadow-Generation- Dataset-DESOBA.

NeurIPS Conference 2022 Conference Paper

UniGAN: Reducing Mode Collapse in GANs using a Uniform Generator

  • Ziqi Pan
  • Li Niu
  • Liqing Zhang

Despite the significant progress that has been made in the training of Generative Adversarial Networks (GANs), the mode collapse problem remains a major challenge in training GANs, which refers to a lack of diversity in generative samples. In this paper, we propose a new type of generative diversity named uniform diversity, which relates to a newly proposed type of mode collapse named $u$-mode collapse where the generative samples distribute nonuniformly over the data manifold. From a geometric perspective, we show that the uniform diversity is closely related with the generator uniformity property, and the maximum uniform diversity is achieved if the generator is uniform. To learn a uniform generator, we propose UniGAN, a generative framework with a Normalizing Flow based generator and a simple yet sample efficient generator uniformity regularization, which can be easily adapted to any other generative framework. A new type of diversity metric named udiv is also proposed to estimate the uniform diversity given a set of generative samples in practice. Experimental results verify the effectiveness of our UniGAN in learning a uniform generator and improving uniform diversity.

NeurIPS Conference 2022 Conference Paper

Weak-shot Semantic Segmentation via Dual Similarity Transfer

  • Junjie Chen
  • Li Niu
  • Siyuan Zhou
  • Jianlou Si
  • Chen Qian
  • Liqing Zhang

Semantic segmentation is a practical and active task, but severely suffers from the expensive cost of pixel-level labels when extending to more classes in wider applications. To this end, we focus on the problem named weak-shot semantic segmentation, where the novel classes are learnt from cheaper image-level labels with the support of base classes having off-the-shelf pixel-level labels. To tackle this problem, we propose a dual similarity transfer framework, which is built upon MaskFormer to disentangle the semantic segmentation task into single-label classification and binary segmentation for each proposal. Specifically, the binary segmentation sub-task allows proposal-pixel similarity transfer from base classes to novel classes, which enables the mask learning of novel classes. We also learn pixel-pixel similarity from base classes and distill such class-agnostic semantic similarity to the semantic masks of novel classes, which regularizes the segmentation model with pixel-level semantic relationship across images. In addition, we propose a complementary loss to facilitate the learning of novel classes. Comprehensive experiments on the challenging COCO-Stuff-10K and ADE20K datasets demonstrate the effectiveness of our method.

AAAI Conference 2021 Conference Paper

Activity Image-to-Video Retrieval by Disentangling Appearance and Motion

  • Liu Liu
  • Jiangtong Li
  • Li Niu
  • Ruicong Xu
  • Liqing Zhang

With the rapid emergence of video data, image-to-video retrieval has attracted much attention. There are two types of image-to-video retrieval: instance-based and activity-based. The former task aims to retrieve videos containing the same main objects as the query image, while the latter focuses on finding the similar activity. Since dynamic information plays a significant role in the video, we pay attention to the latter task to explore the motion relation between images and videos. In this paper, we propose a Motion-assisted Activity Proposal-based Image-to-Video Retrieval (MAP-IVR) approach to disentangle the video features into motion features and appearance features and obtain appearance features from the images. Then, we perform image-to-video translation to improve the disentanglement quality. The retrieval is performed in both appearance and video feature spaces. Extensive experiments demonstrate that our MAP-IVR approach remarkably outperforms the state-of-the-art approaches on two benchmark activity-based video datasets.

AAAI Conference 2021 Conference Paper

Depth Privileged Object Detection in Indoor Scenes via Deformation Hallucination

  • Zhijie Zhang
  • Yan Liu
  • Junjie Chen
  • Li Niu
  • Liqing Zhang

RGB-D object detection has achieved significant advance, because depth provides complementary geometric information to RGB images. Considering that depth images are unavailable in some scenarios, we focus on depth privileged object detection in indoor scenes, where the depth images are only available in the training stage. Under this setting, one prevalent research line is modality hallucination, in which depth image and depth feature are common hallucination targets. In contrast, we choose to hallucinate depth deformation, which benefits a lot from rich geometric information in depth data. Specifically, we employ the deformable convolutional layer with augmented offsets to perform geometric deformation, because the offsets enable flexibly sampling over the object and transforming to a canonical shape for ease of object detection. In addition, we design a quality-based weighted transfer loss to avoid negative transfer of depth deformation. Experimental results on NYUDv2 and SUN RGB-D demonstrate the effectiveness of our method against the state-of-theart methods for depth privileged object detection.

AAAI Conference 2021 Conference Paper

Disentangled Information Bottleneck

  • Ziqi Pan
  • Li Niu
  • Jianfu Zhang
  • Liqing Zhang

The information bottleneck (IB) method is a technique for extracting information that is relevant for predicting the target random variable from the source random variable, which is typically implemented by optimizing the IB Lagrangian that balances the compression and prediction terms. However, the IB Lagrangian is hard to optimize, and multiple trials for tuning values of Lagrangian multiplier are required. Moreover, we show that the prediction performance strictly decreases as the compression gets stronger during optimizing the IB Lagrangian. In this paper, we implement the IB method from the perspective of supervised disentangling. Specifically, we introduce Disentangled Information Bottleneck (DisenIB) that is consistent on compressing source maximally without target prediction performance loss (maximum compression). Theoretical and experimental results demonstrate that our method is consistent on maximum compression, and performs well in terms of generalization, robustness to adversarial attack, outof-distribution detection, and supervised disentangling.

NeurIPS Conference 2021 Conference Paper

Mixed Supervised Object Detection by Transferring Mask Prior and Semantic Similarity

  • Yan Liu
  • Zhijie Zhang
  • Li Niu
  • Junjie Chen
  • Liqing Zhang

Object detection has achieved promising success, but requires large-scale fully-annotated data, which is time-consuming and labor-extensive. Therefore, we consider object detection with mixed supervision, which learns novel object categories using weak annotations with the help of full annotations of existing base object categories. Previous works using mixed supervision mainly learn the class-agnostic objectness from fully-annotated categories, which can be transferred to upgrade the weak annotations to pseudo full annotations for novel categories. In this paper, we further transfer mask prior and semantic similarity to bridge the gap between novel categories and base categories. Specifically, the ability of using mask prior to help detect objects is learned from base categories and transferred to novel categories. Moreover, the semantic similarity between objects learned from base categories is transferred to denoise the pseudo full annotations for novel categories. Experimental results on three benchmark datasets demonstrate the effectiveness of our method over existing methods. Codes are available at https: //github. com/bcmi/TraMaS-Weak-Shot-Object-Detection.

NeurIPS Conference 2021 Conference Paper

Weak-shot Fine-grained Classification via Similarity Transfer

  • Junjie Chen
  • Li Niu
  • Liu Liu
  • Liqing Zhang

Recognizing fine-grained categories remains a challenging task, due to the subtle distinctions among different subordinate categories, which results in the need of abundant annotated samples. To alleviate the data-hungry problem, we consider the problem of learning novel categories from web data with the support of a clean set of base categories, which is referred to as weak-shot learning. In this setting, we propose a method called SimTrans to transfer pairwise semantic similarity from base categories to novel categories. Specifically, we firstly train a similarity net on clean data, and then leverage the transferred similarity to denoise web training data using two simple yet effective strategies. In addition, we apply adversarial loss on similarity net to enhance the transferability of similarity. Comprehensive experiments demonstrate the effectiveness of our weak-shot setting and our SimTrans method.

AAAI Conference 2020 Conference Paper

A Proposal-Based Approach for Activity Image-to-Video Retrieval

  • Ruicong Xu
  • Li Niu
  • Jianfu Zhang
  • Liqing Zhang

Activity image-to-video retrieval task aims to retrieve videos containing the similar activity as the query image, which is a challenging task because videos generally have many background segments irrelevant to the activity. In this paper, we utilize R-C3D model to represent a video by a bag of activity proposals, which can filter out background segments to some extent. However, there are still noisy proposals in each bag. Thus, we propose an Activity Proposal-based Imageto-Video Retrieval (APIVR) approach, which incorporates multi-instance learning into cross-modal retrieval framework to address the proposal noise issue. Specifically, we propose a Graph Multi-Instance Learning (GMIL) module with graph convolutional layer, and integrate this module with classification loss, adversarial loss, and triplet loss in our cross-modal retrieval framework. Moreover, we propose geometry-aware triplet loss based on point-to-subspace distance to preserve the structural information of activity proposals. Extensive experiments on three widely-used datasets verify the effectiveness of our approach.

AAAI Conference 2020 Conference Paper

Exploiting Motion Information from Unlabeled Videos for Static Image Action Recognition

  • Yiyi Zhang
  • Li Niu
  • Ziqi Pan
  • Meichao Luo
  • Jianfu Zhang
  • Dawei Cheng
  • Liqing Zhang

Static image action recognition, which aims to recognize action based on a single image, usually relies on expensive human labeling effort such as adequate labeled action images and large-scale labeled image dataset. In contrast, abundant unlabeled videos can be economically obtained. Therefore, several works have explored using unlabeled videos to facilitate image action recognition, which can be categorized into the following two groups: (a) enhance visual representations of action images with a designed proxy task on unlabeled videos, which falls into the scope of self-supervised learning; (b) generate auxiliary representations for action images with the generator learned from unlabeled videos. In this paper, we integrate the above two strategies in a unified framework, which consists of Visual Representation Enhancement (VRE) module and Motion Representation Augmentation (MRA) module. Specifically, the VRE module includes a proxy task which imposes pseudo motion label constraint and temporal coherence constraint on unlabeled videos, while the MRA module could predict the motion information of a static action image by exploiting unlabeled videos. We demonstrate the superiority of our framework based on four benchmark human action datasets with limited labeled data.

AAAI Conference 2020 Conference Paper

Image Cropping with Composition and Saliency Aware Aesthetic Score Map

  • Yi Tu
  • Li Niu
  • Weijie Zhao
  • Dawei Cheng
  • Liqing Zhang

Aesthetic image cropping is a practical but challenging task which aims at finding the best crops with the highest aesthetic quality in an image. Recently, many deep learning methods have been proposed to address this problem, but they did not reveal the intrinsic mechanism of aesthetic evaluation. In this paper, we propose an interpretable image cropping model to unveil the mystery. For each image, we use a fully convolutional network to produce an aesthetic score map, which is shared among all candidate crops during crop-level aesthetic evaluation. Then, we require the aesthetic score map to be both composition-aware and saliency-aware. In particular, the same region is assigned with different aesthetic scores based on its relative positions in different crops. Moreover, a visually salient region is supposed to have more sensitive aesthetic scores so that our network can learn to place salient objects at more proper positions. Such an aesthetic score map can be used to localize aesthetically important regions in an image, which sheds light on the composition rules learned by our model. We show the competitive performance of our model in the image cropping task on several benchmark datasets, and also demonstrate its generality in realworld applications.

IJCAI Conference 2017 Conference Paper

Robust Survey Aggregation with Student-t Distribution and Sparse Representation

  • Qingtao Tang
  • Tao Dai
  • Li Niu
  • Yisen Wang
  • Shu-Tao Xia
  • Jianfei Cai

Most existing survey aggregation methods assume that the sample data follow Gaussian distribution. However, these methods are sensitive to outliers, due to the thin-tailed property of the Gaussian distribution. To address this issue, we propose a robust survey aggregation method based on Student-t distribution and sparse representation. Specifically, we assume that the samples follow Student-$t$ distribution, instead of the common Gaussian distribution. Due to the Student-t distribution, our method is robust to outliers, which can be explained from both Bayesian point of view and non-Bayesian point of view. In addition, inspired by James-Stain estimator (JS) and Compressive Averaging (CAvg), we propose to sparsely represent the global mean vector by an adaptive basis comprising both data-specific basis and combined generic bases. Theoretically, we prove that JS and CAvg are special cases of our method. Extensive experiments demonstrate that our proposed method achieves significant improvement over the state-of-the-art methods on both synthetic and real datasets.

IJCAI Conference 2017 Conference Paper

Student-t Process Regression with Student-t Likelihood

  • Qingtao Tang
  • Li Niu
  • Yisen Wang
  • Tao Dai
  • Wangpeng An
  • Jianfei Cai
  • Shu-Tao Xia

Gaussian Process Regression (GPR) is a powerful Bayesian method. However, the performance of GPR can be significantly degraded when the training data are contaminated by outliers, including target outliers and input outliers. Although there are some variants of GPR (e. g. , GPR with Student-t likelihood (GPRT)) aiming to handle outliers, most of the variants focus on handling the target outliers while little effort has been done to deal with the input outliers. In contrast, in this work, we aim to handle both the target outliers and the input outliers at the same time. Specifically, we replace the Gaussian noise in GPR with independent Student-t noise to cope with the target outliers. Moreover, to enhance the robustness w. r. t. the input outliers, we use a Student-t Process prior instead of the common Gaussian Process prior, leading to Student-t Process Regression with Student-t Likelihood (TPRT). We theoretically show that TPRT is more robust to both input and target outliers than GPR and GPRT, and prove that both GPR and GPRT are special cases of TPRT. Various experiments demonstrate that TPRT outperforms GPR and its variants on both synthetic and real datasets.