Author name cluster

Qi Tian

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

55 papers

1 author row

AAAI Conference 2026 Conference Paper

A Principle-Driven Adaptive Policy for Group Cognitive Stimulation Dialogue for Elderly with Cognitive Impairment

Jiyue Jiang
Yanyu Chen
Pengan CHEN
Kai Liu
Jingqi Zhou
Zheyong Zhu
He Hu
Fei Ma

Cognitive impairment is becoming a major public health challenge. Cognitive Stimulation Therapy (CST) is an effective intervention for cognitive impairment, but traditional methods are difficult to scale, and existing digital systems struggle with group dialogues and cognitive stimulation principles. While Large Language Models (LLMs) are powerful, their application in this context faces key challenges: cognitive stimulation dialogue paradigms, a lack of therapeutic reasoning, and static-only user modeling. To address these issues, we propose a principle-driven adaptive policy actualized through a Group Cognitive Stimulation Dialogue (GCSD) system. We first construct a dataset with over 500 hours of real-world CST conversations and 10,000+ simulated dialogues generated via our Principle-Guided Scenario Simulation strategy. Our GCSD system then integrates four core modules to overcome LLM limitations: (i) a multi-speaker context controller to resolve role confusion; (ii) dynamic participant cognitive state modeling for personalized interaction; (iii) a cognitive stimulation-focused attention loss to instill cognitive stimulation reasoning; and (iv) a multi-dimensional reward strategy to enhance response value. Experimental results demonstrate that GCSD significantly outperforms baseline models across various evaluation metrics. Future work will focus on long-term clinical validation to bridge the gap between computational performance and clinical efficacy.

PDF Details DOI

AAAI Conference 2026 Conference Paper

Dereflection Any Image with Diffusion Priors and Diversified Data

Jichen Hu
Chen Yang
Zanwei Zhou
Jiemin Fang
Qi Tian
Wei Shen

Reflection removal of a single image remains a highly challenging task due to the complex entanglement between target scenes and unwanted reflections. Despite significant progress, existing methods are hindered by the scarcity of high-quality, diverse data and insufficient restoration priors, resulting in limited generalization across various real-world scenarios. In this paper, we propose Dereflection Any Image, a comprehensive solution with an efficient data preparation pipeline and a generalizable model for robust reflection removal. First, we introduce a dataset named Diverse Reflection Removal (DRR) created by randomly rotating reflective mediums in target scenes, enabling variation of reflection angles and intensities, and setting a new benchmark in scale, quality, and diversity. Second, we propose a diffusion-based framework with one-step diffusion for deterministic outputs and fast inference. To ensure stable learning, we design a three-stage progressive training strategy, including reflection-invariant finetuning to encourage consistent outputs across varying reflection patterns that characterize our dataset. Extensive experiments show that our method achieves SOTA performance on both common benchmarks and challenging in-the-wild images, showing superior generalization across diverse real-world scenes.

PDF Details DOI

AAAI Conference 2026 Conference Paper

Few-step Flow for 3D Generation via Marginal-Data Transport Distillation

Zanwei Zhou
Taoran Yi
Jiemin Fang
Chen Yang
Lingxi Xie
Xinggang Wang
Wei Shen
Qi Tian

Flow-based 3D generation models typically require dozens of sampling steps during inference. Though few-step distillation methods, particularly Consistency Models (CMs), have achieved substantial advancements in accelerating 2D diffusion models, they remain under-explored for more complex 3D generation tasks. In this study, we propose a novel framework, MDT-dist, for few-step 3D flow distillation. Our approach is built upon a primary objective: distilling the pretrained model to learn the Marginal-Data Transport. Directly learning this objective needs to integrate the velocity fields, while this integral is intractable to be implemented. Therefore, we propose two optimizable objectives, Velocity Matching (VM) and Velocity Distillation (VD), to equivalently convert the optimization target from the transport level to the velocity and the distribution level respectively. Velocity Matching (VM) learns to stably match the velocity fields between the student and the teacher, but inevitably provides biased gradient estimates. Velocity Distillation (VD) further enhances the optimization process by leveraging the learned velocity fields to perform probability density distillation. When evaluated on the pioneer 3D generation framework TRELLIS, our method reduces sampling steps of each flow transformer from 25 to 1–2, achieving 0.68s (1 step x2) and 0.94s (2 steps x2) latency with 9.0x and 6.5x speedup on A800, while preserving high visual and geometric fidelity. Experiments demonstrate that our method significantly outperforms existing CM distillation methods, and enables TRELLIS to achieve superior performance in few-step 3D generation.

PDF Details DOI

AAAI Conference 2026 Conference Paper

O-DisCo-Edit: Object Distortion Control for Unified Realistic Video Editing

Yuqing Chen
Junjie Wang
Lin Liu
Ruihang Chu
Xiaopeng Zhang
Qi Tian
Yujiu Yang

Diffusion models have recently advanced video editing, yet controllable editing remains challenging due to the need for precise manipulation of diverse object properties. Current methods require different control signal for diverse editing tasks, which complicates model design and demands significant training resources. To address this, we propose O-DisCo-Edit, a unified framework that incorporates a novel object distortion control (O-DisCo). This signal, based on random and adaptive noise, flexibly encapsulates a wide range of editing cues within a single representation. Paired with a “copy-form” preservation module for preserving non-edited regions, O-DisCo-Edit enables efficient, high-fidelity editing through an effective training paradigm. Extensive experiments and comprehensive human evaluations consistently demonstrate that O-DisCo-Edit surpasses both specialized and multitask state-of-the-art methods across various video editing tasks.

PDF Details DOI

AAAI Conference 2026 Conference Paper

SFedHIFI: Fire Rate-Based Heterogeneous Information Fusion for Spiking Federated Learning

Ran Tao
Qiugang Zhan
Shantian Yang
Xiurui Xie
Qi Tian
Guisong Liu

Spiking Federated Learning (SFL) has been widely studied with the energy efficiency of Spiking Neural Networks (SNNs). However, existing SFL methods require model homogeneity and assume all clients have sufficient computational resources, resulting in the exclusion of some resource-constrained clients. To address the prevalent system heterogeneity in real-world scenarios, enabling heterogeneous SFL systems that allow clients to adaptively deploy models of different scales based on their local resources is crucial. To this end, we introduce SFedHIFI, a novel Spiking Federated Learning framework with Fire Rate-Based Heterogeneous Information Fusion. Specifically, SFedHIFI employs channel-wise matrix decomposition to deploy SNN models of adaptive complexity on clients with heterogeneous resources. Building on this, the proposed heterogeneous information fusion module enables cross-scale aggregation among models of different widths, thereby enhancing the utilization of diverse local knowledge. Extensive experiments on three public benchmarks demonstrate that SFedHIFI can effectively enable heterogeneous SFL, consistently outperforming all three baseline methods. Compared with ANN-based FL, it achieves significant energy savings with only a marginal trade-off in accuracy.

PDF Details DOI

AAAI Conference 2026 Conference Paper

WorldGrow: Generating Infinite 3D World

Sikuang Li
Chen Yang
Jiemin Fang
Taoran Yi
Jia Lu
Jiazhong Cen
Lingxi Xie
Wei Shen

We tackle the challenge of generating the infinitely extendable 3D world -- large, continuous environments with coherent geometry and realistic appearance. Existing methods face key challenges: 2D-lifting approaches suffer from geometric and appearance inconsistencies across views, 3D implicit representations are hard to scale up, and current 3D foundation models are mostly object-centric, limiting their applicability to scene-level generation. Our key insight is leveraging strong generation priors from pre-trained 3D models for structured scene block generation. To this end, we propose WorldGrow, a hierarchical framework for unbounded 3D scene synthesis. Our method features three core components: (1) a data curation pipeline that extracts high-quality scene blocks for training, making the 3D structured latent representations suitable for scene generation; (2) a 3D block inpainting mechanism that enables context-aware scene extension; and (3) a coarse-to-fine generation strategy that ensures both global layout plausibility and local geometric/textural fidelity. Evaluated on the large-scale 3D-FRONT dataset, WorldGrow achieves SOTA performance in geometry reconstruction, while uniquely supporting infinite scene generation with photorealistic and structurally consistent outputs. These results highlight its capability for constructing large-scale virtual environments and potential for building future world models.

PDF Details DOI

AAAI Conference 2025 Conference Paper

Boosting Segment Anything Model Towards Open-Vocabulary Learning

Xumeng Han
Longhui Wei
Xuehui Yu
Zhiyang Dou
Xin He
Kuiran Wang
Yingfei Sun
Zhenjun Han

The recent Segment Anything Model (SAM) has emerged as a new paradigmatic vision foundation model, showcasing potent zero-shot generalization and flexible prompting. Despite SAM finding applications and adaptations in various domains, its primary limitation lies in the inability to grasp object semantics. In this paper, we present Sambor to seamlessly integrate SAM with the open-vocabulary object detector in an end-to-end framework. While retaining all the remarkable capabilities inherent to SAM, we boost it to detect arbitrary objects from human inputs like category names or reference expressions. Building upon the SAM image encoder, we introduce a novel SideFormer module designed to acquire SAM features adept at perceiving objects and inject comprehensive semantic information for recognition. In addition, we devise an Open-set RPN that leverages SAM proposals to assist in finding potential objects. Consequently, Sambor enables the open-vocabulary detector to equally focus on generalizing both localization and classification sub-tasks. Our approach demonstrates superior zero-shot performance across benchmarks, including COCO and LVIS, proving highly competitive against previous state-of-the-art methods. We aspire for this work to serve as a meaningful endeavor in endowing SAM to recognize diverse object categories and advancing open-vocabulary learning with the support of vision foundation models.

PDF Details DOI

NeurIPS Conference 2025 Conference Paper

Diffusion-Driven Progressive Target Manipulation for Source-Free Domain Adaptation

Yuyang Huang
Yabo Chen
Junyu Zhou
Wenrui Dai
Xiaopeng Zhang
Junni Zou
Hongkai Xiong
Qi Tian

Source-free domain adaptation (SFDA) is a challenging task that tackles domain shifts using only a pre-trained source model and unlabeled target data. Existing SFDA methods are restricted by the fundamental limitation of source-target domain discrepancy. Non-generation SFDA methods suffer from unreliable pseudo-labels in challenging scenarios with large domain discrepancies, while generation-based SFDA methods are evidently degraded due to enlarged domain discrepancies in creating pseudo-source data. To address this limitation, we propose a novel generation-based framework named Diffusion-Driven Progressive Target Manipulation (DPTM) that leverages unlabeled target data as references to reliably generate and progressively refine a pseudo-target domain for SFDA. Specifically, we divide the target samples into a trust set and a non-trust set based on the reliability of pseudo-labels to sufficiently and reliably exploit their information. For samples from the non-trust set, we develop a manipulation strategy to semantically transform them into the newly assigned categories, while simultaneously maintaining them in the target distribution via a latent diffusion model. Furthermore, we design a progressive refinement mechanism that progressively reduces the domain discrepancy between the pseudo-target domain and the real target domain via iterative refinement. Experimental results demonstrate that DPTM outperforms existing methods by a large margin and achieves state-of-the-art performance on four prevailing SFDA benchmark datasets with different scales. Remarkably, DPTM can significantly enhance the performance by up to 18. 6\% in scenarios with large source-target gaps.

IJCAI Conference 2025 Conference Paper

Incorporating Visual Experts to Resolve the Information Loss in Multimodal Large Language Models

Xin He
Longhui Wei
Lingxi Xie
Qi Tian

Multimodal Large Language Models (MLLMs) are experiencing rapid growth, yielding a plethora of novel works recently. The prevailing trend involves adopting data-driven methodologies, wherein diverse instruction-following datasets were collected. However, these approaches always face the challenge of limited visual perception capabilities, as they solely utilizing CLIP-like encoders to extract visual information from inputs. Though these encoders are pre-trained on billions of image-text pairs, they still grapple with the information loss dilemma, given that textual captions only partially capture the contents depicted in images. To address this limitation, this paper proposes to improve the visual perception ability of MLLMs through a mixture-of-experts knowledge enhancement mechanism. Specifically, this work introduces a novel method that incorporates multi-task encoders and existing visual tools into the MLLMs training and inference pipeline, aiming to provide a more comprehensive summarization of visual inputs. Extensive experiments have evaluated its effectiveness of advancing MLLMs, showcasing improved visual perception capability achieved through the integration of visual experts.

PDF Details DOI

AAAI Conference 2025 Conference Paper

Infinite-Canvas: Higher-Resolution Video Outpainting with Extensive Content Generation

Qihua Chen
Yue Ma
Hongfa Wang
Junkun Yuan
Wenzhe Zhao
Qi Tian
Hongmei Wang
Shaobo Min

This paper explores higher-resolution video outpainting with extensive content generation. We point out common issues faced by existing methods when attempting to largely outpaint videos: the generation of low-quality content and limitations imposed by GPU memory. To address these challenges, we propose a diffusion-based method called Infinite-Canvas. It builds upon two core designs. First, instead of employing the common practice of "single-shot" outpainting, we distribute the task across spatial windows and seamlessly merge them. It allows us to outpaint videos of any size and resolution without being constrained by GPU memory. Second, the source video and its relative positional relation are injected into the generation process of each window. It makes the generated spatial layout within each window harmonize with the source video. Coupling with these two designs enables us to generate higher-resolution outpainting videos with rich content while keeping spatial and temporal consistency. Infinite-Canvas excels in large-scale video outpainting, e.g., from 512 × 512 to 1152 × 2048 (9×), while producing high-quality and aesthetically pleasing results. It achieves the best quantitative results across various resolution and scale setups. The code is available at https://github.com/mayuelala/FollowYourCanvas.

PDF Details DOI

NeurIPS Conference 2025 Conference Paper

MagCache: Fast Video Generation with Magnitude-Aware Cache

Zehong Ma
Longhui Wei
Feng Wang
Shiliang Zhang
Qi Tian

Existing acceleration techniques for video diffusion models often rely on uniform heuristics or time-embedding variants to skip timesteps and reuse cached features. These approaches typically require extensive calibration with curated prompts and risk inconsistent outputs due to prompt-specific overfitting. In this paper, we introduce a novel and robust discovery: a unified magnitude law observed across different models and prompts. Specifically, the magnitude ratio of successive residual outputs decreases monotonically, steadily in most timesteps while rapidly in the last several steps. Leveraging this insight, we introduce a Magnitude-aware Cache (MagCache) that adaptively skips unimportant timesteps using an error modeling mechanism and adaptive caching strategy. Unlike existing methods requiring dozens of curated samples for calibration, MagCache only requires a single sample for calibration. Experimental results show that MagCache achieves 2. 10×-2. 68× speedups on Open-Sora, CogVideoX, Wan 2. 1, and HunyuanVideo, while preserving superior visual fidelity. It significantly outperforms existing methods in LPIPS, SSIM, and PSNR, under similar computational budgets.

JBHI Journal 2025 Journal Article

Multi-Scale Group Agent Attention-Based Graph Convolutional Decoding Networks for 2D Medical Image Segmentation

Zhichao Wang
Lin Guo
Shuchang Zhao
Shiqing Zhang
Xiaoming Zhao
Jiangxiong Fang
Guoyu Wang
Hongsheng Lu

Automated medical image segmentation plays a crucial role in assisting doctors in diagnosing diseases. Feature decoding is a critical yet challenging issue for medical image segmentation. To address this issue, this work proposes a novel feature decoding network, called multi-scale group agent attention-based graph convolutional decoding networks (MSGAA-GCDN), to learn local-global features in graph structures for 2D medical image segmentation. The proposed MSGAA-GCDN combines graph convolutional network (GCN) and a lightweight multi-scale group agent attention (MSGAA) mechanism to represent features globally and locally within a graph structure. Moreover, in skip connections a simple yet efficient attention-based upsampling convolution fusion (AUCF) module is designed to enhance encoder-decoder feature fusion in both channel and spatial dimensions. Extensive experiments are conducted on three typical medical image segmentation tasks, namely Synapse abdominal multi-organs, Cardiac organs, and Polyp lesions. Experimental results demonstrate that the proposed MSGAA-GCDN outperforms the state-of-the-art methods, and the designed MSGAA is a lightweight yet effective attention architecture. The proposed MSGAA-GCDN can be easily taken as a plug-and-play decoder cascaded with other encoders for general medical image segmentation tasks.

AAAI Conference 2025 Conference Paper

Optimize Incompatible Parameters Through Compatibility-aware Knowledge Integration

Zheqi Lv
Keming Ye
Zishu Wei
Qi Tian
Shengyu Zhang
Wenqiao Zhang
Wenjie Wang
Kun Kuang

Deep neural networks have become foundational to advancements in multiple domains, including recommendation systems, natural language processing, and so on. Despite their successes, these models often contain incompatible parameters that can be underutilized or detrimental to model performance, particularly when faced with specific, varying data distributions. Existing research excels in removing such parameters or merging the outputs of multiple different pretrained models. However, the former focuses on efficiency rather than performance, while the latter requires several times more computing and storage resources to support inference. In this paper, we set the goal to explicitly improve these incompatible parameters by leveraging the complementary strengths of different models, thereby directly enhancing the models without any additional parameters. Specifically, we propose Compatibility-aware Knowledge Integration (CKI), which consists of Parameter Compatibility Assessment and Parameter Splicing, which are used to evaluate the knowledge content of multiple models and integrate the knowledge into one model, respectively. The integrated model can be used directly for inference or for further fine-tuning. Extensive experiments on various recommendation and language datasets show that CKI can effectively optimize incompatible parameters under multiple tasks and settings to break through the training limit of the original model without increasing the inference cost.

PDF Details DOI

JBHI Journal 2025 Journal Article

PFPRNet: A Phase-Wise Feature Pyramid With Retention Network for Polyp Segmentation

Jinghui Chu
Wangtao Liu
Qi Tian
Wei Lu

Early detection of colonic polyps is crucial for the prevention and diagnosis of colorectal cancer. Currently, deep learning-based polyp segmentation methods have become mainstream and achieved remarkable results. Acquiring a large number of labeled data is time-consuming and labor-intensive, and meanwhile the presence of numerous similar wrinkles in polyp images also hampers model prediction performance. In this paper, we propose a novel approach called Phase-wise Feature Pyramid with Retention Network (PFPRNet), which leverages a pre-trained Transformer-based Encoder to obtain multi-scale feature maps. A Phase-wise Feature Pyramid with Retention Decoder is designed to gradually integrate global features into local features and guide the model's attention towards key regions. Additionally, our custom Enhance Perception module enables capturing image information from a broader perspective. Finally, we introduce an innovative Low-layer Retention module as an alternative to Transformer for more efficient global attention modeling. Evaluation results on several widely-used polyp segmentation datasets demonstrate that our proposed method has strong learning ability and generalization capability, and outperforms the state-of-the-art approaches.

AIJ Journal 2025 Journal Article

Rethinking visual prompt learning as masked visual token modeling

Ning Liao
Bowen Shi
Xiaopeng Zhang
Min Cao
Junchi Yan
Qi Tian

AAAI Conference 2025 Conference Paper

Segment Any 3D Gaussians

Jiazhong Cen
Jiemin Fang
Chen Yang
Lingxi Xie
Xiaopeng Zhang
Wei Shen
Qi Tian

This paper presents SAGA (Segment Any 3D GAussians), a highly efficient 3D promptable segmentation method based on 3D Gaussian Splatting (3D-GS). Given 2D visual prompts as input, SAGA can segment the corresponding 3D target represented by 3D Gaussians within 4 ms. This is achieved by attaching a scale-gated affinity feature to each 3D Gaussian to endow it a new property towards multi-granularity segmentation. Specifically, a scale-aware contrastive training strategy is proposed for the scale-gated affinity feature learning. It 1) distills the segmentation capability of the Segment Anything Model (SAM) from 2D masks into the affinity features and 2) employs a soft scale gate mechanism to deal with multi-granularity ambiguity in 3D segmentation through adjusting the magnitude of each feature channel according to a specified 3D physical scale. Evaluations demonstrate that SAGA achieves real-time multi-granularity segmentation with quality comparable to state-of-the-art methods. As one of the first methods addressing promptable segmentation in 3D-GS, the simplicity and effectiveness of SAGA pave the way for future advancements in this field.

PDF Details DOI

AAAI Conference 2024 Conference Paper

LION: Implicit Vision Prompt Tuning

Haixin Wang
Jianlong Chang
Yihang Zhai
Xiao Luo
Jinan Sun
Zhouchen Lin
Qi Tian

Despite recent promising performances across a range of vision tasks, vision Transformers still have an issue of high computational costs. Recently, vision prompt learning has provided an economical solution to this problem without fine-tuning the whole large-scale model. However, the efficiency and effectiveness of existing models are still far from satisfactory due to the parameter cost of extensive prompt blocks and tricky prompt framework designs. In this paper, we propose a light-weight prompt framework named impLicit vIsion prOmpt tuNing (LION), which is motivated by deep implicit models with stable low memory costs for various complex tasks. In particular, we merely insect two equilibrium implicit layers in two ends of the pre-trained backbone with parameters frozen. Moreover, according to the lottery hypothesis, we further prune the parameters to relieve the computation burden in implicit layers. Various experiments have validated that our LION obtains promising performances on a wide range of datasets. Most importantly, LION reduces up to 11.5 % of training parameter numbers while obtaining higher performance than the state-of-the-art VPT, especially under challenging scenes. Furthermore, we find that our proposed LION has an excellent generalization performance, making it an easy way to boost transfer learning in the future.

PDF Details DOI

AAAI Conference 2024 Conference Paper

LogFormer: A Pre-train and Tuning Pipeline for Log Anomaly Detection

Hongcheng Guo
Jian Yang
Jiaheng Liu
Jiaqi Bai
Boyang Wang
Zhoujun Li
Tieqiao Zheng
Bo Zhang

Log anomaly detection is a key component in the field of artificial intelligence for IT operations (AIOps). Considering log data of variant domains, retraining the whole network for unknown domains is inefficient in real industrial scenarios. However, previous deep models merely focused on extracting the semantics of log sequences in the same domain, leading to poor generalization on multi-domain logs. To alleviate this issue, we propose a unified Transformer-based framework for Log anomaly detection (LogFormer) to improve the generalization ability across different domains, where we establish a two-stage process including the pre-training and adapter-based tuning stage. Specifically, our model is first pre-trained on the source domain to obtain shared semantic knowledge of log data. Then, we transfer such knowledge to the target domain via shared parameters. Besides, the Log-Attention module is proposed to supplement the information ignored by the log-paring. The proposed method is evaluated on three public datasets and one real-world dataset. Experimental results on multiple benchmarks demonstrate the effectiveness of our LogFormer with fewer trainable parameters and lower training costs.

PDF Details DOI

NeurIPS Conference 2023 Conference Paper

AiluRus: A Scalable ViT Framework for Dense Prediction

Jin Li
Yaoming Wang
Xiaopeng Zhang
Bowen Shi
Dongsheng Jiang
Chenglin Li
Wenrui Dai
Hongkai Xiong

Vision transformers (ViTs) have emerged as a prevalent architecture for vision tasks owing to their impressive performance. However, their complexity dramatically increases when handling long token sequences, particularly for dense prediction tasks that require high-resolution input. Notably, dense prediction tasks, such as semantic segmentation or object detection, emphasize more on the contours or shapes of objects, while the texture inside objects is less informative. Motivated by this observation, we propose to apply adaptive resolution for different regions in the image according to their importance. Specifically, at the intermediate layer of the ViT, we select anchors from the token sequence using the proposed spatial-aware density-based clustering algorithm. Tokens that are adjacent to anchors are merged to form low-resolution regions, while others are preserved independently as high-resolution. This strategy could significantly reduce the number of tokens, and the following layers only handle the reduced token sequence for acceleration. At the output end, the resolution of the feature map is recovered by unfolding merged tokens for task prediction. Consequently, we can considerably accelerate ViTs for dense prediction tasks. The proposed method is evaluated across three different datasets and demonstrates promising performance. For instance, "Segmenter ViT-L" can be accelerated by 48\% FPS without fine-tuning, while maintaining the performance. Moreover, our method can also be applied to accelerate fine-tuning. Experiments indicate that we can save 52\% training time while accelerating 2. 46$\times$ FPS with only a 0. 09\% performance drop.

AAAI Conference 2023 Conference Paper

DE-net: Dynamic Text-Guided Image Editing Adversarial Networks

Ming Tao
Bing-Kun Bao
Hao Tang
Fei Wu
Longhui Wei
Qi Tian

Text-guided image editing models have shown remarkable results. However, there remain two problems. First, they employ fixed manipulation modules for various editing requirements (e.g., color changing, texture changing, content adding and removing), which results in over-editing or insufficient editing. Second, they do not clearly distinguish between text-required and text-irrelevant parts, which leads to inaccurate editing. To solve these limitations, we propose: (i) a Dynamic Editing Block (DEBlock) that composes different editing modules dynamically for various editing requirements. (ii) a Composition Predictor (Comp-Pred), which predicts the composition weights for DEBlock according to the inference on target texts and source images. (iii) a Dynamic text-adaptive Convolution Block (DCBlock) that queries source image features to distinguish text-required parts and text-irrelevant parts. Extensive experiments demonstrate that our DE-Net achieves excellent performance and manipulates source images more correctly and accurately.

PDF Details DOI

AAAI Conference 2023 Conference Paper

Fine-Grained Retrieval Prompt Tuning

Shijie Wang
Jianlong Chang
Zhihui Wang
Haojie Li
Wanli Ouyang
Qi Tian

Fine-grained object retrieval aims to learn discriminative representation to retrieve visually similar objects. However, existing top-performing works usually impose pairwise similarities on the semantic embedding spaces or design a localization sub-network to continually fine-tune the entire model in limited data scenarios, thus resulting in convergence to suboptimal solutions. In this paper, we develop Fine-grained Retrieval Prompt Tuning (FRPT), which steers a frozen pre-trained model to perform the fine-grained retrieval task from the perspectives of sample prompting and feature adaptation. Specifically, FRPT only needs to learn fewer parameters in the prompt and adaptation instead of fine-tuning the entire model, thus solving the issue of convergence to suboptimal solutions caused by fine-tuning the entire model. Technically, a discriminative perturbation prompt (DPP) is introduced and deemed as a sample prompting process, which amplifies and even exaggerates some discriminative elements contributing to category prediction via a content-aware inhomogeneous sampling operation. In this way, DPP can make the fine-grained retrieval task aided by the perturbation prompts close to the solved task during the original pre-training. Thereby, it preserves the generalization and discrimination of representation extracted from input samples. Besides, a category-specific awareness head is proposed and regarded as feature adaptation, which removes the species discrepancies in features extracted by the pre-trained model using category-guided instance normalization. And thus, it makes the optimized features only include the discrepancies among subcategories. Extensive experiments demonstrate that our FRPT with fewer learnable parameters achieves the state-of-the-art performance on three widely-used fine-grained datasets.

PDF Details DOI

AAAI Conference 2023 Conference Paper

Learning from Good Trajectories in Offline Multi-Agent Reinforcement Learning

Qi Tian
Kun Kuang
Furui Liu
Baoxiang Wang

Offline multi-agent reinforcement learning (MARL) aims to learn effective multi-agent policies from pre-collected datasets, which is an important step toward the deployment of multi-agent systems in real-world applications. However, in practice, each individual behavior policy that generates multi-agent joint trajectories usually has a different level of how well it performs. e.g., an agent is a random policy while other agents are medium policies. In the cooperative game with global reward, one agent learned by existing offline MARL often inherits this random policy, jeopardizing the utility of the entire team. In this paper, we investigate offline MARL with explicit consideration on the diversity of agent-wise trajectories and propose a novel framework called Shared Individual Trajectories (SIT) to address this problem. Specifically, an attention-based reward decomposition network assigns the credit to each agent through a differentiable key-value memory mechanism in an offline manner. These decomposed credits are then used to reconstruct the joint offline datasets into prioritized experience replay with individual trajectories, thereafter agents can share their good trajectories and conservatively train their policies with a graph attention network (GAT) based critic. We evaluate our method in both discrete control (i.e., StarCraft II and multi-agent particle environment) and continuous control (i.e., multi-agent mujoco). The results indicate that our method achieves significantly better results in complex and mixed offline multi-agent datasets, especially when the difference of data quality between individual trajectories is large.

PDF Details DOI

NeurIPS Conference 2023 Conference Paper

Learning to Parameterize Visual Attributes for Open-set Fine-grained Retrieval

Shijie Wang
Jianlong Chang
Haojie Li
Zhihui Wang
Wanli Ouyang
Qi Tian

Open-set fine-grained retrieval is an emerging challenging task that allows to retrieve unknown categories beyond the training set. The best solution for handling unknown categories is to represent them using a set of visual attributes learnt from known categories, as widely used in zero-shot learning. Though important, attribute modeling usually requires significant manual annotations and thus is labor-intensive. Therefore, it is worth to investigate how to transform retrieval models trained by image-level supervision from category semantic extraction to attribute modeling. To this end, we propose a novel Visual Attribute Parameterization Network (VAPNet) to learn visual attributes from known categories and parameterize them into the retrieval model, without the involvement of any attribute annotations. In this way, VAPNet could utilize its parameters to parse a set of visual attributes from unknown categories and precisely represent them. Technically, VAPNet explicitly attains some semantics with rich details via making use of local image patches and distills the visual attributes from these discovered semantics. Additionally, it integrates the online refinement of these visual attributes into the training process to iteratively enhance their quality. Simultaneously, VAPNet treats these attributes as supervisory signals to tune the retrieval models, thereby achieving attribute parameterization. Extensive experiments on open-set fine-grained retrieval datasets validate the superior performance of our VAPNet over existing solutions.

AAAI Conference 2023 Conference Paper

Low-Light Video Enhancement with Synthetic Event Guidance

Lin Liu
Junfeng An
Jianzhuang Liu
Shanxin Yuan
Xiangyu Chen
Wengang Zhou
Houqiang Li
Yan Feng Wang

Low-light video enhancement (LLVE) is an important yet challenging task with many applications such as photographing and autonomous driving. Unlike single image low-light enhancement, most LLVE methods utilize temporal information from adjacent frames to restore the color and remove the noise of the target frame. However, these algorithms, based on the framework of multi-frame alignment and enhancement, may produce multi-frame fusion artifacts when encountering extreme low light or fast motion. In this paper, inspired by the low latency and high dynamic range of events, we use synthetic events from multiple frames to guide the enhancement and restoration of low-light videos. Our method contains three stages: 1) event synthesis and enhancement, 2) event and image fusion, and 3) low-light enhancement. In this framework, we design two novel modules (event-image fusion transform and event-guided dual branch) for the second and third stages, respectively. Extensive experiments show that our method outperforms existing low-light video or single image enhancement approaches on both synthetic and real LLVE datasets. Our code will be available at https://gitee.com/mindspore/models/tree/master/research/cv/LLVE-SEG.

PDF Details DOI

NeurIPS Conference 2023 Conference Paper

Parameter-efficient Tuning of Large-scale Multimodal Foundation Model

Haixin Wang
Xinlong Yang
Jianlong Chang
Dian Jin
Jinan Sun
Shikun Zhang
Xiao Luo
Qi Tian

Driven by the progress of large-scale pre-training, parameter-efficient transfer learning has gained immense popularity across different subfields of Artificial Intelligence. The core is to adapt the model to downstream tasks with only a small set of parameters. Recently, researchers have leveraged such proven techniques in multimodal tasks and achieve promising results. However, two critical issues remain unresolved: how to further reduce the complexity with lightweight design and how to boost alignment between modalities under extremely low parameters. In this paper, we propose A gracefUl pRompt framewOrk for cRoss-modal trAnsfer (AURORA) to overcome these challenges. Considering the redundancy in existing architectures, we first utilize the mode approximation to generate 0. 1M trainable parameters to implement the multimodal parameter-efficient tuning, which explores the low intrinsic dimension with only 0. 04% parameters of the pre-trained model. Then, for better modality alignment, we propose the Informative Context Enhancement and Gated Query Transformation module under extremely few parameters scenes. A thorough evaluation on six cross-modal benchmarks shows that it not only outperforms the state-of-the-art but even outperforms the full fine-tuning approach. Our code is available at: https: //github. com/WillDreamer/Aurora.

NeurIPS Conference 2023 Conference Paper

Segment Anything in 3D with NeRFs

Jiazhong Cen
Zanwei Zhou
Jiemin Fang
Chen Yang
Wei Shen
Lingxi Xie
Dongsheng Jiang
Xiaopeng Zhang

Recently, the Segment Anything Model (SAM) emerged as a powerful vision foundation model which is capable to segment anything in 2D images. This paper aims to generalize SAM to segment 3D objects. Rather than replicating the data acquisition and annotation procedure which is costly in 3D, we design an efficient solution, leveraging the Neural Radiance Field (NeRF) as a cheap and off-the-shelf prior that connects multi-view 2D images to the 3D space. We refer to the proposed solution as SA3D, for Segment Anything in 3D. It is only required to provide a manual segmentation prompt (e. g. , rough points) for the target object in a single view, which is used to generate its 2D mask in this view with SAM. Next, SA3D alternately performs mask inverse rendering and cross-view self-prompting across various views to iteratively complete the 3D mask of the target object constructed with voxel grids. The former projects the 2D mask obtained by SAM in the current view onto 3D mask with guidance of the density distribution learned by the NeRF; The latter extracts reliable prompts automatically as the input to SAM from the NeRF-rendered 2D mask in another view. We show in experiments that SA3D adapts to various scenes and achieves 3D segmentation within minutes. Our research offers a generic and efficient methodology to lift a 2D vision foundation model to 3D, as long as the 2D model can steadily address promptable segmentation across multiple views.

JBHI Journal 2023 Journal Article

Self-Supervised Tumor Segmentation With Sim2Real Adaptation

Xiaoman Zhang
Weidi Xie
Chaoqin Huang
Ya Zhang
Xin Chen
Qi Tian
Yanfeng Wang

This paper targets on self-supervised tumor segmentation. We make the following contributions: (i) we take inspiration from the observation that tumors are often characterised independently of their contexts, we propose a novel proxy task “layer-decomposition”, that closely matches the goal of the downstream task, and design a scalable pipeline for generating synthetic tumor data for pre-training; (ii) we propose a two-stage Sim2Real training regime for unsupervised tumor segmentation, where we first pre-train a model with simulated tumors, and then adopt a self-training strategy for downstream data adaptation; (iii) when evaluating on different tumor segmentation benchmarks, e. g. BraTS2018 for brain tumor segmentation and LiTS2017 for liver tumor segmentation, our approach achieves state-of-the-art segmentation performance under the unsupervised setting. While transferring the model for tumor segmentation under a low-annotation regime, the proposed approach also outperforms all existing self-supervised approaches; (iv) we conduct extensive ablation studies to analyse the critical components in data simulation, and validate the necessity of different proxy tasks. We demonstrate that, with sufficient texture randomization in simulation, model trained on synthetic data can effortlessly generalise to datasets with real tumors.

AAAI Conference 2023 Conference Paper

ShiftDDPMs: Exploring Conditional Diffusion Models by Shifting Diffusion Trajectories

Zijian Zhang
Zhou Zhao
Jun Yu
Qi Tian

Diffusion models have recently exhibited remarkable abilities to synthesize striking image samples since the introduction of denoising diffusion probabilistic models (DDPMs). Their key idea is to disrupt images into noise through a fixed forward process and learn its reverse process to generate samples from noise in a denoising way. For conditional DDPMs, most existing practices relate conditions only to the reverse process and fit it to the reversal of unconditional forward process. We find this will limit the condition modeling and generation in a small time window. In this paper, we propose a novel and flexible conditional diffusion model by introducing conditions into the forward process. We utilize extra latent space to allocate an exclusive diffusion trajectory for each condition based on some shifting rules, which will disperse condition modeling to all timesteps and improve the learning capacity of model. We formulate our method, which we call ShiftDDPMs, and provide a unified point of view on existing related methods. Extensive qualitative and quantitative experiments on image synthesis demonstrate the feasibility and effectiveness of ShiftDDPMs.

PDF Details DOI

AAAI Conference 2022 Conference Paper

Can Semantic Labels Assist Self-Supervised Visual Representation Learning?

Longhui Wei
Lingxi Xie
Jianzhong He
Xiaopeng Zhang
Qi Tian

Recently, contrastive learning has largely advanced the progress of unsupervised visual representation learning. Pretrained on ImageNet, some self-supervised algorithms reported higher transfer learning performance compared to fully-supervised methods, seeming to deliver the message that human labels hardly contribute to learning transferrable visual features. In this paper, we defend the usefulness of semantic labels but point out that fully-supervised and selfsupervised methods are pursuing different kinds of features. To alleviate this issue, we present a new algorithm named Supervised Contrastive Adjustment in Neighborhood (SCAN) that maximally prevents the semantic guidance from damaging the appearance feature embedding. In a series of downstream tasks, SCAN achieves superior performance compared to previous fully-supervised and self-supervised methods, and sometimes the gain is significant. More importantly, our study reveals that semantic labels are useful in assisting self-supervised methods, opening a new direction for the community.

NeurIPS Conference 2022 Conference Paper

ConfounderGAN: Protecting Image Data Privacy with Causal Confounder

Qi Tian
Kun Kuang
Kelu Jiang
Furui Liu
Zhihua Wang
Fei Wu

The success of deep learning is partly attributed to the availability of massive data downloaded freely from the Internet. However, it also means that users' private data may be collected by commercial organizations without consent and used to train their models. Therefore, it's important and necessary to develop a method or tool to prevent unauthorized data exploitation. In this paper, we propose ConfounderGAN, a generative adversarial network (GAN) that can make personal image data unlearnable to protect the data privacy of its owners. Specifically, the noise produced by the generator for each image has the confounder property. It can build spurious correlations between images and labels, so that the model cannot learn the correct mapping from images to labels in this noise-added dataset. Meanwhile, the discriminator is used to ensure that the generated noise is small and imperceptible, thereby remaining the normal utility of the encrypted image for humans. The experiments are conducted in six image classification datasets, including three natural object datasets and three medical datasets. The results demonstrate that our method not only outperforms state-of-the-art methods in standard settings, but can also be applied to fast encryption scenarios. Moreover, we show a series of transferability and stability experiments to further illustrate the effectiveness and superiority of our method.

NeurIPS Conference 2022 Conference Paper

Fine-Grained Semantically Aligned Vision-Language Pre-Training

Juncheng Li
Xin He
Longhui Wei
Long Qian
Linchao Zhu
Lingxi Xie
Yueting Zhuang
Qi Tian

Large-scale vision-language pre-training has shown impressive advances in a wide range of downstream tasks. Existing methods mainly model the cross-modal alignment by the similarity of the global representations of images and text, or advanced cross-modal attention upon image and text features. However, they fail to explicitly learn the fine-grained semantic alignment between visual regions and textual phrases, as only global image-text alignment information is available. In this paper, we introduce LOUPE, a fine-grained semantically aLigned visiOn-langUage PrE-training framework, which learns fine-grained semantic alignment from the novel perspective of game-theoretic interactions. To efficiently estimate the game-theoretic interactions, we further propose an uncertainty-aware neural Shapley interaction learning module. Experiments show that LOUPE achieves state-of-the-art performance on a variety of vision-language tasks. Without any object-level human annotations and fine-tuning, LOUPE achieves competitive performance on object detection and visual grounding. More importantly, LOUPE opens a new promising direction of learning fine-grained semantics from large-scale raw image-text pairs.

AAAI Conference 2022 Conference Paper

SiamTrans: Zero-Shot Multi-Frame Image Restoration with Pre-trained Siamese Transformers

Lin Liu
Shanxin Yuan
Jianzhuang Liu
Xin Guo
Youliang Yan
Qi Tian

We propose a novel zero-shot multi-frame image restoration method for removing unwanted obstruction elements (such as rains, snow, and moiré patterns) that vary in successive frames. It has three stages: transformer pre-training, zero-shot restoration, and hard patch refinement. Using the pre-trained transformers, our model is able to tell the motion difference between the true image information and the obstructing elements. For zero-shot image restoration, we design a novel model, termed SiamTrans, which is constructed by Siamese transformers, encoders, and decoders. Each transformer has a temporal attention layer and several self-attention layers, to capture both temporal and spatial information of multiple frames. Only pre-trained (self-supervised) on the denoising task, SiamTrans is tested on three different low-level vision tasks (deraining, demoiréing, and desnowing). Compared with related methods, ours achieves the best performances, even outperforming those with supervised learning.

AAAI Conference 2021 Conference Paper

Dual Distribution Alignment Network for Generalizable Person Re-Identification

Peixian Chen
Pingyang Dai
Jianzhuang Liu
Feng Zheng
Mingliang Xu
Qi Tian
Rongrong Ji

Domain generalization (DG) offers a preferable real-world setting for Person Re-Identification (Re-ID), which trains a model using multiple source domain datasets and expects it to perform well in an unseen target domain without any model updating. Unfortunately, most DG approaches are designed explicitly for classification tasks, which fundamentally differs from the retrieval task Re-ID. Moreover, existing applications of DG in Re-ID cannot correctly handle the massive variation among Re-ID datasets. In this paper, we identify two fundamental challenges in DG for Person Re-ID: domainwise variations and identity-wise similarities. To this end, we propose an end-to-end Dual Distribution Alignment Network (DDAN) to learn domain-invariant features with dual-level constraints: the domain-wise adversarial feature learning and the identity-wise similarity enhancement. These constraints effectively reduce the domain-shift among multiple source domains further while agreeing to real-world scenarios. We evaluate our method in a large-scale DG Re-ID benchmark and compare it with various cutting-edge DG approaches. Quantitative results show that DDAN achieves state-of-theart performance.

AAAI Conference 2021 Conference Paper

Fitting the Search Space of Weight-sharing NAS with Graph Convolutional Networks

Xin Chen
Lingxi Xie
Jun Wu
Longhui Wei
Yuhui Xu
Qi Tian

Neural architecture search has attracted wide attentions in both academia and industry. To accelerate it, researchers proposed weight-sharing methods which first train a super-network to reuse computation among different operators, from which exponentially many sub-networks can be sampled and efficiently evaluated. These methods enjoy great advantages in terms of computational costs, but the sampled sub-networks are not guaranteed to be estimated precisely unless an individual training process is taken. This paper owes such inaccuracy to the inevitable mismatch between assembled network layers, so that there is a random error term added to each estimation. We alleviate this issue by training a graph convolutional network to fit the performance of sampled sub-networks so that the impact of random errors becomes minimal. With this strategy, we achieve a higher rank correlation coefficient in the selected set of candidates, which consequently leads to better performance of the final architecture. In addition, our approach also enjoys the flexibility of being used under different hardware constraints, since the graph convolutional network has provided an efficient lookup table of the performance of architectures in the entire search space.

NeurIPS Conference 2021 Conference Paper

Learning High-Precision Bounding Box for Rotated Object Detection via Kullback-Leibler Divergence

Xue Yang
Xiaojiang Yang
Jirui Yang
Qi Ming
Wentao Wang
Qi Tian
Junchi Yan

Existing rotated object detectors are mostly inherited from the horizontal detection paradigm, as the latter has evolved into a well-developed area. However, these detectors are difficult to perform prominently in high-precision detection due to the limitation of current regression loss design, especially for objects with large aspect ratios. Taking the perspective that horizontal detection is a special case for rotated object detection, in this paper, we are motivated to change the design of rotation regression loss from induction paradigm to deduction methodology, in terms of the relation between rotation and horizontal detection. We show that one essential challenge is how to modulate the coupled parameters in the rotation regression loss, as such the estimated parameters can influence to each other during the dynamic joint optimization, in an adaptive and synergetic way. Specifically, we first convert the rotated bounding box into a 2-D Gaussian distribution, and then calculate the Kullback-Leibler Divergence (KLD) between the Gaussian distributions as the regression loss. By analyzing the gradient of each parameter, we show that KLD (and its derivatives) can dynamically adjust the parameter gradients according to the characteristics of the object. For instance, it will adjust the importance (gradient weight) of the angle parameter according to the aspect ratio. This mechanism can be vital for high-precision detection as a slight angle error would cause a serious accuracy drop for large aspect ratios objects. More importantly, we have proved that KLD is scale invariant. We further show that the KLD loss can be degenerated into the popular Ln-norm loss for horizontal detection. Experimental results on seven datasets using different detectors show its consistent superiority, and codes are available at https: //github. com/yangxue0827/RotationDetection.

NeurIPS Conference 2021 Conference Paper

Rectifying the Shortcut Learning of Background for Few-Shot Learning

Xu Luo
Longhui Wei
Liangjian Wen
Jinrong Yang
Lingxi Xie
Zenglin Xu
Qi Tian

The category gap between training and evaluation has been characterised as one of the main obstacles to the success of Few-Shot Learning (FSL). In this paper, we for the first time empirically identify image background, common in realistic images, as a shortcut knowledge helpful for in-class classification but ungeneralizable beyond training categories in FSL. A novel framework, COSOC, is designed to tackle this problem by extracting foreground objects in images at both training and evaluation without any extra supervision. Extensive experiments carried on inductive FSL tasks demonstrate the effectiveness of our approaches.

TIST Journal 2020 Journal Article

A Novel Multi-task Tensor Correlation Neural Network for Facial Attribute Prediction

Mingxing Duan
Kenli Li
Keqin Li
Qi Tian

Multi-task learning plays an important role in face multi-attribute prediction. At present, most researches excavate the shared information between attributes by sharing all convolutional layers. However, it is not appropriate to treat the low-level and high-level features of the face multi-attribute equally, because the high-level features are more biased toward the specific content of the category. In this article, a novel multi-attribute tensor correlation neural network (MTCN) is used to predict face attributes. MTCN shares all attribute features at the low-level layers, and then distinguishes each attribute feature at the high-level layers. To better excavate the correlations among high-level attribute features, each sub-network explores useful information from other networks to enhance its original information. Then a tensor canonical correlation analysis method is used to seek the correlations among the highest-level attributes, which enhances the original information of each attribute. After that, these features are mapped into a highly correlated space through the correlation matrix. Finally, we use sufficient experiments to verify the performance of MTCN on the CelebA and LFWA datasets and our MTCN achieves the best performance compared with the latest multi-attribute recognition algorithms under the same settings.

IJCAI Conference 2020 Conference Paper

A Structured Latent Variable Recurrent Network With Stochastic Attention For Generating Weibo Comments

Shijie Yang
Liang Li
Shuhui Wang
Weigang Zhang
Qingming Huang
Qi Tian

Building intelligent agents to generate realistic Weibo comments is challenging. For such realistic Weibo comments, the key criterion is improving diversity while maintaining coherency. Considering that the variability of linguistic comments arises from multi-level sources, including both discourse-level properties and word-level selections, we improve the comment diversity by leveraging such inherent hierarchy. In this paper, we propose a structured latent variable recurrent network, which exploits the hierarchical-structured latent variables with stochastic attention to model the variations of comments. First, we endow both discourse-level and word-level latent variables with hierarchical and temporal dependencies for constructing multi-level hierarchy. Second, we introduce a stochastic attention to infer the key-words of interest in the input post. As a result, diverse comments can be generated with both discourse-level properties and local-word selections. Experiments on open-domain Weibo data show that our model generates more diverse and realistic comments.

PDF Details DOI

AAAI Conference 2020 Conference Paper

Adversarial Domain Adaptation with Domain Mixup

Minghao Xu
Jian Zhang
Bingbing Ni
Teng Li
Chengjie Wang
Qi Tian
Wenjun Zhang

Recent works on domain adaptation reveal the effectiveness of adversarial learning on ﬁlling the discrepancy between source and target domains. However, two common limitations exist in current adversarial-learning-based methods. First, samples from two domains alone are not sufﬁcient to ensure domain-invariance at most part of latent space. Second, the domain discriminator involved in these methods can only judge real or fake with the guidance of hard label, while it is more reasonable to use soft scores to evaluate the generated images or features, i. e. , to fully utilize the inter-domain information. In this paper, we present adversarial domain adaptation with domain mixup (DM-ADA), which guarantees domain-invariance in a more continuous latent space and guides the domain discriminator in judging samples’ difference relative to source and target domains. Domain mixup is jointly conducted on pixel and feature level to improve the robustness of models. Extensive experiments prove that the proposed approach can achieve superior performance on tasks with various degrees of domain shift and data complexity.

NeurIPS Conference 2020 Conference Paper

One-bit Supervision for Image Classification

Hengtong Hu
Lingxi Xie
Zewei Du
Richang Hong
Qi Tian

This paper presents one-bit supervision, a novel setting of learning from incomplete annotations, in the scenario of image classification. Instead of training a model upon the accurate label of each sample, our setting requires the model to query with a predicted label of each sample and learn from the answer whether the guess is correct. This provides one bit (yes or no) of information, and more importantly, annotating each sample becomes much easier than finding the accurate label from many candidate classes. There are two keys to training a model upon one-bit supervision: improving the guess accuracy and making use of incorrect guesses. For these purposes, we propose a multi-stage training paradigm which incorporates negative label suppression into an off-the-shelf semi-supervised learning algorithm. In three popular image classification benchmarks, our approach claims higher efficiency in utilizing the limited amount of annotations.

IJCAI Conference 2020 Conference Paper

Polar Relative Positional Encoding for Video-Language Segmentation

Ke Ning
Lingxi Xie
Fei Wu
Qi Tian

In this paper, we tackle a challenging task named video-language segmentation. Given a video and a sentence in natural language, the goal is to segment the object or actor described by the sentence in video frames. To accurately denote a target object, the given sentence usually refers to multiple attributes, such as nearby objects with spatial relations, etc. In this paper, we propose a novel Polar Relative Positional Encoding (PRPE) mechanism that represents spatial relations in a ``linguistic'' way, i. e. , in terms of direction and range. Sentence feature can interact with positional embeddings in a more direct way to extract the implied relative positional relations. We also propose parameterized functions for these positional embeddings to adapt real-value directions and ranges. With PRPE, we design a Polar Attention Module (PAM) as the basic module for vision-language fusion. Our method outperforms previous best method by a large margin of 11. 4% absolute improvement in terms of mAP on the challenging A2D Sentences dataset. Our method also achieves competitive performances on the J-HMDB Sentences dataset.

PDF Details DOI

NeurIPS Conference 2020 Conference Paper

Self-Adaptively Learning to Demoiré from Focused and Defocused Image Pairs

Lin Liu
Shanxin Yuan
Jianzhuang Liu
Liping Bao
Gregory Slabaugh
Qi Tian

Moiré artifacts are common in digital photography, resulting from the interference between high-frequency scene content and the color filter array of the camera. Existing deep learning-based demoiréing methods trained on large scale datasets are limited in handling various complex moiré patterns, and mainly focus on demoiréing of photos taken of digital displays. Moreover, obtaining moiré-free ground-truth in natural scenes is difficult but needed for training. In this paper, we propose a self-adaptive learning method for demoiréing a high-frequency image, with the help of an additional defocused moiré-free blur image. Given an image degraded with moiré artifacts and a moiré-free blur image, our network predicts a moiré-free clean image and a blur kernel with a self-adaptive strategy that does not require an explicit training stage, instead performing test-time adaptation. Our model has two sub-networks and works iteratively. During each iteration, one sub-network takes the moiré image as input, removing moiré patterns and restoring image details, and the other sub-network estimates the blur kernel from the blur image. The two sub-networks are jointly optimized. Extensive experiments demonstrate that our method outperforms state-of-the-art methods and can produce high-quality demoiréd results. It can generalize well to the task of removing moiré artifacts caused by display screens. In addition, we build a new moiré dataset, including images with screen and texture moiré artifacts. As far as we know, this is the first dataset with real texture moiré patterns.

AAAI Conference 2020 Conference Paper

Single Camera Training for Person Re-Identification

Tianyu Zhang
Lingxi Xie
Longhui Wei
Yongfei Zhang
Bo Li
Qi Tian

Person re-identiﬁcation (ReID) aims at ﬁnding the same person in different cameras. Training such systems usually requires a large amount of cross-camera pedestrians to be annotated from surveillance videos, which is labor-consuming especially when the number of cameras is large. Differently, this paper investigates ReID in an unexplored single-cameratraining (SCT) setting, where each person in the training set appears in only one camera. To the best of our knowledge, this setting was never studied before. SCT enjoys the advantage of low-cost data collection and annotation, and thus eases ReID systems to be trained in a brand new environment. However, it raises major challenges due to the lack of cross-camera person occurrences, which conventional approaches heavily rely on to extract discriminative features. The key to dealing with the challenges in the SCT setting lies in designing an effective mechanism to complement cross-camera annotation. We start with a regular deep network for feature extraction, upon which we propose a novel loss function named multi-camera negative loss (MCNL). This is a metric learning loss motivated by probability, suggesting that in a multi-camera system, one image is more likely to be closer to the most similar negative sample in other cameras than to the most similar negative sample in the same camera. In experiments, MCNL signiﬁcantly boosts ReID accuracy in the SCT setting, which paves the way of fast deployment of ReID systems with good performance on new target scenes.

IJCAI Conference 2019 Conference Paper

Dense Temporal Convolution Network for Sign Language Translation

Dan Guo
Shuo Wang
Qi Tian
Meng Wang

The sign language translation (SLT) which aims at translating a sign language video into natural language is a weakly supervised task, given that there is no exact mapping relationship between visual actions and textual words in a sentence label. To align the sign language actions and translate them into the respective words automatically, this paper proposes a dense temporal convolution network, termed DenseTCN which captures the actions in hierarchical views. Within this network, a temporal convolution (TC) is designed to learn the short-term correlation among adjacent features and further extended to a dense hierarchical structure. In the kth TC layer, we integrate the outputs of all preceding layers together: (1) The TC in a deeper layer essentially has larger receptive fields, which captures long-term temporal context by the hierarchical content transition. (2) The integration addresses the SLT problem by different views, including embedded short-term and extended longterm sequential learning. Finally, we adopt the CTC loss and a fusion strategy to learn the featurewise classification and generate the translated sentence. The experimental results on two popular sign language benchmarks, i. e. PHOENIX and USTCConSents, demonstrate the effectiveness of our proposed method in terms of various measurements.

NeurIPS Conference 2019 Conference Paper

Information Competing Process for Learning Diversified Representations

Jie Hu
Rongrong Ji
Shengchuan Zhang
Xiaoshuai Sun
Qixiang Ye
Chia-Wen Lin
Qi Tian

Learning representations with diversified information remains as an open problem. Towards learning diversified representations, a new approach, termed Information Competing Process (ICP), is proposed in this paper. Aiming to enrich the information carried by feature representations, ICP separates a representation into two parts with different mutual information constraints. The separated parts are forced to accomplish the downstream task independently in a competitive environment which prevents the two parts from learning what each other learned for the downstream task. Such competing parts are then combined synergistically to complete the task. By fusing representation parts learned competitively under different conditions, ICP facilitates obtaining diversified representations which contain rich information. Experiments on image classification and image reconstruction tasks demonstrate the great potential of ICP to learn discriminative and disentangled representations in both supervised and self-supervised learning settings.

IJCAI Conference 2018 Conference Paper

Rethinking Diversified and Discriminative Proposal Generation for Visual Grounding

Zhou Yu
Jun Yu
Chenchao Xiang
Zhou Zhao
Qi Tian
Dacheng Tao

Visual grounding aims to localize an object in an image referred to by a textual query phrase. Various visual grounding approaches have been proposed, and the problem can be modularized into a general framework: proposal generation, multi-modal feature representation, and proposal ranking. Of these three modules, most existing approaches focus on the latter two, with the importance of proposal generation generally neglected. In this paper, we rethink the problem of what properties make a good proposal generator. We introduce the diversity and discrimination simultaneously when generating proposals, and in doing so propose Diversified and Discriminative Proposal Networks model (DDPN). Based on the proposals generated by DDPN, we propose a high performance baseline model for visual grounding and evaluate it on four benchmark datasets. Experimental results demonstrate that our model delivers significant improvements on all the tested data-sets (e. g. , 18. 8% improvement on ReferItGame and 8. 2% improvement on Flickr30k Entities over the existing state-of-the-arts respectively).

IJCAI Conference 2017 Conference Paper

Adaptively Unified Semi-supervised Learning for Cross-Modal Retrieval

Liang Zhang
Bingpeng Ma
Jianfeng He
Guorong Li
Qingming Huang
Qi Tian

Motivated by the fact that both relevancy of class labels and unlabeled data can help to strengthen multi-modal correlation, this paper proposes a novel method for cross-modal retrieval. To make each sample moving to the direction of its relevant label while far away from that of its irrelevant ones, a novel dragging technique is fused into a unified linear regression model. By this way, not only the relation between embedded features and relevant class labels but also the relation between embedded features and irrelevant class labels can be exploited. Moreover, considering that some unlabeled data contain specific semantic information, a weighted regression model is designed to adaptively enlarge their contribution while weaken that of the unlabeled data with non-specific semantic information. Hence, unlabeled data can supply semantic information to enhance discriminant ability of classifier. Finally, we integrate the constraints into a joint minimization formulation and develop an efficient optimization algorithm to learn a discriminative common subspace for different modalities. Experimental results on Wiki, Pascal and NUS-WIDE datasets show that the proposed method outperforms the state-of-the-art methods even when we set 20% samples without class labels.

AAAI Conference 2017 Conference Paper

Image Caption with Global-Local Attention

Linghui Li
Sheng Tang
Lixi Deng
Yongdong Zhang
Qi Tian

Image caption is becoming important in the ﬁeld of artiﬁcial intelligence. Most existing methods based on CNN-RNN framework suffer from the problems of object missing and misprediction due to the mere use of global representation at image-level. To address these problems, in this paper, we propose a global-local attention (GLA) method by integrating local representation at object-level with global representation at image-level through attention mechanism. Thus, our proposed method can pay more attention to how to predict the salient objects more precisely with high recall while keeping context information at image-level cocurrently. Therefore, our proposed GLA method can generate more relevant sentences, and achieve the state-of-the-art performance on the well-known Microsoft COCO caption dataset with several popular metrics.

AAAI Conference 2017 Conference Paper

Multidimensional Scaling on Multiple Input Distance Matrices

Song Bai
Xiang Bai
Longin Jan Latecki
Qi Tian

Multidimensional Scaling (MDS) is a classic technique that seeks vectorial representations for data points, given the pairwise distances between them. In recent years, data are usually collected from diverse sources or have multiple heterogeneous representations. However, how to do multidimensional scaling on multiple input distance matrices is still unsolved to our best knowledge. In this paper, we ﬁrst deﬁne this new task formally. Then, we propose a new algorithm called Multi-View Multidimensional Scaling (MVMDS) by considering each input distance matrix as one view. The proposed algorithm can learn the weights of views (i. e. , distance matrices) automatically by exploring the consensus information and complementary nature of views. Experimental results on synthetic as well as real datasets demonstrate the effectiveness of MVMDS. We hope that our work encourages a wider consideration in many domains where MDS is needed.

AAAI Conference 2017 Conference Paper

Regularized Diffusion Process for Visual Retrieval

Song Bai
Xiang Bai
Qi Tian
Longin Jan Latecki

Diffusion process has advanced visual retrieval greatly owing to its capacity in capturing the geometry structure of the underlying manifold. Recent studies (Donoser and Bischof 2013) have experimentally demonstrated that diffusion process on the tensor product graph yields better retrieval performances than that on the original afﬁnity graph. However, the principle behind this kind of diffusion process remains unclear, i. e. , what kind of manifold structure is captured and how it is reﬂected. In this paper, we propose a new variant of diffusion process, which also operates on a tensor product graph. It is deﬁned in three equivalent formulations (regularization framework, iterative framework and limit framework, respectively). Based on our study, three insightful conclusions are drawn which theoretically explain how this kind of diffusion process can better reveal the intrinsic relationship between objects. Besides, extensive experimental results on various retrieval tasks testify the validity of the proposed method.

TIST Journal 2015 Journal Article

When Location Meets Social Multimedia

Rongrong Ji
Yue Gao
Wei Liu
Xing Xie
Qi Tian
Xuelong Li

Coming with the popularity of multimedia sharing platforms such as Facebook and Flickr, recent years have witnessed an explosive growth of geographical tags on social multimedia content. This trend enables a wide variety of emerging applications, for example, mobile location search, landmark recognition, scene reconstruction, and touristic recommendation, which range from purely research prototype to commercial systems. In this article, we give a comprehensive survey on these applications, covering recent advances in recognition and mining of geographical-aware social multimedia. We review related work in the past decade regarding to location recognition, scene summarization, tourism suggestion, 3D building modeling, mobile visual search and city navigation. At the end, we further discuss potential challenges, future topics, as well as open issues related to geo-social multimedia computing, recognition, mining, and analytics.

AAAI Conference 2014 Conference Paper

Similarity-Preserving Binary Signature for Linear Subspaces

Jianqiu Ji
Jianmin Li
Shuicheng Yan
Qi Tian
Bo Zhang

Linear subspace is an important representation for many kinds of real-world data in computer vision and pattern recognition, e. g. faces, motion videos, speeches. In this paper, first we define pairwise angular similarity and angular distance for linear subspaces. The angular distance satisfies non-negativity, identity of indiscernibles, symmetry and triangle inequality, and thus it is a metric. Then we propose a method to compress linear subspaces into compact similarity-preserving binary signatures, between which the normalized Hamming distance is an unbiased estimator of the angular distance. We provide a lower bound on the length of the binary signatures which suffices to guarantee uniform distancepreservation within a set of subspaces. Experiments on face recognition demonstrate the effectiveness of the binary signature in terms of recognition accuracy, speed and storage requirement. The results show that, compared with the exact method, the approximation with the binary signatures achieves an order of magnitude speedup, while requiring significantly smaller amount of storage space, yet it still accurately preserves the similarity, and achieves high recognition accuracy comparable to the exact method in face recognition.

TIST Journal 2012 Journal Article

Context-Aware Semi-Local Feature Detector

Rongrong Ji
Hongxun Yao
Qi Tian
Pengfei Xu
Xiaoshuai Sun
Xianming Liu

How can interest point detectors benefit from contextual cues? In this articles, we introduce a context-aware semi-local detector (CASL) framework to give a systematic answer with three contributions: (1) We integrate the context of interest points to recurrently refine their detections. (2) This integration boosts interest point detectors from the traditionally local scale to a semi-local scale to discover more discriminative salient regions. (3) Such context-aware structure further enables us to bring forward category learning (usually in the subsequent recognition phase) into interest point detection to locate category-aware, meaningful salient regions. Our CASL detector consists of two phases. The first phase accumulates multiscale spatial correlations of local features into a difference of contextual Gaussians (DoCG) field. DoCG quantizes detector context to highlight contextually salient regions at a semi-local scale, which also reveals visual attentions to a certain extent. The second phase locates contextual peaks by mean shift search over the DoCG field, which subsequently integrates contextual cues into feature description. This phase enables us to integrate category learning into mean shift search kernels. This learning-based CASL mechanism produces more category-aware features, which substantially benefits the subsequent visual categorization process. We conducted experiments in image search, object characterization, and feature detector repeatability evaluations, which reported superior discriminability and comparable repeatability to state-of-the-art works.

NeurIPS Conference 2012 Conference Paper

Super-Bit Locality-Sensitive Hashing

Jianqiu Ji
Jianmin Li
Shuicheng Yan
Bo Zhang
Qi Tian

Sign-random-projection locality-sensitive hashing (SRP-LSH) is a probabilistic dimension reduction method which provides an unbiased estimate of angular similarity, yet suffers from the large variance of its estimation. In this work, we propose the Super-Bit locality-sensitive hashing (SBLSH). It is easy to implement, which orthogonalizes the random projection vectors in batches, and it is theoretically guaranteed that SBLSH also provides an unbiased estimate of angular similarity, yet with a smaller variance when the angle to estimate is within $(0, \pi/2]$. The extensive experiments on real data well validate that given the same length of binary code, SBLSH may achieve significant mean squared error reduction in estimating pairwise angular similarity. Moreover, SBLSH shows the superiority over SRP-LSH in approximate nearest neighbor (ANN) retrieval experiments.

TIST Journal 2011 Journal Article

Introduction to the special issue on intelligent multimedia systems and technology

Xian-Sheng Hua
Qi Tian
Alberto Del Bimbo
Ramesh Jain