Author name cluster

Jianlong Fu

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

17 papers

2 author rows

NeurIPS Conference 2024 Conference Paper

PromptFix: You Prompt and We Fix the Photo

yongsheng yu
Ziyun Zeng
Hang Hua
Jianlong Fu
Jiebo Luo

Diffusion models equipped with language models demonstrate excellent controllability in image generation tasks, allowing image processing to adhere to human instructions. However, the lack of diverse instruction-following data hampers the development of models that effectively recognize and execute user-customized instructions, particularly in low-level tasks. Moreover, the stochastic nature of the diffusion process leads to deficiencies in image generation or editing tasks that require the detailed preservation of the generated images. To address these limitations, we propose PromptFix, a comprehensive framework that enables diffusion models to follow human instructions to perform a wide variety of image-processing tasks. First, we construct a large-scale instruction-following dataset that covers comprehensive image-processing tasks, including low-level tasks, image editing, and object creation. Next, we propose a high-frequency guidance sampling method to explicitly control the denoising process and preserve high-frequency details in unprocessed areas. Finally, we design an auxiliary prompting adapter, utilizing Vision-Language Models (VLMs) to enhance text prompts and improve the model's task generalization. Experimental results show that PromptFix outperforms previous methods in various image-processing tasks. Our proposed model also achieves comparable inference efficiency with these baseline models and exhibits superior zero-shot capabilities in blind restoration and combination tasks.

PDF Details DOI

ICLR Conference 2024 Conference Paper

Solving Diffusion ODEs with Optimal Boundary Conditions for Better Image Super-Resolution

Yiyang Ma
Huan Yang 0005
Wenhan Yang
Jianlong Fu
Jiaying Liu 0001

Diffusion models, as a kind of powerful generative model, have given impressive results on image super-resolution (SR) tasks. However, due to the randomness introduced in the reverse process of diffusion models, the performances of diffusion-based SR models are fluctuating at every time of sampling, especially for samplers with few resampled steps. This inherent randomness of diffusion models results in ineffectiveness and instability, making it challenging for users to guarantee the quality of SR results. However, our work takes this randomness as an opportunity: fully analyzing and leveraging it leads to the construction of an effective plug-and-play sampling method that owns the potential to benefit a series of diffusion-based SR methods. More in detail, we propose to steadily sample high-quality SR images from pre-trained diffusion-based SR models by solving diffusion ordinary differential equations (diffusion ODEs) with optimal boundary conditions (BCs) and analyze the characteristics between the choices of BCs and their corresponding SR results. Our analysis shows the route to obtain an approximately optimal BC via an efficient exploration in the whole space. The quality of SR results sampled by the proposed method with fewer steps outperforms the quality of results sampled by current methods with randomness from the same pre-trained diffusion-based SR model, which means that our sampling method ''boosts'' current diffusion-based SR models without any additional training.

Details

ICLR Conference 2023 Conference Paper

CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Alignment

Hongwei Xue
Yuchong Sun
Bei Liu 0001
Jianlong Fu
Ruihua Song
Houqiang Li
Jiebo Luo 0001

Pre-trained image-text models, like CLIP, have demonstrated the strong power of vision-language representation learned from a large scale of web-collected image-text data. In light of the well-learned visual features, there are works that transfer image representation to the video domain and achieve good results. However, adapting image-text pre-trained models to video-text pre-training (i.e., post-pretraining) has not demonstrated a significant advantage yet. In this paper, we tackle this challenge by raising and addressing two questions: 1) what are the factors hindering post-pretraining CLIP from improving performance on video-text tasks, and 2) how to mitigate the impact of these factors. Through a series of comparative experiments and analyses, we find that the data scale and domain gap between language sources have large impacts. By these observations, we propose an Omnisource Cross-modal Learning method equipped with a Video Proxy mechanism on the basis of CLIP, namely CLIP-ViP. Extensive results show that our approach improves the performance of CLIP on video-text retrieval by a large margin. Our model achieves state-of-the-art results on a variety of datasets, including MSR-VTT, DiDeMo, LSMDC, and ActivityNet. We release our code and pre-trained CLIP-ViP models at \url{https://github.com/microsoft/XPretrain/tree/main/CLIP-ViP}.

Details

NeurIPS Conference 2022 Conference Paper

Long-Form Video-Language Pre-Training with Multimodal Temporal Contrastive Learning

Yuchong Sun
Hongwei Xue
Ruihua Song
Bei Liu
Huan Yang
Jianlong Fu

Large-scale video-language pre-training has shown significant improvement in video-language understanding tasks. Previous studies of video-language pretraining mainly focus on short-form videos (i. e. , within 30 seconds) and sentences, leaving long-form video-language pre-training rarely explored. Directly learning representation from long-form videos and language may benefit many long-formvideo-language understanding tasks. However, it is challenging due to the difficulty of modeling long-range relationships and the heavy computational burden caused by more frames. In this paper, we introduce a Long-Form VIdeo-LAnguage pre-training model (LF-VILA) and train it on a large-scale long-form video and paragraph dataset constructed from an existing public dataset. To effectively capturethe rich temporal dynamics and to better align video and language in an efficient end-to-end manner, we introduce two novel designs in our LF-VILA model. We first propose a Multimodal Temporal Contrastive (MTC) loss to learn the temporal relation across different modalities by encouraging fine-grained alignment between long-form videos and paragraphs. Second, we propose a Hierarchical Temporal Window Attention (HTWA) mechanism to effectively capture long-range dependency while reducing computational cost in Transformer. We fine-tune the pre-trained LF-VILA model on seven downstream long-form video-language understanding tasks of paragraph-to-video retrieval and long-form video question-answering, and achieve new state-of-the-art performances. Specifically, our model achieves 16. 1% relative improvement on ActivityNet paragraph-to-video retrieval task and 2. 4% on How2QA task, respectively. We release our code, dataset, and pre-trained models at https: //github. com/microsoft/XPretrain.

PDF Details

NeurIPS Conference 2021 Conference Paper

Improving Visual Quality of Image Synthesis by A Token-based Generator with Transformers

Yanhong Zeng
Huan Yang
Hongyang Chao
Jianbo Wang
Jianlong Fu

We present a new perspective of achieving image synthesis by viewing this task as a visual token generation problem. Different from existing paradigms that directly synthesize a full image from a single input (e. g. , a latent code), the new formulation enables a flexible local manipulation for different image regions, which makes it possible to learn content-aware and fine-grained style control for image synthesis. Specifically, it takes as input a sequence of latent tokens to predict the visual tokens for synthesizing an image. Under this perspective, we propose a token-based generator (i. e. , TokenGAN). Particularly, the TokenGAN inputs two semantically different visual tokens, i. e. , the learned constant content tokens and the style tokens from the latent space. Given a sequence of style tokens, the TokenGAN is able to control the image synthesis by assigning the styles to the content tokens by attention mechanism with a Transformer. We conduct extensive experiments and show that the proposed TokenGAN has achieved state-of-the-art results on several widely-used image synthesis benchmarks, including FFHQ and LSUN CHURCH with different resolutions. In particular, the generator is able to synthesize high-fidelity images with (1024x1024) size, dispensing with convolutions entirely.

PDF Details

NeurIPS Conference 2021 Conference Paper

Probing Inter-modality: Visual Parsing with Self-Attention for Vision-and-Language Pre-training

Hongwei Xue
Yupan Huang
Bei Liu
Houwen Peng
Jianlong Fu
Houqiang Li
Jiebo Luo

Vision-Language Pre-training (VLP) aims to learn multi-modal representations from image-text pairs and serves for downstream vision-language tasks in a fine-tuning fashion. The dominant VLP models adopt a CNN-Transformer architecture, which embeds images with a CNN, and then aligns images and text with a Transformer. Visual relationship between visual contents plays an important role in image understanding and is the basic for inter-modal alignment learning. However, CNNs have limitations in visual relation learning due to local receptive field's weakness in modeling long-range dependencies. Thus the two objectives of learning visual relation and inter-modal alignment are encapsulated in the same Transformer network. Such design might restrict the inter-modal alignment learning in the Transformer by ignoring the specialized characteristic of each objective. To tackle this, we propose a fully Transformer visual embedding for VLP to better learn visual relation and further promote inter-modal alignment. Specifically, we propose a metric named Inter-Modality Flow (IMF) to measure the interaction between vision and language modalities (i. e. , inter-modality). We also design a novel masking optimization mechanism named Masked Feature Regression (MFR) in Transformer to further promote the inter-modality learning. To the best of our knowledge, this is the first study to explore the benefit of Transformer for visual feature learning in VLP. We verify our method on a wide range of vision-language tasks, including Visual Question Answering (VQA), Visual Entailment and Visual Reasoning. Our approach not only outperforms the state-of-the-art VLP performance, but also shows benefits on the IMF metric.

PDF Details

NeurIPS Conference 2021 Conference Paper

Searching the Search Space of Vision Transformer

Minghao Chen
Kan Wu
Bolin Ni
Houwen Peng
Bei Liu
Jianlong Fu
Hongyang Chao
Haibin Ling

Vision Transformer has shown great visual representation power in substantial vision tasks such as recognition and detection, and thus been attracting fast-growing efforts on manually designing more effective architectures. In this paper, we propose to use neural architecture search to automate this process, by searching not only the architecture but also the search space. The central idea is to gradually evolve different search dimensions guided by their E-T Error computed using a weight-sharing supernet. Moreover, we provide design guidelines of general vision transformers with extensive analysis according to the space searching process, which could promote the understanding of vision transformer. Remarkably, the searched models, named S3 (short for Searching the Search Space), from the searched space achieve superior performance to recently proposed models, such as Swin, DeiT and ViT, when evaluated on ImageNet. The effectiveness of S3 is also illustrated on object detection, semantic segmentation and visual question answering, demonstrating its generality to downstream vision and vision-language tasks. Code and models will be available at https: //github. com/microsoft/Cream.

PDF Details

NeurIPS Conference 2020 Conference Paper

Cream of the Crop: Distilling Prioritized Paths For One-Shot Neural Architecture Search

Houwen Peng
Hao Du
Hongyuan Yu
Qi Li
Jing Liao
Jianlong Fu

One-shot weight sharing methods have recently drawn great attention in neural architecture search due to high efficiency and competitive performance. However, weight sharing across models has an inherent deficiency, i. e. , insufficient training of subnetworks in the hypernetwork. To alleviate this problem, we present a simple yet effective architecture distillation method. The central idea is that subnetworks can learn collaboratively and teach each other throughout the training process, aiming to boost the convergence of individual models. We introduce the concept of prioritized path, which refers to the architecture candidates exhibiting superior performance during training. Distilling knowledge from the prioritized paths is able to boost the training of subnetworks. Since the prioritized paths are changed on the fly depending on their performance and complexity, the final obtained paths are the cream of the crop. We directly select the most promising one from the prioritized paths as the final architecture, without using other complex search methods, such as reinforcement learning or evolution algorithms. The experiments on ImageNet verify such path distillation method can improve the convergence ratio and performance of the hypernetwork, as well as boosting the training of subnetworks. The discovered architectures achieve superior performance compared to the recent MobileNetV3 and EfficientNet families under aligned settings. Moreover, the experiments on object detection and more challenging search space show the generality and robustness of the proposed method. Code and models are available at \url{https: //github. com/neurips-20/cream. git}.

PDF Details

AAAI Conference 2020 Conference Paper

DGCN: Dynamic Graph Convolutional Network for Efficient Multi-Person Pose Estimation

Zhongwei Qiu
Kai Qiu
Jianlong Fu
Dongmei Fu

Multi-person pose estimation aims to detect human keypoints from images with multiple persons. Bottom-up methods for multi-person pose estimation have attracted extensive attention, owing to the good balance between efﬁciency and accuracy. Recent bottom-up methods usually follow the principle of keypoints localization and grouping, where relations between keypoints are the keys to group keypoints. These relations spontaneously construct a graph of keypoints, where the edges represent the relations between two nodes (i. e. , keypoints). Existing bottom-up methods mainly deﬁne relations by empirically picking out edges from this graph, while omitting edges that may contain useful semantic relations. In this paper, we propose a novel Dynamic Graph Convolutional Module (DGCM) to model rich relations in the keypoints graph. Speciﬁcally, we take into account all relations (all edges of the graph) and construct dynamic graphs to tolerate large variations of human pose. The DGCM is quite lightweight, which allows it to be stacked like a pyramid architecture and learn structural relations from multi-level features. Our network with single DGCM based on ResNet-50 achieves relative gains of 3. 2% and 4. 8% over state-of-the-art bottom-up methods on COCO keypoints and MPII dataset, respectively.

PDF Details

AAAI Conference 2020 Conference Paper

Learning 2D Temporal Adjacent Networks for Moment Localization with Natural Language

Songyang Zhang
Houwen Peng
Jianlong Fu
Jiebo Luo

We address the problem of retrieving a speciﬁc moment from an untrimmed video by a query sentence. This is a challenging problem because a target moment may take place in relations to other temporal moments in the untrimmed video. Existing methods cannot tackle this challenge well since they consider temporal moments individually and neglect the temporal dependencies. In this paper, we model the temporal relations between video moments by a two-dimensional map, where one dimension indicates the starting time of a moment and the other indicates the end time. This 2D temporal map can cover diverse video moments with different lengths, while representing their adjacent relations. Based on the 2D map, we propose a Temporal Adjacent Network (2D-TAN), a single-shot framework for moment localization. It is capable of encoding the adjacent temporal relation, while learning discriminative features for matching video moments with referring expressions. We evaluate the proposed 2D-TAN on three challenging benchmarks, i. e. , Charades-STA, ActivityNet Captions, and TACoS, where our 2D-TAN outperforms the state-of-the-art.

PDF Details

NeurIPS Conference 2020 Conference Paper

Learning Semantic-aware Normalization for Generative Adversarial Networks

Heliang Zheng
Jianlong Fu
Yanhong Zeng
Jiebo Luo
Zheng-Jun Zha

The recent advances in image generation have been achieved by style-based image generators. Such approaches learn to disentangle latent factors in different image scales and encode latent factors as “style” to control image synthesis. However, existing approaches cannot further disentangle fine-grained semantics from each other, which are often conveyed from feature channels. In this paper, we propose a novel image synthesis approach by learning Semantic-aware relative importance for feature channels in Generative Adversarial Networks (SariGAN). Such a model disentangles latent factors according to the semantic of feature channels by channel-/group- wise fusion of latent codes and feature channels. Particularly, we learn to cluster feature channels by semantics and propose an adaptive group-wise Normalization (AdaGN) to independently control the styles of different channel groups. For example, we can adjust the statistics of channel groups for a human face to control the open and close of the mouth, while keeping other facial features unchanged. We propose to use adversarial training, a channel grouping loss, and a mutual information loss for joint optimization, which not only enables high-fidelity image synthesis but leads to superior interpretable properties. Extensive experiments show that our approach outperforms the SOTA style-based approaches in both unconditional image generation and conditional image inpainting tasks.

PDF Details

IJCAI Conference 2019 Conference Paper

From Words to Sentences: A Progressive Learning Approach for Zero-resource Machine Translation with Visual Pivots

Shizhe Chen
Qin Jin
Jianlong Fu

The neural machine translation model has suffered from the lack of large-scale parallel corpora. In contrast, we humans can learn multi-lingual translations even without parallel texts by referring our languages to the external world. To mimic such human learning behavior, we employ images as pivots to enable zero-resource translation learning. However, a picture tells a thousand words, which makes multi-lingual sentences pivoted by the same image noisy as mutual translations and thus hinders the translation model learning. In this work, we propose a progressive learning approach for image-pivoted zero-resource machine translation. Since words are less diverse when grounded in the image, we first learn word-level translation with image pivots, and then progress to learn the sentence-level translation by utilizing the learned word translation to suppress noises in image-pivoted multi-lingual sentences. Experimental results on two widely used image-pivot translation datasets, IAPR-TC12 and Multi30k, show that the proposed approach significantly outperforms other state-of-the-art methods.

PDF Details

NeurIPS Conference 2019 Conference Paper

Learning Deep Bilinear Transformation for Fine-grained Image Representation

Heliang Zheng
Jianlong Fu
Zheng-Jun Zha
Jiebo Luo

Bilinear feature transformation has shown the state-of-the-art performance in learning fine-grained image representations. However, the computational cost to learn pairwise interactions between deep feature channels is prohibitively expensive, which restricts this powerful transformation to be used in deep neural networks. In this paper, we propose a deep bilinear transformation (DBT) block, which can be deeply stacked in convolutional neural networks to learn fine-grained image representations. The DBT block can uniformly divide input channels into several semantic groups. As bilinear transformation can be represented by calculating pairwise interactions within each group, the computational cost can be heavily relieved. The output of each block is further obtained by aggregating intra-group bilinear features, with residuals from the entire input features. We found that the proposed network achieves new state-of-the-art in several fine-grained image recognition benchmarks, including CUB-Bird, Stanford-Car, and FGVC-Aircraft.

PDF Details

AAAI Conference 2018 Conference Paper

Self-View Grounding Given a Narrated 360° Video

Shih-Han Chou
Yi-Chun Chen
Kuo-Hao Zeng
Hou-Ning Hu
Jianlong Fu
Min Sun

Narrated 360◦ videos are typically provided in many touring scenarios to mimic real-world experience. However, previous work has shown that smart assistance (i. e. , providing visual guidance) can signiﬁcantly help users to follow the Normal Field of View (NFoV) corresponding to the narrative. In this project, we aim at automatically grounding the NFoVs of a 360◦ video given subtitles of the narrative (referred to as “NFoV-grounding”). We propose a novel Visual Grounding Model (VGM) to implicitly and efﬁciently predict the NFoVs given the video content and subtitles. Speciﬁcally, at each frame, we efﬁciently encode the panorama into feature map of candidate NFoVs using a Convolutional Neural Network (CNN) and the subtitles to the same hidden space using an RNN with Gated Recurrent Units (GRU). Then, we apply soft-attention on candidate NFoVs to trigger sentence decoder aiming to minimize the reconstruct loss between the generated and given sentence. Finally, we obtain the NFoV as the candidate NFoV with the maximum attention without any human supervision. To train VGM more robustly, we also generate a reverse sentence conditioning on one minus the soft-attention such that the attention focuses on candidate NFoVs less relevant to the given sentence. The negative log reconstruction loss of the reverse sentence (referred to as “irrelevant loss”) is jointly minimized to encourage the reverse sentence to be different from the given sentence. To evaluate our method, we collect the ﬁrst narrated 360◦ videos dataset and achieve state-of-the-art NFoV-grounding performance.

PDF Details

AAAI Conference 2018 Conference Paper

Show, Reward and Tell: Automatic Generation of Narrative Paragraph From Photo Stream by Adversarial Training

Jing Wang
Jianlong Fu
Jinhui Tang
Zechao Li
Tao Mei

Impressive image captioning results (i. e. , an objective description for an image) are achieved with plenty of training pairs. In this paper, we take one step further to investigate the creation of narrative paragraph for a photo stream. This task is even more challenging due to the difﬁculty in modeling an ordered photo sequence and in generating a relevant paragraph with expressive language style for storytelling. The dif- ﬁculty can even be exacerbated by the limited training data, so that existing approaches almost focus on search-based solutions. To deal with these challenges, we propose a sequenceto-sequence modeling approach with reinforcement learning and adversarial training. First, to model the ordered photo stream, we propose a hierarchical recurrent neural network as story generator, which is optimized by reinforcement learning with rewards. Second, to generate relevant and story-style paragraphs, we design the rewards with two critic networks, including a multi-modal and a language-style discriminator. Third, we further consider the story generator and reward critics as adversaries. The generator aims to create indistinguishable paragraphs to human-level stories, whereas the critics aim at distinguishing them and further improving the generator by policy gradient. Experiments on three widely-used datasets show the effectiveness, against state-of-the-art methods with relative increase of 20. 2% by METEOR. We also show the subjective preference for the proposed approach over the baselines through a user study with 30 human subjects.

PDF Details

AAAI Conference 2017 Conference Paper

Let Your Photos Talk: Generating Narrative Paragraph for Photo Stream via Bidirectional Attention Recurrent Neural Networks

Yu Liu
Jianlong Fu
Tao Mei
Chang Wen Chen

Automatic generation of natural language description for individual images (a. k. a. image captioning) has attracted extensive research attention. In this paper, we take one step further to investigate the generation of a paragraph to describe a photo stream for the purpose of storytelling. This task is even more challenging than individual image description due to the difﬁculty in modeling the large visual variance in an ordered photo collection and in preserving the long-term language coherence among multiple sentences. To deal with these challenges, we formulate the task as a sequence-to-sequence learning problem and propose a novel joint learning model by leveraging the semantic coherence in a photo stream. Specifically, to reduce visual variance, we learn a semantic space by jointly embedding each photo with its corresponding contextual sentence, so that the semantically related photos and their correlations are discovered. Then, to preserve language coherence in the paragraph, we learn a novel Bidirectional Attention-based Recurrent Neural Network (BARNN) model, which can attend on the discovered semantic relation to produce a sentence sequence and maintain its consistence with the photo stream. We integrate the two-step learning components into one single optimization formulation and train the network in an end-to-end manner. Experiments on three widely-used datasets (NYC/Disney/SIND) show that the proposed approach outperforms state-of-the-art methods with large margins for both retrieval and paragraph generation tasks. We also show the subjective preference of the machinegenerated stories by the proposed approach over the baselines through a user study with 40 human subjects.

PDF Details

IJCAI Conference 2016 Conference Paper

Beyond Object Recognition: Visual Sentiment Analysis with Deep Coupled Adjective and Noun Neural Networks

Jingwen Wang
Jianlong Fu
Yong Xu
Tao Mei

Visual sentiment analysis aims to automatically recognize positive and negative emotions from images. There are three main challenges, including large intra-class variance, fine-grained image categories, and scalability. Most existing methods predominantly focus on one or two challenges, which has limited their performance. In this paper, we propose a novel visual sentiment analysis approach with deep coupled adjective and noun neural networks. Specifically, to reduce the large intra-class variance, we first learn a shared middle-level sentiment representation by jointly learning an adjective and a noun deep neural network with weak label supervision. Second, based on the learned sentiment representation, a prediction network is further optimized to deal with the subtle differences which often exist in the fine-grained image categories. The three networks are trained in end-to-end manner, where the middle-level representation learned in previous two networks can guide the sentiment network to achieve high performance and fast convergence. Third, we generalize the training with mutual supervision between the learned adjective and noun networks by a Rectified Kullback-Leibler loss (ReKL), when the adjective and noun labels are not available. Extensive experiments on two widely-used datasets show that our method outperforms the state-of-the-art on SentiBank dataset with 10. 2% accuracy gain and surpasses the previous best approach on Twitter dataset with clear margin.

PDF Details