Author name cluster

Jason Corso

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

5 papers

1 author row

AAAI Conference 2020 Conference Paper

Unified Vision-Language Pre-Training for Image Captioning and VQA

Luowei Zhou
Hamid Palangi
Lei Zhang
Houdong Hu
Jason Corso
Jianfeng Gao

This paper presents a uniﬁed Vision-Language Pre-training (VLP) model. The model is uniﬁed in that (1) it can be ﬁnetuned for either vision-language generation (e. g. , image captioning) or understanding (e. g. , visual question answering) tasks, and (2) it uses a shared multi-layer transformer network for both encoding and decoding, which differs from many existing methods where the encoder and decoder are implemented using separate models. The uniﬁed VLP model is pre-trained on a large amount of image-text pairs using the unsupervised learning objectives of two tasks: bidirectional and sequence-to-sequence (seq2seq) masked vision-language prediction. The two tasks differ solely in what context the prediction conditions on. This is controlled by utilizing speciﬁc self-attention masks for the shared transformer network. To the best of our knowledge, VLP is the ﬁrst reported model that achieves state-of-the-art results on both vision-language generation and understanding tasks, as disparate as image captioning and visual question answering, across three challenging benchmark datasets: COCO Captions, Flickr30k Captions, and VQA 2. 0. The code and the pre-trained models are available at https: //github. com/LuoweiZhou/VLP.

PDF Details

AAAI Conference 2018 Conference Paper

Towards Automatic Learning of Procedures From Web Instructional Videos

Luowei Zhou
Chenliang Xu
Jason Corso

The potential for agents, whether embodied or software, to learn by observing other agents performing procedures involving objects and actions is rich. Current research on automatic procedure learning heavily relies on action labels or video subtitles, even during the evaluation phase, which makes them infeasible in real-world scenarios. This leads to our question: can the human-consensus structure of a procedure be learned from a large set of long, unconstrained videos (e. g. , instructional videos from YouTube) with only visual evidence? To answer this question, we introduce the problem of procedure segmentation—to segment a video procedure into category-independent procedure segments. Given that no large-scale dataset is available for this problem, we collect a large-scale procedure segmentation dataset with procedure segments temporally localized and described; we use cooking videos and name the dataset YouCook2. We propose a segment-level recurrent network for generating procedure segments by modeling the dependencies across segments. The generated segments can be used as pre-processing for other tasks, such as dense video captioning and event parsing. We show in our experiments that the proposed model outperforms competitive baselines in procedure segmentation.

PDF Details

AAAI Conference 2015 Conference Paper

Jointly Modeling Deep Video and Compositional Text to Bridge Vision and Language in a Unified Framework

Ran Xu
Caiming Xiong
Wei Chen
Jason Corso

Recently, joint video-language modeling has been attracting more and more attention. However, most existing approaches focus on exploring the language model upon on a ﬁxed visual model. In this paper, we propose a uniﬁed framework that jointly models video and the corresponding text sentences. The framework consists of three parts: a compositional semantics language model, a deep video model and a joint embedding model. In our language model, we propose a dependency-tree structure model that embeds sentence into a continuous vector space, which preserves visually grounded meanings and word order. In the visual model, we leverage deep neural networks to capture essential semantic information from videos. In the joint embedding model, we minimize the distance of the outputs of the deep video model and compositional language model in the joint space, and update these two models jointly. Based on these three parts, our system is able to accomplish three tasks: 1) natural language generation, and 2) video retrieval and 3) language retrieval. In the experiments, the results show our approach outperforms SVM, CRF and CCA baselines in predicting Subject-Verb- Object triplet and natural sentence generation, and is better than CCA in video retrieval and language retrieval tasks.

PDF Details

AAAI Conference 2014 Conference Paper

Latent Domains Modeling for Visual Domain Adaptation

Caiming Xiong
Scott McCloskey
Shao-Hang Hsieh
Jason Corso

To improve robustness to significant mismatches between source domain and target domain - arising from changes such as illumination, pose and image quality - domain adaptation is increasingly popular in computer vision. But most of methods assume that the source data is from single domain, or that multi-domain datasets provide the domain label for training instances. In practice, most datasets are mixtures of multiple latent domains, and difficult to manually provide the domain label of each data point. In this paper, we propose a model that automatically discovers latent domains in visual datasets. We first assume the visual images are sampled from multiple manifolds, each of which represents different domain, and which are represented by different subspaces. Using the neighborhood structure estimated from images belonging to the same category, we approximate the local linear invariant subspace for each image based on its local structure, eliminating the category-specific elements of the feature. Based on the effectiveness of this representation, we then propose a squared-loss mutual information based clustering model with category distribution prior in each domain to infer the domain assignment for images. In experiment, we test our approach on two common image datasets, the results show that our method outperforms the existing state-of-the-art methods, and also show the superiority of multiple latent domain discovery.

PDF Details

AAAI Conference 2014 Conference Paper

Learning Compositional Sparse Models of Bimodal Percepts

Suren Kumar
Vikas Dhiman
Jason Corso

Various perceptual domains have underlying compositional semantics that are rarely captured in current models. We suspect this is because directly learning the compositional structure has evaded these models. Yet, the compositional structure of a given domain can be grounded in a separate domain thereby simplifying its learning. To that end, we propose a new approach to modeling bimodal percepts that explicitly relates distinct projections across each modality and then jointly learns a bimodal sparse representation. The resulting model enables compositionality across these distinct projections and hence can generalize to unobserved percepts spanned by this compositional basis. For example, our model can be trained on red triangles and blue squares; yet, implicitly will also have learned red squares and blue triangles. The structure of the projections and hence the compositional basis is learned automatically for a given language model. To test our model, we have acquired a new bimodal dataset comprising images and spoken utterances of colored shapes in a tabletop setup. Our experiments demonstrate the benefits of explicitly leveraging compositionality in both quantitative and human evaluation studies.

PDF Details