Arrow Research search

Author name cluster

Thomas Huang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

21 papers
1 author row

Possible papers

21

AAAI Conference 2021 Short Paper

Text Embedding Bank for Detailed Image Paragraph Captioning

  • Arjun Gupta
  • Zengming Shen
  • Thomas Huang

Existing deep learning-based models for image captioning typically consist of an image encoder to extract visual features and a language model decoder, an architecture that has shown promising results in single high-level sentence generation. However, only the word-level guiding signal is available when the image encoder is optimized to extract visual features. The inconsistency between the parallel extraction of visual features and sequential text supervision limits its success when the length of the generated text is long (more than 50 words). We propose a new module, called the Text Embedding Bank (TEB), to address this problem for image paragraph captioning. This module uses the paragraph vector model to learn fixed-length feature representations from a variable-length paragraph. We refer to the fixedlength feature as the TEB. This TEB module plays two roles to benefit paragraph captioning performance. First, it acts as a form of global and coherent deep supervision to regularize visual feature extraction in the image encoder. Second, it acts as a distributed memory to provide features of the whole paragraph to the language model, which alleviates the longterm dependency problem. Adding this module to two existing state-of-the-art methods achieves a new state-of-theart result on the paragraph captioning Stanford Visual Genome dataset.

AAAI Conference 2020 Conference Paper

FLNet: Landmark Driven Fetching and Learning Network for Faithful Talking Facial Animation Synthesis

  • Kuangxiao Gu
  • Yuqian Zhou
  • Thomas Huang

Talking face synthesis has been widely studied in either appearance-based or warping-based methods. Previous works mostly utilize single face image as a source, and generate novel facial animations by merging other person’s facial features. However, some facial regions like eyes or teeth, which may be hidden in the source image, can not be synthesized faithfully and stably. In this paper, We present a landmark driven two-stream network to generate faithful talking facial animation, in which more facial details are created, preserved and transferred from multiple source images instead of a single one. Specifically, we propose a network consisting of a learning and fetching stream. The fetching sub-net directly learns to attentively warp and merge facial regions from five source images of distinctive landmarks, while the learning pipeline renders facial organs from the training face space to compensate. Compared to baseline algorithms, extensive experiments demonstrate that the proposed method achieves a higher performance both quantitatively and qualitatively. Codes are at https: //github. com/kgu3/FLNet AAAI2020.

AAAI Conference 2020 Conference Paper

When AWGN-Based Denoiser Meets Real Noises

  • Yuqian Zhou
  • Jianbo Jiao
  • Haibin Huang
  • Yang Wang
  • Jue Wang
  • Honghui Shi
  • Thomas Huang

Discriminative learning based image denoisers have achieved promising performance on synthetic noises such as Additive White Gaussian Noise (AWGN). The synthetic noises adopted in most previous work are pixel-independent, but real noises are mostly spatially/channel-correlated and spatially/channel-variant. This domain gap yields unsatisfied performance on images with real noises if the model is only trained with AWGN. In this paper, we propose a novel approach to boost the performance of a real image denoiser which is trained only with synthetic pixel-independent noise data dominated by AWGN. First, we train a deep model that consists of a noise estimator and a denoiser with mixed AWGN and Random Value Impulse Noise (RVIN). We then investigate Pixel-shuffle Down-sampling (PD) strategy to adapt the trained model to real noises. Extensive experiments demonstrate the effectiveness and generalization of the proposed approach. Notably, our method achieves state-of-theart performance on real sRGB images in the DND benchmark among models trained with synthetic noises. Codes are available at https: //github. com/yzhouas/PD-Denoising-pytorch.

AAAI Conference 2019 Short Paper

Adaptation Strategies for Applying AWGN-Based Denoiser to Realistic Noise

  • Yuqian Zhou
  • Jianbo Jiao
  • Haibin Huang
  • Jue Wang
  • Thomas Huang

Discriminative learning based denoising model trained with Additive White Gaussian Noise (AWGN) performs well on synthesized noise. However, realistic noise can be spatialvariant, signal-dependent and a mixture of complicated noises. In this paper, we explore multiple strategies for applying an AWGN-based denoiser to realistic noise. Specifically, we trained a deep network integrating noise estimating and denoiser with mixed Gaussian (AWGN) and Random Value Impulse Noise (RVIN). To adapt the model to realistic noises, we investigated multi-channel, multi-scale and super-resolution approaches. Our preliminary results demonstrated the effectiveness of the newly-proposed noise model and adaptation strategies.

AAAI Conference 2019 Conference Paper

Horizontal Pyramid Matching for Person Re-Identification

  • Yang Fu
  • Yunchao Wei
  • Yuqian Zhou
  • Honghui Shi
  • Gao Huang
  • Xinchao Wang
  • Zhiqiang Yao
  • Thomas Huang

Despite the remarkable progress in person re-identification (Re-ID), such approaches still suffer from the failure cases where the discriminative body parts are missing. To mitigate this type of failure, we propose a simple yet effective Horizontal Pyramid Matching (HPM) approach to fully exploit various partial information of a given person, so that correct person candidates can be identified even if some key parts are missing. With HPM, we make the following contributions to produce more robust feature representations for the Re-ID task: 1) we learn to classify using partial feature representations at different horizontal pyramid scales, which successfully enhance the discriminative capabilities of various person parts; 2) we exploit average and max pooling strategies to account for person-specific discriminative information in a global-local manner. To validate the effectiveness of our proposed HPM method, extensive experiments are conducted on three popular datasets including Market-1501, DukeMTMC- ReID and CUHK03. Respectively, we achieve mAP scores of 83. 1%, 74. 5% and 59. 7% on these challenging benchmarks, which are the new state-of-the-arts.

AAAI Conference 2019 Conference Paper

STA: Spatial-Temporal Attention for Large-Scale Video-Based Person Re-Identification

  • Yang Fu
  • Xiaoyang Wang
  • Yunchao Wei
  • Thomas Huang

In this work, we propose a novel Spatial-Temporal Attention (STA) approach to tackle the large-scale person reidentification task in videos. Different from the most existing methods, which simply compute representations of video clips using frame-level aggregation (e. g. average pooling), the proposed STA adopts a more effective way for producing robust clip-level feature representation. Concretely, our STA fully exploits those discriminative parts of one target person in both spatial and temporal dimensions, which results in a 2-D attention score matrix via inter-frame regularization to measure the importances of spatial parts across different frames. Thus, a more robust clip-level feature representation can be generated according to a weighted sum operation guided by the mined 2-D attention score matrix. In this way, the challenging cases for video-based person re-identification such as pose variation and partial occlusion can be well tackled by the STA. We conduct extensive experiments on two large-scale benchmarks, i. e. MARS and DukeMTMC- VideoReID. In particular, the mAP reaches 87. 7% on MARS, which significantly outperforms the state-of-the-arts with a large margin of more than 11. 6%.

AAAI Conference 2019 Conference Paper

Weakly Supervised Scene Parsing with Point-Based Distance Metric Learning

  • Rui Qian
  • Yunchao Wei
  • Honghui Shi
  • Jiachen Li
  • Jiaying Liu
  • Thomas Huang

Semantic scene parsing is suffering from the fact that pixellevel annotations are hard to be collected. To tackle this issue, we propose a Point-based Distance Metric Learning (PDML) in this paper. PDML does not require dense annotated masks and only leverages several labeled points that are much easier to obtain to guide the training process. Concretely, we leverage semantic relationship among the annotated points by encouraging the feature representations of the intra- and intercategory points to keep consistent, i. e. points within the same category should have more similar feature representations compared to those from different categories. We formulate such a characteristic into a simple distance metric loss, which collaborates with the point-wise cross-entropy loss to optimize the deep neural networks. Furthermore, to fully exploit the limited annotations, distance metric learning is conducted across different training images instead of simply adopting an image-dependent manner. We conduct extensive experiments on two challenging scene parsing benchmarks of PASCAL- Context and ADE 20K to validate the effectiveness of our PDML, and competitive mIoU scores are achieved.

NeurIPS Conference 2018 Conference Paper

Learning Hierarchical Semantic Image Manipulation through Structured Representations

  • Seunghoon Hong
  • Xinchen Yan
  • Thomas Huang
  • Honglak Lee

Understanding, reasoning, and manipulating semantic concepts of images have been a fundamental research problem for decades. Previous work mainly focused on direct manipulation of natural image manifold through color strokes, key-points, textures, and holes-to-fill. In this work, we present a novel hierarchical framework for semantic image manipulation. Key to our hierarchical framework is that we employ structured semantic layout as our intermediate representations for manipulation. Initialized with coarse-level bounding boxes, our layout generator first creates pixel-wise semantic layout capturing the object shape, object-object interactions, and object-scene relations. Then our image generator fills in the pixel-level textures guided by the semantic layout. Such framework allows a user to manipulate images at object-level by adding, removing, and moving one bounding box at a time. Experimental evaluations demonstrate the advantages of the hierarchical manipulation framework over existing image generation and context hole-filing models, both qualitatively and quantitatively. Benefits of the hierarchical framework are further demonstrated in applications such as semantic object manipulation, interactive image editing, and data-driven image manipulation.

NeurIPS Conference 2018 Conference Paper

Non-Local Recurrent Network for Image Restoration

  • Ding Liu
  • Bihan Wen
  • Yuchen Fan
  • Chen Change Loy
  • Thomas Huang

Many classic methods have shown non-local self-similarity in natural images to be an effective prior for image restoration. However, it remains unclear and challenging to make use of this intrinsic property via deep networks. In this paper, we propose a non-local recurrent network (NLRN) as the first attempt to incorporate non-local operations into a recurrent neural network (RNN) for image restoration. The main contributions of this work are: (1) Unlike existing methods that measure self-similarity in an isolated manner, the proposed non-local module can be flexibly integrated into existing deep networks for end-to-end training to capture deep feature correlation between each location and its neighborhood. (2) We fully employ the RNN structure for its parameter efficiency and allow deep feature correlation to be propagated along adjacent recurrent states. This new design boosts robustness against inaccurate correlation estimation due to severely degraded images. (3) We show that it is essential to maintain a confined neighborhood for computing deep feature correlation given degraded images. This is in contrast to existing practice that deploys the whole image. Extensive experiments on both image denoising and super-resolution tasks are conducted. Thanks to the recurrent non-local operations and correlation propagation, the proposed NLRN achieves superior results to state-of-the-art methods with many fewer parameters.

AAAI Conference 2018 Short Paper

Visual Recognition in Very Low-Quality Settings: Delving Into the Power of Pre-Training

  • Bowen Cheng
  • Ding Liu
  • Zhangyang Wang
  • Haichao Zhang
  • Thomas Huang

Visual recognition from very low-quality images is an extremely challenging task with great practical values. While deep networks have been extensively applied to low-quality image restoration and high-quality image recognition tasks respectively, few works have been done on the important problem of recognition from very low-quality images. This paper presents a degradation-robust pre-training approach on improving deep learning models towards this direction. Extensive experiments on different datasets validate the effectiveness of our proposed method.

IJCAI Conference 2018 Conference Paper

When Image Denoising Meets High-Level Vision Tasks: A Deep Learning Approach

  • Ding Liu
  • Bihan Wen
  • Xianming Liu
  • Zhangyang Wang
  • Thomas Huang

Conventionally, image denoising and high-level vision tasks are handled separately in computer vision. In this paper, we cope with the two jointly and explore the mutual influence between them. First we propose a convolutional neural network for image denoising which achieves the state-of-the-art performance. Second we propose a deep neural network solution that cascades two modules for image denoising and various high-level tasks, respectively, and use the joint loss for updating only the denoising network via back-propagation. We demonstrate that on one hand, the proposed denoiser has the generality to overcome the performance degradation of different high-level vision tasks. On the other hand, with the guidance of high-level vision information, the denoising network can generate more visually appealing results. To the best of our knowledge, this is the first work investigating the benefit of exploiting image semantics simultaneously for image denoising and high-level vision tasks via deep learning.

NeurIPS Conference 2017 Conference Paper

Dilated Recurrent Neural Networks

  • Shiyu Chang
  • Yang Zhang
  • Wei Han
  • Mo Yu
  • Xiaoxiao Guo
  • Wei Tan
  • Xiaodong Cui
  • Michael Witbrock

Learning with recurrent neural networks (RNNs) on long sequences is a notoriously difficult task. There are three major challenges: 1) complex dependencies, 2) vanishing and exploding gradients, and 3) efficient parallelization. In this paper, we introduce a simple yet effective RNN connection structure, the DilatedRNN, which simultaneously tackles all of these challenges. The proposed architecture is characterized by multi-resolution dilated recurrent skip connections and can be combined flexibly with diverse RNN cells. Moreover, the DilatedRNN reduces the number of parameters needed and enhances training efficiency significantly, while matching state-of-the-art performance (even with standard RNN cells) in tasks involving very long-term dependencies. To provide a theory-based quantification of the architecture's advantages, we introduce a memory capacity measure, the mean recurrent length, which is more suitable for RNNs with long skip connections than existing measures. We rigorously prove the advantages of the DilatedRNN over other recurrent neural architectures. The code for our method is publicly available at https: //github. com/code-terminator/DilatedRNN.

AAAI Conference 2016 Conference Paper

Epitomic Image Super-Resolution

  • Yingzhen Yang
  • Zhangyang Wang
  • Zhaowen Wang
  • Shiyu Chang
  • Ding Liu
  • Honghui Shi
  • Thomas Huang

We propose Epitomic Image Super-Resolution (ESR) to enhance the current internal SR methods that exploit the selfsimilarities in the input. Instead of local nearest neighbor patch matching used in most existing internal SR methods, ESR employs epitomic patch matching that features robustness to noise, and both local and non-local patch matching. Extensive objective and subjective evaluation demonstrate the effectiveness and advantage of ESR on various images.

AAAI Conference 2016 Conference Paper

Learning Deep ℓ 0 Encoders

  • Zhangyang Wang
  • Qing Ling
  • Thomas Huang

Despite its nonconvex nature, 0 sparse approximation is desirable in many theoretical and application cases. We study the 0 sparse approximation problem with the tool of deep learning, by proposing Deep 0 Encoders. Two typical forms, the 0 regularized problem and the M-sparse problem, are investigated. Based on solid iterative algorithms, we model them as feed-forward neural networks, through introducing novel neurons and pooling functions. Enforcing such structural priors acts as an effective network regularization. The deep encoders also enjoy faster inference, larger learning capacity, and better scalability compared to conventional sparse coding solutions. Furthermore, under task-driven losses, the models can be conveniently optimized from end to end. Numerical results demonstrate the impressive performances of the proposed encoders. Dedication Zhangyang and Qing would like to dedicate the paper to their friend, Mr. Yuan Song (10/09/1984 - 07/13/2015).

IJCAI Conference 2015 Conference Paper

A Space Alignment Method for Cold-Start TV Show Recommendations

  • Shiyu Chang
  • Jiayu Zhou
  • Pirooz Chubak
  • Junling Hu
  • Thomas Huang

In recent years, recommendation algorithms have become one of the most active research areas driven by the enormous industrial demands. Most of the existing recommender systems focus on topics such as movie, music, e-commerce etc. , which essentially differ from the TV show recommendations due to the cold-start and temporal dynamics. Both effectiveness (effectively handling the cold-start TV shows) and efficiency (efficiently updating the model to reflect the temporal data changes) concerns have to be addressed to design real-world TV show recommendation algorithms. In this paper, we introduce a novel hybrid recommendation algorithm incorporating both collaborative user-item relationship as well as item content features. The cold-start TV shows can be correctly recommended to desired users via a so called space alignment technique. On the other hand, an online updating scheme is developed to utilize new user watching behaviors. We present experimental results on a real TV watch behavior data set to demonstrate the significant performance improvement over other state-of-the-art algorithms.

AAAI Conference 2014 Conference Paper

Data Clustering by Laplacian Regularized L1-Graph

  • Yingzhen Yang
  • Zhangyang Wang
  • Jianchao Yang
  • Jiangping Wang
  • Shiyu Chang
  • Thomas Huang

`1 -Graph has been proven to be effective in data clustering, which partitions the data space by using the sparse representation of the data as the similarity measure. However, the sparse representation is performed for each datum separately without taking into account the geometric structure of the data. Motivated by `1 -Graph and manifold leaning, we propose Laplacian Regularized `1 -Graph (LR`1 -Graph) for data clustering. The sparse representations of LR`1 -Graph are regularized by the geometric information of the data so that they vary smoothly along the geodesics of the data manifold by the graph Laplacian according to the manifold assumption. Moreover, we propose an iterative regularization scheme, where the sparse representation obtained from the previous iteration is used to build the graph Laplacian for the current iteration of regularization. The experimental results on real data sets demonstrate the superiority of our algorithm compared to `1 -Graph and other competing clustering methods.

NeurIPS Conference 2014 Conference Paper

On a Theory of Nonparametric Pairwise Similarity for Clustering: Connecting Clustering to Classification

  • Yingzhen Yang
  • Feng Liang
  • Shuicheng Yan
  • Zhangyang Wang
  • Thomas Huang

Pairwise clustering methods partition the data space into clusters by the pairwise similarity between data points. The success of pairwise clustering largely depends on the pairwise similarity function defined over the data points, where kernel similarity is broadly used. In this paper, we present a novel pairwise clustering framework by bridging the gap between clustering and multi-class classification. This pairwise clustering framework learns an unsupervised nonparametric classifier from each data partition, and search for the optimal partition of the data by minimizing the generalization error of the learned classifiers associated with the data partitions. We consider two nonparametric classifiers in this framework, i. e. the nearest neighbor classifier and the plug-in classifier. Modeling the underlying data distribution by nonparametric kernel density estimation, the generalization error bounds for both unsupervised nonparametric classifiers are the sum of nonparametric pairwise similarity terms between the data points for the purpose of clustering. Under uniform distribution, the nonparametric similarity terms induced by both unsupervised classifiers exhibit a well known form of kernel similarity. We also prove that the generalization error bound for the unsupervised plug-in classifier is asymptotically equal to the weighted volume of cluster boundary for Low Density Separation, a widely used criteria for semi-supervised learning and clustering. Based on the derived nonparametric pairwise similarity using the plug-in classifier, we propose a new nonparametric exemplar-based clustering method with enhanced discriminative capability, whose superiority is evidenced by the experimental results.

AAAI Conference 2012 Conference Paper

Pairwise Exemplar Clustering

  • Yingzhen Yang
  • Xinqi Chu
  • Feng Liang
  • Thomas Huang

Exemplar-based clustering methods have been extensively shown to be effective in many clustering problems. They adaptively determine the number of clusters and hold the appealing advantage of not requiring the estimation of latent parameters, which is otherwise difficult in case of complicated parametric model and high dimensionality of the data. However, modeling arbitrary underlying distribution of the data is still difficult for existing exemplar-based clustering methods. We present Pairwise Exemplar Clustering (PEC) to alleviate this problem by modeling the underlying cluster distributions more accurately with non-parametric kernel density estimation. Interpreting the clusters as classes from a supervised learning perspective, we search for an optimal partition of the data that balances two quantities: 1 the misclassification rate of the data partition for separating the clusters; 2 the sum of within-cluster dissimilarities for controlling the cluster size. The broadly used kernel form of cut turns out to be a special case of our formulation. Moreover, we optimize the corresponding objective function by a new efficient algorithm for message computation in a pairwise MRF. Experimental results on synthetic and real data demonstrate the effectiveness of our method.

NeurIPS Conference 2011 Conference Paper

Learning to Search Efficiently in High Dimensions

  • Zhen Li
  • Huazhong Ning
  • LiangLiang Cao
  • Tong Zhang
  • Yihong Gong
  • Thomas Huang

High dimensional similarity search in large scale databases becomes an important challenge due to the advent of Internet. For such applications, specialized data structures are required to achieve computational efficiency. Traditional approaches relied on algorithmic constructions that are often data independent (such as Locality Sensitive Hashing) or weakly dependent (such as kd-trees, k-means trees). While supervised learning algorithms have been applied to related problems, those proposed in the literature mainly focused on learning hash codes optimized for compact embedding of the data rather than search efficiency. Consequently such an embedding has to be used with linear scan or another search algorithm. Hence learning to hash does not directly address the search efficiency issue. This paper considers a new framework that applies supervised learning to directly optimize a data structure that supports efficient large scale search. Our approach takes both search quality and computational cost into consideration. Specifically, we learn a boosted search forest that is optimized using pair-wise similarity labeled examples. The output of this search forest can be efficiently converted into an inverted indexing data structure, which can leverage modern text search infrastructure to achieve both scalability and efficiency. Experimental results show that our approach significantly outperforms the start-of-the-art learning to hash methods (such as spectral hashing), as well as state-of-the-art high dimensional search algorithms (such as LSH and k-means trees).

IJCAI Conference 2007 Conference Paper

  • Huan Wang
  • Shuicheng Yan
  • Thomas Huang
  • Xiaoou Tang

Recently, substantial efforts have been devoted to the subspace learning techniques based on tensor representation, such as 2DLDA, DATER and Tensor Subspace Analysis (TSA). In this context, a vital yet unsolved problem is that the computational convergency of these iterative algorithms is not guaranteed. In this work, we present a novel solution procedure for general tensor-based subspace learning, followed by a detailed convergency proof of the solution projection matrices and the objective function value. Extensive experiments on real-world databases verify the high convergence speed of the proposed procedure, as well as its superiority in classification capability over traditional solution procedures.

NeurIPS Conference 2000 Conference Paper

Probabilistic Semantic Video Indexing

  • Milind Naphade
  • Igor Kozintsev
  • Thomas Huang

We propose a novel probabilistic framework for semantic video in(cid: 173) dexing. We define probabilistic multimedia objects (multijects) to map low-level media features to high-level semantic labels. A graphical network of such multijects (multinet) captures scene con(cid: 173) text by discovering intra-frame as well as inter-frame dependency relations between the concepts. The main contribution is a novel application of a factor graph framework to model this network. We model relations between semantic concepts in terms of their co-occurrence as well as the temporal dependencies between these concepts within video shots. Using the sum-product algorithm [1] for approximate or exact inference in these factor graph multinets, we attempt to correct errors made during isolated concept detec(cid: 173) tion by forcing high-level constraints. This results in a significant improvement in the overall detection performance.