Author name cluster

Hongyang Chao

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

5 papers

1 author row

NeurIPS Conference 2021 Conference Paper

Improving Visual Quality of Image Synthesis by A Token-based Generator with Transformers

Yanhong Zeng
Huan Yang
Hongyang Chao
Jianbo Wang
Jianlong Fu

We present a new perspective of achieving image synthesis by viewing this task as a visual token generation problem. Different from existing paradigms that directly synthesize a full image from a single input (e. g. , a latent code), the new formulation enables a flexible local manipulation for different image regions, which makes it possible to learn content-aware and fine-grained style control for image synthesis. Specifically, it takes as input a sequence of latent tokens to predict the visual tokens for synthesizing an image. Under this perspective, we propose a token-based generator (i. e. , TokenGAN). Particularly, the TokenGAN inputs two semantically different visual tokens, i. e. , the learned constant content tokens and the style tokens from the latent space. Given a sequence of style tokens, the TokenGAN is able to control the image synthesis by assigning the styles to the content tokens by attention mechanism with a Transformer. We conduct extensive experiments and show that the proposed TokenGAN has achieved state-of-the-art results on several widely-used image synthesis benchmarks, including FFHQ and LSUN CHURCH with different resolutions. In particular, the generator is able to synthesize high-fidelity images with (1024x1024) size, dispensing with convolutions entirely.

PDF Details

NeurIPS Conference 2021 Conference Paper

Searching the Search Space of Vision Transformer

Minghao Chen
Kan Wu
Bolin Ni
Houwen Peng
Bei Liu
Jianlong Fu
Hongyang Chao
Haibin Ling

Vision Transformer has shown great visual representation power in substantial vision tasks such as recognition and detection, and thus been attracting fast-growing efforts on manually designing more effective architectures. In this paper, we propose to use neural architecture search to automate this process, by searching not only the architecture but also the search space. The central idea is to gradually evolve different search dimensions guided by their E-T Error computed using a weight-sharing supernet. Moreover, we provide design guidelines of general vision transformers with extensive analysis according to the space searching process, which could promote the understanding of vision transformer. Remarkably, the searched models, named S3 (short for Searching the Search Space), from the searched space achieve superior performance to recently proposed models, such as Swin, DeiT and ViT, when evaluated on ImageNet. The effectiveness of S3 is also illustrated on object detection, semantic segmentation and visual question answering, demonstrating its generality to downstream vision and vision-language tasks. Code and models will be available at https: //github. com/microsoft/Cream.

PDF Details

AAAI Conference 2019 Conference Paper

Temporal Deformable Convolutional Encoder-Decoder Networks for Video Captioning

Jingwen Chen
Yingwei Pan
Yehao Li
Ting Yao
Hongyang Chao
Tao Mei

It is well believed that video captioning is a fundamental but challenging task in both computer vision and artificial intelligence fields. The prevalent approach is to map an input video to a variable-length output sentence in a sequence to sequence manner via Recurrent Neural Network (RNN). Nevertheless, the training of RNN still suffers to some degree from vanishing/exploding gradient problem, making the optimization difficult. Moreover, the inherently recurrent dependency in RNN prevents parallelization within a sequence during training and therefore limits the computations. In this paper, we present a novel design — Temporal Deformable Convolutional Encoder-Decoder Networks (dubbed as TD- ConvED) that fully employ convolutions in both encoder and decoder networks for video captioning. Technically, we exploit convolutional block structures that compute intermediate states of a fixed number of inputs and stack several blocks to capture long-term relationships. The structure in encoder is further equipped with temporal deformable convolution to enable free-form deformation of temporal sampling. Our model also capitalizes on temporal attention mechanism for sentence generation. Extensive experiments are conducted on both MSVD and MSR-VTT video captioning datasets, and superior results are reported when comparing to conventional RNN-based encoder-decoder techniques. More remarkably, TDConvED increases CIDEr-D performance from 58. 8% to 67. 2% on MSVD.

PDF Details

AAAI Conference 2017 Conference Paper

Building an End-to-End Spatial-Temporal Convolutional Network for Video Super-Resolution

Jun Guo
Hongyang Chao

We propose an end-to-end deep network for video superresolution. Our network is composed of a spatial component that encodes intra-frame visual patterns, a temporal component that discovers inter-frame relations, and a reconstruction component that aggregates information to predict details. We make the spatial component deep, so that it can better leverage spatial redundancies for rebuilding high-frequency structures. We organize the temporal component in a bidirectional and multi-scale fashion, to better capture how frames change across time. The effectiveness of the proposed approach is highlighted on two datasets, where we observe substantial improvements relative to the state of the arts.

PDF Details

AAAI Conference 2015 Conference Paper

Building Effective Representations for Sketch Recognition

Jun Guo
Changhu Wang
Hongyang Chao

As the popularity of touch-screen devices, understanding a user’s hand-drawn sketch has become an increasingly important research topic in artificial intelligence and computer vision. However, different from natural images, the hand-drawn sketches are often highly abstract, with sparse visual information and large intraclass variance, making the problem more challenging. In this work, we study how to build effective representations for sketch recognition. First, to capture saliency patterns of different scales and spatial arrangements, a Gabor-based low-level representation is proposed. Then, based on this representation, to discovery more complex patterns in a sketch, a Hybrid Multilayer Sparse Coding (HMSC) model is proposed to learn midlevel representations. An improved dictionary learning algorithm is also leveraged in HMSC to reduce overfitting to common but trivial patterns. Extensive experiments show that the proposed representations are highly discriminative and lead to large improvements over the state of the arts.

PDF Details