Author name cluster

Jiahui Yu

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

15 papers

2 author rows

JBHI Journal 2025 Journal Article

Prior Visual-Guided Self-Supervised Learning Enables Color Vignetting Correction for High-Throughput Microscopic Imaging

Jianhang Wang
Tianyu Ma
Luhong Jin
Yunqi Zhu
Jiahui Yu
Feng Chen
Shujun Fu
Yingke Xu

Vignetting constitutes a prevalent optical degradation that significantly compromises the quality of biomedical microscopic imaging. However, a robust and efficient vignetting correction methodology in multi-channel microscopic images remains absent at present. In this paper, we take advantage of a prior knowledge about the homogeneity of microscopic images and radial attenuation property of vignetting to develop a self-supervised deep learning algorithm that achieves complex vignetting removal in color microscopic images. Our proposed method, vignetting correction lookup table (VCLUT), is trainable on both single and multiple images, which employs adversarial learning to effectively transfer good imaging conditions from the user visually defined central region of its own light field to the entire image. To illustrate its effectiveness, we performed individual correction experiments on data from five distinct biological specimens. The results demonstrate that VCLUT exhibits enhanced performance compared to classical methods. We further examined its performance as a multi-image-based approach on a pathological dataset, revealing its advantage over other state-of-the-art approaches in both qualitative and quantitative measurements. Moreover, it uniquely possesses the capacity for generalization across various levels of vignetting intensity and an ultra-fast model computation capability, rendering it well-suited for integration into high-throughput imaging pipelines of digital microscopy.

Details DOI

JBHI Journal 2025 Journal Article

Semi-Supervised Instance Segmentation in Whole Slide Images via Dense Spatial Variability Enhancing

Jiahui Yu
Tianyu Ma
Dong Hua
Feng Chen
Junfen Fu
Yingke Xu

Current whole slide image (WSI) segmentation aims at extracting tumor regions from the background. Unlike this, segmenting distinct tumor areas (instances) within a WSI driven by limited annotated data remains under-explored. In this paper, we formally propose semi-supervised instance segmentation (Semi-IS) in WSIs. We address a key challenge: learning intra-class similarity and inter-class dissimilarity driven by unlabeled data. Specifically, we generally perceive the patch as composed of tokens (together), not the patch alone. We employ contrastive learning to develop a segmentation framework. In the Semi-IS, we find that the boundaries of segmented instances are usually disturbed by noise. We jointly eliminate and preserve noise features to address this problem. We conduct extensive experiments to evaluate the effectiveness and generalizability of Semi-IS, including histopathology and cellular pathology. The results show that in clinical multi instance segmentation tasks, Semi-IS achieves almost full-supervised state-of-the-art results with only 30% annotated data. Semi-IS can improve segmentation accuracy by about 2% on public cell pathology datasets.

Details DOI

ICLR Conference 2024 Conference Paper

CoBIT: A Contrastive Bi-directional Image-Text Generation Model

Haoxuan You
Mandy Guo
Zhecan Wang
Kai-Wei Chang 0001
Jason M. Baldridge
Jiahui Yu

The field of Vision-and-Language (VL) has witnessed a proliferation of pretrained foundation models. Current techniques typically employ only one type of training objective, whether it's (1) contrastive objectives (like CLIP), (2) image-to-text generative objectives (like PaLI), or (3) text-to-image generative objectives (like Parti). However, all these three objectives are mutually relevant and are all based on image-text pairs. Intuitively, the first two objectives can be considered as complementary projections between two modalities, and contrastive learning can preserve global alignment and generations facilitate fine-grained understanding. Inspired by this, we present a Contrastive Bi-directional Image-Text generation model (CoBIT) to first time unify the three pre-training objectives in one framework. Specifically, CoBIT employs a novel unicoder-decoder structure consisting of an image unicoder, a text unicoder, and a cross-modal decoder. The image/text unicoders can switch between encoding and decoding in different tasks, enabling flexibility and shared knowledge that benefits both image-to-text and text-to-image generations. CoBIT achieves superior performance in image understanding, image-text understanding (Retrieval, Captioning, VQA, SNLI-VE), and text-based content creation, particularly in zero-shot scenarios.

Details

NeurIPS Conference 2023 Conference Paper

Module-wise Adaptive Distillation for Multimodality Foundation Models

Chen Liang
Jiahui Yu
Ming-Hsuan Yang
Matthew Brown
Yin Cui
Tuo Zhao
Boqing Gong
Tianyi Zhou

Pre-trained multimodal foundation models have demonstrated remarkable generalizability but pose challenges for deployment due to their large sizes. One effective approach to reducing their sizes is layerwise distillation, wherein small student models are trained to match the hidden representations of large teacher models at each layer. Motivated by our observation that certain architecture components, referred to as modules, contribute more significantly to the student's performance than others, we propose to track the contributions of individual modules by recording the loss decrement after distillation each module and choose the module with a greater contribution to distill more frequently. Such an approach can be naturally formulated as a multi-armed bandit (MAB) problem, where modules and loss decrements are considered as arms and rewards, respectively. We then develop a modified-Thompson sampling algorithm named OPTIMA to address the nonstationarity of module contributions resulting from model updating. Specifically, we leverage the observed contributions in recent history to estimate the changing contribution of each module and select modules based on these estimations to maximize the cumulative contribution. We evaluate the effectiveness of OPTIMA through distillation experiments on various multimodal understanding and image captioning tasks, using the CoCa-Large model \citep{yu2022coca} as the teacher model.

PDF Details

TMLR Journal 2022 Journal Article

CoCa: Contrastive Captioners are Image-Text Foundation Models

Jiahui Yu
Zirui Wang
Vijay Vasudevan
Legg Yeung
Mojtaba Seyedhosseini
Yonghui Wu

Exploring large-scale pretrained foundation models is of significant interest in computer vision because these models can be quickly transferred to many downstream tasks. This paper presents Contrastive Captioner (CoCa), a minimalist design to pretrain an image-text encoder-decoder foundation model jointly with contrastive loss and captioning loss, thereby subsuming model capabilities from contrastive approaches like CLIP and generative methods like SimVLM. In contrast to standard encoder-decoder transformers where all decoder layers attend to encoder outputs, CoCa omits cross-attention in the first half of decoder layers to encode unimodal text representations, and cascades the remaining decoder layers which cross-attend to the image encoder for multimodal image-text representations. We apply a contrastive loss between unimodal image and text embeddings, in addition to a captioning loss on the multimodal decoder outputs which predicts text tokens autoregressively. By sharing the same computational graph, the two training objectives are computed efficiently with minimal overhead. CoCa is pretrained end-to-end and from scratch on both web-scale alt-text data and annotated images by treating all labels simply as text, seamlessly unifying natural language supervision for representation learning. Empirically, CoCa achieves state-of-the-art performance with zero-shot transfer or minimal task-specific adaptation on a broad range of downstream tasks, spanning visual recognition (ImageNet, Kinetics-400/600/700, Moments-in-Time), crossmodal retrieval (MSCOCO, Flickr30K, MSR-VTT), multimodal understanding (VQA, SNLI-VE, NLVR2), and image captioning (MSCOCO, NoCaps). Notably on ImageNet classification, CoCa obtains 86.3% zero-shot top-1 accuracy, 90.6% with a frozen encoder and learned classification head, and 91.0% with a finetuned encoder.

PDF Details

TMLR Journal 2022 Journal Article

Scaling Autoregressive Models for Content-Rich Text-to-Image Generation

Jiahui Yu
Yuanzhong Xu
Jing Yu Koh
Thang Luong
Gunjan Baid
Zirui Wang
Vijay Vasudevan
Alexander Ku

We present the Pathways Autoregressive Text-to-Image (Parti) model, which generates high-fidelity photorealistic images and supports content-rich synthesis involving complex compositions and world knowledge. Parti treats text-to-image generation as a sequence-to-sequence modeling problem, akin to machine translation, with sequences of image tokens as the target outputs rather than text tokens in another language. This strategy can naturally tap into the rich body of prior work on large language models, which have seen continued advances in capabilities and performance through scaling data and model sizes. Our approach is simple: First, Parti uses a Transformer-based image tokenizer, ViT-VQGAN, to encode images as sequences of discrete tokens. Second, we achieve consistent quality improvements by scaling the encoder-decoder Transformer model up to 20B parameters, with a new state-of-the-art zero-shot FID score of 7.23 and finetuned FID score of 3.22 on MS-COCO. Our detailed analysis on Localized Narratives as well as PartiPrompts (P2), a new holistic benchmark of over 1600 English prompts, demonstrate the effectiveness of Parti across a wide variety of categories and difficulty aspects. We also explore and highlight limitations of our models in order to define and exemplify key areas of focus for further improvements.

PDF Details

ICML Conference 2022 Conference Paper

Self-supervised learning with random-projection quantizer for speech recognition

Chung-Cheng Chiu
James Qin
Yu Zhang
Jiahui Yu
Yonghui Wu

We present a simple and effective self-supervised learning approach for speech recognition. The approach learns a model to predict the masked speech signals, in the form of discrete labels generated with a random-projection quantizer. In particular the quantizer projects speech inputs with a randomly initialized matrix, and does a nearest-neighbor lookup in a randomly-initialized codebook. Neither the matrix nor the codebook are updated during self-supervised learning. Since the random-projection quantizer is not trained and is separated from the speech recognition model, the design makes the approach flexible and is compatible with universal speech recognition architecture. On LibriSpeech our approach achieves similar word-error-rates as previous work using self-supervised learning with non-streaming models, and provides lower word-error-rates than previous work with streaming models. On multilingual tasks the approach also provides significant improvement over wav2vec 2. 0 and w2v-BERT.

Details

ICLR Conference 2022 Conference Paper

SimVLM: Simple Visual Language Model Pretraining with Weak Supervision

Zirui Wang
Jiahui Yu
Adams Wei Yu
Zihang Dai
Yulia Tsvetkov
Yuan Cao 0007

With recent progress in joint modeling of visual and textual representations, Vision-Language Pretraining (VLP) has achieved impressive performance on many multimodal downstream tasks. However, the requirement for expensive annotations including clean image captions and regional labels limits the scalability of existing approaches, and complicates the pretraining procedure with the introduction of multiple dataset-specific objectives. In this work, we relax these constraints and present a minimalist pretraining framework, named Simple Visual Language Model (SimVLM). Unlike prior work, SimVLM reduces the training complexity by exploiting large-scale weak supervision, and is trained end-to-end with a single prefix language modeling objective. Without utilizing extra data or task-specific customization, the resulting model significantly outperforms previous pretraining methods and achieves new state-of-the-art results on a wide range of discriminative and generative vision-language benchmarks, including VQA (+3.74% vqa-score), NLVR2 (+1.17% accuracy), SNLI-VE (+1.37% accuracy) and image captioning tasks (+10.1% average CIDEr score). Furthermore, we demonstrate that SimVLM acquires strong generalization and transfer ability, enabling zero-shot behavior including open-ended visual question answering and cross-modality transfer.

Details

ICLR Conference 2022 Conference Paper

Vector-quantized Image Modeling with Improved VQGAN

Jiahui Yu
Xin Li
Jing Yu Koh
Han Zhang 0010
Ruoming Pang
James Qin
Alexander Ku
Yuanzhong Xu

Pretraining language models with next-token prediction on massive text corpora has delivered phenomenal zero-shot, few-shot, transfer learning and multi-tasking capabilities on both generative and discriminative language tasks. Motivated by this success, we explore a Vector-quantized Image Modeling (VIM) approach that involves pretraining a Transformer to predict rasterized image tokens autoregressively. The discrete image tokens are encoded from a learned Vision-Transformer-based VQGAN (ViT-VQGAN). We first propose multiple improvements over vanilla VQGAN from architecture to codebook learning, yielding better efficiency and reconstruction fidelity. The improved ViT-VQGAN further improves vector-quantized image modeling tasks, including unconditional, class-conditioned image generation and unsupervised representation learning. When trained on ImageNet at 256x256 resolution, we achieve Inception Score (IS) of 175.1 and Fr'echet Inception Distance (FID) of 4.17, a dramatic improvement over the vanilla VQGAN, which obtains 70.6 and 17.04 for IS and FID, respectively. Based on ViT-VQGAN and unsupervised pretraining, we further evaluate the pretrained Transformer by averaging intermediate features, similar to Image GPT (iGPT). This ImageNet-pretrained VIM-L significantly beats iGPT-L on linear-probe accuracy from 60.3% to 73.2% for a similar model size. ViM-L also outperforms iGPT-XL which is trained with extra web image data and larger model size.

Details

ICLR Conference 2021 Conference Paper

Dual-mode ASR: Unify and Improve Streaming ASR with Full-context Modeling

Jiahui Yu
Wei Han 0002
Anmol Gulati
Chung-Cheng Chiu
Bo Li 0028
Tara N. Sainath
Yonghui Wu
Ruoming Pang

Streaming automatic speech recognition (ASR) aims to emit each hypothesized word as quickly and accurately as possible, while full-context ASR waits for the completion of a full speech utterance before emitting completed hypotheses. In this work, we propose a unified framework, Dual-mode ASR, to train a single end-to-end ASR model with shared weights for both streaming and full-context speech recognition. We show that the latency and accuracy of streaming ASR significantly benefit from weight sharing and joint training of full-context ASR, especially with inplace knowledge distillation during the training. The Dual-mode ASR framework can be applied to recent state-of-the-art convolution-based and transformer-based ASR networks. We present extensive experiments with two state-of-the-art ASR networks, ContextNet and Conformer, on two datasets, a widely used public dataset LibriSpeech and a large-scale dataset MultiDomain. Experiments and ablation studies demonstrate that Dual-mode ASR not only simplifies the workflow of training and deploying streaming and full-context ASR models, but also significantly improves both emission latency and recognition accuracy of streaming ASR. With Dual-mode ASR, we achieve new state-of-the-art streaming ASR results on both LibriSpeech and MultiDomain in terms of accuracy and latency.

Details

ICLR Conference 2020 Conference Paper

FSNet: Compression of Deep Convolutional Neural Networks by Filter Summary

Yingzhen Yang
Jiahui Yu
Nebojsa Jojic
Jun Huan
Thomas S. Huang

We present a novel method of compression of deep Convolutional Neural Networks (CNNs) by weight sharing through a new representation of convolutional filters. The proposed method reduces the number of parameters of each convolutional layer by learning a $1$D vector termed Filter Summary (FS). The convolutional filters are located in FS as overlapping $1$D segments, and nearby filters in FS share weights in their overlapping regions in a natural way. The resultant neural network based on such weight sharing scheme, termed Filter Summary CNNs or FSNet, has a FS in each convolution layer instead of a set of independent filters in the conventional convolution layer. FSNet has the same architecture as that of the baseline CNN to be compressed, and each convolution layer of FSNet has the same number of filters from FS as that of the basline CNN in the forward process. With compelling computational acceleration ratio, the parameter space of FSNet is much smaller than that of the baseline CNN. In addition, FSNet is quantization friendly. FSNet with weight quantization leads to even higher compression ratio without noticeable performance loss. We further propose Differentiable FSNet where the way filters share weights is learned in a differentiable and end-to-end manner. Experiments demonstrate the effectiveness of FSNet in compression of CNNs for computer vision tasks including image classification and object detection, and the effectiveness of DFSNet is evidenced by the task of Neural Architecture Search.

Details

NeurIPS Conference 2020 Conference Paper

Neural Sparse Representation for Image Restoration

Yuchen Fan
Jiahui Yu
Yiqun Mei
Yulun Zhang
Yun Fu
Ding Liu
Thomas S. Huang

Inspired by the robustness and efficiency of sparse representation in sparse coding based image restoration models, we investigate the sparsity of neurons in deep networks. Our method structurally enforces sparsity constraints upon hidden neurons. The sparsity constraints are favorable for gradient-based learning algorithms and attachable to convolution layers in various networks. Sparsity in neurons enables computation saving by only operating on non-zero components without hurting accuracy. Meanwhile, our method can magnify representation dimensionality and model capacity with negligible additional computation cost. Experiments show that sparse representation is crucial in deep neural networks for multiple image restoration tasks, including image super-resolution, image denoising, and image compression artifacts removal.

PDF Details

AAAI Conference 2020 Conference Paper

Scale-Wise Convolution for Image Restoration

Yuchen Fan
Jiahui Yu
Ding Liu
Thomas S. Huang

While scale-invariant modeling has substantially boosted the performance of visual recognition tasks, it remains largely under-explored in deep networks based image restoration. Naively applying those scale-invariant techniques (e. g. , multi-scale testing, random-scale data augmentation) to image restoration tasks usually leads to inferior performance. In this paper, we show that properly modeling scale-invariance into neural networks can bring signiﬁcant beneﬁts to image restoration performance. Inspired from spatial-wise convolution for shift-invariance, “scale-wise convolution” is proposed to convolve across multiple scales for scale-invariance. In our scale-wise convolutional network (SCN), we ﬁrst map the input image to the feature space and then build a feature pyramid representation via bi-linear down-scaling progressively. The feature pyramid is then passed to a residual network with scale-wise convolutions. The proposed scalewise convolution learns to dynamically activate and aggregate features from different input scales in each residual building block, in order to exploit contextual information on multiple scales. In experiments, we compare the restoration accuracy and parameter efﬁciency among our model and many different variants of multi-scale neural networks. The proposed network with scale-wise convolution achieves superior performance in multiple image restoration tasks including image super-resolution, image denoising and image compression artifacts removal. Code and models are available at: https: //github. com/ychfan/scn sr.

PDF Details

UAI Conference 2019 Conference Paper

Fast Proximal Gradient Descent for A Class of Non-convex and Non-smooth Sparse Learning Problems

Yingzhen Yang
Jiahui Yu

Non-convex and non-smooth optimization problems are important for statistics and machine learning. However, solving such problems is always challenging. In this paper, we propose fast proximal gradient descent based methods to solve a class of non-convex and non-smooth sparse learning problems, i. e. the $\ell^0$ regularization problems. We prove improved convergence rate of proximal gradient descent on the $\ell^0$ regularization problems, and propose two accelerated versions by support projection. The proposed accelerated proximal gradient descent methods by support projection have convergence rates which match the Nesterov’s optimal convergence rate of first-order methods on smooth and convex objective function with Lipschitz continuous gradient. Experimental results demonstrate the effectiveness of the proposed algorithms. We also propose feedforward neural networks as fast encoders to approximate the optimization results generated by the proposed accelerated algorithms.

Details

UAI Conference 2017 Conference Paper

Neighborhood Regularized l^1-Graph

Yingzhen Yang
Jiashi Feng
Jiahui Yu
Jianchao Yang
Thomas S. Huang

`1 -Graph, which learns a sparse graph over the data by sparse representation, has been demonstrated to be effective in clustering especially for high dimensional data. Although it achieves compelling performance, the sparse graph generated by `1 -Graph ignores the geometric information of the data by sparse representation for each datum separately. To obtain a sparse graph that is aligned to the underlying manifold structure of the data, we propose the novel Neighborhood Regularized `1 -Graph (NR`1 -Graph). NR`1 -Graph learns sparse graph with locally consistent neighborhood by encouraging nearby data to have similar neighbors in the constructed sparse graph. We present the optimization algorithm of NR`1 -Graph with theoretical guarantee on the convergence and the gap between the suboptimal solution and the globally optimal solution in each step of the coordinate descent, which is essential for the overall optimization of NR`1 -Graph. Its provable accelerated version, NR`1 -Graph by Random Projection (NR`1 -Graph-RP) that employs randomized data matrix decomposition, is also presented to improve the efficiency of the optimization of NR`1 -Graph. Experimental results on various real data sets demonstrate the effectiveness of both NR`1 -Graph and NR`1 Graph-RP. This work is supported in part by US Army Research Office grant W911NF-15-1-0317. The work of Jiashi Feng was supported by NUS startup R-263-000-C08-133, MOE R-263000-C21-112 and IDS R-263-000-C67-646. Pushmeet Kohli was at Microsoft Research during this project.

Details