Arrow Research search

Author name cluster

Jiahui Yu

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

15 papers
2 author rows

Possible papers

15

JBHI Journal 2025 Journal Article

Prior Visual-Guided Self-Supervised Learning Enables Color Vignetting Correction for High-Throughput Microscopic Imaging

  • Jianhang Wang
  • Tianyu Ma
  • Luhong Jin
  • Yunqi Zhu
  • Jiahui Yu
  • Feng Chen
  • Shujun Fu
  • Yingke Xu

Vignetting constitutes a prevalent optical degradation that significantly compromises the quality of biomedical microscopic imaging. However, a robust and efficient vignetting correction methodology in multi-channel microscopic images remains absent at present. In this paper, we take advantage of a prior knowledge about the homogeneity of microscopic images and radial attenuation property of vignetting to develop a self-supervised deep learning algorithm that achieves complex vignetting removal in color microscopic images. Our proposed method, vignetting correction lookup table (VCLUT), is trainable on both single and multiple images, which employs adversarial learning to effectively transfer good imaging conditions from the user visually defined central region of its own light field to the entire image. To illustrate its effectiveness, we performed individual correction experiments on data from five distinct biological specimens. The results demonstrate that VCLUT exhibits enhanced performance compared to classical methods. We further examined its performance as a multi-image-based approach on a pathological dataset, revealing its advantage over other state-of-the-art approaches in both qualitative and quantitative measurements. Moreover, it uniquely possesses the capacity for generalization across various levels of vignetting intensity and an ultra-fast model computation capability, rendering it well-suited for integration into high-throughput imaging pipelines of digital microscopy.

JBHI Journal 2025 Journal Article

Semi-Supervised Instance Segmentation in Whole Slide Images via Dense Spatial Variability Enhancing

  • Jiahui Yu
  • Tianyu Ma
  • Dong Hua
  • Feng Chen
  • Junfen Fu
  • Yingke Xu

Current whole slide image (WSI) segmentation aims at extracting tumor regions from the background. Unlike this, segmenting distinct tumor areas (instances) within a WSI driven by limited annotated data remains under-explored. In this paper, we formally propose semi-supervised instance segmentation (Semi-IS) in WSIs. We address a key challenge: learning intra-class similarity and inter-class dissimilarity driven by unlabeled data. Specifically, we generally perceive the patch as composed of tokens (together), not the patch alone. We employ contrastive learning to develop a segmentation framework. In the Semi-IS, we find that the boundaries of segmented instances are usually disturbed by noise. We jointly eliminate and preserve noise features to address this problem. We conduct extensive experiments to evaluate the effectiveness and generalizability of Semi-IS, including histopathology and cellular pathology. The results show that in clinical multi instance segmentation tasks, Semi-IS achieves almost full-supervised state-of-the-art results with only 30% annotated data. Semi-IS can improve segmentation accuracy by about 2% on public cell pathology datasets.

ICLR Conference 2024 Conference Paper

CoBIT: A Contrastive Bi-directional Image-Text Generation Model

  • Haoxuan You
  • Mandy Guo
  • Zhecan Wang
  • Kai-Wei Chang 0001
  • Jason M. Baldridge
  • Jiahui Yu

The field of Vision-and-Language (VL) has witnessed a proliferation of pretrained foundation models. Current techniques typically employ only one type of training objective, whether it's (1) contrastive objectives (like CLIP), (2) image-to-text generative objectives (like PaLI), or (3) text-to-image generative objectives (like Parti). However, all these three objectives are mutually relevant and are all based on image-text pairs. Intuitively, the first two objectives can be considered as complementary projections between two modalities, and contrastive learning can preserve global alignment and generations facilitate fine-grained understanding. Inspired by this, we present a Contrastive Bi-directional Image-Text generation model (CoBIT) to first time unify the three pre-training objectives in one framework. Specifically, CoBIT employs a novel unicoder-decoder structure consisting of an image unicoder, a text unicoder, and a cross-modal decoder. The image/text unicoders can switch between encoding and decoding in different tasks, enabling flexibility and shared knowledge that benefits both image-to-text and text-to-image generations. CoBIT achieves superior performance in image understanding, image-text understanding (Retrieval, Captioning, VQA, SNLI-VE), and text-based content creation, particularly in zero-shot scenarios.

NeurIPS Conference 2023 Conference Paper

Module-wise Adaptive Distillation for Multimodality Foundation Models

  • Chen Liang
  • Jiahui Yu
  • Ming-Hsuan Yang
  • Matthew Brown
  • Yin Cui
  • Tuo Zhao
  • Boqing Gong
  • Tianyi Zhou

Pre-trained multimodal foundation models have demonstrated remarkable generalizability but pose challenges for deployment due to their large sizes. One effective approach to reducing their sizes is layerwise distillation, wherein small student models are trained to match the hidden representations of large teacher models at each layer. Motivated by our observation that certain architecture components, referred to as modules, contribute more significantly to the student's performance than others, we propose to track the contributions of individual modules by recording the loss decrement after distillation each module and choose the module with a greater contribution to distill more frequently. Such an approach can be naturally formulated as a multi-armed bandit (MAB) problem, where modules and loss decrements are considered as arms and rewards, respectively. We then develop a modified-Thompson sampling algorithm named OPTIMA to address the nonstationarity of module contributions resulting from model updating. Specifically, we leverage the observed contributions in recent history to estimate the changing contribution of each module and select modules based on these estimations to maximize the cumulative contribution. We evaluate the effectiveness of OPTIMA through distillation experiments on various multimodal understanding and image captioning tasks, using the CoCa-Large model \citep{yu2022coca} as the teacher model.

TMLR Journal 2022 Journal Article

CoCa: Contrastive Captioners are Image-Text Foundation Models

  • Jiahui Yu
  • Zirui Wang
  • Vijay Vasudevan
  • Legg Yeung
  • Mojtaba Seyedhosseini
  • Yonghui Wu

Exploring large-scale pretrained foundation models is of significant interest in computer vision because these models can be quickly transferred to many downstream tasks. This paper presents Contrastive Captioner (CoCa), a minimalist design to pretrain an image-text encoder-decoder foundation model jointly with contrastive loss and captioning loss, thereby subsuming model capabilities from contrastive approaches like CLIP and generative methods like SimVLM. In contrast to standard encoder-decoder transformers where all decoder layers attend to encoder outputs, CoCa omits cross-attention in the first half of decoder layers to encode unimodal text representations, and cascades the remaining decoder layers which cross-attend to the image encoder for multimodal image-text representations. We apply a contrastive loss between unimodal image and text embeddings, in addition to a captioning loss on the multimodal decoder outputs which predicts text tokens autoregressively. By sharing the same computational graph, the two training objectives are computed efficiently with minimal overhead. CoCa is pretrained end-to-end and from scratch on both web-scale alt-text data and annotated images by treating all labels simply as text, seamlessly unifying natural language supervision for representation learning. Empirically, CoCa achieves state-of-the-art performance with zero-shot transfer or minimal task-specific adaptation on a broad range of downstream tasks, spanning visual recognition (ImageNet, Kinetics-400/600/700, Moments-in-Time), crossmodal retrieval (MSCOCO, Flickr30K, MSR-VTT), multimodal understanding (VQA, SNLI-VE, NLVR2), and image captioning (MSCOCO, NoCaps). Notably on ImageNet classification, CoCa obtains 86.3% zero-shot top-1 accuracy, 90.6% with a frozen encoder and learned classification head, and 91.0% with a finetuned encoder.

TMLR Journal 2022 Journal Article

Scaling Autoregressive Models for Content-Rich Text-to-Image Generation

  • Jiahui Yu
  • Yuanzhong Xu
  • Jing Yu Koh
  • Thang Luong
  • Gunjan Baid
  • Zirui Wang
  • Vijay Vasudevan
  • Alexander Ku

We present the Pathways Autoregressive Text-to-Image (Parti) model, which generates high-fidelity photorealistic images and supports content-rich synthesis involving complex compositions and world knowledge. Parti treats text-to-image generation as a sequence-to-sequence modeling problem, akin to machine translation, with sequences of image tokens as the target outputs rather than text tokens in another language. This strategy can naturally tap into the rich body of prior work on large language models, which have seen continued advances in capabilities and performance through scaling data and model sizes. Our approach is simple: First, Parti uses a Transformer-based image tokenizer, ViT-VQGAN, to encode images as sequences of discrete tokens. Second, we achieve consistent quality improvements by scaling the encoder-decoder Transformer model up to 20B parameters, with a new state-of-the-art zero-shot FID score of 7.23 and finetuned FID score of 3.22 on MS-COCO. Our detailed analysis on Localized Narratives as well as PartiPrompts (P2), a new holistic benchmark of over 1600 English prompts, demonstrate the effectiveness of Parti across a wide variety of categories and difficulty aspects. We also explore and highlight limitations of our models in order to define and exemplify key areas of focus for further improvements.

ICML Conference 2022 Conference Paper

Self-supervised learning with random-projection quantizer for speech recognition

  • Chung-Cheng Chiu
  • James Qin
  • Yu Zhang
  • Jiahui Yu
  • Yonghui Wu

We present a simple and effective self-supervised learning approach for speech recognition. The approach learns a model to predict the masked speech signals, in the form of discrete labels generated with a random-projection quantizer. In particular the quantizer projects speech inputs with a randomly initialized matrix, and does a nearest-neighbor lookup in a randomly-initialized codebook. Neither the matrix nor the codebook are updated during self-supervised learning. Since the random-projection quantizer is not trained and is separated from the speech recognition model, the design makes the approach flexible and is compatible with universal speech recognition architecture. On LibriSpeech our approach achieves similar word-error-rates as previous work using self-supervised learning with non-streaming models, and provides lower word-error-rates than previous work with streaming models. On multilingual tasks the approach also provides significant improvement over wav2vec 2. 0 and w2v-BERT.

ICLR Conference 2022 Conference Paper

SimVLM: Simple Visual Language Model Pretraining with Weak Supervision

  • Zirui Wang
  • Jiahui Yu
  • Adams Wei Yu
  • Zihang Dai
  • Yulia Tsvetkov
  • Yuan Cao 0007

With recent progress in joint modeling of visual and textual representations, Vision-Language Pretraining (VLP) has achieved impressive performance on many multimodal downstream tasks. However, the requirement for expensive annotations including clean image captions and regional labels limits the scalability of existing approaches, and complicates the pretraining procedure with the introduction of multiple dataset-specific objectives. In this work, we relax these constraints and present a minimalist pretraining framework, named Simple Visual Language Model (SimVLM). Unlike prior work, SimVLM reduces the training complexity by exploiting large-scale weak supervision, and is trained end-to-end with a single prefix language modeling objective. Without utilizing extra data or task-specific customization, the resulting model significantly outperforms previous pretraining methods and achieves new state-of-the-art results on a wide range of discriminative and generative vision-language benchmarks, including VQA (+3.74% vqa-score), NLVR2 (+1.17% accuracy), SNLI-VE (+1.37% accuracy) and image captioning tasks (+10.1% average CIDEr score). Furthermore, we demonstrate that SimVLM acquires strong generalization and transfer ability, enabling zero-shot behavior including open-ended visual question answering and cross-modality transfer.

ICLR Conference 2022 Conference Paper

Vector-quantized Image Modeling with Improved VQGAN

  • Jiahui Yu
  • Xin Li
  • Jing Yu Koh
  • Han Zhang 0010
  • Ruoming Pang
  • James Qin
  • Alexander Ku
  • Yuanzhong Xu

Pretraining language models with next-token prediction on massive text corpora has delivered phenomenal zero-shot, few-shot, transfer learning and multi-tasking capabilities on both generative and discriminative language tasks. Motivated by this success, we explore a Vector-quantized Image Modeling (VIM) approach that involves pretraining a Transformer to predict rasterized image tokens autoregressively. The discrete image tokens are encoded from a learned Vision-Transformer-based VQGAN (ViT-VQGAN). We first propose multiple improvements over vanilla VQGAN from architecture to codebook learning, yielding better efficiency and reconstruction fidelity. The improved ViT-VQGAN further improves vector-quantized image modeling tasks, including unconditional, class-conditioned image generation and unsupervised representation learning. When trained on ImageNet at 256x256 resolution, we achieve Inception Score (IS) of 175.1 and Fr'echet Inception Distance (FID) of 4.17, a dramatic improvement over the vanilla VQGAN, which obtains 70.6 and 17.04 for IS and FID, respectively. Based on ViT-VQGAN and unsupervised pretraining, we further evaluate the pretrained Transformer by averaging intermediate features, similar to Image GPT (iGPT). This ImageNet-pretrained VIM-L significantly beats iGPT-L on linear-probe accuracy from 60.3% to 73.2% for a similar model size. ViM-L also outperforms iGPT-XL which is trained with extra web image data and larger model size.

ICLR Conference 2021 Conference Paper

Dual-mode ASR: Unify and Improve Streaming ASR with Full-context Modeling

  • Jiahui Yu
  • Wei Han 0002
  • Anmol Gulati
  • Chung-Cheng Chiu
  • Bo Li 0028
  • Tara N. Sainath
  • Yonghui Wu
  • Ruoming Pang

Streaming automatic speech recognition (ASR) aims to emit each hypothesized word as quickly and accurately as possible, while full-context ASR waits for the completion of a full speech utterance before emitting completed hypotheses. In this work, we propose a unified framework, Dual-mode ASR, to train a single end-to-end ASR model with shared weights for both streaming and full-context speech recognition. We show that the latency and accuracy of streaming ASR significantly benefit from weight sharing and joint training of full-context ASR, especially with inplace knowledge distillation during the training. The Dual-mode ASR framework can be applied to recent state-of-the-art convolution-based and transformer-based ASR networks. We present extensive experiments with two state-of-the-art ASR networks, ContextNet and Conformer, on two datasets, a widely used public dataset LibriSpeech and a large-scale dataset MultiDomain. Experiments and ablation studies demonstrate that Dual-mode ASR not only simplifies the workflow of training and deploying streaming and full-context ASR models, but also significantly improves both emission latency and recognition accuracy of streaming ASR. With Dual-mode ASR, we achieve new state-of-the-art streaming ASR results on both LibriSpeech and MultiDomain in terms of accuracy and latency.

ICLR Conference 2020 Conference Paper

FSNet: Compression of Deep Convolutional Neural Networks by Filter Summary

  • Yingzhen Yang
  • Jiahui Yu
  • Nebojsa Jojic
  • Jun Huan
  • Thomas S. Huang

We present a novel method of compression of deep Convolutional Neural Networks (CNNs) by weight sharing through a new representation of convolutional filters. The proposed method reduces the number of parameters of each convolutional layer by learning a $1$D vector termed Filter Summary (FS). The convolutional filters are located in FS as overlapping $1$D segments, and nearby filters in FS share weights in their overlapping regions in a natural way. The resultant neural network based on such weight sharing scheme, termed Filter Summary CNNs or FSNet, has a FS in each convolution layer instead of a set of independent filters in the conventional convolution layer. FSNet has the same architecture as that of the baseline CNN to be compressed, and each convolution layer of FSNet has the same number of filters from FS as that of the basline CNN in the forward process. With compelling computational acceleration ratio, the parameter space of FSNet is much smaller than that of the baseline CNN. In addition, FSNet is quantization friendly. FSNet with weight quantization leads to even higher compression ratio without noticeable performance loss. We further propose Differentiable FSNet where the way filters share weights is learned in a differentiable and end-to-end manner. Experiments demonstrate the effectiveness of FSNet in compression of CNNs for computer vision tasks including image classification and object detection, and the effectiveness of DFSNet is evidenced by the task of Neural Architecture Search.

NeurIPS Conference 2020 Conference Paper

Neural Sparse Representation for Image Restoration

  • Yuchen Fan
  • Jiahui Yu
  • Yiqun Mei
  • Yulun Zhang
  • Yun Fu
  • Ding Liu
  • Thomas S. Huang

Inspired by the robustness and efficiency of sparse representation in sparse coding based image restoration models, we investigate the sparsity of neurons in deep networks. Our method structurally enforces sparsity constraints upon hidden neurons. The sparsity constraints are favorable for gradient-based learning algorithms and attachable to convolution layers in various networks. Sparsity in neurons enables computation saving by only operating on non-zero components without hurting accuracy. Meanwhile, our method can magnify representation dimensionality and model capacity with negligible additional computation cost. Experiments show that sparse representation is crucial in deep neural networks for multiple image restoration tasks, including image super-resolution, image denoising, and image compression artifacts removal.

AAAI Conference 2020 Conference Paper

Scale-Wise Convolution for Image Restoration

  • Yuchen Fan
  • Jiahui Yu
  • Ding Liu
  • Thomas S. Huang

While scale-invariant modeling has substantially boosted the performance of visual recognition tasks, it remains largely under-explored in deep networks based image restoration. Naively applying those scale-invariant techniques (e. g. , multi-scale testing, random-scale data augmentation) to image restoration tasks usually leads to inferior performance. In this paper, we show that properly modeling scale-invariance into neural networks can bring significant benefits to image restoration performance. Inspired from spatial-wise convolution for shift-invariance, “scale-wise convolution” is proposed to convolve across multiple scales for scale-invariance. In our scale-wise convolutional network (SCN), we first map the input image to the feature space and then build a feature pyramid representation via bi-linear down-scaling progressively. The feature pyramid is then passed to a residual network with scale-wise convolutions. The proposed scalewise convolution learns to dynamically activate and aggregate features from different input scales in each residual building block, in order to exploit contextual information on multiple scales. In experiments, we compare the restoration accuracy and parameter efficiency among our model and many different variants of multi-scale neural networks. The proposed network with scale-wise convolution achieves superior performance in multiple image restoration tasks including image super-resolution, image denoising and image compression artifacts removal. Code and models are available at: https: //github. com/ychfan/scn sr.

UAI Conference 2019 Conference Paper

Fast Proximal Gradient Descent for A Class of Non-convex and Non-smooth Sparse Learning Problems

  • Yingzhen Yang
  • Jiahui Yu

Non-convex and non-smooth optimization problems are important for statistics and machine learning. However, solving such problems is always challenging. In this paper, we propose fast proximal gradient descent based methods to solve a class of non-convex and non-smooth sparse learning problems, i. e. the $\ell^0$ regularization problems. We prove improved convergence rate of proximal gradient descent on the $\ell^0$ regularization problems, and propose two accelerated versions by support projection. The proposed accelerated proximal gradient descent methods by support projection have convergence rates which match the Nesterov’s optimal convergence rate of first-order methods on smooth and convex objective function with Lipschitz continuous gradient. Experimental results demonstrate the effectiveness of the proposed algorithms. We also propose feedforward neural networks as fast encoders to approximate the optimization results generated by the proposed accelerated algorithms.

UAI Conference 2017 Conference Paper

Neighborhood Regularized l^1-Graph

  • Yingzhen Yang
  • Jiashi Feng
  • Jiahui Yu
  • Jianchao Yang
  • Thomas S. Huang

`1 -Graph, which learns a sparse graph over the data by sparse representation, has been demonstrated to be effective in clustering especially for high dimensional data. Although it achieves compelling performance, the sparse graph generated by `1 -Graph ignores the geometric information of the data by sparse representation for each datum separately. To obtain a sparse graph that is aligned to the underlying manifold structure of the data, we propose the novel Neighborhood Regularized `1 -Graph (NR`1 -Graph). NR`1 -Graph learns sparse graph with locally consistent neighborhood by encouraging nearby data to have similar neighbors in the constructed sparse graph. We present the optimization algorithm of NR`1 -Graph with theoretical guarantee on the convergence and the gap between the suboptimal solution and the globally optimal solution in each step of the coordinate descent, which is essential for the overall optimization of NR`1 -Graph. Its provable accelerated version, NR`1 -Graph by Random Projection (NR`1 -Graph-RP) that employs randomized data matrix decomposition, is also presented to improve the efficiency of the optimization of NR`1 -Graph. Experimental results on various real data sets demonstrate the effectiveness of both NR`1 -Graph and NR`1 Graph-RP. This work is supported in part by US Army Research Office grant W911NF-15-1-0317. The work of Jiashi Feng was supported by NUS startup R-263-000-C08-133, MOE R-263000-C21-112 and IDS R-263-000-C67-646. Pushmeet Kohli was at Microsoft Research during this project.