Arrow Research search

Author name cluster

Junyi Wei

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

5 papers
2 author rows

Possible papers

5

NeurIPS Conference 2024 Conference Paper

Interfacing Foundation Models' Embeddings

  • Xueyan Zou
  • Linjie Li
  • Jianfeng Wang
  • Jianwei Yang
  • Mingyu Ding
  • Junyi Wei
  • Zhengyuan Yang
  • Feng Li

Foundation models possess strong capabilities in reasoning and memorizing across modalities. To further unleash the power of foundation models, we present FIND, a generalized interface for aligning foundation models' embeddings with unified image and dataset-level understanding spanning modality and granularity. As shown in Fig. 1, a lightweight transformer interface without tuning any foundation model weights is enough for segmentation, grounding, and retrieval in an interleaved manner. The proposed interface has the following favorable attributes: (1) Generalizable. It applies to various tasks spanning retrieval, segmentation, etc. , under the same architecture and weights. (2) Interleavable. With the benefit of multi-task multi-modal training, the proposed interface creates an interleaved shared embedding space. (3) Extendable. The proposed interface is adaptive to new tasks, and new models. In light of the interleaved embedding space, we introduce FIND-Bench, which introduces new training and evaluation annotations to the COCO dataset for interleaved segmentation and retrieval. We are the first work aligning foundations models' embeddings for interleave understanding. Meanwhile, our approach achieves state-of-the-art performance on FIND-Bench and competitive performance on standard retrieval and segmentation settings.

ICLR Conference 2024 Conference Paper

Towards Few-Shot Adaptation of Foundation Models via Multitask Finetuning

  • Zhuoyan Xu
  • Zhenmei Shi
  • Junyi Wei
  • Fangzhou Mu
  • Yin Li 0003
  • Yingyu Liang

Foundation models have emerged as a powerful tool for many AI problems. Despite the tremendous success of foundation models, effective adaptation to new tasks, particularly those with limited labels, remains an open question and lacks theoretical understanding. An emerging solution with recent success in vision and NLP involves finetuning a foundation model on a selection of relevant tasks, before its adaptation to a target task with limited labeled samples. In this paper, we study the theoretical justification of this multitask finetuning approach. Our theoretical analysis reveals that with a diverse set of related tasks, this multitask finetuning leads to reduced error in the target task, in comparison to directly adapting the same pretrained model. We quantify the relationship between finetuning tasks and target tasks by diversity and consistency metrics, and further propose a practical task selection algorithm. We substantiate our theoretical claims with extensive empirical evidence. Further, we present results affirming our task selection algorithm adeptly chooses related finetuning tasks, providing advantages to the model performance on target tasks. We believe our study shed new light on the effective adaptation of foundation models to new tasks that lack abundant labels. Our code is available at https://github.com/OliverXUZY/Foudation-Model_Multitask.

ICML Conference 2024 Conference Paper

Why Larger Language Models Do In-context Learning Differently?

  • Zhenmei Shi
  • Junyi Wei
  • Zhuoyan Xu
  • Yingyu Liang

Large language models (LLM) have emerged as a powerful tool for AI, with the key ability of in-context learning (ICL), where they can perform well on unseen tasks based on a brief series of task examples without necessitating any adjustments to the model parameters. One recent interesting mysterious observation is that models of different scales may have different ICL behaviors: larger models tend to be more sensitive to noise in the test context. This work studies this observation theoretically aiming to improve the understanding of LLM and ICL. We analyze two stylized settings: (1) linear regression with one-layer single-head linear transformers and (2) parity classification with two-layer multiple attention heads transformers (non-linear data and non-linear model). In both settings, we give closed-form optimal solutions and find that smaller models emphasize important hidden features while larger ones cover more hidden features; thus, smaller models are more robust to noise while larger ones are more easily distracted, leading to different ICL behaviors. This sheds light on where transformers pay attention to and how that affects ICL. Preliminary experimental results on large base and chat models provide positive support for our analysis.

NeurIPS Conference 2023 Conference Paper

Provable Guarantees for Neural Networks via Gradient Feature Learning

  • Zhenmei Shi
  • Junyi Wei
  • Yingyu Liang

Neural networks have achieved remarkable empirical performance, while the current theoretical analysis is not adequate for understanding their success, e. g. , the Neural Tangent Kernel approach fails to capture their key feature learning ability, while recent analyses on feature learning are typically problem-specific. This work proposes a unified analysis framework for two-layer networks trained by gradient descent. The framework is centered around the principle of feature learning from gradients, and its effectiveness is demonstrated by applications in several prototypical problems, such as mixtures of Gaussians and parity functions. The framework also sheds light on interesting network learning phenomena such as feature learning beyond kernels and the lottery ticket hypothesis.

ICLR Conference 2022 Conference Paper

A Theoretical Analysis on Feature Learning in Neural Networks: Emergence from Inputs and Advantage over Fixed Features

  • Zhenmei Shi
  • Junyi Wei
  • Yingyu Liang

An important characteristic of neural networks is their ability to learn representations of the input data with effective features for prediction, which is believed to be a key factor to their superior empirical performance. To better understand the source and benefit of feature learning in neural networks, we consider learning problems motivated by practical data, where the labels are determined by a set of class relevant patterns and the inputs are generated from these along with some background patterns. We prove that neural networks trained by gradient descent can succeed on these problems. The success relies on the emergence and improvement of effective features, which are learned among exponentially many candidates efficiently by exploiting the data (in particular, the structure of the input distribution). In contrast, no linear models on data-independent features of polynomial sizes can learn to as good errors. Furthermore, if the specific input structure is removed, then no polynomial algorithm in the Statistical Query model can learn even weakly. These results provide theoretical evidence showing that feature learning in neural networks depends strongly on the input structure and leads to the superior performance. Our preliminary experimental results on synthetic and real data also provide positive support.