Arrow Research search

Author name cluster

Yaru Hao

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

8 papers
2 author rows

Possible papers

8

ICLR Conference 2025 Conference Paper

Data Selection via Optimal Control for Language Models

  • Yuxian Gu
  • Li Dong 0004
  • Hongning Wang
  • Yaru Hao
  • Qingxiu Dong
  • Furu Wei
  • Minlie Huang

This work investigates the selection of high-quality pre-training data from massive corpora to enhance LMs' capabilities for downstream usage. We formulate data selection as a generalized Optimal Control problem, which can be solved theoretically by Pontryagin's Maximum Principle (PMP), yielding a set of necessary conditions that characterize the relationship between optimal data selection and LM training dynamics. Based on these theoretical results, we introduce **P**MP-based **D**ata **S**election (**PDS**), a framework that approximates optimal data selection by solving the PMP conditions. In our experiments, we adopt PDS to select data from CommmonCrawl and show that the PDS-selected corpus accelerates the learning of LMs and constantly boosts their performance on a wide range of downstream tasks across various model sizes. Moreover, the benefits of PDS extend to ~400B models trained on ~10T tokens, as evidenced by the extrapolation of the test loss curves according to the Scaling Laws. PDS also improves data utilization when the pre-training data is limited, by reducing the data demand by 1.8 times, which helps mitigate the quick exhaustion of available web-crawled corpora. Our code, model, and data can be found at https://github.com/microsoft/LMOps/tree/main/data_selection.

ICLR Conference 2024 Conference Paper

Grounding Multimodal Large Language Models to the World

  • Zhiliang Peng
  • Wenhui Wang 0003
  • Li Dong 0004
  • Yaru Hao
  • Shaohan Huang
  • Shuming Ma
  • Qixiang Ye
  • Furu Wei

We introduce Kosmos-2, a Multimodal Large Language Model (MLLM), enabling new capabilities of perceiving object descriptions (e.g., bounding boxes) and grounding text to the visual world. Specifically, we represent text spans (i.e., referring expressions and noun phrases) as links in Markdown, i.e., [text span](bounding boxes), where object descriptions are sequences of location tokens. To train the model, we construct a large-scale dataset about grounded image-text pairs (GrIT) together with multimodal corpora. In addition to the existing capabilities of MLLMs (e.g., perceiving general modalities, following instructions, and performing in-context learning), Kosmos-2 integrates the grounding capability to downstream applications, while maintaining the conventional capabilities of MLLMs (e.g., perceiving general modalities, following instructions, and performing in-context learning). Kosmos-2 is evaluated on a wide range of tasks, including (i) multimodal grounding, such as referring expression comprehension and phrase grounding, (ii) multimodal referring, such as referring expression generation, (iii) perception-language tasks, and (iv) language understanding and generation. This study sheds a light on the big convergence of language, multimodal perception, and world modeling, which is a key step toward artificial general intelligence. Code can be found in [https://aka.ms/kosmos-2](https://aka.ms/kosmos-2).

YNIMG Journal 2024 Journal Article

Higher emotional synchronization is modulated by relationship quality in romantic relationships and not in close friendships

  • Yijun Chen
  • Shen Liu
  • Yaru Hao
  • Qian Zhao
  • Jiecheng Ren
  • Yi Piao
  • Liuyun Wang
  • Yunping Yang

Emotions are fundamental to social interaction and deeply intertwined with interpersonal dynamics, especially in romantic relationships. Although the neural basis of interaction processes in romance has been widely explored, the underlying emotions and the connection between relationship quality and neural synchronization remain less understood. Our study employed EEG hyperscanning during a non-interactive video-watching paradigm to compare the emotional coordination between romantic couples and close friends. Couples showed significantly greater behavioral and prefrontal alpha synchronization than friends. Notably, couples with low relationship quality required heightened neural synchronization to maintain robust behavioral synchronization. Further support vector machine analysis underscores the crucial role of prefrontal activity in differentiating couples from friends. In summary, our research addresses gaps concerning how intrinsic emotions linked to relationship quality influence neural and behavioral synchronization by investigating a natural non-interactive context, thereby advancing our understanding of the neural mechanisms underlying emotional coordination in romantic relationships.

NeurIPS Conference 2023 Conference Paper

Language Is Not All You Need: Aligning Perception with Language Models

  • Shaohan Huang
  • Li Dong
  • Wenhui Wang
  • Yaru Hao
  • Saksham Singhal
  • Shuming Ma
  • Tengchao Lv
  • Lei Cui

A big convergence of language, multimodal perception, action, and world modeling is a key step toward artificial general intelligence. In this work, we introduce KOSMOS-1, a Multimodal Large Language Model (MLLM) that can perceive general modalities, learn in context (i. e. , few-shot), and follow instructions (i. e. , zero-shot). Specifically, we train KOSMOS-1 from scratch on web-scale multi-modal corpora, including arbitrarily interleaved text and images, image-caption pairs, and text data. We evaluate various settings, including zero-shot, few-shot, and multimodal chain-of-thought prompting, on a wide range of tasks without any gradient updates or finetuning. Experimental results show that KOSMOS-1 achieves impressive performance on (i) language understanding, generation, and even OCR-free NLP (directly fed with document images), (ii) perception-language tasks, including multimodal dialogue, image captioning, visual question answering, and (iii) vision tasks, such as image recognition with descriptions (specifying classification via text instructions). We also show that MLLMs can benefit from cross-modal transfer, i. e. , transfer knowledge from language to multimodal, and from multimodal to language. In addition, we introduce a dataset of Raven IQ test, which diagnoses the nonverbal reasoning capability of MLLMs.

NeurIPS Conference 2023 Conference Paper

Optimizing Prompts for Text-to-Image Generation

  • Yaru Hao
  • Zewen Chi
  • Li Dong
  • Furu Wei

Well-designed prompts can guide text-to-image models to generate amazing images. However, the performant prompts are often model-specific and misaligned with user input. Instead of laborious human engineering, we propose prompt adaptation, a general framework that automatically adapts original user input to model-preferred prompts. Specifically, we first perform supervised fine-tuning with a pretrained language model on a small collection of manually engineered prompts. Then we use reinforcement learning to explore better prompts. We define a reward function that encourages the policy to generate more aesthetically pleasing images while preserving the original user intentions. Experimental results on Stable Diffusion show that our method outperforms manual prompt engineering in terms of both automatic metrics and human preference ratings. Moreover, reinforcement learning further boosts performance, especially on out-of-domain prompts.

ICLR Conference 2023 Conference Paper

Prototypical Calibration for Few-shot Learning of Language Models

  • Zhixiong Han
  • Yaru Hao
  • Li Dong 0004
  • Yutao Sun
  • Furu Wei

In-context learning of GPT-like models has been recognized as fragile across different hand-crafted templates, and demonstration permutations. In this work, we propose prototypical calibration to adaptively learn a more robust decision boundary for zero- and few-shot classification, instead of greedy decoding. Concretely, our method first adopts Gaussian mixture distribution to estimate the prototypical clusters for all categories. Then we assign each cluster to the corresponding label by solving a weighted bipartite matching problem. Given an example, its prediction is calibrated by the likelihood of prototypical clusters. Experimental results show that prototypical calibration yields a substantial improvement on a diverse set of tasks. Extensive analysis across different scales also indicates that our method calibrates the decision boundary as expected, greatly improving the robustness of GPT to templates, permutations, and class imbalance.

AAAI Conference 2023 Conference Paper

Prototypical Fine-Tuning: Towards Robust Performance under Varying Data Sizes

  • Yiqiao Jin
  • Xiting Wang
  • Yaru Hao
  • Yizhou Sun
  • Xing Xie

In this paper, we move towards combining large parametric models with non-parametric prototypical networks. We propose prototypical fine-tuning, a novel prototypical framework for fine-tuning pretrained language models (LM), which automatically learns a bias to improve predictive performance for varying data sizes, especially low-resource settings. Our prototypical fine-tuning approach can automatically adjust the model capacity according to the number of data points and the model's inherent attributes. Moreover, we propose four principles for effective prototype fine-tuning towards the optimal solution. Experimental results across various datasets show that our work achieves significant performance improvements under various low-resource settings, as well as comparable and usually better performances in high-resource scenarios.

AAAI Conference 2021 Conference Paper

Self-Attention Attribution: Interpreting Information Interactions Inside Transformer

  • Yaru Hao
  • Li Dong
  • Furu Wei
  • Ke Xu

The great success of Transformer-based models benefits from the powerful multi-head self-attention mechanism, which learns token dependencies and encodes contextual information from the input. Prior work strives to attribute model decisions to individual input features with different saliency measures, but they fail to explain how these input features interact with each other to reach predictions. In this paper, we propose a self-attention attribution method to interpret the information interactions inside Transformer. We take BERT as an example to conduct extensive studies. Firstly, we apply selfattention attribution to identify the important attention heads, while others can be pruned with marginal performance degradation. Furthermore, we extract the most salient dependencies in each layer to construct an attribution tree, which reveals the hierarchical interactions inside Transformer. Finally, we show that the attribution results can be used as adversarial patterns to implement non-targeted attacks towards BERT.