Author name cluster

Ruibo Liu

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

15 papers

2 author rows

NeurIPS Conference 2025 Conference Paper

OmniBench: Towards The Future of Universal Omni-Language Models

Yizhi Li
Ge Zhang
Yinghao Ma
Ruibin Yuan
Hangyu Guo
Yiming Liang
Jiaheng Liu
Noah Wang

Recent advancements in multimodal large language models (MLLMs) have focused on integrating multiple modalities, yet their ability to simultaneously process and reason across different inputs remains underexplored. We introduce OmniBench, a novel benchmark designed to evaluate models’ ability to recognize, interpret, and reason across visual, acoustic, and textual inputs simultaneously. We define language models capable of such tri-modal processing as omni-language models (OLMs). OmniBench features high-quality human annotations that require integrated understanding across all modalities. Our evaluation reveals that: i) open-source OLMs show significant limitations in instruction-following and reasoning in tri-modal contexts; and ii) most baseline models perform poorly (below 50% accuracy) even with textual alternatives to image/audio inputs. To address these limitations, we develop OmniInstruct, an 96K-sample instruction tuning dataset for training OLMs. We advocate for developing more robust tri-modal integration techniques and training strategies to enhance OLM performance. Codes and data could be found at https: //m-a-p. ai/OmniBench/.

PDF Details

ICML Conference 2025 Conference Paper

Scaling Laws for Differentially Private Language Models

Ryan McKenna
Yangsibo Huang
Amer Sinha
Borja Balle
Zachary Charles
Christopher A. Choquette-Choo
Badih Ghazi
Georgios Kaissis

Scaling laws have emerged as important components of large language model (LLM) training as they can predict performance gains through scale, and provide guidance on important hyper-parameter choices that would otherwise be expensive. LLMs also rely on large, high-quality training datasets, like those sourced from (sometimes sensitive) user data. Training models on this sensitive user data requires careful privacy protections like differential privacy (DP). However, the dynamics of DP training are significantly different, and consequently their scaling laws are not yet fully understood. In this work, we establish scaling laws that accurately model the intricacies of DP LLM training, providing a complete picture of the compute-privacy-utility and the optimal training configurations in many settings.

Details

NeurIPS Conference 2024 Conference Paper

II-Bench: An Image Implication Understanding Benchmark for Multimodal Large Language Models

Ziqiang Liu
Feiteng Fang
Xi Feng
Xinrun Du
Chenhao Zhang
Zekun Wang
yuelin bai
Qixuan Zhao

The rapid advancements in the development of multimodal large language models (MLLMs) have consistently led to new breakthroughs on various benchmarks. In response, numerous challenging and comprehensive benchmarks have been proposed to more accurately assess the capabilities of MLLMs. However, there is a dearth of exploration of the higher-order perceptual capabilities of MLLMs. To fill this gap, we propose the Image Implication understanding Benchmark, II-Bench, which aims to evaluate the model's higher-order perception of images. Through extensive experiments on II-Bench across multiple MLLMs, we have made significant findings. Initially, a substantial gap is observed between the performance of MLLMs and humans on II-Bench. The pinnacle accuracy of MLLMs attains 74. 8%, whereas human accuracy averages 90%, peaking at an impressive 98%. Subsequently, MLLMs perform worse on abstract and complex images, suggesting limitations in their ability to understand high-level semantics and capture image details. Finally, it is observed that most models exhibit enhanced accuracy when image sentiment polarity hints are incorporated into the prompts. This observation underscores a notable deficiency in their inherent understanding of image sentiment. We believe that II-Bench will inspire the community to develop the next generation of MLLMs, advancing the journey towards expert artificial general intelligence (AGI). II-Bench is publicly available at https: //huggingface. co/datasets/m-a-p/II-Bench.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

Long-form factuality in large language models

Jerry Wei
Chengrun Yang
Xinying Song
Yifeng Lu
Nathan Hu
Jie Huang
Dustin Tran
Daiyi Peng

Large language models (LLMs) often generate content that contains factual errors when responding to fact-seeking prompts on open-ended topics. To benchmark a model’s long-form factuality in open domains, we first use GPT-4 to generate LongFact, a prompt set comprising thousands of questions spanning 38 topics. We then propose that LLM agents can be used as automated evaluators for long-form factuality through a method which we call Search-Augmented Factuality Evaluator (SAFE). SAFE utilizes an LLM to break down a long-form response into a set of individual facts and to evaluate the accuracy of each fact using a multi-step reasoning process comprising sending search queries to Google Search and determining whether a fact is supported by the search results. Furthermore, we propose extending F1 score as an aggregated metric for long-form factuality. To do so, we balance the percentage of supported facts in a response (precision) with the percentage of provided facts relative to a hyperparameter representing a user’s preferred response length (recall). Empirically, we demonstrate that LLM agents can outperform crowdsourced human annotators—on a set of∼16k individual facts, SAFE agrees with crowdsourced human annotators 72% of the time, and on a random subset of 100 disagreement cases, SAFE wins 76% of the time. At the same time, SAFE is more than 20 times cheaper than human annotators. We also benchmark thirteen language models on LongFact across four model families (Gemini, GPT, Claude, and PaLM-2), finding that larger language models generally achieve better long-form factuality. LongFact, SAFE, and all experimental code are available at https: //github. com/google-deepmind/long-form-factuality.

PDF Details DOI

ICLR Conference 2024 Conference Paper

MERT: Acoustic Music Understanding Model with Large-Scale Self-supervised Training

Yizhi Li
Ruibin Yuan
Ge Zhang 0009
Yinghao Ma
Xingran Chen
Hanzhi Yin
Chenghao Xiao
Chenghua Lin 0002

Self-supervised learning (SSL) has recently emerged as a promising paradigm for training generalisable models on large-scale data in the fields of vision, text, and speech. Although SSL has been proven effective in speech and audio, its application to music audio has yet to be thoroughly explored. This is partially due to the distinctive challenges associated with modelling musical knowledge, particularly tonal and pitched characteristics of music. To address this research gap, we propose an acoustic **M**usic und**ER**standing model with large-scale self-supervised **T**raining (**MERT**), which incorporates teacher models to provide pseudo labels in the masked language modelling (MLM) style acoustic pre-training. In our exploration, we identified an effective combination of teacher models, which outperforms conventional speech and audio approaches in terms of performance. This combination includes an acoustic teacher based on Residual Vector Quantization - Variational AutoEncoder (RVQ-VAE) and a musical teacher based on the Constant-Q Transform (CQT). Furthermore, we explore a wide range of settings to overcome the instability in acoustic language model pre-training, which allows our designed paradigm to scale from 95M to 330M parameters. Experimental results indicate that our model can generalise and perform well on 14 music understanding tasks and attain state-of-the-art (SOTA) overall scores.

Details

IROS Conference 2024 Conference Paper

Reconfigurable Robot Identification from Motion Data

Yuhang Hu
Yunzhe Wang
Ruibo Liu
Zhou Shen
Hod Lipson

Integrating Large Language Models (LLMs) and Vision-Language Models (VLMs) with robotic systems enables robots to process and understand complex natural language instructions and visual information. However, a fundamental challenge remains: for robots to fully capitalize on these advancements, they must have a deep understanding of their physical embodiment. The gap between AI models’ cognitive capabilities and the understanding of physical embodiment leads to the following question: Can a robot autonomously understand and adapt to its physical form and functionalities through interaction with its environment? This question underscores the transition towards developing self-modeling robots without reliance on external sensory or pre-programmed knowledge about their structure. Here, we propose a meta-self-modeling that can deduce robot morphology through proprioception—the robot’s internal sense of its body’s position and movement. Our study introduces a 12-DoF reconfigurable legged robot, accompanied by a diverse dataset of 200k unique configurations, to systematically investigate the relationship between robotic motion and robot morphology. Utilizing a deep neural network model comprising a robot signature encoder and a configuration decoder, we demonstrate the capability of our system to accurately predict robot configurations from proprioceptive signals. This research contributes to the field of robotic self-modeling, aiming to enhance robot’s understanding of their physical embodiment and adaptability in real-world scenarios.

Details

ICLR Conference 2024 Conference Paper

Training Socially Aligned Language Models on Simulated Social Interactions

Ruibo Liu
Ruixin Yang
Chenyan Jia
Ge Zhang
Diyi Yang
Soroush Vosoughi

The goal of social alignment for AI systems is to make sure these models can conduct themselves appropriately following social values. Unlike humans who establish a consensus on value judgments through social interaction, current language models (LMs) are trained to rigidly recite the corpus in social isolation, which causes poor generalization in unfamiliar cases and the lack of robustness under adversarial attacks. In this work, we introduce a new training paradigm that enables LMs to learn from simulated social interactions. Compared with existing methods, our method is much more scalable and efficient, and shows superior performance in alignment benchmarks and human evaluation.

Details

NeurIPS Conference 2023 Conference Paper

MARBLE: Music Audio Representation Benchmark for Universal Evaluation

Ruibin Yuan
Yinghao Ma
Yizhi Li
Ge Zhang
Xingran Chen
Hanzhi Yin
zhuo le
Yiqi Liu

In the era of extensive intersection between art and Artificial Intelligence (AI), such as image generation and fiction co-creation, AI for music remains relatively nascent, particularly in music understanding. This is evident in the limited work on deep music representations, the scarcity of large-scale datasets, and the absence of a universal and community-driven benchmark. To address this issue, we introduce the Music Audio Representation Benchmark for universaL Evaluation, termed MARBLE. It aims to provide a benchmark for various Music Information Retrieval (MIR) tasks by defining a comprehensive taxonomy with four hierarchy levels, including acoustic, performance, score, and high-level description. We then establish a unified protocol based on 18 tasks on 12 public-available datasets, providing a fair and standard assessment of representations of all open-sourced pre-trained models developed on music recordings as baselines. Besides, MARBLE offers an easy-to-use, extendable, and reproducible suite for the community, with a clear statement on copyright issues on datasets. Results suggest recently proposed large-scale pre-trained musical language models perform the best in most tasks, with room for further improvement. The leaderboard and toolkit repository are published to promote future music AI research.

PDF Details

ICLR Conference 2023 Conference Paper

Mind's Eye: Grounded Language Model Reasoning through Simulation

Ruibo Liu
Jason Wei
Shixiang Gu
Te-Yen Wu
Soroush Vosoughi
Claire Cui
Denny Zhou
Andrew M. Dai

Successful and effective communication between humans and AI relies on a shared experience of the world. By training solely on written text, current language models (LMs) miss the grounded experience of humans in the real-world---their failure to relate language to the physical world causes knowledge to be misrepresented and obvious mistakes in their reasoning. We present Mind's Eye, a paradigm to ground language model reasoning in the physical world. Given a physical reasoning question, we use a computational physics engine (DeepMind's MuJoCo) to simulate the possible outcomes, and then use the simulation results as part of the input, which enables language models to perform reasoning. Experiments on 39 tasks in a physics alignment benchmark demonstrate that Mind's Eye can improve reasoning ability by a large margin (27.9% zero-shot, and 46.0% few-shot absolute accuracy improvement on average). Smaller language models armed with Mind's Eye can obtain similar performance to models that are 100x larger. Finally, we confirm the robustness of Mind's Eye through ablation studies.

Details

ICLR Conference 2022 Conference Paper

Knowledge Infused Decoding

Ruibo Liu
Guoqing Zheng
Shashank Gupta
Radhika Gaonkar
Chongyang Gao
Soroush Vosoughi
Milad Shokouhi
Ahmed Hassan Awadallah

Pre-trained language models (LMs) have been shown to memorize a substantial amount of knowledge from the pre-training corpora; however, they are still limited in recalling factually correct knowledge given a certain context. Hence. they tend to suffer from counterfactual or hallucinatory generation when used in knowledge-intensive natural language generation (NLG) tasks. Recent remedies to this problem focus on modifying either the pre-training or task fine-tuning objectives to incorporate knowledge, which normally require additional costly training or architecture modification of LMs for practical applications. We present Knowledge Infused Decoding (KID)---a novel decoding algorithm for generative LMs, which dynamically infuses external knowledge into each step of the LM decoding. Specifically, we maintain a local knowledge memory based on the current context, interacting with a dynamically created external knowledge trie, and continuously update the local memory as a knowledge-aware constraint to guide decoding via reinforcement learning. On six diverse knowledge-intensive NLG tasks, task-agnostic LMs (e.g., GPT-2 and BART) armed with KID outperform many task-optimized state-of-the-art models, and show particularly strong performance in few-shot scenarios over seven related knowledge-infusion techniques. Human evaluation confirms KID's ability to generate more relevant and factual language for the input context when compared with multiple baselines. Finally, KID also alleviates exposure bias and provides stable generation quality when generating longer sequences.

Details

ICLR Conference 2022 Conference Paper

Non-Parallel Text Style Transfer with Self-Parallel Supervision

Ruibo Liu
Chongyang Gao
Chenyan Jia
Guangxuan Xu
Soroush Vosoughi

The performance of existing text style transfer models is severely limited by the non-parallel datasets on which the models are trained. In non-parallel datasets, no direct mapping exists between sentences of the source and target style; the style transfer models thus only receive weak supervision of the target sentences during training, which often leads the model to discard too much style-independent information, or utterly fail to transfer the style. In this work, we propose LaMer, a novel text style transfer framework based on large-scale language models. LaMer first mines the roughly parallel expressions in the non-parallel datasets with scene graphs, and then employs MLE training, followed by imitation learning refinement, to leverage the intrinsic parallelism within the data. On two benchmark tasks (sentiment & formality transfer) and a newly proposed challenging task (political stance transfer), our model achieves qualitative advances in transfer accuracy, content preservation, and fluency. Further empirical and human evaluations demonstrate that our model not only makes training more efficient, but also generates more readable and diverse expressions than previous models.

Details

AIJ Journal 2022 Journal Article

Quantifying and alleviating political bias in language models

Ruibo Liu
Chenyan Jia
Jason Wei
Guangxuan Xu
Soroush Vosoughi

Details DOI

NeurIPS Conference 2022 Conference Paper

Second Thoughts are Best: Learning to Re-Align With Human Values from Text Edits

Ruibo Liu
Chenyan Jia
Ge Zhang
Ziyu Zhuang
Tony Liu
Soroush Vosoughi

We present Second Thoughts, a new learning paradigm that enables language models (LMs) to re-align with human values. By modeling the chain-of-edits between value-unaligned and value-aligned text, with LM fine-tuning and additional refinement through reinforcement learning, Second Thoughts not only achieves superior performance in three value alignment benchmark datasets but also shows strong human-value transfer learning ability in few-shot scenarios. The generated editing steps also offer better interpretability and ease for interactive error correction. Extensive human evaluations further confirm its effectiveness.

PDF Details

AAAI Conference 2021 Conference Paper

Embedding Heterogeneous Networks into Hyperbolic Space Without Meta-path

Lili Wang
Chongyang Gao
Chenghan Huang
Ruibo Liu
Weicheng Ma
Soroush Vosoughi

Networks found in the real-world are numerous and varied. A common type of network is the heterogeneous network, where the nodes (and edges) can be of different types. Accordingly, there have been efforts at learning representations of these heterogeneous networks in low-dimensional space. However, most of the existing heterogeneous network embedding methods suffer from the following two drawbacks: (1) The target space is usually Euclidean. Conversely, many recent works have shown that complex networks may have hyperbolic latent anatomy, which is non-Euclidean. (2) These methods usually rely on meta-paths, which require domainspecific prior knowledge for meta-path selection. Additionally, different down-streaming tasks on the same network might require different meta-paths in order to generate taskspecific embeddings. In this paper, we propose a novel selfguided random walk method that does not require metapath for embedding heterogeneous networks into hyperbolic space. We conduct thorough experiments for the tasks of network reconstruction and link prediction on two public datasets, showing that our model outperforms a variety of well-known baselines across all tasks.

PDF Details

AAAI Conference 2021 Conference Paper

Mitigating Political Bias in Language Models through Reinforced Calibration

Ruibo Liu
Chenyan Jia
Jason Wei
Guangxuan Xu
Lili Wang
Soroush Vosoughi

Current large-scale language models can be politically biased as a result of the data they are trained on, potentially causing serious problems when they are deployed in realworld settings. In this paper, we describe metrics for measuring political bias in GPT-2 generation and propose a reinforcement learning (RL) framework for mitigating political biases in generated text. By using rewards from word embeddings or a classifier, our RL framework guides debiased generation without having access to the training data or requiring the model to be retrained. In empirical experiments on three attributes sensitive to political bias (gender, location, and topic), our methods reduced bias according to both our metrics and human evaluation, while maintaining readability and semantic coherence.

PDF Details