Author name cluster

Ru Peng

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

4 papers

2 author rows

ICLR Conference 2025 Conference Paper

DataMan: Data Manager for Pre-training Large Language Models

Ru Peng
Kexin Yang 0002
Yawen Zeng
Junyang Lin
Dayiheng Liu
Junbo Zhao 0002

The performance emergence of large language models (LLMs) driven by data scaling laws makes the selection of pre-training data increasingly important. However, existing methods rely on limited heuristics and human intuition, lacking comprehensive and clear guidelines. To address this, we are inspired by *``reverse thinking''* -- prompting LLMs to self-identify which criteria benefit its performance. As its pre-training capabilities are related to perplexity (PPL), we derive 14 quality criteria from the causes of text perplexity anomalies and introduce 15 common application domains to support domain mixing. In this paper, we train a **Data** **Man**ager (**DataMan**) to learn quality ratings and domain recognition from pointwise rating, and use it to annotate a 447B token pre-training corpus with 14 quality ratings and domain type. Our experiments validate our approach, using DataMan to select 30B tokens to train a 1.3B-parameter language model, demonstrating significant improvements in in-context learning (ICL), perplexity, and instruction-following ability over the state-of-the-art baseline. The best-performing model, based on the *Overall Score l=5* surpasses a model trained with 50% more data using uniform sampling. We continue pre-training with high-rated, domain-specific data annotated by DataMan to enhance domain-specific ICL performance and thus verify DataMan's domain mixing ability. Our findings emphasize the importance of quality ranking, the complementary nature of quality criteria, and their low correlation with perplexity, analyzing misalignment between PPL and ICL performance. We also thoroughly analyzed our pre-training dataset, examining its composition, the distribution of quality ratings, and the original document sources.

Details

ICLR Conference 2024 Conference Paper

Energy-based Automated Model Evaluation

Ru Peng
Heming Zou
Haobo Wang 0001
Yawen Zeng
Zenan Huang
Junbo Zhao 0002

The conventional evaluation protocols on machine learning models rely heavily on a labeled, i.i.d-assumed testing dataset, which is not often present in real-world applications. The Automated Model Evaluation (AutoEval) shows an alternative to this traditional workflow, by forming a proximal prediction pipeline of the testing performance without the presence of ground-truth labels. Despite its recent successes, the AutoEval frameworks still suffer from an overconfidence issue, substantial storage and computational cost. In that regard, we propose a novel measure --- Meta-Distribution Energy (MDE) that allows the AutoEval framework to be both more efficient and effective. The core of the MDE is to establish a meta-distribution statistic, on the information (energy) associated with individual samples, then offer a smoother representation enabled by energy-based learning. We further provide our theoretical insights by connecting the MDE with the classification loss. We provide extensive experiments across modalities, datasets and different architectural backbones to validate MDE's validity, together with its superiority compared with prior approaches. We also prove MDE's versatility by showing its seamless integration with large-scale models, and easy adaption to learning scenarios with noisy- or imbalanced- labels.

Details

ICRA Conference 2024 Conference Paper

Experience Consistency Distillation Continual Reinforcement Learning for Robotic Manipulation Tasks

Chao Zhao
Jie Xu
Ru Peng
Xingyu Chen
Kuizhi Mei
Xuguang Lan

Continual reinforcement learning, which aims to help robots acquire skills without catastrophic forgetting, obviating the need to re-learn all tasks from scratch. In order to enable lifelong acquisition of skills in robots, replay-based continual reinforcement learning has emerged as a promising research direction. These techniques replay data from previous tasks to mitigate forgetting when learning new skills. However, existing replay-based methods store poor representative experience, and the experience utilization of old tasks is inefficient. To address these issues, we propose an experience consistency distillation method for robot continual reinforcement learning to improve the data efficiency of the experience. Specifically, the experience of old tasks are distilled to obtain Markov Decision Process (MDP) data with high compression ratio and information content. To ensure consistent data distributions before and after distillation, we further utilize a Fréchet Inception Distance (FID) loss as a regularization constraint. In order to improve experience utilization efficiency, the policy is then trained using both the distilled data and current task data, with policy distillation performed based on uncertainty metrics. Our method is validated in the continual reinforcement learning simulation platform and real scene with a UR5e robot arm. Experimental results indicate that our method achieves higher success and lower buffer size requirement compared to other methods.

Details

NeurIPS Conference 2022 Conference Paper

TaiSu: A 166M Large-scale High-Quality Dataset for Chinese Vision-Language Pre-training

Yulong Liu
Guibo Zhu
Bin Zhu
Qi Song
Guojing Ge
Haoran Chen
GuanHui Qiao
Ru Peng

Vision-Language Pre-training (VLP) has been shown to be an efficient method to improve the performance of models on different vision-and-language downstream tasks. Substantial studies have shown that neural networks may be able to learn some general rules about language and visual concepts from a large-scale weakly labeled image-text dataset. However, most of the public cross-modal datasets that contain more than 100M image-text pairs are in English; there is a lack of available large-scale and high-quality Chinese VLP datasets. In this work, we propose a new framework for automatic dataset acquisition and cleaning with which we construct a new large-scale and high-quality cross-modal dataset named as TaiSu, containing 166 million images and 219 million Chinese captions. Compared with the recently released Wukong dataset, our dataset is achieved with much stricter restrictions on the semantic correlation of image-text pairs. We also propose to combine texts collected from the web with texts generated by a pre-trained image-captioning model. To the best of our knowledge, TaiSu is currently the largest publicly accessible Chinese cross-modal dataset. Furthermore, we test our dataset on several vision-language downstream tasks. TaiSu outperforms BriVL by a large margin on the zero-shot image-text retrieval task and zero-shot image classification task. TaiSu also shows better performance than Wukong on the image-retrieval task without using image augmentation for training. Results demonstrate that TaiSu can serve as a promising VLP dataset, both for understanding and generative tasks. More information can be referred to https: //github. com/ksOAn6g5/TaiSu.

PDF Details