Author name cluster

Dingli Yu

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

15 papers

2 author rows

ICML Conference 2025 Conference Paper

Generalizing from SIMPLE to HARD Visual Reasoning: Can We Mitigate Modality Imbalance in VLMs?

Simon Park 0002
Abhishek Panigrahi
Yun Cheng
Dingli Yu
Anirudh Goyal
Sanjeev Arora

Vision Language Models (VLMs) are impressive at visual question answering and image captioning. But they underperform on multi-step visual reasoning—even compared to LLMs on the same tasks presented in text form—giving rise to perceptions of modality imbalance or brittleness. Towards a systematic study of such issues, we introduce a synthetic framework for assessing the ability of VLMs to perform algorithmic visual reasoning, comprising three tasks: Table Readout, Grid Navigation, and Visual Analogy. Each has two levels of difficulty, SIMPLE and HARD, and even the SIMPLE versions are difficult for frontier VLMs. We propose strategies for training on the SIMPLE version of tasks that improve performance on the corresponding HARD task, i. e. , simple-to-hard (S2H) generalization. This controlled setup, where each task also has an equivalent text-only version, allows a quantification of the modality imbalance and how it is impacted by training strategy. We show that 1) explicit image-to-text conversion is important in promoting S2H generalization on images, by transferring reasoning from text; 2) conversion can be internalized at test time. We also report results of mechanistic study of this phenomenon. We identify measures of gradient alignment that can identify training strategies that promote better S2H generalization. Ablations highlight the importance of chain-of-thought.

Details

EAAI Journal 2025 Journal Article

Handling data heterogeneity for wind turbine fault diagnosis via dynamic ensemble multilevel interactive learning

Shuangxin Wang
Hongrui Li
Jiading Jiang
Meng Li
Junmei Ou
Dingli Yu

The artificial intelligence methods used in wind turbine fault diagnosis have not adequately considered the inherent heterogeneity of SCADA data. This heterogeneity arises from the stochastic nature of wind and the operational characteristics of turbines. Neglecting the heterogeneity may lead to models struggling to capture fault characteristic patterns across varying wind speed ranges, diminishing diagnostic performance. To handle this, a novel dynamic ensemble multilevel interactive (DEMI) learning method is proposed. Firstly, a new index of wind speed jump value is presented and applied to the fine data division of diagnostic units. This process assists diagnostic units in focusing more on learning fault distribution patterns within each wind speed range. Simultaneously, to capture fault feature information from complex and variable data, deep small-world networks with short-path propagation and high clustering properties are designed within each unit for multilevel feature interaction learning. Subsequently, a multi-wind-speed unit model library is constructed, and a dynamic selection algorithm is employed to find high-performance classifiers to reduce the impact of these networks random topology. Finally, in the diagnostic phase, the final diagnostic results are obtained using online dynamic retrieval ensemble. The experimental results indicate that the DEMI enhances diagnostic performance accuracy compared to the currently advanced methods that overlook heterogeneity, while reducing false alarm rates.

Details DOI

ICML Conference 2025 Conference Paper

Weak-to-Strong Generalization Even in Random Feature Networks, Provably

Marko Medvedev
Kaifeng Lyu
Dingli Yu
Sanjeev Arora
Zhiyuan Li 0005
Nathan Srebro

Weak-to-Strong Generalization (Burns et al. ,2024) is the phenomenon whereby a strong student, say GPT-4, learns a task from a weak teacher, say GPT-2, and ends up significantly outperforming the teacher. We show that this phenomenon does not require a complex and pretrained learner like GPT-4, can arise even in simple non-pretrained models, simply due to the size advantage of the student. But, we also show that there are inherint limits to the extent of such weak to strong generalization. We consider students and teachers that are random feature models, described by two-layer networks with a random and fixed bottom layer and trained top layer. A weak’ teacher, with a small number of units (i. e. random features), is trained on the population, and a strong’ student, with a much larger number of units (i. e. random features), is trained only on labels generated by the weak teacher. We demonstrate, prove, and understand how the student can outperform the teacher, even though trained only on data labeled by the teacher. We also explain how such weak-to-strong generalization is enabled by early stopping. We then show the quantitative limits of weak-to-strong generalization in this model, and in fact in a much broader class of models, for arbitrary teacher and student feature spaces and a broad class of learning rules, including when the student features are pre-trained or otherwise more informative. In particular, we show that in such models the student’s error can only approach zero if the teacher’s error approaches zero, and a strong student cannot “boost” a slightly-better-then-chance teacher to obtain a small error.

Details

NeurIPS Conference 2024 Conference Paper

Can Models Learn Skill Composition from Examples?

Haoyu Zhao
Simran Kaur
Dingli Yu
Anirudh Goyal
Sanjeev Arora

As large language models (LLMs) become increasingly advanced, their ability to exhibit compositional generalization---the capacity to combine learned skills in novel ways not encountered during training---has garnered significant attention. This type of generalization, particularly in scenarios beyond training data, is also of great interest in the study of AI safety and alignment. A recent study introduced the Skill-Mix evaluation, where models are tasked with composing a short paragraph demonstrating the use of a specified $k$-tuple of language skills. While small models struggled with composing even with $k=3$, larger models like GPT-4 performed reasonably well with $k=5$ and $6$. In this paper, we employ a setup akin to Skill-Mix to evaluate the capacity of smaller models to learn compositional generalization from examples. Utilizing a diverse set of language skills---including rhetorical, literary, reasoning, theory of mind, and common sense---GPT was used to generate text samples that exhibit random subsets of $k$ skills. Subsequent fine-tuning of 7B and 13B parameter models on these combined skill texts, for increasing values of $k$, revealed the following findings: (1) Training on combinations of $k=2$ and $3$ skills results in noticeable improvements in the ability to compose texts with $k=4$ and $5$ skills, despite models never having seen such examples during training. (2) When skill categories are split into training and held-out groups, models significantly improve at composing texts with held-out skills during testing despite having only seen training skills during fine-tuning, illustrating the efficacy of the training approach even with previously unseen skills. This study also suggests that incorporating skill-rich (potentially synthetic) text into training can substantially enhance the compositional capabilities of models.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

ConceptMix: A Compositional Image Generation Benchmark with Controllable Difficulty

Xindi Wu
Dingli Yu
Yangsibo Huang
Olga Russakovsky
Sanjeev Arora

Compositionality is a critical capability in Text-to-Image (T2I) models, as it reflects their ability to understand and combine multiple concepts from text descriptions. Existing evaluations of compositional capability rely heavily on human-designed text prompts or fixed templates, limiting their diversity and complexity, and yielding low discriminative power. We propose ConceptMix, a scalable, controllable, and customizable benchmark which automatically evaluates compositional generation ability of T2I models. This is done in two stages. First, ConceptMix generates the text prompts: concretely, using categories of visual concepts (e. g. , objects, colors, shapes, spatial relationships), it randomly samples an object and k-tuples of visual concepts, then uses GPT-4o to generate text prompts for image generation based on these sampled concepts. Second, ConceptMix evaluates the images generated in response to these prompts: concretely, it checks how many of the k concepts actually appeared in the image by generating one question per visual concept and using a strong VLM to answer them. Through administering ConceptMix to a diverse set of T2I models (proprietary as well as open ones) using increasing values of k, we show that our ConceptMix has higher discrimination power than earlier benchmarks. Specifically, ConceptMix reveals that the performance of several models, especially open models, drops dramatically with increased k. Importantly, it also provides insight into the lack of prompt diversity in widely-used training datasets. Additionally, we conduct extensive human studies to validate the design of ConceptMix and compare our automatic grading with human judgement. We hope it will guide future T2I model development.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

Keeping LLMs Aligned After Fine-tuning: The Crucial Role of Prompt Templates

Kaifeng Lyu
Haoyu Zhao
Xinran Gu
Dingli Yu
Anirudh Goyal
Sanjeev Arora

Public LLMs such as the Llama 2-Chat underwent alignment training and were considered safe. Recently Qi et al. (2024) reported that even benign fine-tuning on seemingly safe datasets can give rise to unsafe behaviors in the models. The current paper is about methods and best practices to mitigate such loss of alignment. We focus on the setting where a public model is fine-tuned before serving users for specific usage, where the model should improve on the downstream task while maintaining alignment. Through extensive experiments on several chat models (Meta's Llama 2-Chat, Mistral AI's Mistral 7B Instruct v0. 2, and OpenAI's GPT-3. 5 Turbo), this paper uncovers that the prompt templates used during fine-tuning and inference play a crucial role in preserving safety alignment, and proposes the “Pure Tuning, Safe Testing” (PTST) strategy --- fine-tune models without a safety prompt, but include it at test time. This seemingly counterintuitive strategy incorporates an intended distribution shift to encourage alignment preservation. Fine-tuning experiments on GSM8K, ChatDoctor, and OpenOrca show that PTST significantly reduces the rise of unsafe behaviors.

PDF Details DOI

ICLR Conference 2024 Conference Paper

SKILL-MIX: a Flexible and Expandable Family of Evaluations for AI Models

Dingli Yu
Simran Kaur 0001
Arushi Gupta
Jonah Brown-Cohen
Anirudh Goyal
Sanjeev Arora

With LLMs shifting their role from statistical modeling of language to serving as general-purpose AI agents, how should LLM evaluations change? Arguably, a key ability of an AI agent is to flexibly combine, as needed, the basic skills it has learned. The capability to combine skills plays an important role in (human) pedagogy and also in a paper on emergence phenomena (Arora & Goyal, 2023). This work introduces SKILL-MIX, a new evaluation to measure ability to combine skills. Using a list of $N$ skills the evaluator repeatedly picks random subsets of $k$ skills and asks the LLM to produce text combining that subset of skills. Since the number of subsets grows like $N^k$, for even modest $k$ this evaluation will, with high probability, require the LLM to produce text significantly different from any text in the training set. The paper develops a methodology for (a) designing and administering such an evaluation, and (b) automatic grading (plus spot-checking by humans) of the results using GPT-4 as well as the open LLaMA-2 70B model. Administering a version of SKILL-MIX to popular chatbots gave results that, while generally in line with prior expectations, contained surprises. Sizeable differences exist among model capabilities that are not captured by their ranking on popular LLM leaderboards ("cramming for the leaderboard"). Furthermore, simple probability calculations indicate that GPT-4's reasonable performance on $k=5$ is suggestive of going beyond "stochastic parrot" behavior (Bender et al., 2021), i.e., it combines skills in ways that it had not seen during training. We sketch how the methodology can lead to a SKILL-MIX based eco-system of open evaluations for AI capabilities of future models. We maintain a leaderboard of SKILL-MIX at [https://skill-mix.github.io](https://skill-mix.github.io).

Details

ICLR Conference 2024 Conference Paper

Tensor Programs VI: Feature Learning in Infinite Depth Neural Networks

Greg Yang
Dingli Yu
Chen Zhu
Soufiane Hayou

Empirical studies have consistently demonstrated that increasing the size of neural networks often yields superior performance in practical applications. However, there is a lack of consensus regarding the appropriate scaling strategy, particularly when it comes to increasing the depth of neural networks. In practice, excessively large depths can lead to model performance degradation. In this paper, we introduce Depth-$\mu$P, a principled approach for depth scaling, allowing for the training of arbitrarily deep architectures while maximizing feature learning and diversity among nearby layers. Our method involves dividing the contribution of each residual block and the parameter update by the square root of the depth. Through the use of Tensor Programs, we rigorously establish the existence of a limit for infinitely deep neural networks under the proposed scaling scheme. This scaling strategy ensures more stable training for deep neural networks and guarantees the transferability of hyperparameters from shallow to deep models. To substantiate the efficacy of our scaling method, we conduct empirical validation on neural networks with depths up to $2^{10}$.

Details

ICML Conference 2023 Conference Paper

A Kernel-Based View of Language Model Fine-Tuning

Sadhika Malladi
Alexander Wettig
Dingli Yu
Danqi Chen 0001
Sanjeev Arora

It has become standard to solve NLP tasks by fine-tuning pre-trained language models (LMs), especially in low-data settings. There is minimal theoretical understanding of empirical success, e. g. , why fine-tuning a model with $10^8$ or more parameters on a couple dozen training points does not result in overfitting. We investigate whether the Neural Tangent Kernel (NTK)—which originated as a model to study the gradient descent dynamics of infinitely wide networks with suitable random initialization—describes fine-tuning of pre-trained LMs. This study was inspired by the decent performance of NTK for computer vision tasks (Wei et al. , 2022). We extend the NTK formalism to Adam and use Tensor Programs (Yang, 2020) to characterize conditions under which the NTK lens may describe fine-tuning updates to pre-trained language models. Extensive experiments on 14 NLP tasks validate our theory and show that formulating the downstream task as a masked word prediction problem through prompting often induces kernel-based dynamics during fine-tuning. Finally, we use this kernel view to propose an explanation for the success of parameter-efficient subspace-based fine-tuning methods.

Details

NeurIPS Conference 2022 Conference Paper

Fast Mixing of Stochastic Gradient Descent with Normalization and Weight Decay

Zhiyuan Li
Tianhao Wang
Dingli Yu

We prove the Fast Equilibrium Conjecture proposed by Li et al. , (2020), i. e. , stochastic gradient descent (SGD) on a scale-invariant loss (e. g. , using networks with various normalization schemes) with learning rate $\eta$ and weight decay factor $\lambda$ mixes in function space in $\mathcal{\tilde{O}}(\frac{1}{\lambda\eta})$ steps, under two standard assumptions: (1) the noise covariance matrix is non-degenerate and (2) the minimizers of the loss form a connected, compact and analytic manifold. The analysis uses the framework of Li et al. , (2021) and shows that for every $T>0$, the iterates of SGD with learning rate $\eta$ and weight decay factor $\lambda$ on the scale-invariant loss converge in distribution in $\Theta\left(\eta^{-1}\lambda^{-1}(T+\ln(\lambda/\eta))\right)$ iterations as $\eta\lambda\to 0$ while satisfying $\eta \le O(\lambda)\le O(1)$. Moreover, the evolution of the limiting distribution can be described by a stochastic differential equation that mixes to the same equilibrium distribution for every initialization around the manifold of minimizers as $T\to\infty$.

PDF Details

NeurIPS Conference 2022 Conference Paper

New Definitions and Evaluations for Saliency Methods: Staying Intrinsic, Complete and Sound

Arushi Gupta
Nikunj Saunshi
Dingli Yu
Kaifeng Lyu
Sanjeev Arora

Saliency methods compute heat maps that highlight portions of an input that were most important for the label assigned to it by a deep net. Evaluations of saliency methods convert this heat map into a new masked input by retaining the $k$ highest-ranked pixels of the original input and replacing the rest with "uninformative" pixels, and checking if the net's output is mostly unchanged. This is usually seen as an explanation of the output, but the current paper highlights reasons why this inference of causality may be suspect. Inspired by logic concepts of completeness & soundness, it observes that the above type of evaluation focuses on completeness of the explanation, but ignores soundness. New evaluation metrics are introduced to capture both notions, while staying in an intrinsic framework---i. e. , using the dataset and the net, but no separately trained nets, human evaluations, etc. A simple saliency method is described that matches or outperforms prior methods in the evaluations. Experiments also suggest new intrinsic justifications, based on soundness, for popular heuristic tricks such as TV regularization and upsampling.

PDF Details

ICLR Conference 2020 Conference Paper

Harnessing the Power of Infinitely Wide Deep Nets on Small-data Tasks

Sanjeev Arora
Simon S. Du
Zhiyuan Li 0005
Ruslan Salakhutdinov
Ruosong Wang
Dingli Yu

Recent research shows that the following two models are equivalent: (a) infinitely wide neural networks (NNs) trained under l2 loss by gradient descent with infinitesimally small learning rate (b) kernel regression with respect to so-called Neural Tangent Kernels (NTKs) (Jacot et al., 2018). An efficient algorithm to compute the NTK, as well as its convolutional counterparts, appears in Arora et al. (2019a), which allowed studying performance of infinitely wide nets on datasets like CIFAR-10. However, super-quadratic running time of kernel methods makes them best suited for small-data tasks. We report results suggesting neural tangent kernels perform strongly on low-data tasks. 1. On a standard testbed of classification/regression tasks from the UCI database, NTK SVM beats the previous gold standard, Random Forests (RF), and also the corresponding finite nets. 2. On CIFAR-10 with 10 – 640 training samples, Convolutional NTK consistently beats ResNet-34 by 1% - 3%. 3. On VOC07 testbed for few-shot image classification tasks on ImageNet with transfer learning (Goyal et al., 2019), replacing the linear SVM currently used with a Convolutional NTK SVM consistently improves performance. 4. Comparing the performance of NTK with the finite-width net it was derived from, NTK behavior starts at lower net widths than suggested by theoretical analysis(Arora et al., 2019a). NTK’s efficacy may trace to lower variance of output.

Details

ICLR Conference 2020 Conference Paper

Simple and Effective Regularization Methods for Training on Noisily Labeled Data with Generalization Guarantee

Wei Hu 0014
Zhiyuan Li 0005
Dingli Yu

Over-parameterized deep neural networks trained by simple first-order methods are known to be able to fit any labeling of data. Such over-fitting ability hinders generalization when mislabeled training examples are present. On the other hand, simple regularization methods like early-stopping can often achieve highly nontrivial performance on clean test data in these scenarios, a phenomenon not theoretically understood. This paper proposes and analyzes two simple and intuitive regularization methods: (i) regularization by the distance between the network parameters to initialization, and (ii) adding a trainable auxiliary variable to the network output for each training example. Theoretically, we prove that gradient descent training with either of these two methods leads to a generalization guarantee on the clean data distribution despite being trained using noisy labels. Our generalization analysis relies on the connection between wide neural network and neural tangent kernel (NTK). The generalization bound is independent of the network size, and is comparable to the bound one can get when there is no label noise. Experimental results verify the effectiveness of these methods on noisily labeled datasets.

Details

AAMAS Conference 2018 Conference Paper

Balanced Outcomes in Wage Bargaining

Pingzhong Tang
Dingli Yu

Balanced outcomes are a subset of core outcomes that take into consideration fairness and agents’ power in bargaining networks. In this paper, following the seminal works by [3] and [6] on modeling and computing balanced outcomes in unitcapacity trading networks, we explore this concept further by considering its generalization in the so-called wage bargaining network where agents on one side (the employers side) may have multiple capacity. It turns out that previous definitions do not trivially extend to this setting. Our first contribution is to incorporate insights from the bargaining theory and define a generalized notion of balanced outcomes in wage bargaining networks. We then consider computational aspects of this newly proposed solutions. We show that there are polynomial-time combinatorial algorithms to compute such solutions in both unweighted and weighted graphs. Our algorithms and proofs are enabled by novel generalizations of techniques proposed by Kleinberg and Tardos and an original technique proposed in this paper called “loose chain”.

PDF

AAAI Conference 2018 Conference Paper

Fair Rent Division on a Budget

Ariel Procaccia
Rodrigo Velez
Dingli Yu

The standard approach to fair rent division assumes that agents have quasi-linear utilities, and seeks allocations that are envy free; it underlies an algorithm that is widely used in practice. However, this approach does not take budget constraints into account, and, therefore, may assign agents to rooms they cannot afford. By contrast, we design a polynomial-time algorithm that takes budget constraints as part of its input; it determines whether there exist envy-free allocations that satisfy the budget constraints, and, if so, computes one that optimizes an additional criterion of justice. In particular, this gives a polynomial-time implementation of the budget-constrained maximin solution, where the maximization objective is the minimum utility of any agent. We show that, like its non-budget-constrained counterpart, this solution is unique in terms of utilities (when it exists), and satisﬁes additional desirable properties.

PDF Details