Author name cluster

David Mueller

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

2 papers

1 author row

TMLR Journal 2025 Journal Article

Can Optimization Trajectories Explain Multi-Task Transfer?

David Mueller
Mark Dredze
Nicholas Andrews

Despite the widespread adoption of multi-task training in deep learning, little is understood about how multi-task learning (MTL) affects generalization. Prior work has conjectured that the negative effects of MTL are due to optimization challenges that arise during training, and many optimization methods have been proposed to improve multi-task performance. However, recent work has shown that these methods fail to consistently improve multi-task generalization. In this work, we seek to improve our understanding of these failures by empirically studying how MTL impacts the optimization of tasks, and whether this impact can explain the effects of MTL on generalization. We show that MTL results in a generalization gap (a gap in generalization at comparable training loss) between single-task and multi-task trajectories early into training. However, we find that factors of the optimization trajectory previously proposed to explain generalization gaps in single-task settings cannot explain the generalization gaps between single-task and multi-task models. Moreover, we show that the amount of gradient conflict between tasks is correlated with negative effects to task optimization, but is not predictive of generalization. Our work sheds light on the underlying causes for failures in MTL and, importantly, raises questions about the role of general purpose multi-task optimization algorithms.

PDF Details

NeurIPS Conference 2024 Conference Paper

Where does In-context Learning Happen in Large Language Models?

Suzanna Sia
David Mueller
Kevin Duh

Self-supervised large language models have demonstrated the ability to perform various tasks via in-context learning, but little is known about where the model locates the task with respect to prompt instructions and demonstration examples. In this work, we attempt to characterize the region where large language models transition from recognizing the task to performing the task. Through a series of layer-wise context-masking experiments on GPTNeo2. 7B, Bloom3B, Starcoder2-7B, Llama3. 1-8B, Llama3. 1-8B-Instruct, on Machine Translation and Code generation, we demonstrate evidence of a "task recognition" point where the task is encoded into the input representations and attention to context is no longer necessary. Taking advantage of this redundancy results in 45% computational savings when prompting with 5 examples, and task recognition achieved at layer 14 / 32 using an example with Machine Translation. Our findings also have implications for resource and parameter efficient fine-tuning; we observe a correspondence between strong fine-tuning performance of individual LoRA layers and the task recognition layers.

PDF Details DOI