Arrow Research search
Back to NeurIPS

NeurIPS 2025

Computational Budget Should Be Considered in Data Selection

Conference Paper Main Conference Track Artificial Intelligence · Machine Learning

Abstract

Data selection improves computational efficiency by choosing informative subsets of training samples. However, existing methods ignore the compute budget, treating data selection and importance evaluation independently of compute budget constraints. Yet empirical studies show no algorithm can consistently outperform others (or even random selection) across varying budgets. We therefore argue that compute budget must be integral to data-selection strategies, since different budgets impose distinct requirements on data quantity, quality, and distribution for effective training. To this end, we propose a novel Computational budget-Aware Data Selection (CADS) method and naturally formulate it into a bilevel optimization framework, where the inner loop trains the model within the constraints of the computational budget on some selected subset of training data, while the outer loop optimizes data selection based on model evaluation. Our technical contributions lie in addressing two main challenges in solving this bilevel optimization problem: the expensive Hessian matrix estimation for outer-loop gradients and the computational burden of achieving inner-loop optimality during iterations. To solve the first issue, we propose a probabilistic reparameterization strategy and compute the gradient using a Hessian-free policy gradient estimator. To address the second challenge, we transform the inner optimization problem into a penalty term in the outer objective, further discovering that we only need to estimate the minimum of a one-dimensional loss to calculate the gradient, significantly improving efficiency. To accommodate different data selection granularities, we present two complementary CADS variants: an example-level version (CADS-E) offering fine-grained control and a source-level version (CADS-S) aggregating samples into source groups for scalable, efficient selection without sacrificing effectiveness. Extensive experiments show that our method achieves performance gains of up to 14. 42\% over baselines in vision and language benchmarks. Additionally, CADS achieves a 3-20× speedup compared to conventional bilevel implementations, with acceleration correlating positively with compute budget size.

Authors

Keywords

No keywords are indexed for this paper.

Context

Venue
Annual Conference on Neural Information Processing Systems
Archive span
1987-2025
Indexed papers
30776
Paper id
1116679342494314827