Sparse Autoencoders for Hypothesis Generation

Rajiv Movva; Kenny Peng; Nikhil Garg 0001; Jon M. Kleinberg; Emma Pierson

Back to ICML

ICML 2025

Sparse Autoencoders for Hypothesis Generation

Conference Paper Accept (poster) Artificial Intelligence · Machine Learning

Details

Abstract

We describe HypotheSAEs, a general method to hypothesize interpretable relationships between text data (e. g. , headlines) and a target variable (e. g. , clicks). HypotheSAEs has three steps: (1) train a sparse autoencoder on text embeddings to produce interpretable features describing the data distribution, (2) select features that predict the target variable, and (3) generate a natural language interpretation of each feature (e. g. , mentions being surprised or shocked ) using an LLM. Each interpretation serves as a hypothesis about what predicts the target variable. Compared to baselines, our method better identifies reference hypotheses on synthetic datasets (at least +0. 06 in F1) and produces more predictive hypotheses on real datasets ( twice as many significant findings), despite requiring 1-2 orders of magnitude less compute than recent LLM-based methods. HypotheSAEs also produces novel discoveries on two well-studied tasks: explaining partisan differences in Congressional speeches and identifying drivers of engagement with online headlines.

Authors

Keywords

interpretability
hypothesis generation
sparse autoencoders
computational social science
topic modeling

Context

Venue: International Conference on Machine Learning
Archive span: 1993-2025
Indexed papers: 16471
Paper id: 989263184743441413