Junfeng Wen Papers

AAAI Conference 2026 Conference Paper

An MRP Formulation for Supervised Learning: Generalized Temporal Difference Learning Models (Abstract Reprint)

Yangchen Pan
Junfeng Wen
Chenjun Xiao
Philip Torr

Background: Traditional supervised learning (SL) assumes data points are independently and identically distributed (i.i.d.), which overlooks dependencies in real-world data. Reinforcement learning (RL), in contrast, models dependencies through state transitions. Objectives: This study aims to bridge SL and RL by reformulating SL problems as RL tasks, enabling the application of RL techniques to a wider range of SL scenarios. We aim to model SL data as interconnected, and develop novel temporal difference (TD) algorithms that can accommodate diverse data types. Our objectives are to (1) establish conditions where TD outperforms ordinary least squares (OLS), (2) provide convergence guarantees for the generalized TD algorithm, and (3) validate the approach empirically using synthetic and real-world datasets. Methods: We reformulate traditional SL as a RL problem by modeling data points as a Markov Reward Process (MRP). We then introduce a concept analogous to the inverse link function in generalized linear models, allowing our TD algorithm to handle various data types. Our analysis, grounded in variance estimation, identifies conditions where TD outperforms OLS. We establish a convergence guarantee by conceptualizing the TD update rule as a generalized Bellman operator. Empirical validation begins with synthetic data progressively matching theoretical assumptions to verify our analysis, followed by evaluations on real-world datasets to demonstrate practical utility. Results: Our theoretical analysis shows that TD can outperform OLS in estimation accuracy when data noise is correlated. Our approach generalizes across various loss functions and SL datasets. We prove that the Bellman operator in our TD framework is a contraction, ensuring convergence for both expected and stochastic TD updates. Empirically, TD outperforms SL baselines when data aligns with its assumptions, remains competitive across diverse datasets, and is robust to hyperparameter choices. Conclusions: This study demonstrates that SL can be reformulated as a problem of interconnected data modeled by an MRP, effectively solved using TD learning. Our generalized TD is theoretically sound, with convergence guarantees, and practically effective. It generalizes OLS, offering superior performance on correlated data. This work enables RL techniques to benefit SL tasks, offering a pathway for future advancements.

PDF Details DOI

JAIR Journal 2025 Journal Article

An MRP Formulation for Supervised Learning: Generalized Temporal Difference Learning Models

Yangchen Pan
Junfeng Wen
Chenjun Xiao
Philip H. S. Torr

Background: Traditional supervised learning (SL) assumes data points are independently and identically distributed (i.i.d.), which overlooks dependencies in real-world data. Reinforcement learning (RL), in contrast, models dependencies through state transitions. Objectives: This study aims to bridge SL and RL by reformulating SL problems as RL tasks, enabling the application of RL techniques to a wider range of SL scenarios. We aim to model SL data as interconnected, and develop novel temporal difference (TD) algorithms that can accommodate diverse data types. Our objectives are to (1) establish conditions where TD outperforms ordinary least squares (OLS), (2) provide convergence guarantees for the generalized TD algorithm, and (3) validate the approach empirically using synthetic and real-world datasets. Methods: We reformulate traditional SL as a RL problem by modeling data points as a Markov Reward Process (MRP). We then introduce a concept analogous to the inverse link function in generalized linear models, allowing our TD algorithm to handle various data types. Our analysis, grounded in variance estimation, identifies conditions where TD outperforms OLS. We establish a convergence guarantee by conceptualizing the TD update rule as a generalized Bellman operator. Empirical validation begins with synthetic data progressively matching theoretical assumptions to verify our analysis, followed by evaluations on real-world datasets to demonstrate practical utility. Results: Our theoretical analysis shows that TD can outperform OLS in estimation accuracy when data noise is correlated. Our approach generalizes across various loss functions and SL datasets. We prove that the Bellman operator in our TD framework is a contraction, ensuring convergence for both expected and stochastic TD updates. Empirically, TD outperforms SL baselines when data aligns with its assumptions, remains competitive across diverse datasets, and is robust to hyperparameter choices. Conclusions: This study demonstrates that SL can be reformulated as a problem of interconnected data modeled by an MRP, effectively solved using TD learning. Our generalized TD is theoretically sound, with convergence guarantees, and practically effective. It generalizes OLS, offering superior performance on correlated data. This work enables RL techniques to benefit SL tasks, offering a pathway for future advancements.

PDF Details DOI

NeurIPS Conference 2025 Conference Paper

LookWhere? Efficient Visual Recognition by Learning Where to Look and What to See from Self-Supervision

Anthony Fuller
Yousef Yassin
Junfeng Wen
Tarek Ibrahim
Daniel Kyrollos
James Green
Evan Shelhamer

Vision transformers are ever larger, more accurate, and more expensive to compute. At high resolution, the expense is even more extreme as the number of tokens grows quadratically in the image size. We turn to adaptive computation to cope with this cost by learning to predict where to compute. Our LookWhere method divides the computation between a low-resolution selector and a high-resolution extractor without ever processing the full high-resolution input. We jointly pretrain the selector and extractor without task supervision by distillation from a self-supervised teacher, in effect learning where and what to compute at the same time. Unlike prior token reduction methods, which pay to save by pruning already-computed tokens, and prior token selection methods, which require complex and expensive per-task optimization, LookWhere economically and accurately selects and extracts transferrable representations of images. We show that LookWhere excels at sparse recognition on high-resolution inputs (Traffic Signs), maintaining accuracy while reducing FLOPs by 17x and time by 4x, and standard recognition tasks that are global (ImageNet classification) and local (ADE20K segmentation), improving accuracy while reducing time by 1. 36x.

PDF Details

ICML Conference 2022 Conference Paper

A Parametric Class of Approximate Gradient Updates for Policy Optimization

Ramki Gummadi
Saurabh Kumar 0004
Junfeng Wen
Dale Schuurmans

Approaches to policy optimization have been motivated from diverse principles, based on how the parametric model is interpreted (e. g. value versus policy representation) or how the learning objective is formulated, yet they share a common goal of maximizing expected return. To better capture the commonalities and identify key differences between policy optimization methods, we develop a unified perspective that re-expresses the underlying updates in terms of a limited choice of gradient form and scaling function. In particular, we identify a parameterized space of approximate gradient updates for policy optimization that is highly structured, yet covers both classical and recent examples, including PPO. As a result, we obtain novel yet well motivated updates that generalize existing algorithms in a way that can deliver benefits both in terms of convergence speed and final result quality. An experimental investigation demonstrates that the additional degrees of freedom provided in the parameterized family of updates can be leveraged to obtain non-trivial improvements both in synthetic domains and on popular deep RL benchmarks.

Details

ICML Conference 2021 Conference Paper

Characterizing the Gap Between Actor-Critic and Policy Gradient

Junfeng Wen
Saurabh Kumar 0004
Ramki Gummadi
Dale Schuurmans

Actor-critic (AC) methods are ubiquitous in reinforcement learning. Although it is understood that AC methods are closely related to policy gradient (PG), their precise connection has not been fully characterized previously. In this paper, we explain the gap between AC and PG methods by identifying the exact adjustment to the AC objective/gradient that recovers the true policy gradient of the cumulative reward objective (PG). Furthermore, by viewing the AC method as a two-player Stackelberg game between the actor and critic, we show that the Stackelberg policy gradient can be recovered as a special case of our more general analysis. Based on these results, we develop practical algorithms, Residual Actor-Critic and Stackelberg Actor-Critic, for estimating the correction between AC and PG and use these to modify the standard AC algorithm. Experiments on popular tabular and continuous environments show the proposed corrections can improve both the sample efficiency and final performance of existing AC methods.

Details

ICML Conference 2020 Conference Paper

Batch Stationary Distribution Estimation

Junfeng Wen
Bo Dai 0001
Lihong Li 0001
Dale Schuurmans

We consider the problem of approximating the stationary distribution of an ergodic Markov chain given a set of sampled transitions. Classical simulation-based approaches assume access to the underlying process so that trajectories of sufficient length can be gathered to approximate stationary sampling. Instead, we consider an alternative setting where a \emph{fixed} set of transitions has been collected beforehand, by a separate, possibly unknown procedure. The goal is still to estimate properties of the stationary distribution, but without additional access to the underlying system. We propose a consistent estimator that is based on recovering a correction ratio function over the given data. In particular, we develop a variational power method (VPM) that provides provably consistent estimates under general conditions. In addition to unifying a number of existing approaches from different subfields, we also find that VPM yields significantly better estimates across a range of problems, including queueing, stochastic differential equations, post-processing MCMC, and off-policy evaluation.

Details

ICML Conference 2020 Conference Paper

Domain Aggregation Networks for Multi-Source Domain Adaptation

Junfeng Wen
Russell Greiner
Dale Schuurmans

In many real-world applications, we want to exploit multiple source datasets to build a model for a different but related target dataset. Despite the recent empirical success, most existing research has used ad-hoc methods to combine multiple sources, leading to a gap between theory and practice. In this paper, we develop a finite-sample generalization bound based on domain discrepancy and accordingly propose a theoretically justified optimization procedure. Our algorithm, Domain AggRegation Network (DARN), can automatically and dynamically balance between including more data to increase effective sample size and excluding irrelevant data to avoid negative effects during training. We find that DARN can significantly outperform the state-of-the-art alternatives on multiple real-world tasks, including digit/object recognition and sentiment analysis.

Details

NeurIPS Conference 2016 Conference Paper

Convex Two-Layer Modeling with Latent Structure

Vignesh Ganapathiraman
Xinhua Zhang
Yaoliang Yu
Junfeng Wen

Unsupervised learning of structured predictors has been a long standing pursuit in machine learning. Recently a conditional random field auto-encoder has been proposed in a two-layer setting, allowing latent structured representation to be automatically inferred. Aside from being nonconvex, it also requires the demanding inference of normalization. In this paper, we develop a convex relaxation of two-layer conditional model which captures latent structure and estimates model parameters, jointly and optimally. We further expand its applicability by resorting to a weaker form of inference---maximum a-posteriori. The flexibility of the model is demonstrated on two structures based on total unimodularity---graph matching and linear chain. Experimental results confirm the promise of the method.

PDF Details

IJCAI Conference 2015 Conference Paper

Correcting Covariate Shift with the Frank-Wolfe Algorithm

Junfeng Wen
Russell Greiner
Dale Schuurmans

Covariate shift is a fundamental problem for learning in non-stationary environments where the conditional distribution ppy|xq is the same between training and test data while their marginal distributions ptrpxq and ptepxq are different. Although many covariate shift correction techniques remain effective for real world problems, most do not scale well in practice. In this paper, using inspiration from recent optimization techniques, we apply the Frank-Wolfe algorithm to two well-known covariate shift correction techniques, Kernel Mean Matching (KMM) and Kullback-Leibler Importance Estimation Procedure (KLIEP), and identify an important connection between kernel herding and KMM. Our complexity analysis shows the benefits of the Frank-Wolfe approach over projected gradient methods in solving KMM and KLIEP. An empirical study then demonstrates the effectiveness and efficiency of the Frank-Wolfe algorithm for correcting covariate shift in practice.

PDF Details

AAAI Conference 2015 Conference Paper

Optimal Estimation of Multivariate ARMA Models

Martha White
Junfeng Wen
Michael Bowling
Dale Schuurmans

Autoregressive moving average (ARMA) models are a fundamental tool in time series analysis that offer intuitive modeling capability and efficient predictors. Unfortunately, the lack of globally optimal parameter estimation strategies for these models remains a problem: application studies often adopt the simpler autoregressive model that can be easily estimated by maximizing (a posteriori) likelihood. We develop a (regularized, imputed) maximum likelihood criterion that admits efficient global estimation via structured matrix norm optimization methods. An empirical evaluation demonstrates the benefits of globally optimal parameter estimation over local and moment matching approaches.

PDF Details

ICML Conference 2014 Conference Paper

Robust Learning under Uncertain Test Distributions: Relating Covariate Shift to Model Misspecification

Junfeng Wen
Chun-Nam Yu
Russell Greiner

Many learning situations involve learning the conditional distribution p(y|x) when the training instances are drawn from the training distribution p_tr(x), even though it will later be used to predict for instances drawn from a different test distribution p_te(x). Most current approaches focus on learning how to reweigh the training examples, to make them resemble the test distribution. However, reweighing does not always help, because (we show that) the test error also depends on the correctness of the underlying model class. This paper analyses this situation by viewing the problem of learning under changing distributions as a game between a learner and an adversary. We characterize when such reweighing is needed, and also provide an algorithm, robust covariate shift adjustment (RCSA), that provides relevant weights. Our empirical studies, on UCI datasets and a real-world cancer prognostic prediction dataset, show that our analysis applies, and that our RCSA works effectively.

Details

Possible papers