Targeted Learning

Causal Inference for Observational and Experimental Data

Targeted learning is a framework for causal and statistical inference methodology incorporating machine learning. 

The book Targeted Learning: Causal Inference for Observational and Experimental Data, by Mark J. van der Laan and Sherri Rose, was published in 2011. This text focuses largely on cross-sectional studies.

The second book by van der Laan and Rose, Targeted Learning in Data Science: Causal Inference for Complex Longitudinal Studies, has just been released by Springer in March 2018. This sequel text covers the complicated research questions found in longitudinal and dependent data structures.

New Paper Posted: "A Generally Efficient Targeted Minimum Loss-Based Estimator"

Mark van der Laan  / January 25, 2016

I recently posted a technical report titled "A Generally Efficient Targeted Minimum Loss-Based Estimator." This article has drastically changed my understanding of the theory of statistical estimation. Let me try to explain.

The goal is to estimate a certain feature, a target parameter, of the data distribution. One example is the estimand for the average causal effect of a binary treatment on an outcome of interest controlling for the measured confounders, under a realistic set of statistical assumptions. Since parametric models are never realistic and knowledge is often limited to conditional independence assumptions and bounds, such realistic statistical models are always highly nonparametric, and possibly completely nonparametric. So the question I asked myself was: Is it possible to construct an asymptotically consistent, normally distributed, efficient substitution estimator of this target parameter under such weak assumptions?

The general wisdom in our community has been that, due to the curse of dimensionality of the data and the model, the answer is only yes if one is willing to make enormously strong smoothness assumptions about the density of the data. Specifically, one needs to estimate the relevant nuisance parameters, such as the propensity score or outcome regression in the average causal effect example, at a rate faster than n^{-1/4} as sample size n converges to infinity. Additionally, the general wisdom (e.g., minimax rates for densities and regression functions in nonparametric models) states that this requires enormous smoothness assumptions in combination with using highly data-adaptive estimators.

However, in this new paper I show that we can construct a cross-validation based ensemble learner, a so-called super learner, of these nuisance parameters that converges at a rate faster than n^{-1/4} for each dimension of the data and size of model, as long as the true nuisance parameter is right-continuous with left-hand limits and has a bounded variation norm. This convergence is with respect to the so-called (square root of the) loss-based dissimilarity for the loss function for the nuisance parameter.  For example, if the nuisance parameter is a density, this would be the Kullback-Leibler divergence, and if the nuisance parameter is a conditional expectation, then this could be the mean squared error norm. Remarkably, there is no need to know the bound on this variation norm of the true nuisance parameter, just that the true nuisance parameter has a variation norm smaller than infinity.  In our average causal effect example, we only need to know that the propensity score and outcome regression, as a function of the baseline covariates, have a variation norm smaller than infinity.

So, under essentially no assumptions we can construct a super learner converging at a rate that, even in the worst case, is faster than the critical rate n^{-1/4}.  Due to the fact that the super learner converges to the truth as fast as the oracle-selected estimator (i.e,. the best possible choice in the library of candidate estimators for the given data set), its rate and finite sample performance can be much better, and the worst-case rates become better if the size of the model shrinks.  This worst-case rate is guaranteed by including maximum likelihood estimators that minimize the empirical risk (e.g., maximize the log-likelihood) over all functions in the parameter space that have a variation norm smaller than a set constant M, across a range of values of M. I show in this article that these maximum likelihood estimators correspond with minimizing the empirical risk over linear combinations of indicator basis functions under the constraint that the sum of the absolute values of the coefficients is bounded by this M. That is, in the average causal effect example, we can use constrained penalized linear regression, such as the LASSO, to implement such estimators.

Due to this property of this new general super learner for any infinite-dimensional parameter, we can now establish asymptotic efficiency of a one-step targeted maximum likelihood estimator that uses this super learner as an initial estimator for any path-wise differentiable target parameter and any statistical model, as long as this bounded variation assumption holds for the true nuisance parameters relevant for the target parameter.

This finding has dramatically changed my perspective on the impact of the curse of dimensionality on the construction of data-adaptive estimators of infinite-dimensional parameters and on the construction of asymptotically efficient estimators of smooth functionals of the data distribution. I believe the inclusion of these ''variation norm specific maximum likelihood estimators'' in the library of the super learner will have dramatic impact on its practical performance, beyond the above mentioned impact of the theoretical asymptotic performance.  The practical and theoretical performance of this super learner is of course also very important when the goal is to estimate these infinite dimensional parameters themselves, such as in prediction and density estimation. 

In addition, this performance of the super learner also carries over to any other (i.e., non-targeted maximum likelihood)  type of estimator of low-dimensional target parameters that uses this super learner. For example, inverse probability of treatment weighted estimators or estimating equation-based estimators benefit as much from using this super learner for the nuisance parameters as the targeted maximum likelihood estimator does. I focused my study on the targeted maximum likelihood estimator since it is not only asymptotically efficient, but also finite-sample robust due to being a substitution estimator respecting the global constraints in the model.  

So, a challenge for the future is to get this new super learner implemented and evaluated in some practical examples. In the mean time, we can enjoy the beautiful theoretical statement that in great generality one can construct an efficient targeted maximum likelihood estimator without making statistical model assumptions we know are not true.