## Machine Learning and Statistical Inference

**Mark van der Laan** and **Sherri Rose **/ April 18, 2016

**The general wisdom is that statistical inference is not possible in the context of data-adaptive (i.e., machine-learning-based) estimation in nonparametric or semiparametric models. **Letâ€™s make this statement more concrete. Suppose we have computed a machine-learning-based fit of the conditional mean of a clinical outcome as a function of a treatment and patient characteristics in an observational study. We can use an ensemble learner for this; one that combines a library of algorithms and relies on cross-validation, such as the super learner. This fit is mapped into an estimate of the treatment-specific mean by 1) evaluating the predicted outcome under the specified treatment condition and 2) averaging these predictions across all n subjects in the sample.

Historically, the default approach has not been to use machine learning; instead estimating the regression with a maximum likelihood estimator (MLE) based on a parametric regression model. Under this setting, the resulting treatment-specific mean is a simple function of the MLE of the unknown regression coefficients. As a consequence, if the regression model is correctly specified, this MLE of the treatment-specific mean is asymptotically linear. (This means that the MLE minus the true treatment-specific mean equals an empirical mean of its influence curve up to a negligible remainder.) As a result, it is approximately normally distributed with mean the true treatment-specific mean and variance equal to the variance of the influence curve divided by the sample size. Confidence intervals are constructed analogue to confidence intervals based on sample means. However, in practice, we know that this parametric model is misspecified, and therefore the MLE is normally distributed, but biased, and the 95% CIs will have asymptotic coverage equal to zero.

If we use a machine learning algorithm, as initially proposed above, then the estimator of the treatment-specific mean will generally not be normally distributed and will have a bias that is larger than 1 over square root n. Because of this, the difference between the estimator and its true value, standardized by square root n, converges to infinity! Since the sampling distribution of the estimator is generally not well approximated by a specified distribution (such as a normal distribution), statistical inference based on such a limit distribution is not an option.

**Remarkably, a minor targeted modification of the machine-learning-based fit may make the resulting estimator of the treatment-specific mean asymptotically linear with influence curve equal to the efficient influence curve. **Thus, this minor modification maps an initial estimator (of the data distribution, or its relevant part, such as the regression function in our example) for which its substitution estimator of the target parameter is generally overly biased and not normally distributed into an updated estimator for which the substitution estimator is approximately unbiased and has a normal limit distribution with minimal variance.

Two key conditions must be satisfied for this to be true.

- First, the targeted modification needs to guarantee that the target-parameter-specific score equation is solved by the updated regression estimator. This can be achieved with targeted maximum likelihood estimation by maximizing the likelihood of a one-dimensional parametric submodel through the initial estimate that has score (at zero fluctuation) equal to the target-parameter-specific score. Formally, the target-parameter-specific score is defined as the canonical gradient (i.e., efficient influence curve) of the pathwise derivative of the target parameter.
- Second, the initial estimator must converge to the truth at a rate faster than n^{-1/4}. With the super learner, it is guaranteed that the initial estimator converges at a rate faster or equal to the rate of convergence of the best estimator in the library. In recent work, it is shown that by including certain algorithms in the library, this rate of the super learner is now guaranteed to be faster than n^{-1/4}, for every dimension of the covariate vector!

*For more details on this new advance see this post and the working paper.*