Explain bootstrapping. When would you use a bootstrap confidence interval instead of a CLT-based one? Walk me through the algorithm and give a concrete example of when bootstrap is the right tool.

Question

Accepted Answer

The core idea: plug-in principle

The sampling distribution of a statistic θ̂ (e.g., the sample mean) tells you how much θ̂ varies across repeated samples from the population. You'd need to know the true population distribution to compute this.

Bootstrap's insight: treat the sample as the population. Resample from it with replacement to simulate new samples and compute how much your statistic varies.

Algorithm

1. Start with your sample of size n: X = (x_1, ..., x_n)
2. Draw B bootstrap samples (B=1000-10000 is typical):
   - Each bootstrap sample: draw n points from X with replacement
   - Some original points appear multiple times; some not at all (~37% are excluded per sample)
3. Compute your statistic θ̂_b on each bootstrap sample
4. The distribution of (θ̂_1, ..., θ̂_B) approximates the sampling distribution of θ̂

Bootstrap confidence interval (percentile method)

95% CI = [θ̂_(0.025), θ̂_(0.975)]

The 2.5th and 97.5th percentiles of your bootstrap distribution. No formula needed.

More accurate: BCa (bias-corrected and accelerated) bootstrap — accounts for skewness and bias in the sampling distribution.

When to use bootstrap over CLT-based CI

Estimator has no closed-form variance: the standard error of the sample median, correlation coefficient, AUC, NDCG, or F1 has no simple formula. Bootstrap gives you the variance of any statistic automatically.

Non-normal, small samples: CLT kicks in asymptotically. For n=50 with a heavily skewed metric (revenue, session time), CLT approximation is poor. Bootstrap makes no normality assumption.

Complex statistics: ratios, differences of medians, maximum of several statistics — CLT-based CIs require delta method approximations. Bootstrap handles these directly.

Concrete ML examples

Model evaluation CI: you have 500 test examples. What's the 95% CI on your model's AUC?
- Bootstrap sample the test set, compute AUC each time
- Take percentiles of the AUC distribution
- No formula required; works even if the test set is small

Comparing two models: is Model A's AUC significantly higher than Model B's?
- Bootstrap the test set, compute AUC_A - AUC_B each time
- If 0 is below the 5th percentile of the bootstrap distribution, the difference is significant

Feature importance variance: how stable is the feature importance ranking from your random forest?
- Bootstrap the training data, fit a forest each time, record importance rankings
- Identify which features have stable vs volatile importance

Limitations

Doesn't fix small sample problems: bootstrap approximates the true sampling distribution using your sample as a proxy for the population. If your sample is small and unrepresentative, bootstrap distributions are also unrepresentative. Bootstrap reduces variance estimation error; it doesn't add information that isn't in your data.

i.i.d. assumption: standard bootstrap assumes samples are independent. For time series data, block bootstrap is needed — resample contiguous blocks to preserve temporal correlation.

Computationally expensive: 10,000 bootstraps × fitting a model each time is slow. Bayesian methods or delta method approximations are often preferred when the model is expensive.