Regression and Correlation
Regression models the relationship between variables. It is the workhorse of predictive modeling and causal analysis.
Simple Linear Regression
Model: Y = β₀ + β₁X + ε, where ε ~ N(0, σ²).
Least Squares Estimates
Minimize Σ(yᵢ - β₀ - β₁xᵢ)²:
β̂₁ = Σ(xᵢ - x̄)(yᵢ - ȳ) / Σ(xᵢ - x̄)² = S_xy / S_xx
β̂₀ = ȳ - β̂₁x̄
The regression line passes through (x̄, ȳ).
Interpretation
- β̂₁: For a one-unit increase in X, Y changes by β̂₁ units on average.
- β̂₀: Predicted Y when X = 0 (may not be meaningful if 0 is outside data range).
Coefficient of Determination (R²)
R² = 1 - SS_res / SS_tot = 1 - Σ(yᵢ - ŷᵢ)² / Σ(yᵢ - ȳ)²
- R² ∈ [0, 1]
- Proportion of variance in Y explained by the model
- R² = r² for simple linear regression (square of Pearson correlation)
Inference on β₁
Standard error: SE(β̂₁) = s / √S_xx, where s² = SS_res/(n-2).
t-test for H₀: β₁ = 0: t = β̂₁ / SE(β̂₁), df = n-2.
Confidence interval: β̂₁ ± t_{α/2, n-2} · SE(β̂₁).
Prediction
Point prediction: ŷ = β̂₀ + β̂₁x₀.
Confidence interval for mean response: Narrower. Uncertainty about the regression line.
Prediction interval for individual response: Wider. Includes individual variability (ε).
PI: ŷ ± t_{α/2} · s√(1 + 1/n + (x₀ - x̄)²/S_xx)
Multiple Linear Regression
Model: Y = β₀ + β₁X₁ + β₂X₂ + ... + βₚXₚ + ε.
Matrix form: Y = Xβ + ε.
Least Squares Solution
β̂ = (XᵀX)⁻¹Xᵀy
Predicted values: ŷ = Xβ̂ = X(XᵀX)⁻¹Xᵀy = Hy (H is the hat matrix).
Interpretation
β̂ⱼ: The change in Y for a one-unit increase in Xⱼ, holding all other predictors constant (partial effect).
Adjusted R²
Penalizes for number of predictors:
R²_adj = 1 - (1 - R²)(n-1)/(n-p-1)
R²_adj can decrease if adding a predictor doesn't improve fit enough.
Multicollinearity
When predictors are highly correlated:
- Coefficient estimates become unstable (large standard errors)
- Individual coefficients may be insignificant even if the model is significant overall
Detection: Variance Inflation Factor VIF(Xⱼ) = 1/(1 - R²ⱼ) where R²ⱼ is R² from regressing Xⱼ on all other predictors. VIF > 10 suggests problems.
Remedies: Remove correlated predictors, PCA, ridge regression.
Polynomial Regression
Model: Y = β₀ + β₁X + β₂X² + ... + βₖXᵏ + ε.
Still a linear model (linear in the parameters β), solved by least squares with the design matrix containing X, X², ..., Xᵏ.
Overfitting risk: High-degree polynomials fit training data perfectly but generalize poorly. Use cross-validation to select degree.
Correlation Measures
Pearson Correlation (r)
r = Σ(xᵢ - x̄)(yᵢ - ȳ) / √(Σ(xᵢ - x̄)² · Σ(yᵢ - ȳ)²)
Measures linear association. r ∈ [-1, 1].
Hypothesis test: H₀: ρ = 0. t = r√(n-2)/√(1-r²), df = n-2.
Fisher z-transformation: z = ½ ln((1+r)/(1-r)) ≈ N(½ ln((1+ρ)/(1-ρ)), 1/(n-3)). Used for CI and comparing correlations.
Spearman Rank Correlation (rₛ)
Pearson correlation on the ranks of the data. Measures monotonic (not necessarily linear) association.
rₛ = 1 - 6Σdᵢ² / (n(n²-1))
where dᵢ = rank(xᵢ) - rank(yᵢ).
Robust to outliers. Appropriate for ordinal data.
Kendall's Tau (τ)
Based on concordant and discordant pairs:
τ = (concordant - discordant) / C(n, 2)
A pair (i, j) is concordant if xᵢ < xⱼ and yᵢ < yⱼ (or both reversed). Discordant otherwise.
More robust than Spearman for small samples and tied ranks.
Partial Correlation
Correlation between X and Y after removing the linear effect of Z:
r_{XY·Z} = (r_XY - r_XZ · r_YZ) / √((1-r²_XZ)(1-r²_YZ))
Used to identify direct vs. spurious relationships.
ANOVA (Analysis of Variance)
Tests whether the means of k groups are equal.
One-way ANOVA: H₀: μ₁ = μ₂ = ... = μₖ.
F = (between-group variance) / (within-group variance) = MS_between / MS_within
Under H₀: F ~ F(k-1, N-k).
Decomposition: SS_total = SS_between + SS_within.
Post-hoc tests (after rejecting H₀): Tukey HSD, Bonferroni, Scheffé — identify which pairs differ.
Two-way ANOVA: Two factors + interaction.
Connection to regression: ANOVA is a special case of regression with categorical predictors (dummy/indicator variables).
Model Selection
Criteria
- AIC (Akaike): AIC = -2ℓ + 2p. Balances fit and complexity.
- BIC (Bayesian): BIC = -2ℓ + p ln(n). Penalizes complexity more than AIC.
- Adjusted R²: Penalizes for number of predictors.
- Cross-validation: k-fold CV gives out-of-sample error estimate.
- Mallows' Cp: Estimates prediction error.
Selection Methods
- Forward selection: Start empty, add the best predictor one at a time.
- Backward elimination: Start full, remove the worst predictor one at a time.
- Stepwise: Combine forward and backward.
- Best subsets: Try all 2ᵖ subsets (only feasible for small p).
- Regularization: LASSO (L1) performs automatic variable selection.
Residual Analysis
Assumptions (LINE)
- Linearity: Relationship between X and Y is linear.
- Independence: Errors are independent.
- Normality: Errors are normally distributed.
- Equal variance (homoscedasticity): Var(ε) is constant.
Diagnostic Plots
- Residuals vs fitted: Should show no pattern. Patterns indicate non-linearity or heteroscedasticity.
- Normal QQ plot: Should be approximately linear. Departures indicate non-normality.
- Scale-location: √|standardized residuals| vs fitted. Should be flat. Increasing spread indicates heteroscedasticity.
- Residuals vs leverage: Identifies influential points. Cook's distance combines leverage and residual.
Influential Points
- Leverage: hᵢᵢ (diagonal of hat matrix). High leverage = unusual X value.
- Cook's distance: Measures influence on all predictions. Dᵢ > 1 is concerning.
- DFBETAS: Change in each coefficient when observation i is removed.
Applications in CS
- Performance modeling: Predict latency from request size, concurrency, memory usage. Multiple regression identifies key factors.
- Cost estimation: Software effort estimation (COCOMO-like models) use regression.
- Capacity planning: Model throughput as a function of resources.
- ML baselines: Linear regression is the simplest baseline. Understanding it deeply helps understand more complex models.
- Causal inference: Regression with controls, instrumental variables, regression discontinuity. A/B testing is experimental regression.
- Recommendation systems: Matrix factorization is regularized regression.
- Scientific computing: Fitting models to simulation data. Surrogate models use polynomial regression.