Regression and Correlation

Regression models the relationship between variables. It is the workhorse of predictive modeling and causal analysis.

Simple Linear Regression

Model: Y = β₀ + β₁X + ε, where ε ~ N(0, σ²).

Least Squares Estimates

Minimize Σ(yᵢ - β₀ - β₁xᵢ)²:

β̂₁ = Σ(xᵢ - x̄)(yᵢ - ȳ) / Σ(xᵢ - x̄)² = S_xy / S_xx
β̂₀ = ȳ - β̂₁x̄

The regression line passes through (x̄, ȳ).

Interpretation

β̂₁: For a one-unit increase in X, Y changes by β̂₁ units on average.
β̂₀: Predicted Y when X = 0 (may not be meaningful if 0 is outside data range).

Coefficient of Determination (R²)

R² = 1 - SS_res / SS_tot = 1 - Σ(yᵢ - ŷᵢ)² / Σ(yᵢ - ȳ)²

R² ∈ [0, 1]
Proportion of variance in Y explained by the model
R² = r² for simple linear regression (square of Pearson correlation)

Inference on β₁

Standard error: SE(β̂₁) = s / √S_xx, where s² = SS_res/(n-2).

t-test for H₀: β₁ = 0: t = β̂₁ / SE(β̂₁), df = n-2.

Confidence interval: β̂₁ ± t_{α/2, n-2} · SE(β̂₁).

Prediction

Point prediction: ŷ = β̂₀ + β̂₁x₀.

Confidence interval for mean response: Narrower. Uncertainty about the regression line.

Prediction interval for individual response: Wider. Includes individual variability (ε).

PI: ŷ ± t_{α/2} · s√(1 + 1/n + (x₀ - x̄)²/S_xx)

Multiple Linear Regression

Model: Y = β₀ + β₁X₁ + β₂X₂ + ... + βₚXₚ + ε.

Matrix form: Y = Xβ + ε.

Least Squares Solution

β̂ = (XᵀX)⁻¹Xᵀy

Predicted values: ŷ = Xβ̂ = X(XᵀX)⁻¹Xᵀy = Hy (H is the hat matrix).

Interpretation

β̂ⱼ: The change in Y for a one-unit increase in Xⱼ, holding all other predictors constant (partial effect).

Adjusted R²

Penalizes for number of predictors:

R²_adj = 1 - (1 - R²)(n-1)/(n-p-1)

R²_adj can decrease if adding a predictor doesn't improve fit enough.

Multicollinearity

When predictors are highly correlated:

Coefficient estimates become unstable (large standard errors)
Individual coefficients may be insignificant even if the model is significant overall

Detection: Variance Inflation Factor VIF(Xⱼ) = 1/(1 - R²ⱼ) where R²ⱼ is R² from regressing Xⱼ on all other predictors. VIF > 10 suggests problems.

Remedies: Remove correlated predictors, PCA, ridge regression.

Polynomial Regression

Model: Y = β₀ + β₁X + β₂X² + ... + βₖXᵏ + ε.

Still a linear model (linear in the parameters β), solved by least squares with the design matrix containing X, X², ..., Xᵏ.

Overfitting risk: High-degree polynomials fit training data perfectly but generalize poorly. Use cross-validation to select degree.

Correlation Measures

Pearson Correlation (r)

r = Σ(xᵢ - x̄)(yᵢ - ȳ) / √(Σ(xᵢ - x̄)² · Σ(yᵢ - ȳ)²)

Measures linear association. r ∈ [-1, 1].

Hypothesis test: H₀: ρ = 0. t = r√(n-2)/√(1-r²), df = n-2.

Fisher z-transformation: z = ½ ln((1+r)/(1-r)) ≈ N(½ ln((1+ρ)/(1-ρ)), 1/(n-3)). Used for CI and comparing correlations.

Spearman Rank Correlation (rₛ)

Pearson correlation on the ranks of the data. Measures monotonic (not necessarily linear) association.

rₛ = 1 - 6Σdᵢ² / (n(n²-1))

where dᵢ = rank(xᵢ) - rank(yᵢ).

Robust to outliers. Appropriate for ordinal data.

Kendall's Tau (τ)

Based on concordant and discordant pairs:

τ = (concordant - discordant) / C(n, 2)

A pair (i, j) is concordant if xᵢ < xⱼ and yᵢ < yⱼ (or both reversed). Discordant otherwise.

More robust than Spearman for small samples and tied ranks.

Partial Correlation

Correlation between X and Y after removing the linear effect of Z:

r_{XY·Z} = (r_XY - r_XZ · r_YZ) / √((1-r²_XZ)(1-r²_YZ))

Used to identify direct vs. spurious relationships.

ANOVA (Analysis of Variance)

Tests whether the means of k groups are equal.

One-way ANOVA: H₀: μ₁ = μ₂ = ... = μₖ.

F = (between-group variance) / (within-group variance) = MS_between / MS_within

Under H₀: F ~ F(k-1, N-k).

Decomposition: SS_total = SS_between + SS_within.

Post-hoc tests (after rejecting H₀): Tukey HSD, Bonferroni, Scheffé — identify which pairs differ.

Two-way ANOVA: Two factors + interaction.

Connection to regression: ANOVA is a special case of regression with categorical predictors (dummy/indicator variables).

Model Selection

Criteria

AIC (Akaike): AIC = -2ℓ + 2p. Balances fit and complexity.
BIC (Bayesian): BIC = -2ℓ + p ln(n). Penalizes complexity more than AIC.
Adjusted R²: Penalizes for number of predictors.
Cross-validation: k-fold CV gives out-of-sample error estimate.
Mallows' Cp: Estimates prediction error.

Selection Methods

Forward selection: Start empty, add the best predictor one at a time.
Backward elimination: Start full, remove the worst predictor one at a time.
Stepwise: Combine forward and backward.
Best subsets: Try all 2ᵖ subsets (only feasible for small p).
Regularization: LASSO (L1) performs automatic variable selection.

Residual Analysis

Assumptions (LINE)

Linearity: Relationship between X and Y is linear.
Independence: Errors are independent.
Normality: Errors are normally distributed.
Equal variance (homoscedasticity): Var(ε) is constant.

Diagnostic Plots

Residuals vs fitted: Should show no pattern. Patterns indicate non-linearity or heteroscedasticity.
Normal QQ plot: Should be approximately linear. Departures indicate non-normality.
Scale-location: √|standardized residuals| vs fitted. Should be flat. Increasing spread indicates heteroscedasticity.
Residuals vs leverage: Identifies influential points. Cook's distance combines leverage and residual.

Influential Points

Leverage: hᵢᵢ (diagonal of hat matrix). High leverage = unusual X value.
Cook's distance: Measures influence on all predictions. Dᵢ > 1 is concerning.
DFBETAS: Change in each coefficient when observation i is removed.

Applications in CS

Performance modeling: Predict latency from request size, concurrency, memory usage. Multiple regression identifies key factors.
Cost estimation: Software effort estimation (COCOMO-like models) use regression.
Capacity planning: Model throughput as a function of resources.
ML baselines: Linear regression is the simplest baseline. Understanding it deeply helps understand more complex models.
Causal inference: Regression with controls, instrumental variables, regression discontinuity. A/B testing is experimental regression.
Recommendation systems: Matrix factorization is regularized regression.
Scientific computing: Fitting models to simulation data. Surrogate models use polynomial regression.