Descriptive Statistics
Descriptive statistics summarize and describe the main features of a dataset. Before any modeling, understanding the data through descriptive statistics is the essential first step.
Measures of Central Tendency
Mean (Average)
Arithmetic mean: x̄ = (1/n) Σ xᵢ
Weighted mean: x̄ = Σ wᵢxᵢ / Σ wᵢ
Geometric mean: (Π xᵢ)^(1/n) — used for growth rates, ratios.
Harmonic mean: n / Σ(1/xᵢ) — used for rates and ratios (e.g., F1 score is harmonic mean of precision and recall).
Trimmed mean: Remove the top and bottom k% before averaging. Robust to outliers.
Properties: The mean minimizes the sum of squared deviations: argmin_μ Σ(xᵢ - μ)².
Median
The middle value when data is sorted. For n values:
- Odd n: the (n+1)/2-th value
- Even n: average of the n/2-th and (n/2+1)-th values
Properties:
- Robust to outliers (unlike mean)
- Minimizes the sum of absolute deviations: argmin_μ Σ|xᵢ - μ|
- For skewed distributions, median is often more representative than mean
Mode
The most frequently occurring value. Can have multiple modes (bimodal, multimodal) or no mode.
Best suited for categorical data.
When to Use What
| Measure | Best for | Sensitive to outliers | |---|---|---| | Mean | Symmetric distributions, continuous data | Yes | | Median | Skewed distributions, ordinal data | No | | Mode | Categorical data, identifying peaks | No |
Measures of Dispersion
Range
Range = max - min
Simple but very sensitive to outliers.
Variance
Population variance: σ² = (1/N) Σ(xᵢ - μ)²
Sample variance: s² = (1/(n-1)) Σ(xᵢ - x̄)²
The n-1 denominator (Bessel's correction) makes the sample variance an unbiased estimator of population variance.
Computational formula: s² = (Σxᵢ² - n·x̄²) / (n-1)
Standard Deviation
σ = √(σ²) or s = √(s²)
Same units as the data. Interpretable as "typical distance from the mean."
Coefficient of Variation (CV)
CV = s / x̄ × 100%
Relative measure of variability. Allows comparison between datasets with different units or scales.
Interquartile Range (IQR)
IQR = Q₃ - Q₁
The range of the middle 50% of data. Robust to outliers.
Outlier detection (Tukey's rule): Values below Q₁ - 1.5·IQR or above Q₃ + 1.5·IQR are potential outliers.
Mean Absolute Deviation (MAD)
MAD = (1/n) Σ|xᵢ - x̄|
Or median absolute deviation: MAD = median(|xᵢ - median(x)|). The latter is very robust.
Quantiles and Percentiles
The p-th quantile (or 100p-th percentile) is the value below which a fraction p of the data falls.
- Q₁ (25th percentile): First quartile
- Q₂ (50th percentile): Median
- Q₃ (75th percentile): Third quartile
- Deciles: 10th, 20th, ..., 90th percentiles
- Percentiles: 1st through 99th
Multiple interpolation methods exist for computing quantiles when the index is not an integer.
Skewness
Measures the asymmetry of the distribution.
Skewness = (1/n) Σ((xᵢ - x̄)/s)³
- Skewness = 0: Symmetric (e.g., normal distribution)
- Skewness > 0: Right-skewed (long right tail) — mean > median
- Skewness < 0: Left-skewed (long left tail) — mean < median
Examples: Income distribution is right-skewed. Test scores on an easy exam are left-skewed.
Kurtosis
Measures the tailedness (extremity of outliers) of the distribution.
Kurtosis = (1/n) Σ((xᵢ - x̄)/s)⁴
Excess kurtosis = Kurtosis - 3 (so normal distribution has excess kurtosis 0).
- Mesokurtic (≈ 0): Normal-like tails
- Leptokurtic (> 0): Heavy tails, more outliers (e.g., t-distribution, Laplace)
- Platykurtic (< 0): Light tails, fewer outliers (e.g., uniform)
Data Visualization
Box Plot (Box-and-Whisker)
Displays the five-number summary: min, Q₁, median, Q₃, max.
|---[====|====]---| ○
min Q₁ Q₂ Q₃ max outlier
Whiskers extend to min/max (or 1.5×IQR). Points beyond are plotted individually as outliers.
Advantages: Compact comparison of distributions, shows skewness, highlights outliers.
Histogram
Groups data into bins and shows frequency (or relative frequency) of each bin.
Choosing bin width: Sturges' rule (k = ⌈log₂n⌉ + 1), Scott's rule (h = 3.5s/n^(1/3)), Freedman-Diaconis (h = 2·IQR/n^(1/3)).
Kernel Density Estimation
Smooth estimate of the PDF. Place a kernel (usually Gaussian) at each data point and sum:
f̂(x) = (1/nh) Σ K((x - xᵢ)/h)
where h is the bandwidth (smoothing parameter). Too small → noisy. Too large → over-smoothed.
Other Visualizations
- Scatter plot: Two variables. Shows correlation, clusters, outliers.
- Violin plot: Box plot + mirrored KDE. Shows full distribution shape.
- QQ plot: Quantiles of data vs. quantiles of a theoretical distribution. Straight line indicates the data follows that distribution.
- Heatmap: Matrix of values with color coding. Correlation matrices, confusion matrices.
- ECDF: Empirical cumulative distribution function. Step function from 0 to 1.
Correlation
Covariance
Cov(X, Y) = (1/(n-1)) Σ(xᵢ - x̄)(yᵢ - ȳ)
Positive: X and Y tend to increase together. Negative: inverse relationship. Zero: no linear relationship.
Pearson Correlation Coefficient
r = Cov(X, Y) / (sₓ · s_y)
- r ∈ [-1, 1]
- r = 1: perfect positive linear relationship
- r = -1: perfect negative linear relationship
- r = 0: no linear relationship (but may have nonlinear relationship!)
Spearman Rank Correlation
Pearson correlation applied to the ranks of the data. Measures monotonic (not just linear) relationship. Robust to outliers.
Kendall's Tau
Based on concordant and discordant pairs. Another rank-based correlation measure. More robust than Spearman for small samples.
Summary Statistics in Practice
Robust Statistics
Outliers disproportionately affect mean, variance, and Pearson correlation. Use:
- Median instead of mean
- IQR or MAD instead of standard deviation
- Spearman/Kendall instead of Pearson correlation
- Trimmed/winsorized versions of any statistic
Anscombe's Quartet
Four datasets with identical summary statistics (mean, variance, correlation, regression line) but very different distributions. Always visualize your data!
Simpson's Paradox
A trend that appears in several groups may reverse when the groups are combined. Always consider confounding variables.
Applications in CS
- Exploratory data analysis: First step in any data science project. Understand distributions, find outliers, check assumptions.
- Feature engineering: Standardization (z-score: (x-μ)/σ), normalization (min-max scaling), log transforms for skewed features.
- Monitoring and alerting: Track mean, percentiles (p50, p95, p99) of latency, error rates. Anomaly detection via deviation from baselines.
- A/B testing: Compare summary statistics between control and treatment groups.
- Database query optimization: Histograms on column values help the query optimizer estimate selectivity and cardinality.
- Performance benchmarking: Report median and percentiles (not just mean). Use violin plots for latency distributions.
- ML preprocessing: Detect and handle outliers, check feature distributions, compute correlation matrices for feature selection.