5 min read
On this page

Descriptive Statistics

Descriptive statistics summarize and describe the main features of a dataset. Before any modeling, understanding the data through descriptive statistics is the essential first step.

Measures of Central Tendency

Mean (Average)

Arithmetic mean: x̄ = (1/n) Σ xᵢ

Weighted mean: x̄ = Σ wᵢxᵢ / Σ wᵢ

Geometric mean: (Π xᵢ)^(1/n) — used for growth rates, ratios.

Harmonic mean: n / Σ(1/xᵢ) — used for rates and ratios (e.g., F1 score is harmonic mean of precision and recall).

Trimmed mean: Remove the top and bottom k% before averaging. Robust to outliers.

Properties: The mean minimizes the sum of squared deviations: argmin_μ Σ(xᵢ - μ)².

Median

The middle value when data is sorted. For n values:

  • Odd n: the (n+1)/2-th value
  • Even n: average of the n/2-th and (n/2+1)-th values

Properties:

  • Robust to outliers (unlike mean)
  • Minimizes the sum of absolute deviations: argmin_μ Σ|xᵢ - μ|
  • For skewed distributions, median is often more representative than mean

Mode

The most frequently occurring value. Can have multiple modes (bimodal, multimodal) or no mode.

Best suited for categorical data.

When to Use What

| Measure | Best for | Sensitive to outliers | |---|---|---| | Mean | Symmetric distributions, continuous data | Yes | | Median | Skewed distributions, ordinal data | No | | Mode | Categorical data, identifying peaks | No |

Measures of Dispersion

Range

Range = max - min

Simple but very sensitive to outliers.

Variance

Population variance: σ² = (1/N) Σ(xᵢ - μ)²

Sample variance: s² = (1/(n-1)) Σ(xᵢ - x̄)²

The n-1 denominator (Bessel's correction) makes the sample variance an unbiased estimator of population variance.

Computational formula: s² = (Σxᵢ² - n·x̄²) / (n-1)

Standard Deviation

σ = √(σ²) or s = √(s²)

Same units as the data. Interpretable as "typical distance from the mean."

Coefficient of Variation (CV)

CV = s / x̄ × 100%

Relative measure of variability. Allows comparison between datasets with different units or scales.

Interquartile Range (IQR)

IQR = Q₃ - Q₁

The range of the middle 50% of data. Robust to outliers.

Outlier detection (Tukey's rule): Values below Q₁ - 1.5·IQR or above Q₃ + 1.5·IQR are potential outliers.

Mean Absolute Deviation (MAD)

MAD = (1/n) Σ|xᵢ - x̄|

Or median absolute deviation: MAD = median(|xᵢ - median(x)|). The latter is very robust.

Quantiles and Percentiles

The p-th quantile (or 100p-th percentile) is the value below which a fraction p of the data falls.

  • Q₁ (25th percentile): First quartile
  • Q₂ (50th percentile): Median
  • Q₃ (75th percentile): Third quartile
  • Deciles: 10th, 20th, ..., 90th percentiles
  • Percentiles: 1st through 99th

Multiple interpolation methods exist for computing quantiles when the index is not an integer.

Skewness

Measures the asymmetry of the distribution.

Skewness = (1/n) Σ((xᵢ - x̄)/s)³
  • Skewness = 0: Symmetric (e.g., normal distribution)
  • Skewness > 0: Right-skewed (long right tail) — mean > median
  • Skewness < 0: Left-skewed (long left tail) — mean < median

Examples: Income distribution is right-skewed. Test scores on an easy exam are left-skewed.

Kurtosis

Measures the tailedness (extremity of outliers) of the distribution.

Kurtosis = (1/n) Σ((xᵢ - x̄)/s)⁴

Excess kurtosis = Kurtosis - 3 (so normal distribution has excess kurtosis 0).

  • Mesokurtic (≈ 0): Normal-like tails
  • Leptokurtic (> 0): Heavy tails, more outliers (e.g., t-distribution, Laplace)
  • Platykurtic (< 0): Light tails, fewer outliers (e.g., uniform)

Data Visualization

Box Plot (Box-and-Whisker)

Displays the five-number summary: min, Q₁, median, Q₃, max.

  |---[====|====]---|  ○
  min  Q₁  Q₂   Q₃ max outlier

Whiskers extend to min/max (or 1.5×IQR). Points beyond are plotted individually as outliers.

Advantages: Compact comparison of distributions, shows skewness, highlights outliers.

Histogram

Groups data into bins and shows frequency (or relative frequency) of each bin.

Choosing bin width: Sturges' rule (k = ⌈log₂n⌉ + 1), Scott's rule (h = 3.5s/n^(1/3)), Freedman-Diaconis (h = 2·IQR/n^(1/3)).

Kernel Density Estimation

Smooth estimate of the PDF. Place a kernel (usually Gaussian) at each data point and sum:

f̂(x) = (1/nh) Σ K((x - xᵢ)/h)

where h is the bandwidth (smoothing parameter). Too small → noisy. Too large → over-smoothed.

Other Visualizations

  • Scatter plot: Two variables. Shows correlation, clusters, outliers.
  • Violin plot: Box plot + mirrored KDE. Shows full distribution shape.
  • QQ plot: Quantiles of data vs. quantiles of a theoretical distribution. Straight line indicates the data follows that distribution.
  • Heatmap: Matrix of values with color coding. Correlation matrices, confusion matrices.
  • ECDF: Empirical cumulative distribution function. Step function from 0 to 1.

Correlation

Covariance

Cov(X, Y) = (1/(n-1)) Σ(xᵢ - x̄)(yᵢ - ȳ)

Positive: X and Y tend to increase together. Negative: inverse relationship. Zero: no linear relationship.

Pearson Correlation Coefficient

r = Cov(X, Y) / (sₓ · s_y)
  • r ∈ [-1, 1]
  • r = 1: perfect positive linear relationship
  • r = -1: perfect negative linear relationship
  • r = 0: no linear relationship (but may have nonlinear relationship!)

Spearman Rank Correlation

Pearson correlation applied to the ranks of the data. Measures monotonic (not just linear) relationship. Robust to outliers.

Kendall's Tau

Based on concordant and discordant pairs. Another rank-based correlation measure. More robust than Spearman for small samples.

Summary Statistics in Practice

Robust Statistics

Outliers disproportionately affect mean, variance, and Pearson correlation. Use:

  • Median instead of mean
  • IQR or MAD instead of standard deviation
  • Spearman/Kendall instead of Pearson correlation
  • Trimmed/winsorized versions of any statistic

Anscombe's Quartet

Four datasets with identical summary statistics (mean, variance, correlation, regression line) but very different distributions. Always visualize your data!

Simpson's Paradox

A trend that appears in several groups may reverse when the groups are combined. Always consider confounding variables.

Applications in CS

  • Exploratory data analysis: First step in any data science project. Understand distributions, find outliers, check assumptions.
  • Feature engineering: Standardization (z-score: (x-μ)/σ), normalization (min-max scaling), log transforms for skewed features.
  • Monitoring and alerting: Track mean, percentiles (p50, p95, p99) of latency, error rates. Anomaly detection via deviation from baselines.
  • A/B testing: Compare summary statistics between control and treatment groups.
  • Database query optimization: Histograms on column values help the query optimizer estimate selectivity and cardinality.
  • Performance benchmarking: Report median and percentiles (not just mean). Use violin plots for latency distributions.
  • ML preprocessing: Detect and handle outliers, check feature distributions, compute correlation matrices for feature selection.