Object Storage as a Foundation

Every modern data architecture starts with object storage. S3, GCS, Azure Blob Storage. It is cheap, durable, and practically infinite. The data lake concept is simple: dump everything into object storage, organize it later, and use the right engine to query it when you need it.

Why Object Storage Won

Before cloud object storage, storing large amounts of data meant buying and managing HDFS clusters, SAN arrays, or NAS appliances. You had to provision capacity upfront, manage replication, and replace failed disks.

Object storage changed the economics:

Cost comparison (approximate, per GB per month):
  SSD (local):           $0.10 - $0.25
  HDD (local):           $0.02 - $0.05
  HDFS (managed):        $0.02 - $0.04
  S3 Standard:           $0.023
  S3 Infrequent Access:  $0.0125
  S3 Glacier:            $0.004
  GCS Standard:          $0.020
  Azure Blob Hot:        $0.018

At $0.02/GB/month, storing 100 TB costs$ 2,000/month. That is less than the salary of one engineer for one day. Storage is no longer a constraint. The question shifted from "can we afford to store this?" to "can we afford not to?"

Durability & Availability

S3 provides 99.999999999% (11 nines) durability. That means if you store 10 million objects, you can expect to lose one every 10,000 years. The data is automatically replicated across multiple data centers within a region.

This durability level is higher than most organizations could achieve with self-managed infrastructure.

Scalability

There is no capacity planning with object storage. No partitions to resize, no disks to add, no clusters to expand. You write data and the system handles the rest. An S3 bucket can hold an unlimited number of objects of unlimited total size.

The Data Lake Pattern

A data lake is not a product. It is an architecture pattern: store all your data in object storage in its original or near-original form, then process it with whatever engine fits the use case.

Data Lake Architecture:
  Sources (databases, APIs, events, logs, files)
      |
      v
  Ingestion Layer (Kafka, Fivetran, custom scripts)
      |
      v
  Object Storage (S3/GCS/Azure Blob)
      /          |          \
  Spark      Trino/Presto   DuckDB
  (batch)    (interactive)   (local)
      \          |          /
      v          v          v
  Dashboards, ML models, reports

The key insight: storage and compute are fully decoupled. The data sits in one place. Any compute engine that can read from object storage can query it.

Organizing a Data Lake

A common directory structure:

s3://company-data-lake/
  raw/
    stripe/
      charges/
        year=2025/month=01/day=15/charges_20250115.parquet
    salesforce/
      opportunities/
        year=2025/month=01/opportunities_202501.parquet
    events/
      year=2025/month=01/day=15/hour=14/events.parquet
  staging/
    stripe/
      stg_charges/
        year=2025/month=01/day=15/part-00000.parquet
  curated/
    revenue/
      daily_revenue/
        year=2025/month=01/daily_revenue_202501.parquet

The structure follows a progression: raw (untransformed), staging (cleaned), curated (business-ready). Hive-style partitioning (year=2025/month=01/day=15/) enables query engines to skip irrelevant directories.

File Formats

The format you store data in matters enormously. The wrong format can make queries 100x slower and storage 10x more expensive.

CSV: The Worst Option for Analytics

CSV is human-readable and universally supported. That is where the advantages end.

Problems with CSV:
  - No schema embedded in the file (what type is column 3?)
  - No compression by default (massive file sizes)
  - Parsing is slow and error-prone (escaped commas, newlines in fields)
  - No column pruning (must read entire rows)
  - No predicate pushdown (must read all data to filter)
  - Type ambiguity ("123" is a string or an integer?)

CSV is fine for one-off data transfers between humans. It is not a data lake storage format.

JSON: Better, But Not Great

JSON adds schema information (keys are embedded) and handles nested data well. But it shares many of CSV's problems.

JSON trade-offs:
  + Self-describing (field names are in the data)
  + Handles nested and semi-structured data
  + Universal support across languages and tools
  - Row-oriented (no column pruning)
  - Verbose (field names repeated for every row)
  - Compression is decent but not as good as columnar formats
  - Parsing is slower than binary formats
  - No native support for complex types (dates, decimals)

JSON (or JSONL, one JSON object per line) is reasonable for event streams and API responses. For analytical workloads, convert it to a columnar format as soon as possible.

Parquet: The Standard

Apache Parquet is a columnar binary format designed for analytical workloads. It is the default format for modern data lakes.

Parquet advantages:
  + Columnar storage (read only the columns you need)
  + Excellent compression (2-10x smaller than JSON/CSV)
  + Schema embedded in the file (self-describing)
  + Predicate pushdown (skip row groups that do not match filters)
  + Rich type system (timestamps, decimals, nested structs, arrays)
  + Universally supported (Spark, Trino, DuckDB, Pandas, BigQuery, etc.)
  + Row group statistics (min/max per column per row group)

A Parquet file is organized internally:

Parquet file structure:
  Row Group 1 (e.g., 100,000 rows)
    Column Chunk: user_id    [compressed data] [statistics: min=1, max=50000]
    Column Chunk: event_type [compressed data] [statistics: min="click", max="view"]
    Column Chunk: timestamp  [compressed data] [statistics: min=2025-01-01, max=2025-01-15]
  Row Group 2 (next 100,000 rows)
    Column Chunk: user_id    [compressed data] [statistics: ...]
    ...
  Footer: schema, row group metadata, column statistics

When a query asks for SELECT user_id FROM events WHERE timestamp > '2025-06-01', the engine:

Reads the footer to find row group statistics
Skips row groups where timestamp max < 2025-06-01
Reads only the user_id and timestamp column chunks from relevant row groups

This is why Parquet queries are fast even on object storage. Most of the data is never read.

ORC: The Alternative

Apache ORC (Optimized Row Columnar) is another columnar format. It was developed for Hive and is common in Hadoop ecosystems.

ORC vs Parquet:
  - Both are columnar, compressed, and schema-embedded
  - ORC has slightly better compression for some data types
  - ORC has better support for ACID transactions (in Hive)
  - Parquet has broader ecosystem support (more engines, more tools)
  - Parquet is the de facto standard outside the Hadoop ecosystem

If you are starting fresh, use Parquet. If you have an existing Hive/Hadoop ecosystem using ORC, there is no urgent reason to migrate.

Why Parquet Won

Parquet became the standard because it hit the right trade-offs for the cloud era:

Columnar for analytics. Read only the columns you need, which is critical when data sits in object storage and every byte read costs money.
Compression. Parquet files are typically 2-10x smaller than equivalent JSON or CSV. Smaller files mean less storage cost, less network transfer, and faster queries.
Schema embedded. No external schema registry needed to read the file. The schema is right there in the footer.
Universal support. Every major query engine, data processing framework, and cloud warehouse can read Parquet natively.
Statistics for pruning. Row group and page-level statistics let engines skip data without reading it.

Practical Considerations

File Sizing

Just like warehouse partitions, data lake files have a sweet spot for size.

Too small (< 10 MB):
  - Object storage charges per request, not just per byte
  - 1 million 1 KB files costs far more to list and read than 1 GB in one file
  - Query engines spend more time opening files than reading data

Too large (> 1 GB):
  - Cannot parallelize reads within a single file effectively
  - If a job fails, you must re-read the entire large file
  - Harder to partition and prune

Sweet spot: 128 MB - 512 MB per file

If your pipeline produces many small files, add a compaction step that merges them into larger files periodically.

Compression Codecs

Parquet supports multiple compression algorithms. The common choices:

Snappy:   Fast compression/decompression, moderate ratio. Default in many tools.
GZIP:     Higher compression ratio, slower. Good for cold storage.
ZSTD:     Best balance of speed and ratio. Increasingly the recommended default.
LZO:      Fast decompression, lower ratio. Common in legacy Hadoop setups.

For most workloads, ZSTD or Snappy is the right choice. Use GZIP only when storage cost dominates and query speed is not critical.

Common Pitfalls

Storing analytics data as CSV or JSON long-term. Convert to Parquet at ingestion or as soon as possible. The performance difference for analytical queries is not marginal; it is 10-100x.

Millions of tiny files. A common result of streaming ingestion that writes one file per event or per micro-batch. Implement compaction to merge small files into larger ones.

No directory organization. Dumping everything into a flat bucket with no partitioning structure makes it impossible for query engines to prune data. Use Hive-style partitioning from day one.

Ignoring file format when estimating costs. A 1 TB CSV dataset might compress to 100 GB in Parquet. That is 10x savings on storage and 10x savings on query costs in engines that charge per byte scanned.

Not including schema in files. If you use CSV without a header or a schema file, every consumer must guess the column types. Parquet embeds the schema. Use it.

Treating object storage like a filesystem. Object storage has no directories; it has key prefixes that look like directories. Listing operations are expensive (especially with millions of objects). Design your key structure for efficient listing patterns.

Key Takeaways

Object storage (S3, GCS, Azure Blob) is the foundation of modern data architectures: cheap, durable, and infinitely scalable
A data lake stores data in object storage and decouples storage from compute, letting any engine query the same data
Parquet is the standard file format for analytical data: columnar, compressed, schema-embedded, and universally supported
File sizing matters: aim for 128 MB to 512 MB per file to balance read efficiency and parallelism
Organize data lakes with Hive-style partitioning and a clear raw/staging/curated structure
Convert from CSV and JSON to Parquet as early in the pipeline as possible; the performance and cost difference is enormous