Object Storage as a Foundation
Every modern data architecture starts with object storage. S3, GCS, Azure Blob Storage. It is cheap, durable, and practically infinite. The data lake concept is simple: dump everything into object storage, organize it later, and use the right engine to query it when you need it.
Why Object Storage Won
Before cloud object storage, storing large amounts of data meant buying and managing HDFS clusters, SAN arrays, or NAS appliances. You had to provision capacity upfront, manage replication, and replace failed disks.
Object storage changed the economics:
Cost comparison (approximate, per GB per month):
SSD (local): $0.10 - $0.25
HDD (local): $0.02 - $0.05
HDFS (managed): $0.02 - $0.04
S3 Standard: $0.023
S3 Infrequent Access: $0.0125
S3 Glacier: $0.004
GCS Standard: $0.020
Azure Blob Hot: $0.018
At 2,000/month. That is less than the salary of one engineer for one day. Storage is no longer a constraint. The question shifted from "can we afford to store this?" to "can we afford not to?"
Durability & Availability
S3 provides 99.999999999% (11 nines) durability. That means if you store 10 million objects, you can expect to lose one every 10,000 years. The data is automatically replicated across multiple data centers within a region.
This durability level is higher than most organizations could achieve with self-managed infrastructure.
Scalability
There is no capacity planning with object storage. No partitions to resize, no disks to add, no clusters to expand. You write data and the system handles the rest. An S3 bucket can hold an unlimited number of objects of unlimited total size.
The Data Lake Pattern
A data lake is not a product. It is an architecture pattern: store all your data in object storage in its original or near-original form, then process it with whatever engine fits the use case.
Data Lake Architecture:
Sources (databases, APIs, events, logs, files)
|
v
Ingestion Layer (Kafka, Fivetran, custom scripts)
|
v
Object Storage (S3/GCS/Azure Blob)
/ | \
Spark Trino/Presto DuckDB
(batch) (interactive) (local)
\ | /
v v v
Dashboards, ML models, reports
The key insight: storage and compute are fully decoupled. The data sits in one place. Any compute engine that can read from object storage can query it.
Organizing a Data Lake
A common directory structure:
s3://company-data-lake/
raw/
stripe/
charges/
year=2025/month=01/day=15/charges_20250115.parquet
salesforce/
opportunities/
year=2025/month=01/opportunities_202501.parquet
events/
year=2025/month=01/day=15/hour=14/events.parquet
staging/
stripe/
stg_charges/
year=2025/month=01/day=15/part-00000.parquet
curated/
revenue/
daily_revenue/
year=2025/month=01/daily_revenue_202501.parquet
The structure follows a progression: raw (untransformed), staging (cleaned), curated (business-ready). Hive-style partitioning (year=2025/month=01/day=15/) enables query engines to skip irrelevant directories.
File Formats
The format you store data in matters enormously. The wrong format can make queries 100x slower and storage 10x more expensive.
CSV: The Worst Option for Analytics
CSV is human-readable and universally supported. That is where the advantages end.
Problems with CSV:
- No schema embedded in the file (what type is column 3?)
- No compression by default (massive file sizes)
- Parsing is slow and error-prone (escaped commas, newlines in fields)
- No column pruning (must read entire rows)
- No predicate pushdown (must read all data to filter)
- Type ambiguity ("123" is a string or an integer?)
CSV is fine for one-off data transfers between humans. It is not a data lake storage format.
JSON: Better, But Not Great
JSON adds schema information (keys are embedded) and handles nested data well. But it shares many of CSV's problems.
JSON trade-offs:
+ Self-describing (field names are in the data)
+ Handles nested and semi-structured data
+ Universal support across languages and tools
- Row-oriented (no column pruning)
- Verbose (field names repeated for every row)
- Compression is decent but not as good as columnar formats
- Parsing is slower than binary formats
- No native support for complex types (dates, decimals)
JSON (or JSONL, one JSON object per line) is reasonable for event streams and API responses. For analytical workloads, convert it to a columnar format as soon as possible.
Parquet: The Standard
Apache Parquet is a columnar binary format designed for analytical workloads. It is the default format for modern data lakes.
Parquet advantages:
+ Columnar storage (read only the columns you need)
+ Excellent compression (2-10x smaller than JSON/CSV)
+ Schema embedded in the file (self-describing)
+ Predicate pushdown (skip row groups that do not match filters)
+ Rich type system (timestamps, decimals, nested structs, arrays)
+ Universally supported (Spark, Trino, DuckDB, Pandas, BigQuery, etc.)
+ Row group statistics (min/max per column per row group)
A Parquet file is organized internally:
Parquet file structure:
Row Group 1 (e.g., 100,000 rows)
Column Chunk: user_id [compressed data] [statistics: min=1, max=50000]
Column Chunk: event_type [compressed data] [statistics: min="click", max="view"]
Column Chunk: timestamp [compressed data] [statistics: min=2025-01-01, max=2025-01-15]
Row Group 2 (next 100,000 rows)
Column Chunk: user_id [compressed data] [statistics: ...]
...
Footer: schema, row group metadata, column statistics
When a query asks for SELECT user_id FROM events WHERE timestamp > '2025-06-01', the engine:
- Reads the footer to find row group statistics
- Skips row groups where
timestamp max < 2025-06-01 - Reads only the
user_idandtimestampcolumn chunks from relevant row groups
This is why Parquet queries are fast even on object storage. Most of the data is never read.
ORC: The Alternative
Apache ORC (Optimized Row Columnar) is another columnar format. It was developed for Hive and is common in Hadoop ecosystems.
ORC vs Parquet:
- Both are columnar, compressed, and schema-embedded
- ORC has slightly better compression for some data types
- ORC has better support for ACID transactions (in Hive)
- Parquet has broader ecosystem support (more engines, more tools)
- Parquet is the de facto standard outside the Hadoop ecosystem
If you are starting fresh, use Parquet. If you have an existing Hive/Hadoop ecosystem using ORC, there is no urgent reason to migrate.
Why Parquet Won
Parquet became the standard because it hit the right trade-offs for the cloud era:
- Columnar for analytics. Read only the columns you need, which is critical when data sits in object storage and every byte read costs money.
- Compression. Parquet files are typically 2-10x smaller than equivalent JSON or CSV. Smaller files mean less storage cost, less network transfer, and faster queries.
- Schema embedded. No external schema registry needed to read the file. The schema is right there in the footer.
- Universal support. Every major query engine, data processing framework, and cloud warehouse can read Parquet natively.
- Statistics for pruning. Row group and page-level statistics let engines skip data without reading it.
Practical Considerations
File Sizing
Just like warehouse partitions, data lake files have a sweet spot for size.
Too small (< 10 MB):
- Object storage charges per request, not just per byte
- 1 million 1 KB files costs far more to list and read than 1 GB in one file
- Query engines spend more time opening files than reading data
Too large (> 1 GB):
- Cannot parallelize reads within a single file effectively
- If a job fails, you must re-read the entire large file
- Harder to partition and prune
Sweet spot: 128 MB - 512 MB per file
If your pipeline produces many small files, add a compaction step that merges them into larger files periodically.
Compression Codecs
Parquet supports multiple compression algorithms. The common choices:
Snappy: Fast compression/decompression, moderate ratio. Default in many tools.
GZIP: Higher compression ratio, slower. Good for cold storage.
ZSTD: Best balance of speed and ratio. Increasingly the recommended default.
LZO: Fast decompression, lower ratio. Common in legacy Hadoop setups.
For most workloads, ZSTD or Snappy is the right choice. Use GZIP only when storage cost dominates and query speed is not critical.
Common Pitfalls
Storing analytics data as CSV or JSON long-term. Convert to Parquet at ingestion or as soon as possible. The performance difference for analytical queries is not marginal; it is 10-100x.
Millions of tiny files. A common result of streaming ingestion that writes one file per event or per micro-batch. Implement compaction to merge small files into larger ones.
No directory organization. Dumping everything into a flat bucket with no partitioning structure makes it impossible for query engines to prune data. Use Hive-style partitioning from day one.
Ignoring file format when estimating costs. A 1 TB CSV dataset might compress to 100 GB in Parquet. That is 10x savings on storage and 10x savings on query costs in engines that charge per byte scanned.
Not including schema in files. If you use CSV without a header or a schema file, every consumer must guess the column types. Parquet embeds the schema. Use it.
Treating object storage like a filesystem. Object storage has no directories; it has key prefixes that look like directories. Listing operations are expensive (especially with millions of objects). Design your key structure for efficient listing patterns.
Key Takeaways
- Object storage (S3, GCS, Azure Blob) is the foundation of modern data architectures: cheap, durable, and infinitely scalable
- A data lake stores data in object storage and decouples storage from compute, letting any engine query the same data
- Parquet is the standard file format for analytical data: columnar, compressed, schema-embedded, and universally supported
- File sizing matters: aim for 128 MB to 512 MB per file to balance read efficiency and parallelism
- Organize data lakes with Hive-style partitioning and a clear raw/staging/curated structure
- Convert from CSV and JSON to Parquet as early in the pipeline as possible; the performance and cost difference is enormous