6 min read
On this page

Protocol Buffers

Protocol Buffers (protobuf) is Google's binary serialization format. It is schema-defined, language-neutral, and backward compatible by design. You write a .proto file describing your data structures, run a code generator, and get typed serialization/deserialization code in Go, Rust, Python, Java, C++, or a dozen other languages.

Protobuf is the serialization layer beneath gRPC, but it is also used independently for storage, message queues, and configuration. If you need to send structured data between systems and JSON is too slow, too large, or too loosely typed, protobuf is the standard alternative.

Why Not JSON

JSON is human-readable, universally supported, and good enough for most APIs. Protobuf is better when:

  • Size matters — protobuf messages are 3-10x smaller than the equivalent JSON. Field names are replaced by field numbers. Integers use variable-length encoding. No quotes, no braces, no whitespace.
  • Speed matters — protobuf serialization and deserialization is 2-10x faster than JSON parsing. No string escaping, no number parsing, no dynamic type resolution.
  • Types matter — JSON has strings, numbers, booleans, arrays, and objects. Protobuf has int32, int64, float, double, bool, string, bytes, enums, nested messages, maps, and more. A protobuf schema catches type errors at compile time; JSON catches them at runtime (or never).

A comparison for a simple order message:

{
  "order_id": "ord_abc123",
  "customer_id": "cus_def456",
  "items": [
    {"product_id": "prod_789", "quantity": 2, "unit_price_cents": 1500},
    {"product_id": "prod_012", "quantity": 1, "unit_price_cents": 3000}
  ],
  "total_cents": 6000,
  "currency": "usd",
  "status": "confirmed",
  "created_at": "2024-01-15T12:00:00Z"
}

This JSON is approximately 350 bytes. The equivalent protobuf binary is approximately 80 bytes. Over millions of messages per second between microservices, this difference compounds into real bandwidth and CPU savings.

The .proto File

A .proto file defines message types (data structures) and their fields:

syntax = "proto3";

package commerce.v1;

option go_package = "github.com/example/commerce/v1";

message Order {
  string order_id = 1;
  string customer_id = 2;
  repeated LineItem items = 3;
  int64 total_cents = 4;
  string currency = 5;
  OrderStatus status = 6;
  google.protobuf.Timestamp created_at = 7;
}

message LineItem {
  string product_id = 1;
  int32 quantity = 2;
  int64 unit_price_cents = 3;
}

enum OrderStatus {
  ORDER_STATUS_UNSPECIFIED = 0;
  ORDER_STATUS_PENDING = 1;
  ORDER_STATUS_CONFIRMED = 2;
  ORDER_STATUS_SHIPPED = 3;
  ORDER_STATUS_DELIVERED = 4;
  ORDER_STATUS_CANCELED = 5;
}

Syntax Declaration

syntax = "proto3" specifies Protocol Buffers version 3. Proto3 simplified the language by removing required fields, default values, and some other proto2 features. Always use proto3 for new projects.

Packages

package commerce.v1 namespaces the types. The v1 in the package name is intentional — it supports major version changes in the schema without breaking existing consumers.

Message Types

Messages are the data structures. Each field has a type, a name, and a unique field number:

message User {
  string id = 1;          // field number 1
  string email = 2;       // field number 2
  string name = 3;        // field number 3
  int64 created_at = 4;   // field number 4
}

Field Numbers

Field numbers are the most important thing to understand about protobuf. The binary encoding uses field numbers, not names, to identify fields. The name is only for human readability and code generation.

Rules

  • Field numbers must be unique within a message
  • Numbers 1-15 use one byte in the encoding (use these for frequently set fields)
  • Numbers 16-2047 use two bytes
  • Numbers 19000-19999 are reserved by the protobuf implementation

Never Reuse Deleted Field Numbers

If you remove a field, its number must never be reused. Old data may still contain that field number with the old type. Reusing it with a new type causes silent data corruption.

Use the reserved keyword to prevent accidental reuse:

message User {
  reserved 4, 7;
  reserved "phone_number", "legacy_role";

  string id = 1;
  string email = 2;
  string name = 3;
  // field 4 was phone_number (removed in v2)
  UserRole role = 5;
  int64 created_at = 6;
  // field 7 was legacy_role (removed in v3)
}

If someone later tries to use field number 4 or the name phone_number, the protobuf compiler rejects it.

Field Types

Scalar Types

message Example {
  int32 small_number = 1;       // 32-bit integer
  int64 big_number = 2;         // 64-bit integer
  float ratio = 3;              // 32-bit floating point
  double precise_ratio = 4;     // 64-bit floating point
  bool active = 5;              // boolean
  string name = 6;              // UTF-8 string
  bytes payload = 7;            // arbitrary bytes
}

Use int64 for monetary values (cents), timestamps (Unix seconds), and any number that might exceed 2 billion. Use string for identifiers, even if they look numeric — IDs are not quantities, and treating them as strings prevents arithmetic mistakes.

Repeated Fields

repeated means "zero or more" — it is protobuf's array type:

message Order {
  repeated LineItem items = 1;     // list of line items
  repeated string tags = 2;        // list of strings
}

Maps

Maps are key-value pairs. Keys must be scalar types (no floats or bytes). Values can be any type except another map:

message Config {
  map<string, string> labels = 1;
  map<string, int32> feature_flags = 2;
}

Oneof

oneof means exactly one of the enclosed fields is set. It is protobuf's union type:

message PaymentMethod {
  string id = 1;
  oneof details {
    CreditCard credit_card = 2;
    BankAccount bank_account = 3;
    DigitalWallet digital_wallet = 4;
  }
}

message CreditCard {
  string last_four = 1;
  string brand = 2;
  int32 exp_month = 3;
  int32 exp_year = 4;
}

message BankAccount {
  string routing_number = 1;
  string last_four = 2;
}

message DigitalWallet {
  string provider = 1;
  string account_id = 2;
}

Only one of credit_card, bank_account, or digital_wallet can be set. Setting one clears the others. The generated code includes a method to check which field is populated.

Enums

Enums define a fixed set of values. The first value must be 0 and should be the "unspecified" or "unknown" value:

enum Currency {
  CURRENCY_UNSPECIFIED = 0;
  CURRENCY_USD = 1;
  CURRENCY_EUR = 2;
  CURRENCY_GBP = 3;
  CURRENCY_JPY = 4;
}

The zero value convention (_UNSPECIFIED = 0) is important. Proto3 uses 0 as the default for unset fields. If your first enum value is a real value (like USD = 0), you cannot distinguish between "the caller set USD" and "the caller did not set a currency."

Code Generation

The protoc compiler reads .proto files and generates code:

# Generate Go code
protoc --go_out=. --go_opt=paths=source_relative order.proto

# Generate Python code
protoc --python_out=. order.proto

# Generate Rust code (using prost)
# Typically configured in build.rs rather than invoked directly

The generated code includes:

  • Struct/class definitions for each message type
  • Serialization methods (message to bytes)
  • Deserialization methods (bytes to message)
  • Builder/setter methods for constructing messages
  • Enum definitions with string conversions

The generated types are fully typed in languages that support it. In Go, Order.TotalCents is an int64, not an interface{} that you cast at runtime.

Backward Compatibility

Protobuf is designed for schema evolution. You can add fields, deprecate fields, and maintain compatibility with existing data:

Safe Changes

  • Add a new field — old readers ignore it; new readers get the default value from old data
  • Remove a field — old data with the field is still readable; reserve the field number
  • Rename a field — the binary encoding uses numbers, not names; renaming has no wire effect

Unsafe Changes

  • Change a field's typeint32 to string will corrupt data
  • Change a field's number — old data mapped to the old number will be misinterpreted
  • Reuse a deleted field number — the old type and new type will collide

Example Evolution

Version 1:

message User {
  string id = 1;
  string name = 2;
  string email = 3;
}

Version 2 (backward compatible):

message User {
  reserved 4;
  reserved "phone";

  string id = 1;
  string name = 2;
  string email = 3;
  // field 4 was phone (removed)
  string avatar_url = 5;
  UserRole role = 6;
}

Old clients that send Version 1 messages can still be read by Version 2 servers (avatar_url and role default to empty/zero). Version 2 messages can be read by Version 1 clients (unknown fields 5 and 6 are ignored).

Common Pitfalls

  • Reusing field numbers — the most dangerous mistake in protobuf. Deleted field numbers must be reserved, not recycled. This causes silent data corruption that is extremely difficult to debug.
  • Using 0 as a meaningful enum value — proto3 defaults unset enums to 0. If 0 = ACTIVE, you cannot tell whether the caller intended "active" or forgot to set the field. Always use 0 = UNSPECIFIED.
  • Floating point for moneyfloat and double have precision issues. Use int64 for monetary values in the smallest unit (cents, pence).
  • Nested message overuse — deeply nesting messages (5+ levels) makes the schema hard to navigate and the generated code unwieldy. Flatten when the nesting does not represent real domain relationships.
  • No package versioning — using package myservice without a version (package myservice.v1). When you need a breaking change, you have no migration path.
  • Ignoring well-known types — protobuf provides google.protobuf.Timestamp, google.protobuf.Duration, google.protobuf.Struct, and others. Use them instead of inventing your own timestamp or duration encoding.
  • Large messages — protobuf is efficient but not designed for multi-megabyte messages. For large payloads, stream them in chunks or use a different transport.

Key Takeaways

  • Protobuf is a binary serialization format that is smaller, faster, and more strongly typed than JSON. Use it when size, speed, or type safety matters — typically in service-to-service communication.
  • The .proto file is the schema. It defines message types, field types, enums, and relationships. It is the single source of truth for data structures across all languages.
  • Field numbers are the wire format identity. Never reuse deleted field numbers. Reserve them with the reserved keyword.
  • Always use 0 = UNSPECIFIED for the first enum value. Proto3 defaults unset fields to zero, and you need to distinguish "unset" from a real value.
  • Protobuf is designed for backward-compatible evolution. Add new fields freely. Remove fields by reserving their numbers. Never change a field's type or number.
  • Use protoc to generate typed code in any supported language. The generated code handles serialization, deserialization, and type enforcement.