Binary Pattern Matching

Most languages treat binary data as an opaque blob and force you to reach for a serialization library, a struct module, or a hand-written byte-walker. Elixir lets you pattern-match directly on the bits. The <<...>> syntax is one of the features Elixir inherited from Erlang that has no real counterpart in Python, Ruby, or Go, and it is one of the reasons the BEAM is so well-suited to networking and protocol work.

A binary in Elixir is a sequence of bytes. A bitstring is the same idea, but the total length is not required to be a multiple of 8. Both share the <<...>> literal and matching syntax. You will write binaries far more often than non-byte-aligned bitstrings, but the underlying machinery is the same.

The Basic Syntax

A binary literal is a list of segments separated by commas, wrapped in << and >>:

<<1, 2, 3>>           # three bytes
<<"hello">>           # a UTF-8 binary, equivalent to "hello"
<<255, 0, 255, 0>>    # four bytes

A segment is one piece of the binary. By default, each integer segment is 8 bits (one byte), unsigned, big-endian. You can override every part of that.

The interesting part is matching:

<<a, b, c>> = <<1, 2, 3>>
# a = 1, b = 2, c = 3

<<first, rest::binary>> = <<10, 20, 30, 40>>
# first = 10, rest = <<20, 30, 40>>

That second match is the move you will reach for constantly: pull off a known-size header, bind the remainder to a variable, recurse or hand it off.

Segment Types

Each segment can declare its type with :::

<<value::integer>>     # 8-bit integer (default)
<<value::float>>       # 64-bit IEEE 754 float
<<value::binary>>      # raw binary, byte-aligned
<<value::bitstring>>   # raw bits, no alignment required
<<value::utf8>>        # one UTF-8 codepoint
<<value::utf16>>       # one UTF-16 codepoint
<<value::utf32>>       # one UTF-32 codepoint

The utf8, utf16, and utf32 types are codepoint-aware — they consume the right number of bytes for the encoding. The integer, float, binary, and bitstring types are raw and require you to specify size for anything other than the default.

Sizes and the size Modifier

Sizes are the second important modifier. They are written with size(N) or the shorthand ::N:

<<x::32>> = <<0, 0, 1, 0>>
# x = 256, a 32-bit unsigned big-endian integer

<<x::size(16)>> = <<1, 0>>
# x = 256, a 16-bit integer

<<x::integer-size(24)>> = <<0, 1, 0>>
# x = 256, explicit type and size

For integer segments, the size is in bits. For binary and bitstring, the size is multiplied by a unit (8 for binary, 1 for bitstring) and represents bytes or bits respectively.

<<header::binary-size(4), rest::binary>> = <<"PNG\r", "more", "data">>
# header = "PNG\r", rest = "moredata"

You can mix multiple segments with different sizes in a single match. This is how you peel apart a fixed-format record:

<<version::4, ihl::4, tos::8, total_length::16>> = <<0x45, 0x00, 0x00, 0x3C>>
# version = 4, ihl = 5, tos = 0, total_length = 60

That is the first four bytes of an IPv4 header, decoded in one line. version and ihl are 4-bit nibbles inside a single byte. tos is a full byte. total_length spans two bytes. No bit shifting, no masking, no byte order conversions in the body of your code.

Endianness

Network protocols are big-endian (network byte order). x86 hardware is little-endian. Anything you read from a file written by a C program on Linux is almost certainly little-endian. The big, little, and native modifiers let you say which one you mean:

<<x::little-32>> = <<0, 0, 1, 0>>
# x = 65536, interpreted little-endian

<<x::big-32>> = <<0, 0, 1, 0>>
# x = 256, interpreted big-endian (default)

<<x::native-32>> = <<0, 0, 1, 0>>
# whatever your CPU uses; portable code should avoid this

big is the default and matches network protocols. You only need little when reading on-disk formats from x86 sources or interacting with C code over a port.

Signed and Unsigned

Integers are unsigned by default. Add signed when you need two's-complement:

<<x::signed-8>> = <<255>>
# x = -1

<<x::8>> = <<255>>
# x = 255, unsigned default

Most protocol fields are unsigned. The places you reach for signed are sensor data, audio samples, and any format borrowed from C where int and short are signed.

Stacking the Modifiers

The modifiers are written with - between them, in any order:

<<x::integer-signed-little-32>>
<<x::little-signed-integer-size(32)>>

Both are the same thing: a 32-bit, signed, little-endian integer. The compiler resolves them. In practice, pick an order and stick with it. The community convention is roughly type-signed-endianness-size, but no one will fault you for varying it.

Why This Beats the Alternatives

Compare a single IPv4 header parse in Elixir, C, and Python.

Elixir:

<<version::4, ihl::4, tos::8, total_length::16,
  id::16, flags::3, frag_offset::13,
  ttl::8, protocol::8, checksum::16,
  src::32, dst::32,
  rest::binary>> = packet

In C, you write a packed struct, hope your compiler honored __attribute__((packed)), manually call ntohs and ntohl on every multi-byte field, and remember that the bitfields for version and ihl may or may not lay out the way you expect depending on the platform.

In Python, you reach for the struct module:

import struct
v_ihl, tos, total_length, ident, flags_frag, ttl, proto, checksum, src, dst = \
    struct.unpack("!BBHHHBBHII", packet[:20])
version = v_ihl >> 4
ihl = v_ihl & 0x0F
flags = flags_frag >> 13
frag_offset = flags_frag & 0x1FFF

You memorize format-string characters (!BBHHHBBHII), you manually shift and mask the bit-packed fields, and you keep a comment nearby explaining what each letter means. It works, but the code is opaque six months later.

The Elixir version reads exactly like the protocol spec. Each field's name, position, and width is right there in the match. That is the whole pitch.

Building Binaries

Construction uses the same syntax:

header = <<4::4, 5::4, 0::8, 60::16>>
# <<0x45, 0x00, 0x00, 0x3C>>

packet = <<header::binary, payload::binary>>

Anything you can match on, you can construct. The same size and endianness rules apply.

A common pattern is building a fixed-size record from variables:

def build_request(id, payload) do
  size = byte_size(payload)
  <<0x01::8, id::32, size::32, payload::binary>>
end

That gives you a five-line function that produces a wire-format binary with a one-byte type tag, a four-byte request ID, a four-byte length, and an arbitrary-length payload. No serialization library required.

Bitstrings Versus Binaries

A binary is a bitstring whose length is a multiple of 8. Every binary is a bitstring; not every bitstring is a binary.

iex> is_binary(<<1::4>>)
false

iex> is_bitstring(<<1::4>>)
true

iex> is_binary(<<1::8>>)
true

You will rarely produce non-byte-aligned bitstrings deliberately. They come up when parsing protocols that have sub-byte fields and the total ends up unaligned, or when working with formats like compressed bit streams. For normal protocol work, you slice things into bytes and the result is always a binary.

Common Pitfalls

Forgetting that the default segment size is 8 bits. <<x>> matches one byte. Writing <<x>> = <<1, 2>> is a match error because the right side is two bytes and the left side expects one. Use <<x, _rest::binary>> to grab the first byte and ignore the rest.

Mixing up size(N) units. For integer, the size is in bits. For binary, the size is in bytes (because the unit is 8). <<x::binary-size(4)>> is four bytes; <<x::size(4)>> is four bits. The two read similarly but mean very different things.

Defaulting to big-endian when the source is little-endian. A 32-bit field read with the wrong endianness gives you a wildly wrong number. If your parsed values look like they have extra zeros in them, swap to little and try again.

Trying to match a variable-size segment without telling the compiler where it ends. <<a::binary, b::binary>> = ... is ambiguous and will not compile. Either give the first segment an explicit size, or put binary last so it consumes the remainder.

Using byte_size/1 when you wanted bit_size/1. byte_size is bytes; bit_size is bits. They differ by a factor of 8 for binaries and disagree by remainder bits for non-aligned bitstrings.

Reaching for :binary.split/2 when bit syntax would do. The :binary module from Erlang stdlib has useful helpers, but for fixed-format records, bit syntax is faster and clearer than splitting on delimiters.

Key Takeaways

The <<...>> syntax matches and builds binaries with field names, sizes, and types declared in one place.
Default segment is 8-bit, unsigned, big-endian integer. Override with ::type, ::size(N), signed, little.
integer, float, binary, bitstring, utf8, utf16, utf32 are the segment types you reach for.
Binaries are byte-aligned bitstrings; bit syntax handles both seamlessly.
Bit syntax beats C's struct-and-ntohl dance and Python's struct format strings for protocol parsing — the code reads like the spec.