Parsing With Bit Syntax
Bit syntax is the feature; parsing is where you actually use it. Once you have seen one or two real protocol decoders written in Elixir, the question of why anyone would build network code on the BEAM stops being a question. The same syntax that lets you write <<version::4, ihl::4, ...>> also compiles to extremely tight, branch-friendly code that the BEAM optimizes hard. Cowboy, the HTTP server underneath Phoenix, parses every HTTP request with bit syntax — and Phoenix handles tens of thousands of requests per second per node largely because of how cheap that parsing is.
This topic walks through three concrete parsers: an IPv4 header, the PNG file signature, and a small custom binary protocol. All three follow the same shape — match a known-size head, bind the rest, recurse if needed.
Parsing an IPv4 Header
Here is the layout of an IPv4 header per RFC 791 (first 20 bytes, ignoring options):
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
|Version| IHL |Type of Service| Total Length |
| Identification |Flags| Fragment Offset |
| TTL | Protocol | Header Checksum |
| Source Address |
| Destination Address |
In Elixir:
defmodule IPv4 do
def parse(<<
version::4,
ihl::4,
tos::8,
total_length::16,
id::16,
flags::3,
frag_offset::13,
ttl::8,
protocol::8,
checksum::16,
src1::8, src2::8, src3::8, src4::8,
dst1::8, dst2::8, dst3::8, dst4::8,
rest::binary
>>) do
%{
version: version,
ihl: ihl,
tos: tos,
total_length: total_length,
id: id,
flags: flags,
frag_offset: frag_offset,
ttl: ttl,
protocol: protocol,
checksum: checksum,
src: {src1, src2, src3, src4},
dst: {dst1, dst2, dst3, dst4},
payload: rest
}
end
end
That is the entire parser. No state machine, no offset arithmetic, no memcpy. The match itself is the spec.
To test it:
packet = <<
0x45, 0x00, 0x00, 0x3C,
0x1C, 0x46, 0x40, 0x00,
0x40, 0x06, 0xB1, 0xE6,
192, 168, 1, 100,
93, 184, 216, 34,
"TCP header and payload would go here"
>>
IPv4.parse(packet)
# %{version: 4, ihl: 5, tos: 0, total_length: 60,
# id: 7238, flags: 2, frag_offset: 0, ttl: 64,
# protocol: 6, src: {192, 168, 1, 100},
# dst: {93, 184, 216, 34}, payload: "..."}
The protocol: 6 field tells you this is TCP. Layer that with a TCP parser using the same approach and you have the first two layers of a packet decoder in roughly 60 lines of Elixir.
Parsing a UDP Header
UDP is even simpler. Eight bytes of header before the payload:
defmodule UDP do
def parse(<<
src_port::16,
dst_port::16,
length::16,
checksum::16,
payload::binary
>>) do
%{
src_port: src_port,
dst_port: dst_port,
length: length,
checksum: checksum,
payload: payload
}
end
end
Four 16-bit fields and a payload. WhatsApp's signaling and Discord's voice gateway both push enormous amounts of UDP through BEAM nodes — Discord has talked publicly about handling millions of concurrent voice users — and the per-packet decode cost is essentially nil because of how the JIT optimizes bit-syntax matches.
The PNG Signature
Every PNG file starts with an eight-byte signature: 137 80 78 71 13 10 26 10. You verify it like this:
defmodule PNG do
@signature <<137, 80, 78, 71, 13, 10, 26, 10>>
def valid?(<<@signature, _rest::binary>>), do: true
def valid?(_), do: false
end
Module attributes can hold binaries, and you can match against them in function heads. After the signature, PNG files are a series of chunks, each shaped like length::32, type::4-bytes, data::length-bytes, crc::32:
defmodule PNG.Chunks do
def parse(<<137, 80, 78, 71, 13, 10, 26, 10, rest::binary>>) do
parse_chunks(rest, [])
end
defp parse_chunks(<<>>, acc), do: Enum.reverse(acc)
defp parse_chunks(
<<length::32, type::binary-size(4), data::binary-size(length),
_crc::32, rest::binary>>,
acc
) do
parse_chunks(rest, [{type, data} | acc])
end
end
Notice data::binary-size(length) — the size of one segment is bound by a value matched earlier in the same head. This is the move that makes bit-syntax parsers truly powerful. You read a length field, then in the same head you grab exactly that many bytes. Without this you would need a multi-step parser with explicit state.
The Recursive Parsing Pattern
The shape of every binary parser in Elixir is the same:
def parse(binary), do: parse(binary, [])
defp parse(<<>>, acc), do: Enum.reverse(acc)
defp parse(<<header::size, rest::binary>>, acc) do
# ... extract a record from header (or header + part of rest) ...
parse(rest, [record | acc])
end
Match the header, peel it off, push the record onto an accumulator, recurse on the remainder. The empty-binary clause terminates. The compiler turns this into a tight loop because tail calls are optimized and bit-syntax matching is one of the most heavily tuned operations in the BEAM.
This pattern is how Cowboy parses HTTP, how :gen_tcp-based servers parse line-delimited protocols, how MongoDB's BEAM driver decodes BSON, and how the various RabbitMQ and Kafka clients written in Elixir handle their wire formats.
A Custom Binary Protocol
Suppose you are designing a small protocol between an Elixir service and an embedded device. Every message has a 1-byte type, a 4-byte sequence number, a 2-byte payload length, and a variable-length payload.
defmodule MyProto do
@msg_ping 0x01
@msg_pong 0x02
@msg_data 0x03
def encode(:ping, seq), do: <<@msg_ping, seq::32, 0::16>>
def encode(:pong, seq), do: <<@msg_pong, seq::32, 0::16>>
def encode({:data, payload}, seq) do
size = byte_size(payload)
<<@msg_data, seq::32, size::16, payload::binary>>
end
def decode(<<@msg_ping, seq::32, 0::16, rest::binary>>) do
{:ok, {:ping, seq}, rest}
end
def decode(<<@msg_pong, seq::32, 0::16, rest::binary>>) do
{:ok, {:pong, seq}, rest}
end
def decode(<<@msg_data, seq::32, size::16, payload::binary-size(size),
rest::binary>>) do
{:ok, {:data, seq, payload}, rest}
end
def decode(_), do: {:error, :incomplete}
end
The decoder returns {:ok, message, rest} so the caller can keep parsing if the buffer contains multiple messages. The catch-all decode/1 clause returns :incomplete when the binary does not match — typical for stream-based protocols where the next chunk has not arrived yet.
This is roughly the shape every Elixir TCP server uses. Receive bytes from :gen_tcp, accumulate them in a buffer, try to decode one message, repeat. If decode returns :incomplete, wait for more bytes. Production servers add framing for very large payloads, but the core is this loop.
Why This Is Fast
The BEAM has been optimizing binary handling for over two decades. A few things matter in practice:
Sub-binary references. When you match <<a::32, rest::binary>> on a 1 MB binary, rest does not copy the remaining bytes. It is a reference into the original binary with an offset. This makes peel-and-recurse parsers run in constant memory per step, not in O(n) per step.
Pattern-match compilation. The Erlang compiler turns a sequence of <<>> clauses into a decision tree that minimizes the work needed to dispatch. You write what looks like multiple full matches; the compiler emits something close to a switch statement.
JIT-friendly layout. The BEAM's JIT (default in OTP 24+) compiles binary-match instructions to native code that pulls bytes directly out of the binary's backing buffer. Profile a Cowboy or Phoenix process under load and you will see almost no time spent in the request line parser.
This is why WhatsApp could run two million TCP connections per node on the BEAM. The per-connection parsing cost is so low that the bottleneck is the network card, not the parser.
Common Pitfalls
Forgetting that sub-binary references retain the parent. When you keep a small slice of a large binary alive, the BEAM keeps the whole binary alive too. If you parse 100 MB of input and stash a 10-byte field in long-lived state, you are holding 100 MB. Call :binary.copy/1 on the field to detach it before storing.
Matching variable-size segments without an earlier length. <<data::binary-size(n), rest::binary>> requires n to be bound earlier in the same match. If you try to match an unknown-length field without a length prefix, you need a delimiter or a different strategy.
Not handling partial frames. TCP delivers byte streams, not message boundaries. A single recv might give you half a message or one and a half messages. Your decoder needs to return :incomplete and your loop needs to accumulate. Forgetting this works in dev and breaks the moment a packet gets fragmented in production.
Trying to use bit syntax on lists. Bit syntax works on binaries and bitstrings, not lists. If you have a list of bytes (a charlist), call :erlang.list_to_binary/1 first.
Building binaries with <> in a loop. Each <> concatenation copies. To build a binary incrementally, use an iolist — a nested list of binaries and bytes — and pass it to IO.iodata_to_binary/1 at the end, or just pass it directly to most BEAM I/O functions, which accept iolists natively.
Assuming the parser handles the whole input. Always include a catch-all clause that returns an error, otherwise an unexpected byte pattern crashes the process. In a GenServer, that crash kills your connection.
Key Takeaways
- The recursive parse-and-accumulate pattern is the shape of nearly every binary parser in Elixir.
- A size matched earlier in a clause can drive the size of a later segment in the same clause — this is what lets you parse length-prefixed records cleanly.
- Sub-binary references make peel-and-recurse parsers run without copying, but they hold the parent binary alive. Use
:binary.copy/1for long-lived slices. - Real systems like Cowboy, RabbitMQ clients, and Discord's voice gateway rely on this pattern for high-throughput protocol work.
- TCP-style streaming parsers must handle incomplete frames; return
:incompleteand accumulate until the next match succeeds. - Build outgoing binaries with iolists, not repeated
<>concatenation.