7 min read
On this page

Strings vs Binaries

Every Elixir string is a binary. There is no separate string type, no string class, no boxing layer. A double-quoted literal produces a UTF-8 encoded binary, and that is what every string-handling function in the standard library expects. This unification is a quiet superpower — it means the same byte-level tools that parse network protocols also handle text, and the same memory shape that makes binaries cheap to slice makes strings cheap to slice too.

The confusion that follows from this is usually about counting. How long is a string? It depends on what you are counting — bytes, codepoints, or graphemes. Each has a different answer and a different function, and using the wrong one is the most common source of "why is this number weird" bugs in text-heavy Elixir code.

The Three Ways to Count

iex> byte_size("hello")
5

iex> String.length("hello")
5

iex> String.graphemes("hello") |> length()
5

For ASCII, all three agree. Then you add an accented character:

iex> byte_size("café")
5

iex> String.length("café")
4

The é takes two bytes in UTF-8 (0xC3 0xA9), so byte_size/1 reports 5. String.length/1 counts codepoints — abstract characters in the Unicode sense — and returns 4.

Then you add an emoji or a combining character and the third number diverges:

iex> s = "é"   # 'e' followed by combining acute accent
"é"

iex> byte_size(s)
3

iex> String.length(s)
2

iex> String.graphemes(s) |> length()
1

That "é" is built from two codepoints — a base e and a combining accent — but renders as one grapheme cluster, which is what a user perceives as one character. String.length/1 says 2 because there are two codepoints. String.graphemes/1 says 1 because there is one user-perceived character.

For UI work where you need to know how many "characters" a user will see, you want graphemes. For low-level byte budgets (database column lengths, network MTUs), you want bytes. For most ordinary text processing, codepoints via String.length/1 are the right default.

byte_size Is O(1), String.length Is O(n)

byte_size/1 reads the binary's length header. It is constant time regardless of how long the binary is. String.length/1 walks the binary decoding UTF-8 to count codepoints — it is linear time.

This rarely matters at human scale, but if you find yourself doing String.length(very_long_string) == 0, prefer byte_size(very_long_string) == 0. An empty string has zero bytes either way; the byte version is instant.

For the same reason, string == "" is the cleanest empty check in Elixir. It compiles to a binary comparison against a zero-byte literal.

Slicing Is Codepoint-Aware

String.slice/3 and String.at/2 work in codepoints, not bytes. You can pass them indices without worrying about cutting a multi-byte character in half:

iex> String.slice("café", 0, 3)
"caf"

iex> String.slice("café", 3, 1)
"é"

iex> String.at("café", 3)
"é"

If you need byte-level slicing — for example, you are reading a fixed-width binary field that happens to contain text — use binary_part/3:

iex> binary_part("café", 0, 3)
"caf"

iex> binary_part("café", 3, 2)
"é"

iex> binary_part("café", 3, 1)
<<195>>

The last one returns an invalid UTF-8 fragment because the byte boundary cut a codepoint in half. That is the kind of bug String.slice/3 exists to prevent.

Charlists: A List of Integers

Single-quoted literals are charlists, not strings. A charlist is a list of integer codepoints:

iex> ~c"hello"
~c"hello"

iex> ~c"hello" == [104, 101, 108, 108, 111]
true

iex> is_list(~c"hello")
true

iex> is_binary(~c"hello")
false

Recent Elixir versions display charlists with the ~c sigil to make them visually distinct from strings. In older code you will see plain single quotes, which mean the same thing.

Charlists exist because Erlang predates Unicode binaries. Erlang's standard library represents strings as lists of integers, and Elixir needs to interoperate with that. Functions like :os.cmd/1, :inet.gethostname/0, and many of the :string module's functions return charlists.

iex> :os.cmd(~c"echo hello")
~c"hello\n"

iex> {:ok, hostname} = :inet.gethostname()
{:ok, ~c"my-laptop"}

Convert at the boundary. Hold strings as binaries everywhere else.

iex> List.to_string(~c"hello")
"hello"

iex> to_string(~c"hello")
"hello"

iex> String.to_charlist("hello")
~c"hello"

The to_string/1 kernel function is the catch-all converter. It calls the String.Chars protocol, which any data type can implement. For charlists specifically, List.to_string/1 is the most direct.

When to Use Charlists

Almost never, in application code. The rules:

  1. When an Erlang library returns a charlist, convert it immediately and move on with a binary.
  2. When an Erlang library requires a charlist argument, pass a charlist — usually via the ~c sigil — at the call site.
  3. Otherwise, use binaries.

You will see new Elixir developers write things like 'admin' = role and wonder why their pattern match fails when role is "admin". Single quotes and double quotes are not interchangeable. The Elixir compiler issues a warning when you write a single-quoted literal that looks like it should be a string, exactly because this confusion is so common.

Non-UTF-8 Binaries

Not every binary is a UTF-8 string. You can have binaries that are sensor readings, compressed data, encrypted blobs, or text in another encoding. The String module's functions assume UTF-8 and will misbehave on other encodings.

If you have a binary in Latin-1, GB18030, or Shift-JIS — common in legacy protocols, old Windows systems, certain Asian financial messages — convert it to UTF-8 before doing string operations. The :unicode module from Erlang's stdlib handles this:

iex> {:ok, utf8_text, _} = :unicode.characters_to_binary(
...>   <<99, 97, 102, 233>>,
...>   :latin1,
...>   :utf8
...> ) |> then(&{:ok, &1, nil})
iex> utf8_text
"café"

The single byte 233 is é in Latin-1 but a continuation byte in UTF-8. Pass raw Latin-1 bytes to String.length/1 and the results are nonsense or an ArgumentError. Convert first.

For non-text binaries, do not call String functions at all. Use byte_size/1, binary_part/3, and bit syntax. Treat them as the bag of bytes they are.

Checking for Valid UTF-8

If you receive bytes from the network or a file and you do not know whether they are valid UTF-8, ask:

iex> String.valid?("café")
true

iex> String.valid?(<<99, 97, 102, 233>>)
false

String.valid?/1 walks the binary and confirms every byte sequence is a legal UTF-8 codepoint. For inputs you do not control, this is the gate before calling String.length/1 or String.slice/3.

IO Lists for Building Strings

When you are building a string from many parts — a templated email, an HTML response, a CSV row — repeated <> concatenation copies every time. The idiomatic Elixir approach is an iolist:

greeting = ["Hello, ", name, ", you have ", to_string(count), " messages."]

This is a nested list of binaries and integers. You can pass it to IO.puts/2, IO.write/2, :gen_tcp.send/2, or File.write/2 directly — they all accept iodata. To get a flat binary, use IO.iodata_to_binary/1:

IO.iodata_to_binary(["Hello, ", name, "!"])
# "Hello, Ada!"

Phoenix's response rendering is built on iolists end to end. The template engine produces nested lists of strings, and the framework hands the whole thing to the socket without ever flattening. That is part of why Phoenix has a reputation for fast response times under load.

Common Pitfalls

Using byte_size/1 when you meant String.length/1. If your "max 50 character" validation rejects emoji-heavy input as too long while accepting ASCII of the same character count, you are counting bytes. Switch to String.length/1 for codepoint counts or String.graphemes/1 |> length/1 for user-perceived characters.

Slicing strings with binary_part/3. You can cut a multi-byte character in half and produce an invalid UTF-8 binary. Use String.slice/3 for any string a human will read.

Mixing single and double quotes by accident. 'admin' == "admin" is false. The first is a charlist, the second a binary. The compiler warns when single-quoted literals look like strings; heed the warning.

Calling String.to_atom/1 on input. Atoms are not garbage collected. This is the most common Elixir vulnerability in security review. Use String.to_existing_atom/1 if you must, and prefer not converting at all.

Treating non-UTF-8 binaries as strings. Latin-1 bytes look fine until you hit a non-ASCII character, then String.length/1 returns nonsense. Convert with :unicode.characters_to_binary/3 at the boundary.

Building large strings with <>. Every concatenation copies. Build an iolist and flatten once at the end, or pass the iolist directly to whatever consumes it.

Key Takeaways

  • Every Elixir string is a UTF-8 encoded binary. There is no separate string type.
  • byte_size/1 counts bytes (O(1)), String.length/1 counts codepoints (O(n)), String.graphemes/1 |> length/1 counts user-perceived characters.
  • String.slice/3 is codepoint-aware; binary_part/3 is byte-aware. Use the former for human-readable text.
  • Charlists (single-quoted, list of integers) exist for Erlang interop. Convert at the boundary; do not propagate them through application code.
  • For non-UTF-8 binaries, convert with :unicode.characters_to_binary/3 before touching String functions.
  • Build large strings as iolists and pass them directly to IO functions — Phoenix scales partly because of this pattern.