Usability Testing

Usability testing is the practice of watching real people try to use your product while you observe. Not asking them what they think. Not showing them a demo. Watching them — silently — as they try to complete a real task. It is the fastest way to destroy your illusions about how "intuitive" your product is.

You will sit there while a user stares at the screen, clicks the wrong button, backtracks, reads the help text, ignores the help text, tries something random, and eventually either figures it out or gives up. The experience is humbling. It is also the single most effective method for identifying usability problems before they cost you users.

The Basics

5 Users, 30 Minutes Each

Jakob Nielsen's research established that five users uncover approximately 85% of usability problems in a given flow. This is not a magic number, but it is a well-validated guideline. After five users, you start hearing the same problems repeated. The marginal return of user six through ten is low for most studies.

Testing economics:
  5 users x 30 minutes each = 2.5 hours of testing
  + 2 hours of preparation
  + 2 hours of synthesis
  = ~6.5 hours total

  Compare to: shipping a confusing feature, watching 40% of users
  drop off, spending 3 sprints redesigning, and shipping again.

  6.5 hours of usability testing saves weeks of rework.

What Usability Testing Is Not

Usability testing IS:
  - Watching users attempt real tasks
  - Observing behavior (what they do)
  - Finding problems in the interface
  - Evaluative: testing something that exists (or a prototype)

Usability testing IS NOT:
  - Asking users what they want (that's an interview)
  - A/B testing (that's quantitative optimization)
  - A focus group (that's group opinion, not individual behavior)
  - A demo (you show nothing; the user drives)
  - Beta testing (that's broader, unstructured feedback)

Task-Based Testing

The core of usability testing is giving users realistic tasks and watching them try to complete those tasks without help.

Writing Good Tasks

Tasks should be realistic, specific, and self-contained. Do not tell users how to accomplish the task — that defeats the purpose.

Bad tasks (too vague or leading):
  "Explore the dashboard."
  "Use the search feature to find a flight."
  "Click on Settings and update your profile."

Good tasks (realistic scenarios):
  "You want to fly from New York to London next Tuesday,
   returning the following Monday. Find a flight."

  "Your boss asked you to create a monthly sales report
   for the East region. Generate that report."

  "You just hired a new team member named Alex Chen.
   Add them to your team with editor permissions."

  "You received a notification about a billing issue.
   Find out what the issue is and resolve it."

Good tasks are framed as situations, not instructions. They describe what the user wants to accomplish, not how to accomplish it. This reveals whether the interface communicates the path.

Task Difficulty Progression

Start with easier tasks and build to harder ones. This helps users warm up and builds their confidence before they hit the tricky parts.

Task order for a project management tool:
  1. Easy:   Create a new project called "Q3 Marketing Plan"
  2. Medium: Add 3 tasks to the project with due dates
  3. Medium: Assign one task to a teammate
  4. Hard:   Set up a recurring weekly status update
  5. Hard:   Find all tasks assigned to you across all projects
             that are overdue

How Many Tasks

Five to seven tasks per 30-minute session is typical. Each task should take 2-5 minutes. Leave time for the think-aloud protocol, follow-up questions, and the post-test debrief.

Think-Aloud Protocol

Ask users to verbalize their thoughts as they work through each task. This is the think-aloud protocol, and it is the most important technique in usability testing.

Instructions to the participant:
  "As you work through each task, please think out loud.
   Tell me what you're looking at, what you're thinking,
   what you expect to happen, and what confuses you.
   There's no wrong answer — we're testing the product,
   not you."

What think-aloud sounds like:
  "Okay, I need to add a team member... I'm looking for something
   like 'Team' or 'Members' in the navigation... I see 'Settings,'
   maybe it's in there... [clicks Settings] Hmm, I see 'Account,'
   'Billing,' 'Integrations'... no 'Team' option. Let me go back.
   Maybe it's under this people icon? [clicks] Oh wait, this is
   just my profile. I'm stuck."

Think-aloud gives you access to the user's mental model — what they expect, what they look for, and where the interface violates their expectations. Without it, you can see that they clicked the wrong button, but you cannot see why.

When Think-Aloud Fails

Some participants go silent under pressure. Gentle prompts help:

Useful prompts (non-leading):
  "What are you thinking right now?"
  "What are you looking for?"
  "What did you expect to happen?"
  "Tell me what you see on this screen."

Prompts to avoid (leading):
  "Did you try clicking that button?"
  "The answer is in the top menu."
  "Most people find it under Settings."

Never help the user unless they are completely stuck and visibly distressed. The point is to observe failure, not prevent it. If they cannot complete the task, that is a finding — arguably the most important kind.

Running the Session

The Introduction

Moderator script (adapt to your style):
  "Thanks for being here. We're going to look at [product/prototype]
   today, and I'd like you to try a few tasks while thinking out loud.

   A few important things:
   - We're testing the product, not you. There are no wrong answers.
   - If something is confusing, that's the product's fault, not yours.
   - I might stay quiet while you work. That's not because you're
     doing anything wrong — I just want to see how you'd naturally
     approach things.
   - You can stop at any time if you're uncomfortable.

   Do you have any questions before we start?"

During the Session

Moderator behavior:
  DO:
  - Stay neutral (no facial reactions to mistakes)
  - Take timestamped notes
  - Ask follow-up questions AFTER the task, not during
  - Let the user struggle (this is where insights come from)
  - Thank them when they complete each task

  DO NOT:
  - Help them unless they explicitly ask and are stuck
  - React when they miss something obvious
  - Explain how the feature works
  - Say "that's correct" or "good job" (creates performance anxiety)
  - Check your phone or look disengaged

After Each Task

Post-task questions:
  "How did that go?"
  "Was there anything confusing about that?"
  "How difficult was that on a scale of 1-5?"
     (1 = very easy, 5 = very difficult)
  "Is that how you expected it to work?"

The post-task difficulty rating (Single Ease Question, or SEQ) gives you a quantitative measure to complement the qualitative observation.

After the Session

Post-test debrief (5 minutes):
  "Now that you've used the product, what stands out?"
  "What was the most frustrating part?"
  "What, if anything, did you like?"
  "Is there anything else you'd like to share?"

Remote Usability Testing

Remote testing has become the default for most product teams. It eliminates travel, expands the participant pool, and produces recordings that the whole team can review.

Moderated Remote Testing

The moderator joins a video call with the participant, who shares their screen. The moderator observes in real time and can ask questions.

Tools: Zoom, Google Meet, Lookback, UserTesting.com (moderated mode)

Pros:
  - Real-time observation and follow-up questions
  - Can adapt tasks based on what you observe
  - Builds rapport, richer qualitative data

Cons:
  - Scheduling overhead
  - Limited to participant's time zone availability
  - Technical issues (screen sharing, audio)

Best for: Complex flows, new concepts, enterprise products

For simple flows or quick validation, unmoderated remote testing (UserTesting.com, Maze, Lyssna) skips the moderator entirely — participants complete tasks on their own, recorded for later review. It is faster and cheaper but loses the ability to ask follow-up questions.

Analyzing Results

Severity Rating

Not all usability problems are equal. Rate each finding by severity to prioritize fixes.

Severity scale:
  Critical (4): User cannot complete the task. Prevents core
                functionality. Must fix before launch.
                Example: Checkout button does not work on mobile.

  Major (3):    User can complete the task but with significant
                difficulty or frustration. Many users will give up.
                Example: Required field is not labeled as required.

  Minor (2):    User notices the issue but can work around it.
                Causes mild frustration, not failure.
                Example: Button label is ambiguous but clickable.

  Cosmetic (1): User might not notice. Does not affect task
                completion or satisfaction.
                Example: Inconsistent icon style on settings page.

Frequency Matrix

Track which users hit which problems:

Problem                          U1  U2  U3  U4  U5  Frequency
-------------------------------------------------------------------
Could not find "Add member"       X   X   X       X    4/5 (80%)
Confused by permission levels     X       X   X        3/5 (60%)
Missed confirmation message           X           X    2/5 (40%)
Expected autosave, lost work      X                    1/5 (20%)

Problems that affect 3+ out of 5 users are almost certainly real issues that will affect a large portion of your user base. Problems that affect 1 out of 5 may be edge cases or individual differences.

Presenting Findings

Format for each finding:
  Problem:     [What the issue is]
  Severity:    [Critical / Major / Minor / Cosmetic]
  Frequency:   [X out of 5 users]
  Evidence:    [What you observed + key quote]
  Recommendation: [Suggested fix]

Example:
  Problem:     Users cannot find the "Add team member" function
  Severity:    Major
  Frequency:   4 out of 5 users
  Evidence:    All 4 users looked under "Settings" first. User 3
               said "I assumed team management would be in Settings,
               that's where it is in every other tool I use."
  Recommendation: Add "Team" as a top-level navigation item, or add
                  an "Add member" shortcut to the team member list view.

Common Pitfalls

Helping the user — the moment you say "try clicking that button," the test is compromised. The user's struggle is the data. Let them struggle.
Testing with colleagues or friends — they know too much about the product and will behave differently from real users. Always test with people who represent your actual target audience.
Writing tasks as instructions — "Click on Reports, then select Monthly, then export as PDF" tests whether users can follow directions, not whether the interface is usable. Write tasks as goals.
Running tests but not fixing anything — a usability test that does not lead to changes is a waste of everyone's time. Prioritize the findings and fix the critical and major issues.
Testing too late — testing after the feature is built and shipped means fixes require rework. Test on prototypes when changes are cheap.
Only testing happy paths — real users make mistakes, misunderstand instructions, and use unexpected workflows. Include tasks that test error recovery and edge cases.
Skipping the think-aloud protocol — without think-aloud, you see what users do but not why. The "why" is where the actionable insights live.
Over-designing the test — you do not need a lab, a one-way mirror, or specialized equipment. A Zoom call with screen sharing and five willing participants is enough to find critical problems.

Key Takeaways

Five users and 30 minutes each will reveal the majority of usability problems in a given flow. The time investment is trivial compared to shipping a confusing product.
Write tasks as realistic scenarios, not step-by-step instructions. The user should figure out the path, not follow directions.
Use the think-aloud protocol to access the user's mental model. What they say while struggling is more valuable than whether they succeed.
Test on prototypes before code is written. Fixing a Figma mockup costs an afternoon; fixing production code costs a sprint.
Never help the user during the test. Their confusion, frustration, and failure are the findings. If you prevent failure, you prevent learning.
Rate findings by severity and frequency. Fix critical and major issues immediately. Track minor issues for future improvement.