Few-Shot & Chain-of-Thought

Overview

Zero-shot, few-shot, and chain-of-thought are the three most important prompting techniques. They form a progression: zero-shot is the simplest, few-shot adds examples, and chain-of-thought adds reasoning steps. Knowing when to use each saves time and improves output quality significantly.

Few-shot prompting is the most underused technique in production AI systems. Adding 3-5 examples to a prompt often produces a bigger quality improvement than switching to a more expensive model.

Zero-Shot Prompting

Zero-shot means giving the model a task with no examples. You describe what you want and the model figures out how to do it.

# Zero-shot: just ask
response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "system", "content": "Classify the following text as either 'spam' or 'not_spam'. Return only the label."},
        {"role": "user", "content": "Congratulations! You've won a free iPhone. Click here to claim your prize!"}
    ],
    temperature=0
)
# Output: "spam"

When Zero-Shot Works

Simple, well-defined tasks (yes/no, single classification)
Tasks the model has seen extensively in training (translation, summarization)
When the output format is straightforward
When you need to move fast and good-enough is acceptable

When Zero-Shot Fails

Ambiguous tasks where "correct" depends on your specific context
Domain-specific formatting requirements
Tasks that require understanding your conventions or edge cases
Classification with categories the model has not seen before

Few-Shot Prompting

Few-shot means including examples of the input-output pairs you expect. The model learns the pattern from your examples and applies it to new inputs.

system_prompt = """Classify customer support emails into categories.
Return a JSON object with 'category' and 'priority'.

Examples:

Input: "I was charged twice for my last order #5821"
Output: {"category": "billing", "priority": "high"}

Input: "How do I change my shipping address?"
Output: {"category": "account", "priority": "low"}

Input: "The widget I received is broken and missing a piece"
Output: {"category": "product_quality", "priority": "high"}

Input: "Do you ship to Canada?"
Output: {"category": "shipping", "priority": "low"}

Input: "I've been waiting 3 weeks for my order and tracking shows no movement"
Output: {"category": "shipping", "priority": "high"}
"""

user_prompt = "My account was hacked and someone placed orders using my saved payment method"
# Output: {"category": "security", "priority": "critical"}

Why Few-Shot Works So Well

The examples communicate several things simultaneously:

Output format: The model sees the exact JSON structure you expect
Category definitions: The model infers what each category means from examples, not from descriptions
Priority logic: The model learns that financial impact and urgency drive priority
Tone and style: The model mirrors the style of your example outputs
Edge case handling: Examples of tricky cases teach the model your decision boundaries

How Many Examples

Number of examples vs quality improvement (typical pattern):

  0 examples (zero-shot):    Baseline quality
  1 example:                 Significant improvement in format consistency
  3 examples:                Major quality jump — the sweet spot
  5 examples:                Diminishing returns begin
  10+ examples:              Marginal gains, token cost increases
  
Recommendation: Use 3-5 examples for most tasks.
Include at least one example per output category.
Include at least one "tricky" example that shows edge case handling.

Choosing Good Examples

Example selection guidelines:

DO include:
  - One example per category/class
  - Edge cases that are commonly misclassified
  - Examples that demonstrate your specific conventions
  - Examples with varying input lengths

DO NOT include:
  - Only easy, obvious examples
  - Examples that are all the same category
  - Very long examples (wastes tokens, dilutes signal)
  - Examples with errors in the output

Real-World Example: Entity Extraction

system_prompt = """Extract product mentions from customer reviews.
Return a JSON array of objects with 'product', 'sentiment', and 'aspect'.

Examples:

Input: "The battery life on the XPS 15 is incredible, easily lasts 12 hours"
Output: [{"product": "XPS 15", "sentiment": "positive", "aspect": "battery_life"}]

Input: "I returned the AirPods Pro because the noise cancellation kept cutting out, but the sound quality was excellent"
Output: [
  {"product": "AirPods Pro", "sentiment": "negative", "aspect": "noise_cancellation"},
  {"product": "AirPods Pro", "sentiment": "positive", "aspect": "sound_quality"}
]

Input: "Nothing special about it tbh"
Output: []
"""

user_prompt = "The Pixel 8 camera is amazing in low light but the phone gets hot during video calls"
# Output: [
#   {"product": "Pixel 8", "sentiment": "positive", "aspect": "camera"},
#   {"product": "Pixel 8", "sentiment": "negative", "aspect": "thermal"}
# ]

The third example (empty output) is critical. Without it, the model might force-extract entities from text that has none.

Chain-of-Thought Prompting

Chain-of-thought (CoT) asks the model to reason through a problem step by step before giving a final answer. This dramatically improves performance on tasks that require logic, math, or multi-step reasoning.

Basic Chain-of-Thought

# WITHOUT CoT: model jumps to answer (often wrong)
prompt_no_cot = "A store has 15 apples. They sell 7 in the morning, receive a shipment of 12 in the afternoon, then sell 9 before closing. How many apples remain?"

# WITH CoT: model reasons through the problem
prompt_cot = """A store has 15 apples. They sell 7 in the morning, receive a 
shipment of 12 in the afternoon, then sell 9 before closing. How many apples remain?

Think through this step by step before giving your final answer."""
# Model response:
# Step 1: Start with 15 apples
# Step 2: Sell 7 in the morning: 15 - 7 = 8
# Step 3: Receive 12 in the afternoon: 8 + 12 = 20
# Step 4: Sell 9 before closing: 20 - 9 = 11
# Final answer: 11 apples remain.

When Chain-of-Thought Helps

Strong improvement:
  - Math and arithmetic problems
  - Multi-step logical reasoning
  - Code debugging ("trace through this code step by step")
  - Complex classification with multiple criteria
  - Decision-making with tradeoffs

Minimal improvement:
  - Simple classification (spam/not spam)
  - Text extraction (pull out the email address)
  - Translation
  - Simple formatting tasks
  
CoT adds tokens (cost) and latency. Don't use it when the task
is simple enough that the model gets it right without reasoning.

Few-Shot Chain-of-Thought

Combine few-shot examples with reasoning steps for maximum quality:

system_prompt = """Determine if the customer is eligible for a refund.
Think through the eligibility criteria step by step.

Example 1:
Input: Customer bought a laptop 45 days ago. Item is unopened. 
Reason: "Changed my mind."
Thinking: 
- Return window is 30 days. Purchase was 45 days ago. Outside return window.
- Even though item is unopened, the time limit applies.
- "Changed my mind" is a valid reason but does not override the time limit.
Decision: NOT ELIGIBLE. Reason: Outside 30-day return window.

Example 2:
Input: Customer bought headphones 10 days ago. Item is defective 
(left ear has no sound). Reason: "Product defect."
Thinking:
- Return window is 30 days. Purchase was 10 days ago. Within window.
- Item is defective. Defective items are eligible regardless of time.
- Product defect is a valid return reason.
Decision: ELIGIBLE. Reason: Defective product within return window.

Example 3:
Input: Customer bought software license 5 days ago. Digital product. 
License has been activated. Reason: "Not what I expected."
Thinking:
- Return window is 30 days. Purchase was 5 days ago. Within window.
- Digital products are non-refundable after activation.
- License has been activated. This overrides the time window.
Decision: NOT ELIGIBLE. Reason: Digital product already activated.
"""

This approach works because the model learns both the answer and the reasoning process from your examples.

ReAct: Reasoning & Acting

ReAct combines chain-of-thought with tool use. The model reasons about what to do, takes an action (calls a tool), observes the result, then reasons about what to do next.

ReAct pattern:

  Thought: I need to find the user's order status. Let me look up their order.
  Action: lookup_order(order_id="5821")
  Observation: {"status": "shipped", "tracking": "1Z999AA10123456784", "eta": "2025-03-20"}
  Thought: The order is shipped. The customer asked about delivery. Let me check 
           the tracking for current location.
  Action: track_package(tracking_number="1Z999AA10123456784")
  Observation: {"location": "Chicago, IL", "last_update": "2025-03-18", "status": "in_transit"}
  Thought: I now have all the information to answer the customer's question.
  Answer: Your order #5821 is in transit. It's currently in Chicago, IL as of 
          March 18th. Expected delivery is March 20th.

ReAct is implemented using function calling / tool use APIs. You define available tools with JSON schemas, and the model decides which to call and in what order. This is the pattern behind most AI agent frameworks (LangChain agents, OpenAI assistants, Anthropic tool use).

Choosing the Right Technique

Task complexity → Technique:

  Simple, well-defined task          → Zero-shot
  Format-specific or domain-specific → Few-shot (3-5 examples)
  Multi-step reasoning required      → Chain-of-thought
  Domain-specific + reasoning        → Few-shot CoT
  Requires external data or actions  → ReAct (tool use)

Cost/latency consideration:
  Zero-shot:    Least tokens, fastest, cheapest
  Few-shot:     More tokens (examples), moderate cost
  CoT:          More output tokens (reasoning), higher cost
  Few-shot CoT: Most tokens, highest cost, highest quality
  ReAct:        Multiple API calls, highest latency

Common Pitfalls

Not using few-shot when you should: This is the single most common missed optimization. If your zero-shot prompt produces inconsistent output, add 3 examples before trying anything else. It is almost always cheaper than switching to a better model.
Using identical examples: If all your few-shot examples are easy cases, the model learns nothing about hard cases. Include at least one edge case and one negative example.
Using CoT for simple tasks: Chain-of-thought adds latency and cost. For binary classification or simple extraction, it slows things down without improving quality.
Not extracting the final answer from CoT: When using chain-of-thought, the reasoning is useful but the answer is what you need. Parse the output to extract just the final answer for downstream processing.
Putting examples in the user message instead of the system message: Few-shot examples are instructions, not input. They belong in the system prompt where they are treated as stable context, not per-request content.
Too many examples: After 5-7 examples, returns diminish. More examples mean more tokens per request, which adds cost at scale. Measure whether additional examples actually improve quality.

Key Takeaways

Few-shot prompting (3-5 examples) is the most underused and highest-impact technique. Try it before switching models or building complex pipelines.
Chain-of-thought improves reasoning tasks by forcing the model to work through the problem step by step. Skip it for simple tasks.
Choose the technique based on task complexity: zero-shot for simple, few-shot for format/domain-specific, CoT for reasoning, ReAct for tool use.
Example selection matters more than example quantity. Include edge cases, negative examples, and one example per output category.
ReAct (reasoning + tool use) is the pattern behind modern AI agents. The model decides what tools to call based on step-by-step reasoning.