Safety & When Not to Use AI

Overview

Hallucinations are not bugs. They are a fundamental property of how generative models work. A language model generates the most likely next token given its training data and the current context. It has no concept of truth, no ability to verify facts, and no understanding that making something up is different from recalling something real. This is not a temporary limitation that will be fixed with more training data. It is inherent to the architecture.

This matters because people deploy LLMs in contexts where being wrong has real consequences: medical advice, legal guidance, financial recommendations, safety-critical systems. Understanding when AI is appropriate and when it is dangerous is one of the most important skills in applied AI.

Hallucinations: Understanding the Problem

Why Hallucinations Happen

def demonstrate_hallucination_mechanism():
    """
    LLMs don't retrieve facts. They generate probable text.
    
    When asked 'Who wrote the paper on transformer attention?'
    the model doesn't look up the paper. It generates text
    that statistically follows from the question.
    
    Usually, the statistical pattern leads to the right answer
    ('Vaswani et al., 2017'). But sometimes it generates a
    plausible-sounding but completely fabricated answer.
    
    The model has no mechanism to distinguish between:
    - 'I know this because I saw it in training data'
    - 'I am generating this because it sounds right'
    """
    # These are equally "confident" to the model
    correct = "The Transformer was introduced by Vaswani et al. in 2017"
    hallucinated = "The Transformer was introduced by Chen et al. in 2016"
    # Both are fluent, specific, and confident-sounding

Types of Hallucinations

Factual hallucinations:
  "The Eiffel Tower is 450 meters tall" (it's 330m)
  The model generates a plausible but wrong number.

Citation hallucinations:
  "According to Smith et al. (2019) in Nature..."
  The paper doesn't exist. The model invented a citation.

Logical hallucinations:
  "Since X implies Y, and Y implies Z, therefore X implies Z"
  The premises might be fabricated even if the logic is valid.

Temporal hallucinations:
  "As of 2025, Company X uses technology Y"
  The model's training data has a cutoff. It may not know
  current facts but will generate confident statements anyway.

Confident uncertainty:
  "The answer is definitely 42.7 meters"
  The model presents a specific number with no basis,
  and the specificity makes it seem more credible.

When Hallucinations Are Dangerous

def categorize_hallucination_risk(domain):
    """Assess how dangerous hallucinations are in a given domain.
    
    The key question: what happens when the AI is wrong?
    """
    risk_levels = {
        "creative_writing": {
            "risk": "low",
            "consequence": "User reads something inaccurate",
            "mitigation": "None needed, creativity is the goal",
        },
        "customer_support": {
            "risk": "medium",
            "consequence": "User gets wrong information about "
                          "a product or policy",
            "mitigation": "Ground responses in knowledge base, "
                         "add human escalation",
        },
        "medical_advice": {
            "risk": "critical",
            "consequence": "Patient makes health decisions based "
                          "on fabricated information",
            "mitigation": "Do not use for direct advice. Use only "
                         "as a tool for medical professionals.",
        },
        "legal_guidance": {
            "risk": "critical",
            "consequence": "Person takes legal action based on "
                          "non-existent laws or precedents",
            "mitigation": "Always require lawyer review. "
                         "Never present as authoritative.",
        },
        "financial_advice": {
            "risk": "high",
            "consequence": "Person makes investment or tax decisions "
                          "based on fabricated information",
            "mitigation": "Disclaim clearly. Use for analysis, "
                         "not recommendations.",
        },
        "code_generation": {
            "risk": "medium",
            "consequence": "Generated code has bugs or security "
                          "vulnerabilities",
            "mitigation": "Always review, test, and run security "
                         "scans on generated code.",
        },
    }
    
    return risk_levels.get(domain, {
        "risk": "unknown",
        "consequence": "Assess for your specific use case",
        "mitigation": "Apply the precautionary principle",
    })

Safety Layers

Content Filtering

def build_safety_pipeline(user_input, model_output):
    """Multi-layer safety pipeline for LLM applications.
    
    Don't rely on a single safety check. Layer multiple
    independent checks so that if one fails, others catch it.
    """
    # Layer 1: Input filtering
    input_check = check_input_safety(user_input)
    if input_check["blocked"]:
        return {
            "output": "I can't help with that request.",
            "reason": input_check["reason"],
        }
    
    # Layer 2: Prompt injection detection
    injection_check = detect_prompt_injection(user_input)
    if injection_check["detected"]:
        return {
            "output": "I detected an unusual pattern in your input. "
                      "Could you rephrase your question?",
            "reason": "prompt_injection",
        }
    
    # Layer 3: Generate response (with system prompt guardrails)
    # (model_output is already generated)
    
    # Layer 4: Output filtering
    output_check = check_output_safety(model_output)
    if output_check["blocked"]:
        return {
            "output": "I generated a response but it didn't meet "
                      "our safety standards. Let me try again.",
            "reason": output_check["reason"],
        }
    
    # Layer 5: PII detection
    pii_check = detect_pii(model_output)
    if pii_check["found"]:
        model_output = redact_pii(model_output, pii_check["entities"])
    
    return {"output": model_output, "safety_checks_passed": True}

Human-in-the-Loop

def human_in_the_loop_pipeline(input_data, model, confidence_threshold):
    """Route to humans when the model is not confident enough.
    
    This is the single most effective safety measure.
    When in doubt, ask a human.
    """
    prediction = model.predict(input_data)
    confidence = prediction["confidence"]
    
    if confidence >= confidence_threshold:
        # High confidence: auto-approve but log for audit
        return {
            "decision": prediction["output"],
            "source": "automated",
            "confidence": confidence,
            "needs_review": False,
        }
    else:
        # Low confidence: queue for human review
        return {
            "decision": None,
            "source": "pending_human_review",
            "confidence": confidence,
            "needs_review": True,
            "suggested_output": prediction["output"],
            "review_queue": assign_to_reviewer(input_data),
        }

Confidence Thresholds

Set thresholds based on risk: low-risk tasks can auto-approve at 0.7 confidence, medium-risk at 0.85, high-risk at 0.95, and critical-risk should always route to human review. The higher the stakes, the more human involvement you need.

When Not to Use AI

Safety-Critical Systems

Do NOT use ML/LLMs as the sole decision-maker for:

Medical diagnosis:
  AI can assist radiologists (flag suspicious scans)
  AI should not diagnose patients without human oversight
  
Autonomous vehicle control:
  AI can assist (lane keeping, emergency braking)
  AI should not make life-or-death decisions without
  extensive validation and regulatory approval
  
Air traffic control:
  AI can assist (conflict detection, scheduling optimization)
  AI should not replace human controllers
  
Nuclear/industrial safety systems:
  Rule-based safety interlocks exist for a reason
  ML should never override deterministic safety logic

When You Need Guarantees

Use rules instead of ML when: the task must be deterministic, 100% accuracy is required, regulators demand explainable logic, failure has legal consequences, or well-defined rules already exist. ML gives probabilities, not guarantees. Tax calculations, access control, and regulatory compliance should use rules, not ML.

The AI Appropriateness Checklist

Before deploying AI, ask these questions:

1. What happens when the AI is wrong?
   - Minor inconvenience? AI is fine.
   - Financial loss? AI with human review.
   - Physical harm? Do not use AI as sole decision-maker.

2. Can a simpler solution work?
   - If regex can solve it, use regex.
   - If a lookup table can solve it, use a lookup table.
   - If a decision tree with 10 rules can solve it, use rules.
   - Only use ML when simpler solutions genuinely fail.

3. Do you have a feedback mechanism?
   - Can you detect when the AI is wrong?
   - Can users report problems?
   - Can you fix issues quickly?
   - If you can't monitor, you shouldn't deploy.

4. Do you have a fallback?
   - What happens when the API is down?
   - What happens when the model returns nonsense?
   - Is there a manual process users can fall back to?

5. Are you prepared for the worst case?
   - What is the worst output the model could produce?
   - What damage could that output cause?
   - Is that damage acceptable?

Building Responsible AI Systems

Defense in Depth

Layer multiple independent safety mechanisms. No single layer catches everything:

Layer                     Catches
Input validation          Prompt injection, adversarial inputs
System prompt guardrails  Scope violations, inappropriate content
Output filtering          Harmful content, PII leakage
Confidence thresholding   Uncertain predictions, edge cases
Factual grounding (RAG)   Hallucinations, fabricated facts
Human escalation          Everything the other layers miss
Monitoring and alerting   Emerging patterns, drift, new attacks

Incident Response for AI

When your AI system causes harm: (1) assess severity and disable the feature if users are at risk, (2) collect the exact input that caused the failure, (3) add the failing case to the regression test suite and fix the root cause, (4) inform affected users transparently and file regulatory reports if required.

Real-World Example: AI-Assisted Medical Triage

A hospital builds an AI system to help triage emergency department patients.

What AI does: Analyzes symptoms described by the patient and suggests a triage priority level (1-5) to the nurse.

What AI does not do: Make the final triage decision. The nurse always reviews and can override.

Safety layers: (1) The system never suggests a lower priority than what the rules-based scoring system calculates. (2) If the AI and rules-based system disagree by more than 1 level, the case is flagged for senior nurse review. (3) The system explicitly says "This is a suggestion for the triage nurse. It is not a diagnosis." (4) All AI-assisted triage decisions are logged for retrospective review.

Outcome: After 6 months, the AI-assisted process reduces average triage time by 20% and catches 3% of cases that the rules-based system would have under-triaged. But the nurse always makes the final call. The AI is a tool, not a replacement.

Common Pitfalls

Treating hallucinations as a fixable bug: Hallucinations are inherent to generative models. You can reduce their frequency but not eliminate them. Design your system assuming they will happen.
Deploying AI without a fallback: Every AI-powered feature needs a non-AI fallback for when the model fails, the API is down, or the output is unsafe.
Using AI because it is trendy: If a simpler solution works, use it. AI adds complexity, cost, non-determinism, and new failure modes. It should earn its place.
Ignoring low-probability catastrophic failures: A system that works 99.9% of the time but causes serious harm 0.1% of the time may not be acceptable. Evaluate worst-case scenarios, not just averages.
Relying on a single safety layer: No single filter, prompt, or check catches everything. Layer multiple independent safety mechanisms.
Not having an incident response plan: When (not if) your AI system causes a problem, you need a pre-planned response. Figuring it out during a crisis leads to worse outcomes.

Key Takeaways

Hallucinations are a fundamental property of generative models, not a bug that will be fixed. Design systems that handle them gracefully.
Safety layers should be defense in depth: input filtering, output filtering, confidence thresholds, human-in-the-loop, and monitoring. No single layer is sufficient.
Do not use AI when you need deterministic guarantees, when the problem is well-defined with clear rules, or when being wrong has severe consequences without human oversight.
Always provide a path to human review. The most effective safety measure is a knowledgeable human who can override the AI.
Before deploying AI, ask: what is the worst output the model could produce, and is that damage acceptable? If not, either add sufficient safety layers or do not deploy.
AI is a tool, not a replacement for human judgment. The best AI systems augment human decision-making rather than replacing it, especially in high-stakes domains.