{"page":1,"totalPages":49,"scrolls":[{"id":"b84f047b237b","subject":"ai-and-ml-applied","topic":"foundations","subtopic":"applied-ai-vs-ml-theory","chunkType":"section","body":"## Applied AI vs ML Theory\n\nComputer science covers the math behind machine learning: gradient descent, loss functions, backpropagation, regularization. That knowledge matters. But knowing how a neural network learns weights does not tell you how to build an AI-powered product that users rely on every day.\n\nApplied AI is the discipline of taking ML capabilities and turning them into reliable, maintainable software. The gap between \"I trained a model on a Jupyter notebook\" and \"this system handles 10,000 requests per minute with 99.9% uptime\" is enormous. This document covers that gap.","bodyHtml":"<h2>Applied AI vs ML Theory</h2>\n<p>Computer science covers the math behind machine learning: gradient descent, loss functions, backpropagation, regularization. That knowledge matters. But knowing how a neural network learns weights does not tell you how to build an AI-powered product that users rely on every day.</p>\n<p>Applied AI is the discipline of taking ML capabilities and turning them into reliable, maintainable software. The gap between \"I trained a model on a Jupyter notebook\" and \"this system handles 10,000 requests per minute with 99.9% uptime\" is enormous. This document covers that gap.</p>","sourceUrl":"/ai-and-ml-applied/foundations/applied-ai-vs-ml-theory","sourceAnchor":"overview","domain":"data-ai","accentColor":"#08d7d9"},{"id":"f2146b17f797","subject":"ai-and-ml-applied","topic":"foundations","subtopic":"applied-ai-vs-ml-theory","chunkType":"section","body":"## Real-World Example: Document Processing\n\nA company needs to extract key fields from invoices (vendor name, total amount, due date, line items).\n\n**The theory-only approach**: Train a custom document understanding model. Collect 10,000 labeled invoices. Fine-tune a layout-aware transformer. Build training infrastructure. Three months of work.\n\n**The applied approach**:\n\n1. Start with an LLM API call: send the invoice text to GPT-4o with extraction instructions\n2. Validate the output against expected formats (is the amount a valid number? is the date parseable?)\n3. Build a test set of 100 invoices with known correct answers\n4. Measure accuracy: 94% on first attempt\n5. Add few-shot examples for the edge cases: accuracy goes to 97%\n6. Add a human review queue for low-confidence extractions\n7. Ship it. Total development time: two weeks\n\nThe applied approach is not less sophisticated. It is more pragmatic. If accuracy needs to reach 99.5%, you can invest in fine-tuning later, with a working system already in production generating the training data you need.","bodyHtml":"<h2>Real-World Example: Document Processing</h2>\n<p>A company needs to extract key fields from invoices (vendor name, total amount, due date, line items).</p>\n<p><strong>The theory-only approach</strong>: Train a custom document understanding model. Collect 10,000 labeled invoices. Fine-tune a layout-aware transformer. Build training infrastructure. Three months of work.</p>\n<p><strong>The applied approach</strong>:</p>\n<ol>\n<li>Start with an LLM API call: send the invoice text to GPT-4o with extraction instructions</li>\n<li>Validate the output against expected formats (is the amount a valid number? is the date parseable?)</li>\n<li>Build a test set of 100 invoices with known correct answers</li>\n<li>Measure accuracy: 94% on first attempt</li>\n<li>Add few-shot examples for the edge cases: accuracy goes to 97%</li>\n<li>Add a human review queue for low-confidence extractions</li>\n<li>Ship it. Total development time: two weeks</li>\n</ol>\n<p>The applied approach is not less sophisticated. It is more pragmatic. If accuracy needs to reach 99.5%, you can invest in fine-tuning later, with a working system already in production generating the training data you need.</p>","sourceUrl":"/ai-and-ml-applied/foundations/applied-ai-vs-ml-theory","sourceAnchor":"real-world-example-document-processing","domain":"data-ai","accentColor":"#08d7d9"},{"id":"c3040ebfa7b2","subject":"ai-and-ml-applied","topic":"foundations","subtopic":"applied-ai-vs-ml-theory","chunkType":"pitfall","body":"## Pitfalls — Applied AI vs ML Theory\n\n**Starting with training instead of prompting**: The most common mistake. Try the simplest approach first. You can always add complexity later; removing it is harder.\n\n**Treating ML as a black box**: \"We'll throw ML at it\" is not a plan. You need to define what success looks like, how you'll measure it, and what happens when the model is wrong.\n\n**Ignoring the data pipeline**: The model is 5% of the system. If your data pipeline is fragile, your AI feature is fragile.\n\n**Optimizing for the wrong metric**: High accuracy on a test set means nothing if users hate the output. Measure what matters to users.\n\n**No fallback path**: Every ML system needs a graceful degradation strategy. What happens when the API is down? When the model returns garbage? When latency spikes to 30 seconds?\n\n**Skipping evaluation**: If you don't have automated evals, you are shipping blind. Every prompt change, every model upgrade, every data change should be tested against a known-good dataset.","bodyHtml":"<h2>Pitfalls — Applied AI vs ML Theory</h2>\n<p><strong>Starting with training instead of prompting</strong>: The most common mistake. Try the simplest approach first. You can always add complexity later; removing it is harder.</p>\n<p><strong>Treating ML as a black box</strong>: \"We'll throw ML at it\" is not a plan. You need to define what success looks like, how you'll measure it, and what happens when the model is wrong.</p>\n<p><strong>Ignoring the data pipeline</strong>: The model is 5% of the system. If your data pipeline is fragile, your AI feature is fragile.</p>\n<p><strong>Optimizing for the wrong metric</strong>: High accuracy on a test set means nothing if users hate the output. Measure what matters to users.</p>\n<p><strong>No fallback path</strong>: Every ML system needs a graceful degradation strategy. What happens when the API is down? When the model returns garbage? When latency spikes to 30 seconds?</p>\n<p><strong>Skipping evaluation</strong>: If you don't have automated evals, you are shipping blind. Every prompt change, every model upgrade, every data change should be tested against a known-good dataset.</p>","sourceUrl":"/ai-and-ml-applied/foundations/applied-ai-vs-ml-theory","sourceAnchor":"common-pitfalls","domain":"data-ai","accentColor":"#08d7d9"},{"id":"dda6aa69225b","subject":"ai-and-ml-applied","topic":"foundations","subtopic":"applied-ai-vs-ml-theory","chunkType":"takeaway","body":"## Key takeaways — Applied AI vs ML Theory\n\nApplied AI is about building reliable systems, not training models. The gap between theory and production is where most of the work lives.\n\n90% of AI work is data, integration, evaluation, and monitoring. Model training or prompt writing is a small fraction.\n\nUse ML when the problem involves ambiguity, natural language, or pattern recognition. Use rules when the logic is deterministic.\n\nAlways start with the simplest approach: an API call with a good prompt. Escalate complexity only when measurement proves you need it.\n\nEvery AI feature needs evaluation, monitoring, and a fallback path. These are not optional extras.","bodyHtml":"<h2>Key takeaways — Applied AI vs ML Theory</h2>\n<p>Applied AI is about building reliable systems, not training models. The gap between theory and production is where most of the work lives.</p>\n<p>90% of AI work is data, integration, evaluation, and monitoring. Model training or prompt writing is a small fraction.</p>\n<p>Use ML when the problem involves ambiguity, natural language, or pattern recognition. Use rules when the logic is deterministic.</p>\n<p>Always start with the simplest approach: an API call with a good prompt. Escalate complexity only when measurement proves you need it.</p>\n<p>Every AI feature needs evaluation, monitoring, and a fallback path. These are not optional extras.</p>","sourceUrl":"/ai-and-ml-applied/foundations/applied-ai-vs-ml-theory","sourceAnchor":"key-takeaways","domain":"data-ai","accentColor":"#08d7d9"},{"id":"c45f2abcc505","subject":"ai-and-ml-applied","topic":"foundations","subtopic":"the-ai-landscape-for-engineers","chunkType":"section","body":"## The AI Landscape for Engineers\n\nThe AI landscape changes fast, but the categories of tools are stable. As an engineer building AI-powered features, you need to know what is available, what each category is good at, what it costs, and how to make the build-vs-buy decision. This document is a practical map of the terrain.","bodyHtml":"<h2>The AI Landscape for Engineers</h2>\n<p>The AI landscape changes fast, but the categories of tools are stable. As an engineer building AI-powered features, you need to know what is available, what each category is good at, what it costs, and how to make the build-vs-buy decision. This document is a practical map of the terrain.</p>","sourceUrl":"/ai-and-ml-applied/foundations/the-ai-landscape-for-engineers","sourceAnchor":"overview","domain":"data-ai","accentColor":"#08d7d9"},{"id":"b6cc4661c909","subject":"ai-and-ml-applied","topic":"foundations","subtopic":"the-ai-landscape-for-engineers","chunkType":"pitfall","body":"## Pitfalls — The AI Landscape for Engineers\n\n**Using the biggest model for everything**: GPT-4o or Claude Opus for simple classification is like using a sledgehammer to hang a picture. Use the smallest model that gets the job done.\n\n**Ignoring open source**: For many tasks, a fine-tuned 8B parameter open source model outperforms a general-purpose API model at a fraction of the cost.\n\n**Not accounting for latency**: API calls add 200ms-5s of latency. If your feature is latency-sensitive, factor this into the design from the start.\n\n**Vendor lock-in without abstraction**: Wrap your AI calls behind an interface. Switching from OpenAI to Anthropic should be a configuration change, not a rewrite.\n\n**Ignoring rate limits**: Every API has rate limits. At scale, you need queuing, retries, and backoff. Design for this from day one.\n\n**Assuming prices are fixed**: AI API pricing drops 30-50% per year. A \"too expensive\" model today may be affordable in six months. Revisit cost assumptions quarterly.","bodyHtml":"<h2>Pitfalls — The AI Landscape for Engineers</h2>\n<p><strong>Using the biggest model for everything</strong>: GPT-4o or Claude Opus for simple classification is like using a sledgehammer to hang a picture. Use the smallest model that gets the job done.</p>\n<p><strong>Ignoring open source</strong>: For many tasks, a fine-tuned 8B parameter open source model outperforms a general-purpose API model at a fraction of the cost.</p>\n<p><strong>Not accounting for latency</strong>: API calls add 200ms-5s of latency. If your feature is latency-sensitive, factor this into the design from the start.</p>\n<p><strong>Vendor lock-in without abstraction</strong>: Wrap your AI calls behind an interface. Switching from OpenAI to Anthropic should be a configuration change, not a rewrite.</p>\n<p><strong>Ignoring rate limits</strong>: Every API has rate limits. At scale, you need queuing, retries, and backoff. Design for this from day one.</p>\n<p><strong>Assuming prices are fixed</strong>: AI API pricing drops 30-50% per year. A \"too expensive\" model today may be affordable in six months. Revisit cost assumptions quarterly.</p>","sourceUrl":"/ai-and-ml-applied/foundations/the-ai-landscape-for-engineers","sourceAnchor":"common-pitfalls","domain":"data-ai","accentColor":"#08d7d9"},{"id":"5e0377f07a12","subject":"ai-and-ml-applied","topic":"foundations","subtopic":"the-ai-landscape-for-engineers","chunkType":"takeaway","body":"## Key takeaways — The AI Landscape for Engineers\n\nThe AI landscape has clear categories: foundation models (proprietary and open source), embedding models, image generation, and speech/audio. Know what each category does.\n\nAlmost always start with a proprietary API. It is faster to integrate, requires no infrastructure, and quality is high. Build custom only when you have a specific reason.\n\nUse tiered model strategies: cheap models for simple tasks, expensive models for complex ones. This optimizes both cost and latency.\n\nEmbedding models are the foundation of semantic search and RAG. They are cheap and essential for most AI applications.\n\nWrap AI calls behind abstractions to avoid vendor lock-in. The landscape shifts fast and you need the ability to switch providers.","bodyHtml":"<h2>Key takeaways — The AI Landscape for Engineers</h2>\n<p>The AI landscape has clear categories: foundation models (proprietary and open source), embedding models, image generation, and speech/audio. Know what each category does.</p>\n<p>Almost always start with a proprietary API. It is faster to integrate, requires no infrastructure, and quality is high. Build custom only when you have a specific reason.</p>\n<p>Use tiered model strategies: cheap models for simple tasks, expensive models for complex ones. This optimizes both cost and latency.</p>\n<p>Embedding models are the foundation of semantic search and RAG. They are cheap and essential for most AI applications.</p>\n<p>Wrap AI calls behind abstractions to avoid vendor lock-in. The landscape shifts fast and you need the ability to switch providers.</p>","sourceUrl":"/ai-and-ml-applied/foundations/the-ai-landscape-for-engineers","sourceAnchor":"key-takeaways","domain":"data-ai","accentColor":"#08d7d9"},{"id":"7e6d0bac42ff","subject":"ai-and-ml-applied","topic":"foundations","subtopic":"when-to-use-ai","chunkType":"section","body":"## When to Use AI\n\nAI is a powerful tool, but it is not the right tool for every problem. Engineers who understand when AI adds value and when it adds unnecessary complexity make better architectural decisions. This document provides a practical decision framework for evaluating whether AI belongs in your solution.\n\nThe core question is not \"can AI do this?\" but \"should AI do this?\" A language model can multiply two numbers, but a calculator is faster, cheaper, and always correct.","bodyHtml":"<h2>When to Use AI</h2>\n<p>AI is a powerful tool, but it is not the right tool for every problem. Engineers who understand when AI adds value and when it adds unnecessary complexity make better architectural decisions. This document provides a practical decision framework for evaluating whether AI belongs in your solution.</p>\n<p>The core question is not \"can AI do this?\" but \"should AI do this?\" A language model can multiply two numbers, but a calculator is faster, cheaper, and always correct.</p>","sourceUrl":"/ai-and-ml-applied/foundations/when-to-use-ai","sourceAnchor":"overview","domain":"data-ai","accentColor":"#08d7d9"},{"id":"ac1477339575","subject":"ai-and-ml-applied","topic":"foundations","subtopic":"when-to-use-ai","chunkType":"pitfall","body":"## Pitfalls — When to Use AI\n\n**Reaching for AI when a regex would work**: If you need to extract email addresses from a fixed-format report, a regular expression is faster, cheaper, and 100% reliable. AI is for when patterns are complex and variable.\n\n**Using AI because it's trendy**: \"We should add AI to this\" is not a requirement. Start with the user problem, then evaluate solutions. Sometimes the best solution is a well-designed form.\n\n**Underestimating the \"sometimes wrong\" problem**: A model that is 95% accurate sounds good until you realize that means 1 in 20 users gets a wrong answer. At scale, that is thousands of errors per day.\n\n**Not defining success criteria upfront**: \"We want AI to make this better\" is not measurable. Define specific metrics: \"AI should correctly classify 90% of tickets within 2 seconds.\"\n\n**Ignoring the maintenance burden**: AI systems need ongoing monitoring, evaluation dataset updates, prompt tuning, and model upgrades. They are not \"set and forget.\"\n\n**All-or-nothing thinking**: AI does not have to make the final decision. Using AI to assist humans (highlight, suggest, pre-fill) captures most of the value with much less risk.","bodyHtml":"<h2>Pitfalls — When to Use AI</h2>\n<p><strong>Reaching for AI when a regex would work</strong>: If you need to extract email addresses from a fixed-format report, a regular expression is faster, cheaper, and 100% reliable. AI is for when patterns are complex and variable.</p>\n<p><strong>Using AI because it's trendy</strong>: \"We should add AI to this\" is not a requirement. Start with the user problem, then evaluate solutions. Sometimes the best solution is a well-designed form.</p>\n<p><strong>Underestimating the \"sometimes wrong\" problem</strong>: A model that is 95% accurate sounds good until you realize that means 1 in 20 users gets a wrong answer. At scale, that is thousands of errors per day.</p>\n<p><strong>Not defining success criteria upfront</strong>: \"We want AI to make this better\" is not measurable. Define specific metrics: \"AI should correctly classify 90% of tickets within 2 seconds.\"</p>\n<p><strong>Ignoring the maintenance burden</strong>: AI systems need ongoing monitoring, evaluation dataset updates, prompt tuning, and model upgrades. They are not \"set and forget.\"</p>\n<p><strong>All-or-nothing thinking</strong>: AI does not have to make the final decision. Using AI to assist humans (highlight, suggest, pre-fill) captures most of the value with much less risk.</p>","sourceUrl":"/ai-and-ml-applied/foundations/when-to-use-ai","sourceAnchor":"common-pitfalls","domain":"data-ai","accentColor":"#08d7d9"},{"id":"445e8f1bbfec","subject":"ai-and-ml-applied","topic":"foundations","subtopic":"when-to-use-ai","chunkType":"takeaway","body":"## Key takeaways — When to Use AI\n\nAI excels at tasks involving natural language, pattern recognition, recommendations, semantic search, and anomaly detection. These are problems where rules-based approaches fail.\n\nAI is a poor fit for deterministic logic, small datasets, regulatory-heavy domains, and any task where \"sometimes wrong\" has severe consequences.\n\nUse the decision framework: define the problem, check if rules work, assess error cost, evaluate data availability, then consider constraints.\n\nHybrid approaches (AI + rules + human oversight) are the most common and most effective pattern in production systems.\n\nAlways define measurable success criteria before starting. \"Add AI\" is not a goal; \"reduce ticket routing errors by 60%\" is.","bodyHtml":"<h2>Key takeaways — When to Use AI</h2>\n<p>AI excels at tasks involving natural language, pattern recognition, recommendations, semantic search, and anomaly detection. These are problems where rules-based approaches fail.</p>\n<p>AI is a poor fit for deterministic logic, small datasets, regulatory-heavy domains, and any task where \"sometimes wrong\" has severe consequences.</p>\n<p>Use the decision framework: define the problem, check if rules work, assess error cost, evaluate data availability, then consider constraints.</p>\n<p>Hybrid approaches (AI + rules + human oversight) are the most common and most effective pattern in production systems.</p>\n<p>Always define measurable success criteria before starting. \"Add AI\" is not a goal; \"reduce ticket routing errors by 60%\" is.</p>","sourceUrl":"/ai-and-ml-applied/foundations/when-to-use-ai","sourceAnchor":"key-takeaways","domain":"data-ai","accentColor":"#08d7d9"},{"id":"9bb6abe3deba","subject":"ai-and-ml-applied","topic":"prompt-engineering","subtopic":"effective-prompting","chunkType":"section","body":"## Effective Prompting\n\nPrompt engineering is the practice of crafting inputs to language models that produce reliable, useful outputs. It is the most cost-effective way to improve AI system quality: better prompts are free, while better models cost more money.\n\nThe core principle is simple: be specific. Vague instructions produce vague outputs. Clear, detailed instructions with explicit format requirements produce consistent, usable results.","bodyHtml":"<h2>Effective Prompting</h2>\n<p>Prompt engineering is the practice of crafting inputs to language models that produce reliable, useful outputs. It is the most cost-effective way to improve AI system quality: better prompts are free, while better models cost more money.</p>\n<p>The core principle is simple: be specific. Vague instructions produce vague outputs. Clear, detailed instructions with explicit format requirements produce consistent, usable results.</p>","sourceUrl":"/ai-and-ml-applied/prompt-engineering/effective-prompting","sourceAnchor":"overview","domain":"data-ai","accentColor":"#08d7d9"},{"id":"79a8e1a1f1bb","subject":"ai-and-ml-applied","topic":"prompt-engineering","subtopic":"effective-prompting","chunkType":"pitfall","body":"## Pitfalls — Effective Prompting\n\n**\"You are a helpful assistant\"**: This tells the model nothing it does not already know. Every token in your system prompt should add information. If you removed a sentence and the output would be the same, remove it.\n\n**Prompt stuffing**: Cramming every possible instruction into one prompt. If your prompt is 2,000 words, the model will forget parts of it. Keep instructions focused. Use multiple calls if needed.\n\n**Not testing with adversarial inputs**: Your prompt works great with clean data. What about empty strings, HTML, code injection, or a 100KB wall of text? Test edge cases.\n\n**Ignoring token costs**: Long system prompts are sent with every request. A 1,000-token system prompt at 100K requests/day adds up. Be concise.\n\n**Not versioning prompts**: Prompts are code. Version them, test them, review changes. A one-word change in a prompt can dramatically alter behavior.\n\n**Assuming determinism at temperature 0**: Even at temperature 0, LLM outputs can vary slightly between API calls or model versions. Always validate output format programmatically.","bodyHtml":"<h2>Pitfalls — Effective Prompting</h2>\n<p><strong>\"You are a helpful assistant\"</strong>: This tells the model nothing it does not already know. Every token in your system prompt should add information. If you removed a sentence and the output would be the same, remove it.</p>\n<p><strong>Prompt stuffing</strong>: Cramming every possible instruction into one prompt. If your prompt is 2,000 words, the model will forget parts of it. Keep instructions focused. Use multiple calls if needed.</p>\n<p><strong>Not testing with adversarial inputs</strong>: Your prompt works great with clean data. What about empty strings, HTML, code injection, or a 100KB wall of text? Test edge cases.</p>\n<p><strong>Ignoring token costs</strong>: Long system prompts are sent with every request. A 1,000-token system prompt at 100K requests/day adds up. Be concise.</p>\n<p><strong>Not versioning prompts</strong>: Prompts are code. Version them, test them, review changes. A one-word change in a prompt can dramatically alter behavior.</p>\n<p><strong>Assuming determinism at temperature 0</strong>: Even at temperature 0, LLM outputs can vary slightly between API calls or model versions. Always validate output format programmatically.</p>","sourceUrl":"/ai-and-ml-applied/prompt-engineering/effective-prompting","sourceAnchor":"common-pitfalls","domain":"data-ai","accentColor":"#08d7d9"},{"id":"56dff6e77955","subject":"ai-and-ml-applied","topic":"prompt-engineering","subtopic":"effective-prompting","chunkType":"takeaway","body":"## Key takeaways — Effective Prompting\n\nBe specific. Every vague word in your prompt is a chance for the model to do something unexpected. Specify format, length, focus, edge cases, and constraints.\n\nSystem prompts define behavior; user prompts provide input. Keep stable instructions in the system prompt and variable content in the user prompt.\n\nContext is critical. Models with relevant context dramatically outperform models without it. But context must be relevant and concise.\n\nUse temperature 0 for production tasks where consistency matters. Save higher temperatures for creative or exploratory tasks.\n\nIterate on prompts like you iterate on code. Start simple, test, identify failures, add specificity. Version and track your prompts.","bodyHtml":"<h2>Key takeaways — Effective Prompting</h2>\n<p>Be specific. Every vague word in your prompt is a chance for the model to do something unexpected. Specify format, length, focus, edge cases, and constraints.</p>\n<p>System prompts define behavior; user prompts provide input. Keep stable instructions in the system prompt and variable content in the user prompt.</p>\n<p>Context is critical. Models with relevant context dramatically outperform models without it. But context must be relevant and concise.</p>\n<p>Use temperature 0 for production tasks where consistency matters. Save higher temperatures for creative or exploratory tasks.</p>\n<p>Iterate on prompts like you iterate on code. Start simple, test, identify failures, add specificity. Version and track your prompts.</p>","sourceUrl":"/ai-and-ml-applied/prompt-engineering/effective-prompting","sourceAnchor":"key-takeaways","domain":"data-ai","accentColor":"#08d7d9"},{"id":"ef91fcb51166","subject":"ai-and-ml-applied","topic":"prompt-engineering","subtopic":"few-shot-and-chain-of-thought","chunkType":"section","body":"## Few-Shot & Chain-of-Thought\n\nZero-shot, few-shot, and chain-of-thought are the three most important prompting techniques. They form a progression: zero-shot is the simplest, few-shot adds examples, and chain-of-thought adds reasoning steps. Knowing when to use each saves time and improves output quality significantly.\n\nFew-shot prompting is the most underused technique in production AI systems. Adding 3-5 examples to a prompt often produces a bigger quality improvement than switching to a more expensive model.","bodyHtml":"<h2>Few-Shot &#x26; Chain-of-Thought</h2>\n<p>Zero-shot, few-shot, and chain-of-thought are the three most important prompting techniques. They form a progression: zero-shot is the simplest, few-shot adds examples, and chain-of-thought adds reasoning steps. Knowing when to use each saves time and improves output quality significantly.</p>\n<p>Few-shot prompting is the most underused technique in production AI systems. Adding 3-5 examples to a prompt often produces a bigger quality improvement than switching to a more expensive model.</p>","sourceUrl":"/ai-and-ml-applied/prompt-engineering/few-shot-and-chain-of-thought","sourceAnchor":"overview","domain":"data-ai","accentColor":"#08d7d9"},{"id":"5c20dc02c2a3","subject":"ai-and-ml-applied","topic":"prompt-engineering","subtopic":"few-shot-and-chain-of-thought","chunkType":"pitfall","body":"## Pitfalls — Few-Shot & Chain-of-Thought\n\n**Not using few-shot when you should**: This is the single most common missed optimization. If your zero-shot prompt produces inconsistent output, add 3 examples before trying anything else. It is almost always cheaper than switching to a better model.\n\n**Using identical examples**: If all your few-shot examples are easy cases, the model learns nothing about hard cases. Include at least one edge case and one negative example.\n\n**Using CoT for simple tasks**: Chain-of-thought adds latency and cost. For binary classification or simple extraction, it slows things down without improving quality.\n\n**Not extracting the final answer from CoT**: When using chain-of-thought, the reasoning is useful but the answer is what you need. Parse the output to extract just the final answer for downstream processing.\n\n**Putting examples in the user message instead of the system message**: Few-shot examples are instructions, not input. They belong in the system prompt where they are treated as stable context, not per-request content.\n\n**Too many examples**: After 5-7 examples, returns diminish. More examples mean more tokens per request, which adds cost at scale. Measure whether additional examples actually improve quality.","bodyHtml":"<h2>Pitfalls — Few-Shot &#x26; Chain-of-Thought</h2>\n<p><strong>Not using few-shot when you should</strong>: This is the single most common missed optimization. If your zero-shot prompt produces inconsistent output, add 3 examples before trying anything else. It is almost always cheaper than switching to a better model.</p>\n<p><strong>Using identical examples</strong>: If all your few-shot examples are easy cases, the model learns nothing about hard cases. Include at least one edge case and one negative example.</p>\n<p><strong>Using CoT for simple tasks</strong>: Chain-of-thought adds latency and cost. For binary classification or simple extraction, it slows things down without improving quality.</p>\n<p><strong>Not extracting the final answer from CoT</strong>: When using chain-of-thought, the reasoning is useful but the answer is what you need. Parse the output to extract just the final answer for downstream processing.</p>\n<p><strong>Putting examples in the user message instead of the system message</strong>: Few-shot examples are instructions, not input. They belong in the system prompt where they are treated as stable context, not per-request content.</p>\n<p><strong>Too many examples</strong>: After 5-7 examples, returns diminish. More examples mean more tokens per request, which adds cost at scale. Measure whether additional examples actually improve quality.</p>","sourceUrl":"/ai-and-ml-applied/prompt-engineering/few-shot-and-chain-of-thought","sourceAnchor":"common-pitfalls","domain":"data-ai","accentColor":"#08d7d9"},{"id":"7212d2045631","subject":"ai-and-ml-applied","topic":"prompt-engineering","subtopic":"few-shot-and-chain-of-thought","chunkType":"takeaway","body":"## Key takeaways — Few-Shot & Chain-of-Thought\n\nFew-shot prompting (3-5 examples) is the most underused and highest-impact technique. Try it before switching models or building complex pipelines.\n\nChain-of-thought improves reasoning tasks by forcing the model to work through the problem step by step. Skip it for simple tasks.\n\nChoose the technique based on task complexity: zero-shot for simple, few-shot for format/domain-specific, CoT for reasoning, ReAct for tool use.\n\nExample selection matters more than example quantity. Include edge cases, negative examples, and one example per output category.\n\nReAct (reasoning + tool use) is the pattern behind modern AI agents. The model decides what tools to call based on step-by-step reasoning.","bodyHtml":"<h2>Key takeaways — Few-Shot &#x26; Chain-of-Thought</h2>\n<p>Few-shot prompting (3-5 examples) is the most underused and highest-impact technique. Try it before switching models or building complex pipelines.</p>\n<p>Chain-of-thought improves reasoning tasks by forcing the model to work through the problem step by step. Skip it for simple tasks.</p>\n<p>Choose the technique based on task complexity: zero-shot for simple, few-shot for format/domain-specific, CoT for reasoning, ReAct for tool use.</p>\n<p>Example selection matters more than example quantity. Include edge cases, negative examples, and one example per output category.</p>\n<p>ReAct (reasoning + tool use) is the pattern behind modern AI agents. The model decides what tools to call based on step-by-step reasoning.</p>","sourceUrl":"/ai-and-ml-applied/prompt-engineering/few-shot-and-chain-of-thought","sourceAnchor":"key-takeaways","domain":"data-ai","accentColor":"#08d7d9"},{"id":"9acd044a84d2","subject":"ai-and-ml-applied","topic":"prompt-engineering","subtopic":"guardrails-and-output-control","chunkType":"section","body":"## Guardrails & Output Control\n\nLanguage models are probabilistic. They do not guarantee output format, content safety, or adherence to constraints. Guardrails are the mechanisms you build around model outputs to ensure reliability, safety, and consistency.\n\nProduction AI systems need guardrails at three levels: input (what goes into the model), output (what comes out), and structural (enforcing format and schema). Skipping any of these levels means your system will fail in unpredictable ways when real users interact with it.","bodyHtml":"<h2>Guardrails &#x26; Output Control</h2>\n<p>Language models are probabilistic. They do not guarantee output format, content safety, or adherence to constraints. Guardrails are the mechanisms you build around model outputs to ensure reliability, safety, and consistency.</p>\n<p>Production AI systems need guardrails at three levels: input (what goes into the model), output (what comes out), and structural (enforcing format and schema). Skipping any of these levels means your system will fail in unpredictable ways when real users interact with it.</p>","sourceUrl":"/ai-and-ml-applied/prompt-engineering/guardrails-and-output-control","sourceAnchor":"overview","domain":"data-ai","accentColor":"#08d7d9"},{"id":"c12d9bece44e","subject":"ai-and-ml-applied","topic":"prompt-engineering","subtopic":"guardrails-and-output-control","chunkType":"pitfall","body":"## Pitfalls — Guardrails & Output Control\n\n**Trusting JSON mode alone**: JSON mode guarantees valid JSON, not valid schema. Always validate against a schema with Pydantic or equivalent.\n\n**No retry logic**: Models sometimes produce invalid output. A single retry with a clearer prompt fixes most cases. Build retry logic into your pipeline.\n\n**Blocking on false positives**: Overly aggressive input filtering rejects legitimate queries. \"How do I ignore previous settings in the app?\" is a valid question, not an injection attack. Tune your filters.\n\n**Not logging guardrail triggers**: Every time a guardrail activates, log it. These logs tell you what your users are actually doing and where your model is failing.\n\n**Relying on prompt instructions for safety**: \"Never say harmful things\" in a system prompt is not a guardrail. It is a suggestion. Real guardrails are code that runs on the output.\n\n**Skipping output checks in development**: Guardrails feel unnecessary during development with clean test data. They become essential in production with real user input. Build them from the start.","bodyHtml":"<h2>Pitfalls — Guardrails &#x26; Output Control</h2>\n<p><strong>Trusting JSON mode alone</strong>: JSON mode guarantees valid JSON, not valid schema. Always validate against a schema with Pydantic or equivalent.</p>\n<p><strong>No retry logic</strong>: Models sometimes produce invalid output. A single retry with a clearer prompt fixes most cases. Build retry logic into your pipeline.</p>\n<p><strong>Blocking on false positives</strong>: Overly aggressive input filtering rejects legitimate queries. \"How do I ignore previous settings in the app?\" is a valid question, not an injection attack. Tune your filters.</p>\n<p><strong>Not logging guardrail triggers</strong>: Every time a guardrail activates, log it. These logs tell you what your users are actually doing and where your model is failing.</p>\n<p><strong>Relying on prompt instructions for safety</strong>: \"Never say harmful things\" in a system prompt is not a guardrail. It is a suggestion. Real guardrails are code that runs on the output.</p>\n<p><strong>Skipping output checks in development</strong>: Guardrails feel unnecessary during development with clean test data. They become essential in production with real user input. Build them from the start.</p>","sourceUrl":"/ai-and-ml-applied/prompt-engineering/guardrails-and-output-control","sourceAnchor":"common-pitfalls","domain":"data-ai","accentColor":"#08d7d9"},{"id":"e102ca73b01b","subject":"ai-and-ml-applied","topic":"prompt-engineering","subtopic":"guardrails-and-output-control","chunkType":"takeaway","body":"## Key takeaways — Guardrails & Output Control\n\nStructured output (JSON mode, function calling, schema validation) eliminates the most common class of LLM integration bugs. Use Pydantic or equivalent for validation.\n\nInput guardrails (length limits, injection detection, content filtering) protect against malicious and malformed inputs. No single defense is sufficient; layer multiple strategies.\n\nOutput guardrails (format validation, safety checks, length limits) ensure model responses meet your requirements before reaching users.\n\nPrompt injection is the SQL injection of AI applications. Defend with delimiters, explicit instructions, pattern detection, and output monitoring.\n\nBuild guardrails from day one. They are not optional extras for production. Logging guardrail triggers is essential for understanding real-world failure modes.","bodyHtml":"<h2>Key takeaways — Guardrails &#x26; Output Control</h2>\n<p>Structured output (JSON mode, function calling, schema validation) eliminates the most common class of LLM integration bugs. Use Pydantic or equivalent for validation.</p>\n<p>Input guardrails (length limits, injection detection, content filtering) protect against malicious and malformed inputs. No single defense is sufficient; layer multiple strategies.</p>\n<p>Output guardrails (format validation, safety checks, length limits) ensure model responses meet your requirements before reaching users.</p>\n<p>Prompt injection is the SQL injection of AI applications. Defend with delimiters, explicit instructions, pattern detection, and output monitoring.</p>\n<p>Build guardrails from day one. They are not optional extras for production. Logging guardrail triggers is essential for understanding real-world failure modes.</p>","sourceUrl":"/ai-and-ml-applied/prompt-engineering/guardrails-and-output-control","sourceAnchor":"key-takeaways","domain":"data-ai","accentColor":"#08d7d9"},{"id":"ecb3312430e7","subject":"ai-and-ml-applied","topic":"retrieval-augmented-generation","subtopic":"what-rag-is","chunkType":"section","body":"## What RAG Is\n\nRetrieval-Augmented Generation (RAG) is a technique that combines information retrieval with language model generation. Instead of relying solely on what the model learned during training, you retrieve relevant documents from your own data and include them in the prompt. The model then generates an answer grounded in those documents.\n\nRAG solves two fundamental problems with LLMs: they have a knowledge cutoff (they do not know about anything after their training data), and they hallucinate (they confidently state things that are not true). By giving the model the right documents at the right time, you get answers that are both current and grounded in real data.","bodyHtml":"<h2>What RAG Is</h2>\n<p>Retrieval-Augmented Generation (RAG) is a technique that combines information retrieval with language model generation. Instead of relying solely on what the model learned during training, you retrieve relevant documents from your own data and include them in the prompt. The model then generates an answer grounded in those documents.</p>\n<p>RAG solves two fundamental problems with LLMs: they have a knowledge cutoff (they do not know about anything after their training data), and they hallucinate (they confidently state things that are not true). By giving the model the right documents at the right time, you get answers that are both current and grounded in real data.</p>","sourceUrl":"/ai-and-ml-applied/retrieval-augmented-generation/what-rag-is","sourceAnchor":"overview","domain":"data-ai","accentColor":"#08d7d9"},{"id":"943258ff1c23","subject":"ai-and-ml-applied","topic":"retrieval-augmented-generation","subtopic":"what-rag-is","chunkType":"pitfall","body":"## Pitfalls — What RAG Is\n\n**Skipping evaluation**: \"It seems to work\" is not evaluation. Build a test set of question-answer pairs and measure retrieval quality (are the right documents found?) and answer quality (is the generated answer correct?).\n\n**Wrong chunk size**: Too small and chunks lack context. Too big and you waste context window space on irrelevant text. Start with 500-1000 tokens per chunk and adjust based on your data.\n\n**No \"I don't know\" path**: If the retrieved documents do not contain the answer, the model should say so. Without explicit instructions, models will hallucinate an answer using the irrelevant context.\n\n**Ignoring retrieval quality**: The generation can only be as good as the retrieval. If the wrong documents are retrieved, the model will produce a confident wrong answer based on those documents.\n\n**Not considering context window limits**: If you retrieve 10 chunks of 1000 tokens each, that is 10K tokens of context. Add the system prompt and the model's output, and you may be hitting limits. Budget your context window.\n\n**Treating RAG as a one-time setup**: RAG systems need ongoing maintenance. Documents change, user questions evolve, and retrieval quality can degrade. Monitor and iterate.","bodyHtml":"<h2>Pitfalls — What RAG Is</h2>\n<p><strong>Skipping evaluation</strong>: \"It seems to work\" is not evaluation. Build a test set of question-answer pairs and measure retrieval quality (are the right documents found?) and answer quality (is the generated answer correct?).</p>\n<p><strong>Wrong chunk size</strong>: Too small and chunks lack context. Too big and you waste context window space on irrelevant text. Start with 500-1000 tokens per chunk and adjust based on your data.</p>\n<p><strong>No \"I don't know\" path</strong>: If the retrieved documents do not contain the answer, the model should say so. Without explicit instructions, models will hallucinate an answer using the irrelevant context.</p>\n<p><strong>Ignoring retrieval quality</strong>: The generation can only be as good as the retrieval. If the wrong documents are retrieved, the model will produce a confident wrong answer based on those documents.</p>\n<p><strong>Not considering context window limits</strong>: If you retrieve 10 chunks of 1000 tokens each, that is 10K tokens of context. Add the system prompt and the model's output, and you may be hitting limits. Budget your context window.</p>\n<p><strong>Treating RAG as a one-time setup</strong>: RAG systems need ongoing maintenance. Documents change, user questions evolve, and retrieval quality can degrade. Monitor and iterate.</p>","sourceUrl":"/ai-and-ml-applied/retrieval-augmented-generation/what-rag-is","sourceAnchor":"common-pitfalls","domain":"data-ai","accentColor":"#08d7d9"},{"id":"568dd3f08760","subject":"ai-and-ml-applied","topic":"retrieval-augmented-generation","subtopic":"what-rag-is","chunkType":"takeaway","body":"## Key takeaways — What RAG Is\n\nRAG combines retrieval (search) with generation (LLM) to ground model outputs in real data. This reduces hallucination and solves the knowledge cutoff problem.\n\nRAG beats fine-tuning for most knowledge-grounding use cases because it supports real-time updates, source attribution, and access control without retraining.\n\nThe quality of a RAG system depends on both retrieval quality (finding the right documents) and generation quality (synthesizing a correct answer). Evaluate both separately.\n\nRAG is the standard architecture for any application where the model needs to answer questions about data it was not trained on: support bots, internal search, document Q&A, legal research.\n\nAlways include an \"I don't know\" instruction. A model that admits uncertainty is more trustworthy than one that confidently generates wrong answers from irrelevant context.","bodyHtml":"<h2>Key takeaways — What RAG Is</h2>\n<p>RAG combines retrieval (search) with generation (LLM) to ground model outputs in real data. This reduces hallucination and solves the knowledge cutoff problem.</p>\n<p>RAG beats fine-tuning for most knowledge-grounding use cases because it supports real-time updates, source attribution, and access control without retraining.</p>\n<p>The quality of a RAG system depends on both retrieval quality (finding the right documents) and generation quality (synthesizing a correct answer). Evaluate both separately.</p>\n<p>RAG is the standard architecture for any application where the model needs to answer questions about data it was not trained on: support bots, internal search, document Q&#x26;A, legal research.</p>\n<p>Always include an \"I don't know\" instruction. A model that admits uncertainty is more trustworthy than one that confidently generates wrong answers from irrelevant context.</p>","sourceUrl":"/ai-and-ml-applied/retrieval-augmented-generation/what-rag-is","sourceAnchor":"key-takeaways","domain":"data-ai","accentColor":"#08d7d9"},{"id":"a81dff398d1f","subject":"ai-and-ml-applied","topic":"retrieval-augmented-generation","subtopic":"chunking-and-indexing","chunkType":"section","body":"## Chunking & Indexing\n\nBefore a RAG system can retrieve relevant documents, those documents must be split into chunks, converted to embeddings, and stored in a searchable index. This process — the indexing pipeline — determines the ceiling of your RAG system's quality. Bad chunking means bad retrieval, and no amount of prompt engineering fixes that.\n\nThe core tradeoff: chunks must be small enough to be specific (so you retrieve relevant content, not entire chapters) but large enough to be meaningful (so the retrieved content has enough context to be useful).","bodyHtml":"<h2>Chunking &#x26; Indexing</h2>\n<p>Before a RAG system can retrieve relevant documents, those documents must be split into chunks, converted to embeddings, and stored in a searchable index. This process — the indexing pipeline — determines the ceiling of your RAG system's quality. Bad chunking means bad retrieval, and no amount of prompt engineering fixes that.</p>\n<p>The core tradeoff: chunks must be small enough to be specific (so you retrieve relevant content, not entire chapters) but large enough to be meaningful (so the retrieved content has enough context to be useful).</p>","sourceUrl":"/ai-and-ml-applied/retrieval-augmented-generation/chunking-and-indexing","sourceAnchor":"overview","domain":"data-ai","accentColor":"#08d7d9"},{"id":"7ae04a2af391","subject":"ai-and-ml-applied","topic":"retrieval-augmented-generation","subtopic":"chunking-and-indexing","chunkType":"section","body":"## Document Parsing\n\nBefore chunking, you need clean text. PDFs are the hardest (varying layouts, tables, OCR quality). HTML needs boilerplate stripped. Markdown is cleanest. Use libraries like PyMuPDF or pdfplumber for PDFs with Tesseract as an OCR fallback. Always clean artifacts: collapse excessive whitespace, strip headers/footers, normalize encoding.","bodyHtml":"<h2>Document Parsing</h2>\n<p>Before chunking, you need clean text. PDFs are the hardest (varying layouts, tables, OCR quality). HTML needs boilerplate stripped. Markdown is cleanest. Use libraries like PyMuPDF or pdfplumber for PDFs with Tesseract as an OCR fallback. Always clean artifacts: collapse excessive whitespace, strip headers/footers, normalize encoding.</p>","sourceUrl":"/ai-and-ml-applied/retrieval-augmented-generation/chunking-and-indexing","sourceAnchor":"document-parsing","domain":"data-ai","accentColor":"#08d7d9"},{"id":"42625419ccf1","subject":"ai-and-ml-applied","topic":"retrieval-augmented-generation","subtopic":"chunking-and-indexing","chunkType":"pitfall","body":"## Pitfalls — Chunking & Indexing\n\n**One chunk size for all content**: A 500-token chunk works for prose but destroys code (splits functions mid-body) and tables (separates headers from data). Adapt chunking strategy to content type.\n\n**No overlap**: Chunks that split mid-sentence or mid-paragraph lose critical context at boundaries. Always use overlap, typically 10-20% of chunk size.\n\n**Ignoring metadata**: Without metadata, you cannot filter by source, date, or department. This means every query searches everything, returning irrelevant results for scoped questions.\n\n**Not testing chunk quality**: After chunking, manually inspect 20-30 chunks. Are they coherent? Do they contain complete thoughts? Would a human find them useful as context for answering questions?\n\n**Embedding stale data**: If your documents change weekly but your index updates monthly, users get outdated answers. Match index freshness to data change frequency.\n\n**Skipping keyword search**: Pure semantic search fails on exact terms (error codes, product SKUs, names). Hybrid search catches these cases with minimal additional complexity.","bodyHtml":"<h2>Pitfalls — Chunking &#x26; Indexing</h2>\n<p><strong>One chunk size for all content</strong>: A 500-token chunk works for prose but destroys code (splits functions mid-body) and tables (separates headers from data). Adapt chunking strategy to content type.</p>\n<p><strong>No overlap</strong>: Chunks that split mid-sentence or mid-paragraph lose critical context at boundaries. Always use overlap, typically 10-20% of chunk size.</p>\n<p><strong>Ignoring metadata</strong>: Without metadata, you cannot filter by source, date, or department. This means every query searches everything, returning irrelevant results for scoped questions.</p>\n<p><strong>Not testing chunk quality</strong>: After chunking, manually inspect 20-30 chunks. Are they coherent? Do they contain complete thoughts? Would a human find them useful as context for answering questions?</p>\n<p><strong>Embedding stale data</strong>: If your documents change weekly but your index updates monthly, users get outdated answers. Match index freshness to data change frequency.</p>\n<p><strong>Skipping keyword search</strong>: Pure semantic search fails on exact terms (error codes, product SKUs, names). Hybrid search catches these cases with minimal additional complexity.</p>","sourceUrl":"/ai-and-ml-applied/retrieval-augmented-generation/chunking-and-indexing","sourceAnchor":"common-pitfalls","domain":"data-ai","accentColor":"#08d7d9"},{"id":"9a5aaebf744c","subject":"ai-and-ml-applied","topic":"retrieval-augmented-generation","subtopic":"chunking-and-indexing","chunkType":"takeaway","body":"## Key takeaways — Chunking & Indexing\n\nChunk size is the most impactful parameter in a RAG system. Start with 500-800 tokens, use overlap of 10-20%, and adjust based on your data and evaluation results.\n\nSemantic chunking (splitting at natural boundaries like headers and paragraphs) produces better chunks than fixed-size splitting for structured documents.\n\nMetadata on chunks enables filtering, which dramatically improves retrieval relevance for scoped queries (e.g., \"search only in HR policies\").\n\nHybrid search (keyword + semantic) outperforms either method alone. Keyword search catches exact terms; semantic search catches meaning.\n\nThe indexing pipeline needs maintenance. Documents change, and your index must stay fresh. Incremental updates are the standard approach.","bodyHtml":"<h2>Key takeaways — Chunking &#x26; Indexing</h2>\n<p>Chunk size is the most impactful parameter in a RAG system. Start with 500-800 tokens, use overlap of 10-20%, and adjust based on your data and evaluation results.</p>\n<p>Semantic chunking (splitting at natural boundaries like headers and paragraphs) produces better chunks than fixed-size splitting for structured documents.</p>\n<p>Metadata on chunks enables filtering, which dramatically improves retrieval relevance for scoped queries (e.g., \"search only in HR policies\").</p>\n<p>Hybrid search (keyword + semantic) outperforms either method alone. Keyword search catches exact terms; semantic search catches meaning.</p>\n<p>The indexing pipeline needs maintenance. Documents change, and your index must stay fresh. Incremental updates are the standard approach.</p>","sourceUrl":"/ai-and-ml-applied/retrieval-augmented-generation/chunking-and-indexing","sourceAnchor":"key-takeaways","domain":"data-ai","accentColor":"#08d7d9"},{"id":"e7af43e5c244","subject":"ai-and-ml-applied","topic":"retrieval-augmented-generation","subtopic":"building-a-rag-system","chunkType":"section","body":"## Building a RAG System\n\nThis document walks through building a complete RAG system end-to-end: document ingestion, embedding generation, vector storage, retrieval, prompt construction, and generation. It also covers evaluation, common failure modes, and how to debug them.\n\nA working RAG system is not complicated. The first version can be built in a day. Making it reliable, accurate, and fast enough for production takes iteration guided by evaluation.","bodyHtml":"<h2>Building a RAG System</h2>\n<p>This document walks through building a complete RAG system end-to-end: document ingestion, embedding generation, vector storage, retrieval, prompt construction, and generation. It also covers evaluation, common failure modes, and how to debug them.</p>\n<p>A working RAG system is not complicated. The first version can be built in a day. Making it reliable, accurate, and fast enough for production takes iteration guided by evaluation.</p>","sourceUrl":"/ai-and-ml-applied/retrieval-augmented-generation/building-a-rag-system","sourceAnchor":"overview","domain":"data-ai","accentColor":"#08d7d9"},{"id":"78a482a999ab","subject":"ai-and-ml-applied","topic":"retrieval-augmented-generation","subtopic":"building-a-rag-system","chunkType":"section","body":"## Step 2: Chunking\n\nTry semantic chunking first (split on headers/paragraphs), fall back to recursive fixed-size chunking (500 tokens, 100 overlap) for unstructured text. Preserve metadata lineage: each chunk carries its source document ID, chunk index, and original document metadata.","bodyHtml":"<h2>Step 2: Chunking</h2>\n<p>Try semantic chunking first (split on headers/paragraphs), fall back to recursive fixed-size chunking (500 tokens, 100 overlap) for unstructured text. Preserve metadata lineage: each chunk carries its source document ID, chunk index, and original document metadata.</p>","sourceUrl":"/ai-and-ml-applied/retrieval-augmented-generation/building-a-rag-system","sourceAnchor":"step-2-chunking","domain":"data-ai","accentColor":"#08d7d9"},{"id":"af84bb3d46ee","subject":"ai-and-ml-applied","topic":"retrieval-augmented-generation","subtopic":"building-a-rag-system","chunkType":"pitfall","body":"## Pitfalls — Building a RAG System\n\n**Shipping without evaluation**: \"It seems to work\" is not enough. Build a test set of at least 50 question-answer pairs. Run automated evals on every change to chunking, prompts, or models.\n\n**Not inspecting retrieved chunks**: When debugging, always look at what was actually retrieved. Most RAG failures are retrieval failures, not generation failures.\n\n**Overly large chunks**: Using 2000-token chunks means fewer chunks fit in context. If the relevant information is 50 tokens within a 2000-token chunk, 97.5% of that context is noise.\n\n**No similarity threshold**: Returning the top 5 results even when none are relevant leads to hallucination. Set a minimum similarity threshold and return \"I don't know\" when nothing qualifies.\n\n**Ignoring document freshness**: If your knowledge base has versioned documents, old versions can poison results. Delete or demote outdated chunks when new versions are indexed.\n\n**Building before measuring**: Set up evaluation before building the full pipeline. Otherwise you have no way to know if your changes are improvements.","bodyHtml":"<h2>Pitfalls — Building a RAG System</h2>\n<p><strong>Shipping without evaluation</strong>: \"It seems to work\" is not enough. Build a test set of at least 50 question-answer pairs. Run automated evals on every change to chunking, prompts, or models.</p>\n<p><strong>Not inspecting retrieved chunks</strong>: When debugging, always look at what was actually retrieved. Most RAG failures are retrieval failures, not generation failures.</p>\n<p><strong>Overly large chunks</strong>: Using 2000-token chunks means fewer chunks fit in context. If the relevant information is 50 tokens within a 2000-token chunk, 97.5% of that context is noise.</p>\n<p><strong>No similarity threshold</strong>: Returning the top 5 results even when none are relevant leads to hallucination. Set a minimum similarity threshold and return \"I don't know\" when nothing qualifies.</p>\n<p><strong>Ignoring document freshness</strong>: If your knowledge base has versioned documents, old versions can poison results. Delete or demote outdated chunks when new versions are indexed.</p>\n<p><strong>Building before measuring</strong>: Set up evaluation before building the full pipeline. Otherwise you have no way to know if your changes are improvements.</p>","sourceUrl":"/ai-and-ml-applied/retrieval-augmented-generation/building-a-rag-system","sourceAnchor":"common-pitfalls","domain":"data-ai","accentColor":"#08d7d9"},{"id":"ea39376c57d2","subject":"ai-and-ml-applied","topic":"retrieval-augmented-generation","subtopic":"building-a-rag-system","chunkType":"takeaway","body":"## Key takeaways — Building a RAG System\n\nA complete RAG system has two pipelines: offline (ingest, chunk, embed, store) and online (embed query, retrieve, prompt, generate). Both need attention.\n\nEvaluate retrieval and generation separately. Most quality problems are retrieval problems. If the right chunks are not found, the generation cannot be correct.\n\nStart simple: Chroma or pgvector, fixed-size chunks, text-embedding-3-small, and GPT-4o. Optimize only when evaluation shows where the bottleneck is.\n\nSet a similarity threshold and implement \"I don't know\" responses. A system that admits uncertainty is more trustworthy than one that confidently hallucinates.\n\nDebug systematically: inspect retrieved chunks first, then check if the generation prompt is clear, then consider model quality. Most issues live in retrieval or chunking.","bodyHtml":"<h2>Key takeaways — Building a RAG System</h2>\n<p>A complete RAG system has two pipelines: offline (ingest, chunk, embed, store) and online (embed query, retrieve, prompt, generate). Both need attention.</p>\n<p>Evaluate retrieval and generation separately. Most quality problems are retrieval problems. If the right chunks are not found, the generation cannot be correct.</p>\n<p>Start simple: Chroma or pgvector, fixed-size chunks, text-embedding-3-small, and GPT-4o. Optimize only when evaluation shows where the bottleneck is.</p>\n<p>Set a similarity threshold and implement \"I don't know\" responses. A system that admits uncertainty is more trustworthy than one that confidently hallucinates.</p>\n<p>Debug systematically: inspect retrieved chunks first, then check if the generation prompt is clear, then consider model quality. Most issues live in retrieval or chunking.</p>","sourceUrl":"/ai-and-ml-applied/retrieval-augmented-generation/building-a-rag-system","sourceAnchor":"key-takeaways","domain":"data-ai","accentColor":"#08d7d9"},{"id":"ff5f6d21e925","subject":"ai-and-ml-applied","topic":"fine-tuning","subtopic":"when-to-fine-tune","chunkType":"section","body":"## When to Fine-Tune\n\nFine-tuning is the process of taking a pre-trained model and training it further on your own data so it learns your specific patterns, style, or domain knowledge. It is a powerful technique — and it is almost never the first thing you should try.\n\nThe escalation ladder for AI quality improvement is: prompting, few-shot examples, RAG, then fine-tuning. Each step is more expensive and complex than the last. Most production AI features never need fine-tuning because the earlier steps are sufficient.","bodyHtml":"<h2>When to Fine-Tune</h2>\n<p>Fine-tuning is the process of taking a pre-trained model and training it further on your own data so it learns your specific patterns, style, or domain knowledge. It is a powerful technique — and it is almost never the first thing you should try.</p>\n<p>The escalation ladder for AI quality improvement is: prompting, few-shot examples, RAG, then fine-tuning. Each step is more expensive and complex than the last. Most production AI features never need fine-tuning because the earlier steps are sufficient.</p>","sourceUrl":"/ai-and-ml-applied/fine-tuning/when-to-fine-tune","sourceAnchor":"overview","domain":"data-ai","accentColor":"#08d7d9"},{"id":"a8011ccb6101","subject":"ai-and-ml-applied","topic":"fine-tuning","subtopic":"when-to-fine-tune","chunkType":"pitfall","body":"## Pitfalls — When to Fine-Tune\n\n**Fine-tuning as the first step**: This is the most expensive mistake. Always try prompting and few-shot first. Many teams spend weeks preparing fine-tuning data only to discover that a better prompt solves the problem.\n\n**Fine-tuning for knowledge**: Use RAG for knowledge grounding. Fine-tuning bakes knowledge into model weights where it cannot be easily updated, audited, or attributed.\n\n**Not evaluating before and after**: Without baseline metrics from prompting, you cannot prove fine-tuning was worth the effort. Measure the prompt-based approach first, then measure the fine-tuned approach on the same test set.\n\n**Using noisy training data**: 1,000 carefully reviewed examples outperform 50,000 scraped examples with errors. Quality over quantity applies double to fine-tuning data.\n\n**Catastrophic forgetting**: Fine-tuning on a narrow dataset can degrade the model's general capabilities. Test not just your target task but also related tasks the model should still handle.\n\n**One-and-done training**: Fine-tuned models need retraining as your domain evolves. Budget for ongoing data collection, evaluation, and retraining cycles.","bodyHtml":"<h2>Pitfalls — When to Fine-Tune</h2>\n<p><strong>Fine-tuning as the first step</strong>: This is the most expensive mistake. Always try prompting and few-shot first. Many teams spend weeks preparing fine-tuning data only to discover that a better prompt solves the problem.</p>\n<p><strong>Fine-tuning for knowledge</strong>: Use RAG for knowledge grounding. Fine-tuning bakes knowledge into model weights where it cannot be easily updated, audited, or attributed.</p>\n<p><strong>Not evaluating before and after</strong>: Without baseline metrics from prompting, you cannot prove fine-tuning was worth the effort. Measure the prompt-based approach first, then measure the fine-tuned approach on the same test set.</p>\n<p><strong>Using noisy training data</strong>: 1,000 carefully reviewed examples outperform 50,000 scraped examples with errors. Quality over quantity applies double to fine-tuning data.</p>\n<p><strong>Catastrophic forgetting</strong>: Fine-tuning on a narrow dataset can degrade the model's general capabilities. Test not just your target task but also related tasks the model should still handle.</p>\n<p><strong>One-and-done training</strong>: Fine-tuned models need retraining as your domain evolves. Budget for ongoing data collection, evaluation, and retraining cycles.</p>","sourceUrl":"/ai-and-ml-applied/fine-tuning/when-to-fine-tune","sourceAnchor":"common-pitfalls","domain":"data-ai","accentColor":"#08d7d9"},{"id":"0862256ad8c6","subject":"ai-and-ml-applied","topic":"fine-tuning","subtopic":"when-to-fine-tune","chunkType":"takeaway","body":"## Key takeaways — When to Fine-Tune\n\nFollow the escalation ladder: prompting, few-shot, RAG, then fine-tuning. Each step is more expensive. Most features stop at step 1 or 2.\n\nFine-tune when you need consistent style/format, domain-specific behavior, or cost/latency reduction through smaller models. These are the strongest use cases.\n\nDo not fine-tune for knowledge (use RAG), for general tasks (use prompting), or when you have fewer than 500 quality examples (use few-shot).\n\nAlways measure the baseline (prompt-only performance) before investing in fine-tuning. If prompting gets you to 95% and fine-tuning gets you to 97%, the improvement may not justify the cost.\n\nFine-tuning is not a one-time investment. Plan for ongoing data collection, evaluation, and retraining as your domain evolves.","bodyHtml":"<h2>Key takeaways — When to Fine-Tune</h2>\n<p>Follow the escalation ladder: prompting, few-shot, RAG, then fine-tuning. Each step is more expensive. Most features stop at step 1 or 2.</p>\n<p>Fine-tune when you need consistent style/format, domain-specific behavior, or cost/latency reduction through smaller models. These are the strongest use cases.</p>\n<p>Do not fine-tune for knowledge (use RAG), for general tasks (use prompting), or when you have fewer than 500 quality examples (use few-shot).</p>\n<p>Always measure the baseline (prompt-only performance) before investing in fine-tuning. If prompting gets you to 95% and fine-tuning gets you to 97%, the improvement may not justify the cost.</p>\n<p>Fine-tuning is not a one-time investment. Plan for ongoing data collection, evaluation, and retraining as your domain evolves.</p>","sourceUrl":"/ai-and-ml-applied/fine-tuning/when-to-fine-tune","sourceAnchor":"key-takeaways","domain":"data-ai","accentColor":"#08d7d9"},{"id":"d713576b92f7","subject":"ai-and-ml-applied","topic":"fine-tuning","subtopic":"data-preparation","chunkType":"section","body":"## Data Preparation\n\nThe quality of your fine-tuned model is determined by the quality of your training data. This is not a platitude — it is the single most predictive factor. A model fine-tuned on 1,000 carefully curated examples will outperform one trained on 100,000 noisy examples almost every time.\n\nData preparation for fine-tuning is methodical work: collecting examples, formatting them correctly, cleaning errors, removing duplicates, balancing categories, and validating the final dataset. Shortcuts here cost you weeks of debugging later.","bodyHtml":"<h2>Data Preparation</h2>\n<p>The quality of your fine-tuned model is determined by the quality of your training data. This is not a platitude — it is the single most predictive factor. A model fine-tuned on 1,000 carefully curated examples will outperform one trained on 100,000 noisy examples almost every time.</p>\n<p>Data preparation for fine-tuning is methodical work: collecting examples, formatting them correctly, cleaning errors, removing duplicates, balancing categories, and validating the final dataset. Shortcuts here cost you weeks of debugging later.</p>","sourceUrl":"/ai-and-ml-applied/fine-tuning/data-preparation","sourceAnchor":"overview","domain":"data-ai","accentColor":"#08d7d9"},{"id":"cdd5f2bfcc15","subject":"ai-and-ml-applied","topic":"fine-tuning","subtopic":"data-preparation","chunkType":"pitfall","body":"## Pitfalls — Data Preparation\n\n**Garbage in, garbage out**: This cliche is especially true for fine-tuning. Every error in your training data teaches the model to make that error. Spend 80% of your fine-tuning effort on data quality.\n\n**Training on model outputs**: Fine-tuning GPT-4o-mini on outputs from GPT-4o-mini creates a feedback loop. Use human-written or expert-reviewed outputs as the gold standard.\n\n**Ignoring PII**: If your training data contains personal information, the model may memorize and reproduce it. Scrub PII before training.\n\n**No held-out test set**: If you evaluate on training data, you measure memorization, not generalization. Always hold out a test set that the model never sees during training.\n\n**Imbalanced data without addressing it**: A model trained on 95% positive and 5% negative examples will almost always predict positive. Balance your categories or at least acknowledge the bias.\n\n**Quantity over quality**: 500 expert-reviewed examples beat 50,000 noisy scraped examples. If you have to choose between more data and better data, choose better data every time.","bodyHtml":"<h2>Pitfalls — Data Preparation</h2>\n<p><strong>Garbage in, garbage out</strong>: This cliche is especially true for fine-tuning. Every error in your training data teaches the model to make that error. Spend 80% of your fine-tuning effort on data quality.</p>\n<p><strong>Training on model outputs</strong>: Fine-tuning GPT-4o-mini on outputs from GPT-4o-mini creates a feedback loop. Use human-written or expert-reviewed outputs as the gold standard.</p>\n<p><strong>Ignoring PII</strong>: If your training data contains personal information, the model may memorize and reproduce it. Scrub PII before training.</p>\n<p><strong>No held-out test set</strong>: If you evaluate on training data, you measure memorization, not generalization. Always hold out a test set that the model never sees during training.</p>\n<p><strong>Imbalanced data without addressing it</strong>: A model trained on 95% positive and 5% negative examples will almost always predict positive. Balance your categories or at least acknowledge the bias.</p>\n<p><strong>Quantity over quality</strong>: 500 expert-reviewed examples beat 50,000 noisy scraped examples. If you have to choose between more data and better data, choose better data every time.</p>","sourceUrl":"/ai-and-ml-applied/fine-tuning/data-preparation","sourceAnchor":"common-pitfalls","domain":"data-ai","accentColor":"#08d7d9"},{"id":"a8f1f668a543","subject":"ai-and-ml-applied","topic":"fine-tuning","subtopic":"data-preparation","chunkType":"takeaway","body":"## Key takeaways — Data Preparation\n\nData quality is the most important factor in fine-tuning success. Invest 80% of your fine-tuning time in data preparation.\n\nUse the instruction-response format: system prompt, user message, ideal assistant response. Multi-turn conversations work the same way with more messages.\n\nClean your data systematically: remove bad examples, deduplicate, balance categories, scrub PII. Validate with human review before training.\n\nSynthetic data generation (using a stronger model to create training data) is a legitimate augmentation strategy, but always mix with real data and validate quality.\n\nAlways split into train/validation/test sets. Never evaluate on training data. Hold the test set sacred — use it only for final evaluation.","bodyHtml":"<h2>Key takeaways — Data Preparation</h2>\n<p>Data quality is the most important factor in fine-tuning success. Invest 80% of your fine-tuning time in data preparation.</p>\n<p>Use the instruction-response format: system prompt, user message, ideal assistant response. Multi-turn conversations work the same way with more messages.</p>\n<p>Clean your data systematically: remove bad examples, deduplicate, balance categories, scrub PII. Validate with human review before training.</p>\n<p>Synthetic data generation (using a stronger model to create training data) is a legitimate augmentation strategy, but always mix with real data and validate quality.</p>\n<p>Always split into train/validation/test sets. Never evaluate on training data. Hold the test set sacred — use it only for final evaluation.</p>","sourceUrl":"/ai-and-ml-applied/fine-tuning/data-preparation","sourceAnchor":"key-takeaways","domain":"data-ai","accentColor":"#08d7d9"},{"id":"3fae68bb2894","subject":"ai-and-ml-applied","topic":"fine-tuning","subtopic":"fine-tuning-techniques","chunkType":"section","body":"## Fine-Tuning Techniques\n\nThere are multiple ways to fine-tune a language model, ranging from full fine-tuning (updating every parameter) to parameter-efficient methods like LoRA that train a tiny fraction of the model. The right choice depends on your hardware, budget, data size, and performance requirements.\n\nFor most practitioners, the decision is simple: use LoRA. Full fine-tuning requires more memory than most teams have available, and LoRA achieves comparable quality for the vast majority of use cases.","bodyHtml":"<h2>Fine-Tuning Techniques</h2>\n<p>There are multiple ways to fine-tune a language model, ranging from full fine-tuning (updating every parameter) to parameter-efficient methods like LoRA that train a tiny fraction of the model. The right choice depends on your hardware, budget, data size, and performance requirements.</p>\n<p>For most practitioners, the decision is simple: use LoRA. Full fine-tuning requires more memory than most teams have available, and LoRA achieves comparable quality for the vast majority of use cases.</p>","sourceUrl":"/ai-and-ml-applied/fine-tuning/fine-tuning-techniques","sourceAnchor":"overview","domain":"data-ai","accentColor":"#08d7d9"},{"id":"1e6d9b03380c","subject":"ai-and-ml-applied","topic":"fine-tuning","subtopic":"fine-tuning-techniques","chunkType":"section","body":"## Training Process\n\nUse Hugging Face's `SFTTrainer` (from the `trl` library) for the training loop. Load your dataset from JSONL, format examples using the tokenizer's chat template, and enable `packing=True` to pack short examples together for efficiency. Save the adapter after training with `trainer.save_model()`.","bodyHtml":"<h2>Training Process</h2>\n<p>Use Hugging Face's <code>SFTTrainer</code> (from the <code>trl</code> library) for the training loop. Load your dataset from JSONL, format examples using the tokenizer's chat template, and enable <code>packing=True</code> to pack short examples together for efficiency. Save the adapter after training with <code>trainer.save_model()</code>.</p>","sourceUrl":"/ai-and-ml-applied/fine-tuning/fine-tuning-techniques","sourceAnchor":"training-process","domain":"data-ai","accentColor":"#08d7d9"},{"id":"2a2f7219d561","subject":"ai-and-ml-applied","topic":"fine-tuning","subtopic":"fine-tuning-techniques","chunkType":"pitfall","body":"## Pitfalls — Fine-Tuning Techniques\n\n**Not comparing against prompting baselines**: If you don't measure prompt-only performance first, you cannot prove fine-tuning was worth the effort. Always establish a baseline.\n\n**Training for too many epochs**: More epochs does not always mean better. Watch validation loss and stop when it plateaus or increases. Three epochs is a good starting point.\n\n**Using too high a learning rate**: This is the most common training failure. If loss spikes or the model produces garbage, reduce the learning rate by 2-5x.\n\n**Ignoring catastrophic forgetting**: Test your fine-tuned model on general tasks, not just your specific task. If it lost the ability to follow basic instructions, your learning rate or epoch count is too high.\n\n**Evaluating only on loss**: Low loss does not mean the model is good at your task. Always evaluate on real task examples with task-specific metrics.\n\n**Not versioning models and data**: Track which data version produced which model. When a fine-tuned model behaves unexpectedly, you need to trace back to the training data.","bodyHtml":"<h2>Pitfalls — Fine-Tuning Techniques</h2>\n<p><strong>Not comparing against prompting baselines</strong>: If you don't measure prompt-only performance first, you cannot prove fine-tuning was worth the effort. Always establish a baseline.</p>\n<p><strong>Training for too many epochs</strong>: More epochs does not always mean better. Watch validation loss and stop when it plateaus or increases. Three epochs is a good starting point.</p>\n<p><strong>Using too high a learning rate</strong>: This is the most common training failure. If loss spikes or the model produces garbage, reduce the learning rate by 2-5x.</p>\n<p><strong>Ignoring catastrophic forgetting</strong>: Test your fine-tuned model on general tasks, not just your specific task. If it lost the ability to follow basic instructions, your learning rate or epoch count is too high.</p>\n<p><strong>Evaluating only on loss</strong>: Low loss does not mean the model is good at your task. Always evaluate on real task examples with task-specific metrics.</p>\n<p><strong>Not versioning models and data</strong>: Track which data version produced which model. When a fine-tuned model behaves unexpectedly, you need to trace back to the training data.</p>","sourceUrl":"/ai-and-ml-applied/fine-tuning/fine-tuning-techniques","sourceAnchor":"common-pitfalls","domain":"data-ai","accentColor":"#08d7d9"},{"id":"73f97161abd8","subject":"ai-and-ml-applied","topic":"fine-tuning","subtopic":"fine-tuning-techniques","chunkType":"takeaway","body":"## Key takeaways — Fine-Tuning Techniques\n\nLoRA is the default fine-tuning technique for most practitioners. It achieves 95-99% of full fine-tuning quality at a fraction of the memory cost. Use QLoRA when GPU memory is very limited.\n\nThe most important hyperparameters are learning rate and number of epochs. Start with learning rate 2e-4 (LoRA) or 2e-5 (full), 3 epochs, and adjust based on validation loss.\n\nEvaluate on real tasks, not just loss. A model with low loss can still produce wrong answers. Compare against prompting baselines to prove fine-tuning adds value.\n\nWhen your fine-tuned model is worse than the base model, the problem is usually data quality, overfitting, or catastrophic forgetting. Diagnose systematically before adding more training.\n\nAPI-based fine-tuning (OpenAI, etc.) is the easiest path. Self-hosted fine-tuning with LoRA/QLoRA gives more control but requires GPU infrastructure and ML engineering expertise.","bodyHtml":"<h2>Key takeaways — Fine-Tuning Techniques</h2>\n<p>LoRA is the default fine-tuning technique for most practitioners. It achieves 95-99% of full fine-tuning quality at a fraction of the memory cost. Use QLoRA when GPU memory is very limited.</p>\n<p>The most important hyperparameters are learning rate and number of epochs. Start with learning rate 2e-4 (LoRA) or 2e-5 (full), 3 epochs, and adjust based on validation loss.</p>\n<p>Evaluate on real tasks, not just loss. A model with low loss can still produce wrong answers. Compare against prompting baselines to prove fine-tuning adds value.</p>\n<p>When your fine-tuned model is worse than the base model, the problem is usually data quality, overfitting, or catastrophic forgetting. Diagnose systematically before adding more training.</p>\n<p>API-based fine-tuning (OpenAI, etc.) is the easiest path. Self-hosted fine-tuning with LoRA/QLoRA gives more control but requires GPU infrastructure and ML engineering expertise.</p>","sourceUrl":"/ai-and-ml-applied/fine-tuning/fine-tuning-techniques","sourceAnchor":"key-takeaways","domain":"data-ai","accentColor":"#08d7d9"},{"id":"b0dffd33b3d0","subject":"ai-and-ml-applied","topic":"embeddings-and-vector-search","subtopic":"what-embeddings-are","chunkType":"intro","body":"An embedding is a mapping from something humans understand — a word, a sentence, an image, a song — to a dense vector of floating-point numbers that a machine can compare, cluster, and search. The core idea: things with similar meaning end up close together in vector space, and things with different meaning end up far apart. \"King\" is closer to \"Queen\" than it is to \"Banana.\" \"A photo of a sunset\" is closer to \"golden hour over the ocean\" than it is to \"tax return spreadsheet.\"\n\nThis is not a metaphor. It is literally a list of numbers. An embedding model takes your input and returns something like `[0.023, -0.118, 0.541, ...]` with anywhere from 256 to 3072 dimensions. Those numbers encode semantic information learned during training on massive text (or image, or audio) corpora.","bodyHtml":"<p>An embedding is a mapping from something humans understand — a word, a sentence, an image, a song — to a dense vector of floating-point numbers that a machine can compare, cluster, and search. The core idea: things with similar meaning end up close together in vector space, and things with different meaning end up far apart. \"King\" is closer to \"Queen\" than it is to \"Banana.\" \"A photo of a sunset\" is closer to \"golden hour over the ocean\" than it is to \"tax return spreadsheet.\"</p>\n<p>This is not a metaphor. It is literally a list of numbers. An embedding model takes your input and returns something like <code>[0.023, -0.118, 0.541, ...]</code> with anywhere from 256 to 3072 dimensions. Those numbers encode semantic information learned during training on massive text (or image, or audio) corpora.</p>","sourceUrl":"/ai-and-ml-applied/embeddings-and-vector-search/what-embeddings-are","sourceAnchor":"","domain":"data-ai","accentColor":"#08d7d9"},{"id":"ff3240b125d7","subject":"ai-and-ml-applied","topic":"embeddings-and-vector-search","subtopic":"what-embeddings-are","chunkType":"section","body":"## What Embeddings Are\n\nTraditional text comparison is keyword-based. If a user searches \"how to fix a slow database,\" keyword search finds documents containing those exact words. It misses a document titled \"Optimizing PostgreSQL query performance\" even though it is exactly what the user wants.\n\nEmbeddings solve this. Both \"how to fix a slow database\" and \"optimizing PostgreSQL query performance\" map to nearby vectors because the embedding model learned that they mean similar things. Search becomes semantic, not lexical.\n\nThis extends beyond text. Embed an image and a text description into the same vector space (as CLIP does), and you can search images using natural language queries. Embed audio transcripts and you can search podcasts. The technique is general: if you can feed it to a neural network, you can embed it.","bodyHtml":"<h2>What Embeddings Are</h2>\n<p>Traditional text comparison is keyword-based. If a user searches \"how to fix a slow database,\" keyword search finds documents containing those exact words. It misses a document titled \"Optimizing PostgreSQL query performance\" even though it is exactly what the user wants.</p>\n<p>Embeddings solve this. Both \"how to fix a slow database\" and \"optimizing PostgreSQL query performance\" map to nearby vectors because the embedding model learned that they mean similar things. Search becomes semantic, not lexical.</p>\n<p>This extends beyond text. Embed an image and a text description into the same vector space (as CLIP does), and you can search images using natural language queries. Embed audio transcripts and you can search podcasts. The technique is general: if you can feed it to a neural network, you can embed it.</p>","sourceUrl":"/ai-and-ml-applied/embeddings-and-vector-search/what-embeddings-are","sourceAnchor":"why-embeddings-matter","domain":"data-ai","accentColor":"#08d7d9"},{"id":"419add7ed62e","subject":"ai-and-ml-applied","topic":"embeddings-and-vector-search","subtopic":"what-embeddings-are","chunkType":"pitfall","body":"## Pitfalls — What Embeddings Are\n\n**Using keyword search when you need semantic search, or vice versa.** Embeddings are great at \"find me something with similar meaning.\" They are not great at exact match. If a user searches for invoice number `INV-2024-0847`, keyword search wins.\n\n**Ignoring the context window.** Stuffing a 10,000-word document into a model with a 512-token window silently truncates it. You get an embedding of the first 512 tokens only.\n\n**Mixing embedding models.** Vectors from different models live in different spaces. You cannot compare an OpenAI embedding to a sentence-transformers embedding. Pick one model and stick with it (or re-embed everything when you switch).\n\n**Not normalizing before dot product.** If your vectors are not unit-length and you use dot product instead of cosine similarity, longer documents (with larger magnitude vectors) will dominate results regardless of relevance.\n\n**Assuming embeddings are deterministic across versions.** Model updates can change the embedding space. Pin your model version and re-embed when you upgrade.\n\n**Embedding too much at once.** A single embedding for a 50-page document is a blurry summary. Chunk it and embed the chunks separately for better retrieval granularity.","bodyHtml":"<h2>Pitfalls — What Embeddings Are</h2>\n<p><strong>Using keyword search when you need semantic search, or vice versa.</strong> Embeddings are great at \"find me something with similar meaning.\" They are not great at exact match. If a user searches for invoice number <code>INV-2024-0847</code>, keyword search wins.</p>\n<p><strong>Ignoring the context window.</strong> Stuffing a 10,000-word document into a model with a 512-token window silently truncates it. You get an embedding of the first 512 tokens only.</p>\n<p><strong>Mixing embedding models.</strong> Vectors from different models live in different spaces. You cannot compare an OpenAI embedding to a sentence-transformers embedding. Pick one model and stick with it (or re-embed everything when you switch).</p>\n<p><strong>Not normalizing before dot product.</strong> If your vectors are not unit-length and you use dot product instead of cosine similarity, longer documents (with larger magnitude vectors) will dominate results regardless of relevance.</p>\n<p><strong>Assuming embeddings are deterministic across versions.</strong> Model updates can change the embedding space. Pin your model version and re-embed when you upgrade.</p>\n<p><strong>Embedding too much at once.</strong> A single embedding for a 50-page document is a blurry summary. Chunk it and embed the chunks separately for better retrieval granularity.</p>","sourceUrl":"/ai-and-ml-applied/embeddings-and-vector-search/what-embeddings-are","sourceAnchor":"common-pitfalls","domain":"data-ai","accentColor":"#08d7d9"},{"id":"c18d23a0c6c2","subject":"ai-and-ml-applied","topic":"embeddings-and-vector-search","subtopic":"what-embeddings-are","chunkType":"takeaway","body":"## Key takeaways — What Embeddings Are\n\nAn embedding is a dense vector of floats that encodes semantic meaning. Similar inputs produce nearby vectors.\n\nCosine similarity is the default metric. Use dot product if you pre-normalize.\n\nChoose dimensions based on your quality-vs-cost tradeoff. 768 to 1536 covers most use cases.\n\nChunking strategy matters as much as model choice. Bad chunks produce bad embeddings.\n\nDo not mix vectors from different models. They live in incompatible spaces.\n\nOpen-source models (sentence-transformers, BGE, Nomic) are production-ready and eliminate API dependency.\n\nBatch your embedding requests. Single-document calls waste latency and cost.\n\nTest general-purpose models on your domain before investing in fine-tuning.","bodyHtml":"<h2>Key takeaways — What Embeddings Are</h2>\n<p>An embedding is a dense vector of floats that encodes semantic meaning. Similar inputs produce nearby vectors.</p>\n<p>Cosine similarity is the default metric. Use dot product if you pre-normalize.</p>\n<p>Choose dimensions based on your quality-vs-cost tradeoff. 768 to 1536 covers most use cases.</p>\n<p>Chunking strategy matters as much as model choice. Bad chunks produce bad embeddings.</p>\n<p>Do not mix vectors from different models. They live in incompatible spaces.</p>\n<p>Open-source models (sentence-transformers, BGE, Nomic) are production-ready and eliminate API dependency.</p>\n<p>Batch your embedding requests. Single-document calls waste latency and cost.</p>\n<p>Test general-purpose models on your domain before investing in fine-tuning.</p>","sourceUrl":"/ai-and-ml-applied/embeddings-and-vector-search/what-embeddings-are","sourceAnchor":"key-takeaways","domain":"data-ai","accentColor":"#08d7d9"},{"id":"e4388db5a145","subject":"ai-and-ml-applied","topic":"embeddings-and-vector-search","subtopic":"vector-databases","chunkType":"intro","body":"You have millions of embedding vectors. A user sends a query, you embed it, and you need to find the 10 most similar vectors out of those millions — in under 100 milliseconds. A brute-force comparison against every vector is O(n), which is too slow at scale. Vector databases solve this with specialized indexing algorithms that trade a small amount of accuracy for orders-of-magnitude speed improvement.","bodyHtml":"<p>You have millions of embedding vectors. A user sends a query, you embed it, and you need to find the 10 most similar vectors out of those millions — in under 100 milliseconds. A brute-force comparison against every vector is O(n), which is too slow at scale. Vector databases solve this with specialized indexing algorithms that trade a small amount of accuracy for orders-of-magnitude speed improvement.</p>","sourceUrl":"/ai-and-ml-applied/embeddings-and-vector-search/vector-databases","sourceAnchor":"","domain":"data-ai","accentColor":"#08d7d9"},{"id":"2534d33fd8cf","subject":"ai-and-ml-applied","topic":"embeddings-and-vector-search","subtopic":"vector-databases","chunkType":"section","body":"## Filtering with Metadata\n\nReal queries are rarely \"find the 10 most similar vectors.\" They are \"find the 10 most similar vectors that belong to this tenant, were created in the last 30 days, and are in the engineering category.\"","bodyHtml":"<h2>Filtering with Metadata</h2>\n<p>Real queries are rarely \"find the 10 most similar vectors.\" They are \"find the 10 most similar vectors that belong to this tenant, were created in the last 30 days, and are in the engineering category.\"</p>","sourceUrl":"/ai-and-ml-applied/embeddings-and-vector-search/vector-databases","sourceAnchor":"filtering-with-metadata","domain":"data-ai","accentColor":"#08d7d9"},{"id":"c5fdf9f86c85","subject":"ai-and-ml-applied","topic":"embeddings-and-vector-search","subtopic":"vector-databases","chunkType":"pitfall","body":"## Pitfalls — Vector Databases\n\n**Choosing a dedicated vector database before you need one.** pgvector in PostgreSQL handles millions of vectors. Start there. Migrate when you hit a real scaling wall, not a theoretical one.\n\n**No index on the vector column.** Without an ANN index, every query is a full table scan. This is the number one performance complaint from pgvector users, and it is always a missing index.\n\n**Wrong distance metric.** If you build an HNSW index with cosine distance but query with L2 distance, the index is useless and the database falls back to sequential scan.\n\n**Ignoring index build time.** HNSW indexes on millions of vectors take minutes to hours to build. Plan for this during initial load and migrations.\n\n**Filtering after vector search.** If you search for top 10 and then filter, you might end up with 2 results. Over-fetch (top 50) and filter, or use a database that supports pre-filtering.\n\n**Not monitoring recall.** ANN search is approximate. Periodically spot-check results against brute-force search to verify your index parameters give acceptable recall.","bodyHtml":"<h2>Pitfalls — Vector Databases</h2>\n<p><strong>Choosing a dedicated vector database before you need one.</strong> pgvector in PostgreSQL handles millions of vectors. Start there. Migrate when you hit a real scaling wall, not a theoretical one.</p>\n<p><strong>No index on the vector column.</strong> Without an ANN index, every query is a full table scan. This is the number one performance complaint from pgvector users, and it is always a missing index.</p>\n<p><strong>Wrong distance metric.</strong> If you build an HNSW index with cosine distance but query with L2 distance, the index is useless and the database falls back to sequential scan.</p>\n<p><strong>Ignoring index build time.</strong> HNSW indexes on millions of vectors take minutes to hours to build. Plan for this during initial load and migrations.</p>\n<p><strong>Filtering after vector search.</strong> If you search for top 10 and then filter, you might end up with 2 results. Over-fetch (top 50) and filter, or use a database that supports pre-filtering.</p>\n<p><strong>Not monitoring recall.</strong> ANN search is approximate. Periodically spot-check results against brute-force search to verify your index parameters give acceptable recall.</p>","sourceUrl":"/ai-and-ml-applied/embeddings-and-vector-search/vector-databases","sourceAnchor":"common-pitfalls","domain":"data-ai","accentColor":"#08d7d9"},{"id":"fd9f069e7e12","subject":"ai-and-ml-applied","topic":"embeddings-and-vector-search","subtopic":"vector-databases","chunkType":"takeaway","body":"## Key takeaways — Vector Databases\n\npgvector in PostgreSQL is the right starting point for most applications. It handles millions of vectors, offers SQL joins with relational data, and requires no new infrastructure.\n\nPurpose-built vector databases (Pinecone, Qdrant, Milvus) earn their place at hundreds of millions of vectors, sub-10ms latency requirements, or when you need distributed search.\n\nHNSW is the default indexing algorithm. It gives the best recall-speed tradeoff for most workloads.\n\nAlways create an ANN index. Without one, every search is a full scan.\n\nMetadata filtering is a first-class concern. Most real queries combine vector similarity with structured filters.\n\nQuantization (float16, int8) cuts memory usage substantially with minimal quality loss.","bodyHtml":"<h2>Key takeaways — Vector Databases</h2>\n<p>pgvector in PostgreSQL is the right starting point for most applications. It handles millions of vectors, offers SQL joins with relational data, and requires no new infrastructure.</p>\n<p>Purpose-built vector databases (Pinecone, Qdrant, Milvus) earn their place at hundreds of millions of vectors, sub-10ms latency requirements, or when you need distributed search.</p>\n<p>HNSW is the default indexing algorithm. It gives the best recall-speed tradeoff for most workloads.</p>\n<p>Always create an ANN index. Without one, every search is a full scan.</p>\n<p>Metadata filtering is a first-class concern. Most real queries combine vector similarity with structured filters.</p>\n<p>Quantization (float16, int8) cuts memory usage substantially with minimal quality loss.</p>","sourceUrl":"/ai-and-ml-applied/embeddings-and-vector-search/vector-databases","sourceAnchor":"key-takeaways","domain":"data-ai","accentColor":"#08d7d9"},{"id":"cece9367c717","subject":"ai-and-ml-applied","topic":"embeddings-and-vector-search","subtopic":"use-cases-and-patterns","chunkType":"intro","body":"Embeddings are not a solution looking for a problem. They solve a specific class of problems: anything where you need to find things based on meaning rather than exact match. Once you internalize the pattern — embed everything, compare vectors — you start seeing applications everywhere.","bodyHtml":"<p>Embeddings are not a solution looking for a problem. They solve a specific class of problems: anything where you need to find things based on meaning rather than exact match. Once you internalize the pattern — embed everything, compare vectors — you start seeing applications everywhere.</p>","sourceUrl":"/ai-and-ml-applied/embeddings-and-vector-search/use-cases-and-patterns","sourceAnchor":"","domain":"data-ai","accentColor":"#08d7d9"},{"id":"51eb1e433710","subject":"ai-and-ml-applied","topic":"embeddings-and-vector-search","subtopic":"use-cases-and-patterns","chunkType":"pitfall","body":"## Pitfalls — Use Cases & Patterns\n\n**Using embeddings for everything.** Embeddings excel at fuzzy semantic matching. For exact lookups (user IDs, order numbers, enum values), use traditional indexes. Combining both is the right approach.\n\n**Not evaluating retrieval quality.** Build a test set of queries with known-relevant documents. Measure recall@k and precision@k. Without this, you are guessing about quality.\n\n**Stale embeddings.** If the source content changes, the embedding must be regenerated. Build this into your update pipeline, not as an afterthought.\n\n**One-size-fits-all chunking.** A chunking strategy that works for technical documentation will fail for conversational chat logs. Tune chunk size and overlap per content type.\n\n**Ignoring the cost of embedding at scale.** Embedding 10 million documents with OpenAI's API costs real money. Calculate costs upfront. Consider open-source models for large-scale batch embedding.\n\n**Skipping hybrid search.** Pure vector search misses exact-match queries. Pure keyword search misses semantic queries. Combine both.","bodyHtml":"<h2>Pitfalls — Use Cases &#x26; Patterns</h2>\n<p><strong>Using embeddings for everything.</strong> Embeddings excel at fuzzy semantic matching. For exact lookups (user IDs, order numbers, enum values), use traditional indexes. Combining both is the right approach.</p>\n<p><strong>Not evaluating retrieval quality.</strong> Build a test set of queries with known-relevant documents. Measure recall@k and precision@k. Without this, you are guessing about quality.</p>\n<p><strong>Stale embeddings.</strong> If the source content changes, the embedding must be regenerated. Build this into your update pipeline, not as an afterthought.</p>\n<p><strong>One-size-fits-all chunking.</strong> A chunking strategy that works for technical documentation will fail for conversational chat logs. Tune chunk size and overlap per content type.</p>\n<p><strong>Ignoring the cost of embedding at scale.</strong> Embedding 10 million documents with OpenAI's API costs real money. Calculate costs upfront. Consider open-source models for large-scale batch embedding.</p>\n<p><strong>Skipping hybrid search.</strong> Pure vector search misses exact-match queries. Pure keyword search misses semantic queries. Combine both.</p>","sourceUrl":"/ai-and-ml-applied/embeddings-and-vector-search/use-cases-and-patterns","sourceAnchor":"common-pitfalls","domain":"data-ai","accentColor":"#08d7d9"},{"id":"b58be21d6315","subject":"ai-and-ml-applied","topic":"embeddings-and-vector-search","subtopic":"use-cases-and-patterns","chunkType":"takeaway","body":"## Key takeaways — Use Cases & Patterns\n\nSemantic search is the entry-point use case, but embeddings enable recommendations, deduplication, classification, anomaly detection, and multi-modal search with the same underlying primitive.\n\nHybrid search (vector + keyword) outperforms either approach alone in nearly every benchmark and real-world test.\n\nZero-shot classification with embeddings requires no training data and is good enough for many production routing and tagging tasks.\n\nThe embedding-first architecture treats vector similarity as a fundamental building block, not a bolted-on feature.\n\nAlways measure retrieval quality. Build evaluation sets early and track recall and precision as you iterate.\n\nMulti-modal embeddings (CLIP and successors) unlock cross-modal search, which is one of the most underused capabilities available today.","bodyHtml":"<h2>Key takeaways — Use Cases &#x26; Patterns</h2>\n<p>Semantic search is the entry-point use case, but embeddings enable recommendations, deduplication, classification, anomaly detection, and multi-modal search with the same underlying primitive.</p>\n<p>Hybrid search (vector + keyword) outperforms either approach alone in nearly every benchmark and real-world test.</p>\n<p>Zero-shot classification with embeddings requires no training data and is good enough for many production routing and tagging tasks.</p>\n<p>The embedding-first architecture treats vector similarity as a fundamental building block, not a bolted-on feature.</p>\n<p>Always measure retrieval quality. Build evaluation sets early and track recall and precision as you iterate.</p>\n<p>Multi-modal embeddings (CLIP and successors) unlock cross-modal search, which is one of the most underused capabilities available today.</p>","sourceUrl":"/ai-and-ml-applied/embeddings-and-vector-search/use-cases-and-patterns","sourceAnchor":"key-takeaways","domain":"data-ai","accentColor":"#08d7d9"},{"id":"1fb43f08de2d","subject":"ai-and-ml-applied","topic":"ml-in-production","subtopic":"model-serving","chunkType":"intro","body":"A model in a Jupyter notebook is a prototype. A model behind an API serving thousands of requests per second is a product. The gap between these two states is where most ML projects die. Model serving is the discipline of getting a trained model into production where it can receive inputs, produce predictions, and do so reliably, cheaply, and fast enough for your use case.","bodyHtml":"<p>A model in a Jupyter notebook is a prototype. A model behind an API serving thousands of requests per second is a product. The gap between these two states is where most ML projects die. Model serving is the discipline of getting a trained model into production where it can receive inputs, produce predictions, and do so reliably, cheaply, and fast enough for your use case.</p>","sourceUrl":"/ai-and-ml-applied/ml-in-production/model-serving","sourceAnchor":"","domain":"data-ai","accentColor":"#08d7d9"},{"id":"c71ec779fbe9","subject":"ai-and-ml-applied","topic":"ml-in-production","subtopic":"model-serving","chunkType":"pitfall","body":"## Pitfalls — Model Serving\n\n**Loading the model on every request.** Load once at startup, predict on every request. This is the most common performance bug in model serving code.\n\n**No health checks or readiness probes.** Your model server needs a `/health` endpoint that verifies the model is loaded and ready. Without it, load balancers send traffic to servers that are still loading models.\n\n**Ignoring cold start latency.** If your GPU model takes 30 seconds to load into memory, the first request after scaling up will time out. Pre-warm instances or use persistent serving.\n\n**Not batching inference.** Processing one input at a time wastes GPU parallelism. Batch requests together for 3-10x throughput improvement on GPU.\n\n**Premature self-hosting.** Teams spend months building serving infrastructure when an API call would have shipped the feature in a day. Start with APIs, optimize later.\n\n**No fallback for model server failures.** Your model server will go down. Have a degraded experience (cached results, rule-based fallback, or a simpler model) ready.","bodyHtml":"<h2>Pitfalls — Model Serving</h2>\n<p><strong>Loading the model on every request.</strong> Load once at startup, predict on every request. This is the most common performance bug in model serving code.</p>\n<p><strong>No health checks or readiness probes.</strong> Your model server needs a <code>/health</code> endpoint that verifies the model is loaded and ready. Without it, load balancers send traffic to servers that are still loading models.</p>\n<p><strong>Ignoring cold start latency.</strong> If your GPU model takes 30 seconds to load into memory, the first request after scaling up will time out. Pre-warm instances or use persistent serving.</p>\n<p><strong>Not batching inference.</strong> Processing one input at a time wastes GPU parallelism. Batch requests together for 3-10x throughput improvement on GPU.</p>\n<p><strong>Premature self-hosting.</strong> Teams spend months building serving infrastructure when an API call would have shipped the feature in a day. Start with APIs, optimize later.</p>\n<p><strong>No fallback for model server failures.</strong> Your model server will go down. Have a degraded experience (cached results, rule-based fallback, or a simpler model) ready.</p>","sourceUrl":"/ai-and-ml-applied/ml-in-production/model-serving","sourceAnchor":"common-pitfalls","domain":"data-ai","accentColor":"#08d7d9"},{"id":"c9f7c4e195e6","subject":"ai-and-ml-applied","topic":"ml-in-production","subtopic":"model-serving","chunkType":"takeaway","body":"## Key takeaways — Model Serving\n\nMatch serving strategy to latency requirements. Batch for offline, REST API for most real-time, gRPC for high-throughput internal services.\n\nStart with API providers (OpenAI, Anthropic, Cohere). Self-host when volume makes the cost equation favor it — typically above 100K-500K requests per day.\n\nQuantization (INT8, INT4) is not optional for large models. It cuts costs 2-4x with minimal quality loss.\n\nvLLM is the current standard for self-hosted LLM inference. FastAPI + a transformer model is the standard for everything else.\n\nLoad models once at startup. Implement health checks. Batch inference on GPU. These three things solve 80% of serving performance problems.\n\nAlways have a fallback. Model servers fail. Plan for degraded service, not total outage.","bodyHtml":"<h2>Key takeaways — Model Serving</h2>\n<p>Match serving strategy to latency requirements. Batch for offline, REST API for most real-time, gRPC for high-throughput internal services.</p>\n<p>Start with API providers (OpenAI, Anthropic, Cohere). Self-host when volume makes the cost equation favor it — typically above 100K-500K requests per day.</p>\n<p>Quantization (INT8, INT4) is not optional for large models. It cuts costs 2-4x with minimal quality loss.</p>\n<p>vLLM is the current standard for self-hosted LLM inference. FastAPI + a transformer model is the standard for everything else.</p>\n<p>Load models once at startup. Implement health checks. Batch inference on GPU. These three things solve 80% of serving performance problems.</p>\n<p>Always have a fallback. Model servers fail. Plan for degraded service, not total outage.</p>","sourceUrl":"/ai-and-ml-applied/ml-in-production/model-serving","sourceAnchor":"key-takeaways","domain":"data-ai","accentColor":"#08d7d9"},{"id":"7b8043f3111a","subject":"ai-and-ml-applied","topic":"ml-in-production","subtopic":"monitoring-ml-systems","chunkType":"intro","body":"Software monitoring tells you whether your application is up and responding. ML monitoring tells you whether your model is still giving good answers. A model can return HTTP 200 on every request while silently producing garbage predictions because the world changed and the model did not. Traditional monitoring catches the first problem. ML monitoring catches the second.","bodyHtml":"<p>Software monitoring tells you whether your application is up and responding. ML monitoring tells you whether your model is still giving good answers. A model can return HTTP 200 on every request while silently producing garbage predictions because the world changed and the model did not. Traditional monitoring catches the first problem. ML monitoring catches the second.</p>","sourceUrl":"/ai-and-ml-applied/ml-in-production/monitoring-ml-systems","sourceAnchor":"","domain":"data-ai","accentColor":"#08d7d9"},{"id":"c7cf4020a942","subject":"ai-and-ml-applied","topic":"ml-in-production","subtopic":"monitoring-ml-systems","chunkType":"pitfall","body":"## Pitfalls — Monitoring ML Systems\n\n**Monitoring only infrastructure, not data or predictions.** An ML system can be healthy by every infrastructure metric and still produce useless predictions. Monitor all four layers.\n\n**No reference distribution.** You cannot detect drift if you do not know what \"normal\" looks like. Save statistics from your training data and first successful production deployment as a baseline.\n\n**Alerting on single-feature drift.** Individual features drift naturally. Alert when multiple features drift simultaneously or when output distribution changes. Single-feature drift is informational, not actionable.\n\n**Not logging model inputs and outputs.** You cannot debug a production issue if you do not know what the model saw and what it predicted. Log everything (with appropriate PII protections).\n\n**Delayed ground truth treated as no ground truth.** Even if labels arrive weeks later, they are valuable for detecting model degradation. Build a feedback pipeline that joins predictions with delayed labels.\n\n**Ignoring training/serving skew.** The most insidious ML bug: the model works perfectly in evaluation but fails in production because features are computed differently. A feature store or shared computation code prevents this.","bodyHtml":"<h2>Pitfalls — Monitoring ML Systems</h2>\n<p><strong>Monitoring only infrastructure, not data or predictions.</strong> An ML system can be healthy by every infrastructure metric and still produce useless predictions. Monitor all four layers.</p>\n<p><strong>No reference distribution.</strong> You cannot detect drift if you do not know what \"normal\" looks like. Save statistics from your training data and first successful production deployment as a baseline.</p>\n<p><strong>Alerting on single-feature drift.</strong> Individual features drift naturally. Alert when multiple features drift simultaneously or when output distribution changes. Single-feature drift is informational, not actionable.</p>\n<p><strong>Not logging model inputs and outputs.</strong> You cannot debug a production issue if you do not know what the model saw and what it predicted. Log everything (with appropriate PII protections).</p>\n<p><strong>Delayed ground truth treated as no ground truth.</strong> Even if labels arrive weeks later, they are valuable for detecting model degradation. Build a feedback pipeline that joins predictions with delayed labels.</p>\n<p><strong>Ignoring training/serving skew.</strong> The most insidious ML bug: the model works perfectly in evaluation but fails in production because features are computed differently. A feature store or shared computation code prevents this.</p>","sourceUrl":"/ai-and-ml-applied/ml-in-production/monitoring-ml-systems","sourceAnchor":"common-pitfalls","domain":"data-ai","accentColor":"#08d7d9"},{"id":"9e0c38ce7b69","subject":"ai-and-ml-applied","topic":"ml-in-production","subtopic":"monitoring-ml-systems","chunkType":"takeaway","body":"## Key takeaways — Monitoring ML Systems\n\nML monitoring has four layers: infrastructure, data (input drift), predictions (output drift), and outcomes (ground truth comparison). Most teams only implement the first layer.\n\nData drift detection is the highest-leverage monitoring investment. It catches problems before they affect users.\n\nFeature stores enforce consistency between training and serving, eliminating an entire class of silent production bugs.\n\nLog model inputs, outputs, and confidence scores. You cannot debug what you did not record.\n\nSet tiered alerts. Not every drift signal needs to wake someone up. Reserve paging for model collapse and severe metric drops.\n\nGround truth may arrive with a delay, but it always arrives eventually. Build the pipeline to use it when it does.","bodyHtml":"<h2>Key takeaways — Monitoring ML Systems</h2>\n<p>ML monitoring has four layers: infrastructure, data (input drift), predictions (output drift), and outcomes (ground truth comparison). Most teams only implement the first layer.</p>\n<p>Data drift detection is the highest-leverage monitoring investment. It catches problems before they affect users.</p>\n<p>Feature stores enforce consistency between training and serving, eliminating an entire class of silent production bugs.</p>\n<p>Log model inputs, outputs, and confidence scores. You cannot debug what you did not record.</p>\n<p>Set tiered alerts. Not every drift signal needs to wake someone up. Reserve paging for model collapse and severe metric drops.</p>\n<p>Ground truth may arrive with a delay, but it always arrives eventually. Build the pipeline to use it when it does.</p>","sourceUrl":"/ai-and-ml-applied/ml-in-production/monitoring-ml-systems","sourceAnchor":"key-takeaways","domain":"data-ai","accentColor":"#08d7d9"},{"id":"a95d2e4e5d02","subject":"ai-and-ml-applied","topic":"ml-in-production","subtopic":"ab-testing-models","chunkType":"intro","body":"Your new model has better accuracy on the test set. Should you ship it? Not yet. Offline metrics (accuracy, F1, AUC) tell you how the model performs on historical data. Online metrics (click-through rate, conversion, revenue, user satisfaction) tell you how it performs in the real world. These two sets of metrics disagree more often than you would expect. A/B testing is how you resolve the disagreement.","bodyHtml":"<p>Your new model has better accuracy on the test set. Should you ship it? Not yet. Offline metrics (accuracy, F1, AUC) tell you how the model performs on historical data. Online metrics (click-through rate, conversion, revenue, user satisfaction) tell you how it performs in the real world. These two sets of metrics disagree more often than you would expect. A/B testing is how you resolve the disagreement.</p>","sourceUrl":"/ai-and-ml-applied/ml-in-production/ab-testing-models","sourceAnchor":"","domain":"data-ai","accentColor":"#08d7d9"},{"id":"e4d080bd3fe0","subject":"ai-and-ml-applied","topic":"ml-in-production","subtopic":"ab-testing-models","chunkType":"pitfall","body":"## Pitfalls — A/B Testing Models\n\n**Peeking at results early.** Checking p-values repeatedly during the experiment inflates the false positive rate. Either commit to a fixed sample size upfront, or use sequential testing methods designed for continuous monitoring.\n\n**Not accounting for novelty effects.** Users click on new things because they are new, not because they are better. Run experiments for at least 2 weeks to let novelty wear off.\n\n**Testing on the wrong metric.** If you optimize for click-through rate, you may get more clicks but lower conversion. Choose a metric that aligns with actual business value.\n\n**Network effects between variants.** In social or marketplace applications, control and treatment users interact with each other, contaminating results. Use cluster-based randomization (randomize by geographic region or social cluster).\n\n**Ignoring guardrail metrics.** You test one primary metric, but you should also monitor guardrails — metrics that should not degrade. A model that improves search relevance but doubles page load time is a net loss.\n\n**Running too many simultaneous experiments on the same surface.** If three experiments modify the same feature, interactions between them make results uninterpretable. Coordinate experiments across teams.","bodyHtml":"<h2>Pitfalls — A/B Testing Models</h2>\n<p><strong>Peeking at results early.</strong> Checking p-values repeatedly during the experiment inflates the false positive rate. Either commit to a fixed sample size upfront, or use sequential testing methods designed for continuous monitoring.</p>\n<p><strong>Not accounting for novelty effects.</strong> Users click on new things because they are new, not because they are better. Run experiments for at least 2 weeks to let novelty wear off.</p>\n<p><strong>Testing on the wrong metric.</strong> If you optimize for click-through rate, you may get more clicks but lower conversion. Choose a metric that aligns with actual business value.</p>\n<p><strong>Network effects between variants.</strong> In social or marketplace applications, control and treatment users interact with each other, contaminating results. Use cluster-based randomization (randomize by geographic region or social cluster).</p>\n<p><strong>Ignoring guardrail metrics.</strong> You test one primary metric, but you should also monitor guardrails — metrics that should not degrade. A model that improves search relevance but doubles page load time is a net loss.</p>\n<p><strong>Running too many simultaneous experiments on the same surface.</strong> If three experiments modify the same feature, interactions between them make results uninterpretable. Coordinate experiments across teams.</p>","sourceUrl":"/ai-and-ml-applied/ml-in-production/ab-testing-models","sourceAnchor":"common-pitfalls","domain":"data-ai","accentColor":"#08d7d9"},{"id":"5e521ae1f681","subject":"ai-and-ml-applied","topic":"ml-in-production","subtopic":"ab-testing-models","chunkType":"takeaway","body":"## Key takeaways — A/B Testing Models\n\nOffline metrics (accuracy, F1) and online metrics (CTR, revenue) disagree regularly. A/B testing is the only way to know which model is actually better for your users.\n\nUse shadow mode to de-risk deployment before any user sees the new model. It catches crashes, latency regressions, and obvious quality problems.\n\nCanary deployments let you gradually shift traffic and automatically roll back on failure. Start at 1%, increase over days.\n\nCalculate sample size before starting the experiment. Under-powered tests waste time and produce inconclusive results.\n\nDo not peek at results. Commit to a sample size or use sequential testing methods.\n\nMulti-armed bandits reduce exposure to the worse model but sacrifice clean statistical inference. Use them when the cost of serving wrong predictions is high.","bodyHtml":"<h2>Key takeaways — A/B Testing Models</h2>\n<p>Offline metrics (accuracy, F1) and online metrics (CTR, revenue) disagree regularly. A/B testing is the only way to know which model is actually better for your users.</p>\n<p>Use shadow mode to de-risk deployment before any user sees the new model. It catches crashes, latency regressions, and obvious quality problems.</p>\n<p>Canary deployments let you gradually shift traffic and automatically roll back on failure. Start at 1%, increase over days.</p>\n<p>Calculate sample size before starting the experiment. Under-powered tests waste time and produce inconclusive results.</p>\n<p>Do not peek at results. Commit to a sample size or use sequential testing methods.</p>\n<p>Multi-armed bandits reduce exposure to the worse model but sacrifice clean statistical inference. Use them when the cost of serving wrong predictions is high.</p>","sourceUrl":"/ai-and-ml-applied/ml-in-production/ab-testing-models","sourceAnchor":"key-takeaways","domain":"data-ai","accentColor":"#08d7d9"},{"id":"87865f90f362","subject":"ai-and-ml-applied","topic":"llm-application-patterns","subtopic":"agents-and-tool-use","chunkType":"intro","body":"A chatbot generates text. An agent takes actions. The difference is that an agent can call external tools — search the web, query a database, run code, hit an API — and incorporate the results into its reasoning. This turns an LLM from a text generator into something that can actually do things in the world.","bodyHtml":"<p>A chatbot generates text. An agent takes actions. The difference is that an agent can call external tools — search the web, query a database, run code, hit an API — and incorporate the results into its reasoning. This turns an LLM from a text generator into something that can actually do things in the world.</p>","sourceUrl":"/ai-and-ml-applied/llm-application-patterns/agents-and-tool-use","sourceAnchor":"","domain":"data-ai","accentColor":"#08d7d9"},{"id":"fe08534356e6","subject":"ai-and-ml-applied","topic":"llm-application-patterns","subtopic":"agents-and-tool-use","chunkType":"pitfall","body":"## Pitfalls — Agents & Tool Use\n\n**Giving agents too many tools.** Each tool is a decision the model must make. More tools means more chances for the wrong choice. Start with 3-5 tools and add more only when you have evidence they are needed.\n\n**Tool descriptions that are vague.** \"Search for stuff\" is a bad tool description. \"Search the company knowledge base for policy documents, product documentation, and engineering guides. Returns the top 5 most relevant results.\" is much better. The quality of your tool descriptions directly affects how well the agent uses them.\n\n**No iteration limit.** Without a maximum number of steps, agents can loop indefinitely, consuming tokens and money. Always set a cap.\n\n**Not logging the full agent trace.** When an agent gives a wrong answer, you need to see every step: what it thought, what tools it called, what results it got, and where it went wrong. Log the entire message history.\n\n**Treating agents as reliable.** Agents are probabilistic systems. They will sometimes call the wrong tool, misinterpret results, or hallucinate actions. Design your system to handle failures gracefully — confirm destructive actions with the user, validate outputs, and have fallback paths.","bodyHtml":"<h2>Pitfalls — Agents &#x26; Tool Use</h2>\n<p><strong>Giving agents too many tools.</strong> Each tool is a decision the model must make. More tools means more chances for the wrong choice. Start with 3-5 tools and add more only when you have evidence they are needed.</p>\n<p><strong>Tool descriptions that are vague.</strong> \"Search for stuff\" is a bad tool description. \"Search the company knowledge base for policy documents, product documentation, and engineering guides. Returns the top 5 most relevant results.\" is much better. The quality of your tool descriptions directly affects how well the agent uses them.</p>\n<p><strong>No iteration limit.</strong> Without a maximum number of steps, agents can loop indefinitely, consuming tokens and money. Always set a cap.</p>\n<p><strong>Not logging the full agent trace.</strong> When an agent gives a wrong answer, you need to see every step: what it thought, what tools it called, what results it got, and where it went wrong. Log the entire message history.</p>\n<p><strong>Treating agents as reliable.</strong> Agents are probabilistic systems. They will sometimes call the wrong tool, misinterpret results, or hallucinate actions. Design your system to handle failures gracefully — confirm destructive actions with the user, validate outputs, and have fallback paths.</p>","sourceUrl":"/ai-and-ml-applied/llm-application-patterns/agents-and-tool-use","sourceAnchor":"common-pitfalls","domain":"data-ai","accentColor":"#08d7d9"},{"id":"abf623070dbd","subject":"ai-and-ml-applied","topic":"llm-application-patterns","subtopic":"agents-and-tool-use","chunkType":"takeaway","body":"## Key takeaways — Agents & Tool Use\n\nAgents are LLMs in a loop: observe, think, act, repeat. Tool use is what makes them agents rather than chatbots.\n\nGood tool descriptions matter more than model choice. A clear description of when and how to use a tool prevents most agent failures.\n\nLimit the iteration count, limit the tool set, validate tool inputs. Defensive patterns are not optional.\n\nAgents work well for bounded tasks with clear sub-steps (lookup, calculate, compare). They struggle with open-ended, multi-step reasoning that requires dozens of actions.\n\nLog the full agent trace. You cannot debug what you cannot see.\n\nStart simple. A retrieval step plus a single LLM call often beats a multi-step agent for straightforward tasks.","bodyHtml":"<h2>Key takeaways — Agents &#x26; Tool Use</h2>\n<p>Agents are LLMs in a loop: observe, think, act, repeat. Tool use is what makes them agents rather than chatbots.</p>\n<p>Good tool descriptions matter more than model choice. A clear description of when and how to use a tool prevents most agent failures.</p>\n<p>Limit the iteration count, limit the tool set, validate tool inputs. Defensive patterns are not optional.</p>\n<p>Agents work well for bounded tasks with clear sub-steps (lookup, calculate, compare). They struggle with open-ended, multi-step reasoning that requires dozens of actions.</p>\n<p>Log the full agent trace. You cannot debug what you cannot see.</p>\n<p>Start simple. A retrieval step plus a single LLM call often beats a multi-step agent for straightforward tasks.</p>","sourceUrl":"/ai-and-ml-applied/llm-application-patterns/agents-and-tool-use","sourceAnchor":"key-takeaways","domain":"data-ai","accentColor":"#08d7d9"},{"id":"47a0d9f7bdb5","subject":"ai-and-ml-applied","topic":"llm-application-patterns","subtopic":"structured-output-and-chains","chunkType":"intro","body":"LLMs produce text. Your application needs data. A JSON object with specific fields, a classification label from a fixed set, a list of extracted entities. Getting reliable, parseable output from a model that fundamentally produces a stream of tokens is one of the most practical challenges in LLM application development.","bodyHtml":"<p>LLMs produce text. Your application needs data. A JSON object with specific fields, a classification label from a fixed set, a list of extracted entities. Getting reliable, parseable output from a model that fundamentally produces a stream of tokens is one of the most practical challenges in LLM application development.</p>","sourceUrl":"/ai-and-ml-applied/llm-application-patterns/structured-output-and-chains","sourceAnchor":"","domain":"data-ai","accentColor":"#08d7d9"},{"id":"289117569e06","subject":"ai-and-ml-applied","topic":"llm-application-patterns","subtopic":"structured-output-and-chains","chunkType":"pitfall","body":"## Pitfalls — Structured Output & Chains\n\n**Parsing free-form LLM output with regex.** This breaks constantly. Use function calling or structured outputs. If you are writing regex to parse LLM output, you are solving the wrong problem.\n\n**Over-chaining.** Each step in a chain adds latency, cost, and a failure point. A 7-step chain with 95% reliability per step has only 70% end-to-end reliability. Keep chains short.\n\n**No validation between steps.** If step 1 produces garbage, steps 2-5 amplify the garbage. Validate intermediate outputs and fail fast.\n\n**Same model for every step.** Use cheap, fast models (GPT-4o-mini, Haiku) for simple steps like classification and routing. Reserve expensive models for steps that need high quality.\n\n**Ignoring latency.** A 3-step chain where each step takes 2 seconds means 6 seconds of user-visible latency. Pipeline where possible, and consider whether the chain is really necessary.\n\n**Not handling partial extraction gracefully.** If the model cannot extract a field, returning `null` is better than hallucinating a value. Design your schema to allow optional fields and handle missing data downstream.","bodyHtml":"<h2>Pitfalls — Structured Output &#x26; Chains</h2>\n<p><strong>Parsing free-form LLM output with regex.</strong> This breaks constantly. Use function calling or structured outputs. If you are writing regex to parse LLM output, you are solving the wrong problem.</p>\n<p><strong>Over-chaining.</strong> Each step in a chain adds latency, cost, and a failure point. A 7-step chain with 95% reliability per step has only 70% end-to-end reliability. Keep chains short.</p>\n<p><strong>No validation between steps.</strong> If step 1 produces garbage, steps 2-5 amplify the garbage. Validate intermediate outputs and fail fast.</p>\n<p><strong>Same model for every step.</strong> Use cheap, fast models (GPT-4o-mini, Haiku) for simple steps like classification and routing. Reserve expensive models for steps that need high quality.</p>\n<p><strong>Ignoring latency.</strong> A 3-step chain where each step takes 2 seconds means 6 seconds of user-visible latency. Pipeline where possible, and consider whether the chain is really necessary.</p>\n<p><strong>Not handling partial extraction gracefully.</strong> If the model cannot extract a field, returning <code>null</code> is better than hallucinating a value. Design your schema to allow optional fields and handle missing data downstream.</p>","sourceUrl":"/ai-and-ml-applied/llm-application-patterns/structured-output-and-chains","sourceAnchor":"common-pitfalls","domain":"data-ai","accentColor":"#08d7d9"},{"id":"f6d667fbf39c","subject":"ai-and-ml-applied","topic":"llm-application-patterns","subtopic":"structured-output-and-chains","chunkType":"takeaway","body":"## Key takeaways — Structured Output & Chains\n\nUse structured outputs (function calling, Pydantic models) as the default for any task that needs parseable output. Stop writing regex to parse LLM text.\n\nChains decompose complex tasks into reliable subtasks. Each step should do one thing, and validation should happen between steps.\n\nError handling is not optional. LLM calls fail, produce bad output, and time out. Handle every step.\n\nUse cheaper models for simple chain steps. Not every step needs the most capable model.\n\nKeep chains short. Each step multiplies latency and reduces reliability. If you can do it in one well-crafted prompt, do that instead.\n\nFew-shot examples in the prompt remain one of the most effective ways to get consistent output format, even when not using formal schema enforcement.","bodyHtml":"<h2>Key takeaways — Structured Output &#x26; Chains</h2>\n<p>Use structured outputs (function calling, Pydantic models) as the default for any task that needs parseable output. Stop writing regex to parse LLM text.</p>\n<p>Chains decompose complex tasks into reliable subtasks. Each step should do one thing, and validation should happen between steps.</p>\n<p>Error handling is not optional. LLM calls fail, produce bad output, and time out. Handle every step.</p>\n<p>Use cheaper models for simple chain steps. Not every step needs the most capable model.</p>\n<p>Keep chains short. Each step multiplies latency and reduces reliability. If you can do it in one well-crafted prompt, do that instead.</p>\n<p>Few-shot examples in the prompt remain one of the most effective ways to get consistent output format, even when not using formal schema enforcement.</p>","sourceUrl":"/ai-and-ml-applied/llm-application-patterns/structured-output-and-chains","sourceAnchor":"key-takeaways","domain":"data-ai","accentColor":"#08d7d9"},{"id":"dc63fb11f5a8","subject":"ai-and-ml-applied","topic":"llm-application-patterns","subtopic":"routing-and-fallbacks","chunkType":"intro","body":"Not every query needs your most expensive model. A user asking \"what are your business hours?\" does not require GPT-4o. A user asking \"analyze this 50-page contract and identify all liability clauses\" does. Routing directs each query to the right model based on complexity, and fallbacks ensure that if the first choice fails, a better model catches it. Together, these patterns can cut LLM costs by 50-80% without measurable quality loss.","bodyHtml":"<p>Not every query needs your most expensive model. A user asking \"what are your business hours?\" does not require GPT-4o. A user asking \"analyze this 50-page contract and identify all liability clauses\" does. Routing directs each query to the right model based on complexity, and fallbacks ensure that if the first choice fails, a better model catches it. Together, these patterns can cut LLM costs by 50-80% without measurable quality loss.</p>","sourceUrl":"/ai-and-ml-applied/llm-application-patterns/routing-and-fallbacks","sourceAnchor":"","domain":"data-ai","accentColor":"#08d7d9"},{"id":"cf182f29d943","subject":"ai-and-ml-applied","topic":"llm-application-patterns","subtopic":"routing-and-fallbacks","chunkType":"section","body":"## Multi-Provider Fallback\n\nDo not depend on a single LLM provider. APIs go down. Rate limits hit. Build a provider abstraction that tries your primary provider first and falls back to a secondary. Maintain a list of provider configurations (client, model name, API format), iterate through them on failure, and raise only when all providers are exhausted. This is production hygiene, not premature optimization.","bodyHtml":"<h2>Multi-Provider Fallback</h2>\n<p>Do not depend on a single LLM provider. APIs go down. Rate limits hit. Build a provider abstraction that tries your primary provider first and falls back to a secondary. Maintain a list of provider configurations (client, model name, API format), iterate through them on failure, and raise only when all providers are exhausted. This is production hygiene, not premature optimization.</p>","sourceUrl":"/ai-and-ml-applied/llm-application-patterns/routing-and-fallbacks","sourceAnchor":"multi-provider-fallback","domain":"data-ai","accentColor":"#08d7d9"},{"id":"a6d28cc7270b","subject":"ai-and-ml-applied","topic":"llm-application-patterns","subtopic":"routing-and-fallbacks","chunkType":"pitfall","body":"## Pitfalls — Routing & Fallbacks\n\n**Routing overhead exceeds savings.** If your router uses the same expensive model you are trying to avoid, you have gained nothing. Use the cheapest possible classifier or heuristic for routing.\n\n**Over-routing to the cheap model.** Aggressive cost optimization leads to quality degradation on queries that needed the better model. Monitor user satisfaction by route, not just overall.\n\n**No fallback strategy.** Routing without fallbacks means misrouted queries get bad answers with no recovery. Always have a path from the cheap model to the expensive one.\n\n**Hardcoded thresholds.** A confidence threshold of 0.8 might be right today and wrong next month as your query distribution changes. Monitor and adjust regularly.\n\n**Single provider dependency.** If 100% of your LLM traffic goes through one provider and they have an outage, your product is down. Multi-provider fallback is production hygiene, not over-engineering.\n\n**Not A/B testing the router.** Your routing logic is itself a model decision. A/B test routed traffic vs always-expensive-model traffic to verify that routing does not hurt quality.","bodyHtml":"<h2>Pitfalls — Routing &#x26; Fallbacks</h2>\n<p><strong>Routing overhead exceeds savings.</strong> If your router uses the same expensive model you are trying to avoid, you have gained nothing. Use the cheapest possible classifier or heuristic for routing.</p>\n<p><strong>Over-routing to the cheap model.</strong> Aggressive cost optimization leads to quality degradation on queries that needed the better model. Monitor user satisfaction by route, not just overall.</p>\n<p><strong>No fallback strategy.</strong> Routing without fallbacks means misrouted queries get bad answers with no recovery. Always have a path from the cheap model to the expensive one.</p>\n<p><strong>Hardcoded thresholds.</strong> A confidence threshold of 0.8 might be right today and wrong next month as your query distribution changes. Monitor and adjust regularly.</p>\n<p><strong>Single provider dependency.</strong> If 100% of your LLM traffic goes through one provider and they have an outage, your product is down. Multi-provider fallback is production hygiene, not over-engineering.</p>\n<p><strong>Not A/B testing the router.</strong> Your routing logic is itself a model decision. A/B test routed traffic vs always-expensive-model traffic to verify that routing does not hurt quality.</p>","sourceUrl":"/ai-and-ml-applied/llm-application-patterns/routing-and-fallbacks","sourceAnchor":"common-pitfalls","domain":"data-ai","accentColor":"#08d7d9"},{"id":"a1437e812a03","subject":"ai-and-ml-applied","topic":"llm-application-patterns","subtopic":"routing-and-fallbacks","chunkType":"takeaway","body":"## Key takeaways — Routing & Fallbacks\n\nMost LLM queries do not need the most expensive model. Route simple queries to cheap models and complex queries to capable ones. Typical savings are 50-80%.\n\nThree routing approaches: LLM classifier (most accurate, adds latency), embedding-based (fast, no API call), heuristic (simplest, surprisingly effective). Start with heuristics, upgrade if needed.\n\nAlways implement fallbacks. Try the cheap model, fall back to the expensive one on low confidence or failure. This is the highest-leverage pattern for cost optimization with quality preservation.\n\nBuild multi-provider fallback as production infrastructure. Single-provider dependency is a reliability risk.\n\nMonitor routing decisions, costs, and quality per model tier. You cannot optimize routing without data on how each tier performs.\n\nThe router itself must be cheap. An expensive routing step defeats the purpose.","bodyHtml":"<h2>Key takeaways — Routing &#x26; Fallbacks</h2>\n<p>Most LLM queries do not need the most expensive model. Route simple queries to cheap models and complex queries to capable ones. Typical savings are 50-80%.</p>\n<p>Three routing approaches: LLM classifier (most accurate, adds latency), embedding-based (fast, no API call), heuristic (simplest, surprisingly effective). Start with heuristics, upgrade if needed.</p>\n<p>Always implement fallbacks. Try the cheap model, fall back to the expensive one on low confidence or failure. This is the highest-leverage pattern for cost optimization with quality preservation.</p>\n<p>Build multi-provider fallback as production infrastructure. Single-provider dependency is a reliability risk.</p>\n<p>Monitor routing decisions, costs, and quality per model tier. You cannot optimize routing without data on how each tier performs.</p>\n<p>The router itself must be cheap. An expensive routing step defeats the purpose.</p>","sourceUrl":"/ai-and-ml-applied/llm-application-patterns/routing-and-fallbacks","sourceAnchor":"key-takeaways","domain":"data-ai","accentColor":"#08d7d9"},{"id":"ae7eccf7cceb","subject":"ai-and-ml-applied","topic":"data-for-ml","subtopic":"data-collection-and-labeling","chunkType":"section","body":"## Data Collection & Labeling\n\nData is the most expensive part of any ML project. Not compute, not model architecture, not deployment infrastructure. Getting high-quality, representative training data consumes more time and budget than everything else combined. Yet most teams underestimate this cost and jump straight to modeling.\n\nThe quality of your data puts a hard ceiling on the quality of your model. A simple model trained on excellent data will outperform a sophisticated model trained on noisy, biased, or insufficient data. Every hour spent improving your data pipeline pays more dividends than an hour spent tuning hyperparameters.","bodyHtml":"<h2>Data Collection &#x26; Labeling</h2>\n<p>Data is the most expensive part of any ML project. Not compute, not model architecture, not deployment infrastructure. Getting high-quality, representative training data consumes more time and budget than everything else combined. Yet most teams underestimate this cost and jump straight to modeling.</p>\n<p>The quality of your data puts a hard ceiling on the quality of your model. A simple model trained on excellent data will outperform a sophisticated model trained on noisy, biased, or insufficient data. Every hour spent improving your data pipeline pays more dividends than an hour spent tuning hyperparameters.</p>","sourceUrl":"/ai-and-ml-applied/data-for-ml/data-collection-and-labeling","sourceAnchor":"overview","domain":"data-ai","accentColor":"#08d7d9"},{"id":"d2aee8bcc51d","subject":"ai-and-ml-applied","topic":"data-for-ml","subtopic":"data-collection-and-labeling","chunkType":"section","body":"## Real-World Example: Building a Content Moderation Dataset\n\nA social media platform needs to detect toxic comments.\n\n**Phase 1: Bootstrap with existing data.** Pull 50,000 comments that were reported by users and reviewed by moderators. This gives you labels for free, but with selection bias: only reported comments are labeled, and most comments are never reported.\n\n**Phase 2: Active learning on unlabeled data.** Train a model on the Phase 1 data. Run it on 1 million unlabeled comments. Select the 5,000 where the model is most uncertain. Send those to human annotators. This fills gaps in the training distribution.\n\n**Phase 3: Synthetic augmentation.** Generate 2,000 synthetic toxic comments covering categories underrepresented in real data (subtle sarcasm, coded language, context-dependent toxicity). Mix with real data at a 10% ratio.\n\n**Phase 4: Continuous labeling.** In production, sample 100 comments per day for human review. Use disagreements between the model and humans as training signals. The dataset grows and improves continuously.","bodyHtml":"<h2>Real-World Example: Building a Content Moderation Dataset</h2>\n<p>A social media platform needs to detect toxic comments.</p>\n<p><strong>Phase 1: Bootstrap with existing data.</strong> Pull 50,000 comments that were reported by users and reviewed by moderators. This gives you labels for free, but with selection bias: only reported comments are labeled, and most comments are never reported.</p>\n<p><strong>Phase 2: Active learning on unlabeled data.</strong> Train a model on the Phase 1 data. Run it on 1 million unlabeled comments. Select the 5,000 where the model is most uncertain. Send those to human annotators. This fills gaps in the training distribution.</p>\n<p><strong>Phase 3: Synthetic augmentation.</strong> Generate 2,000 synthetic toxic comments covering categories underrepresented in real data (subtle sarcasm, coded language, context-dependent toxicity). Mix with real data at a 10% ratio.</p>\n<p><strong>Phase 4: Continuous labeling.</strong> In production, sample 100 comments per day for human review. Use disagreements between the model and humans as training signals. The dataset grows and improves continuously.</p>","sourceUrl":"/ai-and-ml-applied/data-for-ml/data-collection-and-labeling","sourceAnchor":"real-world-example-building-a-content-moderation-dataset","domain":"data-ai","accentColor":"#08d7d9"},{"id":"cd3226699608","subject":"ai-and-ml-applied","topic":"data-for-ml","subtopic":"data-collection-and-labeling","chunkType":"pitfall","body":"## Pitfalls — Data Collection & Labeling\n\n**Collecting data without a clear task definition**: If you don't know exactly what the model needs to predict, you will collect the wrong data. Define the task first, then design the data collection.\n\n**Assuming more data always helps**: After a certain point, adding more of the same type of data has diminishing returns. Diversity and quality matter more than raw volume.\n\n**Ignoring class imbalance**: If 95% of your labels are one class, the model learns to always predict that class. Oversample the minority class or use weighted loss functions.\n\n**Not versioning your datasets**: When your model breaks, you need to know whether the data changed. Track every version of every dataset.\n\n**Labeling in isolation**: Labels created by people who don't understand the downstream task are often useless. Annotators need context about why the labels matter.\n\n**Skipping the pilot round**: Always label a small batch first, measure agreement, fix the guidelines, then scale up. Scaling bad guidelines wastes money.","bodyHtml":"<h2>Pitfalls — Data Collection &#x26; Labeling</h2>\n<p><strong>Collecting data without a clear task definition</strong>: If you don't know exactly what the model needs to predict, you will collect the wrong data. Define the task first, then design the data collection.</p>\n<p><strong>Assuming more data always helps</strong>: After a certain point, adding more of the same type of data has diminishing returns. Diversity and quality matter more than raw volume.</p>\n<p><strong>Ignoring class imbalance</strong>: If 95% of your labels are one class, the model learns to always predict that class. Oversample the minority class or use weighted loss functions.</p>\n<p><strong>Not versioning your datasets</strong>: When your model breaks, you need to know whether the data changed. Track every version of every dataset.</p>\n<p><strong>Labeling in isolation</strong>: Labels created by people who don't understand the downstream task are often useless. Annotators need context about why the labels matter.</p>\n<p><strong>Skipping the pilot round</strong>: Always label a small batch first, measure agreement, fix the guidelines, then scale up. Scaling bad guidelines wastes money.</p>","sourceUrl":"/ai-and-ml-applied/data-for-ml/data-collection-and-labeling","sourceAnchor":"common-pitfalls","domain":"data-ai","accentColor":"#08d7d9"},{"id":"6af5ae54f589","subject":"ai-and-ml-applied","topic":"data-for-ml","subtopic":"data-collection-and-labeling","chunkType":"takeaway","body":"## Key takeaways — Data Collection & Labeling\n\nData collection and labeling consume more time and budget than any other part of an ML project. Plan for this upfront.\n\nStart with data you already have: application databases, user behavior logs, and existing human decisions are free training labels.\n\nLabel quality matters more than quantity. One thousand clean, consistent labels outperform ten thousand noisy ones.\n\nUse semi-automated labeling and active learning to reduce costs by 50-80% without sacrificing quality.\n\nMeasure inter-annotator agreement early. Low agreement means your task definition is ambiguous, and no amount of data will fix that.\n\nSynthetic data is a useful supplement but not a replacement for real data. Mix it in at low ratios and validate carefully.","bodyHtml":"<h2>Key takeaways — Data Collection &#x26; Labeling</h2>\n<p>Data collection and labeling consume more time and budget than any other part of an ML project. Plan for this upfront.</p>\n<p>Start with data you already have: application databases, user behavior logs, and existing human decisions are free training labels.</p>\n<p>Label quality matters more than quantity. One thousand clean, consistent labels outperform ten thousand noisy ones.</p>\n<p>Use semi-automated labeling and active learning to reduce costs by 50-80% without sacrificing quality.</p>\n<p>Measure inter-annotator agreement early. Low agreement means your task definition is ambiguous, and no amount of data will fix that.</p>\n<p>Synthetic data is a useful supplement but not a replacement for real data. Mix it in at low ratios and validate carefully.</p>","sourceUrl":"/ai-and-ml-applied/data-for-ml/data-collection-and-labeling","sourceAnchor":"key-takeaways","domain":"data-ai","accentColor":"#08d7d9"},{"id":"f9af7619f56d","subject":"ai-and-ml-applied","topic":"data-for-ml","subtopic":"data-augmentation-and-synthetic-data","chunkType":"section","body":"## Data Augmentation & Synthetic Data\n\nYou often need more training data than you have. Collecting and labeling real data is slow and expensive. Data augmentation and synthetic data generation let you multiply your existing dataset or create new examples from scratch. When done well, these techniques improve model robustness and reduce overfitting. When done poorly, they introduce noise and artifacts that degrade performance.\n\nThe core principle is simple: create new training examples that are different enough to teach the model something new, but realistic enough that they represent the actual distribution the model will encounter in production.","bodyHtml":"<h2>Data Augmentation &#x26; Synthetic Data</h2>\n<p>You often need more training data than you have. Collecting and labeling real data is slow and expensive. Data augmentation and synthetic data generation let you multiply your existing dataset or create new examples from scratch. When done well, these techniques improve model robustness and reduce overfitting. When done poorly, they introduce noise and artifacts that degrade performance.</p>\n<p>The core principle is simple: create new training examples that are different enough to teach the model something new, but realistic enough that they represent the actual distribution the model will encounter in production.</p>","sourceUrl":"/ai-and-ml-applied/data-for-ml/data-augmentation-and-synthetic-data","sourceAnchor":"overview","domain":"data-ai","accentColor":"#08d7d9"},{"id":"8d6beab911c4","subject":"ai-and-ml-applied","topic":"data-for-ml","subtopic":"data-augmentation-and-synthetic-data","chunkType":"section","body":"## Real-World Example: Multi-Language Support\n\nA company has a text classifier trained on 50,000 English customer messages. They need to support Spanish, French, and German, but have fewer than 500 labeled messages in each language.\n\n**Step 1**: Back-translate the English dataset to create 50,000 synthetic examples per language. This gives a rough starting point.\n\n**Step 2**: Use GPT-4o to generate 2,000 native-sounding examples per language, with prompts written by native speakers who understand the cultural differences in how customers communicate.\n\n**Step 3**: Have native speakers label 500 real messages per language as a test set. Never use synthetic data for evaluation.\n\n**Step 4**: Train with a mix of back-translated, LLM-generated, and real data. Use the real test set to measure actual performance.\n\n**Result**: The augmented model achieves 85% accuracy in new languages compared to 91% in English. Without augmentation, accuracy was 62%. The model improves to 89% after one month of collecting real labeled data in production.","bodyHtml":"<h2>Real-World Example: Multi-Language Support</h2>\n<p>A company has a text classifier trained on 50,000 English customer messages. They need to support Spanish, French, and German, but have fewer than 500 labeled messages in each language.</p>\n<p><strong>Step 1</strong>: Back-translate the English dataset to create 50,000 synthetic examples per language. This gives a rough starting point.</p>\n<p><strong>Step 2</strong>: Use GPT-4o to generate 2,000 native-sounding examples per language, with prompts written by native speakers who understand the cultural differences in how customers communicate.</p>\n<p><strong>Step 3</strong>: Have native speakers label 500 real messages per language as a test set. Never use synthetic data for evaluation.</p>\n<p><strong>Step 4</strong>: Train with a mix of back-translated, LLM-generated, and real data. Use the real test set to measure actual performance.</p>\n<p><strong>Result</strong>: The augmented model achieves 85% accuracy in new languages compared to 91% in English. Without augmentation, accuracy was 62%. The model improves to 89% after one month of collecting real labeled data in production.</p>","sourceUrl":"/ai-and-ml-applied/data-for-ml/data-augmentation-and-synthetic-data","sourceAnchor":"real-world-example-multi-language-support","domain":"data-ai","accentColor":"#08d7d9"},{"id":"d73dc794274d","subject":"ai-and-ml-applied","topic":"data-for-ml","subtopic":"data-augmentation-and-synthetic-data","chunkType":"pitfall","body":"## Pitfalls — Data Augmentation & Synthetic Data\n\n**Using synthetic data for evaluation**: This is the most common and most damaging mistake. Your test set must always be real, representative data. Synthetic evaluation tells you nothing about production performance.\n\n**Over-augmenting**: If you generate 100x more synthetic data than real data, the model learns the synthetic distribution, not the real one. Keep ratios reasonable.\n\n**Not checking for duplicates**: LLMs often generate near-identical examples. Deduplicate aggressively before training.\n\n**Ignoring the generator's biases**: GPT-4o has its own biases about how text should look. Synthetic data inherits these biases. Check that synthetic examples match real data distributions.\n\n**Augmenting without a baseline**: Always measure performance on real data before and after augmentation. If augmentation doesn't improve your real-data metrics, it is not helping.\n\n**Treating all augmentation techniques equally**: Back-translation works well for classification but poorly for extraction tasks where exact spans matter. Match the technique to the task.","bodyHtml":"<h2>Pitfalls — Data Augmentation &#x26; Synthetic Data</h2>\n<p><strong>Using synthetic data for evaluation</strong>: This is the most common and most damaging mistake. Your test set must always be real, representative data. Synthetic evaluation tells you nothing about production performance.</p>\n<p><strong>Over-augmenting</strong>: If you generate 100x more synthetic data than real data, the model learns the synthetic distribution, not the real one. Keep ratios reasonable.</p>\n<p><strong>Not checking for duplicates</strong>: LLMs often generate near-identical examples. Deduplicate aggressively before training.</p>\n<p><strong>Ignoring the generator's biases</strong>: GPT-4o has its own biases about how text should look. Synthetic data inherits these biases. Check that synthetic examples match real data distributions.</p>\n<p><strong>Augmenting without a baseline</strong>: Always measure performance on real data before and after augmentation. If augmentation doesn't improve your real-data metrics, it is not helping.</p>\n<p><strong>Treating all augmentation techniques equally</strong>: Back-translation works well for classification but poorly for extraction tasks where exact spans matter. Match the technique to the task.</p>","sourceUrl":"/ai-and-ml-applied/data-for-ml/data-augmentation-and-synthetic-data","sourceAnchor":"common-pitfalls","domain":"data-ai","accentColor":"#08d7d9"},{"id":"534ad2481f88","subject":"ai-and-ml-applied","topic":"data-for-ml","subtopic":"data-augmentation-and-synthetic-data","chunkType":"takeaway","body":"## Key takeaways — Data Augmentation & Synthetic Data\n\nData augmentation multiplies your existing dataset. Synthetic generation creates data from scratch. Both require careful quality control.\n\nFor text, the most effective augmentation techniques are back-translation, LLM paraphrasing, and minority class generation. Synonym replacement is limited.\n\nUse synthetic data for training, never for evaluation. The evaluation set must always be real, representative data.\n\nQuality control is mandatory: filter for duplicates, verify labels, check that synthetic examples match real data distributions.\n\nMix synthetic and real data at appropriate ratios. More real data means less synthetic data needed.\n\nLLM-generated synthetic data is a powerful tool for bootstrapping new tasks and balancing class distributions, but it cannot replace real data collection long-term.","bodyHtml":"<h2>Key takeaways — Data Augmentation &#x26; Synthetic Data</h2>\n<p>Data augmentation multiplies your existing dataset. Synthetic generation creates data from scratch. Both require careful quality control.</p>\n<p>For text, the most effective augmentation techniques are back-translation, LLM paraphrasing, and minority class generation. Synonym replacement is limited.</p>\n<p>Use synthetic data for training, never for evaluation. The evaluation set must always be real, representative data.</p>\n<p>Quality control is mandatory: filter for duplicates, verify labels, check that synthetic examples match real data distributions.</p>\n<p>Mix synthetic and real data at appropriate ratios. More real data means less synthetic data needed.</p>\n<p>LLM-generated synthetic data is a powerful tool for bootstrapping new tasks and balancing class distributions, but it cannot replace real data collection long-term.</p>","sourceUrl":"/ai-and-ml-applied/data-for-ml/data-augmentation-and-synthetic-data","sourceAnchor":"key-takeaways","domain":"data-ai","accentColor":"#08d7d9"},{"id":"cf76bf78d3c2","subject":"ai-and-ml-applied","topic":"data-for-ml","subtopic":"bias-and-dataset-versioning","chunkType":"section","body":"## Bias & Dataset Versioning\n\nEvery dataset has bias. This is not a moral judgment but a statistical reality. The data you collect reflects who collected it, how it was collected, and what was available at the time. A hiring dataset trained on historical decisions encodes past discrimination. A medical dataset from one hospital reflects that hospital's patient demographics. A language dataset from the internet overrepresents English speakers with internet access.\n\nAcknowledging bias is the first step. Measuring it is the second. Versioning your datasets so you can trace which data produced which model is the third. Together, these practices make the difference between ML systems that fail silently and systems you can debug, improve, and trust.","bodyHtml":"<h2>Bias &#x26; Dataset Versioning</h2>\n<p>Every dataset has bias. This is not a moral judgment but a statistical reality. The data you collect reflects who collected it, how it was collected, and what was available at the time. A hiring dataset trained on historical decisions encodes past discrimination. A medical dataset from one hospital reflects that hospital's patient demographics. A language dataset from the internet overrepresents English speakers with internet access.</p>\n<p>Acknowledging bias is the first step. Measuring it is the second. Versioning your datasets so you can trace which data produced which model is the third. Together, these practices make the difference between ML systems that fail silently and systems you can debug, improve, and trust.</p>","sourceUrl":"/ai-and-ml-applied/data-for-ml/bias-and-dataset-versioning","sourceAnchor":"overview","domain":"data-ai","accentColor":"#08d7d9"},{"id":"7d027f49f6b0","subject":"ai-and-ml-applied","topic":"data-for-ml","subtopic":"bias-and-dataset-versioning","chunkType":"section","body":"## Real-World Example: Detecting Bias in a Loan Approval Model\n\nA bank builds a model to predict loan defaults. The model achieves 92% accuracy overall but a fairness audit reveals problems.\n\n**Step 1: Demographic analysis.** The training data contains 70% white applicants, 15% Hispanic, 10% Black, 5% other. The actual applicant pool is more evenly distributed. The model has less data to learn from for minority groups.\n\n**Step 2: Slice-based evaluation.** Accuracy for white applicants is 94%. Accuracy for Black applicants is 81%. The model is significantly worse for underrepresented groups.\n\n**Step 3: Feature analysis.** The model uses zip code as a feature. Zip codes correlate with race due to historical segregation. Removing zip code and using income-based features instead reduces the accuracy gap from 13 points to 4 points.\n\n**Step 4: Dataset versioning.** The team versions the dataset, records the fairness metrics for each version, and establishes a rule: no model ships if the accuracy gap between any two demographic groups exceeds 5 percentage points.\n\n**Step 5: Ongoing monitoring.** In production, the team monitors approval rates by demographic group weekly. If a drift is detected, they investigate whether the data distribution has shifted.","bodyHtml":"<h2>Real-World Example: Detecting Bias in a Loan Approval Model</h2>\n<p>A bank builds a model to predict loan defaults. The model achieves 92% accuracy overall but a fairness audit reveals problems.</p>\n<p><strong>Step 1: Demographic analysis.</strong> The training data contains 70% white applicants, 15% Hispanic, 10% Black, 5% other. The actual applicant pool is more evenly distributed. The model has less data to learn from for minority groups.</p>\n<p><strong>Step 2: Slice-based evaluation.</strong> Accuracy for white applicants is 94%. Accuracy for Black applicants is 81%. The model is significantly worse for underrepresented groups.</p>\n<p><strong>Step 3: Feature analysis.</strong> The model uses zip code as a feature. Zip codes correlate with race due to historical segregation. Removing zip code and using income-based features instead reduces the accuracy gap from 13 points to 4 points.</p>\n<p><strong>Step 4: Dataset versioning.</strong> The team versions the dataset, records the fairness metrics for each version, and establishes a rule: no model ships if the accuracy gap between any two demographic groups exceeds 5 percentage points.</p>\n<p><strong>Step 5: Ongoing monitoring.</strong> In production, the team monitors approval rates by demographic group weekly. If a drift is detected, they investigate whether the data distribution has shifted.</p>","sourceUrl":"/ai-and-ml-applied/data-for-ml/bias-and-dataset-versioning","sourceAnchor":"real-world-example-detecting-bias-in-a-loan-approval-model","domain":"data-ai","accentColor":"#08d7d9"},{"id":"6a141ca75213","subject":"ai-and-ml-applied","topic":"data-for-ml","subtopic":"bias-and-dataset-versioning","chunkType":"pitfall","body":"## Pitfalls — Bias & Dataset Versioning\n\n**Assuming your dataset is unbiased**: Every dataset has bias. The question is not \"is there bias?\" but \"what kind of bias, and how much does it matter for this use case?\"\n\n**Measuring only overall metrics**: A model with 95% overall accuracy can have 60% accuracy on a minority subgroup. Always evaluate by slice.\n\n**Using protected attributes as features**: Even if you remove race or gender from the feature set, proxies like zip code or name can encode the same information.\n\n**Not versioning datasets**: When your model degrades, you need to know if the data changed. Without versioning, you are debugging blind.\n\n**Manual dataset management**: Copying files to folders named \"data_v2_final_FINAL\" does not scale. Use proper versioning tools.\n\n**Treating reproducibility as optional**: If you cannot reproduce a model, you cannot debug it, audit it, or explain it. Reproducibility is a requirement, not a nice-to-have.","bodyHtml":"<h2>Pitfalls — Bias &#x26; Dataset Versioning</h2>\n<p><strong>Assuming your dataset is unbiased</strong>: Every dataset has bias. The question is not \"is there bias?\" but \"what kind of bias, and how much does it matter for this use case?\"</p>\n<p><strong>Measuring only overall metrics</strong>: A model with 95% overall accuracy can have 60% accuracy on a minority subgroup. Always evaluate by slice.</p>\n<p><strong>Using protected attributes as features</strong>: Even if you remove race or gender from the feature set, proxies like zip code or name can encode the same information.</p>\n<p><strong>Not versioning datasets</strong>: When your model degrades, you need to know if the data changed. Without versioning, you are debugging blind.</p>\n<p><strong>Manual dataset management</strong>: Copying files to folders named \"data_v2_final_FINAL\" does not scale. Use proper versioning tools.</p>\n<p><strong>Treating reproducibility as optional</strong>: If you cannot reproduce a model, you cannot debug it, audit it, or explain it. Reproducibility is a requirement, not a nice-to-have.</p>","sourceUrl":"/ai-and-ml-applied/data-for-ml/bias-and-dataset-versioning","sourceAnchor":"common-pitfalls","domain":"data-ai","accentColor":"#08d7d9"},{"id":"1ccc73721f03","subject":"ai-and-ml-applied","topic":"data-for-ml","subtopic":"bias-and-dataset-versioning","chunkType":"takeaway","body":"## Key takeaways — Bias & Dataset Versioning\n\nEvery dataset has bias: selection bias, measurement bias, historical bias, and survivorship bias. Acknowledge it, measure it, and mitigate it.\n\nSlice-based evaluation is essential. Overall metrics hide disparities between subgroups. Break down performance by every relevant dimension.\n\nDataset versioning is as important as code versioning. Use DVC, git-lfs, or LakeFS depending on your data scale.\n\nReproducibility requires recording the exact dataset version, code commit, random seed, and hyperparameters for every model you train.\n\nBias detection is not a one-time audit. Monitor fairness metrics continuously in production, just as you monitor accuracy and latency.\n\nThe same data plus the same code should produce the same model. If it does not, you have a reproducibility gap that will eventually cause problems.","bodyHtml":"<h2>Key takeaways — Bias &#x26; Dataset Versioning</h2>\n<p>Every dataset has bias: selection bias, measurement bias, historical bias, and survivorship bias. Acknowledge it, measure it, and mitigate it.</p>\n<p>Slice-based evaluation is essential. Overall metrics hide disparities between subgroups. Break down performance by every relevant dimension.</p>\n<p>Dataset versioning is as important as code versioning. Use DVC, git-lfs, or LakeFS depending on your data scale.</p>\n<p>Reproducibility requires recording the exact dataset version, code commit, random seed, and hyperparameters for every model you train.</p>\n<p>Bias detection is not a one-time audit. Monitor fairness metrics continuously in production, just as you monitor accuracy and latency.</p>\n<p>The same data plus the same code should produce the same model. If it does not, you have a reproducibility gap that will eventually cause problems.</p>","sourceUrl":"/ai-and-ml-applied/data-for-ml/bias-and-dataset-versioning","sourceAnchor":"key-takeaways","domain":"data-ai","accentColor":"#08d7d9"},{"id":"39acc6e00d8e","subject":"ai-and-ml-applied","topic":"evaluation-and-testing","subtopic":"evaluating-ml-systems","chunkType":"section","body":"## Evaluating ML Systems\n\nOffline metrics tell you how a model performs on a test set. Online metrics tell you how it performs in the real world. These two numbers often disagree. A model with 99% accuracy on a carefully curated test set can fail spectacularly when deployed to actual users, because the test set does not reflect the messy, shifting distribution of real-world inputs.\n\nThe gap between offline evaluation and production performance is where most ML projects fail. Bridging that gap requires measuring the right things at the right time, and never trusting a single number to tell the whole story.","bodyHtml":"<h2>Evaluating ML Systems</h2>\n<p>Offline metrics tell you how a model performs on a test set. Online metrics tell you how it performs in the real world. These two numbers often disagree. A model with 99% accuracy on a carefully curated test set can fail spectacularly when deployed to actual users, because the test set does not reflect the messy, shifting distribution of real-world inputs.</p>\n<p>The gap between offline evaluation and production performance is where most ML projects fail. Bridging that gap requires measuring the right things at the right time, and never trusting a single number to tell the whole story.</p>","sourceUrl":"/ai-and-ml-applied/evaluation-and-testing/evaluating-ml-systems","sourceAnchor":"overview","domain":"data-ai","accentColor":"#08d7d9"},{"id":"5eb9aaec7b48","subject":"ai-and-ml-applied","topic":"evaluation-and-testing","subtopic":"evaluating-ml-systems","chunkType":"section","body":"## Real-World Example: Search Ranking Model\n\nA team builds a new search ranking model for an e-commerce site.\n\n**Offline evaluation**: NDCG@10 improves from 0.72 to 0.78 on the test set. The team is excited and prepares to deploy.\n\n**Shadow deployment**: The new model runs alongside the old one on live traffic, but its results are not shown to users. The team discovers that the new model is 3x slower (800ms vs 250ms latency). At scale, this would degrade the user experience.\n\n**After optimization**: Latency is reduced to 300ms. The team runs an A/B test: 50% of users see the new model's results.\n\n**A/B test results**: Click-through rate improves by 4%. But revenue per session is flat. Users click more but buy the same amount. The model is surfacing more \"interesting\" results that attract clicks but not purchases.\n\n**Iteration**: The team adds purchase-weighted relevance to the training signal. The updated model shows both higher click-through rate (+3%) and higher revenue per session (+1.5%). This version ships.\n\nThe offline metric (NDCG) was directionally correct but did not predict the revenue impact. Only the online A/B test revealed the full picture.","bodyHtml":"<h2>Real-World Example: Search Ranking Model</h2>\n<p>A team builds a new search ranking model for an e-commerce site.</p>\n<p><strong>Offline evaluation</strong>: NDCG@10 improves from 0.72 to 0.78 on the test set. The team is excited and prepares to deploy.</p>\n<p><strong>Shadow deployment</strong>: The new model runs alongside the old one on live traffic, but its results are not shown to users. The team discovers that the new model is 3x slower (800ms vs 250ms latency). At scale, this would degrade the user experience.</p>\n<p><strong>After optimization</strong>: Latency is reduced to 300ms. The team runs an A/B test: 50% of users see the new model's results.</p>\n<p><strong>A/B test results</strong>: Click-through rate improves by 4%. But revenue per session is flat. Users click more but buy the same amount. The model is surfacing more \"interesting\" results that attract clicks but not purchases.</p>\n<p><strong>Iteration</strong>: The team adds purchase-weighted relevance to the training signal. The updated model shows both higher click-through rate (+3%) and higher revenue per session (+1.5%). This version ships.</p>\n<p>The offline metric (NDCG) was directionally correct but did not predict the revenue impact. Only the online A/B test revealed the full picture.</p>","sourceUrl":"/ai-and-ml-applied/evaluation-and-testing/evaluating-ml-systems","sourceAnchor":"real-world-example-search-ranking-model","domain":"data-ai","accentColor":"#08d7d9"},{"id":"5f0084a27917","subject":"ai-and-ml-applied","topic":"evaluation-and-testing","subtopic":"evaluating-ml-systems","chunkType":"pitfall","body":"## Pitfalls — Evaluating ML Systems\n\n**Trusting a single metric**: No single number captures model quality. Report accuracy, precision, recall, and task-specific metrics together.\n\n**Evaluating only on the test set**: The test set is a starting point, not the final answer. Shadow deployments and A/B tests are required before you can trust a model in production.\n\n**Ignoring class imbalance**: Accuracy on imbalanced datasets is meaningless. Use precision, recall, F1, or AUC instead.\n\n**Not measuring latency and cost**: A model that is 5% more accurate but 10x slower or 10x more expensive is often a bad trade.\n\n**Skipping slice-based evaluation**: Overall metrics hide failures on subgroups. Always break down performance by relevant dimensions.\n\n**Comparing to the wrong baseline**: Compare your model to the current production system, not to random chance. A model with 90% accuracy is useless if the existing rule-based system achieves 89%.","bodyHtml":"<h2>Pitfalls — Evaluating ML Systems</h2>\n<p><strong>Trusting a single metric</strong>: No single number captures model quality. Report accuracy, precision, recall, and task-specific metrics together.</p>\n<p><strong>Evaluating only on the test set</strong>: The test set is a starting point, not the final answer. Shadow deployments and A/B tests are required before you can trust a model in production.</p>\n<p><strong>Ignoring class imbalance</strong>: Accuracy on imbalanced datasets is meaningless. Use precision, recall, F1, or AUC instead.</p>\n<p><strong>Not measuring latency and cost</strong>: A model that is 5% more accurate but 10x slower or 10x more expensive is often a bad trade.</p>\n<p><strong>Skipping slice-based evaluation</strong>: Overall metrics hide failures on subgroups. Always break down performance by relevant dimensions.</p>\n<p><strong>Comparing to the wrong baseline</strong>: Compare your model to the current production system, not to random chance. A model with 90% accuracy is useless if the existing rule-based system achieves 89%.</p>","sourceUrl":"/ai-and-ml-applied/evaluation-and-testing/evaluating-ml-systems","sourceAnchor":"common-pitfalls","domain":"data-ai","accentColor":"#08d7d9"},{"id":"3b25bd720131","subject":"ai-and-ml-applied","topic":"evaluation-and-testing","subtopic":"evaluating-ml-systems","chunkType":"takeaway","body":"## Key takeaways — Evaluating ML Systems\n\nOffline metrics (accuracy, F1, AUC) are necessary but insufficient. They tell you how a model performs on a frozen test set, not on live traffic.\n\nOnline metrics (user satisfaction, task completion, business outcomes) are the true measure of model quality. Always validate with A/B tests before full deployment.\n\nThe evaluation gap between research and production is real. A model that wins on benchmarks can fail in production due to distribution shift, latency, cost, or edge cases.\n\nUse multiple evaluation strategies: offline metrics for fast iteration, shadow deployments for safety, A/B tests for final validation.\n\nAlways evaluate by slice. Overall metrics hide failures on subgroups that matter to your users and your business.","bodyHtml":"<h2>Key takeaways — Evaluating ML Systems</h2>\n<p>Offline metrics (accuracy, F1, AUC) are necessary but insufficient. They tell you how a model performs on a frozen test set, not on live traffic.</p>\n<p>Online metrics (user satisfaction, task completion, business outcomes) are the true measure of model quality. Always validate with A/B tests before full deployment.</p>\n<p>The evaluation gap between research and production is real. A model that wins on benchmarks can fail in production due to distribution shift, latency, cost, or edge cases.</p>\n<p>Use multiple evaluation strategies: offline metrics for fast iteration, shadow deployments for safety, A/B tests for final validation.</p>\n<p>Always evaluate by slice. Overall metrics hide failures on subgroups that matter to your users and your business.</p>","sourceUrl":"/ai-and-ml-applied/evaluation-and-testing/evaluating-ml-systems","sourceAnchor":"key-takeaways","domain":"data-ai","accentColor":"#08d7d9"},{"id":"23fa204fc298","subject":"ai-and-ml-applied","topic":"evaluation-and-testing","subtopic":"evaluating-llm-outputs","chunkType":"section","body":"## Evaluating LLM Outputs\n\nEvaluating LLMs is fundamentally harder than evaluating traditional ML models. A spam classifier has one correct answer per input. An LLM asked to \"write a professional email declining a meeting\" has thousands of valid answers. There is no single ground truth to compare against.\n\nThis makes LLM evaluation a combination of art and engineering. You need multiple evaluation approaches, each capturing different aspects of quality. Human evaluation is the gold standard but too expensive for every change. Automated metrics are cheap but miss nuance. LLM-as-judge sits in between, offering scalable evaluation with known biases. The practical answer is to use all three, with the right approach for the right situation.","bodyHtml":"<h2>Evaluating LLM Outputs</h2>\n<p>Evaluating LLMs is fundamentally harder than evaluating traditional ML models. A spam classifier has one correct answer per input. An LLM asked to \"write a professional email declining a meeting\" has thousands of valid answers. There is no single ground truth to compare against.</p>\n<p>This makes LLM evaluation a combination of art and engineering. You need multiple evaluation approaches, each capturing different aspects of quality. Human evaluation is the gold standard but too expensive for every change. Automated metrics are cheap but miss nuance. LLM-as-judge sits in between, offering scalable evaluation with known biases. The practical answer is to use all three, with the right approach for the right situation.</p>","sourceUrl":"/ai-and-ml-applied/evaluation-and-testing/evaluating-llm-outputs","sourceAnchor":"overview","domain":"data-ai","accentColor":"#08d7d9"},{"id":"b8ad12fffedd","subject":"ai-and-ml-applied","topic":"evaluation-and-testing","subtopic":"evaluating-llm-outputs","chunkType":"section","body":"## Real-World Example: Evaluating a Customer Support Bot\n\nA team builds an LLM-powered customer support bot. They need to evaluate quality before launch.\n\n**Automated metrics (run on every PR)**: Format compliance (does the response include a greeting and sign-off?), length check (50-500 words), safety check (no PII leakage, no harmful content). These catch obvious failures in seconds.\n\n**LLM-as-judge (run weekly)**: Rate 500 responses on helpfulness, accuracy, and tone. Compare the bot's performance across different ticket categories. Identify categories where the bot struggles. Cost: approximately 25 dollars per run.\n\n**Human evaluation (run before major releases)**: Domain experts review 100 responses from the bot, focusing on factual accuracy and policy compliance. Pairwise comparison against the previous version. Cost: approximately 2,000 dollars per run.\n\n**Result**: Automated metrics catch 80% of regressions immediately. LLM-as-judge catches subtle quality issues within a week. Human evaluation validates that the bot meets company standards before launch.","bodyHtml":"<h2>Real-World Example: Evaluating a Customer Support Bot</h2>\n<p>A team builds an LLM-powered customer support bot. They need to evaluate quality before launch.</p>\n<p><strong>Automated metrics (run on every PR)</strong>: Format compliance (does the response include a greeting and sign-off?), length check (50-500 words), safety check (no PII leakage, no harmful content). These catch obvious failures in seconds.</p>\n<p><strong>LLM-as-judge (run weekly)</strong>: Rate 500 responses on helpfulness, accuracy, and tone. Compare the bot's performance across different ticket categories. Identify categories where the bot struggles. Cost: approximately 25 dollars per run.</p>\n<p><strong>Human evaluation (run before major releases)</strong>: Domain experts review 100 responses from the bot, focusing on factual accuracy and policy compliance. Pairwise comparison against the previous version. Cost: approximately 2,000 dollars per run.</p>\n<p><strong>Result</strong>: Automated metrics catch 80% of regressions immediately. LLM-as-judge catches subtle quality issues within a week. Human evaluation validates that the bot meets company standards before launch.</p>","sourceUrl":"/ai-and-ml-applied/evaluation-and-testing/evaluating-llm-outputs","sourceAnchor":"real-world-example-evaluating-a-customer-support-bot","domain":"data-ai","accentColor":"#08d7d9"},{"id":"9d24c394c226","subject":"ai-and-ml-applied","topic":"evaluation-and-testing","subtopic":"evaluating-llm-outputs","chunkType":"pitfall","body":"## Pitfalls — Evaluating LLM Outputs\n\n**Relying on BLEU or ROUGE for open-ended generation**: These metrics measure word overlap, not quality. A response can have zero BLEU score and still be excellent.\n\n**Using the same model as both generator and judge**: If GPT-4o generates the response and GPT-4o judges it, the evaluation is biased toward that model's style and preferences.\n\n**Not randomizing order in pairwise comparisons**: LLM judges strongly prefer the first response shown. Always randomize and run both orderings.\n\n**Evaluating on too few examples**: Fifty examples is not enough for reliable evaluation. Aim for 200+ for automated metrics and 100+ for human evaluation.\n\n**Ignoring edge cases in evaluation**: Your evaluation set should include adversarial inputs, ambiguous queries, and out-of-scope requests, not just happy-path examples.\n\n**Treating evaluation as a one-time activity**: LLM quality can change with model updates, prompt changes, or data shifts. Evaluate continuously.","bodyHtml":"<h2>Pitfalls — Evaluating LLM Outputs</h2>\n<p><strong>Relying on BLEU or ROUGE for open-ended generation</strong>: These metrics measure word overlap, not quality. A response can have zero BLEU score and still be excellent.</p>\n<p><strong>Using the same model as both generator and judge</strong>: If GPT-4o generates the response and GPT-4o judges it, the evaluation is biased toward that model's style and preferences.</p>\n<p><strong>Not randomizing order in pairwise comparisons</strong>: LLM judges strongly prefer the first response shown. Always randomize and run both orderings.</p>\n<p><strong>Evaluating on too few examples</strong>: Fifty examples is not enough for reliable evaluation. Aim for 200+ for automated metrics and 100+ for human evaluation.</p>\n<p><strong>Ignoring edge cases in evaluation</strong>: Your evaluation set should include adversarial inputs, ambiguous queries, and out-of-scope requests, not just happy-path examples.</p>\n<p><strong>Treating evaluation as a one-time activity</strong>: LLM quality can change with model updates, prompt changes, or data shifts. Evaluate continuously.</p>","sourceUrl":"/ai-and-ml-applied/evaluation-and-testing/evaluating-llm-outputs","sourceAnchor":"common-pitfalls","domain":"data-ai","accentColor":"#08d7d9"},{"id":"ce27af2476dd","subject":"ai-and-ml-applied","topic":"evaluation-and-testing","subtopic":"evaluating-llm-outputs","chunkType":"takeaway","body":"## Key takeaways — Evaluating LLM Outputs\n\nLLM evaluation requires multiple approaches: automated metrics for speed, LLM-as-judge for scale, human evaluation for ground truth.\n\nTraditional NLP metrics (BLEU, ROUGE) are mostly useless for open-ended generation. Use them only for translation and summarization.\n\nLLM-as-judge is powerful but biased. Mitigate position bias by swapping order. Never use the same model as both generator and judge.\n\nTask-specific evaluation (factuality, format compliance, safety) is more valuable than general quality scores.\n\nBuild a layered evaluation pipeline: cheap automated checks on every change, LLM-based evaluation on major changes, human evaluation before releases.\n\nHuman evaluation remains the gold standard. Budget for it and use it to calibrate your automated approaches.","bodyHtml":"<h2>Key takeaways — Evaluating LLM Outputs</h2>\n<p>LLM evaluation requires multiple approaches: automated metrics for speed, LLM-as-judge for scale, human evaluation for ground truth.</p>\n<p>Traditional NLP metrics (BLEU, ROUGE) are mostly useless for open-ended generation. Use them only for translation and summarization.</p>\n<p>LLM-as-judge is powerful but biased. Mitigate position bias by swapping order. Never use the same model as both generator and judge.</p>\n<p>Task-specific evaluation (factuality, format compliance, safety) is more valuable than general quality scores.</p>\n<p>Build a layered evaluation pipeline: cheap automated checks on every change, LLM-based evaluation on major changes, human evaluation before releases.</p>\n<p>Human evaluation remains the gold standard. Budget for it and use it to calibrate your automated approaches.</p>","sourceUrl":"/ai-and-ml-applied/evaluation-and-testing/evaluating-llm-outputs","sourceAnchor":"key-takeaways","domain":"data-ai","accentColor":"#08d7d9"},{"id":"cf4487cf880c","subject":"ai-and-ml-applied","topic":"evaluation-and-testing","subtopic":"regression-testing-for-ml","chunkType":"section","body":"## Regression Testing for ML\n\nYour new model is better on average. The aggregate metrics improved across the board. You deploy it, and within hours, a specific customer segment reports that the system is broken for them. The new model handles 95% of cases better but handles 5% of cases significantly worse. Nobody noticed because the 5% was invisible in the averages.\n\nML regression testing exists to prevent this. It is the practice of maintaining a curated set of inputs with known expected outputs, running every model change against that set, and blocking deployment when important cases degrade. If traditional software testing asks \"does the code still work?\", ML regression testing asks \"does the model still work for the cases we know matter?\"","bodyHtml":"<h2>Regression Testing for ML</h2>\n<p>Your new model is better on average. The aggregate metrics improved across the board. You deploy it, and within hours, a specific customer segment reports that the system is broken for them. The new model handles 95% of cases better but handles 5% of cases significantly worse. Nobody noticed because the 5% was invisible in the averages.</p>\n<p>ML regression testing exists to prevent this. It is the practice of maintaining a curated set of inputs with known expected outputs, running every model change against that set, and blocking deployment when important cases degrade. If traditional software testing asks \"does the code still work?\", ML regression testing asks \"does the model still work for the cases we know matter?\"</p>","sourceUrl":"/ai-and-ml-applied/evaluation-and-testing/regression-testing-for-ml","sourceAnchor":"overview","domain":"data-ai","accentColor":"#08d7d9"},{"id":"85d8d861c024","subject":"ai-and-ml-applied","topic":"evaluation-and-testing","subtopic":"regression-testing-for-ml","chunkType":"section","body":"## Real-World Example: Deploying a New Summarization Model\n\nA team upgrades their document summarization model from v3 to v4.\n\n**Step 1**: Run the golden dataset (200 documents with expert-written reference summaries). Results: 195 pass, 5 fail. All 5 failures are on legal documents where the new model omits key clauses.\n\n**Step 2**: Investigate the failures. The new model was trained on more general text and has less exposure to legal language. The team adds 50 legal documents to the training data and retrains.\n\n**Step 3**: Rerun. All 200 pass. Run full evaluation: ROUGE scores improve on average. Slice-based evaluation shows improvement across all categories including legal.\n\n**Step 4**: Shadow deployment for 48 hours. Compare v3 and v4 outputs on live traffic. No anomalies detected.\n\n**Step 5**: Canary deployment at 10%. Monitor acceptance rate (do users edit the summaries?). V4 acceptance rate is 2% higher than v3. Roll out to 100%.\n\n**Step 6**: Add the 5 failing legal documents to the golden dataset permanently. Future model changes will be tested against them.","bodyHtml":"<h2>Real-World Example: Deploying a New Summarization Model</h2>\n<p>A team upgrades their document summarization model from v3 to v4.</p>\n<p><strong>Step 1</strong>: Run the golden dataset (200 documents with expert-written reference summaries). Results: 195 pass, 5 fail. All 5 failures are on legal documents where the new model omits key clauses.</p>\n<p><strong>Step 2</strong>: Investigate the failures. The new model was trained on more general text and has less exposure to legal language. The team adds 50 legal documents to the training data and retrains.</p>\n<p><strong>Step 3</strong>: Rerun. All 200 pass. Run full evaluation: ROUGE scores improve on average. Slice-based evaluation shows improvement across all categories including legal.</p>\n<p><strong>Step 4</strong>: Shadow deployment for 48 hours. Compare v3 and v4 outputs on live traffic. No anomalies detected.</p>\n<p><strong>Step 5</strong>: Canary deployment at 10%. Monitor acceptance rate (do users edit the summaries?). V4 acceptance rate is 2% higher than v3. Roll out to 100%.</p>\n<p><strong>Step 6</strong>: Add the 5 failing legal documents to the golden dataset permanently. Future model changes will be tested against them.</p>","sourceUrl":"/ai-and-ml-applied/evaluation-and-testing/regression-testing-for-ml","sourceAnchor":"real-world-example-deploying-a-new-summarization-model","domain":"data-ai","accentColor":"#08d7d9"},{"id":"377a0976bf44","subject":"ai-and-ml-applied","topic":"evaluation-and-testing","subtopic":"regression-testing-for-ml","chunkType":"pitfall","body":"## Pitfalls — Regression Testing for ML\n\n**No regression test suite at all**: Many ML teams ship model updates without any regression testing. This is the equivalent of deploying code without running tests.\n\n**Testing only on aggregate metrics**: If you only check overall accuracy, you will miss regressions on specific subgroups that matter.\n\n**Stale golden datasets**: If your golden dataset does not evolve with your product, it stops catching relevant failures. Review and update it regularly.\n\n**Flaky tests due to non-determinism**: LLM outputs are non-deterministic. Use temperature=0 for regression tests or use assertion types that tolerate variation (semantic similarity, contains-check instead of exact match).\n\n**Blocking on every failure**: Not all failures are equal. A critical flow regression should block deployment. A minor edge case regression might be acceptable. Use priority levels.\n\n**Manual regression testing**: If humans have to run the tests manually, they will eventually skip them. Automate regression tests in CI/CD.","bodyHtml":"<h2>Pitfalls — Regression Testing for ML</h2>\n<p><strong>No regression test suite at all</strong>: Many ML teams ship model updates without any regression testing. This is the equivalent of deploying code without running tests.</p>\n<p><strong>Testing only on aggregate metrics</strong>: If you only check overall accuracy, you will miss regressions on specific subgroups that matter.</p>\n<p><strong>Stale golden datasets</strong>: If your golden dataset does not evolve with your product, it stops catching relevant failures. Review and update it regularly.</p>\n<p><strong>Flaky tests due to non-determinism</strong>: LLM outputs are non-deterministic. Use temperature=0 for regression tests or use assertion types that tolerate variation (semantic similarity, contains-check instead of exact match).</p>\n<p><strong>Blocking on every failure</strong>: Not all failures are equal. A critical flow regression should block deployment. A minor edge case regression might be acceptable. Use priority levels.</p>\n<p><strong>Manual regression testing</strong>: If humans have to run the tests manually, they will eventually skip them. Automate regression tests in CI/CD.</p>","sourceUrl":"/ai-and-ml-applied/evaluation-and-testing/regression-testing-for-ml","sourceAnchor":"common-pitfalls","domain":"data-ai","accentColor":"#08d7d9"},{"id":"9b5bf5d7816d","subject":"ai-and-ml-applied","topic":"evaluation-and-testing","subtopic":"regression-testing-for-ml","chunkType":"takeaway","body":"## Key takeaways — Regression Testing for ML\n\nML regression testing prevents the \"better on average, worse for you\" problem. Every model change should be tested against a curated set of known-good examples.\n\nBuild a golden dataset that includes common cases, edge cases, past bug fixes, and critical user flows. Grow it from every production incident.\n\nUse soft assertions for ML: semantic similarity, contains-checks, classification matching. Exact match is too brittle for non-deterministic outputs.\n\nEmbed regression tests in CI/CD. Block deployment on critical failures. Warn on minor degradations.\n\nTest embeddings separately. A new embedding model can silently break all downstream systems that depend on vector similarity.\n\nAuto-rollback is your safety net. Monitor online metrics and revert automatically when they cross thresholds.","bodyHtml":"<h2>Key takeaways — Regression Testing for ML</h2>\n<p>ML regression testing prevents the \"better on average, worse for you\" problem. Every model change should be tested against a curated set of known-good examples.</p>\n<p>Build a golden dataset that includes common cases, edge cases, past bug fixes, and critical user flows. Grow it from every production incident.</p>\n<p>Use soft assertions for ML: semantic similarity, contains-checks, classification matching. Exact match is too brittle for non-deterministic outputs.</p>\n<p>Embed regression tests in CI/CD. Block deployment on critical failures. Warn on minor degradations.</p>\n<p>Test embeddings separately. A new embedding model can silently break all downstream systems that depend on vector similarity.</p>\n<p>Auto-rollback is your safety net. Monitor online metrics and revert automatically when they cross thresholds.</p>","sourceUrl":"/ai-and-ml-applied/evaluation-and-testing/regression-testing-for-ml","sourceAnchor":"key-takeaways","domain":"data-ai","accentColor":"#08d7d9"},{"id":"958b3584148b","subject":"ai-and-ml-applied","topic":"ethics-and-responsible-ai","subtopic":"bias-and-fairness","chunkType":"section","body":"## Bias & Fairness\n\nAI systems amplify the biases present in their training data. A hiring algorithm trained on a decade of hiring decisions learns not just what makes a good candidate but also which candidates the company historically preferred, including any discriminatory patterns. A loan approval model trained on historical approvals inherits redlining practices from decades past. A facial recognition system trained predominantly on light-skinned faces performs dramatically worse on dark-skinned faces.\n\nThese are not hypothetical scenarios. They are documented failures from real systems deployed by real companies. The root cause is always the same: the model learned what the data taught it, and the data reflected a biased world.","bodyHtml":"<h2>Bias &#x26; Fairness</h2>\n<p>AI systems amplify the biases present in their training data. A hiring algorithm trained on a decade of hiring decisions learns not just what makes a good candidate but also which candidates the company historically preferred, including any discriminatory patterns. A loan approval model trained on historical approvals inherits redlining practices from decades past. A facial recognition system trained predominantly on light-skinned faces performs dramatically worse on dark-skinned faces.</p>\n<p>These are not hypothetical scenarios. They are documented failures from real systems deployed by real companies. The root cause is always the same: the model learned what the data taught it, and the data reflected a biased world.</p>","sourceUrl":"/ai-and-ml-applied/ethics-and-responsible-ai/bias-and-fairness","sourceAnchor":"overview","domain":"data-ai","accentColor":"#08d7d9"},{"id":"aa32983582af","subject":"ai-and-ml-applied","topic":"ethics-and-responsible-ai","subtopic":"bias-and-fairness","chunkType":"pitfall","body":"## Pitfalls — Bias & Fairness\n\n**Assuming removing protected attributes fixes bias**: Models learn proxy features. Removing \"gender\" doesn't help if \"played on women's lacrosse team\" is still in the data.\n\n**Optimizing for one fairness metric while ignoring others**: Achieving demographic parity can worsen calibration. Understand the trade-offs for your specific use case.\n\n**Treating fairness as a one-time check**: Bias can emerge over time as data distributions shift. Monitor fairness metrics continuously in production.\n\n**Ignoring intersectional bias**: A model might be fair for women overall and fair for Black applicants overall, but unfair for Black women specifically. Check subgroup intersections.\n\n**Using biased data to define \"ground truth\"**: If your labels come from a biased process (e.g., biased performance reviews), a \"fair\" model trained on those labels still encodes the original bias.\n\n**Believing AI is objective**: AI is as objective as the data and decisions that created it. The appearance of mathematical objectivity makes bias harder to detect, not less present.","bodyHtml":"<h2>Pitfalls — Bias &#x26; Fairness</h2>\n<p><strong>Assuming removing protected attributes fixes bias</strong>: Models learn proxy features. Removing \"gender\" doesn't help if \"played on women's lacrosse team\" is still in the data.</p>\n<p><strong>Optimizing for one fairness metric while ignoring others</strong>: Achieving demographic parity can worsen calibration. Understand the trade-offs for your specific use case.</p>\n<p><strong>Treating fairness as a one-time check</strong>: Bias can emerge over time as data distributions shift. Monitor fairness metrics continuously in production.</p>\n<p><strong>Ignoring intersectional bias</strong>: A model might be fair for women overall and fair for Black applicants overall, but unfair for Black women specifically. Check subgroup intersections.</p>\n<p><strong>Using biased data to define \"ground truth\"</strong>: If your labels come from a biased process (e.g., biased performance reviews), a \"fair\" model trained on those labels still encodes the original bias.</p>\n<p><strong>Believing AI is objective</strong>: AI is as objective as the data and decisions that created it. The appearance of mathematical objectivity makes bias harder to detect, not less present.</p>","sourceUrl":"/ai-and-ml-applied/ethics-and-responsible-ai/bias-and-fairness","sourceAnchor":"common-pitfalls","domain":"data-ai","accentColor":"#08d7d9"},{"id":"05ff6f8a090b","subject":"ai-and-ml-applied","topic":"ethics-and-responsible-ai","subtopic":"bias-and-fairness","chunkType":"takeaway","body":"## Key takeaways — Bias & Fairness\n\nAI amplifies biases in training data. A model trained on biased hiring decisions will make biased hiring recommendations.\n\nBias enters through training data, measurement processes, representation gaps, and feedback loops. All four must be addressed.\n\nThere are multiple definitions of fairness (demographic parity, equal opportunity, calibration), and they cannot all be satisfied simultaneously. Choose the definition that best fits your use case.\n\nFairness auditing should happen before deployment and continuously in production. Use quantitative metrics, not assumptions.\n\nMitigation strategies exist at every stage: fix the data, constrain the training, or adjust the outputs. Each approach has trade-offs.\n\nRemoving protected attributes does not remove bias. Models learn proxy features that correlate with the attributes you removed.","bodyHtml":"<h2>Key takeaways — Bias &#x26; Fairness</h2>\n<p>AI amplifies biases in training data. A model trained on biased hiring decisions will make biased hiring recommendations.</p>\n<p>Bias enters through training data, measurement processes, representation gaps, and feedback loops. All four must be addressed.</p>\n<p>There are multiple definitions of fairness (demographic parity, equal opportunity, calibration), and they cannot all be satisfied simultaneously. Choose the definition that best fits your use case.</p>\n<p>Fairness auditing should happen before deployment and continuously in production. Use quantitative metrics, not assumptions.</p>\n<p>Mitigation strategies exist at every stage: fix the data, constrain the training, or adjust the outputs. Each approach has trade-offs.</p>\n<p>Removing protected attributes does not remove bias. Models learn proxy features that correlate with the attributes you removed.</p>","sourceUrl":"/ai-and-ml-applied/ethics-and-responsible-ai/bias-and-fairness","sourceAnchor":"key-takeaways","domain":"data-ai","accentColor":"#08d7d9"},{"id":"ee0bf3cf32ed","subject":"ai-and-ml-applied","topic":"ethics-and-responsible-ai","subtopic":"transparency-and-explainability","chunkType":"section","body":"## Transparency & Explainability\n\nUsers deserve to know when they are interacting with AI. Stakeholders deserve to understand why the AI made a particular decision. Regulators increasingly require both. Transparency is about disclosure: telling people that AI is involved. Explainability is about understanding: answering \"why did the model produce this output?\"\n\nThese are not nice-to-have features. They are rapidly becoming legal requirements. The EU AI Act mandates transparency for high-risk AI systems. GDPR Article 22 gives individuals the right not to be subject to decisions based solely on automated processing. The practical question is not whether to build explainable AI, but how.","bodyHtml":"<h2>Transparency &#x26; Explainability</h2>\n<p>Users deserve to know when they are interacting with AI. Stakeholders deserve to understand why the AI made a particular decision. Regulators increasingly require both. Transparency is about disclosure: telling people that AI is involved. Explainability is about understanding: answering \"why did the model produce this output?\"</p>\n<p>These are not nice-to-have features. They are rapidly becoming legal requirements. The EU AI Act mandates transparency for high-risk AI systems. GDPR Article 22 gives individuals the right not to be subject to decisions based solely on automated processing. The practical question is not whether to build explainable AI, but how.</p>","sourceUrl":"/ai-and-ml-applied/ethics-and-responsible-ai/transparency-and-explainability","sourceAnchor":"overview","domain":"data-ai","accentColor":"#08d7d9"},{"id":"4d4e17ddf555","subject":"ai-and-ml-applied","topic":"ethics-and-responsible-ai","subtopic":"transparency-and-explainability","chunkType":"section","body":"## Real-World Example: Explainable Credit Scoring\n\nA fintech company builds an ML-based credit scoring system. Regulators require explanations for every denial.\n\n**Step 1: Model selection.** They choose a gradient-boosted tree model instead of a neural network, partly because tree models are easier to explain with SHAP.\n\n**Step 2: Feature transparency.** Every feature used by the model is documented with plain-language descriptions. \"DTI_ratio\" becomes \"your total monthly debt payments divided by your monthly income.\"\n\n**Step 3: Per-decision explanations.** When an application is denied, the system generates a letter listing the top 3 factors that contributed to the denial, with specific values and thresholds.\n\n**Step 4: Human review pathway.** Any applicant can request a human review. A trained analyst reviews the model's explanation, the raw data, and the applicant's context before making a final decision.\n\n**Step 5: Regular auditing.** Monthly fairness audits check that denial explanations are consistent across demographic groups. Quarterly reports are submitted to regulators.","bodyHtml":"<h2>Real-World Example: Explainable Credit Scoring</h2>\n<p>A fintech company builds an ML-based credit scoring system. Regulators require explanations for every denial.</p>\n<p><strong>Step 1: Model selection.</strong> They choose a gradient-boosted tree model instead of a neural network, partly because tree models are easier to explain with SHAP.</p>\n<p><strong>Step 2: Feature transparency.</strong> Every feature used by the model is documented with plain-language descriptions. \"DTI_ratio\" becomes \"your total monthly debt payments divided by your monthly income.\"</p>\n<p><strong>Step 3: Per-decision explanations.</strong> When an application is denied, the system generates a letter listing the top 3 factors that contributed to the denial, with specific values and thresholds.</p>\n<p><strong>Step 4: Human review pathway.</strong> Any applicant can request a human review. A trained analyst reviews the model's explanation, the raw data, and the applicant's context before making a final decision.</p>\n<p><strong>Step 5: Regular auditing.</strong> Monthly fairness audits check that denial explanations are consistent across demographic groups. Quarterly reports are submitted to regulators.</p>","sourceUrl":"/ai-and-ml-applied/ethics-and-responsible-ai/transparency-and-explainability","sourceAnchor":"real-world-example-explainable-credit-scoring","domain":"data-ai","accentColor":"#08d7d9"},{"id":"2ac8358205ea","subject":"ai-and-ml-applied","topic":"ethics-and-responsible-ai","subtopic":"transparency-and-explainability","chunkType":"pitfall","body":"## Pitfalls — Transparency & Explainability\n\n**Treating explanation as an afterthought**: If you build a black-box model first and try to explain it later, the explanations will be poor. Consider explainability from the start.\n\n**Confusing model confidence with explanation**: Saying \"the model is 87% confident\" is not an explanation. Users need to know which factors drove the decision, not just how sure the model is.\n\n**Over-relying on post-hoc explanations**: SHAP and LIME explain what the model did, not why the model is right. A biased model will have \"explanations\" that reflect its bias.\n\n**Providing explanations that nobody reads**: A 10-page technical report is not transparency. Match the explanation depth and format to the audience.\n\n**Assuming chain-of-thought equals reasoning**: When an LLM shows its \"reasoning,\" it is generating plausible-looking text, not revealing its actual computation. Use it for user-facing transparency, not for debugging.\n\n**Not providing a path to human review**: Explanations without recourse are theater. Users must be able to challenge automated decisions.","bodyHtml":"<h2>Pitfalls — Transparency &#x26; Explainability</h2>\n<p><strong>Treating explanation as an afterthought</strong>: If you build a black-box model first and try to explain it later, the explanations will be poor. Consider explainability from the start.</p>\n<p><strong>Confusing model confidence with explanation</strong>: Saying \"the model is 87% confident\" is not an explanation. Users need to know which factors drove the decision, not just how sure the model is.</p>\n<p><strong>Over-relying on post-hoc explanations</strong>: SHAP and LIME explain what the model did, not why the model is right. A biased model will have \"explanations\" that reflect its bias.</p>\n<p><strong>Providing explanations that nobody reads</strong>: A 10-page technical report is not transparency. Match the explanation depth and format to the audience.</p>\n<p><strong>Assuming chain-of-thought equals reasoning</strong>: When an LLM shows its \"reasoning,\" it is generating plausible-looking text, not revealing its actual computation. Use it for user-facing transparency, not for debugging.</p>\n<p><strong>Not providing a path to human review</strong>: Explanations without recourse are theater. Users must be able to challenge automated decisions.</p>","sourceUrl":"/ai-and-ml-applied/ethics-and-responsible-ai/transparency-and-explainability","sourceAnchor":"common-pitfalls","domain":"data-ai","accentColor":"#08d7d9"}]}