{
 "nbformat": 4,
 "nbformat_minor": 5,
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "name": "python",
   "version": "3.11.0"
  }
 },
 "cells": [
  {
   "cell_type": "markdown",
   "id": "cell-0",
   "metadata": {},
   "source": [
    "# B5 - RAG Advanced: Reranking and Contextual Retrieval\n",
    "\n",
    "Companion notebook for article **B5** in *Building with Claude - A Practitioner's Guide to the Anthropic API*.\n",
    "\n",
    "**Attribution.** Concepts adapted from Anthropic's \"Building with the Claude API\" course (Coursera), public API documentation at [docs.anthropic.com](https://docs.anthropic.com), and the Anthropic blog post \"Contextual Retrieval\" (November 2024). All code below is original work (c) 2026 DataMy. Not affiliated with Anthropic.\n",
    "\n",
    "---\n",
    "\n",
    "## What you'll build in this notebook\n",
    "\n",
    "Two upgrades on top of the B4 hybrid RAG pipeline:\n",
    "\n",
    "1. **Reranking** -- retrieve 20 candidates with hybrid search, then re-score them with VoyageAI's cross-encoder reranker to get the true top-5.\n",
    "2. **Contextual retrieval** -- for each chunk, ask Claude to write a short context paragraph that situates the chunk in its document, then prepend that context before embedding. Uses prompt caching to make the per-chunk document-read nearly free after the first chunk.\n",
    "3. **Full advanced pipeline** -- contextual chunks + hybrid search + reranking combined.\n",
    "4. **Before/after comparison** -- run the same questions through the B4 baseline and the advanced pipeline; compare retrieved chunks and answer quality.\n",
    "\n",
    "**Prerequisites:**\n",
    "- `pip install -r ../requirements.txt`\n",
    "- A `.env` file with `ANTHROPIC_API_KEY` and `VOYAGE_API_KEY` set\n",
    "- Datasets built by `python ../scripts/generate_data.py` (same corpus as B4)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-1",
   "metadata": {},
   "source": [
    "## Section 1 - Setup\n",
    "\n",
    "This notebook is self-contained: it re-builds the B4 chunking, embedding, and BM25 infrastructure\n",
    "from scratch rather than importing from the B4 notebook. This keeps the notebook runnable on its own.\n",
    "\n",
    "The setup section embeds the baseline (non-contextual) chunks. Section 4 adds the contextual layer\n",
    "and re-embeds. The two embedding calls together are the main API cost of running this notebook."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cell-2",
   "metadata": {},
   "outputs": [],
   "source": [
    "import math\nimport re\nfrom pathlib import Path\n\nimport voyageai\nfrom dotenv import load_dotenv\nfrom rank_bm25 import BM25Okapi\n\nfrom llm_client import ClaudeClient\n\nload_dotenv(\"../.env\")\n\nDATA_DIR = Path(\"..\") / \"data\"\nCORPUS_PATHS = {\n    \"warehouse_runbook\": DATA_DIR / \"runbook_warehouse_cost.md\",\n    \"quality_runbook\":   DATA_DIR / \"runbook_data_quality.md\",\n    \"qbr_q3_2025\":       DATA_DIR / \"qbr_q3_2025.md\",\n}\nfor path in CORPUS_PATHS.values():\n    assert path.exists(), f\"Missing: {path}. Run python ../scripts/generate_data.py\"\n\nCORPUS = {name: path.read_text() for name, path in CORPUS_PATHS.items()}\n\nvc = voyageai.Client()\ncc = ClaudeClient()\n\n\n# -- Chunking (same as B4) --------------------------------------------------\ndef chunk_by_section(text: str, max_words: int = 600) -> list[str]:\n    raw_sections = re.split(r\"\\n(?=## )\", text)\n    chunks = []\n    for section in raw_sections:\n        words = section.split()\n        if len(words) <= max_words:\n            chunks.append(section)\n        else:\n            paras = section.split(\"\\n\\n\")\n            current, current_words = [], 0\n            for para in paras:\n                pw = len(para.split())\n                if current_words + pw > max_words and current:\n                    chunks.append(\"\\n\\n\".join(current))\n                    current, current_words = [], 0\n                current.append(para)\n                current_words += pw\n            if current:\n                chunks.append(\"\\n\\n\".join(current))\n    return [c for c in chunks if c.strip()]\n\n\n# Build baseline chunk list\nbase_chunks: list[dict] = []\nfor doc_name, doc_text in CORPUS.items():\n    for chunk_text in chunk_by_section(doc_text):\n        base_chunks.append({\"source\": doc_name, \"text\": chunk_text})\n\nprint(f\"Baseline corpus: {len(base_chunks)} chunks across {len(CORPUS)} documents\")\n\n\n# -- Baseline embeddings + BM25 (same as B4) --------------------------------\nprint(\"Embedding baseline chunks ...\")\nbase_emb_result = vc.embed(\n    [c[\"text\"] for c in base_chunks],\n    model=\"voyage-3\",\n    input_type=\"document\",\n)\nfor i, emb in enumerate(base_emb_result.embeddings):\n    base_chunks[i][\"embedding\"] = emb\n\nbase_bm25 = BM25Okapi([c[\"text\"].lower().split() for c in base_chunks])\nprint(f\"Baseline index ready. Dimension: {len(base_emb_result.embeddings[0])}\")\n\n\n# -- Shared retrieval helpers -----------------------------------------------\ndef cosine_sim(a: list[float], b: list[float]) -> float:\n    dot   = sum(x * y for x, y in zip(a, b))\n    norm_a = math.sqrt(sum(x * x for x in a))\n    norm_b = math.sqrt(sum(x * x for x in b))\n    return dot / (norm_a * norm_b + 1e-10)\n\n\ndef embed_query(query: str) -> list[float]:\n    return vc.embed([query], model=\"voyage-3\", input_type=\"query\").embeddings[0]\n\n\ndef hybrid_search(\n    query: str,\n    chunks: list[dict],\n    bm25_index: BM25Okapi,\n    k: int = 5,\n    fetch: int = 20,\n) -> list[tuple[dict, float]]:\n    \"\"\"RRF hybrid search over a given chunk list and BM25 index.\"\"\"\n    q_emb = embed_query(query)\n    vec_scored = sorted(\n        [(cosine_sim(q_emb, c[\"embedding\"]), i) for i, c in enumerate(chunks)],\n        reverse=True,\n    )[:fetch]\n    bm25_scores = bm25_index.get_scores(query.lower().split())\n    bm25_top = sorted(range(len(bm25_scores)), key=lambda i: bm25_scores[i], reverse=True)[:fetch]\n\n    vec_ranks  = {i: r + 1 for r, (_, i) in enumerate(vec_scored)}\n    bm25_ranks = {i: r + 1 for r, i in enumerate(bm25_top)}\n\n    candidates = set(vec_ranks) | set(bm25_ranks)\n    rrf: dict[int, float] = {}\n    for idx in candidates:\n        rrf[idx] = (\n            (1.0 / (60 + vec_ranks[idx])  if idx in vec_ranks  else 0.0) +\n            (1.0 / (60 + bm25_ranks[idx]) if idx in bm25_ranks else 0.0)\n        )\n    top = sorted(rrf, key=lambda i: rrf[i], reverse=True)[:k]\n    return [(chunks[i], rrf[i]) for i in top]\n\n\nprint(\"\\nAll baseline helpers ready.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-3",
   "metadata": {},
   "source": [
    "## Section 2 - Reranking with VoyageAI\n",
    "\n",
    "The rerank API takes a query and a list of candidate documents, and returns them ordered by\n",
    "relevance as scored by a cross-encoder model. Unlike cosine similarity, the cross-encoder reads\n",
    "the query and each document jointly -- it can detect specific token-level interactions that the\n",
    "bi-encoder embedding missed.\n",
    "\n",
    "Pattern: fetch a wider candidate set (k=20) with fast hybrid search, then rerank to top-5.\n",
    "The reranker only processes 20 candidates, not the full corpus -- so the cost and latency are low.\n",
    "\n",
    "Note: `relevance_score` values are not bounded and not comparable across queries. Use only the\n",
    "rank order, not the score magnitude."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cell-4",
   "metadata": {},
   "outputs": [],
   "source": [
    "def reranked_search(\n    query: str,\n    chunks: list[dict],\n    bm25_index: BM25Okapi,\n    k: int = 5,\n    fetch: int = 20,\n) -> list[tuple[dict, float]]:\n    \"\"\"Hybrid retrieval followed by cross-encoder reranking.\"\"\"\n    results = hybrid_search(query, chunks, bm25_index, k=fetch, fetch=fetch)\n    if not results:\n        return []\n    candidates, _ = zip(*results)\n    candidates = list(candidates)\n\n    result = vc.rerank(\n        query=query,\n        documents=[c[\"text\"] for c in candidates],\n        model=\"rerank-2\",\n        top_k=k,\n    )\n    return [(candidates[r.index], r.relevance_score) for r in result.results]\n\n\n# Demo: a query where reranking sharpens the ranking\ndemo_q = \"What is the step-by-step procedure to identify which warehouse is causing a cost spike?\"\n\nprint(f\"Query: {demo_q}\\n\")\n\nprint(\"--- Hybrid only (top 5) ---\")\nfor rank, (c, score) in enumerate(hybrid_search(demo_q, base_chunks, base_bm25, k=5), 1):\n    print(f\"  {rank}. [{c['source']}] rrf={score:.4f}  {c['text'][:80].replace(chr(10),' ')} ...\")\n\nprint()\nprint(\"--- Hybrid + rerank (top 5) ---\")\nfor rank, (c, score) in enumerate(reranked_search(demo_q, base_chunks, base_bm25, k=5), 1):\n    print(f\"  {rank}. [{c['source']}] rel={score:.4f}  {c['text'][:80].replace(chr(10),' ')} ...\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-5",
   "metadata": {},
   "source": [
    "## Section 3 - Contextual retrieval: generating chunk context\n",
    "\n",
    "For each chunk, we ask Claude to write a 2-3 sentence context paragraph that situates the chunk\n",
    "within its source document. This context is prepended to the chunk before embedding.\n",
    "\n",
    "**Prompt caching is essential here.** The full document is identical for every chunk from that\n",
    "document. We place the document in a cached system block. The first chunk from each document\n",
    "writes the document to cache; all subsequent chunks read it at ~10% of the full input cost.\n",
    "\n",
    "This cell demonstrates context generation for 3 sample chunks before we run it over the full corpus."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cell-6",
   "metadata": {},
   "outputs": [],
   "source": [
    "CONTEXT_INSTRUCTION = (\n",
    "    \"Here is a chunk from this document:\\n\\n\"\n",
    "    \"<chunk>\\n{chunk_text}\\n</chunk>\\n\\n\"\n",
    "    \"Write a short 2-3 sentence context that situates this chunk within the overall document. \"\n",
    "    \"Include the document type, the section it belongs to, and key terms a search query might \"\n",
    "    \"use to find this content. Answer with only the context paragraph, nothing else.\"\n",
    ")\n",
    "\n",
    "\n",
    "def generate_chunk_context(document_text: str, chunk_text: str) -> str:\n",
    "    \"\"\"Ask Claude to write a context paragraph for a chunk.\n",
    "\n",
    "    The document is sent as a cached system block so all chunks from the same\n",
    "    document share the cache entry after the first call.\n",
    "    \"\"\"\n",
    "    resp = cc.client.messages.create(\n",
    "        model=cc.default_model,\n",
    "        max_tokens=150,\n",
    "        temperature=0,\n",
    "        system=[\n",
    "            {\n",
    "                \"type\": \"text\",\n",
    "                \"text\": f\"<document>\\n{document_text}\\n</document>\",\n",
    "                \"cache_control\": {\"type\": \"ephemeral\"},\n",
    "            }\n",
    "        ],\n",
    "        messages=[{\n",
    "            \"role\": \"user\",\n",
    "            \"content\": CONTEXT_INSTRUCTION.format(chunk_text=chunk_text),\n",
    "        }],\n",
    "    )\n",
    "    return resp.content[0].text.strip()\n",
    "\n",
    "\n",
    "# Demo: generate context for 3 sample chunks (one per document)\n",
    "sample_chunks = []\n",
    "seen_docs = set()\n",
    "for c in base_chunks:\n",
    "    if c[\"source\"] not in seen_docs:\n",
    "        sample_chunks.append(c)\n",
    "        seen_docs.add(c[\"source\"])\n",
    "    if len(sample_chunks) == 3:\n",
    "        break\n",
    "\n",
    "print(\"Generating context for 3 sample chunks ...\\n\")\n",
    "for c in sample_chunks:\n",
    "    doc_text = CORPUS[c[\"source\"]]\n",
    "    ctx = generate_chunk_context(doc_text, c[\"text\"])\n",
    "    print(f\"[{c['source']}]\")\n",
    "    print(f\"CHUNK  (first 120 chars): {c['text'][:120].replace(chr(10),' ')} ...\")\n",
    "    print(f\"CONTEXT: {ctx}\")\n",
    "    print()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-7",
   "metadata": {},
   "source": [
    "## Section 4 - Building the contextual index\n",
    "\n",
    "Generate context for all chunks, prepend context to each chunk, re-embed, and rebuild BM25.\n",
    "\n",
    "This is the indexing cost you pay once. Progress is printed per document so you can see the\n",
    "cache behaviour: the first chunk from each document shows `cache_creation_input_tokens` > 0;\n",
    "subsequent chunks from the same document show `cache_read_input_tokens` > 0 instead.\n",
    "\n",
    "For a corpus of ~30 chunks across 3 documents, expect ~30 Claude calls. With caching the total\n",
    "token cost is approximately equivalent to 3 full document reads (one write + ~9 cheap reads per\n",
    "document) plus 30 short output completions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cell-8",
   "metadata": {},
   "outputs": [],
   "source": [
    "def build_contextual_chunks(\n    base_chunks: list[dict],\n    corpus: dict[str, str],\n    *,\n    max_chunks: int | None = None,\n) -> list[dict]:\n    \"\"\"Generate context for every chunk and return a new contextual chunk list.\n\n    max_chunks: if set, only the first N chunks are processed. Useful for a quick\n    preview before committing to the full (costly) indexing pass.\n    \"\"\"\n    if max_chunks is not None:\n        base_chunks = base_chunks[:max_chunks]\n    contextual: list[dict] = []\n    cache_stats: dict[str, dict] = {doc: {\"writes\": 0, \"reads\": 0} for doc in corpus}\n\n    for i, c in enumerate(base_chunks):\n        doc_text = corpus[c[\"source\"]]\n        resp = cc.client.messages.create(\n            model=cc.default_model,\n            max_tokens=150,\n            temperature=0,\n            system=[\n                {\n                    \"type\": \"text\",\n                    \"text\": f\"<document>\\n{doc_text}\\n</document>\",\n                    \"cache_control\": {\"type\": \"ephemeral\"},\n                }\n            ],\n            messages=[{\n                \"role\": \"user\",\n                \"content\": CONTEXT_INSTRUCTION.format(chunk_text=c[\"text\"]),\n            }],\n        )\n        ctx_text = resp.content[0].text.strip()\n        contextual.append({\n            \"source\":   c[\"source\"],\n            \"text\":     ctx_text + \"\\n\\n\" + c[\"text\"],   # enriched chunk\n            \"raw_text\": c[\"text\"],                        # original chunk (for generation)\n        })\n\n        u = resp.usage\n        writes = getattr(u, \"cache_creation_input_tokens\", 0) or 0\n        reads  = getattr(u, \"cache_read_input_tokens\", 0) or 0\n        if writes:\n            cache_stats[c[\"source\"]][\"writes\"] += 1\n        if reads:\n            cache_stats[c[\"source\"]][\"reads\"] += 1\n\n        print(f\"  [{i+1:2d}/{len(base_chunks)}] {c['source']:25s}  \"\n              f\"cache_write={writes:>5,}  cache_read={reads:>5,}  \"\n              f\"out={u.output_tokens}\")\n\n    print(\"\\nCache summary per document:\")\n    for doc, stats in cache_stats.items():\n        print(f\"  {doc:25s}  writes={stats['writes']}  reads={stats['reads']}\")\n    return contextual\n\n\n# Set to None to index all chunks (costly: one Claude call per chunk).\nMAX_CONTEXT_CHUNKS = 3\n\nprint(\"Building contextual chunk index (one Claude call per chunk) ...\\n\")\nctx_chunks = build_contextual_chunks(base_chunks, CORPUS, max_chunks=MAX_CONTEXT_CHUNKS)\n\n# Embed contextual chunks\nprint(\"\\nEmbedding contextual chunks ...\")\nctx_emb_result = vc.embed(\n    [c[\"text\"] for c in ctx_chunks],\n    model=\"voyage-3\",\n    input_type=\"document\",\n)\nfor i, emb in enumerate(ctx_emb_result.embeddings):\n    ctx_chunks[i][\"embedding\"] = emb\n\n# Build contextual BM25\nctx_bm25 = BM25Okapi([c[\"text\"].lower().split() for c in ctx_chunks])\nprint(f\"Contextual index ready. {len(ctx_chunks)} chunks embedded.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-9",
   "metadata": {},
   "source": [
    "## Section 5 - Full advanced pipeline\n",
    "\n",
    "The complete pipeline: contextual chunks -> hybrid retrieval -> reranking -> generation.\n",
    "\n",
    "One implementation detail: the context paragraph helps retrieval, but we pass only the original\n",
    "chunk text (`raw_text`) to the generation call. This avoids the model echoing the meta-commentary\n",
    "(\"this chunk is from the diagnosis playbook section...\") in its answer."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cell-10",
   "metadata": {},
   "outputs": [],
   "source": [
    "RAG_SYSTEM = (\n",
    "    \"You are a data platform assistant for Acme SaaS Co. \"\n",
    "    \"Answer questions using ONLY the context provided. \"\n",
    "    \"Cite the source document name when referencing specific facts. \"\n",
    "    \"If the answer is not in the context, say so explicitly.\"\n",
    ")\n",
    "\n",
    "\n",
    "def advanced_rag_answer(question: str, k: int = 4) -> tuple[str, list]:\n",
    "    \"\"\"Full pipeline: contextual hybrid search + reranking + generation.\"\"\"\n",
    "    candidates, _ = zip(\n",
    "        *hybrid_search(question, ctx_chunks, ctx_bm25, k=20, fetch=20)\n",
    "    )\n",
    "    candidates = list(candidates)\n",
    "\n",
    "    rerank_result = vc.rerank(\n",
    "        query=question,\n",
    "        documents=[c[\"text\"] for c in candidates],\n",
    "        model=\"rerank-2\",\n",
    "        top_k=k,\n",
    "    )\n",
    "    top_k = [(candidates[r.index], r.relevance_score) for r in rerank_result.results]\n",
    "\n",
    "    # Use raw_text for generation; enriched text was only for retrieval\n",
    "    context = \"\\n\\n---\\n\\n\".join(\n",
    "        f\"[Source: {c['source']}]\\n{c.get('raw_text', c['text'])}\"\n",
    "        for c, _ in top_k\n",
    "    )\n",
    "\n",
    "    resp = cc.client.messages.create(\n",
    "        model=cc.default_model,\n",
    "        max_tokens=800,\n",
    "        temperature=0,\n",
    "        system=RAG_SYSTEM,\n",
    "        messages=[{\n",
    "            \"role\": \"user\",\n",
    "            \"content\": f\"Context:\\n\\n{context}\\n\\nQuestion: {question}\",\n",
    "        }],\n",
    "    )\n",
    "    return resp.content[0].text, top_k\n",
    "\n",
    "\n",
    "# Quick smoke test\n",
    "test_q = \"What are the recommended auto-suspend settings for each warehouse type?\"\n",
    "answer, sources = advanced_rag_answer(test_q)\n",
    "print(f\"Q: {test_q}\")\n",
    "print(f\"\\nA: {answer[:400]} ...\")\n",
    "print(\"\\nRetrieved from:\")\n",
    "for c, score in sources:\n",
    "    print(f\"  [{c['source']}] rel={score:.4f}  {c.get('raw_text', c['text'])[:60].replace(chr(10),' ')} ...\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-11",
   "metadata": {},
   "source": [
    "## Section 6 - Before/after comparison\n",
    "\n",
    "Three questions, two pipelines. For each question we print:\n",
    "- Which chunk ranked first in each pipeline\n",
    "- The first 300 characters of each generated answer\n",
    "\n",
    "The baseline uses the B4 pipeline (hybrid search, no reranking, no contextual enrichment).\n",
    "The advanced pipeline uses contextual hybrid + reranking.\n",
    "\n",
    "Expected pattern:\n",
    "- Named-entity queries (dates, incident IDs): similar results -- BM25 already handles these.\n",
    "- Specific procedural questions: advanced pipeline retrieves more precisely because the context\n",
    "  paragraph added the section name that the chunk text itself was missing.\n",
    "- Paraphrased semantic questions: advanced pipeline often matches or improves slightly."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cell-12",
   "metadata": {},
   "outputs": [],
   "source": [
    "def baseline_rag_answer(question: str, k: int = 4) -> tuple[str, list]:\n",
    "    \"\"\"B4-style pipeline: hybrid search only, no reranking, no contextual enrichment.\"\"\"\n",
    "    retrieved = hybrid_search(question, base_chunks, base_bm25, k=k)\n",
    "    context = \"\\n\\n---\\n\\n\".join(\n",
    "        f\"[Source: {c['source']}]\\n{c['text']}\" for c, _ in retrieved\n",
    "    )\n",
    "    resp = cc.client.messages.create(\n",
    "        model=cc.default_model,\n",
    "        max_tokens=800,\n",
    "        temperature=0,\n",
    "        system=RAG_SYSTEM,\n",
    "        messages=[{\"role\": \"user\", \"content\": f\"Context:\\n\\n{context}\\n\\nQuestion: {question}\"}],\n",
    "    )\n",
    "    return resp.content[0].text, retrieved\n",
    "\n",
    "\n",
    "comparison_questions = [\n",
    "    \"Walk me through every step of the cost spike diagnosis playbook.\",\n",
    "    \"What was the credit impact of the 2025-05-18 embedded dashboard incident?\",\n",
    "    \"How do I tell if a dbt model has been accidentally converted from incremental to full table?\",\n",
    "]\n",
    "\n",
    "for q in comparison_questions:\n",
    "    base_ans, base_srcs = baseline_rag_answer(q)\n",
    "    adv_ans,  adv_srcs  = advanced_rag_answer(q)\n",
    "\n",
    "    print(\"=\" * 72)\n",
    "    print(f\"Q: {q}\")\n",
    "    print()\n",
    "    print(f\"BASELINE top chunk : [{base_srcs[0][0]['source']}]  \"\n",
    "          f\"{base_srcs[0][0]['text'][:80].replace(chr(10),' ')} ...\")\n",
    "    print(f\"ADVANCED top chunk : [{adv_srcs[0][0]['source']}]  \"\n",
    "          f\"{adv_srcs[0][0].get('raw_text', adv_srcs[0][0]['text'])[:80].replace(chr(10),' ')} ...\")\n",
    "    print()\n",
    "    print(f\"BASELINE answer: {base_ans[:280].rstrip()} ...\")\n",
    "    print()\n",
    "    print(f\"ADVANCED answer: {adv_ans[:280].rstrip()} ...\")\n",
    "    print()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cell-13",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Note: this notebook routes model calls through cc.client.messages.create() directly\n# (for cache_control support), not through cc.complete(). cc.records will be empty;\n# per-call usage is printed inline by build_contextual_chunks and advanced_rag_answer.\ncc.print_summary()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-14",
   "metadata": {},
   "source": [
    "## Section 7 - Practitioner Lab\n",
    "\n",
    "Open-ended extension. No reference solution.\n",
    "\n",
    "**Goal:** build a simple retrieval evaluation harness.\n",
    "\n",
    "**Setup:** create a list of 8-10 question/expected-source pairs for the Acme corpus. Each pair\n",
    "specifies a question and the `source` document name (and optionally a keyword that should appear\n",
    "in the top-1 retrieved chunk's text).\n",
    "\n",
    "```python\n",
    "EVAL_SET = [\n",
    "    {\n",
    "        \"question\": \"What is Step 2 of the cost spike diagnosis playbook?\",\n",
    "        \"expected_source\": \"warehouse_runbook\",\n",
    "        \"expected_keyword\": \"time window\",\n",
    "    },\n",
    "    {\n",
    "        \"question\": \"What dbt test catches duplicate primary keys?\",\n",
    "        \"expected_source\": \"quality_runbook\",\n",
    "        \"expected_keyword\": \"unique\",\n",
    "    },\n",
    "    # ... add more\n",
    "]\n",
    "```\n",
    "\n",
    "**Task:** write an `evaluate(pipeline_fn, eval_set, k=5)` function that:\n",
    "1. For each eval pair, calls the pipeline function with `question`.\n",
    "2. Checks whether `expected_source` appears in the top-k retrieved sources (recall@k).\n",
    "3. Optionally checks whether `expected_keyword` appears in the top-1 chunk text.\n",
    "4. Reports: recall@1, recall@3, recall@5, and keyword hit rate.\n",
    "\n",
    "**Run it on both pipelines** and report the numbers.\n",
    "\n",
    "```python\n",
    "print(\"Baseline:\")\n",
    "evaluate(lambda q: baseline_rag_answer(q)[1], EVAL_SET)\n",
    "\n",
    "print(\"Advanced:\")\n",
    "evaluate(lambda q: advanced_rag_answer(q)[1], EVAL_SET)\n",
    "```\n",
    "\n",
    "**Stretch:** add a third metric -- faithfulness -- by asking Claude to judge whether each generated\n",
    "answer is supported by the retrieved chunks (a simple LLM-as-judge pattern). This is a standard\n",
    "component of production RAG evaluation frameworks (RAGAS, TruLens, etc.).\n",
    "\n",
    "Why this matters: retrieval quality is invisible without measurement. Developers routinely ship\n",
    "improvements that feel better in manual testing but are neutral or negative in aggregate. A 10-pair\n",
    "eval set takes 30 minutes to write and turns subjective impressions into numbers you can track\n",
    "across pipeline changes.\n",
    "\n",
    "---\n",
    "\n",
    "*Companion article: B5 - RAG Advanced: Reranking and Contextual Retrieval.*\n",
    "*Next notebook: C1_builtin_tools.ipynb*"
   ]
  }
 ]
}