{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# B3 - Augmenting Model Reasoning: Extended Thinking and Prompt Caching\n",
    "\n",
    "Companion notebook for article **B3** in *Building with Claude - A Practitioner's Guide to the Anthropic API*.\n",
    "\n",
    "**Attribution.** Concepts adapted from Anthropic's \"Building with the Claude API\" course (Coursera) and public API documentation at [docs.anthropic.com](https://docs.anthropic.com). All code below is original work (c) 2026 DataMy. Not affiliated with Anthropic.\n",
    "\n",
    "---\n",
    "\n",
    "## What you'll build in this notebook\n",
    "\n",
    "Four working patterns, building toward the canonical \"operational copilot\" shape:\n",
    "\n",
    "1. **Extended thinking** -- a hard analytical question over the warehouse cost runbook, with `thinking` enabled. Inspect the thinking content alongside the final answer.\n",
    "2. **Prompt caching** -- call the same question twice; see `cache_creation_input_tokens` on the first call and `cache_read_input_tokens` on the second.\n",
    "3. **Caching break-even calculator** -- a small Python function that computes the cost of N calls with and without caching, given a prefix size and current pricing.\n",
    "4. **The two together** -- thinking-enabled call grounded in a cached runbook, with a volatile user question. The canonical production shape.\n",
    "\n",
    "**Prerequisites:**\n",
    "- `pip install -r ../requirements.txt`\n",
    "- A `.env` file with `ANTHROPIC_API_KEY` set\n",
    "- Datasets built by `python ../scripts/generate_data.py` (creates `runbook_warehouse_cost.md`)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Section 1 - Setup\n",
    "\n",
    "Standard import pattern from B1 onward: `ClaudeClient` from `llm_client.py`, data from `../data/`. See the **\"Project structure & file roles\"** section of the project [`README.md`](../README.md) for the full picture.\n",
    "\n",
    "Like B2, this notebook calls `client.messages.create()` directly in some places instead of the `complete()` helper, because `thinking` and `cache_control` need full control over the request shape."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from pathlib import Path\n",
    "\n",
    "from llm_client import ClaudeClient, estimate_cost_usd\n",
    "\n",
    "DATA_DIR     = Path(\"..\") / \"data\"\n",
    "RUNBOOK_PATH = DATA_DIR / \"runbook_warehouse_cost.md\"\n",
    "\n",
    "assert RUNBOOK_PATH.exists(), (\n",
    "    f\"Missing {RUNBOOK_PATH}. Run python ../scripts/generate_data.py\"\n",
    ")\n",
    "\n",
    "runbook = RUNBOOK_PATH.read_text()\n",
    "print(f\"Loaded runbook: {len(runbook):,} chars (rough estimate ~{len(runbook)//4:,} tokens)\")\n",
    "\n",
    "cc = ClaudeClient()\n",
    "print(\"Client ready. Default model:\", cc.default_model)\n",
    "\n",
    "ANALYST_PERSONA = \"\"\"\n",
    "<role>\n",
    "You are a Snowflake cost analyst supporting Acme SaaS Co's data platform team.\n",
    "</role>\n",
    "\n",
    "<constraints>\n",
    "- Only answer using the runbook provided in the system context. Never invent figures or procedures.\n",
    "- Cite the runbook section number whenever you reference content from it.\n",
    "- Quantify your answers with credit estimates where the runbook gives you the numbers.\n",
    "</constraints>\n",
    "\n",
    "<style>\n",
    "- Direct. Numbered steps when answering procedural questions.\n",
    "- For trade-off questions, surface the trade-off explicitly before recommending.\n",
    "</style>\n",
    "\"\"\".strip()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Section 2 - Extended thinking on a hard analytical question\n",
    "\n",
    "We ask a multi-step question: identify the highest-leverage cost change for next quarter, given the runbook. This is the kind of question where thinking measurably improves the answer -- there is no single right answer in the runbook; the model has to weigh several optimization patterns against the incident history.\n",
    "\n",
    "The response object will contain two content blocks: one of `type=\"thinking\"` (the scratchpad) and one of `type=\"text\"` (the final answer)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "hard_question = (\n",
    "    \"Across all the optimization patterns and incident examples in this runbook, \"\n",
    "    \"if you could implement only ONE change next quarter to reduce warehouse spend, \"\n",
    "    \"which would have the highest leverage and why? Be specific about expected credit impact.\"\n",
    ")\n",
    "\n",
    "thinking_resp = cc.client.messages.create(\n",
    "    model=cc.default_model,\n",
    "    max_tokens=4096,\n",
    "    thinking={\"type\": \"enabled\", \"budget_tokens\": 6000},\n",
    "    system=[{\"type\": \"text\", \"text\": ANALYST_PERSONA + \"\\n\\n---\\n\\n\" + runbook}],\n",
    "    messages=[{\"role\": \"user\", \"content\": hard_question}],\n",
    ")\n",
    "\n",
    "# Separate the thinking and final answer blocks\n",
    "for block in thinking_resp.content:\n",
    "    btype = getattr(block, \"type\", None)\n",
    "    if btype == \"thinking\":\n",
    "        print(\"=== THINKING (first 400 chars) ===\")\n",
    "        print(block.thinking[:400] + \" ...\\n\")\n",
    "    elif btype == \"text\":\n",
    "        print(\"=== FINAL ANSWER ===\")\n",
    "        print(block.text)\n",
    "\n",
    "print(\"\\n---\")\n",
    "u = thinking_resp.usage\n",
    "print(f\"Input tokens : {u.input_tokens:,}\")\n",
    "print(f\"Output tokens: {u.output_tokens:,} (includes thinking)\")\n",
    "print(f\"Est cost USD : ${estimate_cost_usd(cc.default_model, u):.6f}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Section 3 - Prompt caching: call twice, watch the cache work\n",
    "\n",
    "Same persona + runbook, two different questions, but the system block is identical between the calls. On call 1 you should see a non-zero `cache_creation_input_tokens` and zero `cache_read_input_tokens`. On call 2 the relationship flips.\n",
    "\n",
    "If you run this more than ~5 minutes apart, you'll see a second cache_creation -- that is the default 5-minute TTL expiring. The `ttl: \"1h\"` option (commented in the code below) extends this to one hour."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def call_with_cache(question: str):\n",
    "    \"\"\"Call Claude with the persona + runbook as a cached system prefix.\"\"\"\n",
    "    return cc.client.messages.create(\n",
    "        model=cc.default_model,\n",
    "        max_tokens=600,\n",
    "        temperature=0,\n",
    "        system=[\n",
    "            {\n",
    "                \"type\": \"text\",\n",
    "                \"text\": ANALYST_PERSONA,\n",
    "                \"cache_control\": {\"type\": \"ephemeral\"},\n",
    "            },\n",
    "            {\n",
    "                \"type\": \"text\",\n",
    "                \"text\": runbook,\n",
    "                \"cache_control\": {\"type\": \"ephemeral\"},\n",
    "                # For a 1-hour TTL instead of the default 5min, use:\n",
    "                # \"cache_control\": {\"type\": \"ephemeral\", \"ttl\": \"1h\"},\n",
    "            },\n",
    "        ],\n",
    "        messages=[{\"role\": \"user\", \"content\": question}],\n",
    "    )\n",
    "\n",
    "\n",
    "def show_cache_usage(label: str, resp):\n",
    "    u = resp.usage\n",
    "    creation = getattr(u, \"cache_creation_input_tokens\", 0) or 0\n",
    "    read     = getattr(u, \"cache_read_input_tokens\", 0) or 0\n",
    "    print(f\"--- {label} ---\")\n",
    "    print(f\"  uncached input tokens : {u.input_tokens:>7,}\")\n",
    "    print(f\"  cache_creation tokens : {creation:>7,}\")\n",
    "    print(f\"  cache_read tokens     : {read:>7,}\")\n",
    "    print(f\"  output tokens         : {u.output_tokens:>7,}\\n\")\n",
    "\n",
    "\n",
    "resp1 = call_with_cache(\"In section 3 of the runbook, what is the most frequent cause of cost spikes, and what is the recommended fix?\")\n",
    "show_cache_usage(\"Call 1 (first hit - should populate cache)\", resp1)\n",
    "\n",
    "resp2 = call_with_cache(\"Which incident in section 6 had the largest credit impact, and how was it detected?\")\n",
    "show_cache_usage(\"Call 2 (second hit - should read from cache)\", resp2)\n",
    "\n",
    "print(\"Call 1 answer (first 200 chars):\", resp1.content[0].text[:200].replace(\"\\n\", \" \"), \"...\")\n",
    "print()\n",
    "print(\"Call 2 answer (first 200 chars):\", resp2.content[0].text[:200].replace(\"\\n\", \" \"), \"...\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Section 4 - A caching break-even calculator\n",
    "\n",
    "Pure-Python helper. Plug in your prefix size, your expected calls per TTL window, and current pricing from [anthropic.com/pricing](https://www.anthropic.com/pricing). Returns the cost with and without caching plus the savings ratio.\n",
    "\n",
    "Defaults below use illustrative ratios (`cache_read = 0.1 * input`, `cache_write = 1.25 * input`). Replace with real per-token-million rates for an actual cost estimate."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from dataclasses import dataclass, field\n\n\n# Cache write multipliers relative to base input price (source: Anthropic pricing page).\n# Use the constant that matches the TTL you specified in cache_control.\nCACHE_WRITE_5M_MULTIPLIER = 1.25   # default ephemeral TTL (5 minutes)\nCACHE_WRITE_1H_MULTIPLIER = 2.00   # extended TTL (\"ttl\": \"1h\")\nCACHE_READ_MULTIPLIER     = 0.10   # same for both TTL tiers\n\n\n@dataclass\nclass CachingMath:\n    prefix_tokens: int            # size of cacheable prefix\n    calls_per_ttl_window: int     # how many calls hit the same prefix per TTL window\n    input_price_per_m: float      # USD per 1M input tokens\n    ttl: str = \"5m\"               # \"5m\" (default) or \"1h\" -- sets write multiplier\n    cache_read_multiplier: float  = CACHE_READ_MULTIPLIER\n\n    @property\n    def cache_write_multiplier(self) -> float:\n        \"\"\"Write multiplier derived from TTL; 1.25\u00d7 for 5-min, 2.00\u00d7 for 1-hour.\"\"\"\n        if self.ttl == \"1h\":\n            return CACHE_WRITE_1H_MULTIPLIER\n        return CACHE_WRITE_5M_MULTIPLIER\n\n    def cost_without_cache(self) -> float:\n        return (self.prefix_tokens * self.calls_per_ttl_window\n                * self.input_price_per_m / 1_000_000)\n\n    def cost_with_cache(self) -> float:\n        write = (self.prefix_tokens * self.input_price_per_m\n                 * self.cache_write_multiplier / 1_000_000)\n        reads = (self.prefix_tokens * max(self.calls_per_ttl_window - 1, 0)\n                 * self.input_price_per_m * self.cache_read_multiplier / 1_000_000)\n        return write + reads\n\n    def report(self) -> None:\n        no_cache = self.cost_without_cache()\n        cached   = self.cost_with_cache()\n        savings  = no_cache - cached\n        pct      = (savings / no_cache * 100) if no_cache else 0\n        print(f\"Prefix size       : {self.prefix_tokens:>10,} tokens\")\n        print(f\"Calls / TTL window: {self.calls_per_ttl_window:>10,}\")\n        print(f\"TTL               : {self.ttl}  (write multiplier: {self.cache_write_multiplier:.2f}x)\")\n        print(f\"Input price       : ${self.input_price_per_m:.2f} per 1M tokens\")\n        print()\n        print(f\"Without caching   : ${no_cache:.6f}\")\n        print(f\"With caching      : ${cached:.6f}\")\n        print(f\"Savings           : ${savings:.6f} ({pct:.1f}%)\")\n\n\n# --- 5-minute TTL (default ephemeral cache) ---\n# 5000-token runbook prefix, hit 20 times per 5-minute window.\nprint(\"=== 5-minute TTL ===\")\nCachingMath(prefix_tokens=5000, calls_per_ttl_window=20, input_price_per_m=3.00).report()\n\nprint(\"\\n---\\n\")\n# Heavier scenario: 30k system prompt, 200 calls per window.\nCachingMath(prefix_tokens=30_000, calls_per_ttl_window=200, input_price_per_m=3.00).report()\n\n# --- 1-hour TTL (extended cache) ---\n# Same prefix but the TTL window is now 1 hour, so more calls accumulate before\n# the cache expires.  Write cost is higher (2.00x), but read savings compound.\nprint(\"\\n=== 1-hour TTL ===\")\nCachingMath(\n    prefix_tokens=30_000,\n    calls_per_ttl_window=200,\n    input_price_per_m=3.00,\n    ttl=\"1h\",\n).report()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Section 5 - The two together: thinking over cached context\n",
    "\n",
    "The canonical \"operational copilot\" shape from the article. The runbook (long, stable) is cached. Thinking is enabled because the question requires real analysis. The user message is the only volatile part of the call.\n",
    "\n",
    "In a real deployment, you would call this function dozens of times an hour with different user questions. The cached prefix is reused, the thinking budget is paid per call, and the answer quality justifies both."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def operational_copilot(question: str, thinking_budget: int = 6000):\n",
    "    \"\"\"Cached runbook + extended thinking, the canonical production shape.\"\"\"\n",
    "    return cc.client.messages.create(\n",
    "        model=cc.default_model,\n",
    "        max_tokens=4096,\n",
    "        thinking={\"type\": \"enabled\", \"budget_tokens\": thinking_budget},\n",
    "        system=[\n",
    "            {\n",
    "                \"type\": \"text\",\n",
    "                \"text\": ANALYST_PERSONA,\n",
    "                \"cache_control\": {\"type\": \"ephemeral\"},\n",
    "            },\n",
    "            {\n",
    "                \"type\": \"text\",\n",
    "                \"text\": runbook,\n",
    "                \"cache_control\": {\"type\": \"ephemeral\"},\n",
    "            },\n",
    "        ],\n",
    "        messages=[{\"role\": \"user\", \"content\": question}],\n",
    "    )\n",
    "\n",
    "\n",
    "complex_q = (\n",
    "    \"We're seeing a 22% week-over-week climb in WH_BI_M credits, no obvious single \"\n",
    "    \"spike. Walk me through the diagnosis playbook from section 4, identifying which \"\n",
    "    \"of the section 3 causes most likely fits this symptom pattern, and what query I should run first.\"\n",
    ")\n",
    "\n",
    "resp = operational_copilot(complex_q)\n",
    "\n",
    "for block in resp.content:\n",
    "    btype = getattr(block, \"type\", None)\n",
    "    if btype == \"text\":\n",
    "        print(block.text)\n",
    "\n",
    "print(\"\\n---\")\n",
    "show_cache_usage(\"Operational copilot call\", resp)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Note: calls made directly via cc.client.messages.create() (Sections 2 and 3\n# above) are NOT recorded in cc.records. print_summary() reflects only the calls\n# routed through cc.complete() or cc.stream().\ncc.print_summary()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Section 6 - Practitioner Lab\n",
    "\n",
    "Open-ended extension. No reference solution.\n",
    "\n",
    "**Goal:** build a `cache_hit_rate` measurement helper that, given a list of calls, reports the percentage of input tokens served from cache versus paid at full input price.\n",
    "\n",
    "**Constraints:**\n",
    "1. Accept a sequence of response objects (or `usage` objects) and aggregate the three relevant token counters: `input_tokens`, `cache_creation_input_tokens`, `cache_read_input_tokens`.\n",
    "2. Report:\n",
    "   - Cache hit rate as a percentage of total input tokens.\n",
    "   - Estimated dollar savings versus a hypothetical \"no cache\" baseline.\n",
    "3. Flag any call where `cache_creation_input_tokens > 0` after the first call -- that means the cache expired or the prefix drifted, and you want to know about it.\n",
    "\n",
    "**Stretch:** wire this into your `ClaudeClient` wrapper (`llm_client.py`) so every call automatically updates rolling cache-hit metrics. A line of operational observability that pays for itself the first time something breaks.\n",
    "\n",
    "Why this matters: caching looks correct in development and silently breaks in production when someone injects a timestamp or user-specific token into the system prompt above the cache marker. A simple cache-hit-rate gauge is the early warning system that catches this within minutes rather than weeks.\n",
    "\n",
    "---\n",
    "\n",
    "*Companion article: B3 - Augmenting Model Reasoning: Extended Thinking and Prompt Caching.*\n",
    "*Next notebook: B4_rag_essentials.ipynb*"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "name": "python",
   "version": "3.11"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}