{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# B2 - Multi-Modal Inputs: Feeding Claude More Than Text\n",
    "\n",
    "Companion notebook for article **B2** in *Building with Claude - A Practitioner's Guide to the Anthropic API*.\n",
    "\n",
    "**Attribution.** Concepts adapted from Anthropic's \"Building with the Claude API\" course (Coursera) and public API documentation at [docs.anthropic.com](https://docs.anthropic.com). All code below is original work (c) 2026 DataMy. Not affiliated with Anthropic.\n",
    "\n",
    "---\n",
    "\n",
    "## What you'll build in this notebook\n",
    "\n",
    "Four working multi-modal call patterns:\n",
    "\n",
    "1. **Image input** -- feed Claude a BI dashboard PNG and ask for trend interpretation.\n",
    "2. **PDF input** -- feed Claude a quarterly business review PDF and ask narrow questions.\n",
    "3. **Chart vs CSV trade-off** -- ask the same question of the dashboard image and the underlying CSV; compare cost and precision.\n",
    "4. **Convert-first pattern** -- one helper that routes a file path to the right Claude content block by file extension. This is the function that makes the rest of your codebase modality-agnostic.\n",
    "\n",
    "**Prerequisites:**\n",
    "- `pip install -r ../requirements.txt`\n",
    "- A `.env` file with `ANTHROPIC_API_KEY` set\n",
    "- Datasets built by `python ../scripts/generate_data.py` (creates the dashboard PNG, QBR Markdown, and QBR PDF)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Section 1 - Setup\n",
    "\n",
    "### What this notebook imports, and where it comes from\n",
    "\n",
    "Same architectural rule as B1: import `ClaudeClient` from `llm_client.py`, read data files from `../data/`. See the **\"Project structure & file roles\"** section of the project [`README.md`](../README.md) for the full picture.\n",
    "\n",
    "This notebook calls `ClaudeClient.client.messages.create()` directly in some places instead of the `complete()` helper, because multi-modal content requires a list of content blocks rather than a plain string prompt. The helper would hide what we are trying to show. We still benefit from the shared cost/token logging because every successful call's `usage` is on the response object."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import base64\n",
    "from pathlib import Path\n",
    "\n",
    "from llm_client import ClaudeClient, estimate_cost_usd\n",
    "\n",
    "DATA_DIR  = Path(\"..\") / \"data\"\n",
    "PNG_PATH  = DATA_DIR / \"dashboard_screenshot.png\"\n",
    "PDF_PATH  = DATA_DIR / \"qbr_q3_2025.pdf\"\n",
    "MD_PATH   = DATA_DIR / \"qbr_q3_2025.md\"\n",
    "CSV_PATH  = DATA_DIR / \"saas_metrics.csv\"\n",
    "\n",
    "for p in (PNG_PATH, PDF_PATH, MD_PATH, CSV_PATH):\n",
    "    assert p.exists(), f\"Missing: {p}. Run python ../scripts/generate_data.py\"\n",
    "\n",
    "cc = ClaudeClient()\n",
    "print(\"Client ready. Default model:\", cc.default_model)\n",
    "print(\"Files:\")\n",
    "for p in (PNG_PATH, PDF_PATH, MD_PATH, CSV_PATH):\n",
    "    print(f\"  {p.name:30s}  {p.stat().st_size:>8,} bytes\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Section 2 - Image input: read a BI dashboard\n",
    "\n",
    "Pass the PNG inline as base64. The user message is a list of content blocks: image first, then a text instruction. Notice we read the raw `usage` object off the response to log input tokens for the image."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def encode_b64(path: Path) -> str:\n",
    "    return base64.standard_b64encode(path.read_bytes()).decode(\"utf-8\")\n",
    "\n",
    "image_b64 = encode_b64(PNG_PATH)\n",
    "\n",
    "image_resp = cc.client.messages.create(\n",
    "    model=cc.default_model,\n",
    "    max_tokens=600,\n",
    "    temperature=0,\n",
    "    messages=[{\n",
    "        \"role\": \"user\",\n",
    "        \"content\": [\n",
    "            {\n",
    "                \"type\": \"image\",\n",
    "                \"source\": {\n",
    "                    \"type\": \"base64\",\n",
    "                    \"media_type\": \"image/png\",\n",
    "                    \"data\": image_b64,\n",
    "                },\n",
    "            },\n",
    "            {\n",
    "                \"type\": \"text\",\n",
    "                \"text\": (\n",
    "                    \"This is a SaaS revenue dashboard for 2025. \"\n",
    "                    \"Summarise the trend across the three segments and identify the \"\n",
    "                    \"strongest performer in the most recent month. Use the figures shown on the chart.\"\n",
    "                ),\n",
    "            },\n",
    "        ],\n",
    "    }],\n",
    ")\n",
    "\n",
    "print(image_resp.content[0].text)\n",
    "print(\"\\n---\")\n",
    "print(\"Input tokens (incl. image):\", image_resp.usage.input_tokens)\n",
    "print(\"Output tokens:              \", image_resp.usage.output_tokens)\n",
    "print(\"Estimated cost USD:         \", round(estimate_cost_usd(cc.default_model, image_resp.usage), 6))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Section 3 - PDF input: read a quarterly business review\n",
    "\n",
    "Same pattern as image, but with `\"type\": \"document\"` and `\"media_type\": \"application/pdf\"`. Claude renders each page visually and processes the text layer alongside. We ask a question that requires synthesising across multiple sections of the report."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pdf_b64 = encode_b64(PDF_PATH)\n",
    "\n",
    "pdf_resp = cc.client.messages.create(\n",
    "    model=cc.default_model,\n",
    "    max_tokens=800,\n",
    "    temperature=0,\n",
    "    messages=[{\n",
    "        \"role\": \"user\",\n",
    "        \"content\": [\n",
    "            {\n",
    "                \"type\": \"document\",\n",
    "                \"source\": {\n",
    "                    \"type\": \"base64\",\n",
    "                    \"media_type\": \"application/pdf\",\n",
    "                    \"data\": pdf_b64,\n",
    "                },\n",
    "            },\n",
    "            {\n",
    "                \"type\": \"text\",\n",
    "                \"text\": (\n",
    "                    \"From this quarterly business review, list the Q4 priorities and, for each, \"\n",
    "                    \"identify the segment most affected. Be specific about the numbers.\"\n",
    "                ),\n",
    "            },\n",
    "        ],\n",
    "    }],\n",
    ")\n",
    "\n",
    "print(pdf_resp.content[0].text)\n",
    "print(\"\\n---\")\n",
    "print(\"PDF size on disk:          \", PDF_PATH.stat().st_size, \"bytes\")\n",
    "print(\"Input tokens (incl. PDF):  \", pdf_resp.usage.input_tokens)\n",
    "print(\"Output tokens:             \", pdf_resp.usage.output_tokens)\n",
    "print(\"Estimated cost USD:        \", round(estimate_cost_usd(cc.default_model, pdf_resp.usage), 6))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Section 4 - The PDF vs Markdown cost trade-off\n",
    "\n",
    "If you generated the PDF yourself, sending the original Markdown is dramatically cheaper. Same question, Markdown input. The article shipped with the same content, but the input token count tells a very different story."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "qbr_md = MD_PATH.read_text()\n",
    "\n",
    "md_resp = cc.client.messages.create(\n",
    "    model=cc.default_model,\n",
    "    max_tokens=800,\n",
    "    temperature=0,\n",
    "    messages=[{\n",
    "        \"role\": \"user\",\n",
    "        \"content\": (\n",
    "            \"From this quarterly business review, list the Q4 priorities and, for each, \"\n",
    "            \"identify the segment most affected. Be specific about the numbers.\\n\\n\"\n",
    "            \"---\\n\\n\" + qbr_md\n",
    "        ),\n",
    "    }],\n",
    ")\n",
    "\n",
    "print(md_resp.content[0].text[:600], \"...\\n\")\n",
    "print(\"---\")\n",
    "print(\"PDF version:       \", pdf_resp.usage.input_tokens, \"input tokens\")\n",
    "print(\"Markdown version:  \", md_resp.usage.input_tokens, \"input tokens\")\n",
    "if pdf_resp.usage.input_tokens and md_resp.usage.input_tokens:\n",
    "    ratio = pdf_resp.usage.input_tokens / md_resp.usage.input_tokens\n",
    "    print(f\"PDF cost multiplier: ~{ratio:.1f}x more expensive on input\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Takeaway.** If you control the source format and layout is not load-bearing, send the source. Reach for PDFs (and pay the page-by-page rendering cost) only when you don't control the format or when visual structure matters to the question."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Section 5 - Chart image vs CSV text: which gives the better answer?\n",
    "\n",
    "A different version of the same trade-off. We ask the same question of:\n",
    "\n",
    "- the dashboard PNG (model has to read approximate values off chart axes)\n",
    "- the underlying CSV (model has the exact numbers)\n",
    "\n",
    "Watch what changes: CSV answers are more precise (exact figures), and the cost difference is typically an order of magnitude."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "question = (\n",
    "    \"What was the Enterprise MRR in 2025-12 (USD), and how much did it grow versus 2025-01? \"\n",
    "    \"Give exact figures.\"\n",
    ")\n",
    "\n",
    "# --- Chart image version ---\n",
    "chart_resp = cc.client.messages.create(\n",
    "    model=cc.default_model,\n",
    "    max_tokens=300,\n",
    "    temperature=0,\n",
    "    messages=[{\n",
    "        \"role\": \"user\",\n",
    "        \"content\": [\n",
    "            {\"type\": \"image\",\n",
    "             \"source\": {\"type\": \"base64\", \"media_type\": \"image/png\", \"data\": image_b64}},\n",
    "            {\"type\": \"text\", \"text\": question},\n",
    "        ],\n",
    "    }],\n",
    ")\n",
    "\n",
    "# --- CSV text version ---\n",
    "csv_text = CSV_PATH.read_text()\n",
    "csv_resp = cc.client.messages.create(\n",
    "    model=cc.default_model,\n",
    "    max_tokens=300,\n",
    "    temperature=0,\n",
    "    messages=[{\n",
    "        \"role\": \"user\",\n",
    "        \"content\": f\"{question}\\n\\nCSV:\\n{csv_text}\",\n",
    "    }],\n",
    ")\n",
    "\n",
    "print(\"=== CHART IMAGE ===\\n\")\n",
    "print(chart_resp.content[0].text)\n",
    "print(f\"\\n[{chart_resp.usage.input_tokens} input / {chart_resp.usage.output_tokens} output tokens]\")\n",
    "\n",
    "print(\"\\n=== CSV TEXT ===\\n\")\n",
    "print(csv_resp.content[0].text)\n",
    "print(f\"\\n[{csv_resp.usage.input_tokens} input / {csv_resp.usage.output_tokens} output tokens]\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Section 6 - The `to_claude_input()` convert-first helper\n",
    "\n",
    "One function that inspects a file extension and returns the right Claude content block. Once this exists, the rest of your application becomes modality-agnostic: callers pass in any path and get back something the API understands.\n",
    "\n",
    "This is a starter scaffold -- the article covers the conversion patterns for audio, video, DOCX, XLSX, and HTML. Each one is a `TODO` in the function below, with the recommended library in the comment."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def to_claude_input(path: str | Path) -> dict:\n",
    "    \"\"\"Inspect a file extension and return a Claude content block (or list of blocks).\n",
    "\n",
    "    Native modalities: text-like extensions become text blocks; PNG/JPG/GIF/WebP\n",
    "    become image blocks; PDFs become document blocks.\n",
    "\n",
    "    Non-native modalities raise NotImplementedError with a hint about the right conversion.\n",
    "    \"\"\"\n",
    "    path = Path(path)\n",
    "    ext = path.suffix.lower()\n",
    "\n",
    "    # --- Native: text-like (send as text) ---\n",
    "    text_exts = {\".txt\", \".md\", \".csv\", \".json\", \".yaml\", \".yml\", \".py\", \".sql\", \".html\"}\n",
    "    if ext in text_exts:\n",
    "        return {\"type\": \"text\", \"text\": path.read_text()}\n",
    "\n",
    "    # --- Native: images ---\n",
    "    image_exts = {\".png\": \"image/png\", \".jpg\": \"image/jpeg\", \".jpeg\": \"image/jpeg\",\n",
    "                  \".gif\": \"image/gif\", \".webp\": \"image/webp\"}\n",
    "    if ext in image_exts:\n",
    "        return {\n",
    "            \"type\": \"image\",\n",
    "            \"source\": {\n",
    "                \"type\": \"base64\",\n",
    "                \"media_type\": image_exts[ext],\n",
    "                \"data\": base64.standard_b64encode(path.read_bytes()).decode(\"utf-8\"),\n",
    "            },\n",
    "        }\n",
    "\n",
    "    # --- Native: PDF ---\n",
    "    if ext == \".pdf\":\n",
    "        return {\n",
    "            \"type\": \"document\",\n",
    "            \"source\": {\n",
    "                \"type\": \"base64\",\n",
    "                \"media_type\": \"application/pdf\",\n",
    "                \"data\": base64.standard_b64encode(path.read_bytes()).decode(\"utf-8\"),\n",
    "            },\n",
    "        }\n",
    "\n",
    "    # --- Non-native: convert-first ---\n",
    "    if ext in {\".docx\", \".pptx\"}:\n",
    "        # TODO: convert to PDF via LibreOffice headless, or extract text via python-docx / python-pptx,\n",
    "        # then recurse: return to_claude_input(converted_path).\n",
    "        raise NotImplementedError(\n",
    "            f\"{ext}: convert to PDF (LibreOffice) or extract text (python-docx / python-pptx) first.\"\n",
    "        )\n",
    "    if ext == \".xlsx\":\n",
    "        # TODO: convert sheet to CSV via openpyxl/pandas, then return text block.\n",
    "        raise NotImplementedError(\n",
    "            f\"{ext}: convert to CSV via openpyxl or pandas, then send as text.\"\n",
    "        )\n",
    "    if ext in {\".mp3\", \".wav\", \".m4a\", \".aac\", \".flac\"}:\n",
    "        # TODO: transcribe via Whisper / AssemblyAI / Deepgram, then return text block.\n",
    "        raise NotImplementedError(\n",
    "            f\"{ext}: transcribe via Whisper / AssemblyAI / Deepgram first.\"\n",
    "        )\n",
    "    if ext in {\".mp4\", \".mov\", \".webm\", \".mkv\"}:\n",
    "        # TODO: extract frames with ffmpeg AND/OR audio track. Frame -> images, audio -> transcript.\n",
    "        raise NotImplementedError(\n",
    "            f\"{ext}: extract frames or audio with ffmpeg first.\"\n",
    "        )\n",
    "\n",
    "    raise NotImplementedError(f\"Unhandled extension: {ext}\")\n",
    "\n",
    "\n",
    "# Demo: route the three native files through the helper.\n",
    "for p in (CSV_PATH, PNG_PATH, PDF_PATH):\n",
    "    block = to_claude_input(p)\n",
    "    summary = block[\"type\"] + (f' ({block[\"source\"][\"media_type\"]})' if \"source\" in block else \"\")\n",
    "    print(f\"{p.name:30s} -> {summary}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Compose a multi-modal user message using the helper.\n",
    "user_blocks = [\n",
    "    to_claude_input(PNG_PATH),\n",
    "    {\"type\": \"text\", \"text\": (\n",
    "        \"This dashboard summarises the year. Based ONLY on what you can see, \"\n",
    "        \"which segment had the steepest growth curve?\"\n",
    "    )},\n",
    "]\n",
    "\n",
    "composed_resp = cc.client.messages.create(\n",
    "    model=cc.default_model,\n",
    "    max_tokens=400,\n",
    "    temperature=0,\n",
    "    messages=[{\"role\": \"user\", \"content\": user_blocks}],\n",
    ")\n",
    "print(composed_resp.content[0].text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Section 7 - Practitioner Lab\n",
    "\n",
    "Open-ended extension. No reference solution.\n",
    "\n",
    "**Goal:** flesh out the convert-first helper for one non-native modality of your choice.\n",
    "\n",
    "**Pick one path:**\n",
    "\n",
    "1. **XLSX -> CSV text.** Implement the `.xlsx` branch using `pandas.read_excel` or `openpyxl`. For multi-sheet workbooks, emit one CSV section per sheet, separated by `\\n--- sheet: <name> ---\\n`. Test it against any real Excel file you have lying around.\n",
    "2. **Audio -> transcript.** Implement the audio branch using OpenAI Whisper (local or API) or AssemblyAI. For meeting recordings, include speaker labels and timestamps. Test it on a short recording.\n",
    "3. **DOCX -> Markdown.** Implement the `.docx` branch using `python-docx`. Preserve headings, bullets, and basic table structure. Test it on a real Word doc.\n",
    "\n",
    "**Stretch:** add an `image-resize` step to the existing `.png`/`.jpg`/`.jpeg` branches. Use `Pillow` to downsize anything wider than 1500px before base64-encoding. Measure the input-token reduction on a single large screenshot.\n",
    "\n",
    "Why this matters: every team eventually needs to read \"the file the user uploaded\" without knowing in advance what format it will be. A single, well-tested `to_claude_input()` is the difference between elegant multi-modal code and a tangle of format-specific if-statements scattered across the codebase.\n",
    "\n",
    "---\n",
    "\n",
    "*Companion article: B2 - Multi-Modal Inputs: Feeding Claude More Than Text.*\n",
    "*Next notebook: B3_caching_and_thinking.ipynb*"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "name": "python",
   "version": "3.11"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}