# Snowflake Warehouse Cost Runbook

*Owner: Data Platform Team. Last reviewed: 2025-09-15.*

This runbook is the single source of truth for diagnosing, attributing, and reducing Snowflake warehouse spend at Acme SaaS Co. Anyone on the data team -- engineering, analytics engineering, or analytics -- should be able to follow it end to end to triage a cost spike or run a regular optimization pass.

The runbook assumes familiarity with the Snowflake credit model, the `SNOWFLAKE.ACCOUNT_USAGE` schema, and basic SQL. It does **not** assume familiarity with our specific warehouse layout -- that is documented in Section 2.

---

## 1. Cost categories

Snowflake compute spend at Acme breaks down into five categories. In FY2025 the approximate split was:

| Category                          | Share | Notes |
|---|---|---|
| Scheduled ETL (dbt + Airflow)     | 48%   | Predictable load. Largest single category. |
| Ad-hoc analytics (Tableau, ad-hoc SQL) | 22% | Variable, peaks during quarter-close. |
| Reverse ETL (Hightouch, Census)   | 11%   | Mostly steady; spikes when audiences re-sync. |
| Embedded customer-facing dashboards | 10% | Latency-sensitive; cannot run on smaller warehouses. |
| Data Science / ML training        | 9%    | Mostly Snowpark. Bursty. |

The remaining ~0% covers internal admin queries, replication, and snowpipe ingest -- not zero, but small enough to live in the operational rounding.

---

## 2. Warehouse layout

Acme operates seven Snowflake warehouses, separated by workload class and SLA:

- `WH_ELT_M`         (Medium, auto-suspend 60s) -- bulk of dbt models, scheduled
- `WH_ELT_L`         (Large, auto-suspend 60s) -- large or fact-table dbt models
- `WH_AD_HOC_S`      (Small, auto-suspend 30s) -- analyst ad-hoc, capped daily credit budget
- `WH_BI_M`          (Medium, multi-cluster 1-3, auto-suspend 60s) -- Tableau extracts and dashboards
- `WH_REVERSE_S`     (Small, auto-suspend 60s) -- Hightouch and Census
- `WH_EMBEDDED_M`    (Medium, multi-cluster 1-6, auto-suspend 60s) -- customer dashboards (latency SLA 3s)
- `WH_DS_L`          (Large, auto-suspend 300s) -- ML training, Snowpark

Each warehouse has an owning team. Cost attribution flows through warehouse name -- there is **no** shared warehouse used by multiple teams. This decision predates the current platform team and is the single most useful piece of cost hygiene we have. Do not break it.

---

## 3. Common drivers of cost spikes

When monthly warehouse spend exceeds the rolling 90-day average by more than 15%, treat it as a spike and follow the diagnosis playbook in Section 4. The most common root causes, in rough order of frequency:

### 3.1 A new dbt model with an unintended full table scan

By far the most common cause. A new model joins against a fact table without a proper filter, or against `INFORMATION_SCHEMA` in a way that scans every micro-partition. Symptom: a single query in `QUERY_HISTORY` consuming hundreds of credits with no clear justification.

### 3.2 An accidentally-disabled materialization

A model whose materialization was incrementally implemented but is now running as a full `table` because someone removed the `unique_key` config during refactoring. Symptom: gradual but persistent climb in `WH_ELT_M` credit consumption over multiple days, without any specific spike.

### 3.3 Tableau extract proliferation

A new BI consumer creates one extract for every dashboard rather than reusing a shared extract. Symptom: `WH_BI_M` credits climb in step with new dashboard counts; the worst offender shows in `QUERY_HISTORY` as repeated identical SELECTs filtered by `extract_id`.

### 3.4 A misconfigured embedded dashboard

A customer-facing dashboard is missing its row-level security predicate and falls back to scanning the full multi-tenant table. Symptom: `WH_EMBEDDED_M` credits spike; the same query (parameterized) appears hundreds of times in a short window. **Treat this as a P1.** It is both a cost issue and a security issue.

### 3.5 An ML training job that wasn't supposed to run

Symptom: `WH_DS_L` credits spike outside of business hours, often associated with a single notebook or Snowpark session. Usually the result of an iteration loop someone forgot to stop.

### 3.6 Reverse ETL audience explosion

A marketing operator adds a new audience to Hightouch that is much larger than expected (e.g., "all users" instead of "active paid users"). Symptom: `WH_REVERSE_S` credits spike on the day a new sync goes live.

---

## 4. Diagnosis playbook

When a spike is detected, follow these steps in order. Time-box each step to 15 minutes -- if the answer is not obvious, escalate per Section 7 rather than dig deeper alone.

### Step 1: identify which warehouse

Run the standard cost-by-warehouse query (in Section 8) over the past 7 days. Compare against the trailing 90-day baseline. The warehouse with the largest absolute increase is your starting point; secondary warehouses can wait.

### Step 2: identify which time window

Within the offending warehouse, narrow the time window to the hour or day the spike began. Look at the `QUERY_HISTORY` aggregate by hour. A sudden cliff indicates a new query pattern was introduced; a gradual ramp indicates an existing query that grew expensive over time.

### Step 3: identify which query pattern

Group `QUERY_HISTORY` by `QUERY_PARAMETERIZED_HASH` within the offending window. The top 3-5 hashes by total credits consumed are your candidates. Pull the parameterized text of each and examine it for the patterns in Section 3.

### Step 4: identify which actor

For each candidate query pattern, the `USER_NAME` and (if applicable) the `dbt run_id` or Tableau `external_token` identifies the team responsible. Route the issue to the owning team via Slack with the parameterized query, the time window, and the estimated credit impact.

### Step 5: implement the fix

The owning team owns the fix. The platform team owns following up to confirm spend returned to the baseline within 72 hours.

---

## 5. Optimization patterns

Independent of incidents, the following patterns reduce warehouse spend by 10-30% in typical Snowflake accounts and should be applied opportunistically.

### 5.1 Auto-suspend tuning

The default 60-second auto-suspend is too generous for most workloads. Lower to 30 seconds for analyst warehouses (`WH_AD_HOC_S`); keep at 60 seconds for ETL where serial dbt models would otherwise reload the cache; consider 300 seconds only for warehouses with cached working sets that must stay warm (e.g., embedded dashboards during business hours).

### 5.2 Warehouse right-sizing

A Medium warehouse running at 30% queue depth is not a Large-warehouse candidate -- it is an over-provisioned Medium. Inspect the `WAREHOUSE_LOAD_HISTORY` view weekly. If average concurrent queries are well below the warehouse's capacity, step down. **Do not step up without evidence**: stepping up doubles credit cost.

### 5.3 Materialization review

For dbt projects, the rule of thumb: small dimensions = `table`, large or incremental facts = `incremental`, throwaway intermediate = `ephemeral` or `view`. Annual review pass against the project to confirm.

### 5.4 Clustering keys on hot fact tables

For any fact table that is the target of high-volume time-filtered queries (e.g., `fct_events` filtered by `event_date`), clustering by date can reduce per-query cost by 5-10x. Validate with `SYSTEM$CLUSTERING_INFORMATION` before deciding.

### 5.5 Query tag discipline

Every automated workload should set `QUERY_TAG` to identify its source (dbt project, Airflow DAG, Tableau workbook). Without this, attribution in Step 3 above is much harder. The platform team has a quarterly audit; teams without proper query tags lose access to peak-hour warehouse capacity.

---

## 6. Incident examples

### Incident 2025-03-04 (P2): dbt model full scan

A new `fct_subscription_events` model joined against the `EVENTS_RAW` table without filtering on `event_date`. Every run scanned 18 months of data. Daily impact: ~85 credits. Detected on 2025-03-04; root caused on 2025-03-04; fixed on 2025-03-05 by adding incremental config with `event_date >= dateadd(day, -3, current_date())` filter. Total credit waste: ~340 credits ($1,360 at our enterprise rate).

### Incident 2025-05-18 (P1): embedded dashboard missing RLS

A new customer-facing dashboard shipped with an incorrect Snowflake role configuration that bypassed row-level security. Every dashboard load scanned the full multi-tenant fact table. Daily impact during the 30-hour window before detection: ~520 credits. Also a security incident -- reported to security ops per playbook. Fixed by re-binding the dashboard's Snowflake user to the correct role.

### Incident 2025-07-22 (P3): Tableau extract proliferation

A new BI consumer (the Customer Success team) shipped 23 new dashboards, each backed by its own Tableau extract refreshing hourly. `WH_BI_M` credits climbed 40% in a week. Root caused via Step 3 of the diagnosis playbook. Fix: consolidated to three shared extracts feeding all 23 dashboards. Daily impact dropped by 35 credits.

### Incident 2025-08-30 (P3): out-of-hours ML training loop

A data scientist's Snowpark notebook was left running in a loop overnight after their kernel was supposed to have shut down. `WH_DS_L` ran for 11 hours of unattended training. Fixed with an `auto_suspend=300` config and a warehouse-level cron job that force-suspends `WH_DS_L` between 22:00 and 06:00 SGT. Net daily savings: ~40 credits.

---

## 7. Escalation

Route P1 (customer-impacting or security-related cost incidents) to the platform on-call and security on-call simultaneously. Route P2 (significant spend impact, no customer or security impact) to the platform team Slack channel within one business hour. Route P3 (housekeeping-level cost issues) into the weekly cost review backlog.

The platform team maintains a 30-day rolling P1 budget of $5,000. Incidents that exceed individual or cumulative budget thresholds trigger an automatic incident review.

---

## 8. Reference queries

Standard SQL snippets used by the diagnosis playbook. All target `SNOWFLAKE.ACCOUNT_USAGE` -- they require the `ACCOUNTADMIN` role or membership in the `PLATFORM_ANALYST` role.

### Cost by warehouse, last 7 days

```sql
select warehouse_name,
       sum(credits_used) as credits_7d
from snowflake.account_usage.warehouse_metering_history
where start_time >= dateadd(day, -7, current_timestamp())
group by warehouse_name
order by credits_7d desc;
```

### Top parameterized query patterns by spend, last 24h

```sql
select query_parameterized_hash,
       any_value(query_text)       as sample_text,
       count(*)                    as runs,
       sum(credits_used_cloud_services + credits_used_compute) as credits
from snowflake.account_usage.query_history
where start_time >= dateadd(hour, -24, current_timestamp())
  and warehouse_name = :warehouse
group by 1
order by credits desc
limit 10;
```

### Warehouse load profile

```sql
select date_trunc('hour', start_time) as hr,
       avg_running                    as avg_running_queries,
       avg_queued_load                as avg_queued_queries
from snowflake.account_usage.warehouse_load_history
where warehouse_name = :warehouse
  and start_time >= dateadd(day, -14, current_timestamp())
order by hr;
```

---

*End of runbook. For platform-team Slack, see #data-platform-ops. For escalation, see Section 7.*
