# Data Quality Runbook

*Owner: Analytics Engineering Team. Last reviewed: 2025-10-01.*

This runbook documents how Acme SaaS Co. detects, diagnoses, remediates, and prevents data quality failures across the production data platform. It covers the dbt test suite, the Great Expectations checkpoint layer, and the manual investigation playbook for DQ incidents that slip through automated checks.

Read together with the Snowflake Warehouse Cost Runbook for incidents where a DQ failure causes a cost spike (e.g., a broken incremental filter that forces a full re-scan).

---

## 1. What counts as a data quality issue at Acme

Acme classifies DQ failures along four axes:

| Dimension    | Definition                                                  | Example failure                                      |
|---|---|---|
| Completeness | Expected rows or columns are missing                        | Daily event table has no rows after 14:00 UTC        |
| Freshness    | Data is older than the SLA allows                           | `fct_subscriptions` last updated 26 hours ago        |
| Accuracy     | Values are incorrect relative to the source of truth        | MRR figures in Snowflake differ from Stripe by >0.5% |
| Consistency  | Referential or business-rule integrity is broken            | `subscription_id` in `fct_events` has no match in `dim_subscriptions` |

A fifth dimension -- **uniqueness** -- is implicit in all four: duplicated primary keys amplify every other failure mode and are treated as P1 incidents regardless of downstream impact.

---

## 2. DQ monitoring stack

Acme runs two layers of automated DQ monitoring:

### 2.1 dbt tests (first layer -- model-level, runs on every dbt job)

Every production model has at minimum:
- `not_null` on all primary key columns
- `unique` on all primary key columns
- `accepted_values` on any column with a defined enum (e.g., `segment`, `plan_type`, `event_type`)
- `relationships` checks on all foreign keys pointing to `dim_*` tables

Failing a dbt test blocks the downstream model from materialising. Test results are emitted to the `DBT_TEST_RESULTS` table in Snowflake and surfaced in the Elementary dashboard.

### 2.2 Great Expectations checkpoints (second layer -- dataset-level, runs post-dbt)

For `fct_subscriptions`, `fct_mrr`, and `fct_events` (the three tables downstream consumers depend on most heavily), a Great Expectations suite runs after the dbt job completes. It checks:
- Row count within 20% of the trailing 7-day average (freshness + completeness proxy)
- MRR total within 0.5% of the Stripe API pull (accuracy)
- Zero null `account_id` values (completeness, belt-and-suspenders over the dbt test)
- Event timestamp distribution not bimodal (distribution check against a known-bad pattern)

GE checkpoint failures send a PagerDuty alert to the analytics engineering on-call. dbt test failures send a Slack alert to `#data-quality-alerts`.

---

## 3. Common root causes of DQ failures

In order of frequency over the past 12 months:

### 3.1 Source schema drift

An upstream SaaS vendor changes a field name, type, or enum value without notice. Common offenders: Salesforce field renames, Stripe API version upgrades, and HubSpot property migrations. Symptom: a dbt `accepted_values` test fails for a value that appears in the source but not in the expected list.

### 3.2 Incremental model filter breakage

A developer updates an incremental model's unique key or filter logic during a refactor. On the next run, the model either re-materialises rows it already has (duplicate primary keys) or silently drops rows that fall outside the new filter window. Symptom: the `unique` dbt test on the model starts failing; row counts in GE show a sudden non-monotonic drop.

### 3.3 Late-arriving events

The events pipeline has a known 2-4 hour lag from source to Snowflake under normal conditions. When upstream systems are degraded (typically 3rd-party webhooks), events arrive 12-48 hours late. Symptom: GE row count check flags the table as under-count; the lag clears itself without intervention but causes downstream freshness SLA breaches in the meantime.

### 3.4 Timezone handling error

A pipeline that ran cleanly under UTC broke when a developer introduced local-time logic for a Singapore-based client report. Timestamps shifted by 8 hours, causing events to land in the wrong partition and breaking daily aggregations at the date boundary. Symptom: row count for a specific date is roughly double; the adjacent date is roughly half.

### 3.5 Referential integrity break after a dim table truncate

A dimension table (`dim_accounts`) was truncated and rebuilt during a one-time migration. `fct_events` rows referencing old `account_id` values now have no matching dim row. The `relationships` dbt test catches this immediately -- but only if the test runs before the downstream models that join the two tables.

---

## 4. Diagnosis playbook

Follow these steps when a DQ alert fires. Time-box each step to 10 minutes before escalating.

### Step 1: identify the affected model and dimension

Read the alert. It will name the model, the test, and the column. If the alert is from GE rather than dbt, identify which expectation suite fired and which table it covers.

### Step 2: check if the failure is new or recurring

Query `DBT_TEST_RESULTS` for the past 7 runs of the same test. A failure that first appeared today on an otherwise clean test is a new incident. A failure that has been firing intermittently for weeks is a known issue -- check the backlog before creating a new ticket.

### Step 3: check the source data

For accuracy or completeness failures, query the raw source table (`RAW.STRIPE.*`, `RAW.SALESFORCE.*`, etc.) for the same time window. If the raw source also shows the anomaly, the issue is upstream. Route to the source system owner; the analytics engineering team is not the fix.

### Step 4: check recent model changes

Run `git log --since="48 hours ago" -- models/` to find any dbt model changes in the past 48 hours that touch the affected model or its direct parents. A schema change in a parent model is the most common culprit after source drift.

### Step 5: implement the fix or roll back

For a broken incremental filter: add or restore the correct `unique_key` and filter, then run `dbt run --full-refresh --select <model>` to re-materialise cleanly. For a source schema drift: update the staging model's column mapping and accepted values, test locally with `dbt test --select <model>`, then promote.

---

## 5. Remediation patterns

### 5.1 Full-refresh as the safe default

When in doubt, a `dbt run --full-refresh` on the affected model is the safe remediation. It is more expensive (a full scan rather than incremental) but it guarantees idempotence. For large fact tables, co-ordinate with the platform team to avoid running a full-refresh on `WH_ELT_L` during peak analytics hours.

### 5.2 Soft-delete over hard-delete

When removing rows from a production table (e.g., to fix duplicates), prefer marking rows with a `_deleted_at` timestamp and filtering in downstream views rather than issuing a `DELETE` statement. This preserves the audit trail and makes rollback trivial if the fix was wrong.

### 5.3 Backfill windows

For late-arriving event fixes, a backfill job re-runs the incremental model for the affected date range with `--start-date` and `--end-date` args. Backfills always run on `WH_ELT_L` (not `WH_ELT_M`) to avoid queue contention with the regular scheduled jobs.

---

## 6. Incident examples

### Incident 2025-02-11 (P2): Stripe schema drift breaks MRR model

Stripe deprecated the `amount` field on the `invoice` object in favour of `amount_due`. The `stg_stripe__invoices` staging model silently returned null for all invoices after the migration date. `fct_mrr` showed a 100% churn rate for the affected week. Detected by GE accuracy check (MRR vs Stripe API pull diverged by 87%). Fix: updated staging model column reference; backfilled 6 days of MRR. Downstream impact: three Tableau dashboards showed incorrect data for 14 hours.

### Incident 2025-04-03 (P1): duplicate primary keys in fct_subscriptions

A developer removed the `unique_key` config from `fct_subscriptions` during a schema refactor, converting the model from incremental to full table. Every subsequent run appended a full copy of all historical rows. The `unique` dbt test caught the failure on the first run after the change -- but the change merged on a Friday afternoon and the weekend on-call rotation missed the Slack alert. By Monday, three runs had occurred and row count was 4x. Fix: restored `unique_key`, ran `--full-refresh`, re-notified downstream consumers. Post-mortem action: add PagerDuty escalation for `unique` test failures on P0 models regardless of time.

### Incident 2025-06-17 (P3): timezone shift corrupts daily event aggregation

A developer working on a Singapore-time client report introduced `convert_timezone('Asia/Singapore', event_timestamp)` in a shared staging model rather than an isolated view. All downstream daily aggregations shifted by 8 hours. Detected during a weekly data review meeting by a CSM who noticed the previous day's event count was anomalously high. Fix: reverted the staging model change; created an isolated Singapore-time view for the specific report. Lesson: timezone conversions must not occur in shared staging models.

### Incident 2025-09-08 (P2): GE row-count check false positive on bulk migration

A legitimate bulk data migration added 2.1M historical events to `fct_events` in a single run -- an expected one-time backfill. The GE row count check (flagging any run with >20% row count increase) fired as a false positive. The on-call engineer spent 45 minutes investigating before the migration team confirmed the cause. Action: GE checkpoint now accepts a `suppress_for_run_id` parameter that migration jobs can set to skip the volumetric check for a known-good bulk load.

---

## 7. Escalation

- **P1** (duplicated PKs on P0 models, MRR accuracy off by >1%, security-related integrity failure): page analytics engineering on-call and data platform on-call simultaneously.
- **P2** (data incorrect but not duplicated; freshness breach > 4 hours; GE suite failure on core fact tables): Slack alert to `#data-quality-alerts`; on-call responds within 1 business hour.
- **P3** (non-critical model test failure; accepted-values drift on a non-core dimension): ticket in the analytics engineering backlog; review within 2 business days.

---

## 8. Reference queries

### Recent dbt test failures

```sql
select tested_at, model_name, test_name, column_name, status, failures
from dbt_test_results
where tested_at >= dateadd(day, -3, current_timestamp())
  and status = 'fail'
order by tested_at desc;
```

### Row count trend for a model (freshness check)

```sql
select date_trunc('day', loaded_at) as day,
       count(*) as row_count
from analytics.fct_subscriptions
group by 1
order by 1 desc
limit 14;
```

### MRR accuracy spot-check (compare Snowflake vs Stripe pull)

```sql
select s.month,
       s.mrr_end as snowflake_mrr,
       st.mrr    as stripe_mrr,
       abs(s.mrr_end - st.mrr) / nullif(st.mrr, 0) as pct_diff
from analytics.fct_mrr s
join raw.stripe_mrr_reconciliation st using (month)
where s.month >= dateadd(month, -3, current_date())
order by s.month desc;
```

---

*End of runbook. For analytics engineering Slack, see #analytics-engineering. For escalation, see Section 7.*
