Module 4: Staging Models

Clean source data gently: rename, cast, standardize, and expose a stable base layer.

100 minutes. 1 inline exercise. Free course module.

Learning Objectives

  • Build staging models that stay close to the source
  • Apply safe renaming and type casting
  • Avoid burying business logic too early

Why This Matters

Staging models are the clean mirror of raw sources. They should make data easier to use without making heavy business decisions.

Staging Models Follow the arrows. Each box is one idea you will practice in this module. Raw step 1 Rename step 2 Cast step 3 Clean step 4 Stage step 5 Production analytics engineering turns raw records into governed, trusted business meaning.
Architecture diagram for Module 4: Staging Models.

Lesson Content

The Mental Model

Staging models are the clean mirror of raw sources. They should make data easier to use without making heavy business decisions.

A staging model is like rewriting messy notes into clean handwriting. You are not changing the story yet; you are making it readable.

Tiny Example

We will use a small ecommerce dataset throughout the course. Think of these as the only tables in your first warehouse:

TableGrainExample columns
raw_ordersone row per order eventorder_id, customer_id, amount, status, created_at
raw_order_itemsone row per item inside an orderorder_id, product_id, quantity, item_price
raw_customersone row per customercustomer_id, email, country, created_at

Interactive Check

Question: Should a staging model calculate lifetime customer value?

Reveal the answer

No. That is business logic across many events and belongs later. Staging should focus on source cleanup: names, types, null handling, and basic standardization.

Inline Practice Lab

This lab is intentionally small. You can solve it by reading the table, writing the SQL/YAML mentally, or pasting the snippet into any SQL scratchpad later.

-- Example starter table
select
  order_id,
  customer_id,
  amount,
  status,
  created_at
from raw_orders;

The goal is not tooling setup. The goal is learning the production habit: state the grain, clean one thing, test one assumption, and explain the downstream impact.

Self-Check Quiz

  1. What is the grain of the table you are building?
  2. Which downstream metric or dashboard would be wrong if this model broke?
  3. What test would catch the most likely beginner mistake here?

Real-World Use Cases

  • Reliable executive dashboards that do not disagree across teams
  • AI analytics agents that query governed metrics instead of guessing SQL
  • Auditable metric changes where owners can see downstream impact before merge

Production Notes

  • Use one staging model per source table. It gives every raw table one official cleaned interface.

Common Mistakes

  • Joining multiple sources in staging
  • Adding metrics to staging models
  • Leaving cryptic source column names unchanged

Think Like an Engineer

  • Can you explain the grain of this model in one sentence?
  • What breaks downstream if this field becomes null tomorrow?
  • Where should this logic live so it is reused instead of copied?

Career Relevance

Analytics engineering is the bridge between SQL skill and production data ownership. Freshers who learn tests, lineage, metrics, and semantic modeling early stand out because they can reason about trust, not just queries.

Key Terms

Staging model
A dbt model that cleans and standardizes one raw source table.
Source
An upstream table that dbt reads but does not create.

Inline Exercises

  1. Fix stg_orders

    Turn a messy raw_orders table into a clean staging model.

    30-45 minutes - Beginner

    • Rename id to order_id
    • Cast created_at to a timestamp
    • Standardize status values to lowercase
    • Keep source-level fields only
    • Write the model grain in one sentence

    Inline lab: complete the exercise directly in the course page.

Key Takeaways

  • Staging models are stable cleaned source interfaces
  • Keep business logic out of staging unless it is source-specific cleanup
  • Good staging makes every downstream model simpler