End-to-End Retail Data Platform on GCP

From two raw CSV files to a live analytics dashboard

Project Overview

UK giftware retailer, production-grade analytics platform built from scratch

01
Data Pipeline
Raw CSVs → dbt transformations → star schema
02
Infrastructure
GCP services + Terraform automation
03
BI Layer
Looker Studio dashboards answering business questions

Part 1

Data Pipeline

Part 1: Data Pipeline

Business Problem

UK retailer, 38 countries, raw CSV data, not clean, not structured, not usable for analytics

InvoiceNoStockCodeQuantity InvoiceDateCustomerID
53636585123A6 12/1/10 8:2617850
C536379 D -1 12/1/10 9:4114527
536370 POST 312/1/10 8:4512583
  • C-prefixed invoices: cancellations with negative quantities
  • POST, D stock codes: non-product rows
  • ✗ Dates stored as text, two different formats in the same column
  • ✗ Same StockCode with different descriptions and prices
Part 1: Data Pipeline

Four Questions They Couldn't Answer

  1. Which countries generate the most revenue?
  2. Which products sell the most?
  3. How does revenue trend month over month?
  4. Who are our highest-value customers?
Part 1: Data Pipeline

The Approach

Raw files must be loaded, then cleaned, structured, and modelled before they can answer anything

The Pattern
ELT
Extract: pull raw CSV files
Load: drop them into BigQuery as-is
Transform: clean and model inside the warehouse
Unlike ETL, transformation happens after loading, leveraging the warehouse's compute power.
The Tool
dbt
Data Build Tool: defines SQL transformations as versioned, testable models.
It is the T in ELT.
Each model is a .sql file. dbt compiles, runs, and tests them against the warehouse in the right order.
Part 1: Data Pipeline

dbt Transformations

ModelProblemSolution
dim_datetime Dates as text, two formats PARSE_DATETIME selected by string length
dim_product StockCode not unique Surrogate key = hash(code + desc + price)
dim_customer No ISO codes, duplicates Deduplicate + join country reference
fct_invoices Returns, noise, no totals Filter Qty > 0 + compute Qty × Price

541,909 raw rows → 397,884 validated sale facts  ·  Tests: unique, not_null, relationships

Part 1: Data Pipeline

Star Schema

A modelling pattern where one central fact table is surrounded by dimension tables, like a star. Optimised for analytical queries.

Dimension tables
Describe the context of an event, who bought, what product, when it happened. Rows are entities: customers, products, dates.
Fact table
Records the events themselves, one row per sale. Stores measurable values (quantity, revenue) and foreign keys pointing to each dimension.
dim_customer (4,380) · dim_product (16,282) · dim_datetime (23,260)
→ fct_invoices (397,884 rows)
Star schema diagram, dimensions surrounding fct_invoices
Part 1: Data Pipeline

Report Layer

Pre-aggregated for Looker Studio, one model per business question

report_customer_invoices
Countries
Groups by country + ISO code
→ revenue, invoice count, top 10
report_product_invoices
Products
Groups by stock code + description
→ total quantity sold, top 10
report_year_invoices
Revenue over Time
Groups by year + month
→ invoice count, total revenue

Part 2

Infrastructure

Part 2: Infrastructure

GCP Architecture

GCP Architecture, full pipeline from GCS to Looker Studio
Part 2: Infrastructure

GCP Components

ComponentRole
GCSRaw CSV file storage, where the data journey begins
EventarcFires on object.finalized: zero lag, reacts on file landing
Cloud WorkflowsOrchestrates: load CSVs first → then trigger dbt
BigQuery3 datasets: raw, transform, report
Cloud Run JobRuns dbt Docker image on demand, stateless, exits when done
Secret ManagerBigQuery credentials, no passwords hardcoded
Artifact RegistryStores the dbt Docker image
Cloud BuildCI/CD: rebuilds and pushes the dbt image on git push

Part 3

Industrialisation

Part 3: Industrialisation

Terraform

Infrastructure as Code: every GCP resource in version-controlled files

modules/platform
Platform
GCS buckets, IAM service accounts, API enablement, Artifact Registry
modules/pipeline
Pipeline
Cloud Workflows, Eventarc trigger, Cloud Run Job, Secret Manager
modules/observability
Observability (planned)
Monitoring dashboards, alerting, pipeline health tracking
  • terraform apply → creates every resource from scratch
  • Reproducible: deploy identically to dev, staging, prod
  • Every IAM permission is a line of code, not a console click

Part 4

BI Layer

Part 4: BI Layer

Answering the Business Questions

Looker Studio → BigQuery → report dataset  ·  View live dashboard

Which countries generate most revenue?
report_customer_invoices · Bar chart
Which products sell the most?
report_product_invoices · Bar chart
How does revenue trend month over month?
report_year_invoices · Line chart
Who are our highest-value customers?
Custom BigQuery query on raw_invoice · Horizontal bar by revenue

Expected: UK dominates · Q4 spike (Oct to Dec) · small decorative items drive volume

From two CSV files and four unanswered questions
to a fully automated, tested, event-driven analytics platform.
Zero manual steps between data landing and dashboard updating.
BigQuery dbt GCS + Eventarc Cloud Workflows Cloud Run Terraform Looker Studio
← Presentations