Sep to Dec 2025 PostgreSQL · Python · SQL · Metabase

Modern Data Infrastructure
for ABC Foodmart

An end-to-end database system that transforms 30 years of fragmented Excel files into a centralized, 3NF-normalized PostgreSQL database with a Python ETL pipeline, 12 analytical SQL procedures, and executive dashboards in Metabase.

Problem & Objectives

ABC Foodmart is a neighborhood grocery chain that has operated for over 30 years using Excel spreadsheets, email attachments, and paper binders to manage products, staff schedules, sales logs, and vendor information. With plans to expand from two stores to five, this fragmented workflow was causing slow decision-making, reporting inconsistencies, and data entry errors.

Our solution: a fully normalized 3NF PostgreSQL relational database supported by a reproducible Python ETL pipeline and Metabase dashboards, giving analysts SQL access and executives real-time insights without requiring technical expertise.

Final Presentation Final Report

Database Schema

The schema organizes data into six dimensional tables surrounding a central sales fact table separating slow-changing descriptive data from high-volume transactional records. All tables are in Third Normal Form: atomic attributes, no partial or transitive dependencies.

Key design choice: Customer location does not store country_name directly. Instead, customers → city_id → cities → country_id → countries, eliminating the transitive dependency customer_id → city_id → country_name.

Countries

PK country_idINT

country_nameVARCHAR(100)

country_codeVARCHAR(3)

Cities

PK city_idINT

city_nameVARCHAR(120)

zipcodeVARCHAR(20)

FK country_idINT

Customers

PK customer_idINT

first_nameVARCHAR(60)

last_nameVARCHAR(60)

addressVARCHAR(200)

FK city_idINT

Employees

PK employee_idINT

first_nameVARCHAR(60)

birth_dateDATE

genderVARCHAR(10)

hire_dateDATE

FK city_idINT

ETL Pipeline

The ETL pipeline converts seven CSV files from Kaggle into a clean, normalized PostgreSQL database. Built with pandas and SQLAlchemy, the workflow is fully repeatable every run truncates all tables with RESTART IDENTITY CASCADE before reloading, ensuring idempotency.

Extract

Load & Standardize Raw CSVs

Import each file into a pandas DataFrame. Standardize column names to match the SQL schema (e.g., CityName → city_name). Validate structure check for missing columns, unexpected types, and malformed dates before any transformation begins.

Transform

Clean, Normalize & Map Foreign Keys

Sanitize name fields into first / middle_initial / last components. Standardize gender to Male | Female | Other | Unspecified. Parse dates from mixed formats to ISO DATE. Map boolean-like strings to TRUE / FALSE / NULL. Validate numerics against CHECK constraints (price ≥ 0, 0 ≤ discount ≤ 1, quantity > 0). Resolve foreign key IDs by joining country and city name lookups before loading child tables.

Load

Dependency-Ordered Insert into PostgreSQL

Load in referential order: countries → cities → customers → employees → categories → products → sales. High-volume tables (especially sales) are written in chunks to avoid memory pressure. total_price is intentionally set to 0 on insert a trigger fires immediately after each row to recompute the correct value, guaranteeing accuracy regardless of the source data.

Core ETL Code

Below are the key patterns used throughout the pipeline practical SQL-in-Python patterns directly applicable in junior data engineering roles.

# Connect to PostgreSQL via SQLAlchemy
from sqlalchemy import create_engine
import pandas as pd

engine = create_engine(
    "postgresql+psycopg2://user:password@localhost:5432/foodmart"
)

# Truncate all tables before each reload — ensures idempotency
with engine.connect() as conn:
    conn.execute("""
        TRUNCATE TABLE sales, products, categories,
                       employees, customers, cities, countries
        RESTART IDENTITY CASCADE;
    """)Python · ETL Setup

# --- Transformation: clean gender field ---
gender_map = {
    "m": "Male", "f": "Female",
    "male": "Male", "female": "Female",
    "other": "Other"
}
df_emp["gender"] = (
    df_emp["gender"]
    .str.lower()
    .map(gender_map)
    .fillna("Unspecified")
)

# Parse mixed date formats safely
df_emp["birth_date"] = pd.to_datetime(
    df_emp["birth_date"], errors="coerce"
).dt.date

# Clamp discount to valid range [0, 1]
df_sales["discount"] = df_sales["discount"].clip(0, 1)Python · Transform

# Resolve FK: map city names → city_id from loaded cities table
city_lookup = pd.read_sql(
    "SELECT city_id, city_name FROM cities", engine
).set_index("city_name")["city_id"]

df_customers["city_id"] = (
    df_customers["CityName"].map(city_lookup)
)

# Load sales in chunks for performance (large table)
df_sales.to_sql(
    "sales", engine,
    if_exists="append",
    index=False,
    chunksize=5000,
    method="multi"
)Python · Load

The total_price trigger was a deliberate design choice: rather than computing the value in Python and risking rounding inconsistencies, we let PostgreSQL enforce the calculation at the database layer ensuring every row in the sales table satisfies the same pricing formula regardless of how data enters the system.

Analytical Procedures

Twelve SQL procedures were implemented across three domains product and category performance, salesperson effectiveness, and geographic opportunity. Each uses multiple joined tables and real aggregation logic relevant to ABC Foodmart's business questions.

Product & Category Analysis

Six queries addressing category revenue trends, seasonality, peak timing, hero products, underperformers, and discount responsiveness.

1.1 Monthly Revenue Trends by Category tables: sales · products · categories

Business need

Track category-level revenue across months to detect demand shifts, seasonal dips, and growth windows replacing the manual monthly spreadsheet previously emailed between stores.

SELECT
    cat.category_name,
    EXTRACT(MONTH FROM s.sales_date) AS month,
    SUM(s.total_price)             AS monthly_revenue
FROM sales s
JOIN products  p   ON s.product_id  = p.product_id
JOIN categories cat ON p.category_id = cat.category_id
GROUP BY cat.category_name, month
ORDER BY cat.category_name, month;SQL

Sample output Beverages category, first 5 months:

category_name	month	monthly_revenue
Beverages	1	265,010.48
Beverages	2	243,361.08
Beverages	3	220,100.45
Beverages	4	241,889.08
Beverages	5	74,865.80

1.2 Category Seasonality Index tables: sales · products · categories

Business need

Identify which categories have strong vs. weak seasonal patterns. A seasonality index above 1.0 indicates a peak month; below 1.0 indicates a trough guiding procurement timing and promotional calendars.

WITH monthly_rev AS (
    SELECT
        cat.category_name,
        EXTRACT(MONTH FROM s.sales_date) AS month,
        SUM(s.total_price) AS revenue
    FROM sales s
    JOIN products p    ON s.product_id  = p.product_id
    JOIN categories cat ON p.category_id = cat.category_id
    GROUP BY cat.category_name, month
),
category_avg AS (
    SELECT category_name, AVG(revenue) AS avg_revenue
    FROM monthly_rev
    GROUP BY category_name
)
SELECT
    m.category_name,
    m.month,
    m.revenue,
    ROUND(m.revenue / c.avg_revenue, 2) AS seasonality_index
FROM monthly_rev m
JOIN category_avg c ON m.category_name = c.category_name
ORDER BY m.category_name, m.month;SQL · CTE

1.3 Peak Month Identification per Category tables: sales · products · categories · window fn

Business need

Surface the single best-performing month for each category using a window function replacing the manual chart-reading previously done in Excel.

WITH monthly AS (
    SELECT
        cat.category_name,
        EXTRACT(MONTH FROM s.sales_date) AS month,
        SUM(s.total_price) AS revenue
    FROM sales s
    JOIN products p    ON s.product_id  = p.product_id
    JOIN categories cat ON p.category_id = cat.category_id
    GROUP BY cat.category_name, month
)
SELECT category_name, month, revenue
FROM (
    SELECT
        category_name, month, revenue,
        RANK() OVER (
            PARTITION BY category_name
            ORDER BY revenue DESC
        ) AS rnk
    FROM monthly
) ranked
WHERE rnk = 1
ORDER BY category_name;SQL · RANK() window fn

category_name	peak_month	revenue
Beverages	1	265,010.48
Cereals	1	292,262.81
Confections	1	419,263.41
Dairy	1	266,983.76
Grain	1	238,437.17

1.4 & 1.5 Top & Lowest-Grossing Product per Category tables: sales · products · categories · RANK()

Business need

Identify "hero products" driving category revenue (for sourcing and shelf priority) and the weakest performers (for rationalization or repositioning). The same CTE is reused only the ORDER BY direction changes.

WITH product_revenue AS (
    SELECT
        cat.category_name,
        p.product_id,
        p.product_name,
        SUM(s.total_price) AS revenue
    FROM sales s
    JOIN products p    ON s.product_id  = p.product_id
    JOIN categories cat ON p.category_id = cat.category_id
    GROUP BY cat.category_name, p.product_id, p.product_name
),
ranked AS (
    SELECT *,
        RANK() OVER (
            PARTITION BY category_name
            ORDER BY revenue DESC  -- swap ASC for lowest
        ) AS rnk
    FROM product_revenue
)
SELECT category_name, product_id, product_name,
       revenue AS total_revenue
FROM ranked
WHERE rnk = 1
ORDER BY category_name;SQL · RANK() PARTITION BY

Top per category

category	product	revenue
Beverages	Tia Maria	60,895
Cereals	Shrimp 31/40	59,439
Confections	Pail w/ Handle	77,734
Dairy	Scampi Tail	66,104

Lowest per category

category	product	revenue
Beverages	Halibut Fletches	1,582
Cereals	Seedlings Mix	506
Confections	French Baguette	3,573
Dairy	Bread Fig Almond	470

1.6 Products Most Impacted by Discounts tables: sales · products · CASE pivot

Business need

Measure discount elasticity which products actually sell more when discounted, and which don't respond. Prevents margin loss from blanket promotions on inelastic items.

WITH product_qty AS (
    SELECT
        p.product_id,
        p.product_name,
        CASE WHEN s.discount > 0
             THEN 'discounted'
             ELSE 'no_discount'
        END AS price_type,
        AVG(s.quantity) AS avg_qty
    FROM sales s
    JOIN products p ON s.product_id = p.product_id
    GROUP BY p.product_id, p.product_name, price_type
),
pivoted AS (
    SELECT
        product_id, product_name,
        MAX(CASE WHEN price_type = 'discounted'
               THEN avg_qty END) AS avg_qty_discount,
        MAX(CASE WHEN price_type = 'no_discount'
               THEN avg_qty END) AS avg_qty_no_discount
    FROM product_qty
    GROUP BY product_id, product_name
)
SELECT
    product_id, product_name,
    avg_qty_discount,
    avg_qty_no_discount,
    (avg_qty_discount - avg_qty_no_discount)
        AS qty_increase_due_to_discount
FROM pivoted
ORDER BY qty_increase_due_to_discount DESC NULLS LAST
LIMIT 10;SQL · CASE pivot

Salesperson Analysis

Two queries a global revenue ranking and a city-level performance ranking using RANK() window functions.

2.1 Revenue-Based Salesperson Ranking tables: sales · employees · RANK()

Business need

Create a dynamic revenue leaderboard for all salespeople supporting performance reviews, bonus calculations, and territory planning without any manual Excel sorting.

SELECT
    emp.employee_id,
    emp.first_name || ' ' || emp.last_name AS employee_name,
    SUM(s.total_price) AS revenue,
    RANK() OVER (
        ORDER BY SUM(s.total_price) DESC
    ) AS revenue_rank
FROM sales s
JOIN employees emp ON emp.employee_id = s.sales_person_id
GROUP BY emp.employee_id, employee_name
ORDER BY revenue DESC;SQL · RANK() global

rank	employee_name	revenue
1	Chadwick Walton	607,337.21
2	Wendi Buckley	598,446.45
3	Chadwick Cook	591,659.32
4	Pablo Cline	589,504.76
5	Katina Marks	589,415.83

2.2 Top-Grossing Employee in Each City tables: sales · employees · customers · cities · PARTITION BY

Business need

Rank salespeople within their local market accounting for the fact that a city with lower overall demand should be measured differently. Identifies regional champions whose practices can be replicated.

WITH city_employee_revenue AS (
    SELECT
        city.city_name,
        emp.employee_id,
        emp.first_name || ' ' || emp.last_name AS employee_name,
        SUM(s.total_price) AS revenue
    FROM sales s
    JOIN employees emp  ON s.sales_person_id = emp.employee_id
    JOIN customers cust ON s.customer_id     = cust.customer_id
    JOIN cities city    ON cust.city_id      = city.city_id
    GROUP BY city.city_name, emp.employee_id, employee_name
),
ranked AS (
    SELECT *,
        RANK() OVER (
            PARTITION BY city_name
            ORDER BY revenue DESC
        ) AS rnk
    FROM city_employee_revenue
)
SELECT city_name, employee_id, employee_name,
       revenue AS total_revenue
FROM ranked
WHERE rnk = 1
ORDER BY city_name;SQL · RANK() PARTITION BY city

Geographic Analysis

Four queries examining revenue distribution across countries and cities from top markets down to underperforming locations, plus category preference by city.

3.1 Regional Revenue Distribution tables: sales · customers · cities · countries

Business need

Reveal which countries and cities drive the most sales informing store expansion planning, marketing resource allocation, and inventory logistics.

SELECT
    ctry.country_name,
    city.city_name,
    SUM(s.total_price) AS total_revenue
FROM sales s
JOIN customers cust ON s.customer_id  = cust.customer_id
JOIN cities city    ON cust.city_id   = city.city_id
JOIN countries ctry ON city.country_id = ctry.country_id
GROUP BY ctry.country_name, city.city_name
ORDER BY total_revenue DESC;SQL · 4-table JOIN

country	city	total_revenue
United States	St. Petersburg	166,774.35
United States	Philadelphia	157,442.27
United States	Aurora	156,522.37
United States	Albuquerque	155,400.82
United States	Dallas	155,382.13

3.4 Top-Grossing Category in Each City tables: sales · customers · cities · products · categories · RANK()

Business need

Local preferences vary stocking Confections uniformly across all cities ignores meaningful regional differences. This query enables geographic assortment tailoring.

WITH city_category_revenue AS (
    SELECT
        city.city_name,
        cat.category_name,
        SUM(s.total_price) AS revenue
    FROM sales s
    JOIN customers cust ON s.customer_id  = cust.customer_id
    JOIN cities city    ON cust.city_id   = city.city_id
    JOIN products p     ON s.product_id   = p.product_id
    JOIN categories cat ON p.category_id  = cat.category_id
    GROUP BY city.city_name, cat.category_name
),
ranked AS (
    SELECT *,
        RANK() OVER (
            PARTITION BY city_name
            ORDER BY revenue DESC
        ) AS rnk
    FROM city_category_revenue
)
SELECT
    city_name,
    category_name AS top_category,
    revenue       AS total_revenue
FROM ranked
WHERE rnk = 1
ORDER BY city_name;SQL · 5-table JOIN + RANK() PARTITION BY

city	top_category	revenue
Akron	Poultry	21,087.67
Albuquerque	Cereals	24,840.13
Anaheim	Confections	23,321.72
Anchorage	Confections	21,997.42
Arlington	Dairy	14,020.53

Finding: Confections dominated as the top revenue category in most cities confirmed across both the SQL output and the Metabase dashboard. This validates it as the category to anchor national promotions on, while allowing local adjustments for cities like Akron (Poultry) and Albuquerque (Cereals).

Executive Dashboard

The Metabase dashboard gives leadership real-time visibility into revenue, product mix, and market dynamics without requiring SQL knowledge. It is organized into two views: an Executive Overview and a deeper Product & Market Insights view.

Business Scale at a Glance

The three KPI cards at the top confirm the database is operating at a realistic scale proving that the schema and ETL pipeline handle enterprise-level volume correctly.

SQL Project Dashboard Metabase · Executive Overview

$4.3B Total Sales

98,759 Total Customers

6.75M Total Transactions

Sales by Category

Confections

$535M

Meat

$496M

Poultry

$468M

Cereals

$402M

Beverages

$368M

Dairy

$323M

Top Salesperson Revenue Ranking

C. Walton

$607K

W. Buckley

$598K

C. Cook

$592K

P. Cline

$590K

K. Marks

$589K

Note: Dashboard screenshot could not be uploaded the widget above recreates the key panels from the Metabase output for portfolio presentation.

Key Dashboard Insights

Key Takeaways & Reflection

What Worked Well

The RESTART IDENTITY CASCADE truncation pattern made the ETL fully repeatable critical for testing and future data refreshes without orphaned records.
Delegating total_price computation to a PostgreSQL trigger instead of Python ensured every row satisfied the same pricing formula, eliminating a class of transformation bugs.
Using RANK() OVER (PARTITION BY ...) consistently across analyses (peak month, top product, top employee per city) reduced query complexity and made results directly comparable across entities.
Keeping the schema in 3NF routing location through Customer → City → Country rather than denormalizing paid off immediately in the geographic queries, which joined cleanly across all four tables without any duplication issues.

Challenges & Lessons

Mixed date formats in the raw CSVs required pd.to_datetime(errors='coerce') plus manual inspection of NULL rates before loading a good reminder to validate before transforming, not after.
The UNIQUE NULLS NOT DISTINCT constraint on country_code was a PostgreSQL 15+ feature a useful reminder that constraint syntax varies by database version.
Some product names in the raw data exceeded the initial VARCHAR(100) limit, requiring adjustment to VARCHAR(200) before loading caught through constraint testing before production load.

Modern Data Infrastructurefor ABC Foodmart

Problem & Objectives

Database Schema

ETL Pipeline

Core ETL Code

Analytical Procedures

Product & Category Analysis

Salesperson Analysis

Geographic Analysis

Executive Dashboard

Business Scale at a Glance

Key Dashboard Insights

Key Takeaways & Reflection

What Worked Well

Challenges & Lessons

Modern Data Infrastructure
for ABC Foodmart