Sep to Dec 2025 PostgreSQL · Python · SQL · Metabase

Modern Data Infrastructure
for ABC Foodmart

An end-to-end database system that transforms 30 years of fragmented Excel files into a centralized, 3NF-normalized PostgreSQL database with a Python ETL pipeline, 12 analytical SQL procedures, and executive dashboards in Metabase.

Problem & Objectives

ABC Foodmart is a neighborhood grocery chain that has operated for over 30 years using Excel spreadsheets, email attachments, and paper binders to manage products, staff schedules, sales logs, and vendor information. With plans to expand from two stores to five, this fragmented workflow was causing slow decision-making, reporting inconsistencies, and data entry errors.

Our solution: a fully normalized 3NF PostgreSQL relational database supported by a reproducible Python ETL pipeline and Metabase dashboards, giving analysts SQL access and executives real-time insights without requiring technical expertise.

Database Schema

The schema organizes data into six dimensional tables surrounding a central sales fact table separating slow-changing descriptive data from high-volume transactional records. All tables are in Third Normal Form: atomic attributes, no partial or transitive dependencies.

Key design choice: Customer location does not store country_name directly. Instead, customers → city_id → cities → country_id → countries, eliminating the transitive dependency customer_id → city_id → country_name.

Countries
PK country_idINT
country_nameVARCHAR(100)
country_codeVARCHAR(3)
Cities
PK city_idINT
city_nameVARCHAR(120)
zipcodeVARCHAR(20)
FK country_idINT
Customers
PK customer_idINT
first_nameVARCHAR(60)
last_nameVARCHAR(60)
addressVARCHAR(200)
FK city_idINT
Employees
PK employee_idINT
first_nameVARCHAR(60)
birth_dateDATE
genderVARCHAR(10)
hire_dateDATE
FK city_idINT
Categories
PK category_idINT
category_nameVARCHAR(100)
Products
PK product_idINT
product_nameVARCHAR(200)
priceNUMERIC(12,2)
class / resistantVARCHAR
is_allergicBOOLEAN
FK category_idINT
Sales Fact Table
PK sales_idBIGINT
quantityINT > 0
FK sales_person_idINT
discountNUMERIC(5,4) 0 1
FK customer_idINT
total_priceNUMERIC(14,2) trigger
FK product_idINT
sales_dateDATE

Design decisions: BIGINT for sales_id to accommodate high transaction volume; NUMERIC(12,2) for all prices to avoid floating-point rounding errors in financial calculations; ON DELETE RESTRICT prevents orphaned child records; a PostgreSQL trigger recomputes total_price = quantity × price × (1 − discount) on every insert, guaranteeing consistency.

ETL Pipeline

The ETL pipeline converts seven CSV files from Kaggle into a clean, normalized PostgreSQL database. Built with pandas and SQLAlchemy, the workflow is fully repeatable every run truncates all tables with RESTART IDENTITY CASCADE before reloading, ensuring idempotency.

Extract
Load & Standardize Raw CSVs
Import each file into a pandas DataFrame. Standardize column names to match the SQL schema (e.g., CityName → city_name). Validate structure check for missing columns, unexpected types, and malformed dates before any transformation begins.
Transform
Clean, Normalize & Map Foreign Keys
Sanitize name fields into first / middle_initial / last components. Standardize gender to Male | Female | Other | Unspecified. Parse dates from mixed formats to ISO DATE. Map boolean-like strings to TRUE / FALSE / NULL. Validate numerics against CHECK constraints (price ≥ 0, 0 ≤ discount ≤ 1, quantity > 0). Resolve foreign key IDs by joining country and city name lookups before loading child tables.
Load
Dependency-Ordered Insert into PostgreSQL
Load in referential order: countries → cities → customers → employees → categories → products → sales. High-volume tables (especially sales) are written in chunks to avoid memory pressure. total_price is intentionally set to 0 on insert a trigger fires immediately after each row to recompute the correct value, guaranteeing accuracy regardless of the source data.

Core ETL Code

Below are the key patterns used throughout the pipeline practical SQL-in-Python patterns directly applicable in junior data engineering roles.

# Connect to PostgreSQL via SQLAlchemy
from sqlalchemy import create_engine
import pandas as pd

engine = create_engine(
    "postgresql+psycopg2://user:password@localhost:5432/foodmart"
)

# Truncate all tables before each reload — ensures idempotency
with engine.connect() as conn:
    conn.execute("""
        TRUNCATE TABLE sales, products, categories,
                       employees, customers, cities, countries
        RESTART IDENTITY CASCADE;
    """)
Python · ETL Setup
# --- Transformation: clean gender field ---
gender_map = {
    "m": "Male", "f": "Female",
    "male": "Male", "female": "Female",
    "other": "Other"
}
df_emp["gender"] = (
    df_emp["gender"]
    .str.lower()
    .map(gender_map)
    .fillna("Unspecified")
)

# Parse mixed date formats safely
df_emp["birth_date"] = pd.to_datetime(
    df_emp["birth_date"], errors="coerce"
).dt.date

# Clamp discount to valid range [0, 1]
df_sales["discount"] = df_sales["discount"].clip(0, 1)
Python · Transform
# Resolve FK: map city names → city_id from loaded cities table
city_lookup = pd.read_sql(
    "SELECT city_id, city_name FROM cities", engine
).set_index("city_name")["city_id"]

df_customers["city_id"] = (
    df_customers["CityName"].map(city_lookup)
)

# Load sales in chunks for performance (large table)
df_sales.to_sql(
    "sales", engine,
    if_exists="append",
    index=False,
    chunksize=5000,
    method="multi"
)
Python · Load

The total_price trigger was a deliberate design choice: rather than computing the value in Python and risking rounding inconsistencies, we let PostgreSQL enforce the calculation at the database layer ensuring every row in the sales table satisfies the same pricing formula regardless of how data enters the system.

Analytical Procedures

Twelve SQL procedures were implemented across three domains product and category performance, salesperson effectiveness, and geographic opportunity. Each uses multiple joined tables and real aggregation logic relevant to ABC Foodmart's business questions.

Product & Category Analysis

Six queries addressing category revenue trends, seasonality, peak timing, hero products, underperformers, and discount responsiveness.

1.1 Monthly Revenue Trends by Category tables: sales · products · categories
Business need

Track category-level revenue across months to detect demand shifts, seasonal dips, and growth windows replacing the manual monthly spreadsheet previously emailed between stores.

SELECT
    cat.category_name,
    EXTRACT(MONTH FROM s.sales_date) AS month,
    SUM(s.total_price)             AS monthly_revenue
FROM sales s
JOIN products  p   ON s.product_id  = p.product_id
JOIN categories cat ON p.category_id = cat.category_id
GROUP BY cat.category_name, month
ORDER BY cat.category_name, month;
SQL

Sample output Beverages category, first 5 months:

category_namemonthmonthly_revenue
Beverages1265,010.48
Beverages2243,361.08
Beverages3220,100.45
Beverages4241,889.08
Beverages574,865.80
1.2 Category Seasonality Index tables: sales · products · categories
Business need

Identify which categories have strong vs. weak seasonal patterns. A seasonality index above 1.0 indicates a peak month; below 1.0 indicates a trough guiding procurement timing and promotional calendars.

WITH monthly_rev AS (
    SELECT
        cat.category_name,
        EXTRACT(MONTH FROM s.sales_date) AS month,
        SUM(s.total_price) AS revenue
    FROM sales s
    JOIN products p    ON s.product_id  = p.product_id
    JOIN categories cat ON p.category_id = cat.category_id
    GROUP BY cat.category_name, month
),
category_avg AS (
    SELECT category_name, AVG(revenue) AS avg_revenue
    FROM monthly_rev
    GROUP BY category_name
)
SELECT
    m.category_name,
    m.month,
    m.revenue,
    ROUND(m.revenue / c.avg_revenue, 2) AS seasonality_index
FROM monthly_rev m
JOIN category_avg c ON m.category_name = c.category_name
ORDER BY m.category_name, m.month;
SQL · CTE
1.3 Peak Month Identification per Category tables: sales · products · categories · window fn
Business need

Surface the single best-performing month for each category using a window function replacing the manual chart-reading previously done in Excel.

WITH monthly AS (
    SELECT
        cat.category_name,
        EXTRACT(MONTH FROM s.sales_date) AS month,
        SUM(s.total_price) AS revenue
    FROM sales s
    JOIN products p    ON s.product_id  = p.product_id
    JOIN categories cat ON p.category_id = cat.category_id
    GROUP BY cat.category_name, month
)
SELECT category_name, month, revenue
FROM (
    SELECT
        category_name, month, revenue,
        RANK() OVER (
            PARTITION BY category_name
            ORDER BY revenue DESC
        ) AS rnk
    FROM monthly
) ranked
WHERE rnk = 1
ORDER BY category_name;
SQL · RANK() window fn
category_namepeak_monthrevenue
Beverages1265,010.48
Cereals1292,262.81
Confections1419,263.41
Dairy1266,983.76
Grain1238,437.17
1.4 & 1.5 Top & Lowest-Grossing Product per Category tables: sales · products · categories · RANK()
Business need

Identify "hero products" driving category revenue (for sourcing and shelf priority) and the weakest performers (for rationalization or repositioning). The same CTE is reused only the ORDER BY direction changes.

WITH product_revenue AS (
    SELECT
        cat.category_name,
        p.product_id,
        p.product_name,
        SUM(s.total_price) AS revenue
    FROM sales s
    JOIN products p    ON s.product_id  = p.product_id
    JOIN categories cat ON p.category_id = cat.category_id
    GROUP BY cat.category_name, p.product_id, p.product_name
),
ranked AS (
    SELECT *,
        RANK() OVER (
            PARTITION BY category_name
            ORDER BY revenue DESC  -- swap ASC for lowest
        ) AS rnk
    FROM product_revenue
)
SELECT category_name, product_id, product_name,
       revenue AS total_revenue
FROM ranked
WHERE rnk = 1
ORDER BY category_name;
SQL · RANK() PARTITION BY

Top per category

categoryproductrevenue
BeveragesTia Maria60,895
CerealsShrimp 31/4059,439
ConfectionsPail w/ Handle77,734
DairyScampi Tail66,104

Lowest per category

categoryproductrevenue
BeveragesHalibut Fletches1,582
CerealsSeedlings Mix506
ConfectionsFrench Baguette3,573
DairyBread Fig Almond470
1.6 Products Most Impacted by Discounts tables: sales · products · CASE pivot
Business need

Measure discount elasticity which products actually sell more when discounted, and which don't respond. Prevents margin loss from blanket promotions on inelastic items.

WITH product_qty AS (
    SELECT
        p.product_id,
        p.product_name,
        CASE WHEN s.discount > 0
             THEN 'discounted'
             ELSE 'no_discount'
        END AS price_type,
        AVG(s.quantity) AS avg_qty
    FROM sales s
    JOIN products p ON s.product_id = p.product_id
    GROUP BY p.product_id, p.product_name, price_type
),
pivoted AS (
    SELECT
        product_id, product_name,
        MAX(CASE WHEN price_type = 'discounted'
               THEN avg_qty END) AS avg_qty_discount,
        MAX(CASE WHEN price_type = 'no_discount'
               THEN avg_qty END) AS avg_qty_no_discount
    FROM product_qty
    GROUP BY product_id, product_name
)
SELECT
    product_id, product_name,
    avg_qty_discount,
    avg_qty_no_discount,
    (avg_qty_discount - avg_qty_no_discount)
        AS qty_increase_due_to_discount
FROM pivoted
ORDER BY qty_increase_due_to_discount DESC NULLS LAST
LIMIT 10;
SQL · CASE pivot

Salesperson Analysis

Two queries a global revenue ranking and a city-level performance ranking using RANK() window functions.

2.1 Revenue-Based Salesperson Ranking tables: sales · employees · RANK()
Business need

Create a dynamic revenue leaderboard for all salespeople supporting performance reviews, bonus calculations, and territory planning without any manual Excel sorting.

SELECT
    emp.employee_id,
    emp.first_name || ' ' || emp.last_name AS employee_name,
    SUM(s.total_price) AS revenue,
    RANK() OVER (
        ORDER BY SUM(s.total_price) DESC
    ) AS revenue_rank
FROM sales s
JOIN employees emp ON emp.employee_id = s.sales_person_id
GROUP BY emp.employee_id, employee_name
ORDER BY revenue DESC;
SQL · RANK() global
rankemployee_namerevenue
1Chadwick Walton607,337.21
2Wendi Buckley598,446.45
3Chadwick Cook591,659.32
4Pablo Cline589,504.76
5Katina Marks589,415.83
2.2 Top-Grossing Employee in Each City tables: sales · employees · customers · cities · PARTITION BY
Business need

Rank salespeople within their local market accounting for the fact that a city with lower overall demand should be measured differently. Identifies regional champions whose practices can be replicated.

WITH city_employee_revenue AS (
    SELECT
        city.city_name,
        emp.employee_id,
        emp.first_name || ' ' || emp.last_name AS employee_name,
        SUM(s.total_price) AS revenue
    FROM sales s
    JOIN employees emp  ON s.sales_person_id = emp.employee_id
    JOIN customers cust ON s.customer_id     = cust.customer_id
    JOIN cities city    ON cust.city_id      = city.city_id
    GROUP BY city.city_name, emp.employee_id, employee_name
),
ranked AS (
    SELECT *,
        RANK() OVER (
            PARTITION BY city_name
            ORDER BY revenue DESC
        ) AS rnk
    FROM city_employee_revenue
)
SELECT city_name, employee_id, employee_name,
       revenue AS total_revenue
FROM ranked
WHERE rnk = 1
ORDER BY city_name;
SQL · RANK() PARTITION BY city

Geographic Analysis

Four queries examining revenue distribution across countries and cities from top markets down to underperforming locations, plus category preference by city.

3.1 Regional Revenue Distribution tables: sales · customers · cities · countries
Business need

Reveal which countries and cities drive the most sales informing store expansion planning, marketing resource allocation, and inventory logistics.

SELECT
    ctry.country_name,
    city.city_name,
    SUM(s.total_price) AS total_revenue
FROM sales s
JOIN customers cust ON s.customer_id  = cust.customer_id
JOIN cities city    ON cust.city_id   = city.city_id
JOIN countries ctry ON city.country_id = ctry.country_id
GROUP BY ctry.country_name, city.city_name
ORDER BY total_revenue DESC;
SQL · 4-table JOIN
countrycitytotal_revenue
United StatesSt. Petersburg166,774.35
United StatesPhiladelphia157,442.27
United StatesAurora156,522.37
United StatesAlbuquerque155,400.82
United StatesDallas155,382.13
3.4 Top-Grossing Category in Each City tables: sales · customers · cities · products · categories · RANK()
Business need

Local preferences vary stocking Confections uniformly across all cities ignores meaningful regional differences. This query enables geographic assortment tailoring.

WITH city_category_revenue AS (
    SELECT
        city.city_name,
        cat.category_name,
        SUM(s.total_price) AS revenue
    FROM sales s
    JOIN customers cust ON s.customer_id  = cust.customer_id
    JOIN cities city    ON cust.city_id   = city.city_id
    JOIN products p     ON s.product_id   = p.product_id
    JOIN categories cat ON p.category_id  = cat.category_id
    GROUP BY city.city_name, cat.category_name
),
ranked AS (
    SELECT *,
        RANK() OVER (
            PARTITION BY city_name
            ORDER BY revenue DESC
        ) AS rnk
    FROM city_category_revenue
)
SELECT
    city_name,
    category_name AS top_category,
    revenue       AS total_revenue
FROM ranked
WHERE rnk = 1
ORDER BY city_name;
SQL · 5-table JOIN + RANK() PARTITION BY
citytop_categoryrevenue
AkronPoultry21,087.67
AlbuquerqueCereals24,840.13
AnaheimConfections23,321.72
AnchorageConfections21,997.42
ArlingtonDairy14,020.53

Finding: Confections dominated as the top revenue category in most cities confirmed across both the SQL output and the Metabase dashboard. This validates it as the category to anchor national promotions on, while allowing local adjustments for cities like Akron (Poultry) and Albuquerque (Cereals).

Executive Dashboard

The Metabase dashboard gives leadership real-time visibility into revenue, product mix, and market dynamics without requiring SQL knowledge. It is organized into two views: an Executive Overview and a deeper Product & Market Insights view.

Business Scale at a Glance

The three KPI cards at the top confirm the database is operating at a realistic scale proving that the schema and ETL pipeline handle enterprise-level volume correctly.

SQL Project Dashboard Metabase · Executive Overview
$4.3B Total Sales
98,759 Total Customers
6.75M Total Transactions
Sales by Category
Confections
$535M
Meat
$496M
Poultry
$468M
Cereals
$402M
Beverages
$368M
Dairy
$323M
Top Salesperson Revenue Ranking
C. Walton
$607K
W. Buckley
$598K
C. Cook
$592K
P. Cline
$590K
K. Marks
$589K

Note: Dashboard screenshot could not be uploaded the widget above recreates the key panels from the Metabase output for portfolio presentation.

Key Dashboard Insights

Category
Revenue Concentration in a Few Key Categories
Confections, Meat, and Poultry consistently outperform the rest. These categories need priority attention in purchasing, shelf placement, and promotional investment. The bar chart confirms this concentration rather than spreading resources evenly.
Geography
Even City Distribution Diversified Geographic Risk
Revenue is relatively evenly spread across cities, with no single city dominating or dragging. This suggests geographic risk is diversified and that many markets contribute meaningfully. Attention should focus on best-practice sharing from slightly stronger cities, not over-investing in hero markets.
Discount
Only a Subset of Products Responds to Price Cuts
The discount elasticity analysis surfaces a small set of highly responsive SKUs (e.g., Cattail Hearts, French Pastry). These warrant targeted promotions. The majority of products show minimal quantity lift from discounts applying blanket promotions to these would only erode margin without driving incremental volume.
Salesperson
Local Champions as Training References
The top-grossing employee per city output enables management to identify which salespeople outperform in their local market context and use them as coaching references for regional staffing and incentive design.

Key Takeaways & Reflection

What Worked Well

  • The RESTART IDENTITY CASCADE truncation pattern made the ETL fully repeatable critical for testing and future data refreshes without orphaned records.
  • Delegating total_price computation to a PostgreSQL trigger instead of Python ensured every row satisfied the same pricing formula, eliminating a class of transformation bugs.
  • Using RANK() OVER (PARTITION BY ...) consistently across analyses (peak month, top product, top employee per city) reduced query complexity and made results directly comparable across entities.
  • Keeping the schema in 3NF routing location through Customer → City → Country rather than denormalizing paid off immediately in the geographic queries, which joined cleanly across all four tables without any duplication issues.

Challenges & Lessons

  • Mixed date formats in the raw CSVs required pd.to_datetime(errors='coerce') plus manual inspection of NULL rates before loading a good reminder to validate before transforming, not after.
  • The UNIQUE NULLS NOT DISTINCT constraint on country_code was a PostgreSQL 15+ feature a useful reminder that constraint syntax varies by database version.
  • Some product names in the raw data exceeded the initial VARCHAR(100) limit, requiring adjustment to VARCHAR(200) before loading caught through constraint testing before production load.