Modern Data Infrastructure
for ABC Foodmart
An end-to-end database system that transforms 30 years of fragmented Excel files into a centralized, 3NF-normalized PostgreSQL database with a Python ETL pipeline, 12 analytical SQL procedures, and executive dashboards in Metabase.
Problem & Objectives
ABC Foodmart is a neighborhood grocery chain that has operated for over 30 years using Excel spreadsheets, email attachments, and paper binders to manage products, staff schedules, sales logs, and vendor information. With plans to expand from two stores to five, this fragmented workflow was causing slow decision-making, reporting inconsistencies, and data entry errors.
Our solution: a fully normalized 3NF PostgreSQL relational database supported by a reproducible Python ETL pipeline and Metabase dashboards, giving analysts SQL access and executives real-time insights without requiring technical expertise.
Database Schema
The schema organizes data into six dimensional tables surrounding a central sales fact table separating slow-changing descriptive data from high-volume transactional records. All tables are in Third Normal Form: atomic attributes, no partial or transitive dependencies.
Key design choice: Customer location does not store country_name directly. Instead, customers → city_id → cities → country_id → countries, eliminating the transitive dependency customer_id → city_id → country_name.
Design decisions: BIGINT for sales_id to accommodate high transaction volume; NUMERIC(12,2) for all prices to avoid floating-point rounding errors in financial calculations; ON DELETE RESTRICT prevents orphaned child records; a PostgreSQL trigger recomputes total_price = quantity × price × (1 − discount) on every insert, guaranteeing consistency.
ETL Pipeline
The ETL pipeline converts seven CSV files from Kaggle into a clean, normalized PostgreSQL database. Built with pandas and SQLAlchemy, the workflow is fully repeatable every run truncates all tables with RESTART IDENTITY CASCADE before reloading, ensuring idempotency.
CityName → city_name). Validate structure check for missing columns, unexpected types, and malformed dates before any transformation begins.
Male | Female | Other | Unspecified. Parse dates from mixed formats to ISO DATE. Map boolean-like strings to TRUE / FALSE / NULL. Validate numerics against CHECK constraints (price ≥ 0, 0 ≤ discount ≤ 1, quantity > 0). Resolve foreign key IDs by joining country and city name lookups before loading child tables.
countries → cities → customers → employees → categories → products → sales. High-volume tables (especially sales) are written in chunks to avoid memory pressure. total_price is intentionally set to 0 on insert a trigger fires immediately after each row to recompute the correct value, guaranteeing accuracy regardless of the source data.
Core ETL Code
Below are the key patterns used throughout the pipeline practical SQL-in-Python patterns directly applicable in junior data engineering roles.
# Connect to PostgreSQL via SQLAlchemy from sqlalchemy import create_engine import pandas as pd engine = create_engine( "postgresql+psycopg2://user:password@localhost:5432/foodmart" ) # Truncate all tables before each reload — ensures idempotency with engine.connect() as conn: conn.execute(""" TRUNCATE TABLE sales, products, categories, employees, customers, cities, countries RESTART IDENTITY CASCADE; """)Python · ETL Setup
# --- Transformation: clean gender field --- gender_map = { "m": "Male", "f": "Female", "male": "Male", "female": "Female", "other": "Other" } df_emp["gender"] = ( df_emp["gender"] .str.lower() .map(gender_map) .fillna("Unspecified") ) # Parse mixed date formats safely df_emp["birth_date"] = pd.to_datetime( df_emp["birth_date"], errors="coerce" ).dt.date # Clamp discount to valid range [0, 1] df_sales["discount"] = df_sales["discount"].clip(0, 1)Python · Transform
# Resolve FK: map city names → city_id from loaded cities table city_lookup = pd.read_sql( "SELECT city_id, city_name FROM cities", engine ).set_index("city_name")["city_id"] df_customers["city_id"] = ( df_customers["CityName"].map(city_lookup) ) # Load sales in chunks for performance (large table) df_sales.to_sql( "sales", engine, if_exists="append", index=False, chunksize=5000, method="multi" )Python · Load
The total_price trigger was a deliberate design choice: rather than computing the value in Python and risking rounding inconsistencies, we let PostgreSQL enforce the calculation at the database layer ensuring every row in the sales table satisfies the same pricing formula regardless of how data enters the system.
Analytical Procedures
Twelve SQL procedures were implemented across three domains product and category performance, salesperson effectiveness, and geographic opportunity. Each uses multiple joined tables and real aggregation logic relevant to ABC Foodmart's business questions.
Product & Category Analysis
Six queries addressing category revenue trends, seasonality, peak timing, hero products, underperformers, and discount responsiveness.
Track category-level revenue across months to detect demand shifts, seasonal dips, and growth windows replacing the manual monthly spreadsheet previously emailed between stores.
SELECT cat.category_name, EXTRACT(MONTH FROM s.sales_date) AS month, SUM(s.total_price) AS monthly_revenue FROM sales s JOIN products p ON s.product_id = p.product_id JOIN categories cat ON p.category_id = cat.category_id GROUP BY cat.category_name, month ORDER BY cat.category_name, month;SQL
Sample output Beverages category, first 5 months:
| category_name | month | monthly_revenue |
|---|---|---|
| Beverages | 1 | 265,010.48 |
| Beverages | 2 | 243,361.08 |
| Beverages | 3 | 220,100.45 |
| Beverages | 4 | 241,889.08 |
| Beverages | 5 | 74,865.80 |
Identify which categories have strong vs. weak seasonal patterns. A seasonality index above 1.0 indicates a peak month; below 1.0 indicates a trough guiding procurement timing and promotional calendars.
WITH monthly_rev AS ( SELECT cat.category_name, EXTRACT(MONTH FROM s.sales_date) AS month, SUM(s.total_price) AS revenue FROM sales s JOIN products p ON s.product_id = p.product_id JOIN categories cat ON p.category_id = cat.category_id GROUP BY cat.category_name, month ), category_avg AS ( SELECT category_name, AVG(revenue) AS avg_revenue FROM monthly_rev GROUP BY category_name ) SELECT m.category_name, m.month, m.revenue, ROUND(m.revenue / c.avg_revenue, 2) AS seasonality_index FROM monthly_rev m JOIN category_avg c ON m.category_name = c.category_name ORDER BY m.category_name, m.month;SQL · CTE
Surface the single best-performing month for each category using a window function replacing the manual chart-reading previously done in Excel.
WITH monthly AS ( SELECT cat.category_name, EXTRACT(MONTH FROM s.sales_date) AS month, SUM(s.total_price) AS revenue FROM sales s JOIN products p ON s.product_id = p.product_id JOIN categories cat ON p.category_id = cat.category_id GROUP BY cat.category_name, month ) SELECT category_name, month, revenue FROM ( SELECT category_name, month, revenue, RANK() OVER ( PARTITION BY category_name ORDER BY revenue DESC ) AS rnk FROM monthly ) ranked WHERE rnk = 1 ORDER BY category_name;SQL · RANK() window fn
| category_name | peak_month | revenue |
|---|---|---|
| Beverages | 1 | 265,010.48 |
| Cereals | 1 | 292,262.81 |
| Confections | 1 | 419,263.41 |
| Dairy | 1 | 266,983.76 |
| Grain | 1 | 238,437.17 |
Identify "hero products" driving category revenue (for sourcing and shelf priority) and the weakest performers (for rationalization or repositioning). The same CTE is reused only the ORDER BY direction changes.
WITH product_revenue AS ( SELECT cat.category_name, p.product_id, p.product_name, SUM(s.total_price) AS revenue FROM sales s JOIN products p ON s.product_id = p.product_id JOIN categories cat ON p.category_id = cat.category_id GROUP BY cat.category_name, p.product_id, p.product_name ), ranked AS ( SELECT *, RANK() OVER ( PARTITION BY category_name ORDER BY revenue DESC -- swap ASC for lowest ) AS rnk FROM product_revenue ) SELECT category_name, product_id, product_name, revenue AS total_revenue FROM ranked WHERE rnk = 1 ORDER BY category_name;SQL · RANK() PARTITION BY
Top per category
| category | product | revenue |
|---|---|---|
| Beverages | Tia Maria | 60,895 |
| Cereals | Shrimp 31/40 | 59,439 |
| Confections | Pail w/ Handle | 77,734 |
| Dairy | Scampi Tail | 66,104 |
Lowest per category
| category | product | revenue |
|---|---|---|
| Beverages | Halibut Fletches | 1,582 |
| Cereals | Seedlings Mix | 506 |
| Confections | French Baguette | 3,573 |
| Dairy | Bread Fig Almond | 470 |
Measure discount elasticity which products actually sell more when discounted, and which don't respond. Prevents margin loss from blanket promotions on inelastic items.
WITH product_qty AS ( SELECT p.product_id, p.product_name, CASE WHEN s.discount > 0 THEN 'discounted' ELSE 'no_discount' END AS price_type, AVG(s.quantity) AS avg_qty FROM sales s JOIN products p ON s.product_id = p.product_id GROUP BY p.product_id, p.product_name, price_type ), pivoted AS ( SELECT product_id, product_name, MAX(CASE WHEN price_type = 'discounted' THEN avg_qty END) AS avg_qty_discount, MAX(CASE WHEN price_type = 'no_discount' THEN avg_qty END) AS avg_qty_no_discount FROM product_qty GROUP BY product_id, product_name ) SELECT product_id, product_name, avg_qty_discount, avg_qty_no_discount, (avg_qty_discount - avg_qty_no_discount) AS qty_increase_due_to_discount FROM pivoted ORDER BY qty_increase_due_to_discount DESC NULLS LAST LIMIT 10;SQL · CASE pivot
Salesperson Analysis
Two queries a global revenue ranking and a city-level performance ranking using RANK() window functions.
Create a dynamic revenue leaderboard for all salespeople supporting performance reviews, bonus calculations, and territory planning without any manual Excel sorting.
SELECT emp.employee_id, emp.first_name || ' ' || emp.last_name AS employee_name, SUM(s.total_price) AS revenue, RANK() OVER ( ORDER BY SUM(s.total_price) DESC ) AS revenue_rank FROM sales s JOIN employees emp ON emp.employee_id = s.sales_person_id GROUP BY emp.employee_id, employee_name ORDER BY revenue DESC;SQL · RANK() global
| rank | employee_name | revenue |
|---|---|---|
| 1 | Chadwick Walton | 607,337.21 |
| 2 | Wendi Buckley | 598,446.45 |
| 3 | Chadwick Cook | 591,659.32 |
| 4 | Pablo Cline | 589,504.76 |
| 5 | Katina Marks | 589,415.83 |
Rank salespeople within their local market accounting for the fact that a city with lower overall demand should be measured differently. Identifies regional champions whose practices can be replicated.
WITH city_employee_revenue AS ( SELECT city.city_name, emp.employee_id, emp.first_name || ' ' || emp.last_name AS employee_name, SUM(s.total_price) AS revenue FROM sales s JOIN employees emp ON s.sales_person_id = emp.employee_id JOIN customers cust ON s.customer_id = cust.customer_id JOIN cities city ON cust.city_id = city.city_id GROUP BY city.city_name, emp.employee_id, employee_name ), ranked AS ( SELECT *, RANK() OVER ( PARTITION BY city_name ORDER BY revenue DESC ) AS rnk FROM city_employee_revenue ) SELECT city_name, employee_id, employee_name, revenue AS total_revenue FROM ranked WHERE rnk = 1 ORDER BY city_name;SQL · RANK() PARTITION BY city
Geographic Analysis
Four queries examining revenue distribution across countries and cities from top markets down to underperforming locations, plus category preference by city.
Reveal which countries and cities drive the most sales informing store expansion planning, marketing resource allocation, and inventory logistics.
SELECT ctry.country_name, city.city_name, SUM(s.total_price) AS total_revenue FROM sales s JOIN customers cust ON s.customer_id = cust.customer_id JOIN cities city ON cust.city_id = city.city_id JOIN countries ctry ON city.country_id = ctry.country_id GROUP BY ctry.country_name, city.city_name ORDER BY total_revenue DESC;SQL · 4-table JOIN
| country | city | total_revenue |
|---|---|---|
| United States | St. Petersburg | 166,774.35 |
| United States | Philadelphia | 157,442.27 |
| United States | Aurora | 156,522.37 |
| United States | Albuquerque | 155,400.82 |
| United States | Dallas | 155,382.13 |
Local preferences vary stocking Confections uniformly across all cities ignores meaningful regional differences. This query enables geographic assortment tailoring.
WITH city_category_revenue AS ( SELECT city.city_name, cat.category_name, SUM(s.total_price) AS revenue FROM sales s JOIN customers cust ON s.customer_id = cust.customer_id JOIN cities city ON cust.city_id = city.city_id JOIN products p ON s.product_id = p.product_id JOIN categories cat ON p.category_id = cat.category_id GROUP BY city.city_name, cat.category_name ), ranked AS ( SELECT *, RANK() OVER ( PARTITION BY city_name ORDER BY revenue DESC ) AS rnk FROM city_category_revenue ) SELECT city_name, category_name AS top_category, revenue AS total_revenue FROM ranked WHERE rnk = 1 ORDER BY city_name;SQL · 5-table JOIN + RANK() PARTITION BY
| city | top_category | revenue |
|---|---|---|
| Akron | Poultry | 21,087.67 |
| Albuquerque | Cereals | 24,840.13 |
| Anaheim | Confections | 23,321.72 |
| Anchorage | Confections | 21,997.42 |
| Arlington | Dairy | 14,020.53 |
Finding: Confections dominated as the top revenue category in most cities confirmed across both the SQL output and the Metabase dashboard. This validates it as the category to anchor national promotions on, while allowing local adjustments for cities like Akron (Poultry) and Albuquerque (Cereals).
Executive Dashboard
The Metabase dashboard gives leadership real-time visibility into revenue, product mix, and market dynamics without requiring SQL knowledge. It is organized into two views: an Executive Overview and a deeper Product & Market Insights view.
Business Scale at a Glance
The three KPI cards at the top confirm the database is operating at a realistic scale proving that the schema and ETL pipeline handle enterprise-level volume correctly.
Note: Dashboard screenshot could not be uploaded the widget above recreates the key panels from the Metabase output for portfolio presentation.
Key Dashboard Insights
Key Takeaways & Reflection
What Worked Well
- The
RESTART IDENTITY CASCADEtruncation pattern made the ETL fully repeatable critical for testing and future data refreshes without orphaned records. - Delegating
total_pricecomputation to a PostgreSQL trigger instead of Python ensured every row satisfied the same pricing formula, eliminating a class of transformation bugs. - Using
RANK() OVER (PARTITION BY ...)consistently across analyses (peak month, top product, top employee per city) reduced query complexity and made results directly comparable across entities. - Keeping the schema in 3NF routing location through
Customer → City → Countryrather than denormalizing paid off immediately in the geographic queries, which joined cleanly across all four tables without any duplication issues.
Challenges & Lessons
- Mixed date formats in the raw CSVs required
pd.to_datetime(errors='coerce')plus manual inspection of NULL rates before loading a good reminder to validate before transforming, not after. - The
UNIQUE NULLS NOT DISTINCTconstraint oncountry_codewas a PostgreSQL 15+ feature a useful reminder that constraint syntax varies by database version. - Some product names in the raw data exceeded the initial
VARCHAR(100)limit, requiring adjustment toVARCHAR(200)before loading caught through constraint testing before production load.