Skip to content

A product catalog where JSON columns hold dynamic attributes, exposed by virtual/persistent columns for indexing and fast filters — giving the agility of semi-structured data with the performance of relational indexes.

License

Notifications You must be signed in to change notification settings

divs-spec/JSON-Virtual-Columns-Catalog

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

39 Commits
 
 
 
 
 
 

Repository files navigation

Flexible Catalog – JSON + Virtual Columns 🚀

Runner Up at MariaDB Python Hackathon 2025

A product catalog where JSON columns hold dynamic attributes, exposed by virtual/persistent columns for indexing and fast filters — giving the agility of semi-structured data with the performance of relational indexes.

Problem: e-commerce and marketplace catalogs have many product-specific attributes (RAM, GPU, screen_size, battery_mAh, etc.) which vary by category. Storing every attribute as a column is inflexible; storing a blob of JSON is flexible but usually performs poorly for filters unless indexed.

Solution: Keep a JSON column (specs) for flexible attributes. Expose a selected set of frequently-filtered paths as virtual (generated) columns and add indexes on them (BTREE) or persistent indexes to speed up range/eq searches. Keep JSON for everything else.

This repository is a full example: database schema (schema.sql), a data generator to produce 300k rows (load_data.py), a small Express API (/api) that prefers generated columns for filters, and a Docker Compose setup to run everything locally.


This project solves real-world problems:

  • Handling dynamic product attributes without rigid schema changes.
  • Enabling fast filtering and sorting on frequently queried fields.
  • Scaling data ingestion for large datasets (300k+ rows).
  • Maintaining data integrity and validation even with semi-structured JSON.
  • Providing an API that safely exposes complex queries while preventing injection or malformed requests.

Built with:

MariaDB / MySQL 8.0+ (JSON support + generated columns)

Node.js + Express (API with validation + metrics)

Docker Compose (for reproducible setup)

Python (Faker) for synthetic data generation


🌟 Project Overview

Hybrid database design where:

  • Product attributes (RAM, GPU, screen size, battery, etc.) are stored as JSON.
  • Frequently queried attributes are extracted as generated columns for B-tree indexing.
  • Full API layer enables complex filters (e.g., ram >= 16, gpu = "RTX 4070").
  • Supports high-volume data ingestion efficiently via SQL inserts or CSV + LOAD DATA INFILE.
  • Provides built-in analytics views and constraints for robust data validation.

Docker Compose MySQL MariaDB Node.js Python

A production-ready, containerized flexible product catalog that stores arbitrary product attributes in a specs JSON column while providing fast, indexed SQL access via either MySQL 8 (original) or MariaDB (compatibility option). The repo contains both MySQL and MariaDB flavors — pick the one that suits your environment.

📁 Repo structure

flexible-catalog/
├── docker-compose.yml                    # MySQL compose
├── docker-compose-mariadb.yml            # MariaDB compose (optional)
├── .env
├── schema.sql                            # MySQL schema (JSON + generated columns)
├── schema-mariadb-compat.sql             # MariaDB-compatible schema (physical columns)
├── load_data.py                          # MySQL loader (for generated columns)
├── load_data_mariadb.py                  # MariaDB loader (populates physical columns)
├── products_inserts.sql                  # generated by load_data.py
├── products_inserts_mariadb.sql          # generated by load_data_mariadb.py
├── products.csv                          # generated CSVs
├── api/
│   ├── Dockerfile
│   ├── package.json
│   ├── server.js
│   └── middleware/
│       └── validateFilter.js
└── README.md

🎯 Quick summary

  • Two supported DB flows: MySQL 8 (uses JSON + generated stored columns) and MariaDB (uses physical indexed columns populated by loader).
  • Files provided: docker-compose.yml (MySQL), schema.sql (MySQL DDL), load_data.py (MySQL loader), and docker-compose-mariadb.yml, schema-mariadb-compat.sql, load_data_mariadb.py (MariaDB-compatible).
  • API: Node.js (Express) in api/ with endpoints for search, pagination, product detail, and analytics.

⚙️ Requirements & Versions

  • Docker 20.10+ & Docker Compose 1.29+ (container orchestration)
  • MariaDB 10.6+ or MySQL 8.0+ (JSON support, generated columns, CHECK constraints)
  • Node.js v18+ (API server)
  • Python 3.9+ (data generator with faker library)
  • NPM / Yarn (API dependencies)

Optional:

  • Postman or curl for testing API endpoints
  • Prometheus / Grafana for monitoring

🚀 Getting Started

1. Clone Repository

git clone https://github.com/your-username/flexible-catalog.git
cd flexible-catalog

2. ⚙️Configure Environment variables

Create a .env file in the repo root (example):

DB_HOST=db
DB_PORT=3306
DB_USER=catalog_user
DB_PASS=SecurePass123!
DB_NAME=catalog
PORT=3000

Both compose files accept these environment variables. The MariaDB compose also sets MARIADB_* envs internally for compatibility.

🧭 Option A — Run with MySQL 8 (default)

This is the original flow in the repo and uses MySQL 8 features (JSON column + generated stored columns). Use this if you want the simplest parity with the original schema and performance from MySQL JSON + generated columns.

Files used

  • docker-compose.yml (uses mysql:8.0)
  • schema.sql (DDL with JSON + generated columns)
  • load_data.py (generates products_inserts.sql and products.csv)
  • api/ (Node API)

Steps

  1. Ensure Docker is running.
  2. Start the stack:
docker-compose up -d
  1. Generate demo data locally (example will create 300k by default unless you override):
python load_data.py
# or smaller quick run
NUM_PRODUCTS=50000 python load_data.py
  1. After the DB container finishes initialization (watch logs), load the generated SQL into MySQL:
docker exec -i catalog_db mysql -uroot -p$DB_PASS $DB_NAME < products_inserts.sql
  1. Confirm API health:
curl http://localhost:3000/health

Verification & tests

Run these queries to confirm data & performance:

# Basic list
curl http://localhost:3000/products

# Search: laptops with >=32GB RAM
curl -X POST http://localhost:3000/products/search -H "Content-Type: application/json" -d '{"filters":[{"field":"ram","op":">=","value":32},{"field":"category","op":"=","value":"laptop"}], "page":1, "pageSize":10}'

# Analytics summary
curl http://localhost:3000/analytics/summary

To compare JSON-path vs virtual column performance (optional):

-- Run in MySQL client inside container
SELECT * FROM products WHERE JSON_EXTRACT(specs, '$.ram') >= 32;    -- JSON-path (no index)
SELECT * FROM products WHERE ram_gb >= 32;                          -- generated column (indexed)

Measure response times in your client; indexed queries should be much faster.


🧩 Option B — Run with MariaDB (compatibility mode)

This option is included for environments that prefer MariaDB. The repo provides schema-mariadb-compat.sql and load_data_mariadb.py. The loader writes both specs (JSON) and physical indexed columns (ram_gb, storage_gb, brand, etc.) so you avoid relying on DB-specific generated columns.

Files used

  • docker-compose-mariadb.yml (uses mariadb:10.11)
  • schema-mariadb-compat.sql (DDL creating physical indexed columns)
  • load_data_mariadb.py (generates products_inserts_mariadb.sql and products_mariadb.csv)
  • api/ (Node API — unchanged)

Steps

  1. Ensure Docker is running.
  2. Start MariaDB via the provided compose file:
# Option 1 — run directly with file name
docker-compose -f docker-compose-mariadb.yml up -d

# Option 2 — rename/move file to docker-compose.yml and run
mv docker-compose-mariadb.yml docker-compose.yml
docker-compose up -d
  1. Generate demo data (example):
# default smaller run for quick test
NUM_PRODUCTS=50000 python load_data_mariadb.py
# full run
NUM_PRODUCTS=300000 python load_data_mariadb.py
  1. Load SQL into MariaDB (after DB initialization):
docker exec -i catalog_db mysql -uroot -p$DB_PASS catalog < products_inserts_mariadb.sql
  1. Confirm API health:
curl http://localhost:3000/health

Verification & tests

Use the same API calls as MySQL to verify responses:

# Health
curl http://localhost:3000/health

# Search: laptops with >=32GB RAM
curl -X POST http://localhost:3000/products/search -H "Content-Type: application/json" -d '{"filters":[{"field":"ram","op":">=","value":32},{"field":"category","op":"=","value":"laptop"}], "page":1, "pageSize":10}'

# Analytics
curl http://localhost:3000/analytics/summary

Because physical columns are indexed, performance of hot queries should be similar to MySQL generated-column approach. If you see mismatches in column values, ensure you used the load_data_mariadb.py loader so those columns are populated.


🔁 Switching between MySQL & MariaDB

  • The API (api/server.js) reads columns like ram_gb, brand, etc. If you run MySQL with generated stored columns, those appear automatically. For MariaDB mode, ensure your loader populated these physical columns.
  • To switch DB engines: stop the current compose (docker-compose down), start the desired compose file, and reload the data with the appropriate loader's SQL file.

⚠️ Troubleshooting & tips

  • DB init errors: If /docker-entrypoint-initdb.d/01-schema.sql fails, check container logs: docker logs catalog_db and inspect the SQL error. Common causes: function not supported in older DB versions, syntax differences, or file encoding problems.
  • Slow inserts: Use LOAD DATA INFILE for the CSV files rather than many single inserts when possible. For MySQL, products.csv + LOAD DATA INFILE is fastest.
  • Connection refused: ensure .env DB_HOST is db (the compose service name) and containers are on the same network.
  • Data mismatch: If ram_gb shows NULL, you likely loaded data with the wrong loader — rerun the correct loader (load_data.py vs load_data_mariadb.py).

🧪 Suggested verification script (optional)

Create verify.sh for quick smoke testing (adjust host/port as needed):

#!/usr/bin/env bash
set -e

echo "Health check:"
curl -s http://localhost:3000/health | jq

echo "Get products sample:"
curl -s http://localhost:3000/products | jq '.pagination, .data | .[0:2]'

echo "Search laptops with >=32GB RAM"
curl -s -X POST http://localhost:3000/products/search -H "Content-Type: application/json" -d '{"filters":[{"field":"ram","op":">=","value":32},{"field":"category","op":"=","value":"laptop"}], "page":1, "pageSize":5}' | jq

echo "Analytics summary"
curl -s http://localhost:3000/analytics/summary | jq

📦 Which files to use (summary)

  • Default (recommended for parity): docker-compose.yml, schema.sql, load_data.py
  • If you prefer MariaDB: docker-compose-mariadb.yml, schema-mariadb-compat.sql, load_data_mariadb.py

📊 Benchmarking

  • JSON-only query:
SELECT * FROM products WHERE JSON_EXTRACT(specs, '$.ram') >= 16;
  • Virtual-column indexed query (fast):
SELECT * FROM products WHERE ram_gb >= 16;

Virtual columns: 3–10× faster on 300k+ rows.


🛠️ Advanced Features & Technical Details

  • Constraints: JSON validation, numeric fields, positive price.
  • Triggers: normalize JSON on insert, auto-update timestamps.
  • Views: product_summary for analytics.
  • Middleware: validates filters before hitting DB.
  • Logging: query execution times via morgan.
  • Healthchecks: Docker ensures service recovery.
  • Scalable Data Ingestion: CSV bulk load for large datasets.

Problems Solved:

  • Flexible schema for dynamic attributes.
  • Fast search/filtering on common fields.
  • Data consistency in semi-structured datasets.
  • Safe API interface preventing invalid queries.

🧪 Troubleshooting & Fallbacks

  • CSV Import Error → check MySQL secure_file_priv. Ensure file path is correct and accessible.
  • Docker container restarts → check .env DB settings; verify ports and credentials.
  • Slow inserts → prefer CSV import over SQL inserts.
  • API returns errors → verify request payload and filters; check server logs.
  • Database connection issues → ensure DB container is running and accessible; retry connection.
  • Schema load fails → check MySQL version compatibility (8.0+ or MariaDB 10.6+).

For persistent issues, consult logs in Docker containers (docker logs <container>) and ensure all dependencies meet version requirements.


📝 Next Steps / Stretch Goals

  • Full-text search for JSON specs.
  • GraphQL API support.
  • Compare PostgreSQL JSONB vs MySQL JSON for performance.
  • Prometheus/Grafana for monitoring.

🔥 Demonstrates real-world DB optimization, scalable ingestion, API best practices, solving semi-structured data challenges, and includes guidance for handling failures.

About

A product catalog where JSON columns hold dynamic attributes, exposed by virtual/persistent columns for indexing and fast filters — giving the agility of semi-structured data with the performance of relational indexes.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published