Skip to content

Data Persistence Architecture - Migration from File-Based to Database-First #35

@FL-AntoineDurand

Description

@FL-AntoineDurand

Data Persistence Architecture - Migration from File-Based to Database-First

Summary

Our current data persistence architecture has critical limitations that prevent scaling and risk data loss. After comprehensive analysis of requirements, constraints, and technology options, we recommend migrating to a PostgreSQL + Apache AGE architecture for production deployment.

Current Issues (HIGH Priority):

  • 5-minute data loss window (gateway saves every 5 min)
  • No version history or point-in-time recovery
  • Project-level granularity prevents lazy loading
  • File-based storage without transactions or backup strategy
  • No search capability across projects or semantic search

Key Requirements Analyzed

Must-Have Constraints:

  • Cloud-agnostic (no AWS/GCP/Azure-specific services)
  • Gateway-agnostic architecture (gateways are stateless, can serve any org)
  • Schema-less storage (modules define unknown data structures at runtime)
  • Lazy loading (cannot load entire projects into memory)
  • CRDT metadata retention (24h only, with snapshot compaction)
  • Full-text AND semantic search ("car" matches "vehicle")
  • Entity-level persistence (not project-level)
  • Built-in backup and recovery strategy

Graph Nature of Our Platform:

  • Whiteboard with nodes and edges (literal graph structure)
  • Entity relationships across modules
  • Dependency tracking ("what depends on this?")
  • Version trees (v1 → v2 → v3)
  • Path finding and traversal queries are common

Recommended Architecture: PostgreSQL + Apache AGE

Why This Combination:

  1. Apache AGE = Graph Extension for PostgreSQL

    • No new infrastructure (just PostgreSQL extension)
    • SQL for structured queries + Cypher for graph queries
    • Single database = unified backup/restore/monitoring
    • Cloud-agnostic, open-source (Apache 2.0)
  2. Addresses All Requirements:

    • ✅ Schema-less: JSONB for entity properties
    • ✅ Graph queries: Native Cypher support
    • ✅ Entity-level storage: Load entities on-demand
    • ✅ Full-text search: PostgreSQL tsvector
    • ✅ Semantic search: pgvector extension + Ollama embeddings
    • ✅ Versioning: Entity history with deltas
    • ✅ CRDT compaction: Hourly snapshots, 24h operation retention
    • ✅ Gateway-agnostic: Centralized PostgreSQL
  3. Operational Simplicity:

    • Already using PostgreSQL (no learning curve)
    • Same backup tools (pgBackRest)
    • Same monitoring and alerting
    • Gradual adoption (add graph queries where beneficial)
  4. Cost-Effective:

    • $50-100/month (VPS + storage)
    • Scales to 1K organizations, 10K projects, 1M+ entities
    • No additional licensing (Apache 2.0)

Technology Stack (Tier 2 - Recommended)

Primary Database: PostgreSQL 15+ with Apache AGE extension
  - Core data (users, orgs, projects)
  - Entity storage (JSONB for schema-less data)
  - Graph relationships (Cypher queries)
  - Full-text search (tsvector + GIN indexes)
  - Vector search (pgvector extension)
  - Entity versioning (history table with deltas)
  - CRDT operation log (24h retention with auto-cleanup)

Object Storage: MinIO (S3-compatible, self-hosted)
  - Large snapshots (>1MB)
  - Binary files and attachments
  - Backup destination

Full-Text Search: Meilisearch
  - Fast typo-tolerant search
  - Better UX than PostgreSQL FTS alone
  - Faceting and filtering

Semantic Search: pgvector + Ollama
  - Self-hosted embeddings (nomic-embed-text model)
  - "car" matches "vehicle" via vector similarity
  - Cloud-agnostic (no OpenAI dependency)

CRDT Management: Yjs + PostgreSQL
  - In-memory collaboration
  - Operations logged to PostgreSQL (30s flush)
  - Hourly snapshot compaction
  - Auto-cleanup after 24h

Backup: pgBackRest
  - Incremental backups
  - Point-in-time recovery
  - Automated verification

Migration Path

Phase 1: Foundation

  • Install Apache AGE extension in PostgreSQL
  • Create new schema (entities, entity_history, entity_relationships)
  • Add pgvector extension for semantic search
  • Set up MinIO for object storage

Phase 2: Migration

  • Migrate organization data from JSON files to database tables
  • Create graph nodes/edges from existing entities
  • Migrate permissions and OAuth data
  • Verify data integrity

Phase 3: Search & Indexing

  • Deploy Meilisearch
  • Set up Ollama for embeddings
  • Index existing entities for full-text and semantic search
  • Build search API endpoints

Phase 4: Gateway Integration

  • Update gateway to use new persistence layer
  • Implement lazy loading with LRU cache
  • Add CRDT operation logging
  • Deploy snapshot compaction jobs

Phase 5: Monitoring & Optimization

  • Set up monitoring and alerting
  • Optimize query performance
  • Load testing and tuning
  • Documentation and runbooks

Alternative Considered: PostgreSQL + Neo4j

If graph performance becomes critical (>5 level traversals, graph algorithms needed), we can upgrade to dual database architecture:

  • PostgreSQL for source of truth
  • Neo4j for graph projection
  • Change Data Capture (CDC) for sync
  • Cost: $100-200/month, Complexity: ⭐⭐⭐⭐

When to Reconsider: Only if AGE performance becomes a bottleneck or we need advanced graph algorithms (PageRank, community detection).

Expected Outcomes

Reliability:

  • Zero data loss (ACID transactions)
  • Point-in-time recovery to any moment
  • Automated backups with verification

Performance:

  • Sub-second entity queries
  • Lazy loading enables 10K+ entity projects
  • Graph traversal 10x faster than recursive CTEs

Scalability:

  • 1K organizations, 10K projects, 1M+ entities
  • Linear scaling with data growth
  • Entity-level granularity prevents memory issues

Maintainability:

  • Single database (PostgreSQL)
  • Standard tools and practices
  • Clear monitoring and alerting

Next Steps

  1. Review and approve this architectural direction
  2. Allocate resources for implementation
  3. Set up staging environment for testing
  4. Begin Phase 1 (foundation work)

References


Labels: enhancement, architecture, database, high-priority
Milestone: Production Readiness
Assignee: TBD

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestinfrastructureInfrastructure and DevOps related

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions