Data Persistence Architecture - Migration from File-Based to Database-First

# Data Persistence Architecture - Migration from File-Based to Database-First

## Summary

Our current data persistence architecture has critical limitations that prevent scaling and risk data loss. After comprehensive analysis of requirements, constraints, and technology options, we recommend migrating to a **PostgreSQL + Apache AGE** architecture for production deployment.

**Current Issues (HIGH Priority):**
- 5-minute data loss window (gateway saves every 5 min)
- No version history or point-in-time recovery
- Project-level granularity prevents lazy loading
- File-based storage without transactions or backup strategy
- No search capability across projects or semantic search

## Key Requirements Analyzed

**Must-Have Constraints:**
- Cloud-agnostic (no AWS/GCP/Azure-specific services)
- Gateway-agnostic architecture (gateways are stateless, can serve any org)
- Schema-less storage (modules define unknown data structures at runtime)
- Lazy loading (cannot load entire projects into memory)
- CRDT metadata retention (24h only, with snapshot compaction)
- Full-text AND semantic search ("car" matches "vehicle")
- Entity-level persistence (not project-level)
- Built-in backup and recovery strategy

**Graph Nature of Our Platform:**
- Whiteboard with nodes and edges (literal graph structure)
- Entity relationships across modules
- Dependency tracking ("what depends on this?")
- Version trees (v1 → v2 → v3)
- Path finding and traversal queries are common

## Recommended Architecture: PostgreSQL + Apache AGE

**Why This Combination:**

1. **Apache AGE = Graph Extension for PostgreSQL**
   - No new infrastructure (just PostgreSQL extension)
   - SQL for structured queries + Cypher for graph queries
   - Single database = unified backup/restore/monitoring
   - Cloud-agnostic, open-source (Apache 2.0)

2. **Addresses All Requirements:**
   - ✅ Schema-less: JSONB for entity properties
   - ✅ Graph queries: Native Cypher support
   - ✅ Entity-level storage: Load entities on-demand
   - ✅ Full-text search: PostgreSQL tsvector
   - ✅ Semantic search: pgvector extension + Ollama embeddings
   - ✅ Versioning: Entity history with deltas
   - ✅ CRDT compaction: Hourly snapshots, 24h operation retention
   - ✅ Gateway-agnostic: Centralized PostgreSQL

3. **Operational Simplicity:**
   - Already using PostgreSQL (no learning curve)
   - Same backup tools (pgBackRest)
   - Same monitoring and alerting
   - Gradual adoption (add graph queries where beneficial)

4. **Cost-Effective:**
   - $50-100/month (VPS + storage)
   - Scales to 1K organizations, 10K projects, 1M+ entities
   - No additional licensing (Apache 2.0)

## Technology Stack (Tier 2 - Recommended)

```yaml
Primary Database: PostgreSQL 15+ with Apache AGE extension
  - Core data (users, orgs, projects)
  - Entity storage (JSONB for schema-less data)
  - Graph relationships (Cypher queries)
  - Full-text search (tsvector + GIN indexes)
  - Vector search (pgvector extension)
  - Entity versioning (history table with deltas)
  - CRDT operation log (24h retention with auto-cleanup)

Object Storage: MinIO (S3-compatible, self-hosted)
  - Large snapshots (>1MB)
  - Binary files and attachments
  - Backup destination

Full-Text Search: Meilisearch
  - Fast typo-tolerant search
  - Better UX than PostgreSQL FTS alone
  - Faceting and filtering

Semantic Search: pgvector + Ollama
  - Self-hosted embeddings (nomic-embed-text model)
  - "car" matches "vehicle" via vector similarity
  - Cloud-agnostic (no OpenAI dependency)

CRDT Management: Yjs + PostgreSQL
  - In-memory collaboration
  - Operations logged to PostgreSQL (30s flush)
  - Hourly snapshot compaction
  - Auto-cleanup after 24h

Backup: pgBackRest
  - Incremental backups
  - Point-in-time recovery
  - Automated verification
```

## Migration Path

**Phase 1: Foundation**
- Install Apache AGE extension in PostgreSQL
- Create new schema (entities, entity_history, entity_relationships)
- Add pgvector extension for semantic search
- Set up MinIO for object storage

**Phase 2: Migration**
- Migrate organization data from JSON files to database tables
- Create graph nodes/edges from existing entities
- Migrate permissions and OAuth data
- Verify data integrity

**Phase 3: Search & Indexing**
- Deploy Meilisearch
- Set up Ollama for embeddings
- Index existing entities for full-text and semantic search
- Build search API endpoints

**Phase 4: Gateway Integration**
- Update gateway to use new persistence layer
- Implement lazy loading with LRU cache
- Add CRDT operation logging
- Deploy snapshot compaction jobs

**Phase 5: Monitoring & Optimization**
- Set up monitoring and alerting
- Optimize query performance
- Load testing and tuning
- Documentation and runbooks

## Alternative Considered: PostgreSQL + Neo4j

If graph performance becomes critical (>5 level traversals, graph algorithms needed), we can upgrade to dual database architecture:
- PostgreSQL for source of truth
- Neo4j for graph projection
- Change Data Capture (CDC) for sync
- Cost: $100-200/month, Complexity: ⭐⭐⭐⭐

**When to Reconsider:** Only if AGE performance becomes a bottleneck or we need advanced graph algorithms (PageRank, community detection).

## Expected Outcomes

**Reliability:**
- Zero data loss (ACID transactions)
- Point-in-time recovery to any moment
- Automated backups with verification

**Performance:**
- Sub-second entity queries
- Lazy loading enables 10K+ entity projects
- Graph traversal 10x faster than recursive CTEs

**Scalability:**
- 1K organizations, 10K projects, 1M+ entities
- Linear scaling with data growth
- Entity-level granularity prevents memory issues

**Maintainability:**
- Single database (PostgreSQL)
- Standard tools and practices
- Clear monitoring and alerting

## Next Steps

1. **Review and approve** this architectural direction
2. **Allocate resources** for implementation
3. **Set up staging environment** for testing
4. **Begin Phase 1** (foundation work)

## References

- **Full Analysis:** [`doc/architecture/DATA_PERSISTENCE_ANALYSIS.md`](../doc/architecture/DATA_PERSISTENCE_ANALYSIS.md) (2500+ lines)
  - Complete requirements and constraints
  - 40+ technologies analyzed
  - 6 architecture patterns evaluated
  - Algorithms and strategies for each aspect
  - Trade-off analysis and cost modeling

- **Apache AGE:** https://age.apache.org/
- **pgvector:** https://github.com/pgvector/pgvector
- **Meilisearch:** https://www.meilisearch.com/
- **MinIO:** https://min.io/
- **Ollama:** https://ollama.ai/

---

**Labels:** `enhancement`, `architecture`, `database`, `high-priority`  
**Milestone:** Production Readiness  
**Assignee:** TBD



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data Persistence Architecture - Migration from File-Based to Database-First #35

Data Persistence Architecture - Migration from File-Based to Database-First

Summary

Key Requirements Analyzed

Recommended Architecture: PostgreSQL + Apache AGE

Technology Stack (Tier 2 - Recommended)

Migration Path

Alternative Considered: PostgreSQL + Neo4j

Expected Outcomes

Next Steps

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Data Persistence Architecture - Migration from File-Based to Database-First #35

Description

Data Persistence Architecture - Migration from File-Based to Database-First

Summary

Key Requirements Analyzed

Recommended Architecture: PostgreSQL + Apache AGE

Technology Stack (Tier 2 - Recommended)

Migration Path

Alternative Considered: PostgreSQL + Neo4j

Expected Outcomes

Next Steps

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions