The Custom AtomSpace Builder is a comprehensive graph processing system that transforms structured tabular data into multiple knowledge graph formats including MeTTa-formatted graphs, Neo4j-compatible formats, and MORK database integration. Built on Apache HugeGraph Loader with custom modifications, it provides a complete pipeline for data ingestion, schema transformation, and graph generation with support for multiple output formats and backend integrations.
This project provides a flexible graph data loading and transformation pipeline with:
- RESTful API interface for data loading operations
- Multi-backend support (Neo4j, HugeGraph, and MORK) with configurable writer system
- Automatic MeTTa language representation generation for logic-based graph processing
- MORK database integration for advanced knowledge representation
- Multi-tenancy support with tenant-based graph isolation
- Comprehensive job management with history tracking
- Schema conversion and transformation utilities
- Integration with annotation services for AtomSpace interaction
- Thread-safe file operations for concurrent processing
- Direct Neo4j integration with CSV and Cypher file generation
- Makefile-based development workflow for streamlined operations
custom-atomspace-builder/
├── app/
│ ├── __init__.py
│ ├── main.py # FastAPI app setup and lifespan
│ ├── config.py # Centralized configuration management
│ │
│ ├── core/ # Core functionality
│ │ ├── database.py # Neo4j connection management
│ │ ├── session_manager.py # Upload session management
│ │ └── background_tasks.py # Background cleanup tasks
│ │
│ ├── models/ # Data models
│ │ ├── schemas.py # Pydantic models
│ │ └── enums.py # Enumerations
│ │
│ ├── services/ # Business logic
│ │ ├── hugegraph_service.py # HugeGraph operations
│ │ ├── neo4j_service.py # Neo4j operations
│ │ ├── annotation_service.py # Annotation service communication
│ │ └── graph_info_service.py # Graph info generation
│ │
│ ├── api/ # API endpoints
│ │ ├── upload.py # Upload endpoints
│ │ ├── jobs.py # Job management endpoints
│ │ ├── graph.py # Graph info endpoints
│ │ └── admin.py # Admin endpoints
│ │
│ └── utils/ # Utilities
│ ├── file_utils.py # File operations utilities
│ ├── schema_converter.py # Schema conversion utilities
│ └── helpers.py # General helper functions
│
├── Makefile # Development and deployment commands
├── config.yaml # Application configuration
├── requirements.txt # Python dependencies
├── docker-compose.yml # Production Docker setup
├── docker-compose.dev.yml # Development Docker setup
├── example.env # Environment variables template
└── README.md
- Multi-Format Output: Support for MeTTa-formatted knowledge graphs, Neo4j-compatible formats, and MORK database integration
- Data Loading: Upload CSV/JSON files with session-based processing and separate schema/configuration submission
- Neo4j Integration: CSV and Cypher file generation pipeline compatible with Neo4j's LOAD CSV functionality
- MORK Integration: Advanced knowledge representation and reasoning capabilities through MORK database backend
- Schema Management: Retrieve schema information formatted for annotation services and visualization tools
- Job Tracking: Comprehensive job history with metadata including node/edge counts, data size, and timestamps
- Output Management: Download job outputs as zip files or individual files
- Graph Analytics: Automatic generation of graph statistics including top entities, connections, and relationship patterns
- Multi-Tenancy: User-specific subgraph isolation using tenant IDs for nodes and edges
- Annotation Service Integration: Schema formatting optimized for graph annotation interfaces
- Thread-Safe Operations: File locking mechanisms preventing data loss during concurrent access
- History Management: Individual job deletion and complete history clearing capabilities
- Session-Based Processing: Support for multi-step workflows using session IDs
- Makefile Workflow: Streamlined development and deployment commands
- Python 3.8+
- Docker and Docker Compose
- Neo4j 4.x+ (for Neo4j backend)
- Apache HugeGraph Server (for HugeGraph backend)
- MORK database system (for MORK backend)
- Maven 3.6+ and Java 8+ (if building HugeGraph from source)
git clone https://github.com/rejuve-bio/custom-atomspace-builder
cd custom-atomspace-builderCreate a .env file from the example template:
cp example.env .envEdit the .env file with your specific configuration:
# ========================================
# API Configuration
# ========================================
API_PORT=8000
# ========================================
# Neo4j Database Configuration
# ========================================
NEO4J_USERNAME=neo4j
NEO4J_PASSWORD=your_password_here
NEO4J_HOST=localhost
NEO4J_PORT=7687
NEO4J_DATABASE=neo4j
# Neo4j Web Interface Ports
NEO4J_HTTP_PORT=7474
NEO4J_BOLT_PORT=7687
# ========================================
# HugeGraph Configuration
# ========================================
HUGEGRAPH_HOST=localhost
HUGEGRAPH_PORT=8080
HUGEGRAPH_GRAPH=hugegraph
# HugeGraph Service Ports
HUGEGRAPH_REST_PORT=8080
HUGEGRAPH_GREMLIN_PORT=8182
# ========================================
# External Service URLs
# ========================================
ANNOTATION_SERVICE_URL=http://localhost:5800/annotation/load
ANNOTATION_SERVICE_TIMEOUT=300.0
# ========================================
# Application Settings
# ========================================
ENVIRONMENT=production
LOG_LEVEL=INFO
SESSION_TIMEOUT_HOURS=24
OUTPUT_DIR=./outputUpdate the config.yaml file for additional configuration:
paths:
hugegraph_loader: "/app/hugegraph-loader/apache-hugegraph-loader-incubating-1.5.0/bin/hugegraph-loader.sh"
output_dir: "./output"
cors:
allow_origins: ["*"]
allow_credentials: true
allow_methods: ["*"]
allow_headers: ["*"]
uploads:
session_timeout: 24 # hoursThe project includes a comprehensive Makefile for streamlined development and deployment operations:
# Show all available commands
make help
# Development workflow
make build-dev # Build all services in development mode
make up-dev # Start all services in development mode
make logs-dev # View development logs
make down-dev # Stop development services
# Production workflow
make build # Build all services for production
make up # Start all services in production mode
make logs # View production logs
make down # Stop all services
# Quick API-only operations
make up-api # Start only the API service (production)
make up-api-dev # Start only the API service (development)
# Maintenance operations
make rebuild # Force rebuild and restart all services
make rebuild-dev # Force rebuild and restart (development)
make clean # Clean all containers and volumes (⚠️ deletes data)
make clean-dev # Clean development containers and volumes# 1. Set up environment
cp example.env .env
# Edit .env with your settings
# 2. Start development environment
make build-dev
# 3. View logs
make logs-dev
# 4. When making code changes (API only restart)
make up-api-dev
# 5. Clean shutdown
make down-dev# Using Makefile (recommended)
make build-dev# Using Makefile (recommended)
make build# Quick start with Makefile
make build-dev
# Traditional method
docker compose -f docker-compose.dev.yml up -dOnce running, you can access:
-
- Health check: http://localhost:8000/api/health
- API documentation: http://localhost:8000/docs
-
Neo4j Browser: http://localhost:7474
- Username:
neo4j - Password: From your
.envfile
- Username:
-
HugeGraph: http://localhost:8080
- REST API: http://localhost:8080/graphs/hugegraph/conf
- Gremlin WebSocket: ws://localhost:8182
-
Hubble UI: http://localhost:8088 (if enabled)
POST /api/upload/create-session- Create a new upload sessionPOST /api/upload/files- Upload files to a session (session_id in request body)GET /api/upload/status- Get session status (session_id in request body)DELETE /api/upload/files/{filename}- Delete a file from session (session_id in request body)
POST /api/load- Load data files into selected backend (Neo4j, HugeGraph, or MORK)POST /api/convert-schema- Convert JSON schema to backend-specific format
GET /api/history- Get complete job history with statisticsPOST /api/select-job- Select a specific job for operationsDELETE /api/history/{job_id}- Remove specific job and delete associated filesDELETE /api/clear-history- Remove all jobs and reset system
GET /api/schema/- Get schema for currently selected jobGET /api/schema/{job_id}- Get schema information for specific jobGET /api/kg-info/{job_id}- Get comprehensive graph informationGET /api/output/{job_id}- Download all job outputs as zipGET /api/output-file/{job_id}/{filename}- Download specific file
GET /api/health- Health check endpointGET /api/config- Get current configuration
- CSV and Cypher file generation
- Direct database integration
- LOAD CSV functionality
- Multi-tenant support
- Apache TinkerPop Gremlin compatibility
- Distributed graph processing
- RESTful API integration
- High-performance analytics
- Advanced knowledge representation
- Logic-based reasoning capabilities
- Semantic query processing
- Integration with MeTTa language
# Load data into Neo4j
curl -X POST "http://localhost:8000/api/load" \
-F "files=@data.csv" \
-F "config=$(cat struct.json)" \
-F "schema_json=$(cat schema.json)" \
-F "writer_type=neo4j" \
-F "tenant_id=tenant_123" \
-F "session_id=unique_session_456"
# Load data into MORK
curl -X POST "http://localhost:8000/api/load" \
-F "files=@data.csv" \
-F "config=$(cat struct.json)" \
-F "schema_json=$(cat schema.json)" \
-F "writer_type=mork" \
-F "tenant_id=tenant_123" \
-F "session_id=unique_session_456"# Step 1: Create session
curl -X POST "http://localhost:8000/api/upload/create-session"
# Step 2: Upload files with session ID
curl -X POST "http://localhost:8000/api/upload-files" \
-F "files=@data1.csv" \
-F "files=@data2.csv" \
-F "session_id=unique_session_456"
# Step 3: Submit schema for the same session
curl -X POST "http://localhost:8000/api/submit-schema" \
-F "schema=$(cat schema.json)" \
-F "config=$(cat config.json)" \
-F "session_id=unique_session_456" \
-F "writer_type=mork"# Start development environment
make up-dev
# Make code changes and restart only API
make up-api-dev
# View logs for debugging
make logs-dev
# Clean rebuild when needed
make rebuild-dev
# Stop everything
make down-devThe system implements robust multi-tenancy across all backends:
- Each node and edge is tagged with a tenant ID
- Complete subgraph isolation in shared database instances
- User-specific data separation while maintaining performance
- Queries automatically filtered by tenant context
- Session-based isolation for concurrent operations
Enhanced thread safety implementation:
- File locking prevents data loss during concurrent access
- Atomic operations for multi-file processing
- Session-based isolation for parallel workflows
- Queue-based job processing for high-load scenarios
- Automatic retry mechanisms for lock conflicts
Comprehensive error handling includes:
- Thread-safe file operations with automatic retry
- Graceful handling of job deletion with automatic selection updates
- Transaction rollback for data integrity
- Session validation and timeout management
- Detailed error messages for debugging
The system has been tested with:
- Large-scale datasets for performance validation
- Concurrent user scenarios with session isolation
- Multi-tenant isolation verification across all backends
- Integration testing with annotation services
- Thread safety validation under high load
- MORK backend integration and performance testing
# Check service status
make logs-dev
# Restart specific service
make up-api-dev
# Clean start (removes all data)
make clean-dev
make up-dev
# View all available commands
make help- Port conflicts: Change ports in
.envfile - Permission issues: Ensure Docker has proper permissions
- Memory issues: Adjust memory settings in
.envfor development - Build failures: Use
make rebuild-devto force rebuild
- Neo4j backend integration for Custom AtomSpace Builder
- Multi-tenancy support with tenant IDs
- Docker containerization
- Thread-safe file operations
- Annotation service integration
- Session-based processing support
- MORK database integration
- Makefile-based development workflow
- Deploy beta version with user authentication and feedback collection
- Database storage for metadata (replacing static JSON)
- Integrate automated schema inference from multiple data formats
- Build real-time validation and quality assessment pipeline
- Implement containerized deployment strategy for all backends
- Optimize performance for concurrent users and large datasets
- AWS deployment templates
- Advanced graph analytics and ML integration
This project maintains the same license as the original Apache HugeGraph Toolchain project.
- Apache HugeGraph Community
- Neo4j Community
- MORK Database Development Team
- SingularityNET for MeTTa language specification
- FastAPI framework contributors
For issues and feature requests, please create an issue in the repository.