Skip to content

BERDataLakehouse/berdl_docs

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 

Repository files navigation

BERDL System Documentation

This directory contains documentation for the BERDL purpose-built data lakehouse system.

All source code repositories are located in the BERDataLakehouse GitHub Organization.

Note: This documentation provides a brief introduction to each core component of the BERDL system. For detailed development and service information, please refer to each repository's README file.

Authentication

All BERDL services require KBase authentication using a KBase Token. Users must have the BERDL_USER role assigned to their KBase account to access the platform. Admin operations additionally require the CDM_JUPYTERHUB_ADMIN role.

System Architecture

BERDL utilizes a microservices architecture to provide a secure, scalable, and interactive data analysis environment. The core components include dynamic notebook spawning, secure credential management, and an MCP (Model Context Protocol) server for AI-assisted data operations.

graph TD
    %% Use subgraphs to organize hierarchically
    
    subgraph Users ["User Layer"]
        User([User])
    end

    subgraph Entry ["Platform Entry"]
        JH[BERDL JupyterHub]
        NB[Spark Notebook]
    end

    subgraph Core ["Core Services"]
        direction TB
        MMS[MinIO Manager Service]
        SCM[Spark Cluster Manager]
        MCP[Datalake MCP Server]
    end

    subgraph Admin ["Admin Services"]
        TAS[Tenant Access Request]
        Slack([Slack])
    end

    subgraph Compute ["Compute Layer"]
        DYNC[Dynamic Spark Cluster]
        SM[Shared Static Cluster]
    end

    subgraph Data ["Data & Metadata"]
        S3[MinIO Storage]
        HM[Hive Metastore]
        Disk[(Storage)]
    end

    %% Interactions
    
    %% User Entry Flow
    User -->|"Browser (Login & UI)"| JH
    User -->|Direct API| MCP
    
    %% JupyterHub Internal Flow
    JH -->|Proxies UI| NB
    JH -->|Init Policy| MMS
    JH -->|Trigger Create| SCM
    
    %% Service Logic
    SCM -->|Spawns| DYNC
    NB -->|Uses| DYNC
    
    %% Notebook Interactions
    NB -->|Auth| MMS
    NB -->|Query| MCP
    
    %% Admin Flow (Access Requests)
    NB -->|Request Access| TAS
    TAS -->|Notify| Slack
    TAS -->|Add to Group| MMS
    
    %% MCP Logic
    MCP -->|Direct/Fallback| SM
    MCP -->|Via Hub| DYNC
    
    %% Data Access
    NB -->|S3| S3
    NB -->|Meta| HM
    DYNC -->|Process| S3
    SM -->|Process| S3
    S3 -.-> Disk

    %% Styling
    classDef service fill:#f9f,stroke:#333,stroke-width:2px;
    classDef storage fill:#ff9,stroke:#333,stroke-width:2px;
    classDef compute fill:#cce6ff,stroke:#333,stroke-width:2px;
    classDef external fill:#e8e8e8,stroke:#333,stroke-width:1px;
    
    class JH,NB,MMS,SCM,MCP,TAS service;
    class S3,HM,Disk storage;
    class DYNC,SM compute;
    class Slack external;
Loading

Container Dependency Architecture

The following diagram illustrates the build hierarchy and base image dependencies for the BERDL services.

graph TD
    %% Base Images
    JQ[quay.io/jupyter/pyspark-notebook]
    PUB_JH[jupyterhub/jupyterhub]
    PY313[python:3.13-slim]
    PY311[python:3.11-slim]

    %% Internal Base
    subgraph Foundation
        id1(spark_notebook_base)
    end

    %% Services
    subgraph Services
        NB[spark_notebook]
        MCP[datalake-mcp-server]
        MMS[minio_manager_service]
        SCM[spark_cluster_manager]
        JH[BERDL_JupyterHub]
        TAS[tenant_access_request_service]
    end

    %% Dynamic Compute
    subgraph DynamicCompute ["Dynamic Compute"]
        DYNC["Dynamic Spark Cluster (kube_spark_manager_image)"]
    end

    %% Relations
    JQ -->|FROM| id1
    
    id1 -->|FROM| NB
    id1 -->|FROM| MCP
    
    NB -->|FROM| DYNC
    
    PUB_JH -->|FROM| JH
    PY313 -->|FROM| MMS
    PY313 -->|FROM| TAS
    PY311 -->|FROM| SCM
    
    %% Styling
    classDef external fill:#eee,stroke:#333,stroke-dasharray: 5 5;
    classDef internal fill:#cce6ff,stroke:#333,stroke-width:2px;
    classDef service fill:#f9f,stroke:#333,stroke-width:2px;
    classDef compute fill:#ffcc00,stroke:#333,stroke-width:2px;

    class JQ,PUB_JH,PY313,PY311 external;
    class id1 internal;
    class NB,MCP,MMS,SCM,JH,TAS service;
    class DYNC compute;
Loading

Python Dependency Architecture

The following diagram illustrates the internal Python package dependencies.

graph TD
    %% Clients
    SMC[cdm-spark-manager-client]
    MMSC[minio-manager-service-client]
    MCPC[datalake-mcp-server-client]
    
    %% Base Package
    subgraph Base ["spark_notebook_base"]
        PNB[berdl-notebook-python-base]
    end
    
    %% Service Implementations
    subgraph NotebookUtils ["spark_notebook"]
        NU[berdl_notebook_utils]
    end
    
    subgraph MCPServer ["datalake-mcp-server"]
        MCP[datalake-mcp-server]
    end
    
    subgraph JupyterHub ["BERDL_JupyterHub"]
        JH[berdl-jupyterhub]
    end

    %% Dependencies
    PNB -->|Dep| SMC
    PNB -->|Dep| MMSC
    PNB -->|Dep| MCPC
    
    NU -->|Dep| PNB
    MCP -->|Dep| NU
    
    JH -->|Dep| SMC
    
    %% Styling
    classDef client fill:#ffedea,stroke:#cc0000,stroke-width:1px;
    classDef pkg fill:#e1f5fe,stroke:#01579b,stroke-width:2px;
    
    class SMC,MMSC,MCPC client;
    class PNB,NU,MCP,JH pkg;
Loading

Core Components

Service Description Documentation Repository
JupyterHub Manages user sessions and spawns individual notebook servers. BERDL JupyterHub Repo
Spark Notebook User's personal workspace with Spark pre-configured. Spark Notebook Repo
Spark Notebook Base Foundational Docker image with PySpark and common dependencies. Spark Notebook Base Repo
Datalake MCP Server FastAPI Data API with MCP layer for AI interactions and direct queries. Datalake MCP Service Repo
MinIO Manager Service Handles dynamic credentials and IAM policies for secure data access. MinIO Manager Service Repo
Spark Cluster Manager API for managing dynamic, personal Spark clusters on K8s (Primary for Users). Spark Cluster Manager Repo
Hive Metastore Stores metadata for Delta Lake tables. Hive Metastore Repo
Spark Cluster Spark master/worker image for static and dynamic clusters. Spark Cluster Repo
BERDL Access Request Extension JupyterLab extension providing UI for tenant access requests. Access Request Extension Repo
Tenant Access Request Service Self-service Slack workflow for users to request access to tenant groups. Tenant Access Request Service Repo

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published