This directory contains documentation for the BERDL purpose-built data lakehouse system.
All source code repositories are located in the BERDataLakehouse GitHub Organization.
Note: This documentation provides a brief introduction to each core component of the BERDL system. For detailed development and service information, please refer to each repository's README file.
All BERDL services require KBase authentication using a KBase Token. Users must have the BERDL_USER role assigned to their KBase account to access the platform. Admin operations additionally require the CDM_JUPYTERHUB_ADMIN role.
BERDL utilizes a microservices architecture to provide a secure, scalable, and interactive data analysis environment. The core components include dynamic notebook spawning, secure credential management, and an MCP (Model Context Protocol) server for AI-assisted data operations.
graph TD
%% Use subgraphs to organize hierarchically
subgraph Users ["User Layer"]
User([User])
end
subgraph Entry ["Platform Entry"]
JH[BERDL JupyterHub]
NB[Spark Notebook]
end
subgraph Core ["Core Services"]
direction TB
MMS[MinIO Manager Service]
SCM[Spark Cluster Manager]
MCP[Datalake MCP Server]
end
subgraph Admin ["Admin Services"]
TAS[Tenant Access Request]
Slack([Slack])
end
subgraph Compute ["Compute Layer"]
DYNC[Dynamic Spark Cluster]
SM[Shared Static Cluster]
end
subgraph Data ["Data & Metadata"]
S3[MinIO Storage]
HM[Hive Metastore]
Disk[(Storage)]
end
%% Interactions
%% User Entry Flow
User -->|"Browser (Login & UI)"| JH
User -->|Direct API| MCP
%% JupyterHub Internal Flow
JH -->|Proxies UI| NB
JH -->|Init Policy| MMS
JH -->|Trigger Create| SCM
%% Service Logic
SCM -->|Spawns| DYNC
NB -->|Uses| DYNC
%% Notebook Interactions
NB -->|Auth| MMS
NB -->|Query| MCP
%% Admin Flow (Access Requests)
NB -->|Request Access| TAS
TAS -->|Notify| Slack
TAS -->|Add to Group| MMS
%% MCP Logic
MCP -->|Direct/Fallback| SM
MCP -->|Via Hub| DYNC
%% Data Access
NB -->|S3| S3
NB -->|Meta| HM
DYNC -->|Process| S3
SM -->|Process| S3
S3 -.-> Disk
%% Styling
classDef service fill:#f9f,stroke:#333,stroke-width:2px;
classDef storage fill:#ff9,stroke:#333,stroke-width:2px;
classDef compute fill:#cce6ff,stroke:#333,stroke-width:2px;
classDef external fill:#e8e8e8,stroke:#333,stroke-width:1px;
class JH,NB,MMS,SCM,MCP,TAS service;
class S3,HM,Disk storage;
class DYNC,SM compute;
class Slack external;
The following diagram illustrates the build hierarchy and base image dependencies for the BERDL services.
graph TD
%% Base Images
JQ[quay.io/jupyter/pyspark-notebook]
PUB_JH[jupyterhub/jupyterhub]
PY313[python:3.13-slim]
PY311[python:3.11-slim]
%% Internal Base
subgraph Foundation
id1(spark_notebook_base)
end
%% Services
subgraph Services
NB[spark_notebook]
MCP[datalake-mcp-server]
MMS[minio_manager_service]
SCM[spark_cluster_manager]
JH[BERDL_JupyterHub]
TAS[tenant_access_request_service]
end
%% Dynamic Compute
subgraph DynamicCompute ["Dynamic Compute"]
DYNC["Dynamic Spark Cluster (kube_spark_manager_image)"]
end
%% Relations
JQ -->|FROM| id1
id1 -->|FROM| NB
id1 -->|FROM| MCP
NB -->|FROM| DYNC
PUB_JH -->|FROM| JH
PY313 -->|FROM| MMS
PY313 -->|FROM| TAS
PY311 -->|FROM| SCM
%% Styling
classDef external fill:#eee,stroke:#333,stroke-dasharray: 5 5;
classDef internal fill:#cce6ff,stroke:#333,stroke-width:2px;
classDef service fill:#f9f,stroke:#333,stroke-width:2px;
classDef compute fill:#ffcc00,stroke:#333,stroke-width:2px;
class JQ,PUB_JH,PY313,PY311 external;
class id1 internal;
class NB,MCP,MMS,SCM,JH,TAS service;
class DYNC compute;
The following diagram illustrates the internal Python package dependencies.
graph TD
%% Clients
SMC[cdm-spark-manager-client]
MMSC[minio-manager-service-client]
MCPC[datalake-mcp-server-client]
%% Base Package
subgraph Base ["spark_notebook_base"]
PNB[berdl-notebook-python-base]
end
%% Service Implementations
subgraph NotebookUtils ["spark_notebook"]
NU[berdl_notebook_utils]
end
subgraph MCPServer ["datalake-mcp-server"]
MCP[datalake-mcp-server]
end
subgraph JupyterHub ["BERDL_JupyterHub"]
JH[berdl-jupyterhub]
end
%% Dependencies
PNB -->|Dep| SMC
PNB -->|Dep| MMSC
PNB -->|Dep| MCPC
NU -->|Dep| PNB
MCP -->|Dep| NU
JH -->|Dep| SMC
%% Styling
classDef client fill:#ffedea,stroke:#cc0000,stroke-width:1px;
classDef pkg fill:#e1f5fe,stroke:#01579b,stroke-width:2px;
class SMC,MMSC,MCPC client;
class PNB,NU,MCP,JH pkg;
| Service | Description | Documentation | Repository |
|---|---|---|---|
| JupyterHub | Manages user sessions and spawns individual notebook servers. | BERDL JupyterHub | Repo |
| Spark Notebook | User's personal workspace with Spark pre-configured. | Spark Notebook | Repo |
| Spark Notebook Base | Foundational Docker image with PySpark and common dependencies. | Spark Notebook Base | Repo |
| Datalake MCP Server | FastAPI Data API with MCP layer for AI interactions and direct queries. | Datalake MCP Service | Repo |
| MinIO Manager Service | Handles dynamic credentials and IAM policies for secure data access. | MinIO Manager Service | Repo |
| Spark Cluster Manager | API for managing dynamic, personal Spark clusters on K8s (Primary for Users). | Spark Cluster Manager | Repo |
| Hive Metastore | Stores metadata for Delta Lake tables. | Hive Metastore | Repo |
| Spark Cluster | Spark master/worker image for static and dynamic clusters. | Spark Cluster | Repo |
| BERDL Access Request Extension | JupyterLab extension providing UI for tenant access requests. | Access Request Extension | Repo |
| Tenant Access Request Service | Self-service Slack workflow for users to request access to tenant groups. | Tenant Access Request Service | Repo |