Skip to content

pcalnon/juniper-data

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

830 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Juniper Data

Dataset generation and management service for the Juniper ecosystem.

Overview

Juniper Data provides a centralized service for generating, storing, and serving datasets used by the Juniper neural network projects. It supports various dataset types, including the classic two-spiral classification problem.

Ecosystem Compatibility

This service is part of the Juniper ecosystem. Verified compatible versions:

juniper-data juniper-cascor juniper-canopy data-client cascor-client cascor-worker
0.4.x 0.3.x 0.2.x >=0.3.1 >=0.1.0 >=0.1.0

For full-stack Docker deployment and integration tests, see juniper-deploy.

Architecture

juniper-data is the foundational data layer of the Juniper Project ecosystem. juniper-cascor and juniper-canopy both call juniper-data to generate and retrieve datasets.

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     REST+WS      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   juniper-canopy    β”‚ ◄──────────────► β”‚    juniper-cascor    β”‚
β”‚   Dashboard         β”‚                  β”‚    Training Svc      β”‚
β”‚   Port 8050         β”‚                  β”‚    Port 8200         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
           β”‚ REST                                   β”‚ REST
           β–Ό                                        β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                      JuniperData  ◄── (this service)         β”‚
β”‚                   Dataset Service  Β·  Port 8100              β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Data contract: datasets are served as NPZ archives with keys X_train, y_train, X_test, y_test, X_full, y_full (all float32).

Related Services

Service Relationship Environment Variable
juniper-cascor Consumes JuniperData for training datasets JUNIPER_DATA_URL=http://localhost:8100
juniper-canopy Consumes JuniperData for visualization data JUNIPER_DATA_URL=http://localhost:8100
juniper-data-client PyPI client library for this service pip install juniper-data-client

Service Configuration

Variable Default Description
JUNIPER_DATA_HOST 0.0.0.0 Listen address
JUNIPER_DATA_PORT 8100 Service port
JUNIPER_DATA_LOG_LEVEL INFO Log verbosity

Docker Deployment

# Full stack with all three services:
git clone https://github.com/pcalnon/juniper-deploy.git  # (private repository)
cd juniper-deploy && docker compose up --build

Dependency Lockfile

The requirements.lock file pins exact dependency versions for reproducible Docker builds. The pyproject.toml retains flexible >= ranges for local development.

Regenerate after changing dependencies in pyproject.toml:

uv pip compile pyproject.toml --extra api --extra observability -o requirements.lock

Installation

Basic Installation

pip install -e .

With API Support

pip install -e ".[api]"

Development Installation

pip install -e ".[dev]"

Full Installation

pip install -e ".[all]"

Quick Start

Generate a Spiral Dataset

from juniper_data.generators.spiral import SpiralGenerator

generator = SpiralGenerator()
dataset = generator.generate(n_points=100, n_spirals=2, noise=0.1)

Start the API Server

uvicorn juniper_data.api.app:app --reload

API Endpoints

Endpoint Method Description
/v1/health GET Health check
/v1/health/live GET Liveness probe
/v1/health/ready GET Readiness probe (checks storage)
/v1/generators GET List all generators with schemas
/v1/generators/{name}/schema GET Get parameter schema for a generator
/v1/datasets POST Create dataset (or return cached dataset)
/v1/datasets GET List dataset IDs
/v1/datasets/filter GET Filter metadata by generator/tags/date/name/version
/v1/datasets/stats GET Aggregate dataset statistics
/v1/datasets/versions GET List all versions for a logical dataset name
/v1/datasets/latest GET Get latest version for a logical dataset name
/v1/datasets/batch-create POST Create multiple datasets
/v1/datasets/batch-delete POST Delete multiple datasets
/v1/datasets/batch-tags PATCH Update tags on multiple datasets
/v1/datasets/batch-export POST Export multiple datasets as ZIP
/v1/datasets/cleanup-expired POST Delete expired datasets
/v1/datasets/{id} GET Get dataset metadata
/v1/datasets/{id} DELETE Delete a dataset
/v1/datasets/{id}/artifact GET Download NPZ artifact
/v1/datasets/{id}/preview GET Preview first N samples as JSON
/v1/datasets/{id}/tags PATCH Add/remove tags on one dataset

See docs/api/JUNIPER_DATA_API.md for full endpoint documentation, including filtering, batch operations, and tagging.

Named Dataset Versioning

POST /v1/datasets supports logical names for versioned datasets:

  • Set name to group related datasets into a version series.
  • Persisted creates with the same name, auto-increment meta.dataset_version (1, 2, 3, ...).
  • Repeating an identical request returns the cached dataset and keeps its existing version.
  • Use GET /v1/datasets/versions?name=<dataset_name> to view history and GET /v1/datasets/latest?name=<dataset_name> to resolve the latest.

Project Structure

juniper-data/
β”œβ”€β”€ juniper_data/
β”‚   β”œβ”€β”€ core/           # Core functionality and base classes
β”‚   β”œβ”€β”€ generators/     # Dataset generators (8 types)
β”‚   β”‚   β”œβ”€β”€ spiral/     # Multi-spiral classification
β”‚   β”‚   β”œβ”€β”€ xor/        # XOR classification
β”‚   β”‚   β”œβ”€β”€ gaussian/   # Mixture of Gaussians
β”‚   β”‚   β”œβ”€β”€ circles/    # Concentric circles
β”‚   β”‚   β”œβ”€β”€ checkerboard/ # 2D checkerboard pattern
β”‚   β”‚   β”œβ”€β”€ csv_import/ # CSV/JSON file import
β”‚   β”‚   β”œβ”€β”€ mnist/      # MNIST / Fashion-MNIST
β”‚   β”‚   └── arc_agi/    # ARC-AGI visual reasoning
β”‚   β”œβ”€β”€ storage/        # Dataset persistence layer
β”‚   β”œβ”€β”€ api/            # FastAPI application
β”‚   β”‚   └── routes/     # API route handlers
β”‚   └── tests/          # Test suite
β”‚       β”œβ”€β”€ unit/       # Unit tests
β”‚       └── integration/ # Integration tests
β”œβ”€β”€ pyproject.toml      # Project configuration
└── README.md           # This file

Development

Running Tests

pytest

Running Tests with Coverage

pytest --cov=juniper_data --cov-report=html

Code Formatting

ruff format juniper_data tests
ruff check --fix juniper_data tests

Type Checking

mypy juniper_data

Juniper Ecosystem

Repository Description
juniper-data Dataset generation service (this repo)
juniper-cascor CasCor neural network training service
juniper-canopy Real-time monitoring dashboard
juniper-data-client PyPI: juniper-data-client
juniper-cascor-client PyPI: juniper-cascor-client
juniper-cascor-worker PyPI: juniper-cascor-worker

License

MIT License - Copyright (c) 2024-2026 Paul Calnon

Git Leaks

gitleaks badge

About

Dataset generation service for the Juniper AI/ML research ecosystem

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages