Skip to content

aporb/data-science-learning-handbook

Repository files navigation

Federal Data Science Handbook

A practitioner's guide to data science on federal government platforms — and the first reference resource for a regulated domain designed to work natively with AI coding agents.

License: MIT + CC BY 4.0 Stars Last Commit Python 3.11+ Agent-Ready

Read online · Agent Integration · Chapters · Platform Guides · Agent Commands · Docker Environment


Federal Data Science Handbook

The Problem

We spend billions on federal data platforms. We hire analysts with degrees and certifications. And then we hand them a login and tell them to figure it out.

Commercial tutorials assume you have pip install, unrestricted internet access, and a cloud account you control. None of that is true in a DoD environment where CAC authentication, Impact Level restrictions, ATO processes, and air-gapped networks shape every technical decision.

The average federal data analyst spends their first six months learning the platform, not doing the mission work. That's a capability gap and an absurd waste of money.

This handbook closes it. Thirteen chapters covering the full data science lifecycle — environment setup through generative AI — grounded in how the work actually gets done on the five platforms where federal data science happens. Every code example is written to run in the constrained environments it describes, not on a local machine with unconstrained internet.


🤖 Agent-Ready — Works With Your AI Coding Agent

Clone this repo. Open it in Claude Code, Cursor, OpenCode, or Cline. Your AI agent now understands clearances, platform constraints, and compliance requirements — without you explaining them.

We believe this is the first reference resource for a regulated industry domain that ships native, multi-platform AI agent configuration — slash commands, workflow definitions, and structured context files — alongside the knowledge itself. If we're wrong, we want to know.

What your agent gets

Capability What It Does
/compliance-check Reviews your code against NIST 800-53, DoD AI Ethics, FedRAMP, and IL requirements. Outputs a structured severity table with remediation pointing to specific handbook sections.
/generate-federal-code Generates platform-appropriate Python with correct headers, security patterns, and IL-level constraints. Knows that Databricks doesn't allow pip install in cells, that Foundry uses palantir_models, and that IL4+ means no external API calls.
/teach Interactive tutor mode. Opens with the chapter's narrative hook, walks through concepts with code, tracks which learning objectives you've covered.

Supported agent platforms

AI Coding Agent Config File Auto-loaded?
Claude Code CLAUDE.md + .claude/commands/ Yes — on session start
Cursor .cursorrules Yes — on project open
OpenCode .opencode/config.yaml + .opencode/commands/ Yes — on project open
Cline .clinerules/ + .clinerules/workflows/ Yes — on session start
Any agent AGENTS.md Read on first query

How it works: The agent interface layer encodes 96,000 words of non-inferable federal domain knowledge into structured context files — what Impact Level means, why IL4+ prohibits external API calls, what CAC/PIV authentication requires, which packages are available on each platform. Your agent doesn't need to hallucinate or ask you to explain your environment. It reads the handbook.


Get Started in 30 Seconds

New to federal data science? Read chapters 1 through 4 in order. They cover the environment, access model, where data lives, and how to work with it.

Using an AI coding agent? Clone this repo, open it in your agent's IDE, and use the pre-built commands above. The agent picks up context automatically.

Switching platforms? Go directly to the platform guide for your new environment. Each is self-contained.

Need a specific capability? Jump to the relevant chapter (ML, MLOps, visualization, deployment, ethics). Each includes platform-specific implementation notes.

Building an AI/LLM application? Chapter 13 and the Palantir AIP guide cover the current landscape. Read chapter 12 (Ethics and Governance) in parallel.


Who This Is For

  • Junior analysts onboarding to any of the five platforms — skip the 18-month learning curve
  • Team leads building data science practices inside DoD programs
  • GovCon firms winning data task orders and needing to stand up teams fast
  • AI coding agent users who want their agent to understand federal constraints without re-explaining them every session
  • Anyone who's ever said "I can't find good training for [federal platform]"

Chapters

# Title What You'll Learn
01 Introduction to Data Science in Government Clearances, CAC auth, Impact Levels, ATO — everything that shapes the work before you write code
02 Python and R Foundations Air-gapped pip mirrors, conda on IL4/IL5, and the reality of getting a working environment on each platform
03 Data Acquisition Where federal data lives — USASpending, SAM.gov, data.gov — and how to pull it programmatically
04 Data Wrangling and Cleaning 47 million rows of procurement data that a program office called "analysis-ready" — pandas, Spark, and Delta Lake at scale
05 Exploratory Data Analysis EDA without a data dictionary, on a platform that may not support interactive notebooks
06 Supervised Machine Learning Building classifiers on DoD data: feature engineering on MILSTRIP, XGBoost on Databricks, and what accuracy means in a briefing
07 Unsupervised Machine Learning Anomaly detection on GFEBS transactions, clustering readiness data, and turning unsupervised results into actionable findings
08 Deep Learning and Neural Networks Object detection on drone video at 30fps with a 400ms inference budget — deep learning in constrained federal environments
09 MLOps and Production Pipelines MLflow, model registries, drift detection, and the ATO implications of updating a production model
10 Visualization and Dashboards Qlik, Advana dashboards, Databricks SQL — design principles that separate briefing-ready visuals from data art
11 Deployment and Scaling Containers, artifact registries, API gateways, and an ATO process that treats every deployment as a risk event
12 Ethics, Governance, and Compliance DoD AI Ethics Principles, NIST AI RMF, bias auditing, and what responsible AI governance looks like on an active program
13 Advanced Topics — GenAI, RAG, and LLMs RAG at IL4/IL5, Palantir AIP Logic, fine-tuning on classified data, and the gap between commercial LLMs and federal deployments

Every chapter includes working Python code examples and hands-on exercises with solutions.


Platform Guides

Platform IL Levels What It Covers
Advana IL4, IL5 DoD enterprise analytics — JupyterHub, Qlik, 100+ data sources, 100K+ users
Databricks IL2–IL5 Unity Catalog, Delta Lake, MLflow on AWS GovCloud and Azure Government
Navy Jupiter IL4, IL5 Department of the Navy — bronze/silver/gold data tiers, Navy-specific constraints
Palantir AIP / Foundry IL4–IL6 Ontology-based analytics, Pipeline Builder, AIP Logic for LLM workflows
Qlik IL2, IL4 Associative engine for federal BI — NIPRNet, Advana-hosted, and GovCloud

Each guide is self-contained: access, setup, development environment, code patterns, and deployment.


Local Development Environment

This handbook ships with a Docker Compose stack that mirrors federal platform constraints locally:

Service Port Purpose
Jupyter 8888 Development notebooks (+ Streamlit, Dash)
MLflow 5000 Experiment tracking and model registry
PostgreSQL 5432 Relational database
Redis 6379 Caching and session store
Nginx 80/443 Reverse proxy with TLS
Prometheus 9090 Metrics collection
Grafana 3000 Monitoring dashboards
Vault 8200 Secret management
CAC-auth 8001 CAC/PIV authentication simulator
cp .env.example .env && docker compose up -d

See docs/LOCAL_ENVIRONMENT.md for full setup instructions.


Security Compliance Reference

The security-compliance/ directory contains reference implementations for federal security patterns — not toy examples, but working code for:

  • CAC/PIV authentication with PKCS#11 smart card integration and OAuth bridging
  • RBAC/ABAC with MAC enforcement (Bell-LaPadula), role hierarchies, and database-backed permission resolution
  • FIPS 140-2 encryption with AES-256 at rest, TLS 1.3 in transit, and HSM key management
  • NIST 800-53 compliance with automated control assessment, evidence collection, and reporting
  • Audit logging with immutable trails and 7-year retention policies

See security-compliance/CLAUDE.md for a module-by-module guide.


Contributing

Contributions that improve accuracy, add platform-specific detail, or extend coverage are welcome. See CONTRIBUTING.md for guidelines.


License

Content is based on publicly available information. Nothing in this repository is classified or export-controlled. Platform-specific details reflect publicly documented capabilities as of early 2026.


Read online · Agent Integration Guide · Star this repo · Report an issue

About

Practitioner's guide to federal data science — agent-ready for Claude Code, Cursor, OpenCode, and Cline. 96K words, 5 platforms, 3 slash commands.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages