Decentralized AI Data Platform - Production Ready
IPFS Datasets Python is a comprehensive platform for decentralized AI data processing, combining mathematical theorem proving, AI-powered document intelligence, multimedia processing, and knowledge graph operationsβall on decentralized IPFS infrastructure.
- Key Features
- Quick Start
- CLI Tools
- MCP Server
- MCP Dashboard
- Core Modules
- Functional Modules
- Documentation
- Contributing
- License
- π¬ Mathematical Theorem Proving - Convert legal text to verified formal logic (Z3, CVC5, Lean 4, Coq)
- π GraphRAG Document Processing - AI-powered PDF analysis with knowledge graphs
- π Universal File Conversion - Convert any file type to text for AI processing
- π¬ Universal Media Processing - Download and process from 1000+ platforms (yt-dlp + FFmpeg)
- πΈοΈ Knowledge Graph Intelligence - Cross-document reasoning with semantic search
- π Decentralized Storage - IPFS-native with content addressing (ipfs_kit_py)
- β‘ Hardware Acceleration - 2-20x speedup with multi-backend support (ipfs_accelerate_py)
- π€ MCP Server - 200+ tools for AI assistants (Claude, ChatGPT, etc.)
- π§ Auto-Fix with GitHub Copilot - Production-ready AI code fixes (100% verified)
- π Automatic Error Reporting - Runtime errors β GitHub issues automatically
- π Production Monitoring - Dashboards, analytics, and observability
- π Distributed Compute - P2P networking and distributed workflows
- π‘οΈ Enterprise Ready - Security, audit logging, and provenance tracking
# Clone repository
git clone https://github.com/endomorphosis/ipfs_datasets_py.git
cd ipfs_datasets_py
# Quick setup (core dependencies)
python scripts/setup/install.py --quick
# Or install with specific features
pip install -e ".[all]" # All features
pip install -e ".[ml]" # ML/AI features onlyZ3, CVC5, Lean, and Coq are external system tools (not Python packages). ipfs_datasets_py can use them for symbolic proof execution when installed.
-
Manual best-effort installer:
ipfs-datasets-install-provers --yes --z3 --cvc5 --lean --coq
-
Auto-run after
setup.pyinstall/develop (enabled by default; set to0to disable):IPFS_DATASETS_PY_AUTO_INSTALL_PROVERS=1(set0to disable)- Fine-grained toggles:
IPFS_DATASETS_PY_AUTO_INSTALL_Z3=1IPFS_DATASETS_PY_AUTO_INSTALL_CVC5=1IPFS_DATASETS_PY_AUTO_INSTALL_LEAN=1IPFS_DATASETS_PY_AUTO_INSTALL_COQ=1
Notes:
- Lean installs via
elaninto your user home. - Z3/CVC5/Coq installation depends on your OS/package manager; auto-install may require root (apt) or manual steps.
from ipfs_datasets_py.dataset_manager import DatasetManager
# Load and process datasets
manager = DatasetManager()
dataset = manager.load_dataset("squad", split="train[:1000]")
manager.save_dataset(dataset, "output/processed_data.parquet")Routers support dependency injection via a shared RouterDeps container.
This lets you reuse the same heavyweight managers/clients (and avoid repeated
initialization cascades) across multiple modules and even across related repos
within the same Python process.
from ipfs_datasets_py.router_deps import RouterDeps
from ipfs_datasets_py import llm_router, embeddings_router, ipfs_backend_router
deps = RouterDeps()
text = llm_router.generate_text("Write a short summary", deps=deps)
vecs = embeddings_router.embed_texts(["hello", "world"], deps=deps)
cid = ipfs_backend_router.add_bytes(b"data", deps=deps)Notes:
- Set
IPFS_DATASETS_PY_ROUTER_CACHE=0to disable in-process caching. - You can pass
provider_instance=(LLM/embeddings) orbackend_instance=(IPFS) if you want full control over the exact instance being used.
# Convert any file type to text for GraphRAG
from ipfs_datasets_py.file_converter import FileConverter
converter = FileConverter() # Auto-selects best backend
result = await converter.convert('document.pdf')
print(result.text) # Ready for knowledge graph processing
# Or use synchronously
result = converter.convert_sync('document.pdf')# Theorem proving: Website text β Verified formal logic
python scripts/demo/demonstrate_complete_pipeline.py --test-provers
# GraphRAG: AI-powered PDF processing
python scripts/demo/demonstrate_graphrag_pdf.py --create-sampleThe ipfs-datasets CLI provides comprehensive command-line access to all features.
# System status and information
ipfs-datasets info status
ipfs-datasets info version
ipfs-datasets info defaults
# Save configuration
ipfs-datasets save-defaults# Start MCP server
ipfs-datasets mcp start
# Stop server
ipfs-datasets mcp stop
# Check status
ipfs-datasets mcp status
# View logs
ipfs-datasets mcp logs# List all tool categories (200+ tools)
ipfs-datasets tools categories
# List tools in a category
ipfs-datasets tools list dataset_tools
# Execute a tool directly
ipfs-datasets tools run dataset_tools load_dataset --source squad
# Alternative execution
ipfs-datasets tools execute dataset_tools load_dataset --source squad --split train# Check VSCode CLI status
ipfs-datasets vscode status
# Install VSCode CLI
ipfs-datasets vscode install
# Configure authentication
ipfs-datasets vscode auth
# Install with auth in one step
ipfs-datasets vscode install-with-auth
# Manage extensions
ipfs-datasets vscode extensions list
ipfs-datasets vscode extensions install ms-python.python
# Tunnel management
ipfs-datasets vscode tunnel --name my-tunnel# Check GitHub CLI status
ipfs-datasets github status
# Install GitHub CLI
ipfs-datasets github install
# Authenticate
ipfs-datasets github auth login
# Execute GitHub commands
ipfs-datasets github execute issue list
ipfs-datasets github execute pr create# Send Discord message
ipfs-datasets discord send "Hello from IPFS Datasets!" --channel general
# Send to webhook
ipfs-datasets discord webhook "Status update" --url https://discord.com/api/webhooks/...
# Send file
ipfs-datasets discord file report.pdf --channel reports# Send email
ipfs-datasets email send --to user@example.com --subject "Report" --body "See attached"
# Check email status
ipfs-datasets email statusThe Model Context Protocol (MCP) server provides 200+ tools across 50+ categories for AI assistant integration.
MCP (Model Context Protocol) enables AI assistants like Claude, ChatGPT, and GitHub Copilot to access external tools and services. Our MCP server exposes comprehensive dataset operations, web scraping, knowledge graphs, and more.
# Start MCP server
ipfs-datasets mcp start
# Server runs on http://localhost:8765
# Dashboard available at http://localhost:8765/dashboardThe MCP server provides tools in these categories:
- dataset_tools - Load, process, and manage datasets
- web_archive_tools - Web scraping, yt-dlp, Common Crawl
- vector_tools - Vector embeddings and similarity search
- pdf_tools - PDF processing and GraphRAG
- knowledge_graph_tools - Entity extraction and graph operations
- ipfs_tools - IPFS operations (add, get, pin, cat)
- p2p_tools - Distributed computing and workflows
- cache_tools - Caching strategies
- monitoring_tools - System monitoring and metrics
- ...and 40+ more categories
# Discover available tools
ipfs-datasets tools categories
# List tools in a category
ipfs-datasets tools list dataset_tools
# Run a tool
ipfs-datasets tools run dataset_tools load_dataset --source squad --split train
# Run web archive tool
ipfs-datasets tools run web_archive_tools search_common_crawl --query "AI research"Once the MCP server is running, AI assistants can discover and execute tools:
# From Claude Desktop, ChatGPT, or GitHub Copilot
"Load the squad dataset using MCP tools"
"Search Common Crawl for AI research papers"
"Extract entities from this document using knowledge graph tools"
The MCP Dashboard provides real-time monitoring and analytics for all MCP operations.
- Investigation Tracking - Track GitHub issues β MCP tools β AI suggestions workflow
- Tool Usage Analytics - See which tools are used most, execution times, success rates
- System Monitoring - Real-time system health, resource usage, error tracking
- Real-time Updates - WebSocket-based live updates of all operations
# Start MCP server (includes dashboard)
ipfs-datasets mcp start
# Access dashboard in browser
# URL: http://localhost:8765/dashboard- Investigation Panel - Active investigations, GitHub issue tracking, tool execution history
- Analytics Panel - Tool usage statistics, performance metrics, success/failure rates
- System Panel - CPU/memory usage, active connections, error logs
- Tool Explorer - Browse all 200+ tools, test executions, view documentation
The package includes 13 core modules providing foundational functionality.
Dataset loading and management with hardware acceleration support.
from ipfs_datasets_py.dataset_manager import DatasetManager
# Initialize with acceleration
manager = DatasetManager(use_accelerate=True)
# Load dataset
dataset = manager.load_dataset("squad", split="train[:1000]")
# Save processed dataset
manager.save_dataset(dataset, "output/processed.parquet")Features:
- HuggingFace datasets integration
- Hardware acceleration support (ipfs_accelerate_py)
- Multiple format support (Parquet, JSONL, CSV)
- Caching and optimization
Configuration management with TOML support and override capabilities.
from ipfs_datasets_py.config import config
# Load configuration
cfg = config()
# Access configuration
database_url = cfg.baseConfig['database']['url']
# Override configuration
cfg.overrideToml(cfg.baseConfig, {'database': {'url': 'new_url'}})Features:
- TOML-based configuration
- Hierarchical overrides
- Environment variable support
- Default value handling
Security, authentication, and authorization (planned expansion).
Planned Features:
- API key management
- OAuth integration
- Rate limiting
- Access control lists
System monitoring, metrics collection, and health checks.
from ipfs_datasets_py.monitoring import monitor
# Track metrics
monitor.track_metric("api_calls", 1)
monitor.track_metric("processing_time", 0.5)
# Health check
status = monitor.health_check()Features:
- Prometheus metrics
- Health check endpoints
- Resource usage tracking
- Performance monitoring
Core IPFS operations and dataset handling.
from ipfs_datasets_py.ipfs_datasets import ipfs_datasets
# Initialize IPFS operations
ipfs = ipfs_datasets()
# IPFS operations
cid = ipfs.add_file("data.json")
content = ipfs.get_file(cid)Features:
- IPFS add/get operations
- Content addressing
- Dataset storage on IPFS
- Pinning management
IPFS content discovery and indexing.
Features:
- Content indexing
- Discovery protocols
- Metadata management
- Search capabilities
Automatic dependency management and installation.
Features:
- Dependency resolution
- Automatic installation
- Version management
- Platform detection
Comprehensive audit logging for compliance and security.
from ipfs_datasets_py.audit import audit
# Log audit event
audit.log_event("data_access", {"user": "admin", "resource": "dataset_123"})Features:
- Event logging
- Compliance tracking
- Security auditing
- Log rotation
File type detection and validation.
Features:
- MIME type detection
- File format validation
- Content analysis
- Extension mapping
Administrative web interface for system management.
Features:
- User management
- System configuration
- Monitoring dashboard
- Log viewer
Internal dependency tracking and management.
Features:
- Dependency graph
- Version tracking
- Import analysis
- Health checks
IPFS multiformat support (CID, multibase, multihash).
Features:
- CID encoding/decoding
- Multibase conversion
- Multihash operations
- Format validation
K-Nearest Neighbors indexing for IPFS content.
Features:
- Vector indexing
- Similarity search
- Distributed KNN
- IPFS integration
The package includes 11 functional modules providing specialized capabilities.
Dashboard implementations for monitoring and analytics.
Components:
- mcp_dashboard.py - Main MCP server dashboard (real-time monitoring)
- mcp_investigation_dashboard.py - Investigation workflow tracking
- unified_monitoring_dashboard.py - System-wide monitoring
- news_analysis_dashboard.py - News aggregation and analysis
from ipfs_datasets_py.dashboards.mcp_dashboard import MCPDashboard
dashboard = MCPDashboard(port=8765)
dashboard.start()Features:
- Real-time updates via WebSocket
- Tool usage analytics
- System health monitoring
- Investigation tracking
CLI tool implementations for various integrations.
Components:
- discord_cli.py - Discord messaging integration
- github_cli.py - GitHub operations
- vscode_cli.py - VSCode editor integration
- email_cli.py - Email notification system
from ipfs_datasets_py.cli.discord_cli import send_discord_message
send_discord_message("Status update", channel="general")Features:
- Discord webhooks and messaging
- GitHub CLI integration
- VSCode remote development
- Email notifications
Data processors for various formats and operations.
Components:
- graphrag_processor.py - GraphRAG document processing
- document_processor.py - General document processing
- image_processor.py - Image analysis
- video_processor.py - Video processing
from ipfs_datasets_py.processors.graphrag_processor import GraphRAGProcessor
processor = GraphRAGProcessor()
result = processor.process_pdf("document.pdf")Features:
- PDF to knowledge graph
- Document intelligence
- Image OCR and analysis
- Video transcription
Caching systems for performance optimization.
Components:
- cache.py - GitHub API cache
- query_cache.py - Query result caching
- distributed_cache.py - P2P distributed cache
from ipfs_datasets_py.caching.cache import GitHubAPICache
cache = GitHubAPICache()
result = cache.get_or_fetch("repos/user/project")Features:
- LRU caching
- Distributed caching
- Cache invalidation
- Performance optimization
Web scraping and archiving capabilities.
Components:
- web_archive.py - Main web archiving
- yt_dlp_integration.py - yt-dlp wrapper (1000+ platforms)
- ffmpeg_integration.py - FFmpeg video processing
- common_crawl.py - Common Crawl search
from ipfs_datasets_py.web_archiving import create_web_archive
archive = create_web_archive("https://example.com")Features:
- yt-dlp integration (1000+ platforms)
- FFmpeg video processing
- Common Crawl search
- WARC file creation
P2P networking and distributed compute.
Components:
- libp2p_kit.py - libp2p integration
- p2p_workflow_scheduler.py - Distributed workflows
- p2p_peer_registry.py - Peer management
from ipfs_datasets_py.p2p_networking.libp2p_kit import LibP2PKit
p2p = LibP2PKit()
p2p.start_node()Features:
- libp2p networking
- Distributed workflows
- Peer discovery
- Task distribution
Knowledge graph extraction and operations.
Components:
- knowledge_graph_extraction.py - Entity extraction
- graph_operations.py - Graph queries
- reasoning.py - Graph reasoning
from ipfs_datasets_py.knowledge_graphs import KnowledgeGraphExtractor
extractor = KnowledgeGraphExtractor()
graph = extractor.extract_from_text(document)Features:
- Entity extraction
- Relationship detection
- Graph construction
- Cross-document reasoning
Data format conversion and transformation.
Components:
- car_conversion.py - CAR file operations
- parquet_conversion.py - Parquet format
- jsonl_conversion.py - JSONL format
- format_detector.py - Format detection
from ipfs_datasets_py.data_transformation import convert_to_parquet
convert_to_parquet("data.jsonl", "output.parquet")Features:
- CAR file creation
- Format conversion
- Data validation
- Schema detection
Third-party service integrations.
Components:
- accelerate_integration.py - ipfs_accelerate_py integration
- graphrag_integration.py - GraphRAG integration
- vscode_integration.py - VSCode integration
- github_integration.py - GitHub API integration
from ipfs_datasets_py.integrations.accelerate_integration import AccelerateManager
accelerator = AccelerateManager()
accelerator.setup_distributed()Features:
- Hardware acceleration (ipfs_accelerate_py)
- GraphRAG integration
- GitHub API
- VSCode remote development
Logic and reasoning systems.
Components:
- deontological_reasoning.py - Deontic logic
- theorem_proving.py - Formal verification
from ipfs_datasets_py.reasoning import TheoremProver
prover = TheoremProver(backend="z3")
result = prover.prove(formal_logic)Features:
- Deontic logic
- Theorem proving (Z3, CVC5, Lean 4, Coq)
- Legal text β formal logic
- Formal verification
IPFS format handling and operations.
Components:
- car_files.py - CAR archive operations
- ipld_operations.py - IPLD operations
- content_addressing.py - CID operations
from ipfs_datasets_py.ipfs_formats import create_car_file
car_cid = create_car_file("data/", "output.car")Features:
- CAR archive creation
- IPLD path resolution
- Content addressing
- Format conversion
- Quick Start Guide - Complete getting started tutorial
- Installation Guide - Detailed installation instructions
- Architecture Overview - Package structure and design
- Complete Features List - All capabilities explained
- Hardware Acceleration - ipfs_accelerate_py (2-20x speedup)
- IPFS Operations - ipfs_kit_py integration
- Best Practices - Performance, security, patterns
- Migration Guide - Updating from old versions
- CLI Tools - Command-line interface guide
This project is licensed under the AGPL-3.0 License - see the LICENSE file for details.
- Documentation: docs/
- Issues: GitHub Issues
- Discussions: GitHub Discussions
Endomorphosis Github Senior ML-OPS Architect
The-Ride-Never-Ends Github Junior Developer / Political Scientist
Coregod360 Github Formerly Junior Developer
Built with β€οΈ for decentralized AI