Skip to content

NSang22/DataSmart

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

75 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DataSmart

A pay-per-query data marketplace with AI-powered natural language querying and instant Solana micropayments.

DataSMart Logo

Tech Stack

Python FastAPI DuckDB Polars React TypeScript Vite Tailwind CSS Google Gemini Nessie API Solana Web3.js

Overview

DataSmart bridges the gap between public datasets that lack depth and private datasets that require expensive upfront purchases. Our platform allows data consumers to explore and query datasets before committing to a full purchase, paying only for the queries they execute.

Unlike traditional data marketplaces, DataSmart enables users to ask questions in plain English, receive instant results from validated datasets, and pay fractions of a cent per query using Solana blockchain payments.

Key Features

  • Natural Language Querying: Ask questions in plain English using Google Gemini AI
  • Pay-Per-Query Model: Microtransactions via Solana (typically 0.01-0.10 SOL per query)
  • Data Validation Pipeline: Automatic quality scoring (0-100) using Polars
  • Free Tier Access: 2 free queries per dataset per wallet address
  • Premium Datasets: Integration with Capital One Nessie API for financial datasets
  • Secure Preview: View schema and metadata without exposing raw data
  • Instant Results: Queries execute in DuckDB with results limited to 5% sample

Screenshots

Landing Page

Landing Page

The landing page introduces DataSmart's core value proposition with an interactive interface showcasing the marketplace capabilities.

Marketplace

Marketplace

Browse available datasets with filtering options, quality scores, and pricing information. Premium Capital One datasets are highlighted with special badges.

Validation Pipeline

Validation Pipeline

Upload datasets and watch them go through comprehensive validation checks including completeness, duplicates, data types, and statistical quality analysis.

FAQ

FAQ

Common questions about DataSmart, payments, data quality standards, and usage guidelines.

System Architecture

System Design

DataSmart consists of three main components:

Frontend: React-based single-page application with Solana wallet integration

  • Marketplace browsing and filtering
  • Natural language query interface
  • Dataset upload and validation dashboard
  • Payment processing via Solana

Backend: FastAPI REST API handling all business logic

  • Dataset storage and management in DuckDB
  • Natural language to SQL conversion using Gemini AI
  • Data validation pipeline using Polars
  • Premium dataset integration with Nessie API
  • Transaction recording and verification

Data Layer: Local-first storage with intelligent caching

  • DuckDB for persistent dataset storage
  • Separate cache database for Nessie API data
  • Enriched datasets built via SQL joins on cached raw data

Technology Stack

Backend

  • Python 3.8+ - Core language
  • FastAPI - REST API framework
  • DuckDB - Analytical database for storage and querying
  • Polars - Data validation and processing
  • Google Gemini API - Natural language to SQL conversion
  • Nessie API - Capital One financial datasets
  • Uvicorn - ASGI server

Frontend

  • React 19 - UI framework
  • TypeScript - Type safety
  • Vite - Build tool and dev server
  • Tailwind CSS - Styling
  • Solana Web3.js - Blockchain integration
  • React Router - Client-side routing

Data Processing

  • Polars - Fast DataFrame operations
  • PyArrow - Columnar data format support
  • DuckDB - SQL query execution

Getting Started

Prerequisites

  • Python 3.8 or higher
  • Node.js 18 or higher
  • Solana wallet (Phantom, Solflare, etc.)
  • Google API key for Gemini
  • Nessie API key (optional, for premium datasets)

Installation

  1. Clone the repository:
git clone https://github.com/yourusername/swamphacks26.git
cd swamphacks26
  1. Set up the backend:
cd backend
pip install -r requirements.txt
  1. Create a .env file in the backend directory:
GOOGLE_API_KEY=your_gemini_api_key
NESSIE_API_KEY=your_nessie_api_key  # Optional
  1. Set up the frontend:
cd ../frontend
npm install
  1. Start the backend server:
cd ../backend
python server.py
  1. Start the frontend development server:
cd ../frontend
npm run dev

The application will be available at http://localhost:5173 (or the port Vite assigns).

Project Structure

swamphacks26/
├── backend/
│   ├── api/                 # API endpoint definitions
│   ├── data/                # Dataset storage and cache
│   │   ├── datasets.db      # Main DuckDB database
│   │   └── nessie_cache.db  # Nessie API cache
│   ├── validation_pipeline/ # Data validation modules
│   │   ├── validator.py     # Quality scoring logic
│   │   └── storage.py       # DuckDB storage operations
│   ├── server.py            # FastAPI application
│   ├── nessie_service.py    # Capital One API integration
│   └── requirements.txt     # Python dependencies
├── frontend/
│   ├── src/
│   │   ├── pages/           # Page components
│   │   │   ├── Landing.tsx
│   │   │   ├── Marketplace.tsx
│   │   │   ├── QueryInterface.tsx
│   │   │   └── UploadDashboard.tsx
│   │   ├── components/      # Reusable components
│   │   ├── utils/           # Utility functions
│   │   └── main.tsx         # Application entry point
│   ├── public/              # Static assets
│   └── package.json         # Node dependencies
└── README.md

How It Works

For Data Producers

  1. Upload CSV or Parquet files through the web interface
  2. System automatically validates data quality using weighted scoring:
    • Missing values (25% weight)
    • Data completeness (20% weight)
    • Duplicate detection (15% weight)
    • Data type consistency (15% weight)
    • Statistical quality (10% weight)
    • Schema consistency (10% weight)
    • Data range validation (5% weight)
  3. Datasets scoring 80/100 or higher are stored in DuckDB
  4. Set pricing per query and provide metadata (category, description)
  5. Receive payments directly to your Solana wallet address

For Data Consumers

  1. Browse the marketplace and filter by category, file type, or quality score
  2. Preview dataset schema, column names, and quality metrics
  3. Connect Solana wallet to enable querying
  4. Ask questions in natural language (e.g., "What's the average age of passengers?")
  5. System converts query to SQL using Gemini AI
  6. Execute query and receive results (limited to 5% of total rows)
  7. Pay per query via Solana micropayment
  8. Option to purchase full dataset access

Query Processing Flow

  1. User submits natural language query
  2. Gemini AI interprets intent and generates SQL query
  3. Query is validated and executed against DuckDB
  4. Results are limited to 5% sample (max 100 rows)
  5. User approves Solana payment
  6. Transaction is recorded and results are returned
  7. Cryptographic receipt is generated for auditability

Data Validation

The validation pipeline evaluates datasets across seven dimensions:

  • Missing Values: Percentage of null or empty cells
  • Completeness: Overall data coverage per column
  • Duplicates: Detection of duplicate rows
  • Data Types: Consistency and appropriateness of column types
  • Statistical Quality: Distribution analysis and outlier detection
  • Schema Consistency: Structure validation and column count checks
  • Data Range: Validation of value ranges (e.g., negative ages)

Only datasets scoring 80/100 or higher are accepted into the marketplace, ensuring high-quality data for consumers.

Premium Datasets

DataSmart includes premium financial datasets powered by Capital One's Nessie API:

  • Spending Insights: Transaction data enriched with merchant categories
  • Customer Wealth Profiles: Demographics with aggregated account balances
  • P2P Transfer Network: Peer-to-peer transfer activity analysis

These datasets are cached locally in DuckDB for fast access and enriched via SQL joins to provide additional insights.

Payment System

Payments are processed directly on the Solana blockchain:

  • Microtransactions: Queries cost fractions of a cent (0.01-0.10 SOL)
  • Instant Settlement: Transactions confirm in seconds
  • Transparent Pricing: See cost before executing queries
  • Wallet Integration: Works with Phantom, Solflare, and other Solana wallets
  • Transaction Receipts: Every query generates an immutable on-chain record

Security and Privacy

  • Raw data is never exposed in previews
  • Query results are limited to 5% samples
  • All queries are logged for auditability
  • Wallet-based authentication
  • No arbitrary SQL execution
  • Schema enforcement prevents malicious queries

Future Enhancements

  • Redis queues and Celery for job scheduling
  • AWS S3 integration for cloud storage
  • Enhanced query validation and security
  • Analytics dashboard for data producers
  • Support for additional data formats
  • Multi-chain payment support

Contributing

Contributions are welcome. Please open an issue to discuss major changes before submitting a pull request.

License

This project was built for SwampHacks 2026.

Acknowledgments

  • Capital One for providing the Nessie API
  • Google for Gemini AI capabilities
  • Solana Foundation for blockchain infrastructure

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •