DataSmart

A pay-per-query data marketplace with AI-powered natural language querying and instant Solana micropayments.

Tech Stack

Overview

DataSmart bridges the gap between public datasets that lack depth and private datasets that require expensive upfront purchases. Our platform allows data consumers to explore and query datasets before committing to a full purchase, paying only for the queries they execute.

Unlike traditional data marketplaces, DataSmart enables users to ask questions in plain English, receive instant results from validated datasets, and pay fractions of a cent per query using Solana blockchain payments.

Key Features

Natural Language Querying: Ask questions in plain English using Google Gemini AI
Pay-Per-Query Model: Microtransactions via Solana (typically 0.01-0.10 SOL per query)
Data Validation Pipeline: Automatic quality scoring (0-100) using Polars
Free Tier Access: 2 free queries per dataset per wallet address
Premium Datasets: Integration with Capital One Nessie API for financial datasets
Secure Preview: View schema and metadata without exposing raw data
Instant Results: Queries execute in DuckDB with results limited to 5% sample

Screenshots

Landing Page

The landing page introduces DataSmart's core value proposition with an interactive interface showcasing the marketplace capabilities.

Marketplace

Browse available datasets with filtering options, quality scores, and pricing information. Premium Capital One datasets are highlighted with special badges.

Validation Pipeline

Upload datasets and watch them go through comprehensive validation checks including completeness, duplicates, data types, and statistical quality analysis.

FAQ

Common questions about DataSmart, payments, data quality standards, and usage guidelines.

System Architecture

DataSmart consists of three main components:

Frontend: React-based single-page application with Solana wallet integration

Marketplace browsing and filtering
Natural language query interface
Dataset upload and validation dashboard
Payment processing via Solana

Backend: FastAPI REST API handling all business logic

Dataset storage and management in DuckDB
Natural language to SQL conversion using Gemini AI
Data validation pipeline using Polars
Premium dataset integration with Nessie API
Transaction recording and verification

Data Layer: Local-first storage with intelligent caching

DuckDB for persistent dataset storage
Separate cache database for Nessie API data
Enriched datasets built via SQL joins on cached raw data

Technology Stack

Backend

Python 3.8+ - Core language
FastAPI - REST API framework
DuckDB - Analytical database for storage and querying
Polars - Data validation and processing
Google Gemini API - Natural language to SQL conversion
Nessie API - Capital One financial datasets
Uvicorn - ASGI server

Frontend

React 19 - UI framework
TypeScript - Type safety
Vite - Build tool and dev server
Tailwind CSS - Styling
Solana Web3.js - Blockchain integration
React Router - Client-side routing

Data Processing

Polars - Fast DataFrame operations
PyArrow - Columnar data format support
DuckDB - SQL query execution

Getting Started

Prerequisites

Python 3.8 or higher
Node.js 18 or higher
Solana wallet (Phantom, Solflare, etc.)
Google API key for Gemini
Nessie API key (optional, for premium datasets)

Installation

Clone the repository:

git clone https://github.com/yourusername/swamphacks26.git
cd swamphacks26

Set up the backend:

cd backend
pip install -r requirements.txt

Create a .env file in the backend directory:

GOOGLE_API_KEY=your_gemini_api_key
NESSIE_API_KEY=your_nessie_api_key  # Optional

Set up the frontend:

cd ../frontend
npm install

Start the backend server:

cd ../backend
python server.py

Start the frontend development server:

cd ../frontend
npm run dev

The application will be available at http://localhost:5173 (or the port Vite assigns).

Project Structure

swamphacks26/
├── backend/
│   ├── api/                 # API endpoint definitions
│   ├── data/                # Dataset storage and cache
│   │   ├── datasets.db      # Main DuckDB database
│   │   └── nessie_cache.db  # Nessie API cache
│   ├── validation_pipeline/ # Data validation modules
│   │   ├── validator.py     # Quality scoring logic
│   │   └── storage.py       # DuckDB storage operations
│   ├── server.py            # FastAPI application
│   ├── nessie_service.py    # Capital One API integration
│   └── requirements.txt     # Python dependencies
├── frontend/
│   ├── src/
│   │   ├── pages/           # Page components
│   │   │   ├── Landing.tsx
│   │   │   ├── Marketplace.tsx
│   │   │   ├── QueryInterface.tsx
│   │   │   └── UploadDashboard.tsx
│   │   ├── components/      # Reusable components
│   │   ├── utils/           # Utility functions
│   │   └── main.tsx         # Application entry point
│   ├── public/              # Static assets
│   └── package.json         # Node dependencies
└── README.md

How It Works

For Data Producers

Upload CSV or Parquet files through the web interface
System automatically validates data quality using weighted scoring:
- Missing values (25% weight)
- Data completeness (20% weight)
- Duplicate detection (15% weight)
- Data type consistency (15% weight)
- Statistical quality (10% weight)
- Schema consistency (10% weight)
- Data range validation (5% weight)
Datasets scoring 80/100 or higher are stored in DuckDB
Set pricing per query and provide metadata (category, description)
Receive payments directly to your Solana wallet address

For Data Consumers

Browse the marketplace and filter by category, file type, or quality score
Preview dataset schema, column names, and quality metrics
Connect Solana wallet to enable querying
Ask questions in natural language (e.g., "What's the average age of passengers?")
System converts query to SQL using Gemini AI
Execute query and receive results (limited to 5% of total rows)
Pay per query via Solana micropayment
Option to purchase full dataset access

Query Processing Flow

User submits natural language query
Gemini AI interprets intent and generates SQL query
Query is validated and executed against DuckDB
Results are limited to 5% sample (max 100 rows)
User approves Solana payment
Transaction is recorded and results are returned
Cryptographic receipt is generated for auditability

Data Validation

The validation pipeline evaluates datasets across seven dimensions:

Missing Values: Percentage of null or empty cells
Completeness: Overall data coverage per column
Duplicates: Detection of duplicate rows
Data Types: Consistency and appropriateness of column types
Statistical Quality: Distribution analysis and outlier detection
Schema Consistency: Structure validation and column count checks
Data Range: Validation of value ranges (e.g., negative ages)

Only datasets scoring 80/100 or higher are accepted into the marketplace, ensuring high-quality data for consumers.

Premium Datasets

DataSmart includes premium financial datasets powered by Capital One's Nessie API:

Spending Insights: Transaction data enriched with merchant categories
Customer Wealth Profiles: Demographics with aggregated account balances
P2P Transfer Network: Peer-to-peer transfer activity analysis

These datasets are cached locally in DuckDB for fast access and enriched via SQL joins to provide additional insights.

Payment System

Payments are processed directly on the Solana blockchain:

Microtransactions: Queries cost fractions of a cent (0.01-0.10 SOL)
Instant Settlement: Transactions confirm in seconds
Transparent Pricing: See cost before executing queries
Wallet Integration: Works with Phantom, Solflare, and other Solana wallets
Transaction Receipts: Every query generates an immutable on-chain record

Security and Privacy

Raw data is never exposed in previews
Query results are limited to 5% samples
All queries are logged for auditability
Wallet-based authentication
No arbitrary SQL execution
Schema enforcement prevents malicious queries

Future Enhancements

Redis queues and Celery for job scheduling
AWS S3 integration for cloud storage
Enhanced query validation and security
Analytics dashboard for data producers
Support for additional data formats
Multi-chain payment support

Contributing

Contributions are welcome. Please open an issue to discuss major changes before submitting a pull request.

License

This project was built for SwampHacks 2026.

Acknowledgments

Capital One for providing the Nessie API
Google for Gemini AI capabilities
Solana Foundation for blockchain infrastructure

Name		Name	Last commit message	Last commit date
Latest commit History 75 Commits
.github		.github
backend		backend
frontend		frontend
.gitignore		.gitignore
README.md		README.md
package-lock.json		package-lock.json

NSang22/DataSmart

Folders and files

Latest commit

History

Repository files navigation