A pay-per-query data marketplace with AI-powered natural language querying and instant Solana micropayments.
DataSmart bridges the gap between public datasets that lack depth and private datasets that require expensive upfront purchases. Our platform allows data consumers to explore and query datasets before committing to a full purchase, paying only for the queries they execute.
Unlike traditional data marketplaces, DataSmart enables users to ask questions in plain English, receive instant results from validated datasets, and pay fractions of a cent per query using Solana blockchain payments.
- Natural Language Querying: Ask questions in plain English using Google Gemini AI
- Pay-Per-Query Model: Microtransactions via Solana (typically 0.01-0.10 SOL per query)
- Data Validation Pipeline: Automatic quality scoring (0-100) using Polars
- Free Tier Access: 2 free queries per dataset per wallet address
- Premium Datasets: Integration with Capital One Nessie API for financial datasets
- Secure Preview: View schema and metadata without exposing raw data
- Instant Results: Queries execute in DuckDB with results limited to 5% sample
The landing page introduces DataSmart's core value proposition with an interactive interface showcasing the marketplace capabilities.
Browse available datasets with filtering options, quality scores, and pricing information. Premium Capital One datasets are highlighted with special badges.
Upload datasets and watch them go through comprehensive validation checks including completeness, duplicates, data types, and statistical quality analysis.
Common questions about DataSmart, payments, data quality standards, and usage guidelines.
DataSmart consists of three main components:
Frontend: React-based single-page application with Solana wallet integration
- Marketplace browsing and filtering
- Natural language query interface
- Dataset upload and validation dashboard
- Payment processing via Solana
Backend: FastAPI REST API handling all business logic
- Dataset storage and management in DuckDB
- Natural language to SQL conversion using Gemini AI
- Data validation pipeline using Polars
- Premium dataset integration with Nessie API
- Transaction recording and verification
Data Layer: Local-first storage with intelligent caching
- DuckDB for persistent dataset storage
- Separate cache database for Nessie API data
- Enriched datasets built via SQL joins on cached raw data
- Python 3.8+ - Core language
- FastAPI - REST API framework
- DuckDB - Analytical database for storage and querying
- Polars - Data validation and processing
- Google Gemini API - Natural language to SQL conversion
- Nessie API - Capital One financial datasets
- Uvicorn - ASGI server
- React 19 - UI framework
- TypeScript - Type safety
- Vite - Build tool and dev server
- Tailwind CSS - Styling
- Solana Web3.js - Blockchain integration
- React Router - Client-side routing
- Polars - Fast DataFrame operations
- PyArrow - Columnar data format support
- DuckDB - SQL query execution
- Python 3.8 or higher
- Node.js 18 or higher
- Solana wallet (Phantom, Solflare, etc.)
- Google API key for Gemini
- Nessie API key (optional, for premium datasets)
- Clone the repository:
git clone https://github.com/yourusername/swamphacks26.git
cd swamphacks26- Set up the backend:
cd backend
pip install -r requirements.txt- Create a
.envfile in the backend directory:
GOOGLE_API_KEY=your_gemini_api_key
NESSIE_API_KEY=your_nessie_api_key # Optional- Set up the frontend:
cd ../frontend
npm install- Start the backend server:
cd ../backend
python server.py- Start the frontend development server:
cd ../frontend
npm run devThe application will be available at http://localhost:5173 (or the port Vite assigns).
swamphacks26/
├── backend/
│ ├── api/ # API endpoint definitions
│ ├── data/ # Dataset storage and cache
│ │ ├── datasets.db # Main DuckDB database
│ │ └── nessie_cache.db # Nessie API cache
│ ├── validation_pipeline/ # Data validation modules
│ │ ├── validator.py # Quality scoring logic
│ │ └── storage.py # DuckDB storage operations
│ ├── server.py # FastAPI application
│ ├── nessie_service.py # Capital One API integration
│ └── requirements.txt # Python dependencies
├── frontend/
│ ├── src/
│ │ ├── pages/ # Page components
│ │ │ ├── Landing.tsx
│ │ │ ├── Marketplace.tsx
│ │ │ ├── QueryInterface.tsx
│ │ │ └── UploadDashboard.tsx
│ │ ├── components/ # Reusable components
│ │ ├── utils/ # Utility functions
│ │ └── main.tsx # Application entry point
│ ├── public/ # Static assets
│ └── package.json # Node dependencies
└── README.md
- Upload CSV or Parquet files through the web interface
- System automatically validates data quality using weighted scoring:
- Missing values (25% weight)
- Data completeness (20% weight)
- Duplicate detection (15% weight)
- Data type consistency (15% weight)
- Statistical quality (10% weight)
- Schema consistency (10% weight)
- Data range validation (5% weight)
- Datasets scoring 80/100 or higher are stored in DuckDB
- Set pricing per query and provide metadata (category, description)
- Receive payments directly to your Solana wallet address
- Browse the marketplace and filter by category, file type, or quality score
- Preview dataset schema, column names, and quality metrics
- Connect Solana wallet to enable querying
- Ask questions in natural language (e.g., "What's the average age of passengers?")
- System converts query to SQL using Gemini AI
- Execute query and receive results (limited to 5% of total rows)
- Pay per query via Solana micropayment
- Option to purchase full dataset access
- User submits natural language query
- Gemini AI interprets intent and generates SQL query
- Query is validated and executed against DuckDB
- Results are limited to 5% sample (max 100 rows)
- User approves Solana payment
- Transaction is recorded and results are returned
- Cryptographic receipt is generated for auditability
The validation pipeline evaluates datasets across seven dimensions:
- Missing Values: Percentage of null or empty cells
- Completeness: Overall data coverage per column
- Duplicates: Detection of duplicate rows
- Data Types: Consistency and appropriateness of column types
- Statistical Quality: Distribution analysis and outlier detection
- Schema Consistency: Structure validation and column count checks
- Data Range: Validation of value ranges (e.g., negative ages)
Only datasets scoring 80/100 or higher are accepted into the marketplace, ensuring high-quality data for consumers.
DataSmart includes premium financial datasets powered by Capital One's Nessie API:
- Spending Insights: Transaction data enriched with merchant categories
- Customer Wealth Profiles: Demographics with aggregated account balances
- P2P Transfer Network: Peer-to-peer transfer activity analysis
These datasets are cached locally in DuckDB for fast access and enriched via SQL joins to provide additional insights.
Payments are processed directly on the Solana blockchain:
- Microtransactions: Queries cost fractions of a cent (0.01-0.10 SOL)
- Instant Settlement: Transactions confirm in seconds
- Transparent Pricing: See cost before executing queries
- Wallet Integration: Works with Phantom, Solflare, and other Solana wallets
- Transaction Receipts: Every query generates an immutable on-chain record
- Raw data is never exposed in previews
- Query results are limited to 5% samples
- All queries are logged for auditability
- Wallet-based authentication
- No arbitrary SQL execution
- Schema enforcement prevents malicious queries
- Redis queues and Celery for job scheduling
- AWS S3 integration for cloud storage
- Enhanced query validation and security
- Analytics dashboard for data producers
- Support for additional data formats
- Multi-chain payment support
Contributions are welcome. Please open an issue to discuss major changes before submitting a pull request.
This project was built for SwampHacks 2026.
- Capital One for providing the Nessie API
- Google for Gemini AI capabilities
- Solana Foundation for blockchain infrastructure





