Skip to content

karan3691/dataset-builder

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

1 Commit
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ” Dataset Builder πŸ“š

A complete Python pipeline that automates the creation of structured datasets from natural language search queries. This tool searches the web for content matching your query, scrapes and cleans the content, and outputs a structured dataset in multiple formats.

🌟 Features

  • Flexible Search Options: Use DuckDuckGo search API or SerpAPI with fallback to HTML scraping
  • Robust Content Extraction: Multi-strategy approach using newspaper3k, readability-lxml, and custom BeautifulSoup extractors
  • Advanced Text Cleaning: Remove boilerplate text, normalize formatting, and detect duplicates
  • Multiple Output Formats: Save as JSON, CSV, or Hugging Face datasets
  • HuggingFace Integration: Directly upload to Hugging Face Hub
  • Intelligent Dataset Naming: Automatic naming based on search queries for easy identification
  • Interactive Streamlit UI: Build and browse datasets through an intuitive web interface
  • REST API Access: FastAPI endpoint for programmatic dataset creation
  • Preview Functionality: View your dataset content with professionally formatted Markdown tables
  • CLI Interface: Simple command-line interface for easy usage
  • Extensible Design: Modular architecture for easy customization and extension
  • Comprehensive Testing: Extensive test suite covering all components

πŸ“Ί Interactive UI

Dataset Builder includes a Streamlit-based user interface for an intuitive visual experience:

Key UI Features

  • Dataset Creation: Build datasets from natural language queries with a simple form
  • Dataset Browsing: View and explore all your existing datasets
  • Rich Previews: See nicely formatted tables of your dataset content
  • Full Article View: Read the complete content of any article in your datasets
  • Configuration: Adjust all settings through the UI without editing config files
  • HuggingFace Integration: Upload to HF Hub directly from the interface

Running the UI

# Start the Streamlit app
python -m ui.streamlit_app

Then open your browser to http://localhost:8501 to access the UI.

UI Screenshots

Dataset Creation

Dataset Creation Screen The Create Dataset tab allows you to input a search query and build a new dataset with customizable settings.

Dataset Browsing

Dataset Browse Screen The Browse Datasets tab lets you explore existing datasets with rich preview functionality and full article content viewing.

🌐 REST API

Dataset Builder provides a FastAPI-based REST API for programmatic access:

Key API Features

  • Dataset Creation: POST endpoint to build datasets from queries
  • Configuration Access: GET endpoint to retrieve current settings
  • Fully Configurable: All pipeline options configurable via API
  • JSON Response: Structured responses with dataset paths and stats
  • HuggingFace Integration: Upload datasets to HF Hub through the API

Running the API Server

# Start the FastAPI server
python -m ui.fastapi_app

The API will be available at http://localhost:8000, with automatic documentation at http://localhost:8000/docs.

API Usage Examples

Creating a Dataset

# Create a dataset with curl
curl -X POST "http://localhost:8000/dataset" \
  -H "Content-Type: application/json" \
  -d '{
    "query": "AI advances 2023",
    "max_results": 15,
    "save_format": "all",
    "upload_to_hf": false
  }'

Response:

{
  "success": true,
  "message": "Dataset created successfully",
  "dataset_path": "./output/ai_advances_2023_20250405_123456",
  "hf_repo_url": null,
  "stats": {
    "query": "AI advances 2023",
    "engine": "duckduckgo",
    "max_results": 15
  }
}

Getting Configuration

# Retrieve current configuration
curl http://localhost:8000/config

Response:

{
  "search": {
    "engine": "duckduckgo",
    "max_results": 20
  },
  "dataset": {
    "output_dir": "./output",
    "dataset_name": null,
    "save_format": "all",
    "upload_to_hf": false
  }
}

Python Client Example

import requests
import json

# API endpoint
api_url = "http://localhost:8000"

# Create a dataset
response = requests.post(
    f"{api_url}/dataset",
    json={
        "query": "quantum computing breakthroughs",
        "max_results": 10,
        "save_format": "hf",
        "upload_to_hf": True,
        "hf_repo_id": "yourusername/quantum-computing-dataset"
    }
)

# Print the response
result = response.json()
print(f"Success: {result['success']}")
print(f"Dataset path: {result['dataset_path']}")

if result['hf_repo_url']:
    print(f"Hugging Face URL: {result['hf_repo_url']}")

Using with JavaScript/Node.js

// Using Node.js with fetch
const fetch = require('node-fetch');

async function createDataset() {
  const response = await fetch('http://localhost:8000/dataset', {
    method: 'POST',
    headers: {
      'Content-Type': 'application/json',
    },
    body: JSON.stringify({
      query: 'renewable energy research',
      max_results: 20,
      save_format: 'json',
    }),
  });
  
  const data = await response.json();
  console.log(`Dataset created: ${data.success}`);
  console.log(`Path: ${data.dataset_path}`);
}

createDataset();

API Documentation

For complete API documentation, visit the auto-generated Swagger UI at http://localhost:8000/docs after starting the server.

πŸ“‹ Requirements

  • Python 3.7+
  • Required packages:
    • Web & Data: requests, beautifulsoup4, newspaper3k, readability-lxml, pandas
    • Datasets: datasets, huggingface-hub
    • Search: duckduckgo-search, google-search-results (for SerpAPI)
    • UI: streamlit, fastapi, uvicorn
    • Utils: python-dotenv, tqdm, pyyaml

Full dependencies are listed in requirements.txt.

πŸš€ Installation

From Source

  1. Clone the repository:

    git clone https://github.com/yourusername/dataset-builder.git
    cd dataset-builder
  2. Install the package:

    pip install -e .
  3. Or with development dependencies:

    pip install -e ".[dev]"

Using Pip

pip install git+https://github.com/yourusername/dataset-builder.git

Using the Command-line Tool

After installation, the package provides a command-line tool:

# Use the command-line tool directly
dataset-builder build "AI research papers"

# Get help
dataset-builder --help

Dependencies

The package requirements are automatically installed, but you may need to manually install:

# Core dependencies
pip install -r requirements.txt

# Development dependencies (optional)
pip install pytest pytest-cov black mypy isort

βš™οΈ Configuration

  1. Copy the example environment file and edit it:

    cp .env.example .env
  2. Edit .env to configure:

    • API keys for search providers
    • Search settings (max results, engine)
    • Scraping parameters
    • Text cleaning options
    • Dataset output preferences
    • Hugging Face Hub credentials (if uploading)

All configuration options can also be adjusted through the Streamlit UI.

πŸ“Š Usage

Streamlit Interface

The easiest way to use Dataset Builder is through the Streamlit UI:

python -m ui.streamlit_app

The UI provides two main tabs:

  1. Create Dataset: Build new datasets from search queries
  2. Browse Datasets: Explore and preview existing datasets

Creating a Dataset in the UI

  1. Enter your search query in the text area
  2. Adjust settings in the sidebar if needed
  3. Click "Build Dataset"
  4. Wait for the process to complete
  5. View the dataset preview and results

Browsing Datasets in the UI

  1. Navigate to the "Browse Datasets" tab
  2. Select a dataset from the dropdown
  3. Adjust the number of rows to preview
  4. Optionally view the full content of any article

Command Line Interface

Building Datasets

Build a dataset from a search query:

python -m dataset_builder.main build "climate change articles 2023"

The dataset name will be automatically generated from your search query (e.g., climate_change_articles_2023_20250405_001907).

Specifying Output Format

Choose specific output formats with the --format parameter:

# Create only CSV format
python -m dataset_builder.main build "AI news articles" --format csv

# Create only JSON format
python -m dataset_builder.main build "AI news articles" --format json

# Create only Hugging Face dataset format
python -m dataset_builder.main build "AI news articles" --format hf

# Create all formats (default)
python -m dataset_builder.main build "AI news articles" --format all

Additional Options

# Custom output directory
python -m dataset_builder.main build "Python tutorials" --output-dir ./my_datasets 

# Custom dataset name
python -m dataset_builder.main build "Python tutorials" --name python_advanced_tutorials

# Set maximum number of search results
python -m dataset_builder.main build "Python tutorials" --max-results 30

# Upload to Hugging Face Hub
python -m dataset_builder.main build "Python tutorials" --upload --repo-id yourusername/python-tutorials

Display Current Configuration

python -m dataset_builder.main info

Working with Datasets

Previewing Datasets

After creating a dataset, you can easily preview its contents:

# Basic preview in terminal
python preview_dataset.py --path ./output/your_dataset_name

# Specify number of rows to show
python preview_dataset.py --path ./output/your_dataset_name --rows 10

# Save to Markdown file for better formatting
python preview_dataset.py --path ./output/your_dataset_name --output preview.md

# Add a custom title
python preview_dataset.py --path ./output/your_dataset_name --title "My Amazing Dataset" --output preview.md

Checking Dataset Validity

Verify your dataset is properly structured and accessible:

python check_dataset.py

The script will check:

  • Local dataset validity
  • Remote dataset connection (if uploaded to HF Hub)
  • Compatibility with both standard and non-standard dataset structures

Loading Datasets in Python

# Load a Hugging Face format dataset from disk
from datasets import load_from_disk

# Load the dataset
dataset = load_from_disk('./output/your_dataset_name')

# View basic info
print(dataset)
# Example output: Dataset({features: ['url', 'title', 'content', 'author', 'publish_date', 'source'], num_rows: 18})

# See the first example
print(dataset[0])

# Access specific fields
for item in dataset.select(range(3)):  # First 3 items
    print(f"Title: {item['title']}")
    print(f"Source: {item['source']}")
    print(f"Content preview: {item['content'][:200]}...")

Loading from Hugging Face Hub

If you've uploaded your dataset to Hugging Face Hub:

from datasets import load_dataset

# Load dataset from HF Hub
dataset = load_dataset("yourusername/dataset-repo-name")

πŸ“ Output Formats & Structure

The dataset builder creates structured datasets with the following fields:

  • url: Source URL of the content
  • title: Title of the article/content
  • content: Main text content
  • author: Author information (if available)
  • publish_date: Publication date (if available)
  • source: Website domain or explicit source name

Output Format Details

  1. JSON (.json):

    • Complete raw data with all fields
    • Useful for processing with JavaScript or other JSON-compatible tools
    • Example: ./output/climate_change_articles_20250405_001907.json
  2. CSV (.csv):

    • Tabular format with all main fields
    • Compatible with Excel, Google Sheets, pandas, etc.
    • Example: ./output/climate_change_articles_20250405_001907.csv
  3. Hugging Face Dataset (directory):

    • Complete dataset in optimized Arrow format
    • Includes metadata and dataset card (README.md)
    • Compatible with πŸ€— Datasets library and HF Hub
    • Example: ./output/climate_change_articles_20250405_001907/
    • Contains:
      • data-00000-of-00001.arrow - Arrow format data
      • dataset_info.json - Schema information
      • README.md - Dataset description
      • state.json - Metadata

Dataset Naming

Datasets are automatically named based on your search query:

  • Search terms are converted to lowercase and spaces replaced with underscores
  • Special characters are removed
  • A timestamp is added to ensure uniqueness
  • Example: Search for "Machine Learning Ethics" β†’ machine_learning_ethics_20250405_001907

πŸ” Dataset Preview Features

The dataset builder includes powerful preview capabilities:

  • Scholarly Source Attribution: Display professional source names instead of search engines
  • Cleaner Content: Removes HTML, CSS fragments, and unnecessary formatting
  • Standardized Date Format: Consistent date presentation across all sources
  • Formatted Word Counts: Easy-to-read numbers with thousands separators
  • Truncated Text: Prevents table breaking with sensible column width limits
  • Full Content View: Read complete articles in a scrollable area

Preview Formatting

The preview_dataset.py script applies several transformations to make your data more presentable:

  • Title Cleaning: Removes trailing website names and truncates long titles
  • Author Formatting: Strips HTML/CSS fragments and limits to first few authors with "et al." notation
  • Source Enhancement: Replaces generic sources like "duckduckgo" with scholarly publication names
  • Date Standardization: Converts dates to a consistent "Apr 15, 2023" format
  • Word Count Formatting: Adds thousands separators (e.g., "1,234" instead of "1234")

πŸ“‚ Project Structure

dataset-builder/
β”œβ”€β”€ dataset_builder/        # Core package
β”‚   β”œβ”€β”€ config.py           # Configuration management
β”‚   β”œβ”€β”€ pipeline.py         # Main dataset pipeline
β”‚   β”œβ”€β”€ search/             # Search providers implementations
β”‚   β”œβ”€β”€ scrapers/           # Content extraction modules
β”‚   β”œβ”€β”€ cleaners/           # Text cleaning utilities
β”‚   β”œβ”€β”€ dataset/            # Dataset creation and formatting
β”‚   β”œβ”€β”€ utils/              # Helper utilities
β”‚   └── main.py             # CLI interface
β”œβ”€β”€ ui/
β”‚   β”œβ”€β”€ streamlit_app.py    # Streamlit web interface
β”‚   └── fastapi_app.py      # REST API interface
β”œβ”€β”€ tests/                  # Comprehensive test suite
β”‚   β”œβ”€β”€ test_search.py      # Tests for search providers
β”‚   β”œβ”€β”€ test_scraper.py     # Tests for content extraction
β”‚   β”œβ”€β”€ test_cleaner.py     # Tests for text cleaning
β”‚   β”œβ”€β”€ test_dataset.py     # Tests for dataset creation
β”‚   └── test_pipeline.py    # Integration tests for the pipeline
β”œβ”€β”€ preview_dataset.py      # Dataset preview functionality
β”œβ”€β”€ check_dataset.py        # Dataset validation tool
β”œβ”€β”€ requirements.txt        # Dependencies
β”œβ”€β”€ .env.example            # Example environment config
β”œβ”€β”€ setup.py                # Python package installation
β”œβ”€β”€ MANIFEST.in             # Package manifest file
β”œβ”€β”€ LICENSE                 # MIT License
β”œβ”€β”€ .gitignore              # Git ignore file
└── README.md               # This documentation

πŸ”§ Extending the Tool

The modular design allows for customization:

  • New Search Providers: Extend SearchProvider class in dataset_builder/search/
  • Custom Extraction Logic: Modify or add new extractors in dataset_builder/scrapers/
  • Advanced Cleaning: Enhance the TextCleaner class in dataset_builder/cleaners/
  • New Output Formats: Extend the DatasetBuilder in dataset_builder/dataset/
  • UI Components: Add new Streamlit features to ui/streamlit_app.py
  • API Endpoints: Extend the FastAPI application in ui/fastapi_app.py

πŸ§ͺ Testing

The project includes a comprehensive test suite that covers all components:

# Run all tests
pytest

# Run tests with coverage report
pytest --cov=dataset_builder

# Run specific test suite
pytest tests/test_search.py

Test Coverage

  • Search: Tests different search providers and result handling
  • Scraping: Tests content extraction from different HTML structures
  • Cleaning: Tests text normalization and de-duplication
  • Dataset: Tests dataset creation, formatting, and uploading
  • Pipeline: Integration tests for the complete pipeline

πŸ“¦ Package Distribution

This project is set up as a proper Python package with:

  • setup.py: Standard Python package setup
  • MANIFEST.in: List of non-Python files to include in distribution
  • LICENSE: MIT License
  • requirements.txt: All dependencies with version specifications

You can build the package for distribution with:

# Build the package
python -m pip install build
python -m build

# The package will be available in the dist/ directory

πŸ“ˆ Current & Future Enhancements

  • Streamlit UI for interactive dataset building
  • Professional data preview with scholar-focused formatting
  • Full article content viewing
  • Support for datasets without 'train' split
  • Multi-format dataset output (JSON, CSV, HF)
  • FastAPI REST endpoint for programmatic access
  • Language detection and filtering
  • Content filtering by readability score
  • Advanced duplicate detection
  • Dataset merging capabilities

πŸ“„ License

MIT License

🀝 Contributing

Contributions welcome! Please feel free to submit pull requests.

πŸ† Quick Start Guide

First-time User Guide

  1. Install the package

    git clone https://github.com/yourusername/dataset-builder.git
    cd dataset-builder
    pip install -e .
  2. Launch the UI

    python -m ui.streamlit_app
  3. Create your first dataset

    • Enter a search query like "AI advances 2023"
    • Click "Build Dataset"
    • Explore the results in the preview
  4. Check out your existing datasets

    • Go to the "Browse Datasets" tab
    • Select any dataset to preview its contents
    • View full article content by checking the option
  5. Load and use your dataset in code

    from datasets import load_from_disk
    dataset = load_from_disk('./output/ai_advances_2023_*')
    print(dataset[0]['title'])

Happy dataset building!

About

A complete Python pipeline that automates the creation of structured datasets from natural language search queries. This tool searches the web for content matching your query, scrapes and cleans the content, and outputs a structured dataset in multiple formats.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages