🔍 Dataset Builder 📚

A complete Python pipeline that automates the creation of structured datasets from natural language search queries. This tool searches the web for content matching your query, scrapes and cleans the content, and outputs a structured dataset in multiple formats.

🌟 Features

Flexible Search Options: Use DuckDuckGo search API or SerpAPI with fallback to HTML scraping
Robust Content Extraction: Multi-strategy approach using newspaper3k, readability-lxml, and custom BeautifulSoup extractors
Advanced Text Cleaning: Remove boilerplate text, normalize formatting, and detect duplicates
Multiple Output Formats: Save as JSON, CSV, or Hugging Face datasets
HuggingFace Integration: Directly upload to Hugging Face Hub
Intelligent Dataset Naming: Automatic naming based on search queries for easy identification
Interactive Streamlit UI: Build and browse datasets through an intuitive web interface
REST API Access: FastAPI endpoint for programmatic dataset creation
Preview Functionality: View your dataset content with professionally formatted Markdown tables
CLI Interface: Simple command-line interface for easy usage
Extensible Design: Modular architecture for easy customization and extension
Comprehensive Testing: Extensive test suite covering all components

📺 Interactive UI

Dataset Builder includes a Streamlit-based user interface for an intuitive visual experience:

Key UI Features

Dataset Creation: Build datasets from natural language queries with a simple form
Dataset Browsing: View and explore all your existing datasets
Rich Previews: See nicely formatted tables of your dataset content
Full Article View: Read the complete content of any article in your datasets
Configuration: Adjust all settings through the UI without editing config files
HuggingFace Integration: Upload to HF Hub directly from the interface

Running the UI

# Start the Streamlit app
python -m ui.streamlit_app

Then open your browser to http://localhost:8501 to access the UI.

UI Screenshots

Dataset Creation

The Create Dataset tab allows you to input a search query and build a new dataset with customizable settings.

Dataset Browsing

The Browse Datasets tab lets you explore existing datasets with rich preview functionality and full article content viewing.

🌐 REST API

Dataset Builder provides a FastAPI-based REST API for programmatic access:

Key API Features

Dataset Creation: POST endpoint to build datasets from queries
Configuration Access: GET endpoint to retrieve current settings
Fully Configurable: All pipeline options configurable via API
JSON Response: Structured responses with dataset paths and stats
HuggingFace Integration: Upload datasets to HF Hub through the API

Running the API Server

# Start the FastAPI server
python -m ui.fastapi_app

The API will be available at http://localhost:8000, with automatic documentation at http://localhost:8000/docs.

API Usage Examples

Creating a Dataset

# Create a dataset with curl
curl -X POST "http://localhost:8000/dataset" \
  -H "Content-Type: application/json" \
  -d '{
    "query": "AI advances 2023",
    "max_results": 15,
    "save_format": "all",
    "upload_to_hf": false
  }'

Response:

{
  "success": true,
  "message": "Dataset created successfully",
  "dataset_path": "./output/ai_advances_2023_20250405_123456",
  "hf_repo_url": null,
  "stats": {
    "query": "AI advances 2023",
    "engine": "duckduckgo",
    "max_results": 15
  }
}

Getting Configuration

# Retrieve current configuration
curl http://localhost:8000/config

Response:

{
  "search": {
    "engine": "duckduckgo",
    "max_results": 20
  },
  "dataset": {
    "output_dir": "./output",
    "dataset_name": null,
    "save_format": "all",
    "upload_to_hf": false
  }
}

Python Client Example

import requests
import json

# API endpoint
api_url = "http://localhost:8000"

# Create a dataset
response = requests.post(
    f"{api_url}/dataset",
    json={
        "query": "quantum computing breakthroughs",
        "max_results": 10,
        "save_format": "hf",
        "upload_to_hf": True,
        "hf_repo_id": "yourusername/quantum-computing-dataset"
    }
)

# Print the response
result = response.json()
print(f"Success: {result['success']}")
print(f"Dataset path: {result['dataset_path']}")

if result['hf_repo_url']:
    print(f"Hugging Face URL: {result['hf_repo_url']}")

Using with JavaScript/Node.js

// Using Node.js with fetch
const fetch = require('node-fetch');

async function createDataset() {
  const response = await fetch('http://localhost:8000/dataset', {
    method: 'POST',
    headers: {
      'Content-Type': 'application/json',
    },
    body: JSON.stringify({
      query: 'renewable energy research',
      max_results: 20,
      save_format: 'json',
    }),
  });
  
  const data = await response.json();
  console.log(`Dataset created: ${data.success}`);
  console.log(`Path: ${data.dataset_path}`);
}

createDataset();

API Documentation

For complete API documentation, visit the auto-generated Swagger UI at http://localhost:8000/docs after starting the server.

📋 Requirements

Python 3.7+
Required packages:
- Web & Data: requests, beautifulsoup4, newspaper3k, readability-lxml, pandas
- Datasets: datasets, huggingface-hub
- Search: duckduckgo-search, google-search-results (for SerpAPI)
- UI: streamlit, fastapi, uvicorn
- Utils: python-dotenv, tqdm, pyyaml

Full dependencies are listed in requirements.txt.

🚀 Installation

From Source

Clone the repository:

git clone https://github.com/yourusername/dataset-builder.git
cd dataset-builder

Install the package:
```
pip install -e .
```
Or with development dependencies:
```
pip install -e ".[dev]"
```

Using Pip

pip install git+https://github.com/yourusername/dataset-builder.git

Using the Command-line Tool

After installation, the package provides a command-line tool:

# Use the command-line tool directly
dataset-builder build "AI research papers"

# Get help
dataset-builder --help

Dependencies

The package requirements are automatically installed, but you may need to manually install:

# Core dependencies
pip install -r requirements.txt

# Development dependencies (optional)
pip install pytest pytest-cov black mypy isort

⚙️ Configuration

Copy the example environment file and edit it:
```
cp .env.example .env
```
Edit .env to configure:
- API keys for search providers
- Search settings (max results, engine)
- Scraping parameters
- Text cleaning options
- Dataset output preferences
- Hugging Face Hub credentials (if uploading)

All configuration options can also be adjusted through the Streamlit UI.

📊 Usage

Streamlit Interface

The easiest way to use Dataset Builder is through the Streamlit UI:

python -m ui.streamlit_app

The UI provides two main tabs:

Create Dataset: Build new datasets from search queries
Browse Datasets: Explore and preview existing datasets

Creating a Dataset in the UI

Enter your search query in the text area
Adjust settings in the sidebar if needed
Click "Build Dataset"
Wait for the process to complete
View the dataset preview and results

Browsing Datasets in the UI

Navigate to the "Browse Datasets" tab
Select a dataset from the dropdown
Adjust the number of rows to preview
Optionally view the full content of any article

Command Line Interface

Building Datasets

Build a dataset from a search query:

python -m dataset_builder.main build "climate change articles 2023"

The dataset name will be automatically generated from your search query (e.g., climate_change_articles_2023_20250405_001907).

Specifying Output Format

Choose specific output formats with the --format parameter:

# Create only CSV format
python -m dataset_builder.main build "AI news articles" --format csv

# Create only JSON format
python -m dataset_builder.main build "AI news articles" --format json

# Create only Hugging Face dataset format
python -m dataset_builder.main build "AI news articles" --format hf

# Create all formats (default)
python -m dataset_builder.main build "AI news articles" --format all

Additional Options

# Custom output directory
python -m dataset_builder.main build "Python tutorials" --output-dir ./my_datasets 

# Custom dataset name
python -m dataset_builder.main build "Python tutorials" --name python_advanced_tutorials

# Set maximum number of search results
python -m dataset_builder.main build "Python tutorials" --max-results 30

# Upload to Hugging Face Hub
python -m dataset_builder.main build "Python tutorials" --upload --repo-id yourusername/python-tutorials

Display Current Configuration

python -m dataset_builder.main info

Working with Datasets

Previewing Datasets

After creating a dataset, you can easily preview its contents:

# Basic preview in terminal
python preview_dataset.py --path ./output/your_dataset_name

# Specify number of rows to show
python preview_dataset.py --path ./output/your_dataset_name --rows 10

# Save to Markdown file for better formatting
python preview_dataset.py --path ./output/your_dataset_name --output preview.md

# Add a custom title
python preview_dataset.py --path ./output/your_dataset_name --title "My Amazing Dataset" --output preview.md

Checking Dataset Validity

Verify your dataset is properly structured and accessible:

python check_dataset.py

The script will check:

Local dataset validity
Remote dataset connection (if uploaded to HF Hub)
Compatibility with both standard and non-standard dataset structures

Loading Datasets in Python

# Load a Hugging Face format dataset from disk
from datasets import load_from_disk

# Load the dataset
dataset = load_from_disk('./output/your_dataset_name')

# View basic info
print(dataset)
# Example output: Dataset({features: ['url', 'title', 'content', 'author', 'publish_date', 'source'], num_rows: 18})

# See the first example
print(dataset[0])

# Access specific fields
for item in dataset.select(range(3)):  # First 3 items
    print(f"Title: {item['title']}")
    print(f"Source: {item['source']}")
    print(f"Content preview: {item['content'][:200]}...")

Loading from Hugging Face Hub

If you've uploaded your dataset to Hugging Face Hub:

from datasets import load_dataset

# Load dataset from HF Hub
dataset = load_dataset("yourusername/dataset-repo-name")

📁 Output Formats & Structure

The dataset builder creates structured datasets with the following fields:

url: Source URL of the content
title: Title of the article/content
content: Main text content
author: Author information (if available)
publish_date: Publication date (if available)
source: Website domain or explicit source name

Output Format Details

JSON (.json):
- Complete raw data with all fields
- Useful for processing with JavaScript or other JSON-compatible tools
- Example: ./output/climate_change_articles_20250405_001907.json
CSV (.csv):
- Tabular format with all main fields
- Compatible with Excel, Google Sheets, pandas, etc.
- Example: ./output/climate_change_articles_20250405_001907.csv
Hugging Face Dataset (directory):
- Complete dataset in optimized Arrow format
- Includes metadata and dataset card (README.md)
- Compatible with 🤗 Datasets library and HF Hub
- Example: ./output/climate_change_articles_20250405_001907/
- Contains:
  - data-00000-of-00001.arrow - Arrow format data
  - dataset_info.json - Schema information
  - README.md - Dataset description
  - state.json - Metadata

Dataset Naming

Datasets are automatically named based on your search query:

Search terms are converted to lowercase and spaces replaced with underscores
Special characters are removed
A timestamp is added to ensure uniqueness
Example: Search for "Machine Learning Ethics" → machine_learning_ethics_20250405_001907

🔍 Dataset Preview Features

The dataset builder includes powerful preview capabilities:

Scholarly Source Attribution: Display professional source names instead of search engines
Cleaner Content: Removes HTML, CSS fragments, and unnecessary formatting
Standardized Date Format: Consistent date presentation across all sources
Formatted Word Counts: Easy-to-read numbers with thousands separators
Truncated Text: Prevents table breaking with sensible column width limits
Full Content View: Read complete articles in a scrollable area

Preview Formatting

The preview_dataset.py script applies several transformations to make your data more presentable:

Title Cleaning: Removes trailing website names and truncates long titles
Author Formatting: Strips HTML/CSS fragments and limits to first few authors with "et al." notation
Source Enhancement: Replaces generic sources like "duckduckgo" with scholarly publication names
Date Standardization: Converts dates to a consistent "Apr 15, 2023" format
Word Count Formatting: Adds thousands separators (e.g., "1,234" instead of "1234")

📂 Project Structure

dataset-builder/
├── dataset_builder/        # Core package
│   ├── config.py           # Configuration management
│   ├── pipeline.py         # Main dataset pipeline
│   ├── search/             # Search providers implementations
│   ├── scrapers/           # Content extraction modules
│   ├── cleaners/           # Text cleaning utilities
│   ├── dataset/            # Dataset creation and formatting
│   ├── utils/              # Helper utilities
│   └── main.py             # CLI interface
├── ui/
│   ├── streamlit_app.py    # Streamlit web interface
│   └── fastapi_app.py      # REST API interface
├── tests/                  # Comprehensive test suite
│   ├── test_search.py      # Tests for search providers
│   ├── test_scraper.py     # Tests for content extraction
│   ├── test_cleaner.py     # Tests for text cleaning
│   ├── test_dataset.py     # Tests for dataset creation
│   └── test_pipeline.py    # Integration tests for the pipeline
├── preview_dataset.py      # Dataset preview functionality
├── check_dataset.py        # Dataset validation tool
├── requirements.txt        # Dependencies
├── .env.example            # Example environment config
├── setup.py                # Python package installation
├── MANIFEST.in             # Package manifest file
├── LICENSE                 # MIT License
├── .gitignore              # Git ignore file
└── README.md               # This documentation

🔧 Extending the Tool

The modular design allows for customization:

New Search Providers: Extend SearchProvider class in dataset_builder/search/
Custom Extraction Logic: Modify or add new extractors in dataset_builder/scrapers/
Advanced Cleaning: Enhance the TextCleaner class in dataset_builder/cleaners/
New Output Formats: Extend the DatasetBuilder in dataset_builder/dataset/
UI Components: Add new Streamlit features to ui/streamlit_app.py
API Endpoints: Extend the FastAPI application in ui/fastapi_app.py

🧪 Testing

The project includes a comprehensive test suite that covers all components:

# Run all tests
pytest

# Run tests with coverage report
pytest --cov=dataset_builder

# Run specific test suite
pytest tests/test_search.py

Test Coverage

Search: Tests different search providers and result handling
Scraping: Tests content extraction from different HTML structures
Cleaning: Tests text normalization and de-duplication
Dataset: Tests dataset creation, formatting, and uploading
Pipeline: Integration tests for the complete pipeline

📦 Package Distribution

This project is set up as a proper Python package with:

setup.py: Standard Python package setup
MANIFEST.in: List of non-Python files to include in distribution
LICENSE: MIT License
requirements.txt: All dependencies with version specifications

You can build the package for distribution with:

# Build the package
python -m pip install build
python -m build

# The package will be available in the dist/ directory

📈 Current & Future Enhancements

📄 License

MIT License

🤝 Contributing

Contributions welcome! Please feel free to submit pull requests.

🏆 Quick Start Guide

First-time User Guide

Install the package

git clone https://github.com/yourusername/dataset-builder.git
cd dataset-builder
pip install -e .

Launch the UI
```
python -m ui.streamlit_app
```
Create your first dataset
- Enter a search query like "AI advances 2023"
- Click "Build Dataset"
- Explore the results in the preview
Check out your existing datasets
- Go to the "Browse Datasets" tab
- Select any dataset to preview its contents
- View full article content by checking the option

Load and use your dataset in code

from datasets import load_from_disk
dataset = load_from_disk('./output/ai_advances_2023_*')
print(dataset[0]['title'])

Happy dataset building!

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
dataset_builder		dataset_builder
screenshots		screenshots
tests		tests
ui		ui
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
check_dataset.py		check_dataset.py
preview_dataset.py		preview_dataset.py
requirements.txt		requirements.txt
setup.py		setup.py

Folders and files

Latest commit

History

Repository files navigation

🔍 Dataset Builder 📚

🌟 Features

📺 Interactive UI

Key UI Features

Running the UI

UI Screenshots

Dataset Creation

Dataset Browsing

🌐 REST API

Key API Features

Running the API Server

API Usage Examples

Creating a Dataset

Getting Configuration

Python Client Example

Using with JavaScript/Node.js

API Documentation

📋 Requirements

🚀 Installation

From Source

Using Pip

Using the Command-line Tool

Dependencies

⚙️ Configuration

📊 Usage

Streamlit Interface

Creating a Dataset in the UI

Browsing Datasets in the UI

Command Line Interface

Building Datasets

Specifying Output Format

Additional Options

Display Current Configuration

Working with Datasets

Previewing Datasets

Checking Dataset Validity

Loading Datasets in Python

Loading from Hugging Face Hub

📁 Output Formats & Structure

Output Format Details

Dataset Naming

🔍 Dataset Preview Features

Preview Formatting

📂 Project Structure

🔧 Extending the Tool

🧪 Testing

Test Coverage

📦 Package Distribution

📈 Current & Future Enhancements

📄 License

🤝 Contributing

🏆 Quick Start Guide

First-time User Guide

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages