A complete Python pipeline that automates the creation of structured datasets from natural language search queries. This tool searches the web for content matching your query, scrapes and cleans the content, and outputs a structured dataset in multiple formats.
- Flexible Search Options: Use DuckDuckGo search API or SerpAPI with fallback to HTML scraping
- Robust Content Extraction: Multi-strategy approach using newspaper3k, readability-lxml, and custom BeautifulSoup extractors
- Advanced Text Cleaning: Remove boilerplate text, normalize formatting, and detect duplicates
- Multiple Output Formats: Save as JSON, CSV, or Hugging Face datasets
- HuggingFace Integration: Directly upload to Hugging Face Hub
- Intelligent Dataset Naming: Automatic naming based on search queries for easy identification
- Interactive Streamlit UI: Build and browse datasets through an intuitive web interface
- REST API Access: FastAPI endpoint for programmatic dataset creation
- Preview Functionality: View your dataset content with professionally formatted Markdown tables
- CLI Interface: Simple command-line interface for easy usage
- Extensible Design: Modular architecture for easy customization and extension
- Comprehensive Testing: Extensive test suite covering all components
Dataset Builder includes a Streamlit-based user interface for an intuitive visual experience:
- Dataset Creation: Build datasets from natural language queries with a simple form
- Dataset Browsing: View and explore all your existing datasets
- Rich Previews: See nicely formatted tables of your dataset content
- Full Article View: Read the complete content of any article in your datasets
- Configuration: Adjust all settings through the UI without editing config files
- HuggingFace Integration: Upload to HF Hub directly from the interface
# Start the Streamlit app
python -m ui.streamlit_appThen open your browser to http://localhost:8501 to access the UI.
The Create Dataset tab allows you to input a search query and build a new dataset with customizable settings.
The Browse Datasets tab lets you explore existing datasets with rich preview functionality and full article content viewing.
Dataset Builder provides a FastAPI-based REST API for programmatic access:
- Dataset Creation: POST endpoint to build datasets from queries
- Configuration Access: GET endpoint to retrieve current settings
- Fully Configurable: All pipeline options configurable via API
- JSON Response: Structured responses with dataset paths and stats
- HuggingFace Integration: Upload datasets to HF Hub through the API
# Start the FastAPI server
python -m ui.fastapi_appThe API will be available at http://localhost:8000, with automatic documentation at http://localhost:8000/docs.
# Create a dataset with curl
curl -X POST "http://localhost:8000/dataset" \
-H "Content-Type: application/json" \
-d '{
"query": "AI advances 2023",
"max_results": 15,
"save_format": "all",
"upload_to_hf": false
}'Response:
{
"success": true,
"message": "Dataset created successfully",
"dataset_path": "./output/ai_advances_2023_20250405_123456",
"hf_repo_url": null,
"stats": {
"query": "AI advances 2023",
"engine": "duckduckgo",
"max_results": 15
}
}# Retrieve current configuration
curl http://localhost:8000/configResponse:
{
"search": {
"engine": "duckduckgo",
"max_results": 20
},
"dataset": {
"output_dir": "./output",
"dataset_name": null,
"save_format": "all",
"upload_to_hf": false
}
}import requests
import json
# API endpoint
api_url = "http://localhost:8000"
# Create a dataset
response = requests.post(
f"{api_url}/dataset",
json={
"query": "quantum computing breakthroughs",
"max_results": 10,
"save_format": "hf",
"upload_to_hf": True,
"hf_repo_id": "yourusername/quantum-computing-dataset"
}
)
# Print the response
result = response.json()
print(f"Success: {result['success']}")
print(f"Dataset path: {result['dataset_path']}")
if result['hf_repo_url']:
print(f"Hugging Face URL: {result['hf_repo_url']}")// Using Node.js with fetch
const fetch = require('node-fetch');
async function createDataset() {
const response = await fetch('http://localhost:8000/dataset', {
method: 'POST',
headers: {
'Content-Type': 'application/json',
},
body: JSON.stringify({
query: 'renewable energy research',
max_results: 20,
save_format: 'json',
}),
});
const data = await response.json();
console.log(`Dataset created: ${data.success}`);
console.log(`Path: ${data.dataset_path}`);
}
createDataset();For complete API documentation, visit the auto-generated Swagger UI at http://localhost:8000/docs after starting the server.
- Python 3.7+
- Required packages:
- Web & Data: requests, beautifulsoup4, newspaper3k, readability-lxml, pandas
- Datasets: datasets, huggingface-hub
- Search: duckduckgo-search, google-search-results (for SerpAPI)
- UI: streamlit, fastapi, uvicorn
- Utils: python-dotenv, tqdm, pyyaml
Full dependencies are listed in requirements.txt.
-
Clone the repository:
git clone https://github.com/yourusername/dataset-builder.git cd dataset-builder -
Install the package:
pip install -e . -
Or with development dependencies:
pip install -e ".[dev]"
pip install git+https://github.com/yourusername/dataset-builder.gitAfter installation, the package provides a command-line tool:
# Use the command-line tool directly
dataset-builder build "AI research papers"
# Get help
dataset-builder --helpThe package requirements are automatically installed, but you may need to manually install:
# Core dependencies
pip install -r requirements.txt
# Development dependencies (optional)
pip install pytest pytest-cov black mypy isort-
Copy the example environment file and edit it:
cp .env.example .env
-
Edit
.envto configure:- API keys for search providers
- Search settings (max results, engine)
- Scraping parameters
- Text cleaning options
- Dataset output preferences
- Hugging Face Hub credentials (if uploading)
All configuration options can also be adjusted through the Streamlit UI.
The easiest way to use Dataset Builder is through the Streamlit UI:
python -m ui.streamlit_appThe UI provides two main tabs:
- Create Dataset: Build new datasets from search queries
- Browse Datasets: Explore and preview existing datasets
- Enter your search query in the text area
- Adjust settings in the sidebar if needed
- Click "Build Dataset"
- Wait for the process to complete
- View the dataset preview and results
- Navigate to the "Browse Datasets" tab
- Select a dataset from the dropdown
- Adjust the number of rows to preview
- Optionally view the full content of any article
Build a dataset from a search query:
python -m dataset_builder.main build "climate change articles 2023"The dataset name will be automatically generated from your search query (e.g., climate_change_articles_2023_20250405_001907).
Choose specific output formats with the --format parameter:
# Create only CSV format
python -m dataset_builder.main build "AI news articles" --format csv
# Create only JSON format
python -m dataset_builder.main build "AI news articles" --format json
# Create only Hugging Face dataset format
python -m dataset_builder.main build "AI news articles" --format hf
# Create all formats (default)
python -m dataset_builder.main build "AI news articles" --format all# Custom output directory
python -m dataset_builder.main build "Python tutorials" --output-dir ./my_datasets
# Custom dataset name
python -m dataset_builder.main build "Python tutorials" --name python_advanced_tutorials
# Set maximum number of search results
python -m dataset_builder.main build "Python tutorials" --max-results 30
# Upload to Hugging Face Hub
python -m dataset_builder.main build "Python tutorials" --upload --repo-id yourusername/python-tutorialspython -m dataset_builder.main infoAfter creating a dataset, you can easily preview its contents:
# Basic preview in terminal
python preview_dataset.py --path ./output/your_dataset_name
# Specify number of rows to show
python preview_dataset.py --path ./output/your_dataset_name --rows 10
# Save to Markdown file for better formatting
python preview_dataset.py --path ./output/your_dataset_name --output preview.md
# Add a custom title
python preview_dataset.py --path ./output/your_dataset_name --title "My Amazing Dataset" --output preview.mdVerify your dataset is properly structured and accessible:
python check_dataset.pyThe script will check:
- Local dataset validity
- Remote dataset connection (if uploaded to HF Hub)
- Compatibility with both standard and non-standard dataset structures
# Load a Hugging Face format dataset from disk
from datasets import load_from_disk
# Load the dataset
dataset = load_from_disk('./output/your_dataset_name')
# View basic info
print(dataset)
# Example output: Dataset({features: ['url', 'title', 'content', 'author', 'publish_date', 'source'], num_rows: 18})
# See the first example
print(dataset[0])
# Access specific fields
for item in dataset.select(range(3)): # First 3 items
print(f"Title: {item['title']}")
print(f"Source: {item['source']}")
print(f"Content preview: {item['content'][:200]}...")If you've uploaded your dataset to Hugging Face Hub:
from datasets import load_dataset
# Load dataset from HF Hub
dataset = load_dataset("yourusername/dataset-repo-name")The dataset builder creates structured datasets with the following fields:
url: Source URL of the contenttitle: Title of the article/contentcontent: Main text contentauthor: Author information (if available)publish_date: Publication date (if available)source: Website domain or explicit source name
-
JSON (
.json):- Complete raw data with all fields
- Useful for processing with JavaScript or other JSON-compatible tools
- Example:
./output/climate_change_articles_20250405_001907.json
-
CSV (
.csv):- Tabular format with all main fields
- Compatible with Excel, Google Sheets, pandas, etc.
- Example:
./output/climate_change_articles_20250405_001907.csv
-
Hugging Face Dataset (directory):
- Complete dataset in optimized Arrow format
- Includes metadata and dataset card (README.md)
- Compatible with π€ Datasets library and HF Hub
- Example:
./output/climate_change_articles_20250405_001907/ - Contains:
data-00000-of-00001.arrow- Arrow format datadataset_info.json- Schema informationREADME.md- Dataset descriptionstate.json- Metadata
Datasets are automatically named based on your search query:
- Search terms are converted to lowercase and spaces replaced with underscores
- Special characters are removed
- A timestamp is added to ensure uniqueness
- Example: Search for "Machine Learning Ethics" β
machine_learning_ethics_20250405_001907
The dataset builder includes powerful preview capabilities:
- Scholarly Source Attribution: Display professional source names instead of search engines
- Cleaner Content: Removes HTML, CSS fragments, and unnecessary formatting
- Standardized Date Format: Consistent date presentation across all sources
- Formatted Word Counts: Easy-to-read numbers with thousands separators
- Truncated Text: Prevents table breaking with sensible column width limits
- Full Content View: Read complete articles in a scrollable area
The preview_dataset.py script applies several transformations to make your data more presentable:
- Title Cleaning: Removes trailing website names and truncates long titles
- Author Formatting: Strips HTML/CSS fragments and limits to first few authors with "et al." notation
- Source Enhancement: Replaces generic sources like "duckduckgo" with scholarly publication names
- Date Standardization: Converts dates to a consistent "Apr 15, 2023" format
- Word Count Formatting: Adds thousands separators (e.g., "1,234" instead of "1234")
dataset-builder/
βββ dataset_builder/ # Core package
β βββ config.py # Configuration management
β βββ pipeline.py # Main dataset pipeline
β βββ search/ # Search providers implementations
β βββ scrapers/ # Content extraction modules
β βββ cleaners/ # Text cleaning utilities
β βββ dataset/ # Dataset creation and formatting
β βββ utils/ # Helper utilities
β βββ main.py # CLI interface
βββ ui/
β βββ streamlit_app.py # Streamlit web interface
β βββ fastapi_app.py # REST API interface
βββ tests/ # Comprehensive test suite
β βββ test_search.py # Tests for search providers
β βββ test_scraper.py # Tests for content extraction
β βββ test_cleaner.py # Tests for text cleaning
β βββ test_dataset.py # Tests for dataset creation
β βββ test_pipeline.py # Integration tests for the pipeline
βββ preview_dataset.py # Dataset preview functionality
βββ check_dataset.py # Dataset validation tool
βββ requirements.txt # Dependencies
βββ .env.example # Example environment config
βββ setup.py # Python package installation
βββ MANIFEST.in # Package manifest file
βββ LICENSE # MIT License
βββ .gitignore # Git ignore file
βββ README.md # This documentation
The modular design allows for customization:
- New Search Providers: Extend
SearchProviderclass indataset_builder/search/ - Custom Extraction Logic: Modify or add new extractors in
dataset_builder/scrapers/ - Advanced Cleaning: Enhance the
TextCleanerclass indataset_builder/cleaners/ - New Output Formats: Extend the
DatasetBuilderindataset_builder/dataset/ - UI Components: Add new Streamlit features to
ui/streamlit_app.py - API Endpoints: Extend the FastAPI application in
ui/fastapi_app.py
The project includes a comprehensive test suite that covers all components:
# Run all tests
pytest
# Run tests with coverage report
pytest --cov=dataset_builder
# Run specific test suite
pytest tests/test_search.py- Search: Tests different search providers and result handling
- Scraping: Tests content extraction from different HTML structures
- Cleaning: Tests text normalization and de-duplication
- Dataset: Tests dataset creation, formatting, and uploading
- Pipeline: Integration tests for the complete pipeline
This project is set up as a proper Python package with:
- setup.py: Standard Python package setup
- MANIFEST.in: List of non-Python files to include in distribution
- LICENSE: MIT License
- requirements.txt: All dependencies with version specifications
You can build the package for distribution with:
# Build the package
python -m pip install build
python -m build
# The package will be available in the dist/ directory- Streamlit UI for interactive dataset building
- Professional data preview with scholar-focused formatting
- Full article content viewing
- Support for datasets without 'train' split
- Multi-format dataset output (JSON, CSV, HF)
- FastAPI REST endpoint for programmatic access
- Language detection and filtering
- Content filtering by readability score
- Advanced duplicate detection
- Dataset merging capabilities
MIT License
Contributions welcome! Please feel free to submit pull requests.
-
Install the package
git clone https://github.com/yourusername/dataset-builder.git cd dataset-builder pip install -e .
-
Launch the UI
python -m ui.streamlit_app
-
Create your first dataset
- Enter a search query like "AI advances 2023"
- Click "Build Dataset"
- Explore the results in the preview
-
Check out your existing datasets
- Go to the "Browse Datasets" tab
- Select any dataset to preview its contents
- View full article content by checking the option
-
Load and use your dataset in code
from datasets import load_from_disk dataset = load_from_disk('./output/ai_advances_2023_*') print(dataset[0]['title'])
Happy dataset building!