solr-datagen

Generate and index realistic documents into Apache Solr at scale.

Given a Solr URL (with collection/core name) and a target document count, solr-datagen introspects the schema, generates documents with realistic data across all field types, and indexes them in parallel batches. Works with Solr 7.x through 10.x.

Features

Schema-aware — automatically discovers fields, types, unique key, and multiValued settings
Type-diverse generation — covers strings, text, integers, longs, floats, doubles, dates, and booleans
Solr 7–10 compatible — handles both Trie (7.x; deprecated but still present through 10.x) and Point (8.x+) field type classes transparently
Scales to millions — threaded batch submission with backpressure, commitWithin for optimal throughput
Reproducible — optional --seed for deterministic output
Resilient — exponential-backoff retries on batch failures, graceful Ctrl+C handling

Requirements

Python 3.9+
A running Apache Solr instance with at least one collection/core

Installation

git clone https://github.com/rahulgoswami/solr-datagen.git
cd solr-datagen
pip install -r requirements.txt

Usage

python -m solr_datagen <solr_url> <count> [options]

Examples

# Dry run — inspect schema without indexing
python -m solr_datagen http://localhost:8983/solr/my_collection 0 --dry-run

# Index 1,000 documents with defaults
python -m solr_datagen http://localhost:8983/solr/my_collection 1000

# Index 1M documents with tuned settings
python -m solr_datagen http://localhost:8983/solr/my_collection 1000000 \
  --batch-size 1000 --workers 8

# With basic auth (Solr 9.x)
python -m solr_datagen http://localhost:8983/solr/my_collection 5000 \
  --auth admin:secret

# Reproducible run
python -m solr_datagen http://localhost:8983/solr/my_collection 500 --seed 42

Options

Flag	Default	Description
`solr_url`	required	Solr collection URL, e.g. `http://localhost:8983/solr/my_core`
`count`	required	Number of documents to generate
`-b`, `--batch-size`	500	Documents per HTTP request
`-c`, `--commit-within`	5000	`commitWithin` in milliseconds
`-f`, `--max-fields`	20	Max fields to select from schema
`--fields-per-type`	3	Max fields per type category
`-w`, `--workers`	4	Parallel submission threads
`-a`, `--auth`	None	Basic auth as `user:password`
`-s`, `--seed`	None	Random seed for reproducibility
`--dry-run`	false	Analyse schema only, don't index
`-v`, `--verbose`	false	Enable debug logging

How It Works

Connect — validates the Solr URL, detects version and mode (standalone/SolrCloud)
Introspect — fetches fields and field types from the Schema API, skips internal and non-stored fields
Select — picks a diverse subset of fields (up to --max-fields), ensuring representation across type categories
Generate — creates documents using pre-computed data pools (via Faker) for high throughput
Index — submits documents in parallel batches with backpressure, retries, and progress reporting

Project Structure

solr_datagen/
├── __init__.py
├── __main__.py          # python -m solr_datagen entry point
├── cli.py               # argument parsing and orchestration
├── config.py            # constants and field-type mappings
├── solr_client.py       # Solr HTTP client
├── schema_analyzer.py   # schema introspection and field selection
├── data_generator.py    # per-type random data generation
├── indexer.py           # batch submission pipeline
└── progress.py          # progress tracking and reporting

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
solr_datagen		solr_datagen
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

solr-datagen

Features

Requirements

Installation

Usage

Examples

Options

How It Works

Project Structure

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

solr-datagen

Features

Requirements

Installation

Usage

Examples

Options

How It Works

Project Structure

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages