Generate and index realistic documents into Apache Solr at scale.
Given a Solr URL (with collection/core name) and a target document count, solr-datagen introspects the schema, generates documents with realistic data across all field types, and indexes them in parallel batches. Works with Solr 7.x through 10.x.
- Schema-aware — automatically discovers fields, types, unique key, and multiValued settings
- Type-diverse generation — covers strings, text, integers, longs, floats, doubles, dates, and booleans
- Solr 7–10 compatible — handles both Trie (7.x; deprecated but still present through 10.x) and Point (8.x+) field type classes transparently
- Scales to millions — threaded batch submission with backpressure,
commitWithinfor optimal throughput - Reproducible — optional
--seedfor deterministic output - Resilient — exponential-backoff retries on batch failures, graceful Ctrl+C handling
- Python 3.9+
- A running Apache Solr instance with at least one collection/core
git clone https://github.com/rahulgoswami/solr-datagen.git
cd solr-datagen
pip install -r requirements.txtpython -m solr_datagen <solr_url> <count> [options]# Dry run — inspect schema without indexing
python -m solr_datagen http://localhost:8983/solr/my_collection 0 --dry-run
# Index 1,000 documents with defaults
python -m solr_datagen http://localhost:8983/solr/my_collection 1000
# Index 1M documents with tuned settings
python -m solr_datagen http://localhost:8983/solr/my_collection 1000000 \
--batch-size 1000 --workers 8
# With basic auth (Solr 9.x)
python -m solr_datagen http://localhost:8983/solr/my_collection 5000 \
--auth admin:secret
# Reproducible run
python -m solr_datagen http://localhost:8983/solr/my_collection 500 --seed 42| Flag | Default | Description |
|---|---|---|
solr_url |
required | Solr collection URL, e.g. http://localhost:8983/solr/my_core |
count |
required | Number of documents to generate |
-b, --batch-size |
500 | Documents per HTTP request |
-c, --commit-within |
5000 | commitWithin in milliseconds |
-f, --max-fields |
20 | Max fields to select from schema |
--fields-per-type |
3 | Max fields per type category |
-w, --workers |
4 | Parallel submission threads |
-a, --auth |
None | Basic auth as user:password |
-s, --seed |
None | Random seed for reproducibility |
--dry-run |
false | Analyse schema only, don't index |
-v, --verbose |
false | Enable debug logging |
- Connect — validates the Solr URL, detects version and mode (standalone/SolrCloud)
- Introspect — fetches fields and field types from the Schema API, skips internal and non-stored fields
- Select — picks a diverse subset of fields (up to
--max-fields), ensuring representation across type categories - Generate — creates documents using pre-computed data pools (via Faker) for high throughput
- Index — submits documents in parallel batches with backpressure, retries, and progress reporting
solr_datagen/
├── __init__.py
├── __main__.py # python -m solr_datagen entry point
├── cli.py # argument parsing and orchestration
├── config.py # constants and field-type mappings
├── solr_client.py # Solr HTTP client
├── schema_analyzer.py # schema introspection and field selection
├── data_generator.py # per-type random data generation
├── indexer.py # batch submission pipeline
└── progress.py # progress tracking and reporting
MIT