Solr JavaBin Generator

A high-performance tool for generating and uploading JavaBin batch files to Apache Solr for vector search workloads.

Prerequisites

Maven 3.6+
Java 17+

Build

To build the project:

mvn clean package

This creates an executable JAR with all dependencies: target/javabin-generator-1.0-SNAPSHOT-jar-with-dependencies.jar

Usage

Example: 1M Wikipedia Vectors from NVIDIA

Generate JavaBin files from the 1M Wikipedia vector dataset (768 dimensions):

mvn clean package
wget https://data.rapids.ai/raft/datasets/wiki_all_1M/wiki_all_1M.tar
tar -xf wiki_all_1M.tar
java -jar target/javabin-generator-1.0-SNAPSHOT-jar-with-dependencies.jar data_file=base.1M.fbin output_dir=wiki_batches batch_size=10000 docs_count=1000000 threads=all legacy=true

Performance: Using threads=all or threads=4 can improve performance significantly (~80% faster) for large datasets.

Customization

You can customize the JavaBin generation by modifying the command-line parameters or the source code:

Command-line Parameters

data_file: Path to the .fbin/.fvecs input file
output_dir: Directory where JavaBin batch files will be created
batch_size: Number of documents per batch file (default: 1000)
docs_count: Total number of documents to process (default: 10000)
threads: Number of parallel threads for processing (default: 1, use all for all available processors)
overwrite: Delete existing files in output directory before processing (default: false)
'legacy': When true, uses standard Solr JavaBin format; when false, writes compact float[] blocks for smaller files (default: true)

Performance Benchmark

To run a performance benchmark comparing single-threaded vs multi-threaded processing:

Prerequisites: Download the 1M Wikipedia dataset first:

wget https://data.rapids.ai/raft/datasets/wiki_all_1M/wiki_all_1M.tar
tar -xf wiki_all_1M.tar

Run the benchmark:

mvn clean package
java -cp target/javabin-generator-1.0-SNAPSHOT-jar-with-dependencies.jar com.searchscale.benchmarks.PerformanceBenchmark file=base.1M.fbin total_docs=50000 batch_size=2500

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
solr-configset		solr-configset
src		src
.gitignore		.gitignore
README.md		README.md
compute_groundtruths.py		compute_groundtruths.py
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Solr JavaBin Generator

Prerequisites

Build

Usage

Example: 1M Wikipedia Vectors from NVIDIA

Customization

Command-line Parameters

Performance Benchmark

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Solr JavaBin Generator

Prerequisites

Build

Usage

Example: 1M Wikipedia Vectors from NVIDIA

Customization

Command-line Parameters

Performance Benchmark

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages