A high-performance tool for generating and uploading JavaBin batch files to Apache Solr for vector search workloads.
- Maven 3.6+
- Java 17+
To build the project:
mvn clean packageThis creates an executable JAR with all dependencies: target/javabin-generator-1.0-SNAPSHOT-jar-with-dependencies.jar
Generate JavaBin files from the 1M Wikipedia vector dataset (768 dimensions):
mvn clean package
wget https://data.rapids.ai/raft/datasets/wiki_all_1M/wiki_all_1M.tar
tar -xf wiki_all_1M.tar
java -jar target/javabin-generator-1.0-SNAPSHOT-jar-with-dependencies.jar data_file=base.1M.fbin output_dir=wiki_batches batch_size=10000 docs_count=1000000 threads=all legacy=truePerformance: Using threads=all or threads=4 can improve performance significantly (~80% faster) for large datasets.
You can customize the JavaBin generation by modifying the command-line parameters or the source code:
data_file: Path to the .fbin/.fvecs input fileoutput_dir: Directory where JavaBin batch files will be createdbatch_size: Number of documents per batch file (default: 1000)docs_count: Total number of documents to process (default: 10000)threads: Number of parallel threads for processing (default: 1, useallfor all available processors)overwrite: Delete existing files in output directory before processing (default: false)- 'legacy': When
true, uses standard Solr JavaBin format; whenfalse, writes compact float[] blocks for smaller files (default: true)
To run a performance benchmark comparing single-threaded vs multi-threaded processing:
Prerequisites: Download the 1M Wikipedia dataset first:
wget https://data.rapids.ai/raft/datasets/wiki_all_1M/wiki_all_1M.tar
tar -xf wiki_all_1M.tarRun the benchmark:
mvn clean package
java -cp target/javabin-generator-1.0-SNAPSHOT-jar-with-dependencies.jar com.searchscale.benchmarks.PerformanceBenchmark file=base.1M.fbin total_docs=50000 batch_size=2500