Skip to content

SearchScale/solr-javabin-generator

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Solr JavaBin Generator

A high-performance tool for generating and uploading JavaBin batch files to Apache Solr for vector search workloads.

Prerequisites

  • Maven 3.6+
  • Java 17+

Build

To build the project:

mvn clean package

This creates an executable JAR with all dependencies: target/javabin-generator-1.0-SNAPSHOT-jar-with-dependencies.jar

Usage

Example: 1M Wikipedia Vectors from NVIDIA

Generate JavaBin files from the 1M Wikipedia vector dataset (768 dimensions):

mvn clean package
wget https://data.rapids.ai/raft/datasets/wiki_all_1M/wiki_all_1M.tar
tar -xf wiki_all_1M.tar
java -jar target/javabin-generator-1.0-SNAPSHOT-jar-with-dependencies.jar data_file=base.1M.fbin output_dir=wiki_batches batch_size=10000 docs_count=1000000 threads=all legacy=true

Performance: Using threads=all or threads=4 can improve performance significantly (~80% faster) for large datasets.

Customization

You can customize the JavaBin generation by modifying the command-line parameters or the source code:

Command-line Parameters

  • data_file: Path to the .fbin/.fvecs input file
  • output_dir: Directory where JavaBin batch files will be created
  • batch_size: Number of documents per batch file (default: 1000)
  • docs_count: Total number of documents to process (default: 10000)
  • threads: Number of parallel threads for processing (default: 1, use all for all available processors)
  • overwrite: Delete existing files in output directory before processing (default: false)
  • 'legacy': When true, uses standard Solr JavaBin format; when false, writes compact float[] blocks for smaller files (default: true)

Performance Benchmark

To run a performance benchmark comparing single-threaded vs multi-threaded processing:

Prerequisites: Download the 1M Wikipedia dataset first:

wget https://data.rapids.ai/raft/datasets/wiki_all_1M/wiki_all_1M.tar
tar -xf wiki_all_1M.tar

Run the benchmark:

mvn clean package
java -cp target/javabin-generator-1.0-SNAPSHOT-jar-with-dependencies.jar com.searchscale.benchmarks.PerformanceBenchmark file=base.1M.fbin total_docs=50000 batch_size=2500

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors