Skip to content

Benchmarking and Performance

Stefan de Konink edited this page Aug 11, 2025 · 1 revision

Badger uses an LMDB database. This means that lookups can be performed at lightning speed, whereas other operations are inevitably far more complex to execute, and are effectively constrained by the amount of memory available in the system for optimisation. The benchmark below is based on the fr_sncf_ter dataset, which contains numerous issues. These can be addressed, but the time required depends heavily on the chosen approach. The table represents a baseline: “if you need to scan for all values, this is the cost when done single-threaded”.

One of our most expensive operations involves (batch) updates — for example, repairing references. This process requires the deletion and recreation of the references, inverse references, embedding, and inverse embedding. If we wouldn't delete, just overwriting would work, but would result into lingering entries that do not exist.

Since LMDB is not a relational database, we cannot simply search for any arbitrary value in a secondary index. We have therefore developed a method that starts from the outward reference (where the computed key is known), then traverses to the inward reference of the referenced objects. From there, all inward reference values are retrieved, deserialised, and compared to match the class, id, and version.

Deleting a single record from the reference table alone already costed 600 ms per object. Considering that deserialising all objects in the entire database takes 21 seconds, it follows that when updating more than 35 objects, a full table scan might be preferable.

Through targeted benchmarking, we have reduced the 600 ms per-object cost to 60 ms. Under this improvement, the threshold where a full table scan becomes worth considering rises to around 350 objects per batch. Using the benchmark data below, and with this improvement, 200 records take roughly 14 seconds — approximately 68 ms per record.

Even if the per-record operation time could be further reduced, there will always be a sweet spot where the batch operation achieves stable performance and outpaces “smart lookups”. In the current implementation, that point is likely somewhere around 235 objects.

Looking at the full table below, one might wonder: in a significant batch operation, what if we simply discarded all entries and recomputed them? We have benchmarked this version as well. 126519 objects took 367 seconds to be processed, averaging to 3 ms per object.

LMDB Benchmark Results

Database Entries Time (s)
Connection 322 0.0509
CoupledJourney 125 0.0072
DayType 904 0.0347
DayTypeAssignment 904 0.0470
DefaultConnection 4 0.0008
DestinationDisplay 652 0.0800
JourneyPart 455 0.0451
JourneyPartCouple 125 0.0117
Line 556 0.1967
Network 13 0.0068
Operator 3 0.0012
PassengerStopAssignment 3461 0.4613
Route 15723 6.8190
RouteLink 34376 1.5615
RoutePoint 7673 0.3293
ScheduledStopPoint 3461 0.4111
ServiceJourney 22479 8.5755
ServiceJourneyPattern 5659 1.4576
StopPlace 3461 0.7745
TopographicPlace 2773 0.1798
TrainNumber 22479 0.5661
TypeOfService 3 0.0007
UicOperatingPeriod 904 0.0326
ValidBetween 1 0.0006
ValueSet 3 0.0010
Total: 126519 21.6529

Metadata

Database Entries Time (s)
_embedding 763263 6.4301
_embedding_inverse 763263 8.8429
_referencing 959927 12.6878
_referencing_inwards 959927 11.4773

Clone this wiki locally