Welcome to a forensics-style Ruby challenge inspired by real problems in fintech.
You're working with a financial accounting platform that tracks user deposits and withdrawals. There are two ledgers, the events ledger records user initiated deposits and withdrawals in the app. While the bank transaction ledger describes deposits and withdrawals at the bank.
Matching an event with a bank transaction is called reconciliation. It indicates that what was expected to happen ( deposit or withdrawal ) in the events ledger was correctly actioned at the bank.
You’re given two datasets:
-
events_with_withdrawals.json
Each entry represents a user event, like a deposit or withdrawal. These events are internal to the system and can be considered pending until they are later reconciled by a corresponding real world bank transaction (see below) -
bank_txns_with_withdrawals.json
Each entry is a transaction imported from the bank, matched to anevent[:id]. Where a bank transaction corresponds to a specific event, that event can be considered as reconciled.
Your mission is to:
- ✅ Duplicate events (i.e same
id) - ✅ Events missing a matching bank transaction (i.e. they have not yet been reconciled)
- ✅ Orphaned bank transactions (i.e. they reference non-existent events)
- ✅ Compute A final reconciled balance based only on:
- Non-duplicate events
- Events that have been reconciled with a corresponding bank transaction
- Correct handling of deposits (inflows) and withdrawals (outflows)
You can use the provided reconciler.rb as a starting point. To run your tool:
ruby reconciler.rb events_with_withdrawals.json bank_txns_with_withdrawals.json{
"duplicates": ["evt_abc123"],
"missing_bank_txns": ["evt_deadbeef"],
"orphaned_bank_txns": ["txn_unknown123"],
"final_reconciled_balance": 1_275_000
}{
"id": "evt_abc123",
"type": "deposit",
"user_id": "user_0012",
"amount": 10000,
"timestamp": "2024-01-01T12:00:00"
}{
"id": "txn_456xyz",
"type": "deposit",
"reference_id": "evt_abc123",
"amount": 10000,
"timestamp": "2024-01-01T12:05:00"
}-
Please fork the repository and invite the Tembo Engineers as viewers
-
Your repo should contain
- Your Ruby script(s)
- Instructions on how to run them
- (Optional) Tests or CLI improvements
Please feel free to structure your work however you feel is best.
This challenge should take a few hours for a strong first pass. Feel free to go deeper if you're having fun.
You don’t have to implement these, but they’re great discussion points in follow-up:
- How would you design this system to prevent duplicates in the first place?
- How would you persist and enforce idempotency?
- What would this tool look like if it were running on millions of records daily?
Good luck — and enjoy cleaning up this financial mess 💸🧹
bundle install
ruby run.rb fixtures/events_with_withdrawals.json fixtures/bank_txns_with_withdrawals.json5,052 events · 4,904 bank transactions
{
"duplicates": ["evt_f80aaeb0", "evt_6e324766", "..."],
"missing_bank_txns": ["evt_85252717", "evt_3e614235", "..."],
"orphaned_bank_txns": ["txn_87741495", "txn_1fdaba6f", "..."],
"final_reconciled_balance": 14050838
}| Category | Count |
|---|---|
| Duplicate events | 52 |
| Events missing a bank transaction | 371 |
| Orphaned bank transactions | 275 |
| Final reconciled balance | 14,050,838 |
The entry point is run.rb, not reconciler.rb. The README suggests reconciler.rb as a starting point, but mixing CLI concerns (reading ARGV, opening files, printing to stdout) with orchestration logic would violate the single responsibility principle. run.rb is infrastructure — it handles I/O and delegates everything else to Reconciler. reconciler.rb stays focused on orchestrating the domain logic and knows nothing about how it is invoked.
sequenceDiagram
participant R as Reconciler
participant I as IndexBuilder
participant RED as ReconciledEventDetector
participant UED as UnreconciledEventDetector
participant OTD as OrphanTransactionDetector
participant BC as BalanceCalculator
Note over R: run_report(events:, bank_txns:)
R->>I: call(collection: events, key: :id)
I-->>R: { indexed: Hash, duplicates: Array }
R->>I: call(collection: bank_txns, key: :reference_id)
I-->>R: { indexed: Hash, duplicates: Array }
R->>RED: call(indexed_events:, indexed_txns:)
RED-->>R: reconciled_events — Array of Hashes
R->>UED: call(indexed_events:, reconciled_events:)
UED-->>R: missing_bank_txns — Array of ids
R->>OTD: call(bank_txns:, indexed_events:)
OTD-->>R: orphaned_bank_txns — Array of ids
R->>BC: call(reconciled_events:)
BC-->>R: final_reconciled_balance — Integer
Note over R: returns { duplicates:, missing_bank_txns:, orphaned_bank_txns:, final_reconciled_balance: }
Each class has a single responsibility:
| Class | Responsibility |
|---|---|
IndexBuilder |
Builds a hash index from a collection, detects duplicates in O(n) |
ReconciledEventDetector |
Returns events that have a corresponding bank transaction |
UnreconciledEventDetector |
Returns IDs of events that have not been reconciled with a bank transaction |
OrphanTransactionDetector |
Returns bank transaction IDs referencing non-existent events |
BalanceCalculator |
Computes net balance (deposits − withdrawals) from reconciled events |
Reconciler |
Orchestrates all of the above via dependency injection |
All lookups use hash indexes — O(n) throughout, no nested iteration.
| Time | |
|---|---|
| Single run (~10k records) | ~5–6ms |
| 100 consecutive runs | ~534ms (~5.3ms avg) |
Scaling test to verify O(n) — if complexity were O(n²), doubling the dataset would quadruple the time:
| Dataset | Records | Avg time | Expected if O(n²) |
|---|---|---|---|
| 1× | 5,052 events | ~5.3ms | — |
| 2× | 10,104 events | ~6.6ms | ~21ms |
| 4× | 20,208 events | ~9.9ms | ~85ms |
Growth is linear — confirming O(n) throughout.
bundle exec rspecUnit specs cover each class in isolation. The integration spec (spec/reconciler_spec.rb) runs the full pipeline against JSON fixtures.
The current implementation makes two passes over the matched events — one for deposits, one for withdrawals:
deposits = matched_events.select { |e| e[:type] == 'deposit' }.sum { |e| e[:amount] }
withdrawals = matched_events.select { |e| e[:type] == 'withdrawal' }.sum { |e| e[:amount] }
deposits - withdrawalsThis creates two intermediate arrays. An alternative single-pass approach avoids them:
matched_events.sum { |e| e[:type] == 'deposit' ? e[:amount] : -e[:amount] }The two-pass version was chosen for readability — deposits and withdrawals are explicit and easy to reason about. At this dataset size the difference is negligible. For very large datasets the single-pass version would be preferable.
group_by is idiomatic Ruby and keeps the implementation concise. The trade-off is that it builds a full array per key in memory before transform_values discards the duplicates. For the datasets in this challenge that overhead is trivial.
If the input file were very large (gigabytes), a streaming each_with_object approach would be preferable — it can discard duplicates on the fly without ever holding multiple copies of the same record in memory at once.
Enforce a unique constraint on event.id at the database level and use INSERT ... ON CONFLICT DO NOTHING (Postgres) or equivalent. This makes deduplication a database concern rather than an application one.
Keep a processed event log keyed by event.id. On each ingest, check whether the ID has already been processed before acting on it. This makes the pipeline safe to replay after failures or retries.
Note: the debug script above calculates a single global balance for auditing purposes. A production system processing millions of events daily is built to serve individual customer wallets. The architecture below scales the reconciliation engine to project individual user balances in real-time.
At scale, the reconciliation pipeline would move from a batch file-processing model to an event-driven architecture:
sequenceDiagram
participant U as User / Bank Import
participant API as API
participant Redis
participant Kafka
participant Worker as Sidekiq Workers
participant PG as Primary DB (Postgres)
participant Replica as Read Replica
participant Client as Client Dashboard
Note over U,Client: 1. Write path (high frequency)
U->>API: POST /deposit (Idempotency-Key: uuid)
API->>Redis: LOCK event_{uuid}
Redis-->>API: acquired
API->>Kafka: publish → events topic
API-->>U: 202 Accepted
Note over U,Client: 2. Workers consume and reconcile
Kafka->>Worker: consume event
Worker->>PG: INSERT ... ON CONFLICT DO NOTHING
Worker->>PG: match event ↔ bank_txn → write reconciled record
Worker->>Worker: enqueue RefreshBalanceJob (throttled)
Note over U,Client: 3. WAL replication (milliseconds)
PG-->>Replica: WAL stream
Note over U,Client: 4. Materialized view refresh (on Primary)
Worker->>PG: REFRESH MATERIALIZED VIEW CONCURRENTLY user_balances
PG-->>Replica: WAL stream propagates the updated view automatically
Note over U,Client: 5. Read path (~2ms)
Client->>Replica: GET /dashboard
Replica-->>Client: single row from user_balances view (balance)
Replica-->>Client: last 20 rows from reconciled_events (feed)
- Idempotency Key — frontend generates a UUID per form render; the API acquires a Redis lock on it, so a double-click never produces two events in Kafka
- ON CONFLICT DO NOTHING — if two workers race to insert the same event, Postgres handles the row-level lock silently; no exception, no Rollbar noise
- WAL streaming — the read replica receives changes in milliseconds; it is consistent for direct queries
- Materialized view — pre-computes the reconciled balance per user so the dashboard reads a single row (~2ms) instead of summing the full history. The transaction feed (last N events) is a separate paginated query (
SELECT ... LIMIT 20 ORDER BY timestamp DESC) againstreconciled_events— cheap with a standard index on(user_id, timestamp), no materialised view needed - RefreshBalanceJob — enqueued by the Reconciliation Worker after each commit, throttled with
sidekiq-unique-jobsso that if the job is already queued or running, duplicate enqueues are dropped. 1,000 reconciliations in 2 seconds trigger one refresh, not 1,000. Throttling (rate-limiting concurrent execution) is the right primitive here — we want the refresh to happen as soon as possible, just not redundantly - 5GB historical file — streamed record by record with
yajl-rubyinto Kafka; the large file becomes millions of small messages processed in parallel by the existing workers
In a standard Rails CRUD app, the API inserts into Postgres synchronously and returns 201 Created. That works well for most applications. At scale, the order is intentionally reversed: the API writes to Kafka first and returns 202 Accepted before anything reaches the database. Three reasons:
Shock absorber — a synchronous INSERT ties API availability directly to Postgres throughput. Under a traffic spike, Postgres acquires locks, CPU saturates, Rails requests time out, and the platform goes down. Writing to Kafka (an append to an in-memory/sequential disk buffer) takes ~1ms regardless of load. The API stays up; workers drain the queue at a safe rate.
No dual-write problem — if you write to Postgres first and then publish to Kafka, a network failure between the two steps leaves an event in the database that Kafka consumers (Notifications, Fraud, Analytics) will never see — silent inconsistency. Solving this without Kafka-first requires a complex Transactional Outbox pattern. With Kafka as the single synchronous write, failure is atomic: either Kafka accepts the message (202 returned) or it does not (500 returned, nothing written).
Event sourcing — in a financial ledger, Kafka is the source of truth. Postgres is a projection — a read-optimised view derived from that log. Writing to Kafka first makes this explicit: the event exists the moment it lands in the log; the database is updated as a consequence, not as the origin. In practice Kafka retains recent data on fast disks for a configurable TTL (commonly 7–30 days); historical events are offloaded transparently to cheap object storage (S3, GCS) via Tiered Storage, giving infinite replayability without infinite disk cost.
At 10M records/day (~115 msg/sec), Sidekiq on Redis would handle the throughput comfortably. The reason for Kafka is architectural: in a financial platform, a single deposit event is typically consumed by multiple independent teams (Ledger, Notifications, Fraud, Analytics). Kafka's persistent, append-only log means each consumer reads at its own pace from the same topic. If a fraud detection worker crashes and loses state, it can replay from any offset. Sidekiq destroys messages on consumption — there is no replay.
They are not competing tools — most banks use both simultaneously. The distinction is conceptual:
| Kafka | RabbitMQ | |
|---|---|---|
| Broker | Dumb — stores bytes sequentially | Smart — applies routing rules, retries, DLQ |
| Consumer | Smart — tracks its own offset, manages errors | Dumb — picks up message, does work, ACKs |
| Purpose | Project — build and rebuild state from an immutable history of facts | React — execute a command once, reliably |
| Message lifecycle | Retained based on config (TTL, size, or Tiered Storage to S3) | Deleted on ACK |
Kafka fits this challenge because reconciliation is a projection: we are building the ledger state (final_reconciled_balance) by replaying an ordered, immutable log of financial events. Three properties are critical here:
Replayability — if a bug in BalanceCalculator corrupts reconciled balances for 3 hours, reset the consumer offset to before the incident and Kafka re-injects the exact same events for reprocessing. Impossible with RabbitMQ without manual DB surgery.
Efficient fan-out — multiple consumer groups (Ledger, Fraud, Notifications) each maintain their own offset pointer against the same physical log. RabbitMQ must duplicate messages into separate queues per consumer, multiplying RAM usage.
Strict ordering per user — using user_id as the partition key, Kafka guarantees all events for a given user land in the same partition and are processed by the same worker in chronological order. With RabbitMQ and multiple workers, Worker 2 could process a withdrawal before Worker 1 finishes the preceding deposit, producing a false negative balance.
Where RabbitMQ would be the right choice — once a reconciliation is complete, if we need to asynchronously generate a PDF statement or send a notification email, RabbitMQ is the better fit: the broker handles retries, dead-letter queues, and complex routing. The message is a command, not an event — once executed it has no historical value and can be discarded.
Both solve the same architectural problem: a pre-computed, denormalised read model that can be served without joins or aggregations at query time. The difference is who maintains it:
- JSONB read model — a worker builds and updates a JSON document (
{ user_id:, balance:, last_txns: [...] }) stored in ajsonbcolumn (Postgres, MongoDB, DynamoDB). The application code is responsible for the projection logic, handling concurrent updates and partial failures. - Materialised view — you give Postgres a SQL query (
SUM,GROUP BY, etc.) and it executes it in the background, storing the result as a physical table on disk.SELECT balance FROM user_balances WHERE user_id = 123reads that row directly — no computation at query time. Postgres owns the projection logic.
For this challenge the materialised view is the right fit: the projection logic is a straightforward aggregation that SQL expresses cleanly, and delegating it to the database engine avoids writing and maintaining worker-side update logic.
Introducing a separate database for reads makes sense when you are crossing a Bounded Context boundary — for example, sending data to a Marketing system or an analytical Data Warehouse where the schema, vocabulary, and business rules are fundamentally different.
Here the balance calculation belongs to the same domain as the transaction record: the Ledger. The schema is identical, the vocabulary is the same. Splitting into two databases would add operational complexity (distributed sync, split-brain failures) without any domain justification.
What we have is pragmatic CQRS within a single Bounded Context:
- Write model — normalised
eventsandbank_txnstables, optimised for fast inserts - Read model — the
user_balancesmaterialised view, a denormalised projection optimised for O(1) reads
Both live in the same Postgres engine. WAL replication is guaranteed at the system level — there is no risk of the primary and the read model diverging due to a failed sync job. A separate database would introduce exactly that failure mode.
A separate read database would only be warranted if the read domain had genuinely different requirements: full-text search (Elasticsearch), time-series analytics (TimescaleDB), or a different team owning a different Bounded Context entirely.
REFRESH MATERIALIZED VIEW runs on the Primary. Postgres stores the materialised view as a physical table on disk — it is not computed on read. When the job runs, the Primary uses CPU to compute the aggregation and writes the result to its own disk. The Replica never calculates anything; it simply receives the resulting disk blocks via WAL and applies them.
Two concerns arise:
Does it block INSERT workers? No. Postgres uses MVCC (Multi-Version Concurrency Control). REFRESH only acquires an ACCESS SHARE lock on the source tables (events, bank_txns), so workers can continue inserting thousands of rows per second without contention.
Does it block client reads on the view? This depends on the command used:
REFRESH MATERIALIZED VIEW user_balances— acquires anACCESS EXCLUSIVElock on the view. Clients querying the balance are blocked for the entire duration of the refresh. Unacceptable in production.REFRESH MATERIALIZED VIEW CONCURRENTLY user_balances— Postgres computes the new result in a temporary table in the background, then swaps it in atomically. Clients keep reading the previous version until the swap completes. Zero downtime, zero blocking. Requires a unique index onuser_idon the view.
The diagram uses CONCURRENTLY for this reason.
The read replica is consistent (WAL lag is milliseconds). The materialized view is only as fresh as the last RefreshBalanceJob execution. Because the job is throttled, there may be a short window after a reconciliation where a user does not see the update immediately. The frontend handles this with an optimistic UI update — it adds the amount to the displayed balance locally the moment the request succeeds, without waiting for the next refresh.