Skip to content

cannyedge34/tembo_ruby_test

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🔍 Ledger Debug & Reconciliation Challenge

Welcome to a forensics-style Ruby challenge inspired by real problems in fintech.

You're working with a financial accounting platform that tracks user deposits and withdrawals. There are two ledgers, the events ledger records user initiated deposits and withdrawals in the app. While the bank transaction ledger describes deposits and withdrawals at the bank.

Matching an event with a bank transaction is called reconciliation. It indicates that what was expected to happen ( deposit or withdrawal ) in the events ledger was correctly actioned at the bank.


🧠 Your Objective

You’re given two datasets:

  • events_with_withdrawals.json
    Each entry represents a user event, like a deposit or withdrawal. These events are internal to the system and can be considered pending until they are later reconciled by a corresponding real world bank transaction (see below)

  • bank_txns_with_withdrawals.json
    Each entry is a transaction imported from the bank, matched to an event[:id]. Where a bank transaction corresponds to a specific event, that event can be considered as reconciled.

Your mission is to:

1. Data sanitisation. Build a tool that identifies:

  • ✅ Duplicate events (i.e same id)
  • ✅ Events missing a matching bank transaction (i.e. they have not yet been reconciled)
  • ✅ Orphaned bank transactions (i.e. they reference non-existent events)

2. Reconciliation.

  • ✅ Compute A final reconciled balance based only on:
    • Non-duplicate events
    • Events that have been reconciled with a corresponding bank transaction
    • Correct handling of deposits (inflows) and withdrawals (outflows)

🛠 Getting Started

You can use the provided reconciler.rb as a starting point. To run your tool:

ruby reconciler.rb events_with_withdrawals.json bank_txns_with_withdrawals.json

✅ Example Output

{
  "duplicates": ["evt_abc123"],
  "missing_bank_txns": ["evt_deadbeef"],
  "orphaned_bank_txns": ["txn_unknown123"],
  "final_reconciled_balance": 1_275_000
}

📄 Sample Data Format

Event

{
  "id": "evt_abc123",
  "type": "deposit",
  "user_id": "user_0012",
  "amount": 10000,
  "timestamp": "2024-01-01T12:00:00"
}

Bank Transaction

{
  "id": "txn_456xyz",
  "type": "deposit",
  "reference_id": "evt_abc123",
  "amount": 10000,
  "timestamp": "2024-01-01T12:05:00"
}

🏁 Deliverables

  • Please fork the repository and invite the Tembo Engineers as viewers

  • Your repo should contain

    • Your Ruby script(s)
    • Instructions on how to run them
    • (Optional) Tests or CLI improvements

    Please feel free to structure your work however you feel is best.


⏱ Time Expectation

This challenge should take a few hours for a strong first pass. Feel free to go deeper if you're having fun.


🚀 Bonus Topics

You don’t have to implement these, but they’re great discussion points in follow-up:

  • How would you design this system to prevent duplicates in the first place?
  • How would you persist and enforce idempotency?
  • What would this tool look like if it were running on millions of records daily?

Good luck — and enjoy cleaning up this financial mess 💸🧹


Solution

How to run

bundle install
ruby run.rb fixtures/events_with_withdrawals.json fixtures/bank_txns_with_withdrawals.json

Results against the provided dataset

5,052 events · 4,904 bank transactions

{
  "duplicates": ["evt_f80aaeb0", "evt_6e324766", "..."],
  "missing_bank_txns": ["evt_85252717", "evt_3e614235", "..."],
  "orphaned_bank_txns": ["txn_87741495", "txn_1fdaba6f", "..."],
  "final_reconciled_balance": 14050838
}
Category Count
Duplicate events 52
Events missing a bank transaction 371
Orphaned bank transactions 275
Final reconciled balance 14,050,838

CLI vs Orchestrator

The entry point is run.rb, not reconciler.rb. The README suggests reconciler.rb as a starting point, but mixing CLI concerns (reading ARGV, opening files, printing to stdout) with orchestration logic would violate the single responsibility principle. run.rb is infrastructure — it handles I/O and delegates everything else to Reconciler. reconciler.rb stays focused on orchestrating the domain logic and knows nothing about how it is invoked.

Architecture

sequenceDiagram
    participant R as Reconciler
    participant I as IndexBuilder
    participant RED as ReconciledEventDetector
    participant UED as UnreconciledEventDetector
    participant OTD as OrphanTransactionDetector
    participant BC as BalanceCalculator

    Note over R: run_report(events:, bank_txns:)

    R->>I: call(collection: events, key: :id)
    I-->>R: { indexed: Hash, duplicates: Array }

    R->>I: call(collection: bank_txns, key: :reference_id)
    I-->>R: { indexed: Hash, duplicates: Array }

    R->>RED: call(indexed_events:, indexed_txns:)
    RED-->>R: reconciled_events — Array of Hashes

    R->>UED: call(indexed_events:, reconciled_events:)
    UED-->>R: missing_bank_txns — Array of ids

    R->>OTD: call(bank_txns:, indexed_events:)
    OTD-->>R: orphaned_bank_txns — Array of ids

    R->>BC: call(reconciled_events:)
    BC-->>R: final_reconciled_balance — Integer

    Note over R: returns { duplicates:, missing_bank_txns:, orphaned_bank_txns:, final_reconciled_balance: }
Loading

Each class has a single responsibility:

Class Responsibility
IndexBuilder Builds a hash index from a collection, detects duplicates in O(n)
ReconciledEventDetector Returns events that have a corresponding bank transaction
UnreconciledEventDetector Returns IDs of events that have not been reconciled with a bank transaction
OrphanTransactionDetector Returns bank transaction IDs referencing non-existent events
BalanceCalculator Computes net balance (deposits − withdrawals) from reconciled events
Reconciler Orchestrates all of the above via dependency injection

All lookups use hash indexes — O(n) throughout, no nested iteration.

Performance

Time
Single run (~10k records) ~5–6ms
100 consecutive runs ~534ms (~5.3ms avg)

Scaling test to verify O(n) — if complexity were O(n²), doubling the dataset would quadruple the time:

Dataset Records Avg time Expected if O(n²)
5,052 events ~5.3ms
10,104 events ~6.6ms ~21ms
20,208 events ~9.9ms ~85ms

Growth is linear — confirming O(n) throughout.

Tests

bundle exec rspec

Unit specs cover each class in isolation. The integration spec (spec/reconciler_spec.rb) runs the full pipeline against JSON fixtures.


Design trade-offs

BalanceCalculator: two passes vs one pass

The current implementation makes two passes over the matched events — one for deposits, one for withdrawals:

deposits    = matched_events.select { |e| e[:type] == 'deposit' }.sum { |e| e[:amount] }
withdrawals = matched_events.select { |e| e[:type] == 'withdrawal' }.sum { |e| e[:amount] }
deposits - withdrawals

This creates two intermediate arrays. An alternative single-pass approach avoids them:

matched_events.sum { |e| e[:type] == 'deposit' ? e[:amount] : -e[:amount] }

The two-pass version was chosen for readability — deposits and withdrawals are explicit and easy to reason about. At this dataset size the difference is negligible. For very large datasets the single-pass version would be preferable.

IndexBuilder: group_by vs each_with_object

group_by is idiomatic Ruby and keeps the implementation concise. The trade-off is that it builds a full array per key in memory before transform_values discards the duplicates. For the datasets in this challenge that overhead is trivial.

If the input file were very large (gigabytes), a streaming each_with_object approach would be preferable — it can discard duplicates on the fly without ever holding multiple copies of the same record in memory at once.


Bonus Topics

How would you prevent duplicates in the first place?

Enforce a unique constraint on event.id at the database level and use INSERT ... ON CONFLICT DO NOTHING (Postgres) or equivalent. This makes deduplication a database concern rather than an application one.

How would you persist and enforce idempotency?

Keep a processed event log keyed by event.id. On each ingest, check whether the ID has already been processed before acting on it. This makes the pipeline safe to replay after failures or retries.

What would this tool look like running on millions of records daily?

Note: the debug script above calculates a single global balance for auditing purposes. A production system processing millions of events daily is built to serve individual customer wallets. The architecture below scales the reconciliation engine to project individual user balances in real-time.

At scale, the reconciliation pipeline would move from a batch file-processing model to an event-driven architecture:

sequenceDiagram
    participant U as User / Bank Import
    participant API as API
    participant Redis
    participant Kafka
    participant Worker as Sidekiq Workers
    participant PG as Primary DB (Postgres)
    participant Replica as Read Replica
    participant Client as Client Dashboard

    Note over U,Client: 1. Write path (high frequency)

    U->>API: POST /deposit (Idempotency-Key: uuid)
    API->>Redis: LOCK event_{uuid}
    Redis-->>API: acquired
    API->>Kafka: publish → events topic
    API-->>U: 202 Accepted

    Note over U,Client: 2. Workers consume and reconcile

    Kafka->>Worker: consume event
    Worker->>PG: INSERT ... ON CONFLICT DO NOTHING
    Worker->>PG: match event ↔ bank_txn → write reconciled record
    Worker->>Worker: enqueue RefreshBalanceJob (throttled)

    Note over U,Client: 3. WAL replication (milliseconds)

    PG-->>Replica: WAL stream

    Note over U,Client: 4. Materialized view refresh (on Primary)

    Worker->>PG: REFRESH MATERIALIZED VIEW CONCURRENTLY user_balances
    PG-->>Replica: WAL stream propagates the updated view automatically

    Note over U,Client: 5. Read path (~2ms)

    Client->>Replica: GET /dashboard
    Replica-->>Client: single row from user_balances view (balance)
    Replica-->>Client: last 20 rows from reconciled_events (feed)
Loading

Key design decisions

  • Idempotency Key — frontend generates a UUID per form render; the API acquires a Redis lock on it, so a double-click never produces two events in Kafka
  • ON CONFLICT DO NOTHING — if two workers race to insert the same event, Postgres handles the row-level lock silently; no exception, no Rollbar noise
  • WAL streaming — the read replica receives changes in milliseconds; it is consistent for direct queries
  • Materialized view — pre-computes the reconciled balance per user so the dashboard reads a single row (~2ms) instead of summing the full history. The transaction feed (last N events) is a separate paginated query (SELECT ... LIMIT 20 ORDER BY timestamp DESC) against reconciled_events — cheap with a standard index on (user_id, timestamp), no materialised view needed
  • RefreshBalanceJob — enqueued by the Reconciliation Worker after each commit, throttled with sidekiq-unique-jobs so that if the job is already queued or running, duplicate enqueues are dropped. 1,000 reconciliations in 2 seconds trigger one refresh, not 1,000. Throttling (rate-limiting concurrent execution) is the right primitive here — we want the refresh to happen as soon as possible, just not redundantly
  • 5GB historical file — streamed record by record with yajl-ruby into Kafka; the large file becomes millions of small messages processed in parallel by the existing workers

Why 202 Accepted and not 201 Created?

In a standard Rails CRUD app, the API inserts into Postgres synchronously and returns 201 Created. That works well for most applications. At scale, the order is intentionally reversed: the API writes to Kafka first and returns 202 Accepted before anything reaches the database. Three reasons:

Shock absorber — a synchronous INSERT ties API availability directly to Postgres throughput. Under a traffic spike, Postgres acquires locks, CPU saturates, Rails requests time out, and the platform goes down. Writing to Kafka (an append to an in-memory/sequential disk buffer) takes ~1ms regardless of load. The API stays up; workers drain the queue at a safe rate.

No dual-write problem — if you write to Postgres first and then publish to Kafka, a network failure between the two steps leaves an event in the database that Kafka consumers (Notifications, Fraud, Analytics) will never see — silent inconsistency. Solving this without Kafka-first requires a complex Transactional Outbox pattern. With Kafka as the single synchronous write, failure is atomic: either Kafka accepts the message (202 returned) or it does not (500 returned, nothing written).

Event sourcing — in a financial ledger, Kafka is the source of truth. Postgres is a projection — a read-optimised view derived from that log. Writing to Kafka first makes this explicit: the event exists the moment it lands in the log; the database is updated as a consequence, not as the origin. In practice Kafka retains recent data on fast disks for a configurable TTL (commonly 7–30 days); historical events are offloaded transparently to cheap object storage (S3, GCS) via Tiered Storage, giving infinite replayability without infinite disk cost.

Why Kafka and not just Sidekiq?

At 10M records/day (~115 msg/sec), Sidekiq on Redis would handle the throughput comfortably. The reason for Kafka is architectural: in a financial platform, a single deposit event is typically consumed by multiple independent teams (Ledger, Notifications, Fraud, Analytics). Kafka's persistent, append-only log means each consumer reads at its own pace from the same topic. If a fraud detection worker crashes and loses state, it can replay from any offset. Sidekiq destroys messages on consumption — there is no replay.

Why Kafka and not RabbitMQ?

They are not competing tools — most banks use both simultaneously. The distinction is conceptual:

Kafka RabbitMQ
Broker Dumb — stores bytes sequentially Smart — applies routing rules, retries, DLQ
Consumer Smart — tracks its own offset, manages errors Dumb — picks up message, does work, ACKs
Purpose Project — build and rebuild state from an immutable history of facts React — execute a command once, reliably
Message lifecycle Retained based on config (TTL, size, or Tiered Storage to S3) Deleted on ACK

Kafka fits this challenge because reconciliation is a projection: we are building the ledger state (final_reconciled_balance) by replaying an ordered, immutable log of financial events. Three properties are critical here:

Replayability — if a bug in BalanceCalculator corrupts reconciled balances for 3 hours, reset the consumer offset to before the incident and Kafka re-injects the exact same events for reprocessing. Impossible with RabbitMQ without manual DB surgery.

Efficient fan-out — multiple consumer groups (Ledger, Fraud, Notifications) each maintain their own offset pointer against the same physical log. RabbitMQ must duplicate messages into separate queues per consumer, multiplying RAM usage.

Strict ordering per user — using user_id as the partition key, Kafka guarantees all events for a given user land in the same partition and are processed by the same worker in chronological order. With RabbitMQ and multiple workers, Worker 2 could process a withdrawal before Worker 1 finishes the preceding deposit, producing a false negative balance.

Where RabbitMQ would be the right choice — once a reconciliation is complete, if we need to asynchronously generate a PDF statement or send a notification email, RabbitMQ is the better fit: the broker handles retries, dead-letter queues, and complex routing. The message is a command, not an event — once executed it has no historical value and can be discarded.

Materialised view vs JSONB read model

Both solve the same architectural problem: a pre-computed, denormalised read model that can be served without joins or aggregations at query time. The difference is who maintains it:

  • JSONB read model — a worker builds and updates a JSON document ({ user_id:, balance:, last_txns: [...] }) stored in a jsonb column (Postgres, MongoDB, DynamoDB). The application code is responsible for the projection logic, handling concurrent updates and partial failures.
  • Materialised view — you give Postgres a SQL query (SUM, GROUP BY, etc.) and it executes it in the background, storing the result as a physical table on disk. SELECT balance FROM user_balances WHERE user_id = 123 reads that row directly — no computation at query time. Postgres owns the projection logic.

For this challenge the materialised view is the right fit: the projection logic is a straightforward aggregation that SQL expresses cleanly, and delegating it to the database engine avoids writing and maintaining worker-side update logic.

Why a read replica and not a separate database?

Introducing a separate database for reads makes sense when you are crossing a Bounded Context boundary — for example, sending data to a Marketing system or an analytical Data Warehouse where the schema, vocabulary, and business rules are fundamentally different.

Here the balance calculation belongs to the same domain as the transaction record: the Ledger. The schema is identical, the vocabulary is the same. Splitting into two databases would add operational complexity (distributed sync, split-brain failures) without any domain justification.

What we have is pragmatic CQRS within a single Bounded Context:

  • Write model — normalised events and bank_txns tables, optimised for fast inserts
  • Read model — the user_balances materialised view, a denormalised projection optimised for O(1) reads

Both live in the same Postgres engine. WAL replication is guaranteed at the system level — there is no risk of the primary and the read model diverging due to a failed sync job. A separate database would introduce exactly that failure mode.

A separate read database would only be warranted if the read domain had genuinely different requirements: full-text search (Elasticsearch), time-series analytics (TimescaleDB), or a different team owning a different Bounded Context entirely.

Does REFRESH block writes or reads?

REFRESH MATERIALIZED VIEW runs on the Primary. Postgres stores the materialised view as a physical table on disk — it is not computed on read. When the job runs, the Primary uses CPU to compute the aggregation and writes the result to its own disk. The Replica never calculates anything; it simply receives the resulting disk blocks via WAL and applies them.

Two concerns arise:

Does it block INSERT workers? No. Postgres uses MVCC (Multi-Version Concurrency Control). REFRESH only acquires an ACCESS SHARE lock on the source tables (events, bank_txns), so workers can continue inserting thousands of rows per second without contention.

Does it block client reads on the view? This depends on the command used:

  • REFRESH MATERIALIZED VIEW user_balances — acquires an ACCESS EXCLUSIVE lock on the view. Clients querying the balance are blocked for the entire duration of the refresh. Unacceptable in production.
  • REFRESH MATERIALIZED VIEW CONCURRENTLY user_balances — Postgres computes the new result in a temporary table in the background, then swaps it in atomically. Clients keep reading the previous version until the swap completes. Zero downtime, zero blocking. Requires a unique index on user_id on the view.

The diagram uses CONCURRENTLY for this reason.

Eventual consistency trade-off

The read replica is consistent (WAL lag is milliseconds). The materialized view is only as fresh as the last RefreshBalanceJob execution. Because the job is throttled, there may be a short window after a reconciliation where a user does not see the update immediately. The frontend handles this with an optimistic UI update — it adds the amount to the displayed balance locally the moment the request succeeds, without waiting for the next refresh.

About

tembo_ruby_test

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages