Streaming Sketches

Sketch-based streaming algorithms for estimating counts and L1 differences between two event streams. The repo contains pedagogical implementations of:

A classic Cauchy projection sketch (L1DifferenceSketch) that maintains a dense random matrix and updates it online.
A hash-based variant (HashCauchyL1DifferenceSketch) that recreates matrix entries on the fly so the memory footprint depends only on the sketch dimension.
A simple Count-Min Sketch that provides the usual epsilon/delta guarantees for point queries in a data stream.

The code favors clarity over micro-optimizations and prints intermediate values so it doubles as a learning aid.

Repository layout

├── README.md              ← You are here
├── requirements.txt       ← Python dependencies
└── src
    ├── main.py            ← Small driver script with demo streams
    ├── sketching.py       ← Sketch implementations
    └── utils.py           ← Helper for pretty-printing matrices

Requirements

Python 3.10+ (tested with CPython 3.11)
pip and (optionally) python -m venv
Dependencies listed in requirements.txt (currently only NumPy)

Setup

python -m venv .venv
source .venv/bin/activate  # Windows: .venv\\Scripts\\activate
pip install -r requirements.txt

Running the examples

The driver in src/main.py wires up both L1-difference sketches with tiny streams so you can see the estimator in action.

python src/main.py

Inside main.py, toggle the function calls at the bottom to switch between the dense and hash-based sketches. Each run prints the sketch parameters, intermediate projections, and the final estimate so you can compare it against the hand-computed truth.

Using the sketches from your own code

All sketches live in src/sketching.py. Import the class you need and process events as they arrive.

from sketching import L1DifferenceSketch, HashCauchyL1DifferenceSketch, CountMinSketch

sketch = L1DifferenceSketch(n=1_000, k=50, random_seed=7)
sketch.process_stream([
    ("red", 10),
    ("blue", 10),
    ("red", 42),
])
print(sketch.estimate_l1_distance())

hash_sketch = HashCauchyL1DifferenceSketch(n=1_000_000, k=50, seed=99)
for color, idx in stream:
    hash_sketch.process_event(color, idx)

cms = CountMinSketch(epsilon=0.01, delta=0.001)
for item in stream_of_items:
    cms.update(item)
print(cms.estimate("my-item"))

Choosing parameters

k controls the accuracy of the L1 difference sketches: k = Θ(1/ε²) yields a (1 ± ε) approximation with constant probability.
n should match the size of the event universe; the hash-based sketch never materializes the k × n matrix, so it is better suited to very large universes.
For Count-Min Sketch, epsilon controls the additive error (≈ ε · total_count) and delta the failure probability (1 - δ).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Streaming Sketches

Repository layout

Requirements

Setup

Running the examples

Using the sketches from your own code

Choosing parameters

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
src		src
README.md		README.md
requirements.txt		requirements.txt

miguel-merlin/streaming_algorithms

Folders and files

Latest commit

History

Repository files navigation

Streaming Sketches

Repository layout

Requirements

Setup

Running the examples

Using the sketches from your own code

Choosing parameters

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages