Skip to content

Commit 9203da7

Browse files
dependabot[bot]git-steven
authored andcommitted
Bump tqdm from 4.66.2 to 4.66.3 in /python-git-steven (#1)
Bumps [tqdm](https://github.com/tqdm/tqdm) from 4.66.2 to 4.66.3. - [Release notes](https://github.com/tqdm/tqdm/releases) - [Commits](tqdm/tqdm@v4.66.2...v4.66.3) --- updated-dependencies: - dependency-name: tqdm dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
1 parent de99395 commit 9203da7

18 files changed

+2719
-9
lines changed

.gitignore

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,4 +4,6 @@ _site
44
Gemfile.lock
55
.DS_Store
66
**/.ipynb_checkpoints/*
7-
.idea
7+
.idea
8+
*.bkp
9+
*.dtmp

.ruby-version

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
3.3.10

CLAUDE.md

Lines changed: 73 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,73 @@
1+
# CLAUDE.md
2+
3+
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
4+
5+
## Project Overview
6+
7+
This is a Jekyll-based blog/portfolio site using the Minimal Mistakes remote theme. The site showcases technical articles on data engineering, machine learning, software architecture, and related topics.
8+
9+
## Site Architecture
10+
11+
* **Static Site Generator**: Jekyll with GitHub Pages
12+
* **Theme**: Minimal Mistakes (remote theme)
13+
* **Content Structure**:
14+
* `_posts/` - Blog posts in markdown format with YAML front matter
15+
* `_pages/` - Static pages (About, Projects, Archives, etc.)
16+
* `_data/` - Site configuration data (navigation, etc.)
17+
* `_includes/` - Template partials and custom HTML
18+
* `assets/` - Images, CSS, and other static assets
19+
* `_site/` - Generated site output (ignored in git)
20+
21+
## Post Format
22+
23+
Blog posts follow Jekyll conventions:
24+
* File naming: `YYYY-MM-DD-title-slug.md`
25+
* YAML front matter includes: `title`, `date`, `categories`, `author`
26+
* Content uses GitHub-flavored markdown with kramdown
27+
* Posts may include draw.io diagrams (`.drawio` files) and exported images (`.drawio.png`)
28+
29+
## Common Development Commands
30+
31+
### Local Development
32+
```bash
33+
# Install dependencies
34+
bundle install
35+
36+
# Serve site locally with live reload
37+
bundle exec jekyll serve
38+
39+
# Serve with drafts visible
40+
bundle exec jekyll serve --drafts
41+
42+
# Build site without serving
43+
bundle exec jekyll build
44+
```
45+
46+
### Working with Diagrams
47+
48+
The site uses draw.io diagrams embedded in posts. When creating or updating diagrams:
49+
* Store `.drawio` source files in `_posts/` directory alongside related posts
50+
* Export to PNG using `draw.io.export <diagram>.drawio` (creates `.drawio.png`)
51+
* Reference exported images in markdown: `![](/path/to/diagram.drawio.png)`
52+
* Follow diagram styling from `~/.claude/CLAUDE.md` (rounded boxes, Sketch style, gradients)
53+
54+
### Content Guidelines
55+
56+
* Technical posts should be informative and approachable (see existing posts for tone)
57+
* Include relevant images and diagrams to support explanations
58+
* Use proper categorization for posts (data-engineering, machine learning, architecture, etc.)
59+
* Posts cover topics: Python, scikit-learn, PySpark, FastAPI, AWS CDK, architecture patterns
60+
61+
## Site Configuration
62+
63+
* Main config: `_config.yml`
64+
* Navigation: `_data/navigation.yml`
65+
* Author info, social links, and site metadata in `_config.yml:62-107`
66+
* Default post layout includes: author profile, read time, comments, sharing, related posts
67+
68+
## Theme Customization
69+
70+
* Skin: "dirt" (defined in `_config.yml:30`)
71+
* Custom includes in `_includes/head/` for additional head elements
72+
* Site uses pagination (5 posts per page)
73+
* Archive pages generated via liquid templates (by year, category, tag)

_posts/2024-08-06-techniques-large-data-vis.md

Lines changed: 18 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
1-
---
1+
---
22
title: "Taming the Data Beast: A Guide to Visualizing Massive Datasets"
3-
date: 2024-08-06 09:45:00 -0500
3+
date: 2024-08-06 0:45:00 -0500
44
categories:
55
- data-engineering
66
- data-visualization
@@ -19,7 +19,7 @@ Remember the good old days when a "large dataset" meant a few thousand rows in E
1919

2020
But fear not, dear reader! We're about to embark on a journey through the wild world of big data preprocessing. Buckle up, because things are about to get... mildly exciting.
2121

22-
## The Art of Data Reduction
22+
## The Art and Science of Data Reduction
2323

2424
### Sampling: Less is More (Sometimes)
2525

@@ -33,6 +33,8 @@ This is like organizing a really diverse party. You make sure every group is rep
3333

3434
Aggregation is the art of smooshing data together until it fits into a manageable size. It's like making a smoothie out of your fruit salad – you lose some detail, but at least it fits in the cup.
3535

36+
Aggregations include `sum`, `count`, `average`, `median`, as well as [Standard Deviation](https://en.wikipedia.org/wiki/Standard_deviation) and [Variance](https://en.wikipedia.org/wiki/Variance).
37+
3638
### Binning: Put a Lid on It
3739

3840
Continuous data is like that friend who never stops talking. Binning can be particularly useful when you're dealing with things like age ranges or income brackets.
@@ -43,11 +45,18 @@ Continuous data is like that friend who never stops talking. Binning can be part
4345
Dimensionality reduction is a technique used to reduce the number of features (or dimensions) in a dataset while preserving as much of the important information as possible. It's like taking a complex, multi-faceted object and creating a simpler representation that still captures its essence. It is used a lot in data science, data engineering and machine learning where the data has high-dimensionality.
4446

4547
### PCA (Principal Component Analysis)
46-
Imagine you're trying to describe your eccentric aunt to a friend. Instead of listing all her quirks, you focus on the top three that really capture her essence. That's PCA in a nutshell.
48+
Imagine you're trying to describe your eccentric aunt to a friend. Instead of listing all her quirks, you focus on the top three that really capture her essence. That's [PCA](https://en.wikipedia.org/wiki/Principal_component_analysis) in a nutshell.
4749

48-
### t-SNE and UMAP
50+
### [t-SNE](https://en.wikipedia.org/wiki/T-distributed_stochastic_neighbor_embedding) and [UMAP](https://en.wikipedia.org/wiki/Nonlinear_dimensionality_reduction#Uniform_manifold_approximation_and_projection)
4951
These are the cool kids of dimension reduction. They're great at preserving local structures in your data, kind of like how a good caricature exaggerates your most distinctive features.
5052

53+
* **t-SNE** (t-Distributed Stochastic Neighbor Embedding): A nonlinear dimensionality reduction technique that excels at preserving local structures in high-dimensional data by modeling similar data points as nearby points in a lower-dimensional space, making it particularly effective for visualizing clusters or patterns in complex datasets.
54+
55+
56+
* **UMAP:** (Uniform Manifold Approximation and Projection):
57+
58+
A dimensionality reduction algorithm that aims to preserve both local and global structures of high-dimensional data in lower dimensions, offering faster computation times than t-SNE and often providing a better balance between maintaining local relationships and capturing the overall data topology.
59+
5160
## The "Let's Not Crash Our Computer" Techniques
5261

5362
### Incremental Processing
@@ -56,6 +65,7 @@ This is the data equivalent of eating an elephant one bite at a time. It's not f
5665
### Data Sketching
5766
Think of this as the CliffsNotes of your data. It gives you the gist without all the details. Data sketching is a set of techniques used to process and analyze very large datasets efficiently, often with a single pass through the data. These methods provide approximate answers to queries about the data, trading off some accuracy for significant gains in speed and memory usage.
5867

68+
5969
#### Key aspects of data sketching:
6070
* **Single-pass algorithms:** They typically only need to see each data item once.
6171
* **Sub-linear space usage:** They use memory much less than the size of the input.
@@ -69,6 +79,9 @@ Common data sketching techniques include:
6979
* **T-Digest:** Estimates quantiles and histograms.
7080
* **Reservoir** Sampling: Maintains a random sample of a stream.
7181

82+
[Data Sketching](#data-sketching) probably deserves its own article and will probably be my next one.
83+
84+
7285
## Applying These Techniques
7386

7487
Let's say you're a data engineer at GigantoCorp, and you've just been handed a dataset with 500 million customer transactions. Your boss wants a "quick visual summary" by tomorrow morning. (Because apparently, that's a reasonable request.)
Lines changed: 169 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,169 @@
1+
---
2+
title: "Data Sketching: The Art of Guesstimating with Big Data"
3+
date: 2024-08-07 7:45:00 -0500
4+
categories:
5+
- dataengineering
6+
- datavisualization
7+
- datasketching
8+
- datasketch
9+
author: steven
10+
---
11+
12+
# Data Sketching: The Art of Guesstimating with Big Data
13+
14+
Picture this: You're at a county fair, and there's a huge jar of jellybeans. The person who guesses closest to the actual number wins a
15+
prize. Now, you could try counting each jellybean individually, but that would take forever. Instead, you might estimate based on the jar's
16+
size, how densely packed the beans are, and maybe a quick count of one small section. That's essentially what data sketching does, but for
17+
massive datasets.
18+
19+
## What is Data Sketching?
20+
21+
Data sketching is like being a clever detective with big data. Instead of examining every single piece of evidence (which could take years),
22+
you use smart techniques to get a good idea of what's going on without looking at everything. It's all about making educated guesses that
23+
are "good enough" for practical purposes.
24+
25+
The beauty of data sketching is that it lets you work with enormous amounts of data using limited memory and processing power. It's like
26+
summarizing a thousand-page novel in a few paragraphs - you lose some details, but you capture the essence.
27+
28+
## Aspects of Data Sketching
29+
30+
1. **One-Pass Processing**: Data sketching algorithms typically only need to see each data point once. This is crucial when dealing with
31+
streaming data or datasets too large to fit in memory.
32+
33+
2. **Approximate Results**: Sketches provide estimates, not exact answers. But these estimates often come with provable error bounds, so you
34+
know how much to trust them.
35+
36+
3. **Space Efficiency**: Sketches use way less memory than the full dataset. It's like compressing a high-res photo into a smaller file -
37+
you lose some quality, but it's much easier to store and share.
38+
39+
4. **Fast Processing**: Because sketches are small and approximate, computations on them are typically very fast.
40+
41+
## Data Sketching Techniques and When to Use Them
42+
43+
### Counting and Frequency Estimation
44+
- **Count-Min Sketch**:
45+
- What it does: Estimates how often items appear in a data stream.
46+
- When to use it: Tracking trending hashtags on social media or popular products in e-commerce.
47+
48+
- **HyperLogLog**:
49+
- What it does: Estimates the number of unique elements (cardinality) in a dataset.
50+
- When to use it: Counting unique visitors to a website or unique products in a large inventory.
51+
52+
### Set Operations
53+
**Bloom Filters**:
54+
- What it does: Tests whether an element is a member of a set.
55+
- When to use it: Checking if a username already exists or if an email is in a spam database.
56+
57+
### Quantiles and Rankings
58+
**T-Digest**:
59+
- What it does: Estimates percentiles and creates histograms.
60+
- When to use it: Analyzing response times in a web service or summarizing large datasets of numerical values.
61+
62+
### Sampling
63+
**Reservoir Sampling**:
64+
- What it does: Maintains a random sample of items from a data stream.
65+
- When to use it: Keeping a representative sample of user actions on a busy website.
66+
67+
## Data Sketching in Python with Visualizations
68+
In Python, the `datasketch` library is a popular choice for data sketching. Let's look at some examples and create some impressive
69+
visualizations.
70+
71+
### HyperLogLog for Cardinality Estimation
72+
73+
First, let's use HyperLogLog to estimate unique visitors to a website:
74+
75+
```python
76+
import numpy as np
77+
import matplotlib.pyplot as plt
78+
from datasketch import HyperLogLog
79+
80+
81+
def simulate_website_traffic(n_days, n_visitors_per_day):
82+
hll = HyperLogLog()
83+
true_uniques = set()
84+
estimated_uniques = []
85+
true_uniques_count = []
86+
87+
for _ in range(n_days):
88+
daily_visitors = np.random.randint(0, n_visitors_per_day * 10, n_visitors_per_day)
89+
for visitor in daily_visitors:
90+
hll.update(str(visitor).encode('utf8'))
91+
true_uniques.add(visitor)
92+
estimated_uniques.append(len(hll))
93+
true_uniques_count.append(len(true_uniques))
94+
95+
return estimated_uniques, true_uniques_count
96+
97+
98+
# Simulate 30 days of traffic
99+
n_days = 30
100+
n_visitors_per_day = 10000
101+
estimated, true = simulate_website_traffic(n_days, n_visitors_per_day)
102+
103+
# Plotting
104+
plt.figure(figsize=(12, 6))
105+
plt.plot(range(1, n_days + 1), true, label='True Uniques', marker='o')
106+
plt.plot(range(1, n_days + 1), estimated, label='HyperLogLog Estimate', marker='x')
107+
plt.title('Website Unique Visitors: True vs HyperLogLog Estimate')
108+
plt.xlabel('Days')
109+
plt.ylabel('Unique Visitors')
110+
plt.legend()
111+
plt.grid(True)
112+
plt.show()
113+
```
114+
115+
This code simulates website traffic and compares the true number of unique visitors with the HyperLogLog estimate. The resulting plot shows
116+
how accurate the estimation is over time.
117+
118+
### Count-Min Sketch for Frequency Estimation
119+
120+
Now, let's use Count-Min Sketch to estimate word frequencies in a large text:
121+
122+
```python
123+
import numpy as np
124+
import matplotlib.pyplot as plt
125+
from countminsketch import CountMinSketch
126+
127+
128+
def generate_word_stream(n_words, vocabulary_size):
129+
return np.random.randint(0, vocabulary_size, n_words)
130+
131+
132+
# Generate a stream of words
133+
n_words = 1000000
134+
vocabulary_size = 1000
135+
word_stream = generate_word_stream(n_words, vocabulary_size)
136+
137+
# Use Count-Min Sketch
138+
cms = CountMinSketch(width=1000, depth=10)
139+
for word in word_stream:
140+
cms.add(word)
141+
142+
# Get true frequencies
143+
true_freq = np.bincount(word_stream)
144+
145+
# Get estimated frequencies
146+
estimated_freq = [cms.check(i) for i in range(vocabulary_size)]
147+
148+
# Plot results
149+
plt.figure(figsize=(12, 6))
150+
plt.scatter(true_freq, estimated_freq, alpha=0.5)
151+
plt.plot([0, max(true_freq)], [0, max(true_freq)], 'r--') # Perfect estimation line
152+
plt.title('Word Frequency: True vs Count-Min Sketch Estimate')
153+
plt.xlabel('True Frequency')
154+
plt.ylabel('Estimated Frequency')
155+
plt.grid(True)
156+
plt.show()
157+
```
158+
159+
This code generates a stream of "words" (represented by integers) and uses Count-Min Sketch to estimate their frequencies. The scatter plot
160+
compares true frequencies with estimated frequencies.
161+
162+
These visualizations demonstrate the power of data sketching techniques. They allow us to process and analyze enormous amounts of data
163+
efficiently, providing useful insights without the need to store or process every single data point.
164+
165+
Remember, data sketching is all about making smart trade-offs. You're exchanging a bit of accuracy for a lot of speed and efficiency. It's
166+
not about getting perfect answers, but about getting useful insights from data that would otherwise be too big to handle.
167+
168+
So next time you're faced with a dataset as big as that jellybean jar at the county fair, don't panic! Reach for a data sketch, and you'll
169+
be making educated guesses in no time.

0 commit comments

Comments
 (0)