git-steven
diff --git a/‎.gitignore‎
Lines changed: 3 additions & 1 deletion b/‎.gitignore‎
Lines changed: 3 additions & 1 deletion
diff --git a/‎.ruby-version‎
Lines changed: 1 addition & 0 deletions b/‎.ruby-version‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎CLAUDE.md‎
Lines changed: 73 additions & 0 deletions b/‎CLAUDE.md‎
Lines changed: 73 additions & 0 deletions
diff --git a/‎_posts/2024-08-06-techniques-large-data-vis.md‎
Lines changed: 18 additions & 5 deletions b/‎_posts/2024-08-06-techniques-large-data-vis.md‎
Lines changed: 18 additions & 5 deletions
diff --git a/‎_posts/2024-08-07-data-sketching.md‎
Lines changed: 169 additions & 0 deletions b/‎_posts/2024-08-07-data-sketching.md‎
Lines changed: 169 additions & 0 deletions
@@ -4,4 +4,6 @@ _site
 Gemfile.lock
 .DS_Store
 **/.ipynb_checkpoints/*
-.idea
+.idea
+*.bkp
+*.dtmp
@@ -0,0 +1 @@
+3.3.10
@@ -0,0 +1,73 @@
+# CLAUDE.md
+
+This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
+
+## Project Overview
+
+This is a Jekyll-based blog/portfolio site using the Minimal Mistakes remote theme. The site showcases technical articles on data engineering, machine learning, software architecture, and related topics.
+
+## Site Architecture
+
+* **Static Site Generator**: Jekyll with GitHub Pages
+* **Theme**: Minimal Mistakes (remote theme)
+* **Content Structure**:
+  * `_posts/` - Blog posts in markdown format with YAML front matter
+  * `_pages/` - Static pages (About, Projects, Archives, etc.)
+  * `_data/` - Site configuration data (navigation, etc.)
+  * `_includes/` - Template partials and custom HTML
+  * `assets/` - Images, CSS, and other static assets
+  * `_site/` - Generated site output (ignored in git)
+
+## Post Format
+
+Blog posts follow Jekyll conventions:
+* File naming: `YYYY-MM-DD-title-slug.md`
+* YAML front matter includes: `title`, `date`, `categories`, `author`
+* Content uses GitHub-flavored markdown with kramdown
+* Posts may include draw.io diagrams (`.drawio` files) and exported images (`.drawio.png`)
+
+## Common Development Commands
+
+### Local Development
+```bash
+# Install dependencies
+bundle install
+
+# Serve site locally with live reload
+bundle exec jekyll serve
+
+# Serve with drafts visible
+bundle exec jekyll serve --drafts
+
+# Build site without serving
+bundle exec jekyll build
+```
+
+### Working with Diagrams
+
+The site uses draw.io diagrams embedded in posts. When creating or updating diagrams:
+* Store `.drawio` source files in `_posts/` directory alongside related posts
+* Export to PNG using `draw.io.export <diagram>.drawio` (creates `.drawio.png`)
+* Reference exported images in markdown: `![](/path/to/diagram.drawio.png)`
+* Follow diagram styling from `~/.claude/CLAUDE.md` (rounded boxes, Sketch style, gradients)
+
+### Content Guidelines
+
+* Technical posts should be informative and approachable (see existing posts for tone)
+* Include relevant images and diagrams to support explanations
+* Use proper categorization for posts (data-engineering, machine learning, architecture, etc.)
+* Posts cover topics: Python, scikit-learn, PySpark, FastAPI, AWS CDK, architecture patterns
+
+## Site Configuration
+
+* Main config: `_config.yml`
+* Navigation: `_data/navigation.yml`
+* Author info, social links, and site metadata in `_config.yml:62-107`
+* Default post layout includes: author profile, read time, comments, sharing, related posts
+
+## Theme Customization
+
+* Skin: "dirt" (defined in `_config.yml:30`)
+* Custom includes in `_includes/head/` for additional head elements
+* Site uses pagination (5 posts per page)
+* Archive pages generated via liquid templates (by year, category, tag)
@@ -1,6 +1,6 @@
----
+---
 title:  "Taming the Data Beast: A Guide to Visualizing Massive Datasets"
-date:   2024-08-06 09:45:00 -0500
+date:   2024-08-06 0:45:00 -0500
 categories:
 - data-engineering
 - data-visualization
@@ -19,7 +19,7 @@ Remember the good old days when a "large dataset" meant a few thousand rows in E
 
 But fear not, dear reader! We're about to embark on a journey through the wild world of big data preprocessing. Buckle up, because things are about to get... mildly exciting.
 
-## The Art of Data Reduction
+## The Art and Science of Data Reduction
 
 ### Sampling: Less is More (Sometimes)
 
@@ -33,6 +33,8 @@ This is like organizing a really diverse party. You make sure every group is rep
 
 Aggregation is the art of smooshing data together until it fits into a manageable size. It's like making a smoothie out of your fruit salad – you lose some detail, but at least it fits in the cup.
 
+Aggregations include `sum`, `count`, `average`, `median`, as well as [Standard Deviation](https://en.wikipedia.org/wiki/Standard_deviation) and [Variance](https://en.wikipedia.org/wiki/Variance).
+
 ### Binning: Put a Lid on It
 
 Continuous data is like that friend who never stops talking. Binning can be particularly useful when you're dealing with things like age ranges or income brackets.
@@ -43,11 +45,18 @@ Continuous data is like that friend who never stops talking. Binning can be part
 Dimensionality reduction is a technique used to reduce the number of features (or dimensions) in a dataset while preserving as much of the important information as possible. It's like taking a complex, multi-faceted object and creating a simpler representation that still captures its essence.  It is used a lot in data science, data engineering and machine learning where the data has high-dimensionality.
 
 ### PCA (Principal Component Analysis)
-Imagine you're trying to describe your eccentric aunt to a friend. Instead of listing all her quirks, you focus on the top three that really capture her essence. That's PCA in a nutshell.
+Imagine you're trying to describe your eccentric aunt to a friend. Instead of listing all her quirks, you focus on the top three that really capture her essence. That's [PCA](https://en.wikipedia.org/wiki/Principal_component_analysis) in a nutshell.
 
-### t-SNE and UMAP
+### [t-SNE](https://en.wikipedia.org/wiki/T-distributed_stochastic_neighbor_embedding) and [UMAP](https://en.wikipedia.org/wiki/Nonlinear_dimensionality_reduction#Uniform_manifold_approximation_and_projection)
 These are the cool kids of dimension reduction. They're great at preserving local structures in your data, kind of like how a good caricature exaggerates your most distinctive features.
 
+* **t-SNE** (t-Distributed Stochastic Neighbor Embedding): A nonlinear dimensionality reduction technique that excels at preserving local structures in high-dimensional data by modeling similar data points as nearby points in a lower-dimensional space, making it particularly effective for visualizing clusters or patterns in complex datasets.
+
+
+* **UMAP:** (Uniform Manifold Approximation and Projection):
+
+A dimensionality reduction algorithm that aims to preserve both local and global structures of high-dimensional data in lower dimensions, offering faster computation times than t-SNE and often providing a better balance between maintaining local relationships and capturing the overall data topology.
+
 ## The "Let's Not Crash Our Computer" Techniques
 
 ### Incremental Processing
@@ -56,6 +65,7 @@ This is the data equivalent of eating an elephant one bite at a time. It's not f
 ### Data Sketching
 Think of this as the CliffsNotes of your data. It gives you the gist without all the details. Data sketching is a set of techniques used to process and analyze very large datasets efficiently, often with a single pass through the data. These methods provide approximate answers to queries about the data, trading off some accuracy for significant gains in speed and memory usage.
 
+
 #### Key aspects of data sketching:
 * **Single-pass algorithms:** They typically only need to see each data item once.
 * **Sub-linear space usage:** They use memory much less than the size of the input.
@@ -69,6 +79,9 @@ Common data sketching techniques include:
 * **T-Digest:** Estimates quantiles and histograms.
 * **Reservoir** Sampling: Maintains a random sample of a stream.
 
+[Data Sketching](#data-sketching) probably deserves its own article and will probably be my next one.
+
+
 ## Applying These Techniques
 
 Let's say you're a data engineer at GigantoCorp, and you've just been handed a dataset with 500 million customer transactions. Your boss wants a "quick visual summary" by tomorrow morning. (Because apparently, that's a reasonable request.)
 
@@ -0,0 +1,169 @@
+---
+title: "Data Sketching: The Art of Guesstimating with Big Data"
+date: 2024-08-07 7:45:00 -0500
+categories:
+  - dataengineering
+  - datavisualization
+  - datasketching
+  - datasketch
+author: steven
+---
+
+# Data Sketching: The Art of Guesstimating with Big Data
+
+Picture this: You're at a county fair, and there's a huge jar of jellybeans. The person who guesses closest to the actual number wins a
+prize. Now, you could try counting each jellybean individually, but that would take forever. Instead, you might estimate based on the jar's
+size, how densely packed the beans are, and maybe a quick count of one small section. That's essentially what data sketching does, but for
+massive datasets.
+
+## What is Data Sketching?
+
+Data sketching is like being a clever detective with big data. Instead of examining every single piece of evidence (which could take years),
+you use smart techniques to get a good idea of what's going on without looking at everything. It's all about making educated guesses that
+are "good enough" for practical purposes.
+
+The beauty of data sketching is that it lets you work with enormous amounts of data using limited memory and processing power. It's like
+summarizing a thousand-page novel in a few paragraphs - you lose some details, but you capture the essence.
+
+## Aspects of Data Sketching
+
+1. **One-Pass Processing**: Data sketching algorithms typically only need to see each data point once. This is crucial when dealing with
+   streaming data or datasets too large to fit in memory.
+
+2. **Approximate Results**: Sketches provide estimates, not exact answers. But these estimates often come with provable error bounds, so you
+   know how much to trust them.
+
+3. **Space Efficiency**: Sketches use way less memory than the full dataset. It's like compressing a high-res photo into a smaller file -
+   you lose some quality, but it's much easier to store and share.
+
+4. **Fast Processing**: Because sketches are small and approximate, computations on them are typically very fast.
+
+## Data Sketching Techniques and When to Use Them
+
+### Counting and Frequency Estimation
+- **Count-Min Sketch**:
+  - What it does: Estimates how often items appear in a data stream.
+  - When to use it: Tracking trending hashtags on social media or popular products in e-commerce.
+
+- **HyperLogLog**:
+  - What it does: Estimates the number of unique elements (cardinality) in a dataset.
+  - When to use it: Counting unique visitors to a website or unique products in a large inventory.
+
+### Set Operations
+**Bloom Filters**:
+- What it does: Tests whether an element is a member of a set.
+- When to use it: Checking if a username already exists or if an email is in a spam database.
+
+### Quantiles and Rankings
+**T-Digest**:
+- What it does: Estimates percentiles and creates histograms.
+- When to use it: Analyzing response times in a web service or summarizing large datasets of numerical values.
+
+### Sampling
+**Reservoir Sampling**:
+  - What it does: Maintains a random sample of items from a data stream.
+  - When to use it: Keeping a representative sample of user actions on a busy website.
+
+## Data Sketching in Python with Visualizations
+In Python, the `datasketch` library is a popular choice for data sketching. Let's look at some examples and create some impressive
+visualizations.
+
+### HyperLogLog for Cardinality Estimation
+
+First, let's use HyperLogLog to estimate unique visitors to a website:
+
+```python
+import numpy as np
+import matplotlib.pyplot as plt
+from datasketch import HyperLogLog
+
+
+def simulate_website_traffic(n_days, n_visitors_per_day):
+  hll = HyperLogLog()
+  true_uniques = set()
+  estimated_uniques = []
+  true_uniques_count = []
+
+  for _ in range(n_days):
+    daily_visitors = np.random.randint(0, n_visitors_per_day * 10, n_visitors_per_day)
+    for visitor in daily_visitors:
+      hll.update(str(visitor).encode('utf8'))
+      true_uniques.add(visitor)
+    estimated_uniques.append(len(hll))
+    true_uniques_count.append(len(true_uniques))
+
+  return estimated_uniques, true_uniques_count
+
+
+# Simulate 30 days of traffic
+n_days = 30
+n_visitors_per_day = 10000
+estimated, true = simulate_website_traffic(n_days, n_visitors_per_day)
+
+# Plotting
+plt.figure(figsize=(12, 6))
+plt.plot(range(1, n_days + 1), true, label='True Uniques', marker='o')
+plt.plot(range(1, n_days + 1), estimated, label='HyperLogLog Estimate', marker='x')
+plt.title('Website Unique Visitors: True vs HyperLogLog Estimate')
+plt.xlabel('Days')
+plt.ylabel('Unique Visitors')
+plt.legend()
+plt.grid(True)
+plt.show()
+```
+
+This code simulates website traffic and compares the true number of unique visitors with the HyperLogLog estimate. The resulting plot shows
+how accurate the estimation is over time.
+
+### Count-Min Sketch for Frequency Estimation
+
+Now, let's use Count-Min Sketch to estimate word frequencies in a large text:
+
+```python
+import numpy as np
+import matplotlib.pyplot as plt
+from countminsketch import CountMinSketch
+
+
+def generate_word_stream(n_words, vocabulary_size):
+  return np.random.randint(0, vocabulary_size, n_words)
+
+
+# Generate a stream of words
+n_words = 1000000
+vocabulary_size = 1000
+word_stream = generate_word_stream(n_words, vocabulary_size)
+
+# Use Count-Min Sketch
+cms = CountMinSketch(width=1000, depth=10)
+for word in word_stream:
+  cms.add(word)
+
+# Get true frequencies
+true_freq = np.bincount(word_stream)
+
+# Get estimated frequencies
+estimated_freq = [cms.check(i) for i in range(vocabulary_size)]
+
+# Plot results
+plt.figure(figsize=(12, 6))
+plt.scatter(true_freq, estimated_freq, alpha=0.5)
+plt.plot([0, max(true_freq)], [0, max(true_freq)], 'r--')  # Perfect estimation line
+plt.title('Word Frequency: True vs Count-Min Sketch Estimate')
+plt.xlabel('True Frequency')
+plt.ylabel('Estimated Frequency')
+plt.grid(True)
+plt.show()
+```
+
+This code generates a stream of "words" (represented by integers) and uses Count-Min Sketch to estimate their frequencies. The scatter plot
+compares true frequencies with estimated frequencies.
+
+These visualizations demonstrate the power of data sketching techniques. They allow us to process and analyze enormous amounts of data
+efficiently, providing useful insights without the need to store or process every single data point.
+
+Remember, data sketching is all about making smart trade-offs. You're exchanging a bit of accuracy for a lot of speed and efficiency. It's
+not about getting perfect answers, but about getting useful insights from data that would otherwise be too big to handle.
+
+So next time you're faced with a dataset as big as that jellybean jar at the county fair, don't panic! Reach for a data sketch, and you'll
+be making educated guesses in no time.