tactivos · willik · Feb 3, 2026 · Feb 6, 2026 · Feb 12, 2026 · Feb 12, 2026
diff --git a/test-env-optimization/cross-repo-branch-match-analysis.md b/test-env-optimization/cross-repo-branch-match-analysis.md
@@ -0,0 +1,123 @@
+# Cross-Repo Branch Match Analysis
+
+**Author**: Willis Kirkham  
+**Analysis Date**: February 8, 2026  
+**Data Period**: February 2025 - February 2026 (52 weeks)
+
+## Executive Summary
+
+Approximately **14.4% of all PRs** across `tactivos/murally` and `tactivos/mural-api` were part of a cross-repo effort — a PR in one repo with a matching branch name in the other.
+
+Over the last year, **482 unique branch names appeared in both repos**, representing at least 964 PRs (one in each repo per branch) out of 6,682 total.
+
+| Metric | Value |
+|--------|-------|
+| Total murally PRs analyzed | 4,301 |
+| Total mural-api PRs analyzed | 2,381 |
+| Total PRs (combined) | 6,682 |
+| Branch names appearing in both repos | 482 |
+| PRs involved in cross-repo efforts | ~964 (14.4%) |
+
+---
+
+## The Question
+
+**How often do PRs in murally and mural-api share the same branch name?** This determines how frequently developers work on features that span both repos simultaneously — a key input for decisions about merge and deployment tooling.
+
+---
+
+## Key Findings
+
+### Cross-Repo Branch Matches by Window Size
+
+Two PRs (one from each repo) are classified as "matched" when they share the same branch name and were created within N days of each other. Branches outside the window are treated as coincidental naming.
+
+| Window | Matched Branches | Coincidental |
+|--------|:---:|:---:|
+| 1 day | 408 | 74 |
+| 2 days | 427 | 55 |
+| **3 days** | **440** | **42** |
+| 4 days | 448 | 34 |
+| 5 days | 453 | 29 |
+| 6 days | 458 | 24 |
+| 7 days | 463 | 19 |
+| 8 days | 467 | 15 |
+
+**The total (482) is constant across all window sizes** — it represents every branch name that appeared in both repos during the year. The window only determines how many are classified as matched vs coincidental.
+
+### Diminishing Returns Beyond 3 Days
+
+The jump from 1-day to 3-day window captures 32 additional matches (408 → 440), while expanding from 3-day to 8-day only captures 27 more (440 → 467). A 3-day window is the practical sweet spot for matching accuracy.
+
+---
+
+## Supporting Analysis
+
+### Weekly Cross-Repo Match Volume
+
+Cross-repo matches are distributed consistently throughout the year with no seasonal spikes:
+
+| Metric | Value |
+|--------|-------|
+| Average cross-repo matches per week | 8.1 |
+| Peak cross-repo matches in a single week | 15 (week of Apr 7, 2025) |
+| Minimum cross-repo matches in a week | 0 (holiday week Dec 29, 2025) |
+| Weeks with 10+ cross-repo matches | 15 of 53 weeks |
+
+### Typical Cross-Repo Match Profile
+
+The `matched_branches_sample` data shows most cross-repo branches are created within 0-1 days of each other:
+
+- **0-day delta**: Branches created the same day in both repos (most common)
+- **1-day delta**: Branch created in one repo, then the other the next day (very common)
+- **2-3 day delta**: Less common, typically larger features where backend work starts after frontend
+
+---
+
+## Methodology
+
+### Data Collection
+
+PR data was fetched from GitHub's GraphQL API for both `tactivos/murally` and `tactivos/mural-api` covering February 3, 2025 through February 3, 2026. Fields collected: PR number, branch name (`headRefName`), `createdAt`, `closedAt`, `mergedAt`, and state.
+
+### Matching Logic
+
+1. Group PRs by branch name across both repos
+2. For each branch name appearing in both repos, compare the earliest `createdAt` from each repo
+3. If the delta is within the configured window (N days), classify as **matched**
+4. If the delta exceeds the window, classify as **coincidental naming**
+
+### Edge Cases
+
+| Case | Handling |
+|------|----------|
+| Multiple PRs with the same branch in one repo | Uses the earliest creation date |
+| Branch exists in only one repo | Not counted (no cross-repo match possible) |
+| Same branch name, >N days apart | Treated as coincidental naming, counted separately |
+
+### Scripts
+
+All scripts and data are in `deduplicate_branch_analysis/`:
+
+| Script | Purpose |
+|--------|---------|
+| `fetch_pr_data.py` | Fetches PR data via GitHub GraphQL API |
+| `analyze_branches.py` | Cross-repo branch matching and weekly statistics |
+| `analyze_concurrency.py` | Concurrent environment simulation (used in [PR Volume Risk Analysis](pr-volume-risk-analysis.md)) |
+
+### Reproducing the Analysis
+
+```bash
+cd deduplicate_branch_analysis/
+
+# Fetch PR data (uses cached data if available)
+python fetch_pr_data.py
+
+# Run analysis for a specific window size
+python analyze_branches.py --window-days 3
+
+# Run for all window sizes
+for n in 1 2 3 4 5 6 7 8; do
+  python analyze_branches.py --window-days $n
+done
+```
diff --git a/test-env-optimization/deduplicate_branch_analysis/README.md b/test-env-optimization/deduplicate_branch_analysis/README.md
@@ -0,0 +1,266 @@
+# PR Branch Deduplication Analysis
+
+Analyze PR data from `tactivos/murally` and `tactivos/mural-api` to identify cross-repo branch matches and calculate deduplicated test environment estimates.
+
+## Background
+
+When developers work on features spanning both frontend (murally) and backend (mural-api), they typically use the **same branch name** in both repositories. The test environment system creates **only one environment per unique branch name**, so counting PRs independently overcounts the actual number of test environments needed.
+
+This tool:
+1. Fetches PR data from both repositories
+2. Identifies branch names that appear in both repos within a configurable time window
+3. Calculates the deduplicated count of unique test environments
+4. Outputs weekly statistics and sample matched branches for validation
+
+## Setup
+
+```bash
+# Install dependencies
+pip install -r requirements.txt
+
+# Set GitHub token (one of these methods)
+export GITHUB_TOKEN=$(gh auth token)
+# or
+export GITHUB_TOKEN=ghp_your_token_here
+```
+
+## Usage
+
+### Step 1: Fetch PR Data (One Time)
+
+Fetch and cache PR data from both repositories:
+
+```bash
+python fetch_pr_data.py
+```
+
+Options:
+- `--start YYYY-MM-DD`: Start date (default: 2025-02-03)
+- `--end YYYY-MM-DD`: End date (default: 2026-02-03)
+- `--output-dir DIR`: Output directory for JSON files (default: current)
+- `--force`: Force re-fetch even if cached data exists
+
+This creates cached data files:
+- `pr_data_murally.json`
+- `pr_data_mural-api.json`
+
+**Note:** If cache files already exist, the script will skip fetching and inform you. Use `--force` to re-fetch.
+
+### Step 2: Run Analysis (Repeatable with Different Parameters)
+
+Analyze the cached data to find cross-repo matches:
+
+```bash
+python analyze_branches.py
+```
+
+Options:
+- `--window-days N`: Time window for matching (default: 3 days)
+- `--data-dir DIR`: Directory containing cached JSON files (default: current)
+- `--output-dir DIR`: Output directory for results (default: current)
+
+This creates (with window size in filename):
+- `weekly_analysis_3days.csv`: Weekly breakdown of PRs and environments
+- `matched_branches_sample_3days.txt`: Sample of matched branches for validation
+
+### Experimenting with Different Window Sizes
+
+The main benefit of caching is that you can run analysis multiple times with different window sizes without re-fetching data from GitHub:
+
+```bash
+# Fetch data once
+python fetch_pr_data.py
+
+# Try different window sizes
+python analyze_branches.py --window-days 1
+python analyze_branches.py --window-days 3
+python analyze_branches.py --window-days 5
+python analyze_branches.py --window-days 7
+python analyze_branches.py --window-days 14
+```
+
+Each run creates uniquely named output files:
+- `weekly_analysis_1days.csv`, `matched_branches_sample_1days.txt`
+- `weekly_analysis_3days.csv`, `matched_branches_sample_3days.txt`
+- etc.
+
+This lets you compare how the deduplication percentage changes based on the matching window.
+
+## Output Files
+
+Output files include the window size in the filename (e.g., `_3days`) to allow comparison across different window settings.
+
+### weekly_analysis_{N}days.csv
+
+Weekly statistics with columns:
+- `week_start`: Monday of the week
+- `murally_prs`: Number of PRs in murally
+- `api_prs`: Number of PRs in mural-api
+- `combined_prs`: Total PRs
+- `murally_branches`: Unique branches in murally
+- `api_branches`: Unique branches in mural-api
+- `cross_repo_matches`: Branches matched across repos
+- `deduplicated_envs`: Actual test environments needed
+
+### matched_branches_sample_{N}days.txt
+
+Sample of cross-repo matches showing:
+- Branch name
+- PR numbers and dates from each repo
+- Time delta between PRs
+- Whether it's a match (within window) or coincidental naming
+
+## Deduplication Logic
+
+Two PRs (one from each repo) are considered "the same effort" and deduplicated if:
+1. They have **exactly the same branch name** (case-sensitive)
+2. Their creation dates are **within N days of each other** (default: 3 days)
+
+### Edge Cases
+
+1. **Multiple PRs same branch, same repo**: Uses earliest creation date
+2. **Branch in one repo only**: Counted as one environment
+3. **Same branch name, >N days apart**: Treated as separate efforts (coincidental naming), counted as two environments
+
+## Example Output
+
+```
+============================================================
+SUMMARY STATISTICS
+============================================================
+
+-----------------------PR Counts-----------------------
+Total PRs (murally):           4,304
+Total PRs (mural-api):         2,390
+Total PRs (combined):          6,694
+
+--------------------Branch Analysis--------------------
+Unique branches (murally):     3,850
+Unique branches (mural-api):   2,100
+Branches in both repos:          450
+
+----------Cross-Repo Matching (window: 3 days)----------
+Cross-repo matches:              380
+Coincidental naming (>window):    70
+
+-----------------Deduplication Results-----------------
+Branches only in murally:      3,400
+Branches only in mural-api:    1,650
+Matched cross-repo branches:     380
+Non-matched (counted twice):      70
+
+Naive branch count:            5,950
+Deduplicated env count:        5,570
+Reduction from dedup:            380 (6.4%)
+```
+
+## Concurrent Environment Analysis
+
+While `analyze_branches.py` calculates **total environments created per week**, it doesn't account for the fact that test environments are automatically cleaned up when PRs close. The actual infrastructure load depends on **concurrent open PRs**, not total created.
+
+`analyze_concurrency.py` extends the analysis to calculate concurrent environment counts over time.
+
+### Running Concurrency Analysis
+
+First, ensure you have PR data with close times:
+
+```bash
+# Re-fetch data to include closedAt/mergedAt fields
+python fetch_pr_data.py --force
+
+# Run concurrency analysis (hourly resolution - more accurate, slower)
+python analyze_concurrency.py
+
+# Or with daily resolution (faster, less granular)
+python analyze_concurrency.py --resolution daily
+```
+
+Options:
+- `--resolution hourly|daily`: Time resolution (default: hourly)
+- `--data-dir DIR`: Directory containing cached JSON files (default: current)
+- `--output-dir DIR`: Output directory for results (default: current)
+
+### Concurrency Output Files
+
+#### concurrent_envs_timeseries.csv
+
+Time series data with concurrent environment counts:
+- `timestamp`: ISO timestamp of the data point
+- `murally_active`: PRs with open status in murally at this time
+- `api_active`: PRs with open status in mural-api at this time
+- `naive_total`: Sum of murally + api (without deduplication)
+- `deduplicated_envs`: Unique branch names (with cross-repo deduplication)
+- `cross_repo_active`: Branches with open PRs in both repos simultaneously
+
+#### concurrent_envs_summary.txt
+
+Summary statistics including:
+- PR lifespan statistics (median, P90, P95 by repository)
+- Concurrent environment statistics (peak, average, P95)
+- Peak timestamp
+- Cross-repo concurrency metrics
+- Long-lived PRs (>30 days open)
+
+#### pr_lifespan_distribution.csv
+
+Histogram of PR lifespans:
+- `lifespan_bucket`: Time range (0-1h, 1-2h, 2-4h, etc.)
+- `murally_count`: Number of murally PRs in this bucket
+- `api_count`: Number of mural-api PRs in this bucket
+
+### Concurrency Deduplication Logic
+
+For concurrent environments, deduplication is straightforward:
+- At any time point T, if branch "feature/foo" has an open PR in **both** repos, count as **1 environment**
+- No time-window matching needed—if both PRs are open at time T, they share an environment
+
+### Edge Cases
+
+1. **Still-open PRs**: Treated as open until the analysis end time
+2. **Very long-lived PRs**: Flagged in summary if open >30 days (potential outliers)
+3. **Draft PRs**: Included in analysis (would get environments with auto-provisioning)
+
+### Example Concurrency Output
+
+```
+============================================================
+CONCURRENT ENVIRONMENT ANALYSIS
+============================================================
+
+PR Lifespan Statistics:
+------------------------------
+Murally PRs analyzed:          4,301
+  Median lifespan:             8.2 hours
+  90th percentile:             72.5 hours
+  95th percentile:             168.0 hours
+
+mural-api PRs analyzed:        2,390
+  Median lifespan:             12.4 hours
+  90th percentile:             96.3 hours
+  95th percentile:             240.0 hours
+
+Concurrent Environment Statistics:
+------------------------------
+Peak concurrent (naive):       142
+Peak concurrent (deduplicated):135
+Average concurrent:            89.3
+Median concurrent (P50):       85.0
+P95 concurrent:                120.0
+
+Peak occurred at:              2025-06-15T14:00:00
+
+Cross-Repo Concurrency:
+------------------------------
+Max branches active in both:   12
+Avg branches active in both:   4.2
+Deduplication savings:         5.2% reduction
+```
+
+## Notes
+
+- GraphQL API is used for efficient data fetching (better than REST for this use case)
+- Data is cached locally to allow re-running analysis with different parameters
+- GitHub API rate limit: 5,000 requests/hour for authenticated users
+- Estimated requests needed: ~15-20 (paginated, 100 PRs per page)
+- The fetch script now includes `closedAt` and `mergedAt` fields for concurrency analysis
+- Hourly resolution provides more accurate peak detection but takes longer to compute