Skip to content

Latest commit

 

History

History
291 lines (199 loc) · 22.2 KB

File metadata and controls

291 lines (199 loc) · 22.2 KB

pmstack

A PM toolkit that runs inside the browser, desktop, or mobile Claude app, as well as the terminal. I built it to give Product Managers a path to AI fluency...irrespective of your technical fluency today. Learn by building. Part learning playground, part tactical skillset. All-in-one.

Turn customer signals into outcomes like I do @AWS - delegate your work to AI agents as Routines or run commands on demand: run evals like an ML engineer. Schedule pre-mortems, stress-test launches, monitor your AI feature for drift, and write the Monday memo — all from one consistent set of commands you type into Claude (terminal, browser, desktop, or mobile).

Brand new?/onboarding walks you through every capability with a runnable example. Skip the rest of this README until you've done it once.

If you find this repo useful, star it and help me make it better!


Who this is for

There are two Personas. Neither is presumed to be AI-pilled.

Technical PM — you live in a terminal. You want one consistent muscle memory across every PM artifact you produce. You install pmstack as slash commands inside Claude Code.

Non-technical PM — you do PM work in claude.ai web, the desktop app, or on your phone. You want the same toolkit without ever opening a terminal. You install pmstack as Anthropic Skills inside your Claude account. Same skills, same outputs, no terminal.


Install

Path 1: claude.ai web, desktop, or mobile (non-technical PMs)

~30 seconds. No terminal.

  1. claude.ai → Projects → new Project ("PM toolkit").
  2. Settings → Skills → Upload skill → upload each folder from claude-skills/. All 14, or pick the 4–5 you'll use most.
  3. Chat normally. "Write a PRD from this customer quote" auto-activates the skill; output appears inline — copy-paste anywhere.

Path 2: Claude Code CLI (technical PMs)

curl -fsSL https://raw.githubusercontent.com/RyanAlberts/pmstack/main/install.sh | bash -s -- --global

Installs to ~/.claude/ so /prd, /eval, /weekly work in any folder. Drop --global to install in the current folder only.

Manual install / other tools (Cursor, ChatGPT, Gemini, …)
git clone https://github.com/RyanAlberts/pmstack.git && cd pmstack
./setup --global    # or: ./setup ~/work/my-pm-stuff

For non-Claude tools, see docs/using-other-tools.md — the skills are plain markdown, paste them into .cursorrules, a Custom GPT, or a system prompt.


Start here

If you just installed pmstack, run this once. It's the only command you need to remember.

Command template What it answers What you get Where it runs
/onboarding "I just installed pmstack. Where do I start?" A 9-step interactive tutorial running every capability in this README with examples so you can see what good looks like CLI · web · desktop

Prefer to browse first?


What's an eval?

An eval is the test suite for an AI feature - it's how we evaluate non-deterministic systems like LLMs, agents, and agentic systems. You define inputs ("what a real user might say") and what success looks like, then run these inputs through your AI system and score the results. This repo abstracts the measurement appartus in the same way that you depend on your engineering team to write unit and integration tests in traditional software development.

[From Anthropic] "An evaluation (“eval”) is a test for an AI system: give an AI an input, then apply grading logic to its output to measure success."

Diagram of an evaluation pipeline: input → AI system → grading logic → score. Adapted from Anthropic's 'Demystifying Evals for AI Agents.'

Image taken from Anthropic — Demystifying Evals for AI Agents (Anthropic Engineering, 2025). Used with attribution; original copyright Anthropic, PBC. Read the full article for the canonical guidance pmstack draws on.

Fluency in evals separates an AI PM from a traditional PM. Traditional PMs ship a feature, instrument a dashboard, and read the launch metric. AI PMs do all of that and maintain a structured measurement of how the model itself behaves — because LLM outputs are non-deterministic, drift between model versions, and fail in ways feature flags can't catch.

pmstack implements Anthropic's eval framework as PM-runnable commands. The Anthropic article is the canonical reference; you don't have to read it cover-to-cover before shipping. pmstack is the operating layer that produces its artifacts — task suites, transcripts, graders, drift watches — without writing engineering scaffolding. Five commands cover the lifecycle in flow order:

  • /vibe-test — Read raw transcripts of your AI feature in action; surface failure patterns and draft task candidates before you formalize a test suite. (Anthropic's "start with manual testing" ritual.)
  • /eval — Design the suite. Output: a YAML with tasks, metrics, graders (code / model / human), and pass-bars.
  • /run-eval — Execute the suite against a real AI system; report pass@k AND pass^k. Hard-stops if no target is configured — never invents fake scores.
  • /transcript-review — Walk every failed trial asking the diagnostic question: model mistake, grader mistake, or task-spec error? (Anthropic's "read the transcripts" ritual.)
  • /eval-drift — Re-run the suite weekly, diff against last week, flag any regression as a release blocker.

First-time setup: docs/run-eval-setup.md. How pmstack maps to Anthropic's 8-step roadmap: docs/anthropic-framework.md.


What you get

A growing set of capabilities, each packaging a task to delegate to AI. The logical flow is Spec creation → Measurement → Communicate & orchestrate, with Routines giving you the operating discipline. Each command produces a real markdown or YAML artifact (or, on web/desktop, an inline block you can copy).

Don't try to memorize them all. Run /onboarding once — it walks the whole stack with a real signal and produces a complete artifact set you can compare to the bundled examples.

 

Spec creation — Signal to shippable

Translate the noise of customer signals, competitor moves, and market gaps into structured artifacts engineering can plan from.

Command template What it answers What you get Where it runs
/prd "<a customer signal>" "Customer said X. What's the spec?" A 6-section PRD draft → example CLI · web · desktop · mobile
/competitive "<market>" "Who else is in this space and where's the white space?" Landscape with positioning + white-space analysis → example CLI · web · desktop · mobile
/compare "<product A>" "<product B>" [...] "Which of these should we pick — and how would we test it?" Feature matrix + decision rules + executable eval YAML → example CLI · web · desktop · mobile

 

Measurement — What does "good" look like?

Define how you'll know your work succeeded, then check whether it did — using Anthropic's framework end-to-end. The six commands below run in flow order: define metrics, read the data, design the suite, run it, diagnose failures, keep pmstack itself honest.

Command template What it answers What you get Where it runs
/metrics "<feature>" "How will we know this worked?" North Star + 2–3 supporting + 1–2 counter-metrics → example CLI · web · desktop · mobile
/vibe-test "<feature>" "What does this AI feature actually do in the wild?" A vibe-test memo — failure patterns + task candidates + 'ready for /eval?' verdict (reads transcripts via paste, attach, or --from-folder) CLI · web · desktop · mobile
/eval "<AI feature>" "What does 'good' actually look like for this AI feature?" A test-suite YAML — tasks, metrics, graders (code/model/human), pass@k or pass^k → example CLI · web · desktop · mobile
/run-eval <eval-yaml-path> "Does this AI feature actually pass the bar?" A scored summary.md with both pass@k AND pass^k, top failures, cost → example CLI only (needs a real target)
/transcript-review <run-folder> "Why did these trials fail — model, grader, or task?" A diagnosis memo — per-trial verdicts + proposed eval changes (reads /run-eval output) CLI · web · desktop · mobile
/eval-self [--skill <name>] "Is pmstack itself still good?" Scores every pmstack skill against canonical scenarios CLI only

 

Communicate & orchestrate

Package your work for the audience that needs it, or chain the right tools in the right order.

Command template What it answers What you get Where it runs
/brief "<topic>" <audience> "What does the exec / eng team / customer need to know?" A one-page audience-sized brief → example CLI · web · desktop · mobile
/sprint "<a customer signal>" "Take this from signal to ship-ready in one pass." Four artifacts in sequence — PRD → metrics → eval → brief — with confirmation gates CLI · web · desktop

 

Routines — Become an AI Director

Recurring patterns that turn pmstack from a set of one-shot commands into a PM operating system. They schedule themselves, audit your workspace, gate launches, and force you to notice your own learning. This layer is what separates someone running prompts ad hoc from someone directing an AI-augmented workflow.

Command template What it answers What you get Where it runs
/eval-drift "Did my AI feature get worse this week?" Drift memo with RELEASE_BLOCKED: true|false flag → example CLI only (needs real eval runs)
/premortem <prd-slug> "How could this feature fail?" 3 failure stories + leading indicators + mitigations; mutates the PRD's Risks section → example CLI · web · desktop · mobile
/weekly "What changed in my thinking this week?" Decisions made + open loops aging + one required 'changed my mind' field → example CLI · web · desktop
/launch-readiness <feature> "Are we actually ready to ship this?" GO / NO-GO / CONDITIONAL verdict + 7-item evidence checklist → example CLI · web · desktop · mobile
/lint "Did anything in my workspace drift out of sync?" Graph gaps + cross-artifact drift + stale candidates with 'Do this:' actions → example CLI · web · desktop

Schedule with /loop so the routines run themselves:

/loop 7d /weekly       # Monday self-snapshot
/loop 7d /lint         # workspace audit
/loop 7d /eval-drift   # weekly eval-regression watch

A week with pmstack

A typical week, end to end: a customer signal arrives → /prd translates it → /premortem stress-tests the spec → /sprint chains metrics, eval, and brief → /launch-readiness gates the ship → /eval-drift watches it after launch. Meanwhile /lint keeps your workspace tidy and /weekly captures what changed in your thinking. The full week, with all 12 artifacts, lives in examples/walkthrough-code-review/.


What it looks like (claude.ai web — non-technical path)

Conceptual mockup of a fresh claude.ai conversation after the user has uploaded the pmstack skills:

┌──────────────────────────────────────────────────────────────────────┐
│ Project: PM toolkit                                                  │
│ Skills: pmstack-prd, pmstack-premortem, pmstack-launch-readiness,    │
│         pmstack-weekly, pmstack-onboarding, ... (14 total)           │
├──────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  You:    I just installed pmstack. Walk me through it.               │
│                                                                      │
│  Claude: [auto-activates pmstack-onboarding skill]                   │
│          Welcome to pmstack. We'll walk through 9 steps using a      │
│          realistic AI code review feature. Type 'next' to start      │
│          with /prd.                                                  │
│                                                                      │
│  You:    next                                                        │
│                                                                      │
│  Claude: Step 2 of 9 — /prd                                          │
│          /prd takes a customer signal and writes a PRD draft. ...    │
│          [shows the example signal, the command, the expected        │
│          output, and a link to the bundled example artifact]         │
│                                                                      │
│  You:    [pastes a real customer quote from their team]              │
│          /prd "Half our trial users churn before they finish setup." │
│                                                                      │
│  Claude: [auto-activates pmstack-prd skill]                          │
│          # PRD: Trial setup friction — 2026-05-05                    │
│          ## Problem Statement ...                                    │
│          [full 6-section PRD inline]                                 │
│                                                                      │
└──────────────────────────────────────────────────────────────────────┘

That's the entire UX. No terminal, no install path, no slash commands to memorize. The skills auto-activate on natural-language phrasing the user is already using.


Get started in 60 seconds

If you installed the CLI:

cd ~/work/my-pm-stuff      # or wherever you installed
claude                       # opens Claude Code

Then type:

/onboarding

That's it. The tutorial does the rest.

If you installed via claude.ai, just open the Project you set up and ask: "walk me through pmstack."


Glossary

The Anthropic eval-framework vocabulary, with concrete examples drawn from the bundled "AI code review" walkthrough so each term ties to something runnable.

Term What it means Example (AI code review)
Task One test case. An input plus a definition of what success looks like. The atomic unit of an eval suite. "Given this 18-line bugfix PR, the bot should produce a summary that names the bug, 0–2 inline comments, and a CODEOWNER as the suggested reviewer."
Trial One attempt at a task. Models are non-deterministic, so we run multiple. pmstack default: 5 trials per task. The bot is shown the same PR five times. Three trials produce great reviews, one is mid, one calls a clean diff "suspicious." That spread is the data.
Transcript The full record of one trial — the input the bot saw, its reasoning, every tool call, intermediate output, and the final answer. (a.k.a. trace, trajectory.) The five mock files at examples/walkthrough-code-review/transcripts/. Each shows what the bot saw, what it said, and what happened next.
Outcome The final state of the world after the trial — distinct from what the agent said it did. The transcript says "I posted a comment." The outcome is whether the comment actually appears on the PR in GitHub.
Grader The logic that scores a trial. Three flavors: code (deterministic), model (LLM-as-judge with a rubric), human (SME spot-check). One task can have several. comments_per_pr_p75 → graded by code (count the comments). severity_calibration → graded by model (Claude judges the bot's severity tags vs. a rubric). refusal_precision → graded by human (SME labels 24 cases by hand).
Suite The whole eval YAML — a collection of tasks measuring some capability or guarding a regression. eval-code-review-2026-05-06.yaml — 8+ tasks, 7 metrics, one target system.
Capability eval "What can this agent do well?" Starts at a low pass-rate. Gives the team a hill to climb. Launch suite for a brand-new feature: pass-rate sits at 60%, climbing weekly as you fix prompts and edge cases.
Regression eval "Does it still handle what it used to?" Should sit near 100%. Any drop is a release blocker. The launch suite graduated to a regression suite once it hit 95% consistently — now it just guards the floor.
pass@k Probability the agent gets it right on at least one of k trials. Use when one success is enough (a tool with retry). A coding agent: as long as one of 5 attempts works, the dev moves on. pass@5 is the right metric.
pass^k Probability the agent gets it right on every one of k trials. Use when consistency matters (customer-facing). The code review bot: every dev expects it to work every time. pass^5 is the right metric — and the bar PMs gate launches on.
Drift watch A scheduled re-run of the eval that diffs against last week and hard-stops releases on regression. /loop 7d /eval-drift runs every Monday. If severity_calibration drops from 4.3 → 3.8, the memo includes RELEASE_BLOCKED: true.
Negative case A task where the agent should not do something. For every "should-do-X" case, include a "should-not-do-X" companion — Anthropic's balanced-set rule. "PR uses parameterized SQL that looks unsafe — bot should NOT flag SQL injection." Marked negative_case: true.
Reference solution A known good output that passes all the graders. Proves the task is solvable and that the graders are wired up correctly. "On the bugfix PR, a reference solution would be: a summary naming the off-by-one, one inline asking for a unit test, @django-pagination-codeowner as reviewer."

Want the deeper version? See docs/anthropic-framework.md for the full mapping of pmstack commands to Anthropic's 8-step eval roadmap.


Common questions

Do I need to know how Claude works internally? No. You should know what an LLM is and roughly that they have limits, but pmstack hides everything else.

Does this cost money? The commands themselves are free. The Claude tokens used to run them count against your Claude subscription (Pro/Max/Team) or API key. A typical PRD or competitive analysis is well under a penny. The /run-eval and /eval-self commands can use more tokens — they tell you the estimate before each run and ask before spending.

What if I want to change how a command behaves? Open the file in skills/<name>.md (e.g., skills/prd-from-signal.md) and edit. The skill is just markdown — Claude reads it as instructions. Save and re-run; your changes apply immediately.

Is my data sent anywhere unusual? No. Everything runs through your existing Claude account. pmstack adds no servers, no telemetry, no third parties. The skills are just markdown files on your disk (or, on claude.ai, files Anthropic stores in your Project).

Can I use pmstack on a project that isn't pmstack itself? Yes — that's the point. Install with --global (CLI) or upload the skills to a Project (claude.ai) once, and the commands are available wherever you do PM work.

The five new routines look like a lot. Where do I start? Run /onboarding. It walks you through every routine with the bundled walkthrough. After that, the highest-leverage starting routine is /premortem — runs in 60 seconds against any PRD, mutates the Risks section, and the lift on launch-decision quality is the largest in the set.


Want to contribute, fork, or learn how it's built?


Built by Ryan Alberts — Staff PM in Agentic AI. PRs and forks welcome.