TPM Tooling gaps for Intelligent Systems Delivery #1387

namitts82 · 2026-04-20T01:00:24Z

namitts82
Apr 20, 2026
Collaborator

As we all have realised, Intelligent systems don't behave like traditional software projects. Outcomes are probabilistic, model performance shifts over time, and progress can't be measured by task completion alone. Yet most TPM tooling (scoping, scheduling, risk tracking, RAG dashboards) still assumes deterministic delivery. This creates a structural gap: projects report green while decisions remain unsafe. We've been working on closing that gap with tools and processes built specifically for probabilistic delivery.

What We Are Doing

We have kicked off an initiative to build TPM tooling specifically for intelligent systems delivery. We studied how Google, Microsoft, and AWS define production readiness for ML systems and mapped where standard TPM practices hold up and where they break down. From that research, we are redefining 10 core TPM disciplines (scope, schedule, risk, quality, progress tracking, dependencies, stakeholder management, definition of done, operational readiness, reporting and change control) for environments where behaviour is learned from data and adapts to context. Each discipline gets purpose built tools and processes. Like, confidence dashboards instead of RAG status, hypothesis driven scope packs instead of fixed requirements, timeboxed learning cycles instead of milestone driven schedules, kill criteria and risk heatmaps that account for drift and silent failure modes, etc.

Where We Are

Research across three industry frameworks is complete. Ten TPM responsibility areas have been redefined with evidence based rationale. Twenty seven tools and processes have been documented with structure, templates, and alignment to the responsibility model. These include artifacts like the Confidence Dashboard, Decision Ledger, Uncertainty Framing Contract, Quality Assurance Pack (covering data, model, system, and drift gates), Model Operations Readiness Checklist, and Monitoring Signal Map. Everything is catalogued and queryable. Next step is piloting on live engagements to validate what works.

What's Next

We are hoping to integrate this as an accelerator into HVE Core so it becomes part of the standard toolkit available to all teams. Before that, we need to pilot on real engagements and iterate based on feedback.

If this problem resonates with you or you want to help shape the tooling, reach out to any of us. We welcome contributors, reviewers, and teams willing to pilot.

Namit (namit.t@microsoft.com)
David (david.ratcliffe@microsoft.com)
Parag (paragalurkar@microsoft.com)

hodnick · 2026-05-01T11:14:28Z

hodnick
May 1, 2026
Collaborator

Namit, David, Parag, this resonates. I like the framing because it puts the TPM role where it belongs for intelligent systems: helping the team make safe decisions under uncertainty, not just reporting task progress.

A project can be green on scope, schedule, and risks while still being unsafe if the team has not earned confidence in model behavior, data quality, evaluation coverage, operator workflow, rollback posture, or drift detection.

A few additions I would consider as you move toward pilots:

Tie confidence to decisions: A confidence dashboard should name the decision it supports: demo to leadership, expand a pilot, remove human review, increase traffic, commit to a customer milestone, or move to production. Confidence should also have a shelf life. If the evidence ages, the model changes, or the operating context shifts, the status should decay until refreshed
Make the MVE pattern explicit: In HVE practice we already use Minimum Viable Experiments to test the riskiest assumption before building the full solution. I would make that a formal TPM scoping pattern for intelligent systems: hypothesis, evidence needed, timebox, pass/fail threshold, kill criteria, and owner
Plan evaluation capacity like delivery capacity: We are getting better at asking, “how much code can we meaningfully review today?” For intelligent systems, the companion planning question is, “how much model behavior can we meaningfully evaluate this week?” If evaluation capacity is finite, it should shape schedule and scope up front
Measure feedback loop velocity: I would track the time from production signal to evaluation dataset to prompt/model/system change to redeploy. That is closer to the heartbeat of an intelligent system than story completion or milestone percent complete
Watch context drift: Prompts, model versions, evaluation sets, ADRs, business rules, telemetry assumptions, and customer workflows all age. A lightweight context freshness check may be as important as a model drift check, especially in HVE where markdown becomes working memory for both humans and agents

The caution I would add is process weight. We should pilot the smallest useful pack first, probably the Confidence Dashboard, Decision Ledger, Uncertainty Framing Contract, and one evaluation/readiness artifact. If the tooling helps a crew make a safer decision faster, it belongs in HVE Core. If it mostly creates another status surface, it will give back the gains we are trying to protect.

I would be happy to help review or pilot from the TPM Guild side.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TPM Tooling gaps for Intelligent Systems Delivery #1387

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

TPM Tooling gaps for Intelligent Systems Delivery #1387

Uh oh!

Uh oh!

namitts82 Apr 20, 2026 Collaborator

What We Are Doing

Where We Are

What's Next

Replies: 1 comment

Uh oh!

hodnick May 1, 2026 Collaborator

namitts82
Apr 20, 2026
Collaborator

hodnick
May 1, 2026
Collaborator