A reproducible methodology for diagnosing the developer experience (DX) of any Python SDK built on top of an HTTP API. The goal is not a binary pass/fail check, but finding weak spots in the user experience and prioritizing improvements.
The methodology is universal: it is not tied to specific classes, files, or method names. All criteria are formulated through roles, contracts, and behavior, not code identifiers.
- Integration experience. The path from "installed the package" to "made the first successful request".
- Day-to-day development experience. Speed of finding the right method, predictability of behavior, cost of debugging, cost of testing one's own code on top of the SDK.
- Maintenance experience. Behavior on upgrade, on upstream API changes, on environment changes, on network failures.
The methodology does not evaluate:
- performance (latency, throughput, memory);
- architectural cleanliness for its own sake;
- upstream API feature completeness (this is a separate "coverage" metric, only partially included).
Goals:
- Produce a quantitative score
Score ∈ [0, 100], comparable across releases and across different SDKs. - Localize weak spots down to the behavior/contract level (without binding to a specific file).
- Produce a prioritized improvement backlog (via the
Painindex, see §6).
Constraints:
- The methodology is designed for SDK wrappers on top of HTTP/REST APIs. For gRPC, WebSocket, or low-level libraries, some criteria are not applicable and are explicitly disabled.
- Minimum evaluation: one evaluator in ~1 working day. Reference evaluation: two independent evaluators with cross-check.
- Some criteria require user participation (persona sessions); if access to real users is unavailable, a simulation mode is used (the evaluator plays the persona role).
Assumptions:
- The SDK has a public repository or source code access.
- The SDK is distributed via PyPI/similar index, installable in one step.
- At least minimal documentation exists (README, docstrings).
Each sub-criterion is evaluated from the perspective of three personas. The final score for a sub-criterion = minimum of the three scores (the weakest link determines DX).
- P1. Newcomer integrator. Using the SDK for the first time. Relies on README, quickstart, and IDE hints. Reads source code only as a last resort.
- P2. Experienced developer. Writes production code, uses
mypy, debugs incidents, writes their own tests on top of the SDK. - P3. Maintainer. Updates the SDK version in their project, monitors releases, maintains backward compatibility of their own code, collects bug reports.
A single set of methods — each criterion explicitly states which apply.
| Code | Method | Description |
|---|---|---|
| AC | Auditing Code | Manual reading of the public API without diving into implementation. The evaluator's role is "smart reader", not author |
| SA | Static Analysis | Applying static analyzers: types, style, docstrings, dead code, security lint |
| FI | Fault Injection | Controlled error injection (5xx, 401, timeout, malformed JSON) via mocks and proxies |
| TT | Timed Task | Time-based task: measuring how long it takes a persona to reach the result |
| TAL | Think-Aloud Lab | User session with a "think aloud" protocol: recording moments of getting stuck |
| HE | Heuristic Evaluation | Heuristic analysis using 10 DX heuristics (see §5.5) |
| CB | Comparative Benchmark | Comparing specific behavior against reference SDKs (Stripe, Azure, Google Cloud, boto3, OpenAI) |
| DA | Document Audit | Verifying public documentation against actual SDK behavior |
| LA | Log Analysis | Analyzing SDK logs: what is written, at what levels, whether secrets leak |
| AM | Automated Metric | Any machine-extractable metric: test coverage, number of public symbols with docstrings, number of TODO/FIXME, etc. |
Each sub-criterion must have ≥1 method from {SA, FI, DA, AM, CB} — at least one source independent of subjective judgment. Manual methods (AC, TT, TAL, HE) are applied in addition, but not as the sole source.
- ruff — style violations, basic anti-patterns, dead code, subset of
flake8rules. - mypy in
--strictmode — completeness of type annotations,Anyleaks, annotation/runtime mismatches. - pyright (alternative) — faster and stricter static type analyzer.
- bandit — security lint (hardcoded secrets, unsafe patterns).
- vulture — dead code detection: unused imports, functions, classes.
- radon / xenon — cyclomatic complexity of public methods.
- interrogate — docstring coverage of public API.
- pydocstyle — docstring format.
- pytest — base for contract, integration, and error-mapping tests.
- respx / httpx-mock — HTTP layer mock for
httpx-based clients. - vcrpy / pytest-recording — recording and replaying real HTTP sessions for regression.
- pytest-httpserver — spin up a test HTTP server capable of simulating 4xx/5xx/medleys.
- toxiproxy / mitmproxy — simulating network anomalies: jitter, drops, rate limiting.
- hypothesis — property-based tests for edge cases in serialization and mappers.
- freezegun — time freezing for checking retry-after and backoff behavior.
- vale — prose linter for documentation style and terminology.
- markdownlint — Markdown quality.
- sphinx-build -W — reference build without warnings.
- link-checker (e.g.,
lychee) — dead link detection. - Diátaxis matrix — manual categorization of existing materials (see §9.15).
- Time-to-First-Call (TTFC) — metric in seconds from
pip installto first successful real request, measured with a stopwatch. - Task Success Rate (TSR) — share of users who solved the test task without assistance.
- System Usability Scale (SUS) in DevSUS variant — 10-question questionnaire adapted for SDKs. Result from 0 to 100, reference threshold ≥80.
- Think-aloud protocol — user narrates their progress; evaluator records points of getting stuck.
- Diary study (for P3) — collecting a maintainer's notes over 1–2 sprints of use.
Adapted Nielsen checklist for SDKs:
- Visibility of state. The user understands where a network call happens and where it does not.
- Match to domain language. Method names reflect domain operations, not HTTP.
- Freedom and control. Per-operation overrides exist, and there is an "escape hatch" to low-level primitives when needed.
- Consistency and standards. The SDK follows Python conventions (snake_case, PEP 8) and internal conventions throughout the surface.
- Error prevention. Incorrect input is caught before the network call (fail-fast).
- Recognition over recall. IDE autocomplete provides everything needed without consulting documentation.
- Flexibility and efficiency. A newcomer completes the happy path in minutes; an experienced user has access to fine-tuning.
- Minimalism. The public API contains only what is needed; everything else is hidden.
- Language of errors. Messages help diagnose and act, not merely state a fact.
- Help and documentation. All four Diátaxis modes are covered: tutorials, how-to, reference, explanations.
Behavior comparisons are made against reference Python SDKs:
- Stripe (
stripe-python) — idempotency, auto-pagination, error mapping. - Azure SDK for Python — client constructor, long-running operations,
ItemPaged. - Google Cloud Python — ADC, transparent pagination.
- boto3 — configuration, retries, sessions.
- OpenAI Python SDK — async/sync parity, streaming responses.
For each behavior, the methodology requires formulating an answer: "in the reference it is implemented via X, in the evaluated SDK via Y; the DX difference is Z".
| Level | Score | Indicator |
|---|---|---|
| Absent | 0% | Requirement not met; defect visible in the first minutes |
| Partial | 25% | Met in some places, violated elsewhere |
| Basic | 50% | Met in most cases, with systematic exceptions |
| Good | 75% | Met everywhere, rare minor deviations |
| Reference | 100% | Met, no signs of weakness, comparable to best SDKs |
Intermediate values are allowed with justification.
Score = Σ ( weight_criterion_i × Σ ( weight_subcriterion_ij × grade_ij ) )
where grade_ij ∈ [0, 1], Σ weight_criterion = 100%, and within each criterion Σ weight_subcriterion = weight_criterion.
- ≥ 90% — reference DX, comparable to industry-leading SDKs.
- 75–89% — working tool, noticeable rough edges in 1–2 criteria.
- 60–74% — DX debt, planned refactoring needed.
- < 60% — architectural revision required before wide publication.
For each sub-criterion with grade < 0.75:
Pain = weight_subcriterion × (1 - grade) × persona_multiplier × cadence_multiplier
where:
persona_multiplier = 1.5if the deficiency affects persona P1 (newcomer), otherwise 1.0 — newcomers leave an SDK fastest;cadence_multiplier = 1.3if the weakness manifests in everyday scenarios (written/read in every other line of user code), 1.0 otherwise.
The backlog is sorted by descending Pain. The top quartile is the target for the nearest sprint.
Sum of weights = 100%.
| # | Criterion | Weight |
|---|---|---|
| 1 | Onboarding and Time-to-First-Call | 10% |
| 2 | Public API Discoverability | 8% |
| 3 | Naming and Consistency | 4% |
| 4 | Typing and IDE-friendliness | 10% |
| 5 | Data Modeling and Return Types | 6% |
| 6 | Error Handling and Actionability | 8% |
| 7 | Safety of Use | 6% |
| 8 | Reliability Under Unstable Network | 7% |
| 9 | Idempotency of Write Operations | 3% |
| 10 | Pagination and Collection Handling | 4% |
| 11 | Per-operation Overrides | 3% |
| 12 | Async/Sync Parity | 3% |
| 13 | Configuration and Fail-fast | 4% |
| 14 | API Contract Coverage | 4% |
| 15 | Documentation (Diátaxis) | 7% |
| 16 | Testability by User Code | 5% |
| 17 | Observability and Logging | 3% |
| 18 | Compatibility and Deprecation Policy | 5% |
Detailed descriptions of each criterion follow.
Goal. Measure the path from "learned about the SDK" to "saw the first result".
Key metric. TTFC (minutes).
| # | Sub-criterion | Weight | Methods | Tools |
|---|---|---|---|---|
| 1.1 | Installation with one command, no system dependencies | 1% | TT, DA | pip install in a clean venv, measure time and errors |
| 1.2 | README quickstart reproducible via copy-paste | 2% | TT, DA | Clean venv, run README examples, count manual edits |
| 1.3 | First successful call requires ≤5 lines of code | 2% | AC, TT | LoC count for the canonical scenario |
| 1.4 | Environment variables are documented, priority resolution is described | 1% | DA | Verify documentation against behavior when env vs .env conflict |
| 1.5 | Fail-fast on invalid configuration | 2% | FI | Remove required fields, verify: error before network, clear exception class, no secrets |
| 1.6 | Canonical TTFC ≤ 15 minutes for P1 | 2% | TT, TAL | Video recording of newcomer session, stopwatch |
Procedure.
- Spin up a clean venv, record exact time.
- Run
pip install <sdk>; record installation time and warnings. - Copy quickstart to a file, run it. Each manual edit reduces score 1.2 by 25%.
- Measure TTFC until the first real network response.
- Intentionally remove one required config field, catch the exception, evaluate the message quality.
Symptoms of weak spots.
- README leads through a "full example" before a "minimal example".
- Required fields require creating intermediate objects.
- Configuration error masquerades as a network exception.
pip installrequires system packages without warning.
Recording template: "Canonical scenario — N lines, TTFC — M minutes, manual edits — K, first configuration error message — <quote>".
Goal. Measure how much effort is needed to find the right method without reading source code.
| # | Sub-criterion | Weight | Methods | Tools |
|---|---|---|---|---|
| 2.1 | Single entry point (facade) | 2% | AC, HE-2, CB | Comparison with Stripe/Azure: all scenarios from one root |
| 2.2 | Factory methods by domain area | 2% | AC | IDE autocomplete on the root client shows all domains |
| 2.3 | One obvious path for each operation | 2% | AC, DA | Search for duplicates in documentation and public names |
| 2.4 | Package structure navigation is predictable | 1% | AC, HE-4 | Sub-package names match API sections |
| 2.5 | Method names reflect business actions | 1% | AC, HE-2 | Absence of post_*/do_* style names |
Procedure.
- Give P1 a task formulated in domain language (e.g., "send message X to user Y"). Measure time to the correct method.
- Open the root SDK object in IDE, verify all domains appear in autocomplete.
- Collect list of public methods and check for duplicates: two access points to the same data — deduct score 2.3.
Symptoms.
- Two equivalent paths exist for one action without a deprecation marker.
- Sub-package names don't match familiar operation names.
find . -name "utils*.py"finds files containing significant business logic.
Goal. Consistent naming style = less time to learn.
| # | Sub-criterion | Weight | Methods | Tools |
|---|---|---|---|---|
| 3.1 | Follows PEP 8 and Python conventions | 1% | SA | ruff, pep8-naming |
| 3.2 | Consistent operation prefixes (get_, list_, create_, ...) |
1% | AC, AM | Script grouping public methods by prefix |
| 3.3 | Domain-specific identifiers (no resource_id, entity_id) |
1% | AM | grep for abstract names in public signatures |
| 3.4 | Consistent parameter order in similar methods | 1% | AC | Visual comparison of signatures in the same family |
Procedure. Run ruff and pep8-naming; export list of public methods to CSV with columns prefix, nouns, params; analyze visually.
Goal. Protect the user at the static layer; provide maximum hints in the IDE.
| # | Sub-criterion | Weight | Methods | Tools |
|---|---|---|---|---|
| 4.1 | Public API is fully typed | 3% | SA | mypy --strict, pyright --strict |
| 4.2 | No Any/dict[str, Any]/object in public signatures |
2% | SA, AM | grep signatures + mypy report |
| 4.3 | Runtime type matches annotation | 1% | AC, FI | Compare declared vs actual on collections (iterators, lazy) |
| 4.4 | Closed value sets expressed as enums | 2% | AC, DA | Cross-check with OpenAPI: fields with enum must be typed |
| 4.5 | IDE shows argument types, return types, and exceptions | 2% | HE-6, AC | Open 10 random public methods in IDE |
Procedure. Run mypy --strict and pyright --strict in CI, export error list; grep ": Any", -> Any, -> dict; sample 10 methods — manual IDE hover check.
Effect ranking tools: sub-criterion 4.1 — exit_code from mypy; 4.2 — grep hit count divided by total public signatures; 4.5 — assessed via HE.
Goal. Users work with domain objects, not "raw" structures.
| # | Sub-criterion | Weight | Methods |
|---|---|---|---|
| 5.1 | Public methods return typed models, not dictionaries | 2% | AC, SA |
| 5.2 | Model fields have concrete domain names | 1% | AC |
| 5.3 | required and Optional are expressed explicitly and match the spec |
1% | DA |
| 5.4 | Uniform serialization (to_dict/model_dump) for all public models |
1% | AC, AM |
| 5.5 | Transport fields are hidden from public models | 1% | AC |
Procedure. Sample 20 public models; for each: serialize to JSON using a standard function, verify absence of internal fields, cross-check nullability against the specification.
Goal. Errors help diagnose and act.
| # | Sub-criterion | Weight | Methods | Tools |
|---|---|---|---|---|
| 6.1 | Domain exception hierarchy covers significant HTTP statuses | 2% | FI | pytest-httpserver with sequence 400/401/403/404/409/422/429/5xx |
| 6.2 | Semantically distinct errors are not related by inheritance | 1% | AC | Exception graph analysis |
| 6.3 | Error messages contain operation, status, request-id, attempt | 2% | FI, AC | Capture exception, check attributes |
| 6.4 | Messages are actionable (what happened + what to do) | 1% | HE-9 | Sample 20 errors, evaluate against actionability checklist |
| 6.5 | Parse errors and transport errors are distinguishable by type | 1% | FI | Submit malformed JSON vs network drop |
| 6.6 | Probe methods return bool, don't throw |
1% | FI, AC | Check exists-style behavior on 404 |
Procedure. Build a matrix (status × failure type) → expected exception class. Submit each case through pytest-httpserver or respx. Record mismatches. For 6.4, apply the actionability checklist: (a) what happened, (b) where, (c) what to do, (d) where to get help.
Goal. Users must not leak secrets through the SDK.
| # | Sub-criterion | Weight | Methods | Tools |
|---|---|---|---|---|
| 7.1 | Secrets do not appear in logs at any level | 2% | LA | Intercept logging, regex check against markers Bearer, client_secret=, Authorization |
| 7.2 | Secrets do not appear in exception messages | 1% | FI | Provoke an error with a known secret value in a parameter |
| 7.3 | Client diagnostic snapshot is safe by default | 1% | AC, FI | Call a "diagnostic" function, check content |
| 7.4 | Serialization of public models does not expose internal fields | 1% | AC | json.dumps(model.to_dict()) on a model sample |
| 7.5 | bandit reports no high-severity findings in public modules |
1% | SA | bandit -r <package> |
Procedure. Set up a test that intercepts logging at DEBUG level, run 10 typical scenarios, grep output against secret regexes. For 7.5 — CI integration of bandit.
Goal. The SDK survives network failures without user intervention.
| # | Sub-criterion | Weight | Methods | Tools |
|---|---|---|---|---|
| 8.1 | Retries only on safe scenarios | 2% | FI | Sequences [500, 500, 200], [429, 200], [timeout, 200], [POST, 500] |
| 8.2 | Exponential backoff with jitter | 1% | FI, AM | freezegun + measure delays between attempts, check variance |
| 8.3 | Retry-After is respected on 429 |
1% | FI | Server returns 429 Retry-After: 3, check actual delay |
| 8.4 | After retries exhausted — domain exception, not transport exception | 1% | FI | [500, 500, 500, 500] → not httpx.*, but a domain type |
| 8.5 | Connect/read/write timeouts are configured explicitly | 1% | AC | Read transport constructor |
| 8.6 | Retry/timeout policy is configurable | 1% | AC, CB | Compare with stripe-python and Azure SDK |
Procedure. Connect toxiproxy or a mock server with programmable delay; for 8.2 — ≥5 attempts, measure stdev(delays) > 0; for 8.3 — compare actual delay with Retry-After ± 20%.
Goal. A safe repeat of a write call does not create duplicates.
| # | Sub-criterion | Weight | Methods |
|---|---|---|---|
| 9.1 | Public write methods accept idempotency_key |
1% | AC, CB (reference — Stripe) |
| 9.2 | The same key is used throughout the entire retry chain | 1% | FI |
| 9.3 | Absence of key = no retry on non-idempotent statuses | 1% | FI |
Procedure. Scenario [transport_error, 200] for POST: with key — two requests with the same header; without key — one attempt and exception.
Goal. Large collections don't require manual page stitching.
| # | Sub-criterion | Weight | Methods |
|---|---|---|---|
| 10.1 | Lazy container returned by default | 1% | AC, FI |
| 10.2 | Iterating first N elements → ceil(N/page) requests |
1% | FI, AM |
| 10.3 | Explicit full materialization without repeated requests | 1% | FI |
| 10.4 | Empty collection makes no extra requests; page N error propagates when reading N | 1% | FI |
Procedure. Mock server with N pages: count HTTP calls for a slice [:k], for full materialization, for empty response, for error on page 3.
Goal. Flexibility at the individual call level without mutating the client.
| # | Sub-criterion | Weight | Methods |
|---|---|---|---|
| 11.1 | timeout can be overridden at the method level |
1% | AC, FI |
| 11.2 | Retry policy can be disabled/strengthened for a single call | 1% | AC, FI |
| 11.3 | Override does not mutate the client and does not leak between calls | 1% | FI |
Procedure. Pass timeout=0.001 to one call, verify that a neighboring call with the default timeout is not affected.
Applicable only if the SDK declares an async surface; otherwise the criterion is disabled and the weight is redistributed proportionally.
| # | Sub-criterion | Weight | Methods |
|---|---|---|---|
| 12.1 | Async and sync live in separate namespaces | 1% | AC, CB |
| 12.2 | Signatures are identical (differ only by async/await) |
1% | AC, AM |
| 12.3 | Feature parity within a single release | 1% | DA |
| # | Sub-criterion | Weight | Methods |
|---|---|---|---|
| 13.1 | Configuration from environment, explicit object, and direct arguments — all three paths work | 1% | AC, TT |
| 13.2 | Priority resolution (env > .env > defaults) is deterministic and documented |
1% | FI, DA |
| 13.3 | Missing required fields are caught before the first HTTP call | 1% | FI |
| 13.4 | Generic env names (TOKEN, SECRET) are not official aliases |
1% | DA |
| # | Sub-criterion | Weight | Methods |
|---|---|---|---|
| 14.1 | All operations in the upstream spec have a public method | 2% | AM, DA |
| 14.2 | Field names, nullability, and enums match the spec | 1% | DA |
| 14.3 | Deprecated operations are explicitly marked with deprecation | 1% | AC, DA |
Procedure. Extract the list of operationId from OpenAPI/Swagger; map to public methods via a correspondence table; cross-check fields against schema on a sample of 20 models.
| # | Sub-criterion | Weight | Methods | Tools |
|---|---|---|---|---|
| 15.1 | Tutorial (step-by-step from zero to first success) is present | 1.5% | DA, TT | Run P1 persona through the tutorial |
| 15.2 | How-to guides for typical tasks | 1.5% | DA | Inventory of ready-made recipes |
| 15.3 | Reference covers the public API | 1.5% | AM | interrogate — docstring coverage |
| 15.4 | Explanations for key concepts (transport, retry, pagination) | 1% | DA | |
| 15.5 | CHANGELOG is maintained, linked to releases | 0.5% | DA | |
| 15.6 | Documentation examples are executable | 1% | TT | Copy-paste all snippets into a venv |
Procedure. Build a Diátaxis matrix (4 columns, documentation sections as rows). Each section belongs to exactly one cell. Empty columns = failure.
| # | Sub-criterion | Weight | Methods |
|---|---|---|---|
| 16.1 | Public fake/mock transport is available from a documented namespace | 2% | AC, DA |
| 16.2 | Mock contract is documented: response script, call inspection, error injection | 1% | DA |
| 16.3 | Public models serialize stably via json.dumps without custom encoders |
1% | AM |
| 16.4 | Context manager closes resources; calling after close gives a clear error | 1% | FI |
Procedure. Write a "canonical consumer test": create a client with the public mock, run 3 scenarios, serialize results to JSON, close the client, attempt to call a method — record behavior.
| # | Sub-criterion | Weight | Methods | Tools |
|---|---|---|---|---|
| 17.1 | SDK uses standard logging under a named logger |
1% | AC, LA | |
| 17.2 | Structured fields: operation, endpoint, status, attempt, latency | 1% | LA | |
| 17.3 | Every network call leaves at least one record at DEBUG |
1% | LA |
Procedure. Enable logging at DEBUG, run 5 typical scenarios, export records as JSON lines, verify field presence and absence of secrets.
| # | Sub-criterion | Weight | Methods |
|---|---|---|---|
| 18.1 | Semver is followed (breaking → major, additive → minor, fix → patch) | 1% | DA |
| 18.2 | Deprecation period is explicit, at least two minor releases | 1% | DA |
| 18.3 | Deprecated symbols emit DeprecationWarning with a link to the replacement |
1% | AC, FI |
| 18.4 | CHANGELOG contains sections Added/Changed/Deprecated/Removed/Fixed |
1% | DA |
| 18.5 | Public renames always go through an alias with a warning | 1% | AC, DA |
Procedure. Take the last 3–5 releases; for each — match the nature of changes against the versioning. Import deprecated symbols, verify DeprecationWarning.
Recommended duration for a full cycle — 1–2 days per evaluator; 3–4 days with user participation.
- Clean venv, fix SDK SHA/version.
- Install tools from §5.
- Create working folder
evaluation/<sha>/for artifacts.
Run sequentially and save output:
ruff check,ruff format --checkmypy --strictorpyright --strictbandit -rvultureinterrogate- custom script comparing upstream spec with public API
Fill all sub-criteria with SA/AM method.
- Build the Diátaxis matrix.
- Cross-check the list of env variables in documentation against actual behavior.
- Review CHANGELOG for recent releases.
- Verify executability of all documentation snippets.
- Spin up
pytest-httpserverorrespx. - Run matrices:
(status × scenario) → expected behavior. - Record results in a table.
- P1 receives the canonical onboarding task.
- P2 receives the task "write a test on top of the SDK".
- P3 (if available — optional via diary study) records version upgrade incidents.
- Screen recording, think-aloud protocol.
Using 10 DX heuristics (§5.5). Each heuristic produces a short observation + reference to affected sub-criteria.
Select 2–3 reference SDKs from §5.6. Record divergences on the same scenarios.
- Fill the summary table
criterion × subcriterion × grade × evidence. - Calculate
Score. - Calculate
Painfor each sub-criterion withgrade < 0.75. - Sort backlog.
- Format report following the template in §27.
Fixed structure to make results comparable across releases.
- Score:
XX%(category: reference / working / debt / refactoring). - Evaluation date, SDK version, evaluator name.
- Top 3 findings in 1–2 sentences.
Columns: #, criterion, weight, weighted-average score, contribution to Score, trend (↑/↓/→ relative to previous evaluation).
For each criterion:
- Sub-criterion scores with brief justification.
- Evidence — link to log, screenshot, quoted message, session recording. Without specifying concrete filenames from source code — behavior is described, not location.
- What will earn +25% in the next evaluation.
Table: #, observation, sub-criterion, Pain, estimated fix effort, recommendation.
- Raw tool output (
ruff.txt,mypy.txt,bandit.json). - FI matrix with requested/received reactions.
- Persona session transcripts (de-identified).
- DevSUS questionnaires.
The methodology is considered valid if:
- Reproducibility. Two independent evaluators produce
Scorewithin ±5% on the same revision. - Pain-ranking stability. The top-10 sub-criteria by
Painagree between sessions ≥80%. - Automation. Each sub-criterion has at least one objective data source (SA/FI/DA/AM/CB). No purely subjective sub-criteria.
- Predictability. Implementing all backlog recommendations produces a
Scoreincrease no less than the sum of weights of the corresponding sub-criteria. - Portability. The methodology applies to any Python SDK without changes, except for disabling criterion 12 (async/sync parity) for synchronous SDKs.
- Quarterly for stable SDKs.
- Before every major release as a release gate.
- After any architectural rework of the public API.
- After negative user feedback appears — targeted evaluation of the criteria indicated by the feedback.
Results — Score, delta, Pain-top — are published in release notes as part of the release quality description.