Skip to content

Triage pipeline letting near-duplicate sidewalk topics through (7 sidewalk topics, 2 merge clusters) #90

@AndreRobitaille

Description

@AndreRobitaille

Observation

While backfilling TopicBriefing records for the homepage headline voice rewrite (spec: docs/superpowers/specs/2026-04-10-homepage-headline-voice-design.md), I noticed multiple sidewalk-related topics that should have been merged or aliased by the triage pipeline but weren't. There are 7 sidewalk topics on prod, split into three clusters.

Cluster 1: Physical sidewalk maintenance (2 duplicates)

ID Name Status Impact Appearances Description
470 sidewalk program approved 4 6 Plans for building, repairing, and maintaining neighborhood sidewalks
520 sidewalk trip hazard repairs approved 4 2 Fix minor sidewalk trip hazards with saw-cutting to improve safety and ADA

520 is a specific program inside the broader 470 topic. Both are surfacing on the homepage at the same impact score, creating visible card redundancy.

Proposed merge: sidewalk trip hazard repairs (520) → sidewalk program (470).

Cluster 2: Snow clearing / shoveling enforcement (3 approved duplicates + 1 blocked)

ID Name Status Impact Appearances
530 sidewalk snow clearing enforcement approved 1
579 sidewalk snow removal rules approved 4 1
559 sidewalk snow shoveling enforcement approved 2
166 sidewalk snow shoveling blocked 3

Three approved topics cover the identical civic concern (rules and enforcement for property owners clearing snow from sidewalks in winter). Only 166 was caught and blocked.

Proposed merge: 530 and 579 → 559 (the most appearances). Keep 166 blocked as-is.

Cluster 3: Correctly blocked

  • [536] ebikes on sidewalks — blocked, 1 appearance. Correct outcome.

The underlying issue

The triage pipeline is letting close duplicates through. Specifically:

  • The auto-triage blocklist-matching is name-based but doesn't catch semantic near-duplicates with different word orders
  • Topic alias auto-matching isn't kicking in for obvious synonyms like "snow clearing" vs "snow shoveling" vs "snow removal"
  • There's no post-creation near-duplicate check against existing approved topics

Worth a broader look at whether:

  1. The triage pipeline should run a similarity check against all approved topics when creating a new one
  2. The ExtractTopicsJob should be more aggressive about reusing existing approved topics instead of creating new ones
  3. There should be a periodic "find near-duplicates" job that proposes merges for admin review

Suggested fixes (this issue)

  1. Manually merge Cluster 1: `520 → 470`
  2. Manually merge Cluster 2: `530 → 559`, `579 → 559`
  3. Re-run `Topics::GenerateTopicBriefingJob.perform_now` on 470 and 559 after the merges so the briefings pick up the combined appearance history (note: these will not be in the current headline-voice backfill scope if the merges happen after the backfill completes)

Follow-up work (separate issue suggested)

  • Investigate adding a similarity check to `Topics::TriageTool` or `Topics::FindOrCreateService`
  • Consider a periodic duplicate-detection job that surfaces candidate merges in the admin UI

Context

Related work: #89 (homepage cards were showing content-thin meetings), #76 (resolver prefers specific sub-items), and the current headline voice rewrite not yet tracked as an issue. The homepage headline backfill is in progress as of this writing (~18 of 53 topics done).

Found via: manual inspection of homepage cards during the post-prompt-rewrite backfill on 2026-04-11.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions