Skip to content

Event identifier hash uses date-only startDate, causing same-day same-venue collisions #219

@chubes4

Description

@chubes4

Summary

EventIdentifierGenerator::generate() is being called with date-only startDate (e.g. 2026-04-26) instead of full datetime (2026-04-26 13:30:00), causing two events at the same venue on the same day to collide on the same hash. The collision marks the second event as a duplicate, and dedup kills it forever.

This is silently dropping recurring shows network-wide.

Concrete reproduction

Charleston Pour House Tribe Events API returns 12 events for Apr 25 - May 1. Three of those events are missing from our database:

Date Title Stage In DB?
Sun Apr 26 Motown Throwdown Deck
Tue Apr 28 Guilt Ridden Troubadour Deck
Wed Apr 29 The Reckoning Deck

All 9 other events for the week are correctly captured. Pour House has two stages (Deck + Main), so multiple events on the same day are normal.

The flow runs 6× daily and returns completed_no_items because the dedup says "already processed."

Root cause trace

Step 1: Tribe API returns full datetime

$ curl 'https://charlestonpourhouse.com/wp-json/tribe/events/v1/events?per_page=100' \
    | jq '.events[] | select(.title=="Motown Throwdown") | {title, start_date, slug}'
{
  "title": "Motown Throwdown",
  "start_date": "2026-04-26 13:30:00",       ← full datetime, correct
  "slug": "motown-throwdown-3-2026-04-26"
}

Step 2: WordPressExtractor splits date and time into separate fields

inc/Steps/EventImport/Handlers/WebScraper/Extractors/WordPressExtractor.php:198-237

private function mapTribeV1Event( array $event, string $source_url ): array {
    $start_date = '';
    $start_time = '';
    // ...
    if ( ! empty( $event['start_date'] ) ) {
        $parsed     = $this->parseDatetime( $event['start_date'] );
        $start_date = $parsed['date'];   // ← '2026-04-26'
        $start_time = $parsed['time'];   // ← '13:30'
    }
    // ...
    return array(
        // ...
        'startDate' => $start_date,      // ← date only
        'startTime' => $start_time,      // ← time, separate
        // ...
    );
}

This separation is fine on its own — the upsert layer needs them split anyway.

Step 3: StructuredDataProcessor only feeds startDate to the hash

inc/Steps/EventImport/Handlers/WebScraper/StructuredDataProcessor.php:73-77

$event_identifier = EventIdentifierGenerator::generate(
    $event['title'],
    $event['startDate'] ?? '',   // ← '2026-04-26' (NO TIME)
    $event['venue'] ?? ''
);

$event['startTime'] is right there in the same array, but never makes it into the hash.

Step 4: Hash collisions on same-day same-venue events

md5( "Motown Throwdown" + "2026-04-26" + "Charleston Pour House" )
  → c0741f83bbeafa73c59dba405f1eba0f

md5( "MJT" + "2026-04-26" + "Charleston Pour House" )
  → some other hash (titles differ — fine)

md5( "Motown Throwdown" + "2026-04-26" + "Charleston Pour House" )  // 2nd time on later run
  → c0741f83bbeafa73c59dba405f1eba0f                                  ← SAME HASH → dedup kills it

So the immediate Pour House symptom isn't actually title collisions on the same day — it's the next-layer-up bug interacting with this one (see #2 below). But this date-only hash IS the underlying weakness that makes the dedup unable to recover.

Verification

// Reproducing what the scraper hashes
EventIdentifierGenerator::generate('Motown Throwdown', '2026-04-26', 'Charleston Pour House')
// → c0741f83bbeafa73c59dba405f1eba0f  ← THIS hash is in c8c_7_datamachine_processed_items ✓

EventIdentifierGenerator::generate('Motown Throwdown', '2026-04-26 13:30:00', 'Charleston Pour House')
// → 32917cd554a43193972a412249a3c308  ← NOT in processed_items

Confirmed via direct database query against flow_step_id = '3_b27b1934-40c7-478f-9586-e474bd7d84ad_9'.

Suggested fix

Option A — pass startTime through to the hash (minimal)

// StructuredDataProcessor.php:73-77
$event_identifier = EventIdentifierGenerator::generate(
    $event['title'],
    trim( ( $event['startDate'] ?? '' ) . ' ' . ( $event['startTime'] ?? '' ) ),
    $event['venue'] ?? ''
);

Passes '2026-04-26 13:30' instead of '2026-04-26'. Hash becomes time-aware. No signature change to EventIdentifierGenerator::generate().

Option B — change the EventIdentifierGenerator contract (cleaner)

Add explicit startTime parameter:

public static function generate(
    string $title,
    string $startDate,
    string $venue,
    string $startTime = ''
): string {
    // ...
    return md5( $normalized_title . $startDate . $startTime . $normalized_venue );
}

Then audit all callers (there are several) to pass startTime where available. Backwards-compatible because the new param defaults to empty.

Audit other callers

EventIdentifierGenerator::generate() is also called from:

  • inc/Abilities/DuplicateDetectionAbilities.php:309, 333 — duplicate detection
  • inc/Abilities/EventQualityAuditAbilities.php — quality audits
  • inc/Steps/Upsert/Events/EventUpsert.php:540, 692, 787, 826, 891 — upsert dedup
  • Plus EventIdentifierGenerator::titlesMatch() and friends in the upsert fuzzy-match path

Need to verify each caller has access to startTime and passes it. The upsert callers DO have full datetime available (they query event_dates.start_datetime), so they should be migrated to time-aware hashing.

Without auditing every caller, you'd get the situation where the scraper hashes one way and the upsert dedup hashes a different way for the same event — even worse than current state.

Why this matters

The Charleston Pour House example is one venue. Any venue with multiple stages (House of Blues, Brooklyn Steel, Music Hall of Williamsburg, etc.) hits this. Any festival with multiple parallel sets hits this. Any "weekly residency" with a different opener each week hits this if the resident name is the title.

This compounds with the separate processed_items lifecycle bug filed at [data-machine repo TBD] — items get marked processed before they're confirmed-saved, so a hash collision that drops an event also makes it unrecoverable forever.

Diagnostic queries (for verification)

-- How many "completed_no_items" runs has flow 9 had?
SELECT status, COUNT(*) FROM c8c_7_datamachine_jobs
WHERE flow_id = 9 GROUP BY status;
-- 61 × completed_no_items
-- 50 × completed
-- 7 × failed
-- (most runs return zero new items because dedup kills them)
// Reproduce a hash and check if it's in processed_items
$hash = EventIdentifierGenerator::generate('Motown Throwdown', '2026-04-26', 'Charleston Pour House');
// Then: SELECT * FROM datamachine_processed_items WHERE item_identifier = $hash

Related

  • Discovered while building wp extrachill events roundup (extrachill-cli#14) and noticing missing Pour House events in real production data.
  • This is the scraper-side bug. There's a separate core-engine architectural bug to be filed against data-machine about the lifecycle of processed_items records (marked-processed before confirmed-ingested).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions