Summary
EventIdentifierGenerator::generate() is being called with date-only startDate (e.g. 2026-04-26) instead of full datetime (2026-04-26 13:30:00), causing two events at the same venue on the same day to collide on the same hash. The collision marks the second event as a duplicate, and dedup kills it forever.
This is silently dropping recurring shows network-wide.
Concrete reproduction
Charleston Pour House Tribe Events API returns 12 events for Apr 25 - May 1. Three of those events are missing from our database:
| Date |
Title |
Stage |
In DB? |
| Sun Apr 26 |
Motown Throwdown |
Deck |
❌ |
| Tue Apr 28 |
Guilt Ridden Troubadour |
Deck |
❌ |
| Wed Apr 29 |
The Reckoning |
Deck |
❌ |
All 9 other events for the week are correctly captured. Pour House has two stages (Deck + Main), so multiple events on the same day are normal.
The flow runs 6× daily and returns completed_no_items because the dedup says "already processed."
Root cause trace
Step 1: Tribe API returns full datetime
$ curl 'https://charlestonpourhouse.com/wp-json/tribe/events/v1/events?per_page=100' \
| jq '.events[] | select(.title=="Motown Throwdown") | {title, start_date, slug}'
{
"title": "Motown Throwdown",
"start_date": "2026-04-26 13:30:00", ← full datetime, correct
"slug": "motown-throwdown-3-2026-04-26"
}
Step 2: WordPressExtractor splits date and time into separate fields
inc/Steps/EventImport/Handlers/WebScraper/Extractors/WordPressExtractor.php:198-237
private function mapTribeV1Event( array $event, string $source_url ): array {
$start_date = '';
$start_time = '';
// ...
if ( ! empty( $event['start_date'] ) ) {
$parsed = $this->parseDatetime( $event['start_date'] );
$start_date = $parsed['date']; // ← '2026-04-26'
$start_time = $parsed['time']; // ← '13:30'
}
// ...
return array(
// ...
'startDate' => $start_date, // ← date only
'startTime' => $start_time, // ← time, separate
// ...
);
}
This separation is fine on its own — the upsert layer needs them split anyway.
Step 3: StructuredDataProcessor only feeds startDate to the hash
inc/Steps/EventImport/Handlers/WebScraper/StructuredDataProcessor.php:73-77
$event_identifier = EventIdentifierGenerator::generate(
$event['title'],
$event['startDate'] ?? '', // ← '2026-04-26' (NO TIME)
$event['venue'] ?? ''
);
$event['startTime'] is right there in the same array, but never makes it into the hash.
Step 4: Hash collisions on same-day same-venue events
md5( "Motown Throwdown" + "2026-04-26" + "Charleston Pour House" )
→ c0741f83bbeafa73c59dba405f1eba0f
md5( "MJT" + "2026-04-26" + "Charleston Pour House" )
→ some other hash (titles differ — fine)
md5( "Motown Throwdown" + "2026-04-26" + "Charleston Pour House" ) // 2nd time on later run
→ c0741f83bbeafa73c59dba405f1eba0f ← SAME HASH → dedup kills it
So the immediate Pour House symptom isn't actually title collisions on the same day — it's the next-layer-up bug interacting with this one (see #2 below). But this date-only hash IS the underlying weakness that makes the dedup unable to recover.
Verification
// Reproducing what the scraper hashes
EventIdentifierGenerator::generate('Motown Throwdown', '2026-04-26', 'Charleston Pour House')
// → c0741f83bbeafa73c59dba405f1eba0f ← THIS hash is in c8c_7_datamachine_processed_items ✓
EventIdentifierGenerator::generate('Motown Throwdown', '2026-04-26 13:30:00', 'Charleston Pour House')
// → 32917cd554a43193972a412249a3c308 ← NOT in processed_items
Confirmed via direct database query against flow_step_id = '3_b27b1934-40c7-478f-9586-e474bd7d84ad_9'.
Suggested fix
Option A — pass startTime through to the hash (minimal)
// StructuredDataProcessor.php:73-77
$event_identifier = EventIdentifierGenerator::generate(
$event['title'],
trim( ( $event['startDate'] ?? '' ) . ' ' . ( $event['startTime'] ?? '' ) ),
$event['venue'] ?? ''
);
Passes '2026-04-26 13:30' instead of '2026-04-26'. Hash becomes time-aware. No signature change to EventIdentifierGenerator::generate().
Option B — change the EventIdentifierGenerator contract (cleaner)
Add explicit startTime parameter:
public static function generate(
string $title,
string $startDate,
string $venue,
string $startTime = ''
): string {
// ...
return md5( $normalized_title . $startDate . $startTime . $normalized_venue );
}
Then audit all callers (there are several) to pass startTime where available. Backwards-compatible because the new param defaults to empty.
Audit other callers
EventIdentifierGenerator::generate() is also called from:
inc/Abilities/DuplicateDetectionAbilities.php:309, 333 — duplicate detection
inc/Abilities/EventQualityAuditAbilities.php — quality audits
inc/Steps/Upsert/Events/EventUpsert.php:540, 692, 787, 826, 891 — upsert dedup
- Plus
EventIdentifierGenerator::titlesMatch() and friends in the upsert fuzzy-match path
Need to verify each caller has access to startTime and passes it. The upsert callers DO have full datetime available (they query event_dates.start_datetime), so they should be migrated to time-aware hashing.
Without auditing every caller, you'd get the situation where the scraper hashes one way and the upsert dedup hashes a different way for the same event — even worse than current state.
Why this matters
The Charleston Pour House example is one venue. Any venue with multiple stages (House of Blues, Brooklyn Steel, Music Hall of Williamsburg, etc.) hits this. Any festival with multiple parallel sets hits this. Any "weekly residency" with a different opener each week hits this if the resident name is the title.
This compounds with the separate processed_items lifecycle bug filed at [data-machine repo TBD] — items get marked processed before they're confirmed-saved, so a hash collision that drops an event also makes it unrecoverable forever.
Diagnostic queries (for verification)
-- How many "completed_no_items" runs has flow 9 had?
SELECT status, COUNT(*) FROM c8c_7_datamachine_jobs
WHERE flow_id = 9 GROUP BY status;
-- 61 × completed_no_items
-- 50 × completed
-- 7 × failed
-- (most runs return zero new items because dedup kills them)
// Reproduce a hash and check if it's in processed_items
$hash = EventIdentifierGenerator::generate('Motown Throwdown', '2026-04-26', 'Charleston Pour House');
// Then: SELECT * FROM datamachine_processed_items WHERE item_identifier = $hash
Related
- Discovered while building
wp extrachill events roundup (extrachill-cli#14) and noticing missing Pour House events in real production data.
- This is the scraper-side bug. There's a separate core-engine architectural bug to be filed against
data-machine about the lifecycle of processed_items records (marked-processed before confirmed-ingested).
Summary
EventIdentifierGenerator::generate()is being called with date-onlystartDate(e.g.2026-04-26) instead of full datetime (2026-04-26 13:30:00), causing two events at the same venue on the same day to collide on the same hash. The collision marks the second event as a duplicate, and dedup kills it forever.This is silently dropping recurring shows network-wide.
Concrete reproduction
Charleston Pour House Tribe Events API returns 12 events for Apr 25 - May 1. Three of those events are missing from our database:
All 9 other events for the week are correctly captured. Pour House has two stages (Deck + Main), so multiple events on the same day are normal.
The flow runs 6× daily and returns
completed_no_itemsbecause the dedup says "already processed."Root cause trace
Step 1: Tribe API returns full datetime
Step 2: WordPressExtractor splits date and time into separate fields
inc/Steps/EventImport/Handlers/WebScraper/Extractors/WordPressExtractor.php:198-237This separation is fine on its own — the upsert layer needs them split anyway.
Step 3: StructuredDataProcessor only feeds startDate to the hash
inc/Steps/EventImport/Handlers/WebScraper/StructuredDataProcessor.php:73-77$event['startTime']is right there in the same array, but never makes it into the hash.Step 4: Hash collisions on same-day same-venue events
So the immediate Pour House symptom isn't actually title collisions on the same day — it's the next-layer-up bug interacting with this one (see #2 below). But this date-only hash IS the underlying weakness that makes the dedup unable to recover.
Verification
Confirmed via direct database query against
flow_step_id = '3_b27b1934-40c7-478f-9586-e474bd7d84ad_9'.Suggested fix
Option A — pass startTime through to the hash (minimal)
Passes
'2026-04-26 13:30'instead of'2026-04-26'. Hash becomes time-aware. No signature change toEventIdentifierGenerator::generate().Option B — change the EventIdentifierGenerator contract (cleaner)
Add explicit
startTimeparameter:Then audit all callers (there are several) to pass
startTimewhere available. Backwards-compatible because the new param defaults to empty.Audit other callers
EventIdentifierGenerator::generate()is also called from:inc/Abilities/DuplicateDetectionAbilities.php:309, 333— duplicate detectioninc/Abilities/EventQualityAuditAbilities.php— quality auditsinc/Steps/Upsert/Events/EventUpsert.php:540, 692, 787, 826, 891— upsert dedupEventIdentifierGenerator::titlesMatch()and friends in the upsert fuzzy-match pathNeed to verify each caller has access to
startTimeand passes it. The upsert callers DO have full datetime available (they queryevent_dates.start_datetime), so they should be migrated to time-aware hashing.Without auditing every caller, you'd get the situation where the scraper hashes one way and the upsert dedup hashes a different way for the same event — even worse than current state.
Why this matters
The Charleston Pour House example is one venue. Any venue with multiple stages (House of Blues, Brooklyn Steel, Music Hall of Williamsburg, etc.) hits this. Any festival with multiple parallel sets hits this. Any "weekly residency" with a different opener each week hits this if the resident name is the title.
This compounds with the separate processed_items lifecycle bug filed at [data-machine repo TBD] — items get marked processed before they're confirmed-saved, so a hash collision that drops an event also makes it unrecoverable forever.
Diagnostic queries (for verification)
Related
wp extrachill events roundup(extrachill-cli#14) and noticing missing Pour House events in real production data.data-machineabout the lifecycle of processed_items records (marked-processed before confirmed-ingested).