Skip to content

Feature/library etl#153

Open
AyBruno wants to merge 18 commits intomainfrom
feature/library-etl
Open

Feature/library etl#153
AyBruno wants to merge 18 commits intomainfrom
feature/library-etl

Conversation

@AyBruno
Copy link
Collaborator

@AyBruno AyBruno commented Feb 13, 2026

Summary

  • Add cronjob deployment support for jobs/* targets in deploy-base.yml, including cron schedule setup, validation, and verification on EC2.
  • Add Dockerfile.library-etl to build/run the library ETL job container and update workflow detection to include job targets.
  • Introduce the new library-etl job implementation with legacy ETL logic, format parsing, artist/genre handling, and cronjob run tracking.
  • Extend legacy database access utilities (e.g., MirrorSQL cleanup/timeout handling) and update schema/migrations to support new ETL fields (cronjob runs, code volume letters, etc.).

@AyBruno AyBruno force-pushed the feature/library-etl branch 2 times, most recently from b3b6edf to 900fefd Compare February 13, 2026 04:52
@AyBruno AyBruno force-pushed the feature/library-etl branch from 900fefd to bcb14fa Compare February 13, 2026 05:02
@jakebromberg
Copy link
Member

How does this handle ALPHABETICAL_NAME

@AyBruno AyBruno requested a review from jakebromberg February 28, 2026 20:35
Copy link
Member

@jakebromberg jakebromberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I munged around the old DB and found some stuff.

Will Crash or Silently Drop Data

  1. Orphaned releases (missing LIBRARY_CODE)

Release IDs 36676 (Burnin' Love, LIBRARY_CODE_ID=13408) and 64658 (Gamelan of Peliatan, LIBRARY_CODE_ID=23619) reference LIBRARY_CODE rows that don't exist. The ETL's JOIN LIBRARY_CODE lc ON lr.LIBRARY_CODE_ID = lc.ID silently excludes them. These releases won't be imported.

  1. Six releases with empty titles
Release ID CALL_NUMBERS CALL_LETTERS
21107 1 (empty)
39290 9 (empty)
51871 4 (empty)
52374 109 (empty)
65301 35 (empty)
66329 8 (empty)

The ETL checks albumTitle.length === 0 and increments skippedCount, so these are handled. No logging identifies which specific releases were skipped for this reason.

Will Produce Incorrect Data

  1. "various" (lowercase, no hyphen) -- 603 releases misclassified

Two LIBRARY_CODE entries named "various" (not "Various Artists"):

PRESENTATION_NAME CALL_LETTERS CALL_NUMBERS Genre Releases
various Z-- 0 Jazz 308
various Z-- 0 Reggae 295

The ETL regex normalizeArtistName requires "various artists" (two words) followed by a hyphen: /^various\s+artists\s*-\s*/i. The name "various" does not match. These 603 releases will create a regular artist named "various" with code letters "Z-" (truncated from "Z--" by normalizeCodeLetters) instead of being treated as compilations with V/A code letters.

  1. "Various Artists [group]" -- 2 releases misclassified
PRESENTATION_NAME CALL_LETTERS CALL_NUMBERS Genre
Various Artists [group] Va 6 Hiphop

Does not match the hyphen regex. Will be treated as a regular artist in the same namespace as "Vanilla Ice" (Va/1), "Sven Vath" (Va/2), etc. The intended Hip-hop "Various Artists" entry is "V/A" at Va/0.

  1. Rock compilation sub-categories collapse into one artist (27 entries)

All "Various Artists - Rock - A" through "Various Artists - Rock - Z" plus "Various Artists - Rock" have 3-character code letters (Z-A, Z-B, ..., Z--).

These DO match the hyphen regex, so they'll all normalize to VARIOUS_ARTISTS_NAME = "Various Artists" with code letters VARIOUS_ARTISTS_CODE_LETTERS = "V/A" (overriding the original code letters). The ensureArtist cache key for isVarious is "various artists|V/A", so all 27 distinct compilation buckets collapse into a single "Various Artists" artist.

The albums will still import (different titles), but the distinct sub-categorization by letter is lost.

Affected entries and approximate release counts:

PRESENTATION_NAME Code Letters Releases
Various Artists - Rock - S Z-S 254
Various Artists - Rock - B Z-B 141
Various Artists - Rock - T Z-T 133
Various Artists - Rock - P Z-P 121
Various Artists - Rock - M Z-M 120
Various Artists - Rock - C Z-C 108
Various Artists - Rock - L Z-L 107
Various Artists - Rock - R Z-R 104
Various Artists - Rock - W Z-W 100
Various Artists - Rock - D Z-D 96
Various Artists - Rock - A Z-A 91
Various Artists - Rock - F Z-F 90
Various Artists - Rock - I Z-I 78
Various Artists - Rock - H Z-H 73
Various Artists - Rock - E Z-E 69
Various Artists - Rock - N Z-N 69
Various Artists - Rock - G Z-G 66
Various Artists - Rock - K Z-K 45
Various Artists - Rock - O Z-O 43
Various Artists - Rock - J Z-J 31
Various Artists - Rock - V Z-V 29
Various Artists - Rock - Y Z-Y 22
Various Artists - Rock - U Z-U 20
Various Artists - Rock - Z Z-Z 8
Various Artists - Rock - X Z-X 3
Various Artists - Rock - Q Z-Q 2
  1. Soundtrack sub-categories -- same truncation problem (27 entries)

"Soundtracks - A" through "Soundtracks - Z" have 3-character code letters (Z-A through Z-Z). These do NOT start with "Various Artists" so the regex won't match. They'll each be treated as separate artists, but normalizeCodeLetters truncates "Z-A" to "Z-".

Since they have distinct names, they won't collide in the artist cache. However, they'll all share code letters "Z-" and code number 0 in the Soundtracks genre. Notable entries:

PRESENTATION_NAME Releases
Soundtracks - S 142
Soundtracks - M 104
Soundtracks - C 81
Soundtracks - T 73
Soundtracks - B 67
  1. Leading space in code letters: " Vi" for "Ben Vida"

normalizeCodeLetters will trim and uppercase " Vi" to "VI", matching it with other Hip-hop VI artists (Stephen Vitiello VI/4, Villian Accelerate VI/5, etc.). Ben Vida is at call number 12 so the cache key won't collide, but the original " Vi" vs "VI" distinction is lost.

  1. "LOD" code letters for "Dem Franchize Boyz"

3-character code letters truncated to "LO". Original categorization intent is lost.

  1. "Unk" code letters for "Unknown" artist

3-character code letters truncated to "UN". May collide with other UN artists in Hip-hop.

  1. 131 artists with CALL_NUMBERS = 0 -- potential artist merging

131 LIBRARY_CODE entries have CALL_NUMBERS = 0. Many also have empty CALL_LETTERS, which normalizeCodeLetters returns null for, falling back to "??". Multiple unrelated artists sharing the same (genre_id, "??", 0) cache key will be merged into one artist record.

Includes test data ("LibCodeTest abf68760", "BrowserTestArtist 0f80f91f") and real artists:

Artist (sample) Genre
Little Brother DB_ONLY
Junior Boys DB_ONLY
Paris DB_ONLY
Cash, Larry Jr. DB_ONLY
Kim Jung Mi DB_ONLY
Rayvn Lenae DB_ONLY

Note: DB_ONLY genre entries (4 releases) are skipped by the ETL via isDbOnlyGenre, so the DB_ONLY artists above won't be imported. However, any zero-code artists in non-DB_ONLY genres would be affected.

  1. Inconsistent ALPHABETICAL_NAME for "303 Committee"

There's both "303 Committee" and "Three 0 Three" alphabetical names on different LIBRARY_CODE rows. The ETL doesn't update existing artist records (ensureArtist returns early on match), so the first row processed wins.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants