Conversation
dd68c08 to
ddf105c
Compare
b3b6edf to
900fefd
Compare
900fefd to
bcb14fa
Compare
|
How does this handle |
…eference to maintain atomic artist entries.
jakebromberg
left a comment
There was a problem hiding this comment.
Okay, I munged around the old DB and found some stuff.
Will Crash or Silently Drop Data
- Orphaned releases (missing LIBRARY_CODE)
Release IDs 36676 (Burnin' Love, LIBRARY_CODE_ID=13408) and 64658 (Gamelan of Peliatan, LIBRARY_CODE_ID=23619) reference LIBRARY_CODE rows that don't exist. The ETL's JOIN LIBRARY_CODE lc ON lr.LIBRARY_CODE_ID = lc.ID silently excludes them. These releases won't be imported.
- Six releases with empty titles
| Release ID | CALL_NUMBERS | CALL_LETTERS |
|---|---|---|
| 21107 | 1 | (empty) |
| 39290 | 9 | (empty) |
| 51871 | 4 | (empty) |
| 52374 | 109 | (empty) |
| 65301 | 35 | (empty) |
| 66329 | 8 | (empty) |
The ETL checks albumTitle.length === 0 and increments skippedCount, so these are handled. No logging identifies which specific releases were skipped for this reason.
Will Produce Incorrect Data
- "various" (lowercase, no hyphen) -- 603 releases misclassified
Two LIBRARY_CODE entries named "various" (not "Various Artists"):
| PRESENTATION_NAME | CALL_LETTERS | CALL_NUMBERS | Genre | Releases |
|---|---|---|---|---|
| various | Z-- | 0 | Jazz | 308 |
| various | Z-- | 0 | Reggae | 295 |
The ETL regex normalizeArtistName requires "various artists" (two words) followed by a hyphen: /^various\s+artists\s*-\s*/i. The name "various" does not match. These 603 releases will create a regular artist named "various" with code letters "Z-" (truncated from "Z--" by normalizeCodeLetters) instead of being treated as compilations with V/A code letters.
- "Various Artists [group]" -- 2 releases misclassified
| PRESENTATION_NAME | CALL_LETTERS | CALL_NUMBERS | Genre |
|---|---|---|---|
| Various Artists [group] | Va | 6 | Hiphop |
Does not match the hyphen regex. Will be treated as a regular artist in the same namespace as "Vanilla Ice" (Va/1), "Sven Vath" (Va/2), etc. The intended Hip-hop "Various Artists" entry is "V/A" at Va/0.
- Rock compilation sub-categories collapse into one artist (27 entries)
All "Various Artists - Rock - A" through "Various Artists - Rock - Z" plus "Various Artists - Rock" have 3-character code letters (Z-A, Z-B, ..., Z--).
These DO match the hyphen regex, so they'll all normalize to VARIOUS_ARTISTS_NAME = "Various Artists" with code letters VARIOUS_ARTISTS_CODE_LETTERS = "V/A" (overriding the original code letters). The ensureArtist cache key for isVarious is "various artists|V/A", so all 27 distinct compilation buckets collapse into a single "Various Artists" artist.
The albums will still import (different titles), but the distinct sub-categorization by letter is lost.
Affected entries and approximate release counts:
| PRESENTATION_NAME | Code Letters | Releases |
|---|---|---|
| Various Artists - Rock - S | Z-S | 254 |
| Various Artists - Rock - B | Z-B | 141 |
| Various Artists - Rock - T | Z-T | 133 |
| Various Artists - Rock - P | Z-P | 121 |
| Various Artists - Rock - M | Z-M | 120 |
| Various Artists - Rock - C | Z-C | 108 |
| Various Artists - Rock - L | Z-L | 107 |
| Various Artists - Rock - R | Z-R | 104 |
| Various Artists - Rock - W | Z-W | 100 |
| Various Artists - Rock - D | Z-D | 96 |
| Various Artists - Rock - A | Z-A | 91 |
| Various Artists - Rock - F | Z-F | 90 |
| Various Artists - Rock - I | Z-I | 78 |
| Various Artists - Rock - H | Z-H | 73 |
| Various Artists - Rock - E | Z-E | 69 |
| Various Artists - Rock - N | Z-N | 69 |
| Various Artists - Rock - G | Z-G | 66 |
| Various Artists - Rock - K | Z-K | 45 |
| Various Artists - Rock - O | Z-O | 43 |
| Various Artists - Rock - J | Z-J | 31 |
| Various Artists - Rock - V | Z-V | 29 |
| Various Artists - Rock - Y | Z-Y | 22 |
| Various Artists - Rock - U | Z-U | 20 |
| Various Artists - Rock - Z | Z-Z | 8 |
| Various Artists - Rock - X | Z-X | 3 |
| Various Artists - Rock - Q | Z-Q | 2 |
- Soundtrack sub-categories -- same truncation problem (27 entries)
"Soundtracks - A" through "Soundtracks - Z" have 3-character code letters (Z-A through Z-Z). These do NOT start with "Various Artists" so the regex won't match. They'll each be treated as separate artists, but normalizeCodeLetters truncates "Z-A" to "Z-".
Since they have distinct names, they won't collide in the artist cache. However, they'll all share code letters "Z-" and code number 0 in the Soundtracks genre. Notable entries:
| PRESENTATION_NAME | Releases |
|---|---|
| Soundtracks - S | 142 |
| Soundtracks - M | 104 |
| Soundtracks - C | 81 |
| Soundtracks - T | 73 |
| Soundtracks - B | 67 |
- Leading space in code letters: " Vi" for "Ben Vida"
normalizeCodeLetters will trim and uppercase " Vi" to "VI", matching it with other Hip-hop VI artists (Stephen Vitiello VI/4, Villian Accelerate VI/5, etc.). Ben Vida is at call number 12 so the cache key won't collide, but the original " Vi" vs "VI" distinction is lost.
"LOD"code letters for "Dem Franchize Boyz"
3-character code letters truncated to "LO". Original categorization intent is lost.
"Unk"code letters for "Unknown" artist
3-character code letters truncated to "UN". May collide with other UN artists in Hip-hop.
- 131 artists with CALL_NUMBERS = 0 -- potential artist merging
131 LIBRARY_CODE entries have CALL_NUMBERS = 0. Many also have empty CALL_LETTERS, which normalizeCodeLetters returns null for, falling back to "??". Multiple unrelated artists sharing the same (genre_id, "??", 0) cache key will be merged into one artist record.
Includes test data ("LibCodeTest abf68760", "BrowserTestArtist 0f80f91f") and real artists:
| Artist (sample) | Genre |
|---|---|
| Little Brother | DB_ONLY |
| Junior Boys | DB_ONLY |
| Paris | DB_ONLY |
| Cash, Larry Jr. | DB_ONLY |
| Kim Jung Mi | DB_ONLY |
| Rayvn Lenae | DB_ONLY |
Note: DB_ONLY genre entries (4 releases) are skipped by the ETL via isDbOnlyGenre, so the DB_ONLY artists above won't be imported. However, any zero-code artists in non-DB_ONLY genres would be affected.
- Inconsistent ALPHABETICAL_NAME for "303 Committee"
There's both "303 Committee" and "Three 0 Three" alphabetical names on different LIBRARY_CODE rows. The ETL doesn't update existing artist records (ensureArtist returns early on match), so the first row processed wins.
Summary
jobs/*targets indeploy-base.yml, including cron schedule setup, validation, and verification on EC2.Dockerfile.library-etlto build/run the library ETL job container and update workflow detection to include job targets.library-etljob implementation with legacy ETL logic, format parsing, artist/genre handling, and cronjob run tracking.MirrorSQLcleanup/timeout handling) and update schema/migrations to support new ETL fields (cronjob runs, code volume letters, etc.).