Skip to content

[SYNC] Seed Ordo from url #253

@kilip

Description

@kilip

Overview

Add a scraper to packages/sync that extracts liturgical calendar data from lagumisa.web.id/saranps.php and seeds the ordo table via a staging pipeline.

Data extracted:

  1. Liturgical feast/celebration — name, date, rank, liturgical color
  2. Scripture readings per celebration
  3. Puji Syukur song suggestions per celebration

Background

lagumisa.web.id is a public, server-side rendered page — cheerio + undici is sufficient, no Playwright needed. No authentication required. A separate lightweight PublicHttpClient is used (no session management).

Source URL: https://www.lagumisa.web.id/saranps.php

Reference: packages/sync/docs/sync-tdd.md Section 6.


HTML Structure

Each celebration is a <tr> row in the main table.

Date column:

<time class="iconminggu">   <!-- iconminggu = Sunday, icon = weekday -->
  <em>Desember</em>
  <strong>Minggu</strong>
  <span>20</span>
</time>

Content column:

<div id="ccungu" class="circle"></div>   <!-- color encoded in id -->
<strong>HARI MINGGU ADVEN IV</strong>
<strong>Bacaan: </strong><a class=aayat>2Sam. 7:1-5...</a>; ...
<strong>Saran Nyanyian: </strong><font class=psnum>PS 440, 441, ...</font>

Color map:

div id LiturgicalColor
ccungu purple
ccputih white
ccmerah red
ccijau green
ccmerahmuda rose
cchitam black

Rank inference from celebration name prefix:

Prefix CelebrationRank
HARI RAYA solemnity
PESTA feast
PERINGATAN memorial
Pw. commemoration
(default) feria

Staging Table

Add to packages/db/src/schema/sync-staging.ts (schema: sync_staging):

export const syncStagingLiturgi = pgTable('liturgi', {
  id: serial('id').primaryKey(),
  celebrationName: text('celebration_name').notNull(),
  month: text('month').notNull(),
  dayName: text('day_name').notNull(),
  dateNumber: integer('date_number').notNull(),
  isSunday: boolean('is_sunday').notNull().default(false),
  liturgicalColor: text('liturgical_color').notNull(),  // raw id e.g. "ccungu"
  massLabel: text('mass_label'),
  readings: text('readings').array().notNull().default([]),
  songs: text('songs'),
  scrapedAt: timestamp('scraped_at', { withTimezone: true }).defaultNow(),
}, (t) => ({
  uniq: unique().on(t.celebrationName, t.month, t.dateNumber, t.massLabel).nullsNotDistinct(),
}))

Implementation Plan

1. PublicHttpClientpackages/sync/src/public-http-client.ts

Simple fetch wrapper, no session/auth. Throws SyncScrapeError on non-200.

2. Scraper — packages/sync/src/scraper/liturgi.ts

export async function scrapeLiturgi(client: PublicHttpClient): Promise<RawLiturgiEntry[]>

Fetches /saranps.php, passes HTML to parser.

3. Parser — packages/sync/src/parser/liturgi.ts

Cheerio-based. Per row:

  1. Extract month, dayName, dateNumber, isSunday from <time>
  2. Map circle div idliturgicalColor via LITURGICAL_COLOR_MAP
  3. Extract celebration name (first <strong> after circle, skip Bacaan: / Saran Nyanyian: labels)
  4. Split multi-mass entries by sub-mass label (Misa Malam, Misa Fajar, Misa Siang)
  5. Extract <a class=aayat> text → readings[]
  6. Extract <font class=psnum> text → songs

4. Constants — packages/sync/src/constants/liturgi.ts

LITURGICAL_COLOR_MAP and RANK_INFERENCE_RULES (prefix → CelebrationRank).

5. Staging writer — extend packages/sync/src/staging/index.ts

export async function writeLiturgiToStaging(
  db: DrizzleClient,
  entries: RawLiturgiEntry[],
  logger: ILogger,
): Promise<void>

Upsert with ON CONFLICT DO NOTHING.

6. Transform — packages/sync/src/transform/liturgi.ts

Maps sync_staging.liturgiordo:

export async function transformLiturgi(
  db: DrizzleClient,
  year: number,
  logger: ILogger,
): Promise<void>
  • Resolve full date from month (Indonesian name) + dateNumber + year
  • Map liturgicalColorLiturgicalColor enum value
  • Infer rank from celebrationName prefix
  • Upsert into ordo with ON CONFLICT (date, massLabel) DO UPDATEskip if source = 'manual'
  • Set source = 'lagumisa', createdBy = null

7. Export from packages/sync/src/index.ts

Export scrapeLiturgi and transformLiturgi as part of the public API.


Acceptance Criteria

  • PublicHttpClient fetches /saranps.php without auth
  • Parser correctly extracts all fields for a regular weekday entry
  • Parser correctly extracts all fields for a Sunday entry (isSunday: true)
  • Parser correctly splits multi-mass celebration (e.g., Natal) into separate entries with massLabel
  • liturgicalColor correctly maps from div id to LiturgicalColor value
  • rank correctly inferred from celebration name prefix
  • writeLiturgiToStaging() upserts without duplicates
  • transformLiturgi() resolves full date correctly
  • transformLiturgi() never overwrites ordo entries with source = 'manual'
  • Staging table migration included in packages/db
  • ILogger injected via constructor throughout
  • Unit tests for parser covering: regular weekday, Sunday, multi-mass (Natal)

Files to Create / Modify

packages/sync/src/
  public-http-client.ts         # new
  scraper/liturgi.ts            # new
  parser/liturgi.ts             # new
  transform/liturgi.ts          # new
  constants/liturgi.ts          # new
  staging/index.ts              # extend
  index.ts                      # export new public API

packages/db/src/schema/
  sync-staging.ts               # add syncStagingLiturgi

packages/db/drizzle/
  <timestamp>_add_sync_staging_liturgi.sql   # new migration

References

  • HTML structure analysis: packages/sync/docs/sync-tdd.md Section 6
  • Staging table definition: docs/erd.md Section 6.1
  • ordo table definition: docs/erd.md Section 2.5
  • ILogger injection pattern: docs/tdd.md Section 11.2
  • Existing scraper pattern: packages/sync/src/scraper/umat.ts

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

Projects

Status

Ready

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions