Skip to content

Python Scraper Functions

Akshay B edited this page Mar 16, 2026 · 1 revision

Python Scraper Functions (Azure Functions)

This document details scraper routes in PluckIt.Processor.

Documentation metadata

  • Audience: external contributors
  • Last reviewed: 2026-03-16
  • Scope: scraper contract only

Endpoint inventory

ScraperFunctions

  • GET /api/scraper/sources
    • Returns configured scraping sources.
    • Source metadata is typically read-only unless managed via POST.
  • POST /api/scraper/sources
    • Creates a new scraper source configuration.
    • Establishes crawl targets for later lease/run execution.
  • POST /api/scraper/lease/{source_id}
    • Leases source work for a scheduler cycle.
    • Prevents duplicate concurrent runs for a given source.
  • POST /api/scraper/ingest/reddit
    • Triggers reddit ingestion flow.
    • Accepts payload specific to Reddit ingestion requests.
  • POST /api/admin/unban/{target_user_id}
    • Removes scraper-side restriction for a user in admin context.
  • POST /api/scraper/subscribe/{source_id}
    • Subscribes current user to source updates.
    • Writes user-source preferences used by item fanout.
  • DELETE /api/scraper/subscribe/{source_id}
    • Removes user subscription for a source.
    • Stops future delivery for that source in user context.
  • GET /api/scraper/items
    • Lists scraped items available to the current user context.
    • Represents the current scraped dataset view for authenticated users.
  • POST /api/scraper/items/{item_id}/feedback
    • Records item feedback.
    • Feeds preference signals into downstream taste and digest behavior.
  • POST /api/scraper/run/{source_id}
    • Runs source pipeline on demand.
    • Admin-only operation for manual reprocessing and investigation.

Admin routes

  • POST /api/admin/unban/{target_user_id}
  • POST /api/scraper/run/{source_id}
  • These validate that the caller id is included in ADMIN_USER_IDS.

Notes

  • These routes interact with the timer-based scraper pipeline described in Python-Background-Processing.md for delayed processing stages.
  • Admin routes can also be used to correct source state and remove user-level restrictions during operational incidents.
  • Scraped item and feedback endpoints are part of the active preference signal chain feeding analysis jobs.

Clone this wiki locally