This is a clean, runnable reference implementation for automated market analysis in the cultural resource & heritage management space (designed originally to be hosted and run on a local Synology NAS).
CHARM = Cultural Heritage & Archaeological Resource Management
The point of CHARM is to guide the investment of resources and the development of undergraduate, graduate, and non-degree/professioal programs and curricula, including courses, micro-degrees, and professional certificates. While it was built with cultural heritage and archaeology in mind, the pipeline is intentionally modular and can be adapted to other disciplines with minimal changes.
Outcomes: scrape job postings (American Anthropological Association & American Cultural Resources Association) → clean/dedupe → parse uploaded PDFs (industry reports) → spaCy Natural Language Processing (entity + skill extraction) → sentiment → geocode → analysis → insights → SQLite/CSVs → optional Google Sheets → Streamlit dashboard (Folium + Plotly) → downloadable PDF report.
Demo mode (no external services): set DEMO_MODE=1 when launching the dashboard to use the bundled synthetic snapshot in demo/processed/. Streamlit Cloud should set this env var so it never scrapes or calls paid APIs.
If you want the easiest local setup, start the Streamlit app and use the built in wizard. It can ingest PDFs, run the pipeline, and show results in one place.
# 1. Clone and enter the repository
git clone https://github.com/YOUR_USERNAME/charm-market-intelligence-engine.git
cd charm-market-intelligence-engine
# 2. Set up environment
make setup
# 3. Create a local .env file
cp .env.example .env
# 4. Allow runs from the Streamlit wizard
# This should stay false in shared or hosted environments
# Open .env and set ALLOW_PIPELINE_RUN=true
# 5. Launch the app
make run-dashboardIf you are comfortable with the command line and prefer to run the pipeline directly, this is the simplest path:
# 1. Clone and enter the repository
git clone https://github.com/YOUR_USERNAME/charm-market-intelligence-engine.git
cd charm-market-intelligence-engine
# 2. Set up environment (creates venv, installs deps, downloads NLP models)
make setup
# 3. Run the pipeline
make run-pipeline
# 4. Launch the dashboard
make run-dashboardAlternative (manual setup):
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
cp .env.example .env
# Download NLP models
python -m spacy download en_core_web_sm
python -c "import nltk; nltk.download('vader_lexicon')"
# Run pipeline and dashboard
python scripts/pipeline.py
streamlit run dashboard/app.pyIf you are comfortable with Docker, you can run CHARM with containers:
cp .env.example .env
docker compose up --build dashboardTo run the pipeline in Docker:
docker compose run --rm pipelineCost-safe dry run:
USE_LLMandUSE_SHEETSdefault tofalsein.env.exampleso you can run the full pipeline locally without triggering OpenAI tokens or Google Sheets API calls. Flip them totrueonly after you are ready to authenticate those paid services.
After running the pipeline, check that these files were created:
| File | What it contains | Success indicator |
|---|---|---|
data/processed/jobs.csv |
Scraped and enriched job postings | File exists, has rows with title, company, skills columns |
data/processed/analysis.json |
Summary statistics | Contains num_jobs, top_skills, schema_version |
data/processed/insights.md |
Human-readable market brief | Contains "## In-demand Skills" section |
data/charm.db |
SQLite database | File exists (if USE_SQLITE=true) |
data/reports/CHARM_Report_*.pdf |
PDF report | File exists, starts with %PDF, > 10 KB |
Quick validation command:
# Check that key outputs exist and have content
ls -la data/processed/
head -5 data/processed/jobs.csv
cat data/processed/analysis.json | head -20Dashboard validation:
- Run
make run-dashboard - Open http://localhost:8501 in your browser
- You should see key findings cards, a map, and skill charts
- A "Download report" button in the header generates a PDF on demand
- If "No data yet" appears, the pipeline hasn't run successfully
CHARM organizes data into specific directories. Understanding this structure helps with debugging and customization.
charm-market-intelligence-engine/
├── config/ # Configuration files
│ ├── .env.example # Environment variable template
│ ├── insight_prompt.md # LLM prompt template (editable)
│ └── job_patterns.json # Job type/seniority regex patterns
├── data/ # All generated data (gitignored)
│ ├── cache/ # Cached API responses
│ │ ├── job_descriptions.json # Cached job detail pages
│ │ ├── reports_cache.json # Cached PDF extractions
│ │ └── gsheets_jobs_urls.txt # Synced job URLs
│ ├── processed/ # Pipeline outputs (dashboard reads these)
│ │ ├── jobs.csv # Enriched job postings
│ │ ├── reports.csv # Parsed PDF reports
│ │ ├── analysis.json # Summary statistics
│ │ ├── insights.md # Generated brief
│ │ └── wordcloud.png # Visualization
│ ├── geocache.csv # Location → lat/lon cache
│ └── charm.db # SQLite database
├── reports/ # Drop PDFs here for parsing
├── reports/ (Python) # PDF report generation package
│ ├── context.py # Build report context from pipeline artifacts
│ ├── pdf_report.py # Assemble PDF with ReportLab
│ └── styles.py # Page layout, fonts, table styles
├── skills/ # Taxonomy definitions
│ └── skills_taxonomy.csv # Skill aliases → normalized names
└── secrets/ # Credentials (gitignored)
└── service_account.json # Google API credentials
What gets cached (and why):
- Job descriptions: Avoids re-fetching the same posting detail pages
- PDF text: Avoids re-parsing unchanged PDFs
- Geocoding: Nominatim rate limits require caching location lookups
To clear caches and force fresh data:
rm -rf data/cache/
make run-pipelineCopy .env.example to .env and configure as needed. The file includes detailed comments explaining each variable.
GOOGLE_SERVICE_ACCOUNT_FILE=ENTER_PATH_TO_SERVICE_ACCOUNT_JSON_HERE- Path to your service account key file (like
secrets/service_account.json).
- Path to your service account key file (like
GOOGLE_SHEET_ID=ENTER_GOOGLE_SHEET_ID_HERE- The ID from your Sheet URL.
OPENAI_API_KEY=ENTER_OPENAI_API_KEY_HERE- Only needed if
USE_LLM=trueandLLM_PROVIDER=openai.
- Only needed if
GEOCODE_CONTACT_EMAIL=ENTER_CONTACT_EMAIL_HERE- Required by Nominatim usage guidelines so your geocoding requests have a contact.
| Variable | Default | Purpose |
|---|---|---|
USE_SQLITE |
true |
Persist processed jobs/reports into data/charm.db. |
USE_SHEETS |
false |
Append jobs/reports to Google Sheets when credentials are configured. |
USE_LLM |
false |
Enable the optional LLM brief via config/insight_prompt.md. |
ALLOW_PIPELINE_RUN |
false |
Allow the Streamlit wizard to run the pipeline on this machine. |
PIPELINE_SCRAPE |
true |
Run the job board scraper step. |
PIPELINE_REPORTS |
true |
Parse PDFs in the reports folder. |
PIPELINE_NLP |
true |
Run spaCy and skills extraction. |
PIPELINE_SENTIMENT |
true |
Add sentiment scores. |
PIPELINE_GEOCODE |
true |
Geocode locations using Nominatim. |
USER_AGENT |
CHARM/1.0 (research) |
HTTP header for scrapers; include contact info. |
GEOCODE_CONTACT_EMAIL |
(empty) | Injected into the Nominatim UA per policy. |
LLM_PROVIDER, LLM_MODEL |
openai, gpt-4o-mini |
Choose an LLM backend/model when USE_LLM=true. |
LLM_BASE_URL |
(empty) | Base URL for OpenAI compatible providers when LLM_PROVIDER=openai_compat. |
HF_TOKEN, HF_MODEL |
(empty) | Use Hugging Face hosted inference when LLM_PROVIDER=hf_inference. |
GOOGLE_SERVICE_ACCOUNT_FILE, GOOGLE_SHEET_ID |
(empty) | Required for Sheets sync/tests. |
SCRAPER_MAX_WORKERS |
4 |
Number of concurrent detail-page fetches; lower for stricter rate limits. |
SCRAPER_REQUEST_INTERVAL |
0.8 |
Minimum seconds between outbound requests (global). Increase to slow the scraper. |
After editing, verify:
python scripts/gsheets_test.py # check Google Sheets access
python scripts/pipeline.py # run the end-to-end pipeline- n8n orchestration: n8n is an open-source workflow automation tool; we use it for Cron/Webhook triggers that run the pipeline via one Execute Command.
- Python pipeline: scraping → cleaning/dedupe → report parsing → Natural Language Processing (NLP) → sentiment → geocoding → analysis → insights → persistence.
- Storage: CSVs for the dashboard + SQLite for durable querying; Google Sheets for sharing raw rows.
- Dashboard: Streamlit with Plotly charts and a Folium map (heatmap + clustered markers).
- Large Language Model (LLM, optional): brief insights when enabled in
.env.
Import n8n/charm_workflow.json and point Execute Command to:
bash -lc "cd /data/charm-market-intelligence-engine && source .venv/bin/activate && python scripts/pipeline.py"- Add parsers in
scripts/scrape_jobs.pyfor new job boards. - Update
skills/skills_taxonomy.csvwith additional skills/aliases. - Expand rules in
scripts/insights.pyto map skills → program formats. - Customize
config/job_patterns.jsonto tweak job-type/seniority detection; runpython scripts/validate_patterns.py(ormake validate-patterns) after edits to ensure regexes compile.
- Enable Google Sheets API and Google Drive API in Google Cloud Platform (GCP).
- Create a Service Account, download the JSON key to
secrets/service_account.json(or your path). - Set
GOOGLE_SHEET_IDandGOOGLE_SERVICE_ACCOUNT_FILEin.env(replace the placeholders). - Share the Sheet with the service account email as Editor.
- Test:
python scripts/gsheets_test.py| Symptom | Likely Cause | Fix |
|---|---|---|
ModuleNotFoundError: No module named 'spacy' |
Virtual environment not activated | Run source venv/bin/activate (macOS/Linux) or venv\Scripts\activate (Windows) |
OSError: [E050] Can't find model 'en_core_web_sm' |
spaCy language model not downloaded | Run python -m spacy download en_core_web_sm |
| No jobs scraped (empty CSV) | CSS selectors out of date OR site blocking requests | Check scripts/scrape_jobs.py selectors; add delays between requests |
FileNotFoundError: config/.env |
Environment file missing | Copy .env.example to config/.env and fill in values |
| Geocoding extremely slow | Nominatim rate limiting | Normal behavior; geocache (data/geocache.db) speeds up repeat runs |
google.auth.exceptions.DefaultCredentialsError |
Service account JSON missing or path wrong | Verify GOOGLE_SERVICE_ACCOUNT_FILE path in .env |
| Sheets append fails silently | Sheet ID incorrect or missing permissions | Double-check GOOGLE_SHEET_ID; share sheet with service account email |
| Dashboard won't start | Port 8501 already in use | Kill other Streamlit processes or use streamlit run dashboard/app.py --server.port 8502 |
OPENAI_API_KEY error when USE_LLM=true |
Key not set or invalid | Add valid key to .env or set USE_LLM=false |
If you're running on a NAS or shared drive:
- Ensure write access to
data/directory - SQLite may perform poorly over network mounts; consider running locally
Most scripts print progress to stdout. For more detail:
python scripts/pipeline.py 2>&1 | tee pipeline.logThe LLM question set lives in config/insight_prompt.md. It’s plain text with {{variables}} you can edit:
{{INDUSTRY}},{{DATE_TODAY}},{{NUM_JOBS}},{{UNIQUE_EMPLOYERS}},{{GEOCODED}}{{TOP_SKILLS_BULLETS}}→ a bullet list of top skills and counts
To see the fully rendered prompt (before sending to an LLM):
python scripts/preview_prompt.pyIf the file is missing, the workflow falls back to a concise built-in prompt.
Dashboard design notes:
- Single column rhythm with clear section spacing; primary actions (filters, downloads) are easy to find.
- Subtle cards for key findings; no heavy boxes or loud colors.
- Plotly (plotly_white) with reduced chart chrome; labels kept concise.
- Folium map with heatmap + clustered markers for fast spatial scanning.
- Hidden default Streamlit menu/footer to keep focus on data.
- Sidebar filters drive all sections, so the page stays uncluttered.
- PDF export via a header download button; the report is generated with ReportLab and cached by a content fingerprint so repeated clicks are instant.
The dashboard includes a "Download report" button that generates a multi-section PDF from the latest pipeline run. The report is built with ReportLab and uses the Inter font family (falls back to Helvetica if the TTFs are missing).
Sections included:
- Cover page (title, date range, fingerprint)
- Executive Summary (5 data-driven bullets)
- Key Findings (up to 9 metrics in a 3-column grid)
- Trends & Signals (top 12 skills + emerging skills tables)
- Implications & Opportunities (actionable cards with "Why it matters")
- Methods & Governance (data sources, approach, limitations)
- Appendix (definitions and employer sources)
How it works:
reports/context.pyreadsjobs.csv,analysis.json, andinsights.mdfrom the processed directory and builds a normalized context dict.reports/pdf_report.pyturns that context into ReportLab flowables and assembles the PDF.reports/styles.pydefines the page layout, paragraph styles, and table styles.dashboard/header.pywires the download button into the Streamlit header; the PDF bytes are cached by a SHA-256 fingerprint of the source files, so the report is only regenerated when data changes.
Generating a report from the command line:
python scripts/generate_report.py --proc-dir data/processed --out-dir data/reportsThis writes a CHARM_Report_<fingerprint>.pdf to the output directory and runs basic validity checks (PDF header present, file size > 10 KB).
By default USE_LLM=false in config/.env.example, so the rules-based brief runs without triggering model calls. Flip it to true only when you are ready to authenticate a provider. When enabled, the pipeline renders the external prompt (config/insight_prompt.md) with your current data.
Choose a provider in .env:
LLM_PROVIDER=openai→ setOPENAI_API_KEY=ENTER_OPENAI_API_KEY_HERELLM_PROVIDER=ollama→ setOLLAMA_BASE_URL(defaulthttp://localhost:11434) andLLM_MODEL(e.g.,llama3:instruct)
If no key is present or the call fails, the pipeline still produces rules-based insights.
The pipeline can also append a reports tab with parsed report metadata:
- Worksheet name (default):
reports(configure viaGOOGLE_SHEET_WORKSHEET_REPORTS) - Columns: report_name, word_count, skills (comma-separated)
Use the included Makefile to run common tasks with short commands:
make setup # venv + requirements + models
make run # scrape → process → analyze → insights → SQLite/CSVs → Sheets
make dash # launch the Streamlit dashboard
make sheets-test # verify Google Sheets setup
make prompt # preview rendered LLM prompt
make reset-db # delete data/charm.db (keeps CSVs)
make clean # clear cachesOn macOS/Linux it works out of the box. On Windows, use Git Bash or WSL.
- Scrape job boards (AAA + ACRA) with pagination →
scripts/scrape_jobs.py- Collects:
source, title, company, location, date_posted, job_url, description - Walks “Next” pages safely (limit=10) and de-dupes by
job_url.
- Collects:
- Clean & de-duplicate →
scripts/data_cleaning.py- Normalizes text, hashes
(title|company|desc-snippet)to drop dupes. - Extracts salary hints (
salary_min,salary_max,currency) when present.
- Normalizes text, hashes
- Parse industry reports (PDFs) →
scripts/parse_reports.py- Reads PDFs from
/reports/with PyMuPDF; outputs one row per report.
- Reads PDFs from
- NLP enrichment (jobs + reports) →
scripts/nlp_entities.py- spaCy NER (ORG/GPE/LOC) and skills taxonomy matching (
skills/skills_taxonomy.csv).
- spaCy NER (ORG/GPE/LOC) and skills taxonomy matching (
- Sentiment (optional) →
scripts/sentiment_salience.py - Geocode locations with Nominatim + on-disk cache →
scripts/geocode.py - Persist results
- CSVs to
data/processed/(for the dashboard) - SQLite to
data/charm.db(for durable querying and auditing)
- CSVs to
- Share (optional): append jobs + reports to Google Sheets
- Analyze →
scripts/analyze.py(top skills, counts; optional clustering) - Generate insights →
scripts/insights.py- Rules-based recommendations (always)
- LLM brief using the external prompt (
config/insight_prompt.md)
- Visualize →
dashboard/app.py(Streamlit + Plotly + Folium)
- Drop .pdf files into the
reports/folder. - On the next run, the pipeline will extract text with PyMuPDF, enrich with NER + skills, and:
- write
data/processed/reports.csv - upsert into
data/charm.db(reportstable) - append a concise row to Google Sheets (worksheet:
reports) withreport_name,word_count, and aggregatedskills.
- write
- Parsed text is cached in
data/cache/reports_cache.json, so unchanged PDFs aren't re-read on every execution. - Reports are combined with job data in analysis and in the LLM prompt context to surface trends and gaps.
The insights module translates demand signals into program formats:
- Undergraduate (online), Graduate (online)
- Certificate, Post-baccalaureate
- Workshop, Microlearning
How it works:
scripts/insights.pycontains simple, transparent rules that map top skills to program formats.- If
USE_LLM=true, the external prompt (config/insight_prompt.md) requests:- 5 trend statements, 3 emerging skills, 3 program gaps/opportunities, explicitly referencing those formats.
- To tailor outputs for different catalogs or brands, adjust:
skills/skills_taxonomy.csv(aliases & categories)- mapping rules in
scripts/insights.py - the prompt language in
config/insight_prompt.md
- Pagination: the scrapers follow “Next” links (rel/aria/title/text) with a safe page limit.
- Politeness: configurable rate limiting + polite User-Agent (see
SCRAPER_MAX_WORKERS/SCRAPER_REQUEST_INTERVALin.env). For example, the defaults (~4 workers, 0.8 s interval) average ~5 detail fetches/sec; lower these values if the target site throttles faster. - Job-description caching: fetched detail pages are stored in
data/cache/job_descriptions.jsonso reruns avoid hammering the same postings. Adjust worker count/interval in.envto tune throughput. - Dedupe: by
job_urland content hash to avoid churn and inflated counts. - Respect sites: check each site's robots.txt and Terms of Service before scraping; scale cautiously and cache aggressively.
- No PII: the pipeline collects job-level, non-personal data only; avoid ingesting personally identifiable information (PII).
The pipeline can post a completion message to a Mattermost channel using an incoming webhook.
Setup:
-
Create an Incoming Webhook in your Mattermost workspace (bound to a channel).
-
In
.env, set:-
MATTERMOST_WEBHOOK_URL=ENTER_MATTERMOST_WEBHOOK_URL_HERE -
DASHBOARD_URL=ENTER_DASHBOARD_URL_HERE(public or internal URL for your Streamlit app)
-
-
Run
make run(orpython scripts/pipeline.py).
What it sends:
- A check-marked completion line
- A short summary (total postings, employers, geocoded count, top 5 skills)
- A link to the dashboard
- An optional short snippet from
insights.md(“LLM Brief” if present)
Where to customize: open n8n/charm_workflow_mattermost.json, find the Notification Config node, and edit the embedded JavaScript to change the summary payload.
Import n8n/charm_workflow_mattermost.json for a version of the workflow that notifies Mattermost after each run.
Configure in the “Notification Config” node:
webhookUrl→ your Mattermost Incoming Webhook URLdashboardUrl→ link to your Streamlit app (public or internal)mention→ optional (@channel,@here, or empty)thresholdSkillsCsv→ comma-separated skills to track (e.g.,ArcGIS,Section 106,NEPA)thresholdPercent→ percentage change to trigger an alert (default20)
What the message includes:
- ✅ Completion line
- Totals (postings, employers, geocoded)
- Top 5 skills
- Dashboard link
- Alerts section if thresholds are hit (↑/↓ with % change vs previous run)
- A short Brief snippet extracted from
insights.md
How thresholds work:
- The workflow reads the previous snapshot from
data/processed/analysis_prev.json(if present) - It writes the current
analysis.jsonto that file after posting the message, so the next run compares properly - Zero results trigger a “No jobs scraped” alert automatically
This pipeline supports two classes of LLM backends:
Cloud (commercial): OpenAI
- Set
LLM_PROVIDER=openaiandOPENAI_API_KEY=ENTER_OPENAI_API_KEY_HERE - Pros: highest quality, simple setup, scalable.
- Cons: usage cost; data governance requires key management.
Self-hosted:
-
Ollama (simple local runner on CPU/GPU; great for demos)
- Set
LLM_PROVIDER=ollama,OLLAMA_BASE_URL=http://localhost:11434,LLM_MODEL=llama3:instruct(or similar) - Pros: easiest local setup; good developer ergonomics.
- Cons: slower on CPU-only; fewer enterprise durability knobs.
- Set
-
OpenAI-compatible server (e.g., vLLM on GPU)
- Set
LLM_PROVIDER=openai_compat,LLM_BASE_URL=http://YOUR-HOST:PORT/v1,LLM_MODEL=YourModelName - Pros: production-friendly throughput and token cost control; keeps data in your infra.
- Cons: requires GPU provisioning & ops (e.g., vLLM/TGI deployment).
- Set
Recommendation: For a Synology/NAS demo, Ollama is the fastest path to a working self-host. For higher throughput or larger prompts, deploy vLLM with an OpenAI-compatible endpoint and switch to LLM_PROVIDER=openai_compat.
Supporting both cloud and local LLMs is practical:
- Cost control: Cloud models are metered. Ollama lets you run unlimited local inferences at no extra cost.
- Data governance: Some orgs want text to stay on-prem. A local model keeps everything in your infrastructure.
- Portability: Anyone can clone this repo and get working insights with Ollama -- no API key required.
- Failover: If your cloud quota runs out or there's an outage, the local model keeps the pipeline functional.
This repository is organized so a reviewer can read it top-down and understand exactly how the system works. Every piece below has a clear, single responsibility.
n8n/charm_workflow.json- Minimal scheduler/trigger that runs the Python pipeline via Execute Command from a Cron or Webhook.n8n/charm_workflow_mattermost.json- Same as above, with post-run Mattermost notifications. Readsanalysis.jsonandinsights.md, composes a short message (totals, top skills, optional alerts, brief), posts to your incoming webhook, and snapshots the current analysis for next-run comparisons.
config/.env.example- Environment variables with explicit placeholders (e.g.,ENTER_GOOGLE_SHEET_ID_HERE). Copy to.envand fill in. Includes LLM provider switches, user agent, and dashboard URL.config/insight_prompt.md- The human-editable prompt template used whenUSE_LLM=true. It’s rendered with live variables (date, counts, top skills) before calling the model.
skills/skills_taxonomy.csv- Deterministic mapping of common terms/aliases to normalized skill names and (optionally) categories. This keeps “GIS” vs “ArcGIS” vs “ArcGIS Pro” consistent in analysis.
scripts/pipeline.py- The orchestrator. Runs end-to-end: scrape → clean/dedupe → parse reports → NLP/skills → sentiment → geocode → analyze → insights → persist (CSV/SQLite) → optional Google Sheets append.scripts/scrape_jobs.py- Scrapers for ACRA + AAA with pagination and per-item description fetching. Uses a configurableUSER_AGENTand polite defaults.scripts/data_cleaning.py- Normalization and duplicate detection (content hashing across title/company/description snippet). Also extracts salary hints when present.scripts/parse_reports.py- Reads PDFs from/reports/with PyMuPDF; emits one record per report with the raw text for downstream NLP.scripts/nlp_entities.py- spaCy NER for organizations and locations + taxonomy-based skill extraction. Produces a comma-separatedskillscolumn.scripts/sentiment_salience.py- Optional lightweight sentiment using VADER (useful for qualitative clustering or future labeling).scripts/geocode.py- Geocodes thelocationfield using Nominatim with on-disk caching; attacheslatandlonfor mapping.scripts/analyze.py- Pandas-based summaries (top skills, counts, employers, geocoded totals). Can be extended to clustering or time-series.scripts/insights.py- Generates a short, human-readable brief. Always emits rule-based recommendations; whenUSE_LLM=true, rendersconfig/insight_prompt.mdand calls the selected provider (OpenAI or Ollama).scripts/gsheets_sync.py- Appends jobs and reports to Google Sheets. Handles worksheet creation and de-dupe by URL or name.scripts/gsheets_test.py- One-liner connectivity check for Sheets credentials and permissions.scripts/preview_prompt.py- Renders the final LLM prompt (with current data) so you can review or paste it elsewhere.scripts/pandas_examples.py- Extra recipes for ad-hoc analysis; helpful for quick CSV exports during exploration.
data/charm.db- SQLite database created on first run (durable auditing and ad-hoc queries).data/processed/- CSV and artifacts used by the dashboard:jobs.csv,reports.csv,analysis.json,insights.md, andwordcloud.png.docs/sql_examples.sql- A few ready-to-use SQL queries againstcharm.db(e.g., salary by skill, recent Section 106/NEPA postings).docs/data_contract.md- Field-level documentation for each exported file so downstream teams know how to consume them.
dashboard/app.py- Single-page, minimalist UI:- Key findings cards (postings, employers, geocoded)
- Top skills bar chart (Plotly)
- Job map with heatmap + clustered markers (Folium)
- Insights panel and word cloud
- Sidebar filters and simple download actions (filtered CSV, analysis JSON)
dashboard/header.py- Header strip with the PDF download button; caches generated bytes by content fingerprint..streamlit/config.toml- Neutral, brand-agnostic theme.
reports/context.py- Builds and normalizes a report context dict from pipeline artifacts (jobs.csv,analysis.json,insights.md). Computes a SHA-256 fingerprint for cache invalidation.reports/pdf_report.py- Assembles the multi-section PDF from the context dict using ReportLab flowables.reports/styles.py- Page layout, paragraph styles, table styles, and font registration (Inter with Helvetica fallback).scripts/generate_report.py- CLI script to generate and validate a PDF report without the dashboard.
Makefile- Short commands for setup, running the pipeline, launching the dashboard, testing Sheets, previewing the prompt, and cleanup.requirements.txt- Python dependencies (scraping, NLP, analysis, dashboard, LLM providers).LICENSE- MIT license.CHANGELOG.md- A concise record of what’s included in this release and why certain decisions were made.
- n8n triggers the run (Cron or Webhook) → executes
scripts/pipeline.pyin the repo directory on your NAS. - The pipeline scrapes jobs (with pagination), parses any PDFs in
/reports/, enriches with NLP/skills, geocodes, analyzes, and writes outputs to both CSV and SQLite. - Google Sheets (optional) is updated with appended rows for jobs and reports (so stakeholders can view raw structured data).
- The dashboard reads
data/processed/*and refreshes automatically when files change. - The n8n Mattermost workflow (optional) reads the latest outputs, composes a short message (totals, top skills, alerts), posts to your channel, and snapshots the analysis for the next run.
Everything is idempotent: duplicates are filtered, pagination is capped, geocoding is cached, and runs can be scheduled safely.
- LLM calls (optional): Each pipeline run with
gpt-4o-minicosts well under $1 (usually a few cents). The default prompt and response fit comfortably within 1200 tokens. SetUSE_LLM=falsefor completely free runs. If you're running this on a schedule, estimate your monthly call volume and budget accordingly. - Google Sheets sync (optional): Setting
USE_SHEETS=trueturns on both Sheets and Drive APIs. They are metered after the free tier, and every run makes a few dozen append/read calls. Leave itfalseuntil you create a GCP project, confirm quotas, and budget for increased throughput (e.g., batch jobs nightly instead of per-scrape). - Geocoding: The built-in Nominatim client is free but rate-limited to 1 request/sec; heavy usage may require hosting your own instance. Because geocoding is cached in
data/geocache.csv, reruns stay cost-free unless you clear the cache. - Storage/dashboards: Streamlit + SQLite incur no extra spend -- everything runs locally. When deploying to cloud infrastructure, include VM/storage costs in your overall estimate.
- Sheets cache resets: The Google Sheets sync stores cached job/report IDs under
data/cache/. If someone edits or deletes rows directly in the Sheet, clear those files before the next run so the pipeline can rebuild its local view of existing rows.
Document these toggles in your runbook so reviewers understand how to perform a zero-cost demo vs. a production run with LLM + Sheets enabled.
- With the defaults (4 workers, 0.8 s interval) expect roughly 5 detail-page fetches per second. Increase the interval or lower workers if a target board publishes stricter rate limits.