Expand Ashby slug pool, add slug verification utility, and expose Ashby filters in API#2
Conversation
There was a problem hiding this comment.
Pull request overview
This PR expands Ashby ingestion by switching to Ashby’s public posting API, greatly enlarging the default Ashby company slug pool, and exposing per-request filtering controls through the Step 1 pipeline endpoint. It also adds a small Python utility to generate and optionally verify candidate Ashby slugs, plus documentation and env-config updates to support the new behavior.
Changes:
- Refactored Ashby/TheirStack fetcher to use Ashby’s posting API, add keyword/date filtering, and support per-request options (companies/keywords/freshness/limit/range).
- Extended
/api/pipeline/step1to accept and forward Ashby filter query params. - Added new Ashby-related env config, a slug verification script, and README/.env.example documentation updates.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
| scripts/ashby_slugs_verified.py | New CLI utility to write curated slug lists and optionally verify slugs via jobs.json. |
| backend/src/services/theirstack-fetcher.ts | Switches to Ashby posting API, expands slug pool, adds filtering/normalization, delay, and option-driven fetching. |
| backend/src/routes/pipeline.ts | Adds query parsing for Ashby filters and forwards them into the fetcher. |
| backend/src/config.ts | Adds env-config for Ashby keywords, lookback window, compensation inclusion, and request delay. |
| README.md | Documents new Ashby behavior, request params, and slug script usage. |
| .env.example | Adds example values for the new Ashby env vars. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| const companies = resolveCompanies(options.companySlugs); | ||
| const keywords = normalizeKeywords(options.keywords); | ||
| const lookbackHours = options.publishedWithinHours ?? config.ashbyPublishedWithinHours; | ||
| const postedTodayOnly = options.postedTodayOnly || false; |
There was a problem hiding this comment.
When companySlugs is omitted, resolveCompanies(...) returns the full ASHBY_COMPANY_SLUGS list (currently very large), so a default Step 1 run will issue hundreds of requests. With the default per-company delay, this can turn Step 1 into a multi-minute operation. Consider capping the default number of companies per run and/or adding limited concurrency with rate limiting to keep ingestion time bounded.
There was a problem hiding this comment.
@copilot may be we can make this under user conntrol how many comapnies he wants to get
| const companies = parseCommaList(request.query.companies).map((slug) => slug.toLowerCase()); | ||
| const keywords = parseCommaList(request.query.keywords).map((keyword) => keyword.toLowerCase()); | ||
| const postedTodayOnly = parseBoolean(request.query.postedToday); |
There was a problem hiding this comment.
If a caller supplies companies=..., unknown slugs are silently kept in the request but later dropped by resolveCompanies(...), which can lead to a successful-but-empty run that’s hard to debug. Consider validating companies against the supported slug pool and returning 400 (or at least logging/returning the invalid slugs) when none match.
There was a problem hiding this comment.
@copilot open a new pull request to apply changes based on this feedback
| const limit = !isNaN(limitVal) ? limitVal : 10; | ||
|
|
There was a problem hiding this comment.
limit is defaulted to 10 even when the query param isn’t provided. Since this value is always forwarded to fetchTheirStack, it overrides config.jobCount/JOB_COUNT for source=theirstack runs. Consider leaving limit as undefined unless request.query.limit is present (or defaulting it from config.jobCount specifically for TheirStack).
There was a problem hiding this comment.
@copilot open a new pull request to apply changes based on this feedback
| ), | ||
| ashbyIncludeCompensation: | ||
| (process.env.ASHBY_INCLUDE_COMPENSATION || 'true').toLowerCase() === 'true', | ||
| ashbyRequestDelayMs: parseInt(process.env.ASHBY_REQUEST_DELAY_MS || '1000', 10), |
There was a problem hiding this comment.
ashbyRequestDelayMs is parsed with parseInt(...) but not validated. If ASHBY_REQUEST_DELAY_MS is non-numeric, this becomes NaN and the > 0 check later will evaluate false, effectively disabling request throttling unexpectedly. Consider falling back to the default (1000ms) when Number.isNaN(...) (and/or clamping to >= 0).
There was a problem hiding this comment.
@copilot open a new pull request to apply changes based on this feedback
| const result: AshbyPublicResponse = await response.json(); | ||
| const postings = result.jobs ?? []; | ||
| return postings.map((posting) => ({ company: companyName, slug, posting })); |
There was a problem hiding this comment.
await response.json() can throw if Ashby returns non-JSON (e.g., HTML error page) even when response.ok is true, which would currently fail the entire Step 1 run. Consider guarding JSON parsing with try/catch (and logging the slug + status) so one bad response doesn’t stop ingestion.
There was a problem hiding this comment.
@copilot open a new pull request to apply changes based on this feedback
| # Compact curated seed list; can be extended over time. | ||
| SLUGS = [ | ||
| 'airtable','alan','altura','away','deliveroo','duolingo','flock-safety','hackerone','notion','opendoor', | ||
| 'oyster','posthog','ramp','sequoia','sony','vanta','cursor','deel','harvey','modern-treasury','openai', | ||
| 'reddit','shopify','snowflake','apify','ashby','buffer','factory','hcompany','jerry.ai','lightning','linear', |
There was a problem hiding this comment.
The curated slug seed list is duplicated between this script (SLUGS) and the backend’s ASHBY_COMPANY_SLUGS, which can drift over time. Consider sourcing both from a single canonical file (e.g., have the script generate a file consumed by the backend, or have the script read the TS list) to avoid divergence.
There was a problem hiding this comment.
@copilot open a new pull request to apply changes based on this feedback
| source === 'theirstack' | ||
| ? { | ||
| companies, | ||
| keywords, |
There was a problem hiding this comment.
For source=theirstack, when no keywords query param is provided you pass keywords=[] and the fetcher falls back to config.ashbyKeywords, but the response still echoes keywords: []. This makes the API response misleading vs. the actual behavior. Consider omitting keywords from filters when not provided, or returning something like keywords: 'default' / the resolved keyword list.
| keywords, | |
| keywords: request.query.keywords ? keywords : 'default', |
| const response = await fetch(url.toString(), { | ||
| method: 'GET', | ||
| headers: { | ||
| Accept: 'application/json', | ||
| 'User-Agent': 'JobSync-Service/1.0', |
There was a problem hiding this comment.
The fetch(...) call isn’t wrapped in a try/catch. With the expanded slug pool, a single transient network failure (DNS, timeout, connection reset) will throw and abort the entire Step 1 run instead of logging and moving on to the next company. Consider catching request errors in fetchCompanyJobs and returning [] on failure.
There was a problem hiding this comment.
@copilot open a new pull request to apply changes based on this feedback
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
|
@Mayank-glitch-cpu I've opened a new pull request, #3, to work on those changes. Once the pull request is ready, I'll request review from you. |
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
|
@Mayank-glitch-cpu I've opened a new pull request, #4, to work on those changes. Once the pull request is ready, I'll request review from you. |
|
@Mayank-glitch-cpu I've opened a new pull request, #5, to work on those changes. Once the pull request is ready, I'll request review from you. |
|
@Mayank-glitch-cpu I've opened a new pull request, #6, to work on those changes. Once the pull request is ready, I'll request review from you. |
|
@Mayank-glitch-cpu I've opened a new pull request, #7, to work on those changes. Once the pull request is ready, I'll request review from you. |
|
@Mayank-glitch-cpu I've opened a new pull request, #8, to work on those changes. Once the pull request is ready, I'll request review from you. |
…ilures Co-authored-by: Mayank-glitch-cpu <85422185+Mayank-glitch-cpu@users.noreply.github.com>
…work Wrap fetch() in try/catch to handle network failures gracefully
Motivation
jobs.ashbyhq.com.companies,keywords,postedToday,publishedWithinHours) instead of only using a static built-in set.Description
ASHBY_COMPANY_SLUGSarray and updated the fetcher to use the Ashby public posting API (backend/src/services/theirstack-fetcher.ts) while adding parsing, normalization, keyword/date filters, request delays, and dynamic name generation viaslugToCompanyNameandresolveCompanies.FetchAshbyOptionssupport and refactoredfetchTheirStackto acceptcompanySlugs,keywords,postedTodayOnly,publishedWithinHours,limit, andrangeand to return deduplicated, filteredRawJobobjects.companies,keywords,postedToday,publishedWithinHours,limit) and forward them intofetchTheirStack(backend/src/routes/pipeline.ts).backend/src/config.tsto supportASHBY_KEYWORDS,ASHBY_PUBLISHED_WITHIN_HOURS,ASHBY_INCLUDE_COMPENSATION, andASHBY_REQUEST_DELAY_MS.scripts/ashby_slugs_verified.py, a small CLI that writes curated slug lists to disk and can optionally verify each slug by checkinghttps://jobs.ashbyhq.com/<slug>/jobs.json.README.mdwith usage notes for the Ashby smart-ingestion request parameters and instructions for the slug verification utility.Testing
pnpm --filter jobslist-backend buildand the build completed successfully.python scripts/ashby_slugs_verified.py --verifyin this environment; the script ran and generated outputs, but network egress tojobs.ashbyhq.comis blocked here so verification returned0/83verified (no reachable endpoints).Codex Task