Skip to content

feat: dynamic sitemap with page router, add daily cron referesh and google console submission#355

Closed
amaan-bhati wants to merge 37 commits intomainfrom
dynamic-sitemap-update
Closed

feat: dynamic sitemap with page router, add daily cron referesh and google console submission#355
amaan-bhati wants to merge 37 commits intomainfrom
dynamic-sitemap-update

Conversation

@amaan-bhati
Copy link
Copy Markdown
Member

@amaan-bhati amaan-bhati commented Mar 31, 2026

feat: Dynamic Sitemap Generation with Cron Refresh, Snapshot Fallback, and Google Search Console Auto-Submission

The blog previously served a static, manually maintained sitemap.xml, new posts, authors, and tags were never reflected unless someone updated it by hand. This PR replaces it with a fully automated sitemap pipeline at https://keploy.io/blog/sitemap.xml that crawls WordPress GraphQL, refreshes daily via Vercel cron, maintains a three-tier snapshot fallback so crawlers always receive valid XML during outages, and auto-submits to Google Search Console after every successful refresh.

Iteration History

multiple rounds of GitHub Copilot code review issues addressed across iterations: cursor pagination infinite-loop guard, CSP regex exclusion for the sitemap route, CRON_SECRET misconfiguration (500 vs 401 distinction), no-store on 503, /tmp snapshot edge cases, TypeScript strict null guards, empty tag slug guard, and maxDuration tuning.

File and Folder Structure

blog-website/
├── lib/
│   ├── sitemap.ts                            # Core: paginator, retry, entry builders, fallback, serialization
│   └── google-search-console.ts             # GSC OAuth 2.0 JWT flow + sitemap submission (no Google SDK)
├── pages/
│   ├── sitemap.xml.ts                        # SSR route serving /blog/sitemap.xml (maxDuration: 60)
│   └── api/cron/
│       └── refresh-sitemap.ts               # Cron endpoint: auth guard, refresh, GSC submit (maxDuration: 300)
├── scripts/
│   └── submit-sitemap-to-search-console.mjs # Standalone script to verify GSC credentials locally
├── tests/e2e/
│   ├── Sitemap.spec.ts                       # E2E: HTTP 200, Content-Type, Cache-Control, XML structure
│   └── RefreshSitemapCron.spec.ts            # E2E: 401/405/200 auth + method guards, response shape
└── vercel.json                               # Cron schedule 0 0 * * *, cache headers, CSP exclusions

WordPress GraphQL Paginator

fetchAllPosts() paginates with first: 50, after: cursor ordered by modified DESC.

  • 25s timeout per request (AbortSignal.timeout)
  • 6 retry attempts, linear backoff: 2000ms × attempt (2s → 12s, max 42s total)
  • 250ms settle delay between pages to reduce WPGraphQL pressure
  • Retryable: 408, 429, 500–504, AbortError, TypeError, network failures, and GraphQL-level errors on 200 OK (WPGraphQL returns these during plugin reload / DB lock)
  • Non-retryable: other 4xx, missing data field
  • Cursor guard: if WordPress returns hasNextPage: true with no endCursor, throws immediately to prevent infinite page-1 re-fetch

Category-to-Route Mapping

mapCategoriesToRoutes() maps WP categories to "technology" or "community" by matching both slug and name (lowercased) to handle editorial inconsistencies. Posts with no matching category are excluded.

Entry Builders

Posts: priority: 0.8 if modified in last 30 days, 0.5 otherwise; changefreq: weekly; slug is encodeURIComponent-encoded.

Authors: lastmod = newest post by that author; priority: 0.7.

Tags: one entry per unique tag from included posts; tag display names sanitized to URL slugs via sanitizeStringForURL(); empty/whitespace tags skipped; priority: 0.7.

Static routes : 7 hardcoded entries; lastmod set to the newest post modified time so listing pages reflect freshest content.

XML Serialization

Manual generation: no external library. All values passed through escapeXml() (&, ", ', <, >). Priority formatted via .toFixed(1). dedupeEntries() uses a Map keyed by URL to eliminate any overlapping entries before serialization.

Three-Tier Snapshot Fallback

When fresh generation fails:

  • Tier 1: In-memory (lastSuccessfulSitemapXml): updated after every successful refresh; instant, no I/O; survives within the same Lambda instance.
  • Tier 2 : /tmp file (/tmp/keploy-blog-sitemap.xml): written after every successful refresh; validated by isValidSitemapXml() (checks XML declaration, <urlset> namespace, closing tag) before use.
  • Tier 3: Static-only fallback (getStaticFallbackXml()): 7 hardcoded routes returned with HTTP 503 + Cache-Control: no-store : never cached by edge, enabling immediate recovery once WordPress is back.

How /tmp works: Vercel exposes a writable /tmp directory (up to 500 MB) per serverless function instance, scoped to that instance's lifetime. It is not a shared filesystem — different instances have independent /tmp directories. Source: Vercel Functions: Runtimes.

How in-memory state works: Node.js module-level variables (like lastSuccessfulSitemapXml) persist across multiple requests handled by the same warm instance. When Vercel reuses an existing Lambda container for a new invocation (warm start), the module is not re-evaluated : variable state is retained. On a cold start or a new instance, the module is re-evaluated and the variable resets to null. Source: Vercel Functions: Concepts, Vercel: Improving Cold Start Performance.

Concurrency Guard

refreshSitemapPromise is a module-level deduplication guard. Concurrent callers share one in-flight crawl rather than each triggering an independent WordPress fetch. Cleared in .finally() so the next call after resolution starts fresh.

Cron Endpoint Security

Auth checked before method — prevents leaking valid HTTP methods to unauthenticated callers

  • GET only : 405 Method Not Allowed with Allow: GET header for anything else
  • Missing CRON_SECRET500 (misconfiguration), not 401 (wrong token) : distinguishes deployment error from auth failure
  • Google submission is non-blocking : a GSC failure returns 200 ok: true; sitemap refresh is never coupled to Google's availability

Vercel automatically injects Authorization: Bearer <CRON_SECRET> on every cron invocation when CRON_SECRET is set in project settings. Source: Vercel Cron Jobs : Quickstart, Vercel : Managing Cron Jobs.

Google Search Console Integration

Full OAuth 2.0 service account flow, no third-party Google SDK:

  1. RS256-signed JWT constructed from GOOGLE_SERVICE_ACCOUNT_EMAIL + GOOGLE_SERVICE_ACCOUNT_PRIVATE_KEY using Node.js crypto.createSign('RSA-SHA256'): per RFC 7518 (JWA) and RFC 7519 (JWT)
  2. JWT exchanged for OAuth access token via grant_type: urn:ietf:params:oauth:grant-type:jwt-bearer at oauth2.googleapis.com/token: per RFC 7523 and Google OAuth 2.0 Service Account docs
  3. PUT to googleapis.com/webmasters/v3/sites/{siteUrl}/sitemaps/{sitemapUrl}: Search Console Sitemaps API
  4. Private key \\n sequences replaced with real newlines before signing: Vercel stores multi-line env vars with literal \n; crypto.createSign requires a valid PEM with real line breaks. Source: Node.js crypto docs

Entirely optional : isSearchConsoleSubmissionConfigured() checks all three required env vars; missing any one skips the step with skipped: true in the response.

Cache-Control Strategy

Response Cache-Control
200 sitemap public, max-age=0, s-maxage=86400, stale-while-revalidate=86400
503 static fallback no-store
/blog/api/(.*) no-store
/_next/static/(.*) public, max-age=31536000, immutable
All other /blog/ pages public, max-age=3600, s-maxage=86400, stale-while-revalidate=604800

Sitemap is excluded from CSP headers in both vercel.json and next.config.js via regex : XML responses have no use for CSP.

Build-Time Validation

next.config.js calls URL.canParse(process.env.WORDPRESS_API_URL) at build time: the build fails with a clear error if the variable is missing or invalid, preventing a silent misconfigured deploy. (as suggested and imroved by copilot multiple times)

Edge Cases Covered

Edge Case How It Is Handled
WordPress completely unreachable Three-tier fallback → in-memory → /tmp → static 503
WordPress returns partial data assertFullSitemap() throws if < 5 posts per category
hasNextPage: true but no cursor Throws immediately; prevents infinite page-1 re-fetch
GraphQL errors on 200 OK Treated as retryable; up to 6 attempts
Non-retryable 4xx Fails fast without exhausting retry budget
Concurrent refresh requests Shared refreshSitemapPromise — one crawl, all callers wait
Corrupted /tmp snapshot isValidSitemapXml() rejects it; falls to static fallback
CRON_SECRET not set Returns 500 (misconfiguration), not 401
GSC credentials wrong / Google down Caught; cron still returns 200 ok: true
Post with no matching category Silently excluded from sitemap
Author name normalizes to empty slug Skipped; no empty /authors/ URL emitted
Tag name empty or normalizes to empty slug Skipped via flatMap + early return guard
XML-sensitive chars in URLs/dates escapeXml() escapes &, ", ', <, >
Duplicate URLs across generation paths dedupeEntries() Map-deduplication before serialization
503 cached by edge no-store on 503; edge retries origin on next request
Private key stored with literal \n .replace(/\\n/g, "\n") before JWT signing

Testing

E2E Tests (Playwright)

tests/e2e/Sitemap.spec.ts: GET /sitemap.xml → HTTP 200, Content-Type: application/xml, Cache-Control has s-maxage=86400 + max-age=0 + stale-while-revalidate=86400, valid XML declaration, correct <urlset> namespace, core <loc> entries present.

tests/e2e/RefreshSitemapCron.spec.ts: no auth → 401; wrong token → 401; POST with valid token → 405 + Allow: GET; GET with valid token → 200, ok: true, entryCount > 0, generatedAt as string, searchConsole.submitted as boolean.

Live Verification Results

Check Result
HTTP status 200
XML parse (Python xml.etree) PASS — well-formed
Total URLs 1,561 (0 duplicates)
All entries have <lastmod> 1,561 / 1,561
Static / Technology / Community / Authors / Tags 7 / 37 / 457 / 86 / 974
/tmp snapshot written YES — 267,306 bytes
Cron (valid auth) 200 {ok: true, entryCount: 1561, searchConsole.submitted: true}
Cron (wrong secret / no auth / POST) 401 / 401 / 405
GSC submission SUBMITTED to sc-domain:keploy.io
End-to-end crawl time ~13 seconds

Priority distribution: 1.0 × 1, 0.9 × 2, 0.8 × 84, 0.7 × 1060, 0.6 × 2, 0.5 × 412.

Local GSC Verification Script, this was a script i generated using claude for testing and the steps were completed successfully

node scripts/submit-sitemap-to-search-console.mjs

Mirrors the production JWT + submission flow; loads from .env.local; exits code 1 with structured JSON error on failure. Use this to verify credentials before deploy without running the full app.

…d failure

Signed-off-by: amaan-bhati <amaanbhati49@gmail.com>
Copilot AI review requested due to automatic review settings March 31, 2026 14:04
Comment thread pages/sitemap.xml.ts Outdated
Comment thread lib/sitemap.ts Outdated
@kilo-code-bot
Copy link
Copy Markdown

kilo-code-bot Bot commented Mar 31, 2026

Code Review Summary

Status: 4 Issues Found | Recommendation: Address before merge

Overview

Severity Count
CRITICAL 0
WARNING 4
SUGGESTION 0
Issue Details (click to expand)

WARNING

File Line Issue
pages/sitemap.xml.ts 9 Missing error handling for generateSitemapXml() - if all fallbacks fail, unhandled errors will cause 500 responses without actionable context
lib/sitemap.ts 443 Error log in persistSitemapSnapshot() lacks actionable next steps per project guidelines
pages/api/cron/refresh-sitemap.ts 71 Error log for sitemap refresh failure lacks actionable next steps per project guidelines
pages/api/cron/refresh-sitemap.ts 47 NEW Error log for Google Search Console submission failure lacks actionable next steps per project guidelines
Improvements Since Last Review
  • Google Search Console integration added: New lib/google-search-console.ts implements sitemap submission to Google Search Console using service account JWT authentication
  • Graceful degradation: Search Console submission failure doesn't block the cron response - errors are captured and returned in the response body
  • Configuration check: isSearchConsoleSubmissionConfigured() allows optional Search Console submission based on environment variables
  • Robust retry logic: fetchGraphQL() includes exponential backoff retry with configurable limits (6 retries, 2s delay multiplier, 25s timeout)
  • Fallback mechanism: generateSitemapXml() caches successful sitemaps in memory and persists to /tmp for resilience
  • Validation added: assertFullSitemap() ensures both technology and community posts are present before serving
  • Promise coalescing: refreshSitemapSnapshot() deduplicates concurrent refresh requests using a shared promise
  • XML validation: isValidSitemapXml() validates persisted snapshot before serving stale data
Incremental Review (aff778d..4c2eeb1)

Changes reviewed:

  • lib/google-search-console.ts - NEW Google Search Console sitemap submission module with JWT auth
  • pages/api/cron/refresh-sitemap.ts - Added Search Console submission after sitemap refresh (1 new issue)

New issue introduced: Error log for Search Console submission failure lacks actionable next steps.

Positive observations in new code:

  • Proper service account JWT creation with RS256 signing
  • Good error handling with informative error messages including response body
  • Configuration check before attempting submission prevents unnecessary API calls
  • Failure doesn't break the cron job - gracefully reports submission status in response
  • URL encoding for siteUrl and sitemapUrl in the API call prevents injection issues
Additional Notes

Positive observations:

  • Excellent fallback strategy: in-memory cache → persisted snapshot → throw
  • Good use of proper XML escaping in escapeXml() to prevent injection attacks
  • Appropriate deduplication of sitemap entries
  • Good caching strategy with s-maxage=86400, stale-while-revalidate=86400
  • Sequential fetching with settle delay reduces burst pressure on WPGraphQL
  • Clean separation of concerns with dedicated builder functions
  • Retryable status codes properly identified (408, 429, 500, 502, 503, 504)
  • Promise coalescing prevents duplicate concurrent sitemap generations
  • XML validation ensures corrupted snapshots aren't served
  • Cron authentication properly requires CRON_SECRET environment variable
Files Reviewed (9 files)
  • lib/google-search-console.ts - no issues (new file, well-structured)
  • lib/sitemap.ts - 1 issue (error log lacks actionable next steps)
  • pages/sitemap.xml.ts - 1 issue (missing error handling)
  • pages/api/cron/refresh-sitemap.ts - 2 issues (error logs lack actionable next steps)
  • vercel.json - no issues (added cron config, fixed JSON structure)
  • pages/_document.tsx - no issues (previous review: import reorder + EOF formatting)
  • public/robots.txt - no issues
  • public/sitemap.xml - no issues (deleted static file)
  • package-lock.json - no issues

Fix these issues in Kilo Cloud


Reviewed by claude-4.5-opus-20251124 · 205,846 tokens

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Implements a dynamic sitemap endpoint for the Next.js pages router (replacing the previously committed static public/sitemap.xml) and updates SEO-related plumbing to reference the new sitemap location.

Changes:

  • Removed the large static public/sitemap.xml and added a server-rendered /sitemap.xml route.
  • Added sitemap generation utilities (lib/sitemap.ts) that aggregate posts/tags/authors into sitemap entries.
  • Updated robots.txt sitemap URL and adjusted Node engine range in package-lock.json to match package.json.

Reviewed changes

Copilot reviewed 5 out of 6 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
public/sitemap.xml Removed committed static sitemap in favor of a dynamic endpoint.
public/robots.txt Points crawlers to the new sitemap URL.
pages/sitemap.xml.ts Adds SSR route that emits generated sitemap XML with caching headers.
lib/sitemap.ts Implements sitemap entry collection + XML serialization.
pages/_document.tsx Adds missing next/script import (enables Script usage in Document).
package-lock.json Updates Node engine constraint to >=18.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread pages/_document.tsx Outdated
Comment thread pages/sitemap.xml.ts Outdated
Comment thread lib/sitemap.ts Outdated
Comment thread pages/sitemap.xml.ts Outdated
Signed-off-by: amaan-bhati <amaanbhati49@gmail.com>
Comment thread lib/sitemap.ts Outdated
Signed-off-by: amaan-bhati <amaanbhati49@gmail.com>
Comment thread lib/sitemap.ts Outdated
@amaan-bhati amaan-bhati requested a review from Copilot April 1, 2026 09:31
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 5 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread lib/sitemap.ts Outdated
Comment thread lib/sitemap.ts Outdated
Comment thread lib/sitemap.ts Outdated
Signed-off-by: amaan-bhati <amaanbhati49@gmail.com>
Comment thread pages/api/cron/refresh-sitemap.ts Outdated
Signed-off-by: amaan-bhati <amaanbhati49@gmail.com>
Copilot AI review requested due to automatic review settings April 1, 2026 16:01
Comment thread pages/api/cron/refresh-sitemap.ts Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 8 changed files in this pull request and generated 9 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread vercel.json
Comment thread pages/api/cron/refresh-sitemap.ts Outdated
Comment thread pages/sitemap.xml.ts Outdated
Comment thread lib/sitemap.ts Outdated
Comment thread lib/sitemap.ts Outdated
Comment thread pages/api/cron/refresh-sitemap.ts Outdated
Comment thread lib/sitemap.ts Outdated
Comment thread lib/google-search-console.ts
Comment thread lib/google-search-console.ts
@amaan-bhati amaan-bhati changed the title feat: implement dynamic sitemap with page router and updated wrt build failure feat: implement dynamic sitemap with page router, add daily cron referesh and google console submission Apr 2, 2026
@amaan-bhati amaan-bhati changed the title feat: implement dynamic sitemap with page router, add daily cron referesh and google console submission feat: dynamic sitemap with page router, add daily cron referesh and google console submission Apr 2, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.

Comments suppressed due to low confidence (1)

vercel.json:59

  • This route sets Cache-Control in getServerSideProps, but this global /blog/(.*) headers rule also sets Cache-Control for /blog/sitemap.xml. Having two different caching directives can result in the sitemap being cached differently than intended (depending on precedence). Consider excluding sitemap.xml from this global rule or defining its caching behavior in one place.
    {
      "source": "/blog/(.*)",
      "headers": [
        {
          "key": "Content-Security-Policy",
          "value": "connect-src 'self' https://px.ads.linkedin.com https://www.google-analytics.com https://analytics.google.com https://region1.google-analytics.com https://stats.g.doubleclick.net https://rp.liadm.com https://idx.liadm.com https://pagead2.googlesyndication.com https://*.clarity.ms https://news.google.com https://assets.apollo.io https://wp.keploy.io https://cdn.hashnode.com https://keploy-websites.vercel.app https://blog-website-phi-eight.vercel.app https://docbot.keploy.io https://www.youtube.com https://youtube.com https://www.youtube-nocookie.com https://*.youtube.com https://*.googlevideo.com https://googleads.g.doubleclick.net https://marketplace.visualstudio.com https://api.github.com https://pro.ip-api.com https://api.vector.co https://aplo-evnt.com https://ep1.adtrafficquality.google https://ppptg.com https://telemetry.keploy.io; frame-src 'self' https://www.googletagmanager.com https://keploy-websites.vercel.app https://blog-website-phi-eight.vercel.app https://docbot.keploy.io https://www.youtube.com https://youtube.com https://www.youtube-nocookie.com https://*.youtube.com https://news.google.com https://googleads.g.doubleclick.net https://*.google.com https://ppptg.com; img-src 'self' https://c.bing.com https://ppptg.com https://pbs.twimg.com https://secure.gravatar.com https://wp.keploy.io https://keploy.io data:;"
        },
        {
          "key": "Cache-Control",
          "value": "public, max-age=3600, s-maxage=86400, stale-while-revalidate=604800"
        }

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread vercel.json
Comment thread lib/sitemap.ts Outdated
Comment thread lib/google-search-console.ts
@amaan-bhati amaan-bhati requested a review from Copilot April 6, 2026 15:01
…okahead

Signed-off-by: amaan-bhati <amaanbhati49@gmail.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 14 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread lib/sitemap.ts Outdated
Comment thread pages/sitemap.xml.ts Outdated
Signed-off-by: amaan-bhati <amaanbhati49@gmail.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 14 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread lib/sitemap.ts Outdated
Comment thread pages/sitemap.xml.ts Outdated
Signed-off-by: amaan-bhati <amaanbhati49@gmail.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 14 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 14 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@amaan-bhati
Copy link
Copy Markdown
Member Author

  • Addressed all the copilot reviews until there were no comments left till now

  • After this step, gave a detailed context prompt to claude to test the entire implementation and see if there are any potential bugs, this was the 3rd time i was doing this with claude, earlier when there were no copilot comments and we asked claude for bugs, claude pointed out 4-5 bugs which we fixed later, this time, there was only one tradeoff:

    • Edge cache vs cron timing: s-maxage=86400 is based on when the edge first cached the response, not when the cron last ran. Worst case: ~24–48h stale window (TTL + stale-while-revalidate). Acceptable for a blog sitemap; GSC submission ensures Google discovers new posts independently.
  • Tested rest of the things manually, visited the sitemap locally, the updated sitemap loads instantly:

Screenshot 2026-04-09 at 7 19 39 PM
  • Ran npm run build nothing breaks:
Screenshot 2026-04-09 at 7 27 55 PM

Important precaution

things we'd need to ensure before deploying to prod:

  • Vercel runs next build next.config.js calls URL.canParse(process.env.WORDPRESS_API_URL) at build time. If WORDPRESS_API_URL is missing or invalid in your Vercel project env vars, the build will fail here before anything deploys. Check the Vercel dashboard build logs first.

    1. Confirm all env vars are set in Vercel
    • WORDPRESS_API_URL: required, build fails without it
    • CRON_SECRET: required, cron returns 500 without it
    • GOOGLE_SERVICE_ACCOUNT_EMAIL
    • GOOGLE_SERVICE_ACCOUNT_PRIVATE_KEY
    • GOOGLE_SEARCH_CONSOLE_SITE_URL: should be sc-domain:keploy.io
    • SITEMAP_PUBLIC_URL: should be https://keploy.io/blog/sitemap.xml

@slayerjain
Copy link
Copy Markdown
Member

Review: Simpler architecture available — ISR replaces most of this code

The WordPress crawling logic, fallback handling, and edge cases are well thought out — good work on the robustness. But the core architecture can be dramatically simplified because Vercel ISR already does what the cron + in-memory cache + /tmp snapshot is trying to do.

The core issue: ~900 lines reimplements ISR

The three-tier fallback (in-memory → /tmp → static) is fundamentally unreliable on Vercel serverless:

  • In-memory (lastSuccessfulSitemapXml) — lost on every cold start, which happens frequently
  • /tmp snapshot — not shared across instances, wiped on every deploy
  • So 2 of 3 fallback tiers fail exactly when you need them most (after deploys or during traffic spikes that spawn new instances)

Vercel's ISR solves all of this natively:

  • Generates at build time → always has a valid cached version
  • Revalidates in background on a timer → no cron needed
  • If WordPress is down during regen → serves stale (valid) version automatically
  • Edge-cached → no serverless function invocation for most requests

Suggested approach: getStaticProps + revalidate

pages/sitemap.xml.ts (~80 lines):

import { GetStaticProps } from "next";
import { getAllPosts, getAllTags, getAllAuthors } from "../lib/api"; // already exists!

export const getStaticProps: GetStaticProps = async () => {
  const [posts, tags, authors] = await Promise.all([
    getAllPosts(),
    getAllTags(),
    getAllAuthors(),
  ]);
  
  const xml = buildSitemapXml(posts, tags, authors);
  return { props: { xml }, revalidate: 3600 }; // regenerate hourly in background
};

pages/api/cron/submit-gsc.ts (~30 lines) — cron ONLY pings GSC, no sitemap regen

What this eliminates

Current PR With ISR
lib/sitemap.ts (733 lines) — custom paginator, retry, fallback tiers Reuse existing lib/api.ts (already has getAllPosts, getAllTags, getAllAuthors with pagination)
getServerSideProps — function runs on every request getStaticProps + revalidate — static, edge-cached
3-tier fallback (memory → /tmp → static) Vercel ISR cache is the fallback (stale-while-revalidate built in)
Concurrency guard (refreshSitemapPromise) ISR deduplicates revalidation natively
maxDuration: 300 on cron ISR revalidation runs in background, no long-running function
Cron refreshes sitemap + submits GSC Cron only submits to GSC (~5 lines)
~900 lines across 4 new files ~110 lines across 2 files

What to keep

  • The GSC OAuth implementation (lib/google-search-console.ts) is solid — keep it (or simplify with google-auth-library)
  • The XML escaping and dedup logic is fine, just move it into the ISR page
  • The E2E tests are good — adapt them for the simpler route
  • The CSP exclusion for sitemap.xml is correct

The lib/api.ts duplication problem

The blog already has fully paginated queries in lib/api.ts:

  • getAllPosts() — paginated, returns slug, categories, author, date
  • getAllTags() — all tags
  • getAllAuthors() — all authors from posts

The PR rewrites all of these from scratch in lib/sitemap.ts with a parallel fetchGraphQL function, parallel retry logic, and parallel pagination. This means two independent WordPress data layers to maintain. If someone fixes a bug in lib/api.ts, the sitemap won't pick it up and vice versa.

TL;DR

The defensive programming is genuinely good, but it's solving problems that Vercel's platform already solves. Simplify to ISR + reuse lib/api.ts, and the PR drops from ~900 lines to ~110 while being more reliable (edge-cached, platform-managed fallback, no ephemeral /tmp dependency).

Signed-off-by: amaan-bhati <amaanbhati49@gmail.com>
Copilot AI review requested due to automatic review settings April 10, 2026 11:47
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 23 out of 25 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread tests/mock-server.js
return technologyPosts;
}

if (query.includes('query SitemapPosts')) {
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

handleGraphQL never returns sitemapPostsResponse for the sitemap crawl: the sitemap route calls getAllPosts() which issues query AllPosts(...), but the mock server only checks for query SitemapPosts. As a result the sitemap request gets technologyPosts only (via the final if (query.includes('AllPosts'))), causing assertFullSitemap to fail (community count = 0) and the sitemap endpoint to fall back to 503 in e2e runs. Update the mock routing to return sitemapPostsResponse for the actual sitemap query shape (e.g., detect the AllPosts query without tagName, or detect the presence of sitemap-specific fields like modified/pageInfo).

Suggested change
if (query.includes('query SitemapPosts')) {
if (
query.includes('query SitemapPosts') ||
(
query.includes('AllPosts') &&
!query.includes('AllPostsForCategory') &&
!query.includes('tagName') &&
(query.includes('modified') || query.includes('pageInfo'))
)
) {

Copilot uses AI. Check for mistakes.
Comment thread app/sitemap.xml/route.ts
Comment on lines +31 to +72
export const revalidate = 3600;

export async function GET(): Promise<Response> {
try {
// reuse the existing getAllPosts() paginator from lib/api.ts.
// as of the pagination fix, this fetches ALL posts (not just the first 50).
const allPostsResult = await getAllPosts();

// convert getAllPosts() return shape into SitemapPost[] for the entry builders.
const posts = adaptPostsForSitemap(allPostsResult);

// reject partial wordpress responses before they replace a good cached version.
// throws if fewer than 5 posts per category, ISR will not cache a thrown error,
// so Vercel keeps serving the previous good cached version automatically.
assertFullSitemap(posts);

// static routes get lastmod = newest post modification time,
// so listing pages reflect when the freshest underlying content changed.
const latestModified = getLatestModified(posts) ?? new Date().toISOString();
const staticEntries = STATIC_ROUTES.map((r) => ({
...r,
lastModified: latestModified,
}));

const entries = dedupeEntries([
...staticEntries,
...buildPostEntries(posts),
...buildAuthorEntries(posts),
...buildTagEntries(posts),
]);

const xml = serializeSitemap(entries);

return new Response(xml, {
status: 200,
headers: {
"Content-Type": "application/xml",
// s-maxage instructs Vercel's CDN to cache for 1h (matches revalidate above).
// stale-while-revalidate lets the CDN serve stale while regenerating in background.
// max-age=0 ensures browsers always revalidate with the CDN rather than caching locally.
"Cache-Control": "public, max-age=0, s-maxage=3600, stale-while-revalidate=3600",
},
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sitemap route sets Cache-Control to s-maxage=3600 / stale-while-revalidate=3600, but this PR’s documented cache strategy and the new Playwright sitemap test assert s-maxage=86400 / stale-while-revalidate=86400. Align the implementation with the intended caching (either change revalidate + Cache-Control to 86400, or update the tests/docs to expect 3600) so CI and behavior are consistent.

Copilot uses AI. Check for mistakes.
Comment thread tests/e2e/Sitemap.spec.ts
Comment on lines +9 to +11
expect(response.headers()['cache-control']).toContain('s-maxage=86400');
expect(response.headers()['cache-control']).toContain('max-age=0');
expect(response.headers()['cache-control']).toContain('stale-while-revalidate=86400');
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test asserts cache-control contains s-maxage=86400 and stale-while-revalidate=86400, but app/sitemap.xml/route.ts currently returns s-maxage=3600 / stale-while-revalidate=3600 and there is no vercel.json override for /blog/sitemap.xml. Update the expected values (or adjust the route’s headers/revalidate) so the test matches actual sitemap caching behavior.

Suggested change
expect(response.headers()['cache-control']).toContain('s-maxage=86400');
expect(response.headers()['cache-control']).toContain('max-age=0');
expect(response.headers()['cache-control']).toContain('stale-while-revalidate=86400');
expect(response.headers()['cache-control']).toContain('s-maxage=3600');
expect(response.headers()['cache-control']).toContain('max-age=0');
expect(response.headers()['cache-control']).toContain('stale-while-revalidate=3600');

Copilot uses AI. Check for mistakes.
Comment on lines +7 to +60
// GSC submission is fast — no WordPress crawl happens here anymore.
// Sitemap generation is handled by ISR in app/sitemap.xml/route.ts.
export const config = { maxDuration: 30 };

export default async function handler(req: NextApiRequest, res: NextApiResponse) {
const expectedSecret = process.env.CRON_SECRET;

// distinguish a deployment misconfiguration (500) from a wrong token (401).
if (!expectedSecret) {
console.error(
"CRON_SECRET is not configured. Set it in Vercel environment variables and redeploy."
);
return res.status(500).json({
ok: false,
message: "Server misconfiguration — CRON_SECRET is not configured",
});
}

// auth is checked before method to avoid leaking valid HTTP methods to
// unauthenticated callers. vercel cron automatically injects this header.
if (req.headers.authorization !== `Bearer ${expectedSecret}`) {
return res.status(401).json({ ok: false, message: "Unauthorized" });
}

if (req.method !== "GET") {
res.setHeader("Allow", "GET");
return res.status(405).json({ ok: false, message: "Method not allowed" });
}

// skip silently if google search console env vars are not all configured.
if (!isSearchConsoleSubmissionConfigured()) {
return res.status(200).json({
ok: true,
message: "Google Search Console submission is not configured — skipped",
});
}

try {
// notify google that the sitemap has been updated so it re-crawls it.
// the sitemap itself is generated and cached by ISR — no crawl needed here.
const result = await submitSitemapToSearchConsole();
return res.status(200).json({ ok: true, ...result });
} catch (error) {
console.error(
"Google Search Console sitemap submission failed. " +
"Verify GOOGLE_SERVICE_ACCOUNT_EMAIL, GOOGLE_SERVICE_ACCOUNT_PRIVATE_KEY, " +
"GOOGLE_SEARCH_CONSOLE_SITE_URL, and Search Console property access for the service account.",
error
);
return res.status(500).json({
ok: false,
message:
error instanceof Error ? error.message : "Google Search Console submission failed",
});
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cron handler no longer performs a sitemap refresh/warm (it only submits to Google) and its success payload is { ok: true, siteUrl, sitemapUrl, submittedAt }, but the PR description + added e2e test expect refresh metadata like entryCount, generatedAt, and searchConsole.submitted, and also state that Google failures should be non-blocking. Consider (1) triggering a fetch of /sitemap.xml (using the incoming request host/proto) to ensure the sitemap is regenerated/warmed before submission, (2) returning the documented metadata in the JSON response, and (3) on GSC failure returning 200 ok: true with a searchConsole error payload instead of a 500 so cron refresh isn’t coupled to Google availability.

Copilot uses AI. Check for mistakes.
Comment thread tests/e2e/RefreshSitemapCron.spec.ts
Signed-off-by: amaan-bhati <amaanbhati49@gmail.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 24 out of 26 changed files in this pull request and generated 5 comments.

Comments suppressed due to low confidence (1)

vercel.json:66

  • /blog/sitemap.xml is excluded from the /blog/* catch-all headers rule, but there’s no dedicated headers rule for the sitemap in vercel.json. If the intended Cache-Control policy is the documented s-maxage=86400 (daily edge cache), add an explicit /blog/sitemap.xml headers entry (or update the documentation/tests to match the route handler’s 1h policy).
    {
      "source": "/blog/((?!(?:sitemap\\.xml$|api/|_next/static/)).*)",
      "headers": [
        {
          "key": "Content-Security-Policy",
          "value": "connect-src 'self' https://px.ads.linkedin.com https://www.google-analytics.com https://analytics.google.com https://region1.google-analytics.com https://stats.g.doubleclick.net https://rp.liadm.com https://idx.liadm.com https://pagead2.googlesyndication.com https://*.clarity.ms https://news.google.com https://assets.apollo.io https://wp.keploy.io https://cdn.hashnode.com https://keploy-websites.vercel.app https://blog-website-phi-eight.vercel.app https://docbot.keploy.io https://www.youtube.com https://youtube.com https://www.youtube-nocookie.com https://*.youtube.com https://*.googlevideo.com https://googleads.g.doubleclick.net https://marketplace.visualstudio.com https://api.github.com https://pro.ip-api.com https://api.vector.co https://aplo-evnt.com https://ep1.adtrafficquality.google https://ppptg.com https://telemetry.keploy.io; frame-src 'self' https://www.googletagmanager.com https://keploy-websites.vercel.app https://blog-website-phi-eight.vercel.app https://docbot.keploy.io https://www.youtube.com https://youtube.com https://www.youtube-nocookie.com https://*.youtube.com https://news.google.com https://googleads.g.doubleclick.net https://*.google.com https://ppptg.com; img-src 'self' https://c.bing.com https://ppptg.com https://pbs.twimg.com https://secure.gravatar.com https://wp.keploy.io https://keploy.io data:;"
        },
        {
          "key": "Cache-Control",
          "value": "public, max-age=3600, s-maxage=86400, stale-while-revalidate=604800"
        }
      ]
    }

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread app/sitemap.xml/route.ts Outdated
Comment on lines +1 to +3
import https from "node:https";
import {
adaptPostsForSitemap,
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fetchGraphQL uses node:https unconditionally, so a WORDPRESS_API_URL like http://localhost:4000/graphql (used by Playwright’s mock server) will fail the TLS handshake and force the sitemap route into the 503 fallback. Parse new URL(apiUrl).protocol and use node:http for http: (or switch to a protocol-agnostic client) so local/dev/test URLs work.

Copilot uses AI. Check for mistakes.
Comment thread app/sitemap.xml/route.ts Outdated
Comment on lines +119 to +123
const edges = data?.posts?.edges ?? [];
allEdges = [...allEdges, ...edges];
hasNextPage = data?.posts?.pageInfo?.hasNextPage ?? false;
endCursor = data?.posts?.pageInfo?.endCursor ?? null;
}
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pagination loop can become infinite if WPGraphQL returns hasNextPage: true with a missing/null endCursor (the request will keep sending after: null and re-fetch page 1). Add a guard to throw/fail fast when hasNextPage is true but endCursor is falsy (or when the cursor doesn’t advance).

Copilot uses AI. Check for mistakes.
Comment thread app/sitemap.xml/route.ts Outdated
Comment on lines +42 to +55
return new Promise((resolve, reject) => {
const req = https.request(
{
hostname: url.hostname,
port: url.port || 443,
path: url.pathname + url.search,
method: "POST",
headers: {
"Content-Type": "application/json",
"Content-Length": Buffer.byteLength(body),
"User-Agent": "keploy-blog-sitemap/1.0",
},
},
(res) => {
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https.request() here has no overall timeout/abort. If the upstream stalls (socket hang, never-ending response), the route handler can hang until the platform kills it, which prevents clean 503 fallback behavior and ties up concurrency. Add a request/response timeout (and ensure the request is destroyed/aborted) so failures reliably hit the catch block.

Copilot uses AI. Check for mistakes.
Comment thread app/sitemap.xml/route.ts Outdated
Comment on lines +19 to +21
// If WordPress is down during regen: Vercel keeps serving previous good version automatically.
// Cold-start / first request: mitigated by build-time pre-generation (see scripts/prewarm-sitemap.mjs)
// and post-deploy warming triggered by the Vercel deployment hook in GitHub Actions.
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The header comment mentions build-time pre-generation via scripts/prewarm-sitemap.mjs and a post-deploy warming hook, but there’s no such script in the repo. Update the comment to match the actual warming strategy (or add the referenced script) so future maintainers aren’t misled.

Suggested change
// If WordPress is down during regen: Vercel keeps serving previous good version automatically.
// Cold-start / first request: mitigated by build-time pre-generation (see scripts/prewarm-sitemap.mjs)
// and post-deploy warming triggered by the Vercel deployment hook in GitHub Actions.
// If WordPress is down during regen: Vercel keeps serving the previous good version automatically.
// The first request after deploy or cache expiry triggers generation; there is no
// repository-managed build-time pre-generation or post-deploy warming script here.

Copilot uses AI. Check for mistakes.
Comment on lines +45 to +50
const body = await response.json();
expect(body.ok).toBe(true);
expect(body.entryCount).toBeGreaterThan(0);
expect(typeof body.generatedAt).toBe('string');
expect(typeof body.searchConsole?.submitted).toBe('boolean');
});
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test expects entryCount, generatedAt, and searchConsole.submitted, but pages/api/cron/refresh-sitemap.ts returns either { ok, message } (skipped) or { ok, siteUrl, sitemapUrl, submittedAt } (submission). Update the assertions (or change the endpoint response shape) so the test matches the real API contract.

Copilot uses AI. Check for mistakes.
Signed-off-by: amaan-bhati <amaanbhati49@gmail.com>
@amaan-bhati
Copy link
Copy Markdown
Member Author

clsoing this pr since we found a cheaper and more optimised + faster approach in this pr: #374

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants