Skip to content

Fix: export OOM on large crawls — add streaming export endpoint#48

Open
liquidpurple wants to merge 2 commits intoPhialsBasement:mainfrom
liquidpurple:feature/streaming-export
Open

Fix: export OOM on large crawls — add streaming export endpoint#48
liquidpurple wants to merge 2 commits intoPhialsBasement:mainfrom
liquidpurple:feature/streaming-export

Conversation

@liquidpurple
Copy link
Copy Markdown

The current export flow has a 4× memory multiplication that causes OOM kills
on memory-constrained instances:

  1. Frontend fetches ALL URLs from /api/crawl_status (full data in browser)
  2. Frontend sends ALL data back in POST body to /api/export_data
  3. Backend generates full export string in memory
  4. Backend wraps in jsonify() JSON envelope
  5. Frontend parses JSON, extracts content, creates Blob

For a crawl of ~2200 URLs (~45MB in SQLite), this consistently OOMs on a
2GB RAM instance with systemd MemoryMax.

This PR adds a streaming GET endpoint (/api/export_stream) that uses Python
generators to yield CSV/JSON/XML rows one at a time via Flask Response.
The browser downloads the file directly — no JSON envelope, no Blob,
no round-trip of all data.

The old /api/export_data endpoint is preserved for backward compatibility.

Changes:

  • main.py: new /api/export_stream endpoint + 8 streaming generator functions
  • web/static/js/app.js: exportData() rewritten to use streaming endpoint

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant