Integrate Modal-hosted Marker and add env-based switch to offload layout extraction from local RQ #157

priyankeshh · 2025-09-25T15:03:40Z

Summary

This PR adds a clean integration path to run Marker on Modal instead of locally via RQ. It introduces an environment-based switch (MARKER_RUN_MODE) and an HTTP client that forwards PDFs to a Modal-hosted Marker service. This lets us use proper GPU/compute and avoid HF Space resource limits. Layout correctness improvements can be iterated later; this PR focuses on wiring and compute offload.

Key changes

ocr_jobs.py
- Loads .env early to pick up configuration.
- Adds a modal execution branch that calls the Modal endpoint for output_format=json and parses the response.
- Logs the selected run mode and endpoint for visibility.
extralit_server/integrations/marker_modal_client.py
- New lightweight client to POST PDFs to Modal /convert, with timeout and basic error handling.
- Reads MARKER_MODAL_BASE_URL and optional MARKER_MODAL_TIMEOUT_SECS from env.
.env.example
- Documents MARKER_RUN_MODE=modal and MARKER_MODAL_BASE_URL required to enable Modal integration.

Configuration

Set the following variables (local .env for dev; real env in prod/worker runtime):
- MARKER_RUN_MODE=modal
- MARKER_MODAL_BASE_URL=https://YOUR-ENDPOINT.modal.run
- Optional: MARKER_MODAL_TIMEOUT_SECS=600

How to test

Ensure a Modal Marker service is deployed and healthy

Deploy the example from datalab-to/marker (examples) with JSON path fixed
Verify it responds:
- curl https://YOUR-ENDPOINT.modal.run/health
- curl -F "file=@/path/to/sample.pdf" -F "output_format=json" https://YOUR-ENDPOINT.modal.run/convert

Local CLI test (bypassing RQ)

In repo root (so .env is picked up):
- python extralit-server/src/extralit_server/jobs/ocr_jobs.py "/path/to/sample.pdf" --extract-text
Expected logs:
- Starting Marker layout extraction ... (mode=modal)
- Using Modal endpoint: https://YOUR-ENDPOINT.modal.run
Expected output: JSON with tables, figures, text_blocks, metadata.

Checklist

Env-based switch to Modal
New Modal client and wiring in ocr_jobs
.env.example updated
Local CLI test path documented
Optional: add Modal auth header support
Optional: retries/backoff and timeouts tuning

…r layout extraction

extralit-server/src/extralit_server/integrations/modal/marker_client.py

JonnyTran · 2025-09-30T06:01:17Z

extralit-server/src/extralit_server/jobs/ocr_jobs.py


 def parse_marker_output(result: "JSONOutput") -> dict[str, Any]:
    """
    Parse Marker JSONOutput into our application's expected layout format.
-
-    Args:
-        result: JSONOutput object from Marker
-
-    Returns:
-        A dictionary with a structured list of pages and their blocks.
    """
    layout_data = {"pages": []}
-
    if result.children:
        for page_idx, page in enumerate(result.children):
            page_data = {"page": page_idx, "blocks": []}
-
            if page.children:
                for block in page.children:
                    block_data = {
-                        "type": block.block_type or "unknown",
-                        "bbox": block.bbox or [],
-                        "content": (block.html or "").strip(),
-                        "id": block.id or "",
-                        "score": None,  # Marker doesn't provide confidence scores
+                        "type": getattr(block, "block_type", None) or "unknown",
+                        "bbox": getattr(block, "bbox", None) or [],
+                        "content": (getattr(block, "html", None) or "").strip(),
+                        "id": getattr(block, "id", None) or "",
+                        "score": None,
                    }
                    page_data["blocks"].append(block_data)
-
            layout_data["pages"].append(page_data)
+    return layout_data

+
+def parse_marker_json_output(result_json: dict[str, Any]) -> dict[str, Any]:
+    """
+    Parse the JSON renderer payload returned by Modal (modal_resp['json']).
+    Mirrors Marker JSONOutput.model_dump().
+    """
+    layout_data = {"pages": []}
+    children = result_json.get("children") or []
+    for page_idx, page in enumerate(children):
+        page_data = {"page": page_idx, "blocks": []}
+        for block in page.get("children") or []:
+            block_data = {
+                "type": block.get("block_type") or "unknown",
+                "bbox": block.get("bbox") or [],
+                "content": (block.get("html") or "").strip(),
+                "id": block.get("id") or "",
+                "score": None,
+            }
+            page_data["blocks"].append(block_data)
+        layout_data["pages"].append(page_data)
    return layout_data


Hey @priyankeshh, these code are handling nested dict for both function arguments and returns types, and it's better to define our own data types with pydantic.BaseModel or import a defined type from marker. We want to avoid using these nested dict parsing method since it's hard to maintain this code when there's no typehinting. Can you define the models at extralit-server/src/extralit_server/api/schemas/v1/document/layout.py?

You can see here for an example: https://github.com/Extralit/extralit/blob/e919e0453c808c89a4e7bfa331f1542fde5c2674/extralit-server/src/extralit_server/api/schemas/v1/document/metadata.py

refactored the code to use Pydantic models for layout parsing and defined them in layout.py as suggested.

… OCR settings and client - Deleted the .env.example file as it is no longer needed. - Added new layout.py for PDF OCR settings using Pydantic. - Created marker_client.py to handle interactions with the Modal-hosted Marker service. - Updated ocr_jobs.py to import the new Modal client for document conversion.

feat: add Marker service configuration and integrate Modal support fo…

98a19ca

…r layout extraction

JonnyTran reviewed Sep 30, 2025

View reviewed changes

extralit-server/src/extralit_server/integrations/modal/marker_client.py Outdated Show resolved Hide resolved

JonnyTran reviewed Sep 30, 2025

View reviewed changes

JonnyTran and others added 4 commits September 29, 2025 23:18

using httpx

7333c3c

Use Pydantic models for layout parsing

09db08d

minor fix

daa9ffb

JonnyTran marked this pull request as ready for review October 7, 2025 05:11

JonnyTran requested a review from a team as a code owner October 7, 2025 05:11

priyankeshh added 3 commits October 9, 2025 03:12

added modal deployment

d00dc41

refactor: use PDFOCRSettings with OCR_ prefix for marker modal config

5af7dd4

fix: use OCR_MARKER_MODAL_* naming convention as requested

76f718c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Integrate Modal-hosted Marker and add env-based switch to offload layout extraction from local RQ #157

Integrate Modal-hosted Marker and add env-based switch to offload layout extraction from local RQ #157

Uh oh!

priyankeshh commented Sep 25, 2025

Uh oh!

Uh oh!

JonnyTran Sep 30, 2025

Uh oh!

priyankeshh Sep 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Integrate Modal-hosted Marker and add env-based switch to offload layout extraction from local RQ #157

Are you sure you want to change the base?

Integrate Modal-hosted Marker and add env-based switch to offload layout extraction from local RQ #157

Uh oh!

Conversation

priyankeshh commented Sep 25, 2025

Summary

Key changes

Uh oh!

Uh oh!

JonnyTran Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

priyankeshh Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants