Skip to content

Conversation

@priyankeshh
Copy link
Contributor

Summary

This PR adds a clean integration path to run Marker on Modal instead of locally via RQ. It introduces an environment-based switch (MARKER_RUN_MODE) and an HTTP client that forwards PDFs to a Modal-hosted Marker service. This lets us use proper GPU/compute and avoid HF Space resource limits. Layout correctness improvements can be iterated later; this PR focuses on wiring and compute offload.

Key changes

  • ocr_jobs.py
    • Loads .env early to pick up configuration.
    • Adds a modal execution branch that calls the Modal endpoint for output_format=json and parses the response.
    • Logs the selected run mode and endpoint for visibility.
  • extralit_server/integrations/marker_modal_client.py
    • New lightweight client to POST PDFs to Modal /convert, with timeout and basic error handling.
    • Reads MARKER_MODAL_BASE_URL and optional MARKER_MODAL_TIMEOUT_SECS from env.
  • .env.example
    • Documents MARKER_RUN_MODE=modal and MARKER_MODAL_BASE_URL required to enable Modal integration.

Configuration

  • Set the following variables (local .env for dev; real env in prod/worker runtime):

How to test

  1. Ensure a Modal Marker service is deployed and healthy
  1. Local CLI test (bypassing RQ)
  • In repo root (so .env is picked up):
    • python extralit-server/src/extralit_server/jobs/ocr_jobs.py "/path/to/sample.pdf" --extract-text
  • Expected logs:
  • Expected output: JSON with tables, figures, text_blocks, metadata.

Checklist

  • Env-based switch to Modal
  • New Modal client and wiring in ocr_jobs
  • .env.example updated
  • Local CLI test path documented
  • Optional: add Modal auth header support
  • Optional: retries/backoff and timeouts tuning

Comment on lines 174 to 216

def parse_marker_output(result: "JSONOutput") -> dict[str, Any]:
"""
Parse Marker JSONOutput into our application's expected layout format.
Args:
result: JSONOutput object from Marker
Returns:
A dictionary with a structured list of pages and their blocks.
"""
layout_data = {"pages": []}

if result.children:
for page_idx, page in enumerate(result.children):
page_data = {"page": page_idx, "blocks": []}

if page.children:
for block in page.children:
block_data = {
"type": block.block_type or "unknown",
"bbox": block.bbox or [],
"content": (block.html or "").strip(),
"id": block.id or "",
"score": None, # Marker doesn't provide confidence scores
"type": getattr(block, "block_type", None) or "unknown",
"bbox": getattr(block, "bbox", None) or [],
"content": (getattr(block, "html", None) or "").strip(),
"id": getattr(block, "id", None) or "",
"score": None,
}
page_data["blocks"].append(block_data)

layout_data["pages"].append(page_data)
return layout_data


def parse_marker_json_output(result_json: dict[str, Any]) -> dict[str, Any]:
"""
Parse the JSON renderer payload returned by Modal (modal_resp['json']).
Mirrors Marker JSONOutput.model_dump().
"""
layout_data = {"pages": []}
children = result_json.get("children") or []
for page_idx, page in enumerate(children):
page_data = {"page": page_idx, "blocks": []}
for block in page.get("children") or []:
block_data = {
"type": block.get("block_type") or "unknown",
"bbox": block.get("bbox") or [],
"content": (block.get("html") or "").strip(),
"id": block.get("id") or "",
"score": None,
}
page_data["blocks"].append(block_data)
layout_data["pages"].append(page_data)
return layout_data
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @priyankeshh, these code are handling nested dict for both function arguments and returns types, and it's better to define our own data types with pydantic.BaseModel or import a defined type from marker. We want to avoid using these nested dict parsing method since it's hard to maintain this code when there's no typehinting. Can you define the models at extralit-server/src/extralit_server/api/schemas/v1/document/layout.py?

You can see here for an example: https://github.com/Extralit/extralit/blob/e919e0453c808c89a4e7bfa331f1542fde5c2674/extralit-server/src/extralit_server/api/schemas/v1/document/metadata.py

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

refactored the code to use Pydantic models for layout parsing and defined them in layout.py as suggested.

JonnyTran and others added 4 commits September 29, 2025 23:18
… OCR settings and client

- Deleted the .env.example file as it is no longer needed.
- Added new layout.py for PDF OCR settings using Pydantic.
- Created marker_client.py to handle interactions with the Modal-hosted Marker service.
- Updated ocr_jobs.py to import the new Modal client for document conversion.
@JonnyTran JonnyTran marked this pull request as ready for review October 7, 2025 05:11
@JonnyTran JonnyTran requested a review from a team as a code owner October 7, 2025 05:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants