-
-
Notifications
You must be signed in to change notification settings - Fork 36
Integrate Modal-hosted Marker and add env-based switch to offload layout extraction from local RQ #157
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
…r layout extraction
extralit-server/src/extralit_server/integrations/modal/marker_client.py
Outdated
Show resolved
Hide resolved
|
|
||
| def parse_marker_output(result: "JSONOutput") -> dict[str, Any]: | ||
| """ | ||
| Parse Marker JSONOutput into our application's expected layout format. | ||
| Args: | ||
| result: JSONOutput object from Marker | ||
| Returns: | ||
| A dictionary with a structured list of pages and their blocks. | ||
| """ | ||
| layout_data = {"pages": []} | ||
|
|
||
| if result.children: | ||
| for page_idx, page in enumerate(result.children): | ||
| page_data = {"page": page_idx, "blocks": []} | ||
|
|
||
| if page.children: | ||
| for block in page.children: | ||
| block_data = { | ||
| "type": block.block_type or "unknown", | ||
| "bbox": block.bbox or [], | ||
| "content": (block.html or "").strip(), | ||
| "id": block.id or "", | ||
| "score": None, # Marker doesn't provide confidence scores | ||
| "type": getattr(block, "block_type", None) or "unknown", | ||
| "bbox": getattr(block, "bbox", None) or [], | ||
| "content": (getattr(block, "html", None) or "").strip(), | ||
| "id": getattr(block, "id", None) or "", | ||
| "score": None, | ||
| } | ||
| page_data["blocks"].append(block_data) | ||
|
|
||
| layout_data["pages"].append(page_data) | ||
| return layout_data | ||
|
|
||
|
|
||
| def parse_marker_json_output(result_json: dict[str, Any]) -> dict[str, Any]: | ||
| """ | ||
| Parse the JSON renderer payload returned by Modal (modal_resp['json']). | ||
| Mirrors Marker JSONOutput.model_dump(). | ||
| """ | ||
| layout_data = {"pages": []} | ||
| children = result_json.get("children") or [] | ||
| for page_idx, page in enumerate(children): | ||
| page_data = {"page": page_idx, "blocks": []} | ||
| for block in page.get("children") or []: | ||
| block_data = { | ||
| "type": block.get("block_type") or "unknown", | ||
| "bbox": block.get("bbox") or [], | ||
| "content": (block.get("html") or "").strip(), | ||
| "id": block.get("id") or "", | ||
| "score": None, | ||
| } | ||
| page_data["blocks"].append(block_data) | ||
| layout_data["pages"].append(page_data) | ||
| return layout_data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @priyankeshh, these code are handling nested dict for both function arguments and returns types, and it's better to define our own data types with pydantic.BaseModel or import a defined type from marker. We want to avoid using these nested dict parsing method since it's hard to maintain this code when there's no typehinting. Can you define the models at extralit-server/src/extralit_server/api/schemas/v1/document/layout.py?
You can see here for an example: https://github.com/Extralit/extralit/blob/e919e0453c808c89a4e7bfa331f1542fde5c2674/extralit-server/src/extralit_server/api/schemas/v1/document/metadata.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
refactored the code to use Pydantic models for layout parsing and defined them in layout.py as suggested.
… OCR settings and client - Deleted the .env.example file as it is no longer needed. - Added new layout.py for PDF OCR settings using Pydantic. - Created marker_client.py to handle interactions with the Modal-hosted Marker service. - Updated ocr_jobs.py to import the new Modal client for document conversion.
Summary
This PR adds a clean integration path to run Marker on Modal instead of locally via RQ. It introduces an environment-based switch (MARKER_RUN_MODE) and an HTTP client that forwards PDFs to a Modal-hosted Marker service. This lets us use proper GPU/compute and avoid HF Space resource limits. Layout correctness improvements can be iterated later; this PR focuses on wiring and compute offload.
Key changes
Configuration
How to test
Checklist