Skip to content

fix(parser): clean up resources on parse failure#101

Open
Co-Messi wants to merge 2 commits intorun-llama:mainfrom
Co-Messi:codex/parser-finally-cleanup
Open

fix(parser): clean up resources on parse failure#101
Co-Messi wants to merge 2 commits intorun-llama:mainfrom
Co-Messi:codex/parser-finally-cleanup

Conversation

@Co-Messi
Copy link
Copy Markdown
Contributor

Problem

LiteParse.parse() only cleans up the converted PDF, closes the PDF document, and terminates Tesseract on the happy path. If parsing throws after conversion, the temporary conversion directory and OCR worker cleanup are skipped.

What changed

  • moved parser resource cleanup into a finally path
  • ensured PDF document close, Tesseract termination, and temp conversion cleanup all run after parse failures
  • added parser tests covering failure after conversion with and without OCR enabled

Why this fixes it

This keeps resource cleanup tied to parser lifetime instead of parser success, so failed parses do not leak temp files or OCR workers.

Verification

  • npm test -- src/core/parser.test.ts
  • npm run format:check
  • npm run build

Fixes #100

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b5cebae6fd

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/core/parser.ts Outdated
Comment on lines +229 to +232
if (this.ocrEngine && "terminate" in this.ocrEngine) {
await cleanupResource(async () => {
await (this.ocrEngine as TesseractEngine).terminate();
});
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Guard OCR termination to the current parse invocation

LiteParse keeps this.ocrEngine on the parser instance, but this finally block now calls terminate() on every parse exit, including failures that occur before OCR work starts. If callers run parse() concurrently on the same instance, a fast-failing call can tear down Tesseract while another call is still in runOCR, and processPageOcr will swallow those OCR errors and return degraded output. Please only terminate OCR resources that were actually used by this invocation (or otherwise prevent concurrent use) to avoid this regression.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels like a sensible suggestion? A little worried that this might destroy concurrent calls to parse()

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adjusted this so OCR termination only runs if that parse invocation actually entered OCR work. A failure before runOCR() now still closes the PDF and cleans up temp files, but it no longer tears down the shared Tesseract engine for another concurrent parse. I also updated the parser tests to cover that pre-OCR failure path while keeping termination on successful OCR parses.

@logan-markewich
Copy link
Copy Markdown
Contributor

whoops, I let this sit a little too long. Seems like some gnarly merge conflicts

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] parse() skips cleanup on failure after conversion

2 participants