Skip to content

[BUG]: Resume preprocessing can duplicate output after crash between checkpoint saves #76

@AnshuPriya-1

Description

@AnshuPriya-1

Bug Description

The current resume logic in extract_text_from_xml() uses last_processed_page from data/checkpoint.json to decide where to continue, and opens the output file in append mode whenever start_page > 0.

However, output is written continuously while checkpoints are only saved every 1000 pages. If the process crashes after writing additional pages but before the next checkpoint update, the output file may already contain more content than the checkpoint reflects.

On resume:

  • pages up to last_processed_page are skipped
  • output file is reopened in append mode
  • pages written after the last checkpoint may be appended again

This can duplicate processed content in data/processed/wiki_clean.txt, which may also affect processed hashes and Merkle roots.

Steps to Reproduce

  1. Create or use an XML dump with multiple <page> entries.
  2. Run extract_text_from_xml() and let it process beyond the latest saved checkpoint.
  3. Simulate an interruption after extra cleaned text has already been written to data/processed/wiki_clean.txt but before the next checkpoint update is saved.
  4. Ensure data/checkpoint.json still contains an older last_processed_page value than the content already written to the output file.
  5. Run extract_text_from_xml() again to resume processing.
  6. Observe that pages written after the last checkpoint can be appended again, causing duplicated cleaned text in data/processed/wiki_clean.txt.

Logs and Screenshots

No runtime traceback is required to observe this issue.

The problem is a data-correctness bug in the resume logic:

  • checkpoint progress is tracked by page number
  • output progress is tracked only by the contents already appended to wiki_clean.txt
  • if these get out of sync after an interruption, resume may duplicate already-written content

Environment Details

  • OS: Windows 11
  • Python version: 3.x
  • Repository: AOSSIE-Org/OpenVerifiableLLM
  • Relevant file: openverifiablellm/utils.py
  • Function: extract_text_from_xml()

This issue appears to be logic-related and should be reproducible across environments.

Impact

High - Major feature is broken

Code of Conduct

  • I have joined the Discord server and will post updates there
  • I have searched existing issues to avoid duplicates

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions