-
-
Notifications
You must be signed in to change notification settings - Fork 28
Description
Bug Description
The current resume logic in extract_text_from_xml() uses last_processed_page from data/checkpoint.json to decide where to continue, and opens the output file in append mode whenever start_page > 0.
However, output is written continuously while checkpoints are only saved every 1000 pages. If the process crashes after writing additional pages but before the next checkpoint update, the output file may already contain more content than the checkpoint reflects.
On resume:
- pages up to
last_processed_pageare skipped - output file is reopened in append mode
- pages written after the last checkpoint may be appended again
This can duplicate processed content in data/processed/wiki_clean.txt, which may also affect processed hashes and Merkle roots.
Steps to Reproduce
- Create or use an XML dump with multiple
<page>entries. - Run
extract_text_from_xml()and let it process beyond the latest saved checkpoint. - Simulate an interruption after extra cleaned text has already been written to
data/processed/wiki_clean.txtbut before the next checkpoint update is saved. - Ensure
data/checkpoint.jsonstill contains an olderlast_processed_pagevalue than the content already written to the output file. - Run
extract_text_from_xml()again to resume processing. - Observe that pages written after the last checkpoint can be appended again, causing duplicated cleaned text in
data/processed/wiki_clean.txt.
Logs and Screenshots
No runtime traceback is required to observe this issue.
The problem is a data-correctness bug in the resume logic:
- checkpoint progress is tracked by page number
- output progress is tracked only by the contents already appended to
wiki_clean.txt - if these get out of sync after an interruption, resume may duplicate already-written content
Environment Details
- OS: Windows 11
- Python version: 3.x
- Repository: AOSSIE-Org/OpenVerifiableLLM
- Relevant file:
openverifiablellm/utils.py - Function:
extract_text_from_xml()
This issue appears to be logic-related and should be reproducible across environments.
Impact
High - Major feature is broken
Code of Conduct
- I have joined the Discord server and will post updates there
- I have searched existing issues to avoid duplicates