[BUG]: Resume preprocessing can duplicate output after crash between checkpoint saves

### Bug Description

The current resume logic in `extract_text_from_xml()` uses `last_processed_page` from `data/checkpoint.json` to decide where to continue, and opens the output file in append mode whenever `start_page > 0`.

However, output is written continuously while checkpoints are only saved every 1000 pages. If the process crashes after writing additional pages but before the next checkpoint update, the output file may already contain more content than the checkpoint reflects.

On resume:
- pages up to `last_processed_page` are skipped
- output file is reopened in append mode
- pages written after the last checkpoint may be appended again

This can duplicate processed content in `data/processed/wiki_clean.txt`, which may also affect processed hashes and Merkle roots.

### Steps to Reproduce

1. Create or use an XML dump with multiple `<page>` entries.
2. Run `extract_text_from_xml()` and let it process beyond the latest saved checkpoint.
3. Simulate an interruption after extra cleaned text has already been written to `data/processed/wiki_clean.txt` but before the next checkpoint update is saved.
4. Ensure `data/checkpoint.json` still contains an older `last_processed_page` value than the content already written to the output file.
5. Run `extract_text_from_xml()` again to resume processing.
6. Observe that pages written after the last checkpoint can be appended again, causing duplicated cleaned text in `data/processed/wiki_clean.txt`.

### Logs and Screenshots

No runtime traceback is required to observe this issue.

The problem is a data-correctness bug in the resume logic:
- checkpoint progress is tracked by page number
- output progress is tracked only by the contents already appended to `wiki_clean.txt`
- if these get out of sync after an interruption, resume may duplicate already-written content

### Environment Details

- OS: Windows 11
- Python version: 3.x
- Repository: AOSSIE-Org/OpenVerifiableLLM
- Relevant file: `openverifiablellm/utils.py`
- Function: `extract_text_from_xml()`

This issue appears to be logic-related and should be reproducible across environments.

### Impact

High - Major feature is broken

### Code of Conduct

- [x] I have joined the [Discord server](https://discord.gg/hjUhu33uAn) and will post updates there
- [x] I have searched existing issues to avoid duplicates

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG]: Resume preprocessing can duplicate output after crash between checkpoint saves #76

Bug Description

Steps to Reproduce

Logs and Screenshots

Environment Details

Impact

Code of Conduct

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[BUG]: Resume preprocessing can duplicate output after crash between checkpoint saves #76

Description

Bug Description

Steps to Reproduce

Logs and Screenshots

Environment Details

Impact

Code of Conduct

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions