Skip to content

[bug] ChatGPT HTML bulk import failed. #181

@wxxb789

Description

@wxxb789

Describe the bug

Three distinct issues encountered during a 1,537-conversation ChatGPT HTML bulk import:

Issue 1 — Import dialog silently dismissed on focus loss
While the first import was running (~800 records processed over an afternoon), clicking outside the import dialog caused it to disappear. No confirmation prompt, no way to recover or resume. The import process stopped.

Issue 2 — Re-import performance collapses
After re-importing the same file, the second run completed only 37 records in ~2.7 hours (progress.log shows timestamps from 03:44:30 to 06:23:30 UTC, Apr 10). The first import was processing at a much higher rate before interruption.

Issue 3 — Server-wide degradation after re-import
After the re-import, the entire Nowledge Mem server became degraded. Ingesting data through any channel (Alma plugin, nmem CLI, MCP) started failing at a high rate.

To Reproduce

  1. Prepare a ChatGPT HTML export (~1,537 conversations, format: chatgpt_html_bulk)
  2. Start import in Nowledge Mem desktop app
  3. Let it run — first import proceeds at reasonable speed (~800 records in an afternoon)
  4. Click outside the import dialog → dialog disappears, import stops
  5. Re-import the same file
  6. Observe: import crawls (~37 records in 2.7 hours), server degrades for all ingest channels

Expected behavior

  • Import dialog should be a persistent modal — not dismissable by accidental click
  • Import should be resumable / support deduplication on re-import
  • A failed or repeated import should not corrupt the WAL or degrade the server

Screenshots

(Available on request)

Additional context

Environment

Item Detail
OS Windows 11 (VM), build 10.0.26200.7985
RAM 64 GB
CPU AMD EPYC 7763, 16 vCPU
Nowledge Mem version 0.6.19 (confirmed via UI)
Embedding model BAAI/bge-m3 (1024-dim, fastembed backend)
LLM model gpt-5.3-codex
DB size 331 MB (nowledge_graph_v2_v1.db)
Search index size 319 MB (Lance format, 6 indices)
Total log volume ~153 MB across 6 rotated log files
Import file ChatGPT HTML export, 1,537 conversations

Diagnostic evidence from local logs

5 separate import attempts logged (all for the same 1,537-thread file)

# Timestamp (local, UTC+8) Job ID
1 2026-04-09 14:37 ed4da345-6f26-4ea7-bd9d-d15cfb07809f
2 2026-04-09 23:19 51e53486-e7f2-4745-91b9-4c86e1a91dff
3 2026-04-10 01:54 28058c7e-0dd7-4733-9374-e05e65fcce2b
4 2026-04-10 11:43 627b6e82-4621-4fc3-bdd5-f3acd72a528a
5 2026-04-10 15:02 1850ac39-3d05-4df5-97cf-c14c4e5cf305

progress.log: only 37 records imported, with erratic timing

2026-04-10T03:44:30  (record 1)
2026-04-10T03:44:31  (record 2,  1s gap)
2026-04-10T03:51:47  (record 3,  7min gap)
2026-04-10T04:41:49  (record 4,  50min gap)
2026-04-10T05:44:00  (record 5,  62min gap)
2026-04-10T05:46:36 → 05:46:49  (records 6–18, ~1s each — burst)
2026-04-10T05:49:24 → 05:52:05  (records 19–25, mixed)
2026-04-10T05:54:41 → 06:02:47  (records 26–35, mixed)
2026-04-10T06:10:27  (record 36, 8min gap)
2026-04-10T06:23:30  (record 37, 13min gap)

Pattern: short bursts of ~1s/record interrupted by long stalls (up to 62 minutes), suggesting repeated contention or blocking.

WAL corruption detected twice

[2026-04-10 08:56:07] WARN: Detected corrupted WAL file, attempting recovery
  error: "Corrupted wal file. Read out invalid WAL record type."
  → Moved aside as *.wal.corrupt-20260410T005607Z (36 KB)

An earlier corruption was also recovered on Apr 7 (103 KB). Both timestamps correlate with import activity.

Checkpoint contention — error distribution across all logs

Log file Time range Checkpoint retrying Checkpoint ALL RETRIES EXHAUSTED FTS search failed Corrupted WAL
app.log.5 Apr 8 01:36–02:20 0 0 66 0
app.log.4 Apr 8 08:54–22:01 11 3 278 0
app.log.3 Apr 8 22:01–Apr 9 01:54 8 1 12 13
app.log.2 Apr 9 01:54–Apr 10 00:25 212 16 9 4
app.log.1 Apr 10 00:25–08:54 66 21 3 0
app.log Apr 10 08:55–15:22 71 18 199 3

The recurring error message:

"Timeout waiting for active transactions to leave the system before checkpointing.
 If you have an open transaction, please close it and try again."

Checkpoint failures per hour on Apr 10 (showing peak contention midday):

09h: 4  |  10h: 1  |  11h: 6  |  12h: 23  |  13h: 24  |  14h: 23  |  15h: 9

FTS index corruption

197 FTS search failed on entities_index errors caused by missing Lance data files:

error: "Not found: .../entities_index.lance/data/<hash>.lance"
error: "FileDoesNotExist("meta.json")"

Other ingest channels affected

Alma plugin thread appends fail with "Thread not found" errors, confirming server-wide degradation:

[2026-04-10 09:00:05] ERROR: Invalid request for append — Thread not found: alma-<redacted>
[2026-04-10 14:23:27] ERROR: Invalid request for append — Thread not found: alma-<redacted>
[2026-04-10 15:12:06] ERROR: Invalid request for append — Thread not found: alma-<redacted>

Windows Defender / Exploit Guard investigation (ruled out)

We investigated Windows Defender Real-Time Protection (RTP) and Controlled Folder Access (CFA) as a possible cause of the WAL corruption and import performance issues. After thorough analysis, Defender has been ruled out as a contributing factor.

Key evidence

  • CFA mode = 3 (audit only), 0 events targeting the NowledgeGraph data directory. Audit mode does not block or delay file operations — it only logs what would be blocked in enforcement mode.
  • All 104 Nowledge-related CFA events target the exe file (nowledge-mem.exe), not the data directory. No CFA events reference any .db, .wal, or .lance files.
  • The NowledgeGraph data directory (%LOCALAPPDATA%\NowledgeGraph) is NOT in CFA's default protected folders. CFA protects user-profile folders like Documents, Desktop, and Pictures — not AppData\Local. Writes to the data directory are not subject to CFA at all.
  • Zero Defender events during the import stall windows (03:44–06:24 on Apr 10). The 50–62 minute stalls in progress.log have no corresponding Defender activity.
  • Write benchmarks show negligible RTP overhead: 1.1ms vs 1.0ms per 4KB write. Real-Time Protection scanning adds ~0.1ms per write — nowhere near enough to explain multi-minute stalls.
  • Zero Defender events mention any .db, .wal, or .lance files. Defender is simply not interacting with the Nowledge data files.

Conclusion

Windows Defender is not the root cause of the WAL corruption or import performance issues.


Summary

Core asks for the maintainers:

  • Make the import dialog non-dismissable / add resume capability — if the import is interrupted or the dialog is closed, there is currently no way to resume; the partial state left behind leads to the errors documented above
  • Investigate WAL/checkpoint contention under concurrent or repeated imports — the WAL corruption and FTS index errors appear reproducible when imports are retried after a failure, suggesting possible contention or incomplete cleanup
  • Investigate internal WAL checkpoint contention and state cleanup after interrupted imports — this appears to be an application-level issue rather than an environmental one

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions