Skip to content

Fix Apple Mail V10 import: discover messages in numeric partition directories#166

Open
schlabrendorff wants to merge 1 commit intowesm:mainfrom
schlabrendorff:fix/apple-mail-v10-partitions
Open

Fix Apple Mail V10 import: discover messages in numeric partition directories#166
schlabrendorff wants to merge 1 commit intowesm:mainfrom
schlabrendorff:fix/apple-mail-v10-partitions

Conversation

@schlabrendorff
Copy link

This is a vibecoded fix for #157

msgvault v0.9.0
  commit:  4801588b
  built:   2026-02-26T20:45:45Z
  go:      go1.25.7
  os/arch: darwin/amd64

Summary

  • Discover .emlx files in V10 numeric partition subdirectories (Data/0/3/Messages/, Data/9/Messages/, etc.) at arbitrary nesting depths
  • Handle partition-only mailboxes where Data/Messages/ does not exist at all
  • Resolve partition file paths transparently via FileIndex map + FilePath() method, requiring only a one-line change in the importer

Problem

Apple Mail V10 splits large mailboxes across single-digit numeric partition directories under Data/. A real-world mailbox might look like:

~/Library/Mail/V10/<account-GUID>/Sent Messages.mbox/
  <GUID>/Data/
    Messages/           <- top-level (may be empty or absent)
      1.emlx
    0/
      3/
        Messages/       <- 2-level partition
          123.emlx
          124.emlx
    9/
      Messages/         <- 1-level partition
        456.emlx
      9/
        Messages/       <- 2-level partition
          500.emlx

Real tree -d output from a 20-year Inbox.mbox (Attachments dirs omitted):

~/Library/Mail/V10/<account-GUID>/Inbox.mbox/<GUID>/Data
├── 0
│   ├── 1
│   │   └── Messages
│   ├── 3
│   │   └── Messages
│   ├── 4
│   │   └── Messages
│   ├── 6
│   │   └── Messages
│   └── 7
│       └── Messages
├── 1
│   ├── 1
│   │   └── Messages
│   ├── 3
│   │   └── Messages
│   ├── 4
│   │   └── Messages
│   ├── 6
│   │   └── Messages
│   ├── 7
│   │   └── Messages
│   └── 8
│       └── Messages
├── 2
│   ├── 3
│   │   └── Messages
│   ├── 4
│   │   └── Messages
│   ├── 7
│   │   └── Messages
│   ├── 8
│   │   └── Messages
│   ├── 9
│   │   └── Messages
│   └── Messages
...
└── Messages

108 directories

The previous code only looked for .emlx files in Data/Messages/. It had no awareness of numeric partition directories at all. This meant:

  1. Missing messages: Any .emlx file stored in a partition directory (e.g., Data/0/3/Messages/123.emlx) was silently skipped. On a real 20-year Apple Mail archive (~6 accounts), only 363 files across 5 mailboxes were found instead of the full 39,417 files across 32 mailboxes — 99% of messages were invisible.

  2. Invisible mailboxes: Some mailboxes have no top-level Data/Messages/ directory. Every message lives in partitions. These mailboxes were not detected at all by findMessagesDir (it checked os.Stat(Data/Messages) and gave up if that directory didn't exist), so isMailboxDir returned false and the entire mailbox was skipped during discovery.

  3. Path resolution assumed flat structure: The importer used filepath.Join(mb.MsgDir, fileName) to build file paths, which only works when all .emlx files are in a single Messages/ directory. Partition files live in different Messages/ directories at different paths, so this produced incorrect paths that would fail on os.Stat.

Solution

The fix adds partition awareness at two levels: discovery (finding files) and resolution (building paths).

Discovery

findMessagesDir now checks for partition directories when Data/Messages/ is absent or empty. It calls hasEmlxFilesInPartitions, which recursively walks single-digit subdirectories (0-9) looking for Messages/ dirs that contain .emlx files. This lets isMailboxDir return true for partition-only mailboxes.

listEmlxFiles now collects files from both the primary Messages/ dir and all partition Messages/ dirs. It calls collectPartitionFiles, which recursively descends digit directories and collects every .emlx file it finds in any nested Messages/ directory, building a map from filename to the absolute path of its containing directory.

Both recursive functions (hasEmlxFilesInPartitions and collectPartitionFiles) follow the same pattern: at each directory level, enter Messages/ to check/collect files, and recurse into any single-digit subdirectory. This mirrors Apple Mail's actual partition structure, which nests digit dirs arbitrarily deep (observed up to 3 levels in real data).

Resolution

The Mailbox struct gains a FileIndex field (map of filename to its Messages/ directory path) and a FilePath() method. For files in the primary Messages/ dir, FileIndex has no entry and FilePath falls back to filepath.Join(MsgDir, fileName) — identical to the old behavior. For partition files, FileIndex maps the filename to its specific Messages/ directory, and FilePath joins that path with the filename.

The importer change is a single line: filepath.Join(mb.MsgDir, fileName) becomes mb.FilePath(fileName). No other importer logic changes because the Files slice already contains all filenames (both top-level and partition) and they sort/deduplicate identically.

Code consolidation

Shared logic is extracted into two small helpers to avoid duplication:

  • isDigitDir(name) — checks if a directory name is a single digit 0-9 (used in 4 places)
  • isEmlxFile(name) — checks .emlx suffix excluding .partial.emlx (used in 3 places)

The recursive walks for existence checking (hasEmlxFilesInPartitions) and file collection (collectPartitionFiles) are each a single function rather than being split across multiple wrappers.

Tests

Three new tests cover the distinct partition scenarios encountered in real Apple Mail data:

TestDiscoverMailboxes_V10Partitioned — The mixed case: a mailbox has files in both the primary Data/Messages/ and in partition directories at different depths (Data/0/3/Messages/, Data/9/Messages/). Verifies that all 3 files are discovered, that FilePath() resolves each to an existing file on disk, that top-level files are not in FileIndex (they resolve via MsgDir), and that partition files are in FileIndex.

TestDiscoverMailboxes_V10PartitionedOnly — The empty-primary case: Data/Messages/ exists but contains no .emlx files; all messages are in Data/3/Messages/. Verifies the mailbox is detected and both partition files are found. This catches a regression where findMessagesDir would find the empty Messages/ dir, listEmlxFiles would return zero files, and the mailbox would be silently skipped.

TestDiscoverMailboxes_V10NoTopLevelMessages — The most extreme case: Data/Messages/ does not exist at all. Messages are in partitions at 2-3 levels deep (Data/9/9/Messages/, Data/0/0/1/Messages/). This is the scenario that motivated the hasEmlxFilesInPartitions check in findMessagesDir — without it, os.Stat(Data/Messages) fails and the mailbox is invisible. Verifies all 3 files across two separate partition paths are found and resolve correctly.

All existing tests continue to pass unchanged, confirming no regression in legacy or standard V10 layouts.

Verification against real data

Tested against a real Apple Mail archive spanning 20+ years (~6 accounts, 39,417 .emlx files on disk):

Metric Before After
Mailboxes discovered 5 32
Files found 363 39,417

The 5 mailboxes found before were the only ones with any .emlx files in the top-level Data/Messages/ directory. The remaining 27 mailboxes stored all messages exclusively in partition directories and were completely invisible.

Future simplification

The current Mailbox.Files field stores bare filenames (e.g., "123.emlx"), with a separate FileIndex map to resolve partition files back to their containing directory. This two-layer design exists because Files predates partition support — when all files lived in a single Messages/ dir, filenames alone were sufficient.

A cleaner approach would be to store full absolute paths in Files directly and drop FileIndex/FilePath() entirely. The discovery walk already knows the full path of every file it finds; the current code discards that information and reconstructs it later via the map lookup. Storing full paths would simplify the Mailbox struct, eliminate FileIndex, and make the recursive walk simpler (just collect paths, no map needed). The importer would use the path directly instead of calling FilePath(). The main thing to watch is that resume checkpoints currently store the last processed filename — that format would need to change to full paths.

…ectories

Apple Mail V10 stores large mailboxes by splitting .emlx files across
numeric partition subdirectories (0-9) nested under Data/ at arbitrary
depths (e.g., Data/0/3/Messages/123.emlx). Previously, only the
top-level Data/Messages/ directory was scanned, missing the majority of
messages in partitioned mailboxes.

Changes:
- Add FileIndex map to Mailbox struct for resolving partition file paths
- Add FilePath() method for transparent path resolution
- Add recursive partition discovery (hasEmlxFilesInPartitions,
  collectPartitionFiles) with isDigitDir/isEmlxFile helpers
- Handle partition-only layouts where Data/Messages/ doesn't exist
- Use FilePath() in emlx importer instead of direct path join

Before: 12 mailboxes, 31k files (top-level Messages/ only)
After:  52 mailboxes, 39k files (all partition depths)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@roborev-ci
Copy link

roborev-ci bot commented Mar 2, 2026

roborev: Combined Review (bca4ea0)

Verdict: The commit successfully extends .emlx discovery for Apple Mail V
10 layouts, but requires a fix for a medium-severity basename collision issue.

Medium

Basename collisions can cause wrong-file imports and dropped messages

  • Refs: /home/roborev/repos/msgvault/internal/emlx/discover.go (lines 2
    9, 31, 357, 358, 396, 402)
  • Description: Partitioned files and FileIndex are keyed only by bare filename (map[string]string). If the same .emlx basename appears in multiple locations (e
    .g., across different partitions or between a partition and the top-level Messages/ directory), one path silently overwrites the other. This causes FilePath() to resolve incorrectly, potentially duplicating one file import while dropping another, which enables input-controlled file shadowing.
  • Remediation: Track files
    using a collision-safe identifier, such as the full absolute path or a stable relative path from the mailbox root, instead of basename-only indexing. At a minimum, detect duplicate filenames explicitly and handle the collision deterministically (e.g., fail the import or log a warning and skip) rather than silently overwriting index
    entries.

Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant