Fix Apple Mail V10 import: discover messages in numeric partition directories#166
Open
schlabrendorff wants to merge 1 commit intowesm:mainfrom
Open
Fix Apple Mail V10 import: discover messages in numeric partition directories#166schlabrendorff wants to merge 1 commit intowesm:mainfrom
schlabrendorff wants to merge 1 commit intowesm:mainfrom
Conversation
…ectories Apple Mail V10 stores large mailboxes by splitting .emlx files across numeric partition subdirectories (0-9) nested under Data/ at arbitrary depths (e.g., Data/0/3/Messages/123.emlx). Previously, only the top-level Data/Messages/ directory was scanned, missing the majority of messages in partitioned mailboxes. Changes: - Add FileIndex map to Mailbox struct for resolving partition file paths - Add FilePath() method for transparent path resolution - Add recursive partition discovery (hasEmlxFilesInPartitions, collectPartitionFiles) with isDigitDir/isEmlxFile helpers - Handle partition-only layouts where Data/Messages/ doesn't exist - Use FilePath() in emlx importer instead of direct path join Before: 12 mailboxes, 31k files (top-level Messages/ only) After: 52 mailboxes, 39k files (all partition depths) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
roborev: Combined Review (
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This is a vibecoded fix for #157
Summary
.emlxfiles in V10 numeric partition subdirectories (Data/0/3/Messages/,Data/9/Messages/, etc.) at arbitrary nesting depthsData/Messages/does not exist at allFileIndexmap +FilePath()method, requiring only a one-line change in the importerProblem
Apple Mail V10 splits large mailboxes across single-digit numeric partition directories under
Data/. A real-world mailbox might look like:Real
tree -doutput from a 20-year Inbox.mbox (Attachments dirs omitted):The previous code only looked for
.emlxfiles inData/Messages/. It had no awareness of numeric partition directories at all. This meant:Missing messages: Any
.emlxfile stored in a partition directory (e.g.,Data/0/3/Messages/123.emlx) was silently skipped. On a real 20-year Apple Mail archive (~6 accounts), only 363 files across 5 mailboxes were found instead of the full 39,417 files across 32 mailboxes — 99% of messages were invisible.Invisible mailboxes: Some mailboxes have no top-level
Data/Messages/directory. Every message lives in partitions. These mailboxes were not detected at all byfindMessagesDir(it checkedos.Stat(Data/Messages)and gave up if that directory didn't exist), soisMailboxDirreturned false and the entire mailbox was skipped during discovery.Path resolution assumed flat structure: The importer used
filepath.Join(mb.MsgDir, fileName)to build file paths, which only works when all.emlxfiles are in a singleMessages/directory. Partition files live in differentMessages/directories at different paths, so this produced incorrect paths that would fail onos.Stat.Solution
The fix adds partition awareness at two levels: discovery (finding files) and resolution (building paths).
Discovery
findMessagesDirnow checks for partition directories whenData/Messages/is absent or empty. It callshasEmlxFilesInPartitions, which recursively walks single-digit subdirectories (0-9) looking forMessages/dirs that contain.emlxfiles. This letsisMailboxDirreturn true for partition-only mailboxes.listEmlxFilesnow collects files from both the primaryMessages/dir and all partitionMessages/dirs. It callscollectPartitionFiles, which recursively descends digit directories and collects every.emlxfile it finds in any nestedMessages/directory, building a map from filename to the absolute path of its containing directory.Both recursive functions (
hasEmlxFilesInPartitionsandcollectPartitionFiles) follow the same pattern: at each directory level, enterMessages/to check/collect files, and recurse into any single-digit subdirectory. This mirrors Apple Mail's actual partition structure, which nests digit dirs arbitrarily deep (observed up to 3 levels in real data).Resolution
The
Mailboxstruct gains aFileIndexfield (map of filename to itsMessages/directory path) and aFilePath()method. For files in the primaryMessages/dir,FileIndexhas no entry andFilePathfalls back tofilepath.Join(MsgDir, fileName)— identical to the old behavior. For partition files,FileIndexmaps the filename to its specificMessages/directory, andFilePathjoins that path with the filename.The importer change is a single line:
filepath.Join(mb.MsgDir, fileName)becomesmb.FilePath(fileName). No other importer logic changes because theFilesslice already contains all filenames (both top-level and partition) and they sort/deduplicate identically.Code consolidation
Shared logic is extracted into two small helpers to avoid duplication:
isDigitDir(name)— checks if a directory name is a single digit 0-9 (used in 4 places)isEmlxFile(name)— checks.emlxsuffix excluding.partial.emlx(used in 3 places)The recursive walks for existence checking (
hasEmlxFilesInPartitions) and file collection (collectPartitionFiles) are each a single function rather than being split across multiple wrappers.Tests
Three new tests cover the distinct partition scenarios encountered in real Apple Mail data:
TestDiscoverMailboxes_V10Partitioned— The mixed case: a mailbox has files in both the primaryData/Messages/and in partition directories at different depths (Data/0/3/Messages/,Data/9/Messages/). Verifies that all 3 files are discovered, thatFilePath()resolves each to an existing file on disk, that top-level files are not inFileIndex(they resolve viaMsgDir), and that partition files are inFileIndex.TestDiscoverMailboxes_V10PartitionedOnly— The empty-primary case:Data/Messages/exists but contains no.emlxfiles; all messages are inData/3/Messages/. Verifies the mailbox is detected and both partition files are found. This catches a regression wherefindMessagesDirwould find the emptyMessages/dir,listEmlxFileswould return zero files, and the mailbox would be silently skipped.TestDiscoverMailboxes_V10NoTopLevelMessages— The most extreme case:Data/Messages/does not exist at all. Messages are in partitions at 2-3 levels deep (Data/9/9/Messages/,Data/0/0/1/Messages/). This is the scenario that motivated thehasEmlxFilesInPartitionscheck infindMessagesDir— without it,os.Stat(Data/Messages)fails and the mailbox is invisible. Verifies all 3 files across two separate partition paths are found and resolve correctly.All existing tests continue to pass unchanged, confirming no regression in legacy or standard V10 layouts.
Verification against real data
Tested against a real Apple Mail archive spanning 20+ years (~6 accounts, 39,417
.emlxfiles on disk):The 5 mailboxes found before were the only ones with any
.emlxfiles in the top-levelData/Messages/directory. The remaining 27 mailboxes stored all messages exclusively in partition directories and were completely invisible.Future simplification
The current
Mailbox.Filesfield stores bare filenames (e.g.,"123.emlx"), with a separateFileIndexmap to resolve partition files back to their containing directory. This two-layer design exists becauseFilespredates partition support — when all files lived in a singleMessages/dir, filenames alone were sufficient.A cleaner approach would be to store full absolute paths in
Filesdirectly and dropFileIndex/FilePath()entirely. The discovery walk already knows the full path of every file it finds; the current code discards that information and reconstructs it later via the map lookup. Storing full paths would simplify theMailboxstruct, eliminateFileIndex, and make the recursive walk simpler (just collect paths, no map needed). The importer would use the path directly instead of callingFilePath(). The main thing to watch is that resume checkpoints currently store the last processed filename — that format would need to change to full paths.