This repository was archived by the owner on Dec 31, 2025. It is now read-only.
[WIP] Ingest outlook pst files using pypff #20
Closed
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fixes https://github.com/alephdata/ingestors/issues/13.
Some notes about the implementation:
In this implementation, we open the pst file and go through all the folders recursively. All these folders are exported to a temp directory. Similarly all the emails and other files that we can recognize are also exported to the temp directory while maintaining the folder hierarchy. Then we feed that temp directory to the DirectoryIngestor.
There are 2 issues with the implementation as far as I can see.
Ideally, we should be parsing the email files once. But with this implementation, we'll end up parsing the files twice; once to export them and then again to ingest them.
Some files are not parsed correctly. For example, some messages don't have transport headers. So they are parsed as html files. But some of these html files have attachments. I'm just exporting the attachments as separate file in the same parent folder for now. Similarly, some messages only have RTF text in them. Aleph tries to show them as PDF documents which of course fails.
To avoid parsing the files twice, I tried implementing the pst ingestor in a non-recursive way. But that didn't work out well because Aleph kind of expects it to be recursive or else doesn't create any child document for a nested result.
On the libpff side of things, this PR kind of depends on libyal/libpff#69 getting merged. So I'm waiting on that to fix the build. Or else we could just build from source from a fork.