Skip to content
This repository was archived by the owner on Dec 31, 2025. It is now read-only.

Conversation

@sunu
Copy link
Contributor

@sunu sunu commented Sep 18, 2018

Fixes https://github.com/alephdata/ingestors/issues/13.

Some notes about the implementation:

In this implementation, we open the pst file and go through all the folders recursively. All these folders are exported to a temp directory. Similarly all the emails and other files that we can recognize are also exported to the temp directory while maintaining the folder hierarchy. Then we feed that temp directory to the DirectoryIngestor.

There are 2 issues with the implementation as far as I can see.

Ideally, we should be parsing the email files once. But with this implementation, we'll end up parsing the files twice; once to export them and then again to ingest them.

Some files are not parsed correctly. For example, some messages don't have transport headers. So they are parsed as html files. But some of these html files have attachments. I'm just exporting the attachments as separate file in the same parent folder for now. Similarly, some messages only have RTF text in them. Aleph tries to show them as PDF documents which of course fails.

To avoid parsing the files twice, I tried implementing the pst ingestor in a non-recursive way. But that didn't work out well because Aleph kind of expects it to be recursive or else doesn't create any child document for a nested result.


On the libpff side of things, this PR kind of depends on libyal/libpff#69 getting merged. So I'm waiting on that to fix the build. Or else we could just build from source from a fork.

@pudo
Copy link
Contributor

pudo commented Feb 4, 2019

Can you remind me of the status of this?

@sunu
Copy link
Contributor Author

sunu commented Feb 11, 2019

The PR to libpff is still not merged. The build is passing because it's using my fork. Implementation details are still the same as described in the original comment. It pulls out more files than readpst, but some of those extra files can't be parsed correctly.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Consider doing PST support inline?

3 participants