Skip to content

Faster email parser #5

@rth

Description

@rth

The default email.Parser (converts the raw email text to a structured dict) in written in pure Python in the standard library and is somewhat slow. As a result when threading emails, the performance bottleneck is in the e-mail parsing. Here is a benchmark for a dataset of 5,000 emails,

  • email.Parser: 33.329 s
  • converting to jwthreading.Message format: 0.121s
  • the JWZ threading algorithm: 0.031s
  • sorting of threads: 0.002s

A solution could be to,

  • use a MIME parser from https://github.com/mailgun/flanker (though no PY3 support for the moment and has a lot of additional dependencies)
  • adapt the https://github.com/jkr/pygmime (no PY3 support either, cross-platform support would be difficult)
  • write a custom simplified email.Parser (we only require the References:, In-Reply-To: and Subject header fields, for the JWZ algorithm )

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions