The default email.Parser (converts the raw email text to a structured dict) in written in pure Python in the standard library and is somewhat slow. As a result when threading emails, the performance bottleneck is in the e-mail parsing. Here is a benchmark for a dataset of 5,000 emails,
email.Parser: 33.329 s
- converting to
jwthreading.Message format: 0.121s
- the JWZ threading algorithm: 0.031s
- sorting of threads: 0.002s
A solution could be to,
- use a MIME parser from https://github.com/mailgun/flanker (though no PY3 support for the moment and has a lot of additional dependencies)
- adapt the https://github.com/jkr/pygmime (no PY3 support either, cross-platform support would be difficult)
- write a custom simplified email.Parser (we only require the
References:, In-Reply-To: and Subject header fields, for the JWZ algorithm )
The default
email.Parser(converts the raw email text to a structured dict) in written in pure Python in the standard library and is somewhat slow. As a result when threading emails, the performance bottleneck is in the e-mail parsing. Here is a benchmark for a dataset of 5,000 emails,email.Parser: 33.329 sjwthreading.Messageformat: 0.121sA solution could be to,
References:,In-Reply-To:andSubjectheader fields, for the JWZ algorithm )