-
Notifications
You must be signed in to change notification settings - Fork 30
Add DKIM style ID generation #517
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
tools/generators.py
Outdated
| v = v.replace(b"\r\n", b"") | ||
| v = v.replace(b"\t", b" ") | ||
| v = v.strip(b" ") | ||
| v = b" ".join(vv for vv in v.split(b" ") if vv) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NOTE: Used --amend to fix final v to vv here.
|
Please note that this PR also adds an extra parameter passing the bytes of the original message to the |
|
Thanks very much. This looks very good. However I think the hash won't be sufficiently unique, given that PonyMail uses the hash both for the Permalink and for storing the copy of the email. It is possible that two separate messages sent to a mailing list can end up with the same hash. But list software such as ezmlm generally does not care what an email contains and would treat this as a new message, so would generate a new sequence number for the message. This could result in gaps in the PonyMail record, since that uses the hash as a unique id for the message. There would be no way to know what caused the missing sequence number -- was it such a duplicate, or did the email get lost in transit? -- without access to a separate archive. I think this problem can be avoided by including some of the last received headers in the hash. The ones after any List-xxx headers added by the list software will relate to the journey the mail took to reach the mail server, so should not be affected by subsequent deliveries to mail archivers. Note: this is only an issue because PonyMail uses the same hash for Permalinks and the database id. |
|
Isn't that just trading one potential risk (if you will) for another? I think the DKIM list makes sense, and would trade the stable generated hash over whether or not you can debug ezmlm with it. Having said all that, what I could envision is for the next generation pony mail to use this generator for the permalink, but store multiple copies of the source, as they come in, and referencing them in a metadata table (or just storing the pibble in the source as well, but not use it for the document ID). When viewing the source, you could be presented with all existing options. Thus, you would have just the one copy of the email when viewing the list (which I think makes the most sense in any case), but multiple options when viewing the source. That is, to sum up, I like this generator and what it accomplishes - I think it looks far better than our previous/current solutions, and if not for this version, I'd be interested in pulling this PR into the next generation of pony mail. |
|
To expand upon my previous comment: Email A comes in. It gets "pibbled" to abcdefg1234. A SHA3 digest is 123412341234 |
|
AFAICT, so long as one only takes into account the Received headers that relate to the hops before arrival at ezmlm, all recipients of the email will be able to generate the same hash. If for some reason the email does not have a list-id or another way of determining whether the Received header was added before arrival at the list server, then ignore the header. Note that this is not a question of 'debugging' ezmlm. For example, can you tell me why As to your example, if email A and email B are identical, I don't see how they will get different pibbles/digests unless at least some of the received headers are taken into account. Please explain. |
|
They would have the same pibble, but different SHA3 if the SHA3 is done using the full message source. What headers are/aren't in the source wouldn't matter, as it would refer to the same pibble at any time because of how the DKIM generator works. As for the return path you mention, that is specific to a specific installation. I don't think we should allow/deny PRs based on just that. I for one don't get such return paths in my mbox source, as it goes through another alias before it hits my inbox - the same could be true for the archiver, in which case it wouldn't matter what the original return path was. |
|
Email A: pibble is abcdefg1234, SHA3 of full message is 123412341234 When importing from a third source (email C) into a DB from scratch, pibble would again be abcdefg1234, and SHA3 could perhaps be 111222333444, it wouldn't matter as the pibble metadata is the same, so a search would find it in the DB. |
|
Furthermore, blue-skying here, this could be made backwards compatible with older databases easily. Thus, to figure out if we need to access directly via permalink ID or via a pibble keyword search, we'd just assess whether the ID of the email being accessed is a pibble or not. Things to consider for later:
|
|
On closer examination, I see that the DKIM generator has several options as to how the hash is generated. This means that the generated hash will depend on which options are chosen, rather than just on the mail content. If for some reason emails have to be reloaded either in the same installation or another, it is vital that the same hashes are generated. My conclusion is that options cannot be allowed if Permalinks are to be truly permanent. |
|
Color me stupid, but...you would manually have to go in and change those settings to get a different result, would you not? The nonce is the only option I could see anyone purposefully changing, and that's not defined in the generator itself, but in the pony mail config. |
|
Yes, you would have to change the options. I see the DKIM hash being used as a more reliable Message-Id. |
|
The optional nonce makes things worse as there are effectively infinite values it can take. |
|
If different instance have different settings, then that is the problem of the person that set that up, not us, not the generator's fault. As for the other comment, infinite values is exactly the point of the nonce as I read it. It is to prevent collision attacks from being possible. In my mind, this is a great addition to the generator. If I set up an instance with a nonce, only I can reasonably collide IDs (with a lot of effort still), no one else. If you don't want that extra security, don't set a nonce. |
|
An 80 bit truncated hash provides 80 bits of preimage resistance, but only 40 bits of collision resistance. In terms of Ponymail, preimage resistance prevents forgery of new messages with existing IDs. Collision resistance prevents a forger generating two different messages with a single new ID. Back of the envelope calculations, which I am not claiming as the actual security of this PR code (i.e. this must be independently verified), using SUPERCOP benchmarks show that 40 bits enables a forgery under that model in a matter of hours. 64 bit security would be possible only by a nation-state actor. Forgery under 80 bit security is infeasible now, and is likely to remain so for a reasonable duration. In other words, 80 bit encoded DKIM style IDs are not suitable for use in the way that you suggest, as a nonce-free identifier for abstract emails. They would be far too easily forged, in a matter of hours even with standard hardware. This is why the option for adding a nonce exists. General forgeries could be mitigated by storing the actual DKIM signature (if present) as metadata within Ponymail, and this would probably be a good idea in any case. Identifier forgeries can be partially mitigated in Ponymail by canonically showing the first message to have been received by the system. But this only works for currently hosted mailing lists; it does not work for message archive imports. A minimum of 128 bit security is standard for security in general scenarios. That is why 256 bit hashes have become widespread, because they provide 128 bits of collision resistance security. But for Ponymail preimage resistance is the primary concern, and collision resistance less so. For your use case I would suggest 128 bit identifiers, providing 64 bits of collision resistance. The code could still be written with the idea that collision attacks would be feasible, providing further mitigations even in this very unlikely case. Another advantage of using 128 bits is that SHAKE-128 output could be used untruncated. The alternative base32 pibble encoding proposed in this PR has the advantages that it is case insensitive and avoids a wide range of cultural taboo substrings, but it is less compact than urlsafe base64, which would be a potential alternative if using longer IDs. Here is an example of a 16 character pibble encoded 80 bit identifier: And here is a 22 character base64 encoded 128 bit identifier: The hashes of original messages should use 256 bit hashes. They could use SHA3-256 and be stored using urlsafe base64 encoding. Specifically, it is not secure to use 128 bit identifiers because the threat model is different. The DKIM style identifer is a one-to-many mapping, where the many values are preserved. Collisions in that context can be mitigated. The message source identifier is a one-to-one mapping, and so a collision would result in lost data, which is not acceptable. In summary: if the nonce option is preserved, 80 bit truncated cryptographically secure hash (CSH) identifiers may be suitable for generic messages but not for message sources. If the nonce option is removed, 128 bit CSH identifiers may be suitable for generic messages, but not for message sources. Only 256 bit CSH identifiers are suitable for message sources. I apologise that you were led astray by the ignis fatuus of the
The options can even be safely removed if |
|
@sbp This sounds like we have two options here then:
As for using the complete 256 bit CSH for sources, I am fully on board there, but I think this is best saved for the next gen. That doesn't mean the dkim generator isn't suitable for this gen (from what I can tell, it is far superior for clustered environments compared to any other algorithm), but rather that we'd save using the 256 bit CSH for sources for the next generation. Does that make sense? |
|
@sbp == Not all emails in archives have List- headers. |
|
AIUI, the destination list-id (not the one in the origin, which may not exist) is appended in the generator with: headers.append([b"X-Archive-List-ID", xali_value])the destination list ID is thus both in the hash and also in the mbox document (as it always is). |
|
I mean that the Permalink should include the list id, as it does at present. For example: aabbcc@<lid> |
|
I don't think the ID should include the list name by default, I like it short and neat - makes life easier for people using links :) |
|
Collision forgery would require control over entire input messages, unless the source identifier algorithm uses a subset. It also does not enable attacks against the identifiers of existing messages. If a It is reasonable to save this feature for the next generation as long as at least one kind of existing message source identifier has enough collision resistance to make these attacks impractical. Since a range of identifiers are available, their security levels could be noted in the documentation so that implementers can understand the security consequences and decide. |
|
If an externally provided list-id is included in the hash, then the has will change if the lid changes. It is vital that the hash depends only on the message source (possibly plus a fixed pepper), otherwise reloading the messages may generate a different hash and thus Permalink. As to characters such as @<> being a problem, I agree, but there are other characters that could be used to separate the lid from the hash. For example one could use hash_dev.ponymail.apache.org. I agree that it would be nice to have shorter links, but that should not be at the cost of unstable Permalinks. r3358b63557d3a40e179f6ca498a38b9aaf0b2532aba48bfc03c7a1a0@<dev.project.apache.org> |
|
Arguably, if you use a custom list ID different from what's in the source, you are going to potentially 404 your permalink in any case if you change it or forget what it was when reimporting, no matter what generator you use. All current generators use the list ID in their input/output. The only difference of real importance, in my view, is that the list ID is visible in current generated IDs, and hidden in the dkim generator to get as short an ID as is reasonably possible (by hidden I mean it's used inside the hash, instead of being plaintext outside the hash). I will agree that having the list ID in the generated ID can be very helpful for administrators if they need to reimport and had a lot of custom overrides they can't remember or don't have a backed-up configuration of. What if we make this an option for the administrator to decide on? Thus, we could accept this as is, and then have a second, the current dkim generator appeals to a certain group of people, and your suggestion of appending the list ID will appeal to other groups of people. Neither solution will be 100.00% stable against all edge cases (take for instance losing your database and re-importing from gmail mbox sources, that would not work). Let the administrators running PM decide between a shorter ID for neater URLs, or a longer ID that could make recovery easier. But let that be their decision, I don't think we should be imposing one or the other option. |
|
"re-importing from gmail mbox sources, that would not work" - why not? It would certainly work with the mod_mbox software, as that relies on an intrinsic part of the message (Message-Id) That is surely one of the main ideas behind the dkim hash - generate an id that is the same for all instances of the same original message? |
|
gmail does some (nasty) normalization of header values, such as lower-casing email addresses, which is not standard practice, so you cannot reliably generate the same hash for all emails if your previous import was based off non-gmail mbox files. I've run into this problem a few times, where the mbox address had caps in it, and gmail removed those caps, hence why I know. |
|
In which case maybe dkim should do the same normalisation as GMail to avoid the issue? |
|
Mails where we use a list override (that is, any mail that arrives to an archiver with --lid $something) should have the list ID appended to the generated ID as you proposed. All emails that do not go through a list override should not need to have anything appended. Does this satisfy you both? :) It would satisfy my requirements. |
|
I think that should work for mails that are archived. For the importer, generally it makes sense to always provide the --lid override to ensure the mails are added to the expected list. In think it makes sense to pick up an idea from the original PR, which is to only add the lid if it is different from the lid (if any) in the email. However instead of including the lid in the hash input as before, now it is appended to the generated Permalink. I hope this would avoid most of the issues with reproducibility of Permalinks. |
|
Imagine we have a small mailing list archive in an mbox file called Now the catastrophic scenario happens, and the Ponymail database is lost! But we still have The idea behind putting There are a couple of problems with this:
Therefore, the only reason to put a manual List-ID in permalinks is to support an unreliable backup strategy. The strategy is unreliable because it depends on arbitrary users to retain copies of the permalinks that can then be consulted to restore the data in the case of catastrophic database loss. And an unreliable backup strategy is made unacceptable when there is a reliable alternative. One alternative is that we can just rename Instead, here is a reliable alternative: Imagine we import our three emails from When we performed this import, we generated the three DKIM-IDs We now perform standard backup procedures for Does this strategy scale? Consider a very large mailing list that has a million emails in it. The manual List-ID What are the problems with this strategy? Unlike the unreliable and unacceptable backup strategy described above, it does not rely on arbitrary users or search engines to backup our data for us. It does not lead to the problem of wondering who to consult to restore that data. It follows established, standard industry practices for backing up our manual List-IDs, instead of the existing ad hoc and idiosyncratic method. For that reason, I could never recommend the strategy where manual List-IDs are part of the permalinks. I could never recommend that people use it as their backup strategy, because this superior strategy is available instead and it ticks all the boxes. It is, however, sometimes necessary to include the manual List-ID in the URL somewhere for UI purposes. Consider the email It doesn't know which List-ID to present to the user. In fact, in Ponymail and in Foal it doesn't even retain the information that this message was sent to six mailing lists if the DKIM-ID is the primary permalink, which is necessary in Ponymail if DKIM-ID is used at all, and is not necessary in Foal but it still possible. This would not be a problem if the manual List-ID were part of the permalink, but it solves one problem and causes another. DKIM-IDs were designed to deduplicate emails. If List-IDs are part of the DKIM-ID permalink, this means we would have to store six copies of the Thankfully there is a simple solution. In Foal commit Then, if the user browses the URL above: They can be presented with a list showing all List-IDs that this message belongs to, and the option to display the message in its context in those lists, specialising its UI. Or, they can still browse a version that contains the List-ID: But, importantly, To argue that manual List-IDs should be part of DKIM-IDs would remove all of the above. In particular:
|
|
Seems to me if no-one has a copy of the permalink then it does not matter as much if it is lost. Whereas the other way round, if there is a known permalink, but it is not in the database, it is vital to be able to reconstruct it. If the link contains a lid suffix, then it should be possible to find the mail by re-loading all the relevant archives. If there is no separate lid suffix in the missing Permalink, then the entire corpus may need to be reloaded unless there is some context that helps narrow down the search (e.g. dates, possible list names). The advantage of the above approach is that there is no need for additional backup data beyond the original archives. |
|
What if somebody has a permalink that the administrator doesn't know about? Is it vital to be able to reconstruct that? The advantage of including manual List-IDs is that it requires no additional backup. But the disadvantages are that it is unreliable, and that it does not remove duplicates across lists, which is what DKIM-ID was designed to do in the first place. Why recommend an unreliable backup strategy that forces users to bear the burden of storage when there is a reliable alternative that keeps that responsibility with the archive administrator? |
|
No, if the link is not known, then clearly it does not have to be reconstructed at that time. However if all the archives are available, they can all be reloaded. == I don't understand the part about DKIM and de-duplication across lists. I don't think it's possible to share message content across lists, because the sources will have different headers. The current design does share attachments, but there is no attempt to share any other parts of messages. Doing so would require a redesign, and I'm not sure it would be worth the extra complication and house-keeping. |
|
If the link is not reconstructed before it is requested then the request fails. Whether a link is known or not to the adminstrator does not necessarily correlate with how "vital" that link is. An administrator's knowledge of link usage can only ever be ad hoc and incomplete. The reliable backup mechanism avoids this issue, and enables prompt restoration. Duplicate headers were tested by sending a message to two different mailing lists, both using Mailman and Ponymail. The only difference was the |
|
Obviously it is better to have backups that simplify and speed up restoration. == I still don't understand your point about de-duplication. |
|
If an email is sent with the addresses of two or more mailing lists in the Since the discussion topic is the case where As for examples, any email with two or more mailing lists in the This is not the only possible example scenario, as mentioned. |
|
Sorry, but I am not convinced. I need to see actual data. I have been looking for examples in the ASF corpus, but have yet to find one where the list-ids are not added. |
|
The There is no |
|
Discussion of DKIM-IDs with manual List-IDs appended should be moved to Issue #523. This thread is for discussion of DKIM-IDs as they appear in the present commit of this PR. |
|
This PR is now more than a month old. |
|
There is a trade-off here. Let's call the two Permalink designs O and OL, where: O = opaque hash created from the message source plus the LID (if it differs from the LID in the headers) OL = opaque hash created from the message source only. If the LID differs from the headers, it is appended to the opaque hash. Style OAdvantage: fixed Permalink style Style OLAdvantage: can regenerate Permalinks from just the mail sources Neither is perfect; seems to me that the appropriate choice will depend on the installation. |
|
In Style OL it is not possible to "regenerate Permalinks from just the mail sources" because the LID must still be known to obtain the suffix of the full Permalink. It is only possible to regenerate a prefix of the Permalink. Therefore the LIDs must still be known even in Style OL, to obtain the full Permalink. This disproves the supposed advantage of Style OL. And since the LIDs must be known even in Style OL, Style O is always possible too. Therefore Style OL is an inferior variant of Style O. |
|
In the case of an OL Permalink which is missing from a rebuilt database, there are two possibilities:
When rebuilding a database using OL links, any mails that don't have LID headers will need a suffix. In the case of an O-style Permalink, in order to match the opaque hash, it is necessary to know the exact LID that was used. Note: I am assuming here that the O-style Permalink includes the lid in the hash, either as an existing header, or as an addition. That is not actually the case for the amended PR, however as I indicated earlier that will need to be fixed. |
|
The original PR included the LID in the hash. This should ensure that the target is an email on the correct list The current PR -- dfd18eb -- does not include the LID in either the hash or as a suffix. |
|
It does not suffice to "use a best guess as to the original LID that was used" to reconstruct an OL permalink. If somebody has an Moreover, if the appended lid in OL were not an integral part of the permalink then it would be necessary to use it anyway to disambiguate which list UI to apply to a message in the archives. This is because, for example and amongst other possible scenarios, the same message could be imported twice with different manual command line lid overrides. Using lid UI disambiguation has already been proposed, and requires handling of the case where the lid is omitted from the link. One could consider that the current commit dfd18eb already implements OL in the presence of lid UI disambiguation. It is therefore not an accurate characterisation that the current commit "will need to be fixed" to include the lid in the input, because the intention was that lid UI disambiguation would be introduced alongside the current commit, or implemented in a future version with a warning to users in the documentation meanwhile. |
|
I agree that choosing a different LID when re-importing will affect a full OL permalink. Of course, if the original LIDs are known then this won't be an issue. The point is that an OL-style Permalink can be recovered without needing to keep a database of the imports. === I don't know what "if the appended lid in OL were not an integral part of the permalink" means. |
|
The lid is not an integral part of the permalink. One may also say that the lid is not part of the permalink at all, if only the mandatory parts of a link intended for permanence are defined as "the permalink", but that is mere terminological choice. Commit dfd18eb implements OL in the presence of lid UI disambiguation. Are there any remaining objections to merging this pull request? There are no review comments on the code in the most recent commit, a month later. |
|
I think we should merge. We can work out differences in opinion later on. I'm leaning towards having two generators and leaving it up to the end user to make a decision. Can you adjust the PR so it can be merged? |
|
AFAICT, commit dfd18eb does not implement OL, at least not in the way that I defined it. If the generator output is also used as the database id, as is currently the case, only one message will be stored. This issue occurs where a message is sent to multiple lists and where the headers used by dkim are identical. |
|
If this PR were to be modified so that the lid is used in the input of the DKIM-ID algorithm, but only when it does not match the |
|
I have no issue with the code in dkim-id.py per se. However the dkimid method in generators.py needs to take note of the LID where it differs from the List-ID in the headers. |
The question was specifically about using the lid in the DKIM-ID input. Would that be acceptable? |
|
@sebbASF Would it be acceptable to insert the lid into the DKIM-ID input in case of a mismatch instead of appending it to the DKIM-ID output? Thirty days have elapsed since this thread was active. |
|
@sbp I find it acceptable, I think a solution here is to implement that as you wish, and if someone else wants to make an alternate DKIM generator that appends to the ID, then they can make that a reality at a later point in time. |
This PR adds DKIM style Ponymail ID generation.
Why?
There are a number of existing Ponymail ID generators, two of which are currently recommended:
fullfor a single node, andclusterfor multiple nodes. The purpose of the latter in particular is to generate an ID based on the hash of the email once elements that may vary across cluster nodes have been excluded.There are other situations in which elements of an email may vary from environment to environment. Once such scenario is that archives of a mailing list collected by different people or organisations in different locations may contain different actual email sources. The
Receivedheaders, for example, will be different depending on the routes of emails sent out by the mailing list software.Variability in emails is a problem in another area too: guaranteeing email authenticity. Authenticity has been studied and gradually solved over the years by technologies such as SPF, DKIM, and ARC. Since DKIM involves signing emails, the designers had to solve email variability in such a way that signatures would be consistent. They did this by allowing signers to (a) subset the original headers before signing, and (b) apply different canonicalisation algorithms to the headers and to the body.
This PR solves the problem of robust ID generation by leveraging the existing DKIM mechanism for robust signing.
What?
To generate a DKIM style Ponymail ID, we first parse the email using a superset of the algorithm used by the popular Python dkimpy package. We then subset the headers to an RFC recommended subset, and apply DKIM
relaxed/simplecanonicalisation. Finally, we hash the canonicalised subset message using SHA3-256, and encode the first 80 bits of the digest using a custom base32 alphabet. The encoded digest prefix is the resulting Ponymail ID.How?
We first implemented a superset of the RFC 822 parser in
dkimpy(the reference parser) from scratch.Why did we use this reference parser instead of following an algorithm in an RFC? Neither RFC 822 nor RFC 6376 (the DKIM RFC) nor any other RFC that we found gives a parsing algorithm for inputs which are broken. Ponymail must generate an ID no matter what the form of the input is. Therefore we followed an RFC 6376 implementation as a reference because it already covers some broken inputs.
But we went further. Our version shares no code with the original parser, and has other advantages in that it:
We subset the email using headers recommended by RFC 4871, the precursor to RFC 6376. We do this because the former had a more comprehensive set of recommendations on which headers to include. We sent the following email, which explains the situation in detail, to the authors of RFC 6376:
RFC 6376, whilst contradicting itself, does at least say that ultimately the choice is up to the implementer. We use the more comprehensive recommendations of RFC 4871 as proving a better baseline. We also add the
DKIM-Signatureheader itself so that signed and unsigned messages cannot have the same ID.We hash the result with SHA3-256. RFC 6376 and its successors do not yet provide for signatures using SHA3-256. We chose SHA3-256 as it is more likely to be resistant to cryptanalysis than SHA2, and the sponge construction has more practical uses such as STROBE, and therefore SHA3 is more likely to be robust. We could have used SHAKE for a shorter digest to truncate from, but SHA3-256 is more ubiquitous. We use only the first 80 bits, to make IDs easier to share; Ponymail IDs are IDs, not hashes. Although 80 bytes is enough to make collisions in normal use unlikely, collision attacks are still possible. To avoid collision attacks, we allow configuration of a nonce which is added as a header to the email before canonicalisation.
Finally, we encode the output using base32, but with the alphabet
[0-9b-df-hj-tv-z]in place of the normal base32 alphabet. We position 0-9 first by analogy with base16. We use a-z without a, e, i, u to minimise the probability that cultural taboo words will appear in generated IDs. We call this the pibble encoding, which is short for "pony nibble".