Skip to content

Conversation

@sbp
Copy link

@sbp sbp commented Aug 11, 2020

This PR adds DKIM style Ponymail ID generation.

Why?

There are a number of existing Ponymail ID generators, two of which are currently recommended: full for a single node, and cluster for multiple nodes. The purpose of the latter in particular is to generate an ID based on the hash of the email once elements that may vary across cluster nodes have been excluded.

There are other situations in which elements of an email may vary from environment to environment. Once such scenario is that archives of a mailing list collected by different people or organisations in different locations may contain different actual email sources. The Received headers, for example, will be different depending on the routes of emails sent out by the mailing list software.

Variability in emails is a problem in another area too: guaranteeing email authenticity. Authenticity has been studied and gradually solved over the years by technologies such as SPF, DKIM, and ARC. Since DKIM involves signing emails, the designers had to solve email variability in such a way that signatures would be consistent. They did this by allowing signers to (a) subset the original headers before signing, and (b) apply different canonicalisation algorithms to the headers and to the body.

This PR solves the problem of robust ID generation by leveraging the existing DKIM mechanism for robust signing.

What?

To generate a DKIM style Ponymail ID, we first parse the email using a superset of the algorithm used by the popular Python dkimpy package. We then subset the headers to an RFC recommended subset, and apply DKIM relaxed/simple canonicalisation. Finally, we hash the canonicalised subset message using SHA3-256, and encode the first 80 bits of the digest using a custom base32 alphabet. The encoded digest prefix is the resulting Ponymail ID.

How?

We first implemented a superset of the RFC 822 parser in dkimpy (the reference parser) from scratch.

Why did we use this reference parser instead of following an algorithm in an RFC? Neither RFC 822 nor RFC 6376 (the DKIM RFC) nor any other RFC that we found gives a parsing algorithm for inputs which are broken. Ponymail must generate an ID no matter what the form of the input is. Therefore we followed an RFC 6376 implementation as a reference because it already covers some broken inputs.

But we went further. Our version shares no code with the original parser, and has other advantages in that it:

  • Gives identical output to the reference parser in all cases we tested for which the reference parser gave output.
  • Provides output for all byte sequence inputs, unlike the reference parser. In cases where the reference parser would throw an error, we make reasonable assumptions as to what the form of the output should be.
  • Performs about 2x faster. It's so fast that even when it performs canonicalisation, it's still faster than the reference parser.
  • Depends on no modules, even from the Python standard library.
  • Can exclude headers whilst parsing.

We subset the email using headers recommended by RFC 4871, the precursor to RFC 6376. We do this because the former had a more comprehensive set of recommendations on which headers to include. We sent the following email, which explains the situation in detail, to the authors of RFC 6376:

In RFC 6376 you removed several entries from the list of headers in
_Recommended Signature Content_ (5.4.1) that had been present in RFC
4871. The (Resent-)Sender and (Resent-)Message-ID headers were
removed, as well as all MIME headers.

In an _INFORMATIVE OPERATIONS NOTE_ (5.4) earlier on in RFC 6376,
though, it is "highly advised" that both Sender and all MIME headers
are present. This earlier NOTE is unchanged from RFC 4871.

Later on in RFC 6376 the removal of Message-ID is discussed, and its
new status is explained to be contextual. But what was the rationale
for removing the other headers, and what is their new status? Was the
NOTE accidentally not updated when the other headers were removed from
the list? Or are (Resent-)Sender and the MIME headers still
recommended for inclusion?

RFC 6376, whilst contradicting itself, does at least say that ultimately the choice is up to the implementer. We use the more comprehensive recommendations of RFC 4871 as proving a better baseline. We also add the DKIM-Signature header itself so that signed and unsigned messages cannot have the same ID.

We hash the result with SHA3-256. RFC 6376 and its successors do not yet provide for signatures using SHA3-256. We chose SHA3-256 as it is more likely to be resistant to cryptanalysis than SHA2, and the sponge construction has more practical uses such as STROBE, and therefore SHA3 is more likely to be robust. We could have used SHAKE for a shorter digest to truncate from, but SHA3-256 is more ubiquitous. We use only the first 80 bits, to make IDs easier to share; Ponymail IDs are IDs, not hashes. Although 80 bytes is enough to make collisions in normal use unlikely, collision attacks are still possible. To avoid collision attacks, we allow configuration of a nonce which is added as a header to the email before canonicalisation.

Finally, we encode the output using base32, but with the alphabet [0-9b-df-hj-tv-z] in place of the normal base32 alphabet. We position 0-9 first by analogy with base16. We use a-z without a, e, i, u to minimise the probability that cultural taboo words will appear in generated IDs. We call this the pibble encoding, which is short for "pony nibble".

v = v.replace(b"\r\n", b"")
v = v.replace(b"\t", b" ")
v = v.strip(b" ")
v = b" ".join(vv for vv in v.split(b" ") if vv)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NOTE: Used --amend to fix final v to vv here.

@sbp
Copy link
Author

sbp commented Aug 11, 2020

Please note that this PR also adds an extra parameter passing the bytes of the original message to the compute_updates and generator functions in archiver.py and generators.py respectively. This is because the DKIM style ID generator works on the original bytes, and recovering these from msg.as_bytes() is not only slower but also more fragile in that the behaviour of that method depends on the upstream implementation in the Python standard library.

@sebbASF
Copy link
Contributor

sebbASF commented Aug 11, 2020

Thanks very much. This looks very good.

However I think the hash won't be sufficiently unique, given that PonyMail uses the hash both for the Permalink and for storing the copy of the email.

It is possible that two separate messages sent to a mailing list can end up with the same hash.
AFAICT, the sender has control over all the fields that are checked (assuming the MTA does not generate DKIM headers), so can send mails that are duplicates as far as the DKIM calculation is concerned.

But list software such as ezmlm generally does not care what an email contains and would treat this as a new message, so would generate a new sequence number for the message. This could result in gaps in the PonyMail record, since that uses the hash as a unique id for the message.

There would be no way to know what caused the missing sequence number -- was it such a duplicate, or did the email get lost in transit? -- without access to a separate archive.

I think this problem can be avoided by including some of the last received headers in the hash. The ones after any List-xxx headers added by the list software will relate to the journey the mail took to reach the mail server, so should not be affected by subsequent deliveries to mail archivers.

Note: this is only an issue because PonyMail uses the same hash for Permalinks and the database id.

@Humbedooh
Copy link
Member

Isn't that just trading one potential risk (if you will) for another?
If you keep any received headers in the permalink ID, you risk losing reproducibility in permalinks for emails that either don't have a list-id, or are sent via multiple lists.

I think the DKIM list makes sense, and would trade the stable generated hash over whether or not you can debug ezmlm with it.
ezmlm's return path only applies if your email address is the direct recipient after the list was invoked. If it goes past that, you may lose it, and the index number will be lost anyway. I'm also unsure whether it's Pony Mail's job to debug whether postfix/etc received an email or not.

Having said all that, what I could envision is for the next generation pony mail to use this generator for the permalink, but store multiple copies of the source, as they come in, and referencing them in a metadata table (or just storing the pibble in the source as well, but not use it for the document ID). When viewing the source, you could be presented with all existing options. Thus, you would have just the one copy of the email when viewing the list (which I think makes the most sense in any case), but multiple options when viewing the source.

That is, to sum up, I like this generator and what it accomplishes - I think it looks far better than our previous/current solutions, and if not for this version, I'd be interested in pulling this PR into the next generation of pony mail.

@Humbedooh
Copy link
Member

To expand upon my previous comment:

Email A comes in. It gets "pibbled" to abcdefg1234. A SHA3 digest is 123412341234
Email B comes in, identical to A but with a different route. It gets pibbled as abcdefg1234, and the SHA3 digest is now 432143214321
The permalink for BOTH emails would then be abcdefg1234 (as we only ever need that one copy of the contents for basic search/viewing), but because that pibble is stored in both sources (which have different SHA3 digests), you would be able to see both options when wanting to view the source.

@sebbASF
Copy link
Contributor

sebbASF commented Aug 11, 2020

AFAICT, so long as one only takes into account the Received headers that relate to the hops before arrival at ezmlm, all recipients of the email will be able to generate the same hash. If for some reason the email does not have a list-id or another way of determining whether the Received header was added before arrival at the list server, then ignore the header.

Note that this is not a question of 'debugging' ezmlm.
It is a question of knowing why Ponymail does not have a particular email sent by the mailing list.
Was it lost in transit, or was it a duplicate?

For example, can you tell me why
dev-return-1032-archive-asf-public=cust-asf.ponee.io@ponymail.incubator.apache.org
is missing from Ponymail without looking at some other archive?

As to your example, if email A and email B are identical, I don't see how they will get different pibbles/digests unless at least some of the received headers are taken into account. Please explain.

@Humbedooh
Copy link
Member

They would have the same pibble, but different SHA3 if the SHA3 is done using the full message source. What headers are/aren't in the source wouldn't matter, as it would refer to the same pibble at any time because of how the DKIM generator works.

As for the return path you mention, that is specific to a specific installation. I don't think we should allow/deny PRs based on just that. I for one don't get such return paths in my mbox source, as it goes through another alias before it hits my inbox - the same could be true for the archiver, in which case it wouldn't matter what the original return path was.

@Humbedooh
Copy link
Member

Email A: pibble is abcdefg1234, SHA3 of full message is 123412341234
Email B: pibble is still abcdefg1234, SHA3 of full message is 432143214321
Both have the same pibble, both could show up as variants when you went to look at the source.

When importing from a third source (email C) into a DB from scratch, pibble would again be abcdefg1234, and SHA3 could perhaps be 111222333444, it wouldn't matter as the pibble metadata is the same, so a search would find it in the DB.

@Humbedooh
Copy link
Member

Furthermore, blue-skying here, this could be made backwards compatible with older databases easily.
For all new sources, store the source document with a pibble field inside (next to the source field).
For all old sources, the doc ID is the permalink.

Thus, to figure out if we need to access directly via permalink ID or via a pibble keyword search, we'd just assess whether the ID of the email being accessed is a pibble or not.

Things to consider for later:

  • Not accessing documents directly via their ID has performance costs - I don't know how significant these are. For general single-source viewing, this should not be much if any. For generating mbox files, it will need some careful thought.
  • Viewing source via source.(lua|py|whatever) should only yield one result. There should perhaps be an X-Source-Alternate header if an alternate source exists in the DB, so you can look this up if debugging why an email is presumed missing.

@sebbASF
Copy link
Contributor

sebbASF commented Aug 12, 2020

On closer examination, I see that the DKIM generator has several options as to how the hash is generated.

This means that the generated hash will depend on which options are chosen, rather than just on the mail content.

If for some reason emails have to be reloaded either in the same installation or another, it is vital that the same hashes are generated.

My conclusion is that options cannot be allowed if Permalinks are to be truly permanent.

@Humbedooh
Copy link
Member

Color me stupid, but...you would manually have to go in and change those settings to get a different result, would you not?
Whether there are options set or not would then not matter, as you could change the behavior however you see fit in any case.
What matters is that the defaults, and what is used, stay the same, IMHO.

The nonce is the only option I could see anyone purposefully changing, and that's not defined in the generator itself, but in the pony mail config.

@sebbASF
Copy link
Contributor

sebbASF commented Aug 12, 2020

Yes, you would have to change the options.
However if different instances have different settings, then their hashes won't be the same.

I see the DKIM hash being used as a more reliable Message-Id.
This means it must be immutable.
Changing the options is akin to changing a Message-Id.

@sebbASF
Copy link
Contributor

sebbASF commented Aug 12, 2020

The optional nonce makes things worse as there are effectively infinite values it can take.
At least with boolean options it would be possible to generate all the different hashes.

@Humbedooh
Copy link
Member

If different instance have different settings, then that is the problem of the person that set that up, not us, not the generator's fault.
Having options make it easier for others to reuse the code in my view. I don't see it as a fault or a bad thing.

As for the other comment, infinite values is exactly the point of the nonce as I read it. It is to prevent collision attacks from being possible. In my mind, this is a great addition to the generator. If I set up an instance with a nonce, only I can reasonably collide IDs (with a lot of effort still), no one else. If you don't want that extra security, don't set a nonce.

@sbp
Copy link
Author

sbp commented Aug 12, 2020

@sebbASF

An 80 bit truncated hash provides 80 bits of preimage resistance, but only 40 bits of collision resistance. In terms of Ponymail, preimage resistance prevents forgery of new messages with existing IDs. Collision resistance prevents a forger generating two different messages with a single new ID.

Back of the envelope calculations, which I am not claiming as the actual security of this PR code (i.e. this must be independently verified), using SUPERCOP benchmarks show that 40 bits enables a forgery under that model in a matter of hours. 64 bit security would be possible only by a nation-state actor. Forgery under 80 bit security is infeasible now, and is likely to remain so for a reasonable duration.

In other words, 80 bit encoded DKIM style IDs are not suitable for use in the way that you suggest, as a nonce-free identifier for abstract emails. They would be far too easily forged, in a matter of hours even with standard hardware. This is why the option for adding a nonce exists.

General forgeries could be mitigated by storing the actual DKIM signature (if present) as metadata within Ponymail, and this would probably be a good idea in any case. Identifier forgeries can be partially mitigated in Ponymail by canonically showing the first message to have been received by the system. But this only works for currently hosted mailing lists; it does not work for message archive imports.

A minimum of 128 bit security is standard for security in general scenarios. That is why 256 bit hashes have become widespread, because they provide 128 bits of collision resistance security. But for Ponymail preimage resistance is the primary concern, and collision resistance less so.

For your use case I would suggest 128 bit identifiers, providing 64 bits of collision resistance. The code could still be written with the idea that collision attacks would be feasible, providing further mitigations even in this very unlikely case. Another advantage of using 128 bits is that SHAKE-128 output could be used untruncated.

The alternative base32 pibble encoding proposed in this PR has the advantages that it is case insensitive and avoids a wide range of cultural taboo substrings, but it is less compact than urlsafe base64, which would be a potential alternative if using longer IDs. Here is an example of a 16 character pibble encoded 80 bit identifier:

t3wfqdtjor8kghwn

And here is a 22 character base64 encoded 128 bit identifier:

MTIzNDU2Nzg5MDEyMzQ1Ng

The hashes of original messages should use 256 bit hashes. They could use SHA3-256 and be stored using urlsafe base64 encoding. Specifically, it is not secure to use 128 bit identifiers because the threat model is different. The DKIM style identifer is a one-to-many mapping, where the many values are preserved. Collisions in that context can be mitigated. The message source identifier is a one-to-one mapping, and so a collision would result in lost data, which is not acceptable.

In summary: if the nonce option is preserved, 80 bit truncated cryptographically secure hash (CSH) identifiers may be suitable for generic messages but not for message sources. If the nonce option is removed, 128 bit CSH identifiers may be suitable for generic messages, but not for message sources. Only 256 bit CSH identifiers are suitable for message sources.

I apologise that you were led astray by the ignis fatuus of the dkim options. They are not user facing options. They are only there for:

  • The participants of this PR thread, to facilitate changing parameters if the initially submitted defaults are agreed to be unreasonable.
  • Advanced programmer users who have the ability to patch Ponymail code, would like to customise their identifiers, and can therefore use the dkim options to achieve what would have taken them more effort as programmers.

The options can even be safely removed if dkim is standardised.

@Humbedooh
Copy link
Member

@sbp This sounds like we have two options here then:

  1. pibble with 80 bits if nonce is set, 128 bits if no nonce?
  2. always use 128 bits for pibbling?

As for using the complete 256 bit CSH for sources, I am fully on board there, but I think this is best saved for the next gen. That doesn't mean the dkim generator isn't suitable for this gen (from what I can tell, it is far superior for clustered environments compared to any other algorithm), but rather that we'd save using the 256 bit CSH for sources for the next generation. Does that make sense?

@sebbASF
Copy link
Contributor

sebbASF commented Aug 13, 2020

@sbp
In the case of the nonce, does the additional security rely on using a variable nonce, or would a fixed nonce be sufficient?

==

Not all emails in archives have List- headers.
For example email aliases, and early emails before list managers were common.
I think this means that the Permalink must contain the list id separately from the message hash

@Humbedooh
Copy link
Member

AIUI, the destination list-id (not the one in the origin, which may not exist) is appended in the generator with:

        headers.append([b"X-Archive-List-ID", xali_value])

the destination list ID is thus both in the hash and also in the mbox document (as it always is).
I'm unsure what you mean by the permalink containing the list id. I don't see that as a must, it's in the document itself, and the pibble would change if the destination list ID changed, thus two identical emails for the two separate lists would have different pibbles for each list.

@sebbASF
Copy link
Contributor

sebbASF commented Aug 13, 2020

I mean that the Permalink should include the list id, as it does at present. For example: aabbcc@<lid>

@sbp
Copy link
Author

sbp commented Aug 13, 2020

@sebbASF

One nonce can be used for all messages archived by a host, but it must never be disclosed. It is more accurately called a pepper, which is a secret salt. Once it is set in the configuration it does not have to be changed, but disclosure is catastrophic.

@Humbedooh
Copy link
Member

Humbedooh commented Aug 13, 2020

I don't think the ID should include the list name by default, I like it short and neat - makes life easier for people using links :)
It could perhaps be an option to append, but I don't really see a need to always have the list ID in the permalink.
having a long permalink leads to a worse user experience, and using <@> chars etc often leads to encoding bugs.

@sbp
Copy link
Author

sbp commented Aug 13, 2020

@Humbedooh

Collision forgery would require control over entire input messages, unless the source identifier algorithm uses a subset. It also does not enable attacks against the identifiers of existing messages. If a Received header was added and is used to compute the identifier, this increases the difficulty of the attack. If an unpredictable header is added by Ponymail and used, this thwarts attacks even against imported archives. But using a 256 bit CSH of the whole message means you get all the security of the hash and no longer have to threat model such collision forgeries. A 256 bit CSH is cheap and currently reliable security.

It is reasonable to save this feature for the next generation as long as at least one kind of existing message source identifier has enough collision resistance to make these attacks impractical. Since a range of identifiers are available, their security levels could be noted in the documentation so that implementers can understand the security consequences and decide.

@sebbASF
Copy link
Contributor

sebbASF commented Aug 16, 2020

If an externally provided list-id is included in the hash, then the has will change if the lid changes.
Suppose there is an mbox to be imported.
If the individual messages all have list-ids there is no need to specify the lid on the command-line.
However if you do provide one (which is advisable in case there are any missing ids), then the hashes will be different.

It is vital that the hash depends only on the message source (possibly plus a fixed pepper), otherwise reloading the messages may generate a different hash and thus Permalink.

As to characters such as @<> being a problem, I agree, but there are other characters that could be used to separate the lid from the hash. For example one could use hash_dev.ponymail.apache.org.

I agree that it would be nice to have shorter links, but that should not be at the cost of unstable Permalinks.
Now it already looks like it should be possible to use a shorter hash by using base64 so a Permalink of the form:

r3358b63557d3a40e179f6ca498a38b9aaf0b2532aba48bfc03c7a1a0@<dev.project.apache.org>
might become
MTIzNDU2Nzg5MDEyMzQ1Ng_dev.project.apache.org

@Humbedooh
Copy link
Member

Arguably, if you use a custom list ID different from what's in the source, you are going to potentially 404 your permalink in any case if you change it or forget what it was when reimporting, no matter what generator you use. All current generators use the list ID in their input/output. The only difference of real importance, in my view, is that the list ID is visible in current generated IDs, and hidden in the dkim generator to get as short an ID as is reasonably possible (by hidden I mean it's used inside the hash, instead of being plaintext outside the hash).

I will agree that having the list ID in the generated ID can be very helpful for administrators if they need to reimport and had a lot of custom overrides they can't remember or don't have a backed-up configuration of. What if we make this an option for the administrator to decide on? Thus, we could accept this as is, and then have a second, dkim_long or such, where the list-id is appended to the generated ID instead.

the current dkim generator appeals to a certain group of people, and your suggestion of appending the list ID will appeal to other groups of people. Neither solution will be 100.00% stable against all edge cases (take for instance losing your database and re-importing from gmail mbox sources, that would not work).

Let the administrators running PM decide between a shorter ID for neater URLs, or a longer ID that could make recovery easier. But let that be their decision, I don't think we should be imposing one or the other option.

@sebbASF
Copy link
Contributor

sebbASF commented Aug 17, 2020

"re-importing from gmail mbox sources, that would not work" - why not?

It would certainly work with the mod_mbox software, as that relies on an intrinsic part of the message (Message-Id)

That is surely one of the main ideas behind the dkim hash - generate an id that is the same for all instances of the same original message?
It would not be needed if Message-Id were universal and unique, but unfortunately that's not the case.

@Humbedooh
Copy link
Member

gmail does some (nasty) normalization of header values, such as lower-casing email addresses, which is not standard practice, so you cannot reliably generate the same hash for all emails if your previous import was based off non-gmail mbox files. I've run into this problem a few times, where the mbox address had caps in it, and gmail removed those caps, hence why I know.

@sebbASF
Copy link
Contributor

sebbASF commented Aug 17, 2020

In which case maybe dkim should do the same normalisation as GMail to avoid the issue?

@Humbedooh
Copy link
Member

Mails where we use a list override (that is, any mail that arrives to an archiver with --lid $something) should have the list ID appended to the generated ID as you proposed. All emails that do not go through a list override should not need to have anything appended. Does this satisfy you both? :) It would satisfy my requirements.

@sebbASF
Copy link
Contributor

sebbASF commented Sep 13, 2020

I think that should work for mails that are archived.

For the importer, generally it makes sense to always provide the --lid override to ensure the mails are added to the expected list. In think it makes sense to pick up an idea from the original PR, which is to only add the lid if it is different from the lid (if any) in the email. However instead of including the lid in the hash input as before, now it is appended to the generated Permalink.

I hope this would avoid most of the issues with reproducibility of Permalinks.

@sbp
Copy link
Author

sbp commented Sep 14, 2020

Imagine we have a small mailing list archive in an mbox file called ancient.mbox. This mbox archive contains three emails, none of which contain a List-ID header. We import it using the manual command line List-ID alt.small.archive, and use the DKIM-ID generator that appends the manual List-ID. The permalinks are like this:

https://lists.example.org/t/bnbqz6hb4gplpvtz7zmlhymj_alt.small.archive
https://lists.example.org/t/jnb2msdg3j6of3dco2bw5fct_alt.small.archive
https://lists.example.org/t/oxbrv2q2wbd23vfz5ttfcs7v_alt.small.archive

Now the catastrophic scenario happens, and the Ponymail database is lost! But we still have ancient.mbox, and we want to restore its three emails back into the lists.example.org Ponymail instance. How do we know what manual List-ID to use? The three emails do not include a List-ID. We made up alt.small.archive, but we did not record this fact and we no longer remember it. How do we get the List-ID?

The idea behind putting _alt.small.archive in the permalinks is that now we can send a plea to our users asking: "does anybody have any links to an email that was in the database I just lost?", or use a search engine to try to find such permalinks.

There are a couple of problems with this:

  • If the archive is small, what if nobody, including search engines, ever recorded those links? Our mailing list archives may be unpopular, private, or hidden from search engines using robots.txt.
  • Whom do we consult to find those permalinks? In other words, how do we even know who our users are? For sites that have a community around them, there may be a straightforward answer to this. But there is not a general answer to this.

Therefore, the only reason to put a manual List-ID in permalinks is to support an unreliable backup strategy. The strategy is unreliable because it depends on arbitrary users to retain copies of the permalinks that can then be consulted to restore the data in the case of catastrophic database loss. And an unreliable backup strategy is made unacceptable when there is a reliable alternative.

One alternative is that we can just rename archive.mbox to alt.small.archive.mbox when we import it. Or we can record the hash of archive.mbox into a file called alt.small.archive.mbox-sha3 and keep it alongside archive.mbox. But those approaches have drawbacks too, e.g. if we obtain an mbox file which is differently ordered.

Instead, here is a reliable alternative:

Imagine we import our three emails from ancient.mbox, but this time without a manual List-ID in the permalinks. The permalinks are like this:

https://lists.example.org/t/bnbqz6hb4gplpvtz7zmlhymj
https://lists.example.org/t/jnb2msdg3j6of3dco2bw5fct
https://lists.example.org/t/oxbrv2q2wbd23vfz5ttfcs7v

When we performed this import, we generated the three DKIM-IDs bnbqz6hb4gplpvtz7zmlhymj, jnb2msdg3j6of3dco2bw5fct, and oxbrv2q2wbd23vfz5ttfcs7v. These are each encodings of 16 bytes, for a total of 48 bytes. In general this is 16 * n bytes, where n is the number of emails imported. We store these bytes in a file called alt.small.archive.dkim-ids.

We now perform standard backup procedures for alt.small.archive.dkim-ids. We replicate it across environments, storing as many copies as possible in different geographic locations using different setups. This is easy to do because the file is only 48 bytes long. We only need to store 48 bytes, several times, to have a reliable backup of our manual List-ID. We can even include the manual List-ID plus line feed at the start of the file, so that we're not relying on the filename itself.

Does this strategy scale? Consider a very large mailing list that has a million emails in it. The manual List-ID .dkim-ids backup file for such a list would be 16 * n or 16 * 1,000,000 or 16,000,000 bytes long. This is only 15.2 MiB. As of 2020 it is trivial to widely and reliably replicate fifteen mebibytes for backup purposes.

What are the problems with this strategy? Unlike the unreliable and unacceptable backup strategy described above, it does not rely on arbitrary users or search engines to backup our data for us. It does not lead to the problem of wondering who to consult to restore that data. It follows established, standard industry practices for backing up our manual List-IDs, instead of the existing ad hoc and idiosyncratic method.

For that reason, I could never recommend the strategy where manual List-IDs are part of the permalinks. I could never recommend that people use it as their backup strategy, because this superior strategy is available instead and it ticks all the boxes.

It is, however, sometimes necessary to include the manual List-ID in the URL somewhere for UI purposes. Consider the email bnbqz6hb4gplpvtz7zmlhymj above. Let's say that as well as appearing in our ancient.mbox archive it was also sent to five other mailing lists, for a total of six, all of which are in the lists.example.org Ponymail instance. What should the UI say if a user visits the following address?

https://lists.example.org/t/bnbqz6hb4gplpvtz7zmlhymj

It doesn't know which List-ID to present to the user. In fact, in Ponymail and in Foal it doesn't even retain the information that this message was sent to six mailing lists if the DKIM-ID is the primary permalink, which is necessary in Ponymail if DKIM-ID is used at all, and is not necessary in Foal but it still possible.

This would not be a problem if the manual List-ID were part of the permalink, but it solves one problem and causes another. DKIM-IDs were designed to deduplicate emails. If List-IDs are part of the DKIM-ID permalink, this means we would have to store six copies of the bnbqz6hb4gplpvtz7zmlhymj metadata, and six copies of its source too. But DKIM-IDs were explicitly designed to prevent this. Therefore we should solve the problem of retaining manual List-IDs another way. If we added them to the hash input of DKIM-IDs then we would lose our reliable backup strategy presented above.

Thankfully there is a simple solution.

In Foal commit 178b729, the multiple ID generators feature was added with the field permalinks. To support multiple manual List-IDs for a DKIM-ID identified message, all that would be required is to have an analogous field called lids for an array of List-IDs, just like permalinks is an array of generated permalink IDs.

Then, if the user browses the URL above:

https://lists.example.org/t/bnbqz6hb4gplpvtz7zmlhymj

They can be presented with a list showing all List-IDs that this message belongs to, and the option to display the message in its context in those lists, specialising its UI. Or, they can still browse a version that contains the List-ID:

https://lists.example.org/alt.small.archive/t/bnbqz6hb4gplpvtz7zmlhymj

But, importantly, alt.small.archive is not part of the DKIM-ID here. This means that messages are still deduplicated even when they appear in multiple mailing lists.

To argue that manual List-IDs should be part of DKIM-IDs would remove all of the above. In particular:

  • Metadata and sources would be duplicated across mailing lists
  • Showing what lists a DKIM-ID appears in would require a prefix search incompatible with elasticsearch keyword arguments
  • Ponymail administrators would be induced to just rely on their users to backup the permalinks in case of catastrophic data loss instead of performing the reliable backup method described above

@sebbASF
Copy link
Contributor

sebbASF commented Sep 14, 2020

Seems to me if no-one has a copy of the permalink then it does not matter as much if it is lost.

Whereas the other way round, if there is a known permalink, but it is not in the database, it is vital to be able to reconstruct it.

If the link contains a lid suffix, then it should be possible to find the mail by re-loading all the relevant archives.

If there is no separate lid suffix in the missing Permalink, then the entire corpus may need to be reloaded unless there is some context that helps narrow down the search (e.g. dates, possible list names).

The advantage of the above approach is that there is no need for additional backup data beyond the original archives.
Indeed it should be possible to use sibling archives recorded by other subscribers to the list.

@sbp
Copy link
Author

sbp commented Sep 14, 2020

What if somebody has a permalink that the administrator doesn't know about? Is it vital to be able to reconstruct that?

The advantage of including manual List-IDs is that it requires no additional backup. But the disadvantages are that it is unreliable, and that it does not remove duplicates across lists, which is what DKIM-ID was designed to do in the first place.

Why recommend an unreliable backup strategy that forces users to bear the burden of storage when there is a reliable alternative that keeps that responsibility with the archive administrator?

@sebbASF
Copy link
Contributor

sebbASF commented Sep 14, 2020

No, if the link is not known, then clearly it does not have to be reconstructed at that time.

However if all the archives are available, they can all be reloaded.
If a different lid suffix is used (for mails with no List-Id header), that does not matter because the hash prefix will still be the same, so if the Permalink turns up later it will have the same prefix and can thus be found.

==

I don't understand the part about DKIM and de-duplication across lists.
I thought the hash included the List-id etc as input, so it will only de-duplicate mails sent to lists without such headers?

I don't think it's possible to share message content across lists, because the sources will have different headers.
This is true even if no List-* headers are involved.

The current design does share attachments, but there is no attempt to share any other parts of messages. Doing so would require a redesign, and I'm not sure it would be worth the extra complication and house-keeping.

@sbp
Copy link
Author

sbp commented Sep 14, 2020

If the link is not reconstructed before it is requested then the request fails. Whether a link is known or not to the adminstrator does not necessarily correlate with how "vital" that link is. An administrator's knowledge of link usage can only ever be ad hoc and incomplete. The reliable backup mechanism avoids this issue, and enables prompt restoration.

Duplicate headers were tested by sending a message to two different mailing lists, both using Mailman and Ponymail. The only difference was the List-ID header, even when including the headers that DKIM-ID discards. Deduplication may occur in other cases too, such as when importing an email that only exists in one archive but is addressed to multiple mailing lists.

@sebbASF
Copy link
Contributor

sebbASF commented Sep 14, 2020

Obviously it is better to have backups that simplify and speed up restoration.
My point is that with the appropriate Permalink design all is not lost even if such backups are not available.

==

I still don't understand your point about de-duplication.
Perhaps you can provide some examples of such emails?

@sbp
Copy link
Author

sbp commented Sep 14, 2020

If an email is sent with the addresses of two or more mailing lists in the To headers, then Ponymail will store identical copies of that email in mbox_source except that the List-ID headers will be different. This has been tested with Mailman as the mailing list software.

Since the discussion topic is the case where List-ID is for some reason not added, and since the rest of the mbox_source values are identical apart from that header, the values would be exact duplicates if List-ID were not added.

As for examples, any email with two or more mailing lists in the To header should suffice. The test did not involve any special conditions or setup, just writing a regular mailing list message in a regular email client. Nor was there any Mailman or Ponymail configuration that would affect the test.

This is not the only possible example scenario, as mentioned.

@sebbASF
Copy link
Contributor

sebbASF commented Sep 14, 2020

Sorry, but I am not convinced. I need to see actual data.

I have been looking for examples in the ASF corpus, but have yet to find one where the list-ids are not added.
However, if the list-id headers added by ezmlm are ignored, the mails are still not identical.
See for example:
apache/incubator-ponymail-unit-tests@be029d9
The Received and Delivered-To lines are different. I don't see how that can fail to be the case.

@sbp
Copy link
Author

sbp commented Sep 14, 2020

The Received headers are the same.

There is no Delivered-To header. It is not a standard header.

@sbp
Copy link
Author

sbp commented Sep 14, 2020

Discussion of DKIM-IDs with manual List-IDs appended should be moved to Issue #523.

This thread is for discussion of DKIM-IDs as they appear in the present commit of this PR.

@Humbedooh
Copy link
Member

This PR is now more than a month old.
I am satisfied that the generator, as intended, works. I intend to merge it.
I agree that we should work on the alternate strategy elsewhere. I see the benefits of it, but I think it's ultimately unfair to the original submission to keep requiring changes that are sort of tangential at best to the original issue, as well as a "side-step" from DKIM.

@sebbASF
Copy link
Contributor

sebbASF commented Sep 16, 2020

There is a trade-off here.

Let's call the two Permalink designs O and OL, where:

O = opaque hash created from the message source plus the LID (if it differs from the LID in the headers)

OL = opaque hash created from the message source only. If the LID differs from the headers, it is appended to the opaque hash.

Style O

Advantage: fixed Permalink style
Disadvantage: not feasible to regenerate Permalinks without a separate database to relate the LIDs to the messages

Style OL

Advantage: can regenerate Permalinks from just the mail sources
Disadvantage: Permalinks are longer for some mails

Neither is perfect; seems to me that the appropriate choice will depend on the installation.

@sbp
Copy link
Author

sbp commented Sep 16, 2020

In Style OL it is not possible to "regenerate Permalinks from just the mail sources" because the LID must still be known to obtain the suffix of the full Permalink. It is only possible to regenerate a prefix of the Permalink. Therefore the LIDs must still be known even in Style OL, to obtain the full Permalink. This disproves the supposed advantage of Style OL. And since the LIDs must be known even in Style OL, Style O is always possible too. Therefore Style OL is an inferior variant of Style O.

@sebbASF
Copy link
Contributor

sebbASF commented Sep 16, 2020

In the case of an OL Permalink which is missing from a rebuilt database, there are two possibilities:

  • The link contains the LID as a suffix, in which case the lid is obviously known.
  • The link does not contain a LID, in which case it is not needed.

When rebuilding a database using OL links, any mails that don't have LID headers will need a suffix.
If possible the original LIDs should be used, but unlike with the O-style links it is not essential.
It suffices to use a best guess as to the original LID that was used.
This is because the opaque hash part is unaffected by the chosen LID.
The opaque hash prefix can be used to find a missing link; it is not essential to rebuild the full Permalink (though of course that is better).

In the case of an O-style Permalink, in order to match the opaque hash, it is necessary to know the exact LID that was used.
Of course if the LID was the same as in the headers that's not necessary, but in that case the O and OL Permalinks are the same anyway.
If the LID is different from the headers, then it's not possible to regenerate the same hash unless the same LID is used.
Given a missing Permalink, AFAICT it will be necessary to generate hashes for each mail with each possible LID, of which there may be hundreds.

Note: I am assuming here that the O-style Permalink includes the lid in the hash, either as an existing header, or as an addition. That is not actually the case for the amended PR, however as I indicated earlier that will need to be fixed.
If the same message is posted to two different lists, the copies need to have different Permalinks.
If the O-style link does not include a LID in the hash, then some other means will need to be found to distinguish copies on other lists, for example by appending the LID in clear, which is the OL-style approach.

@sebbASF
Copy link
Contributor

sebbASF commented Sep 18, 2020

The original PR included the LID in the hash. This should ensure that the target is an email on the correct list

The current PR -- dfd18eb -- does not include the LID in either the hash or as a suffix.
As such, the same hash may be generated for mails on multiple lists.
A Permalink needs to lead directly to the correct list; it should not be necessary to select a list id from a list of options.

@sbp
Copy link
Author

sbp commented Oct 2, 2020

It does not suffice to "use a best guess as to the original LID that was used" to reconstruct an OL permalink. If somebody has an opaque_lid permalink, and the message is lost but the Ponymail operator guesses an incorrect new lid, then the original opaque_lid link will break. If links break then they cannot be considered permalinks.

Moreover, if the appended lid in OL were not an integral part of the permalink then it would be necessary to use it anyway to disambiguate which list UI to apply to a message in the archives. This is because, for example and amongst other possible scenarios, the same message could be imported twice with different manual command line lid overrides. Using lid UI disambiguation has already been proposed, and requires handling of the case where the lid is omitted from the link.

One could consider that the current commit dfd18eb already implements OL in the presence of lid UI disambiguation. It is therefore not an accurate characterisation that the current commit "will need to be fixed" to include the lid in the input, because the intention was that lid UI disambiguation would be introduced alongside the current commit, or implemented in a future version with a warning to users in the documentation meanwhile.

@sebbASF
Copy link
Contributor

sebbASF commented Oct 2, 2020

I agree that choosing a different LID when re-importing will affect a full OL permalink.
However, assuming that the missing message has been reloaded somewhere, it will have the same opaque part, and it is fairly simple to match abcd_lid1 against abcd_lid2.

Of course, if the original LIDs are known then this won't be an issue.

The point is that an OL-style Permalink can be recovered without needing to keep a database of the imports.

===

I don't know what "if the appended lid in OL were not an integral part of the permalink" means.
The OL is the variable part of the permalink, and the appended lid is part of the OL, so I don't see how the lid can be "not part of the permalink".

@sbp
Copy link
Author

sbp commented Oct 3, 2020

The lid is not an integral part of the permalink. One may also say that the lid is not part of the permalink at all, if only the mandatory parts of a link intended for permanence are defined as "the permalink", but that is mere terminological choice.

Commit dfd18eb implements OL in the presence of lid UI disambiguation. Are there any remaining objections to merging this pull request? There are no review comments on the code in the most recent commit, a month later.

@Humbedooh
Copy link
Member

I think we should merge. We can work out differences in opinion later on. I'm leaning towards having two generators and leaving it up to the end user to make a decision. Can you adjust the PR so it can be merged?

@sebbASF
Copy link
Contributor

sebbASF commented Oct 3, 2020

AFAICT, commit dfd18eb does not implement OL, at least not in the way that I defined it.
The output from the generator does not depend on the LID, whereas my definition requires the LID to be appended if the LID differs from the List-Id in the message headers.

If the generator output is also used as the database id, as is currently the case, only one message will be stored.
That needs to be fixed, either by changing the generator output, or by changing the underlying design of Ponymail.

This issue occurs where a message is sent to multiple lists and where the headers used by dkim are identical.
I think this applies mainly to list aliases. However I think Ponymail should be able to handle any type of mail archives.

@sbp
Copy link
Author

sbp commented Oct 3, 2020

If this PR were to be modified so that the lid is used in the input of the DKIM-ID algorithm, but only when it does not match the List-Id header value of an email, would that be sufficient for this PR to be merged? Or would there be further objections?

@sebbASF
Copy link
Contributor

sebbASF commented Oct 3, 2020

I have no issue with the code in dkim-id.py per se.

However the dkimid method in generators.py needs to take note of the LID where it differs from the List-ID in the headers.
For example by appending it to the returned id.

@sbp
Copy link
Author

sbp commented Oct 3, 2020

For example by appending it to the returned id.

The question was specifically about using the lid in the DKIM-ID input. Would that be acceptable?

@sbp
Copy link
Author

sbp commented Nov 2, 2020

@sebbASF Would it be acceptable to insert the lid into the DKIM-ID input in case of a mismatch instead of appending it to the DKIM-ID output? Thirty days have elapsed since this thread was active.

@Humbedooh
Copy link
Member

@sbp I find it acceptable, I think a solution here is to implement that as you wish, and if someone else wants to make an alternate DKIM generator that appends to the ID, then they can make that a reality at a later point in time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants