Skip to content

Successive syncs create duplicate chat objects #135

@netvor

Description

@netvor

I have an issue whereby successive (full) syncs cause the same chat object to be downloaded multiple times. The compressed objects are all different, yet the plaintext contents are the same:

$ cd gmvault-db/db/chats/

$ md5sum */1437017132845624821.eml.gz
93db2cdf56f65c3393f48a7ac3822a89  subchats-2/1437017132845624821.eml.gz
7d891df7672a2346741de39773dc9810  subchats-3/1437017132845624821.eml.gz
cc2dac40e35bdfbcd06db0e6785d1f77  subchats-4/1437017132845624821.eml.gz

$ cp subchats-2/1437017132845624821.eml.gz /tmp/sc2.gz
$ cp subchats-3/1437017132845624821.eml.gz /tmp/sc3.gz
$ cp subchats-4/1437017132845624821.eml.gz /tmp/sc4.gz
$ gunzip /tmp/sc2.gz
$ gunzip /tmp/sc3.gz
$ gunzip /tmp/sc4.gz

$ md5sum /tmp/sc*
8f96d8ec223ea64c13a028cc9038a694  /tmp/sc2
8f96d8ec223ea64c13a028cc9038a694  /tmp/sc3
8f96d8ec223ea64c13a028cc9038a694  /tmp/sc4

These duplicates are not created every time. Generally when there is nothing to update (no new emails or chats) it does not happen, but when there is a new chat recorded, I usually get a duplicate. As far as I know this only affects chat objects, not mail objects.

To localize the problem better, I disabled compression and did a series of --chats-only syncs.

  1. Initial sync: gmvault sync --no-compression --chats-only. 1267 chats stored in subchats-1
  2. Force an update, i.e. send a chat message (I use a 3rd-party Jabber client, not the native Google app)
  3. gmvault sync --no-compression --chats-only. This time 1268 chats stored in both subchats-1 and subchats-2
  4. gmvault sync --no-compression --chats-only. This time no change.
  5. Force an update
  6. gmvault sync --no-compression --chats-only. This time 1268 chats stored in subchats-1 and subchats-2, 1269 chats stored in subchats-3
  7. gmvault sync --no-compression --chats-only (so no update). This time 1268 in subchats-1 and -2, 1269 in -3 and 538 (huh?) in subchats-4

So you see the behavior is not very predictable. Another observation is that the different md5sum of the .gz duplicates is only a side-effect of gzip storing the timestamp of the .eml in the .gz file.

As to the duplicates, after accumulating these four subchats- folders, I discovered they are not always identical: if they are Content-Type: multipart/alternative, then the "boundary" string differs between duplicates. The .meta files are always identical.

I suppose my main question is: what is the logic behind creating new subchats- directories?

Metadata

Metadata

Assignees

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions