Skip to content

feat: beginnings of a mysql backend#22

Merged
bbangert merged 1 commit intomasterfrom
feat/18
Sep 11, 2018
Merged

feat: beginnings of a mysql backend#22
bbangert merged 1 commit intomasterfrom
feat/18

Conversation

@pjenvey
Copy link
Member

@pjenvey pjenvey commented Sep 10, 2018

w/ some initial calls and a test suite migrated from the sqlite version

  • prefers raw DQL (note: not DML) queries vs diesel's query builder for
    potential reuse for other backends (spanner)

  • TODO: further fleshing out of the types, likely wanting i64 or wrappers
    everywhere (as all spanner has is INT64) -- nor should the db layer be
    responsible for conversions from unsigned

Issue #18

@pjenvey pjenvey requested a review from bbangert September 10, 2018 22:52
@bbangert bbangert requested a review from rfk September 11, 2018 00:13
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What kind of modification would this undergo?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need a modified timestamp here, because any modification timestamp is going to be per-user rather than global.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, I'm curious about what kind of modifications are being tracked here. I realize this is likely due to something the Go/Python version tracks, but it'd be nice to document either inline or near here why this field exists.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be the last-modified timestamp of the collection as a whole, and should change whenever an item is added, modified or deleted in the collection. It will be greater-than-or-equal-to the MAX() of the modified column for that collection in the bso table, with the greater-than case being because an item was deleted.

FWIW, the python server has a bug here in its handling of deleted collections:

mozilla-services/server-syncstorage#62

Basically, it tries to calculate the last-modified time of the storage as a whole by doing SELECT MAX(modified) FROM user_collections WHERE uid = X. That's incorrect in the case of a deleted collection, which should cause the last-modified time of the storage as a whole to increase, but won't affect the last-modified time of any remaining collections.

It's clearly an edge-case, because we haven't bothered to actually fix it in the python version. But for greenfields code it's probably worth doing it right the first time. I suggest adding a deleted_at timestamp field to the user_collections table, so that we can explicit track the deletion of collections. But that can be a follow-up issue if necessary.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#23

src/db/error.rs Outdated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

Copy link
Contributor

@rfk rfk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a great start!

Architecturally, I'm a little worried about the way that the DB is taking transactions and querying the current time from behind its API boundary. This could lead to subtle edge-cases under high concurrency, such as two concurrent PUTs inserting items with the same timestamp into the same collection, which could cause the two clients to never sync down each others changes.

Ideally, each HTTP request would be processed in a single logical transaction and at a single logical instant in time. (It might help to think of the modified integers here not as timestamps, but as opaque version numbers, with a new version number being generated for each change to the collection).

I wonder if there's a way to help enforce that at the DB trait API level here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The BSO payload can in theory be arbitrary unicode, since it's JSON. Should we set an explicit encoding like utf8mb4 to guard against any weirdness in unicode handling?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This payload_size is a separate column to (in theory) make it quicker to calculate total size of items stored in a collection. IIRC we don't ever do that in practice, so it may be worth considering whether that optimization still makes sense for us here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

payload_size does appear to be utilized in the info/quota/collectionion_usage calls

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be interesting to compare SELECT SUM(payload_size) vs SELECT SUM(LENGTH(payload)) for this purpose to see whether having it as a separate column really provides much value in practice.

To be clear, I don't have any particular objection to keeping it, just wondering whether the extra complexity pays for itself or not.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need a modified timestamp here, because any modification timestamp is going to be per-user rather than global.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be the last-modified timestamp of the collection as a whole, and should change whenever an item is added, modified or deleted in the collection. It will be greater-than-or-equal-to the MAX() of the modified column for that collection in the bso table, with the greater-than case being because an item was deleted.

FWIW, the python server has a bug here in its handling of deleted collections:

mozilla-services/server-syncstorage#62

Basically, it tries to calculate the last-modified time of the storage as a whole by doing SELECT MAX(modified) FROM user_collections WHERE uid = X. That's incorrect in the case of a deleted collection, which should cause the last-modified time of the storage as a whole to increase, but won't affect the last-modified time of any remaining collections.

It's clearly an edge-case, because we haven't bothered to actually fix it in the python version. But for greenfields code it's probably worth doing it right the first time. I suggest adding a deleted_at timestamp field to the user_collections table, so that we can explicit track the deletion of collections. But that can be a follow-up issue if necessary.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "id" here is the batch id, right? It may be worth naming it "batch_id" or similar to avoid confusion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not obvious to me why some of these are suffixed with _sync and some are not; what's the significance?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the higher level db interface (trait) supplies async calls. So the MysqlDb impl of this trait will call sync methods via the tokio blocking wrapping call.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please do :-)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not obvious to me whether this defaults sortindex to NULL or to 0; I believe NULL is the correct behaviour.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch, this is leftover from the port from go which instead defaulted to 0.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docs aren't clear on what happens when limit < 0, does it default to "no limit"?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW I wouldn't expect the ttl to be updated unless you explicitly sent a new TTL in the update.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume the // XXX are to maybe do a convert operation with ? so we can ensure we catch casting errors?

bbangert
bbangert previously approved these changes Sep 11, 2018
@pjenvey
Copy link
Member Author

pjenvey commented Sep 11, 2018

Will address further things later (likely switching get_bsos limit to u64, table encoding, batches table/architecture is still up in the air).

I'm already intending to have the transactions work like you suggest @rfk -- a lot like the python version: one transaction started (TODO: a transaction() call added to the Db trait) per handler request, having all db calls taking place within it.

modified timestamp likely following the same pattern

@pjenvey pjenvey force-pushed the feat/18 branch 2 times, most recently from 5722817 to 64d8a48 Compare September 11, 2018 23:28
w/ some initial calls and a test suite migrated from the sqlite version

- prefers raw DQL (note: not DML) queries vs diesel's query builder for
potential reuse for other backends (spanner)

- TODO: further fleshing out of the types, likely wanting i64 or wrappers
everywhere (as all spanner has is INT64) -- nor should the db layer be
responsible for conversions from unsigned

Issue #18
@bbangert bbangert merged commit fbd314b into master Sep 11, 2018
@bbangert bbangert deleted the feat/18 branch September 11, 2018 23:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants