Skip to content

Fix tag updates atomicity and deadlocks#483

Merged
Meow merged 7 commits intophilomena-dev:masterfrom
MareStare:fix/get-or-create-tags-atomicity
May 12, 2025
Merged

Fix tag updates atomicity and deadlocks#483
Meow merged 7 commits intophilomena-dev:masterfrom
MareStare:fix/get-or-create-tags-atomicity

Conversation

@MareStare
Copy link
Contributor

@MareStare MareStare commented Mar 30, 2025

Before you begin

  • I understand my contributions may be rejected for any reason
  • I understand my contributions are for the benefit of Derpibooru and/or the Philomena software
  • I understand my contributions are licensed under the GNU AGPLv3
  • I understand all of the above

I've noticed unique constraint violation errors in get_or_create_tags when working on #481. So I replaced the existing impl with two separate queries INSERT ON CONFLICT DO NOTHING and a subsequent SELECT.

But then I started seeing deadlocks in update_all queries for incrementing the image counts on tags. So I fixed it this way.

Deadlock demo diagram is below. This can only happen in bulk-update transactions.

The problem is that postgres doesn't guarantee the order of acquiring row-level locks during the update. So one transaction can acquire its part of the locks, and the other transaction may acquire another part of the locks, thus they both wait on each other's remaining part of the records to be released falling into a deadlock.

image

Tested it with the image seeding script in #483, that does ~50 image uploads concurrently.

@MareStare MareStare force-pushed the fix/get-or-create-tags-atomicity branch from a222e25 to a7acf06 Compare March 30, 2025 19:44
@MareStare MareStare marked this pull request as draft March 30, 2025 19:56
@MareStare MareStare force-pushed the fix/get-or-create-tags-atomicity branch from a7acf06 to ba2fcdc Compare March 30, 2025 20:13
@MareStare MareStare force-pushed the fix/get-or-create-tags-atomicity branch from ba2fcdc to 945b6a3 Compare March 30, 2025 20:37
@MareStare MareStare marked this pull request as ready for review March 30, 2025 20:39
@MareStare MareStare requested a review from liamwhite March 30, 2025 20:39
ok

{:error, err} ->
raise "get_or_create_tags failed: #{inspect(err)}\ntag_names: #{inspect(tag_names)}"
Copy link
Contributor Author

@MareStare MareStare Mar 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like this error handling story, but this function never returned a Result in the first place, but it wasn't annotated with the ! suffix. In fact, it seems that all Repo.*() methods never return a Result. Is this expected? I suppose they use raise internally, but shouldn't they have a ! suffix?

@MareStare MareStare changed the title Fix the tag updates atomicity and deadlocks Fix tag updates atomicity and deadlocks Mar 30, 2025
@MareStare MareStare force-pushed the fix/get-or-create-tags-atomicity branch from e120fae to e72f7db Compare March 30, 2025 21:06
@MareStare
Copy link
Contributor Author

Found this error during seeding after recent change, trying to debug:

app-1  | [error] Task #PID<0.1927.0> started from #PID<0.1926.0> terminating
app-1  | ** (Ecto.NoResultsError) expected at least one result but got none in query:
app-1  | 
app-1  | from i0 in Philomena.Images.Image,
app-1  |   where: i0.id == ^30
app-1  | 
app-1  |     (ecto 3.12.5) lib/ecto/repo/queryable.ex:164: Ecto.Repo.Queryable.one!/3
app-1  |     (philomena 1.1.0) lib/philomena/images/thumbnailer.ex:77: Philomena.Images.Thumbnailer.generate_thumbnails/1
app-1  |     (philomena 1.1.0) lib/philomena/workers/thumbnail_worker.ex:6: Philomena.ThumbnailWorker.perform/1
app-1  |     (exq 0.19.0) lib/exq/worker/server.ex:175: anonymous fn/4 in Exq.Worker.Server.dispatch_work/3
app-1  |     (elixir 1.18.1) lib/task/supervised.ex:101: Task.Supervised.invoke_mfa/2
app-1  | Function: #Function<0.120145007/0 in Exq.Worker.Server.dispatch_work/3>
app-1  |     Args: []

@MareStare MareStare marked this pull request as draft March 30, 2025 21:12
@MareStare
Copy link
Contributor Author

MareStare commented Mar 30, 2025

Hm, can't reproduce that error mode after docker compose down. I suppose I just had my DB messed up during intermediate iterations, where no images were uploaded. Reopening this again for review

@MareStare MareStare marked this pull request as ready for review March 30, 2025 21:21
@MareStare
Copy link
Contributor Author

MareStare commented May 2, 2025

@Meow I discussed this deadlock in PM with Liam. Note that postgres logs every deadlock occurrence in its sever logs like this:

postgres-1  | 2025-03-31 21:18:23.319 UTC [289] DETAIL:  Process 289 waits for ShareLock on transaction 149633; blocked by process 296.
postgres-1  |   Process 296 waits for ShareLock on transaction 149621; blocked by process 289.
postgres-1  |   Process 289: UPDATE "tags" AS t0 SET "images_count" = t0."images_count" + $1 WHERE (t0."id" IN (SELECT st0."id" FROM "tags" AS st0 WHERE (st0."id" = ANY($2)) ORDER BY st0."id" FOR UPDATE))
postgres-1  |   Process 296: UPDATE "tags" AS t0 SET "images_count" = t0."images_count" + $1 WHERE (t0."id" IN (SELECT st0."id" FROM "tags" AS st0 WHERE (st0."id" = ANY($2)) ORDER BY st0."id" FOR UPDATE))

Here, there is good enough context on which queries trapped into a deadlock so it becomes much easier to find the root cause. Postgres never actually hangs on a deadlock, it just detects them and cancels all participating transactions (there can be more than 2 transactions in the deadlock chain, and it happened for me during my testing of my dev seeds parallelization efforts), therefore it's technically not a critical error.

I recommend you to dump the logs from postgres in prod and grep them for ERROR: deadlock detected (not in log snippet higher, but it's also nearby it) to check if there are any real deadlocks happening in prod today.

@Meow Meow added this to the 1.2 milestone May 9, 2025
tags = Tag |> where([t], t.id in ^tag_ids)

{count, nil} = repo.update_all(tags, inc: [images_count: 1])
count = Tags.update_images_counts(repo, +1, tag_ids)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this not be written simply as 1 instead of +1?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the +

"""
@spec update_images_counts(term(), integer(), [integer()]) :: integer()
def update_images_counts(repo, diff, tag_ids) do
case tag_ids do
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could this case statement not be eliminated by using pattern matching in the function definition like def update_image_counts(repo, diff, tag_ids) when length(tag_ids) == 0, do: 0

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'd also rename the function to update_image_counts, instead of plural images

|> Enum.sort()
|> Enum.dedup()
|> Enum.map(
&(%Tag{}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'd not use this shorthand form here for clarity's sake, and instead would write Enum.map(fn tag_name -> ...

Tag
|> where([t], t.id in ^tag_ids)
|> Repo.update_all(inc: [images_count: 1])
update_images_counts(Repo, +1, tag_ids)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove the + here as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 79c794c


_ ->
locked_tags = vectorized_mutation_lock("FOR NO KEY UPDATE", tag_ids)
def update_image_counts(nil, _diff, []), do: 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

match repo to nil? what if repo isn't nil but tag_ids is []?

Copy link
Contributor Author

@MareStare MareStare May 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I haven't noticed that nil. Copilot autocomplete betrayed me there 💀. FIxed

@Meow Meow merged commit 9331ad4 into philomena-dev:master May 12, 2025
6 checks passed
Meow pushed a commit that referenced this pull request May 12, 2025
* Fix the tag updates atomicity and deadlocks

* Replace `Map.drop()` with `Map.take()` as per Liam's feedback in DM

* Fix error handling since there will be a tuple of more than 2 items returned as per docs

* Replace `+1` with `1` for update_images_counts

* Rename `update_images_counts` to `update_image_counts` and replace `case` with a function-level match

* Replace the capture operator with an explicit `fn tag_name ->`

* Fix remove repo nil match autocompleted by Github Copilot (bruh)
MareStare added a commit to MareStare/philomena that referenced this pull request May 20, 2025
* Fix the tag updates atomicity and deadlocks

* Replace `Map.drop()` with `Map.take()` as per Liam's feedback in DM

* Fix error handling since there will be a tuple of more than 2 items returned as per docs

* Replace `+1` with `1` for update_images_counts

* Rename `update_images_counts` to `update_image_counts` and replace `case` with a function-level match

* Replace the capture operator with an explicit `fn tag_name ->`

* Fix remove repo nil match autocompleted by Github Copilot (bruh)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants