[RHELMISC-26727] Convert dory device to dm-lossy for upstreaming#437

Open

lorelei-sakai wants to merge 33 commits intodm-vdo:mainfrom

lorelei-sakai:feature/dm-lossy

Member

lorelei-sakai commented Apr 8, 2026

This series transforms dmDory into the dm-lossy device in preparation for sending it upstream. When submitting this upstream, all the changes will be squashed into a single commit. However, this series is 33 commits so that I can show the transformation in smaller, hopefully easy-to-understand steps. Each commit should be focused on a single concept or cleanup, and as a result some patches are quite small.

In addition to the target transformation (30 commits), this series also includes the test change necessary to move our existing Dory tests to use dm-lossy instead (3 commits). The device code has been validated by running our existing Rebuild and TornWrite tests.

lorelei-sakai added 30 commits

April 6, 2026 21:07


          dm: copy dory sources to lossy

46b20c9

Copy dm-dory sources to prepare for the transition to
a proper device mapper target, and rename files.

Signed-off-by: Matthew Sakai <msakai@redhat.com>


          dm lossy: streamline files

c3aa16e

Merge common.[hc] into the dm-lossy-target.c and
update makefiles to use the new target name.

Also remove perforce tags and other unnecessary bits
from files.

Signed-off-by: Matthew Sakai <msakai@redhat.com>


          dm lossy: remove old version support

1c4a8c5

Also add a carveout for the header file not needed upstream.

Signed-off-by: Matthew Sakai <msakai@redhat.com>


          dm lossy: remove typedefs

af3091f

Replace typedefs with bare type definitions.

Signed-off-by: Matthew Sakai <msakai@redhat.com>


          dm lossy: open-code small inline functions

fb4be48

Also remove unused functions and declarations, and move
a few functions to more logical locations.

Signed-off-by: Matthew Sakai <msakai@redhat.com>


          dm lossy: add dmsetup message handling

9856c38

Signed-off-by: Matthew Sakai <msakai@redhat.com>


          dm lossy: remove sysfs nodes

Signed-off-by: Matthew Sakai <msakai@redhat.com>


          dm lossy: fix string handling

e2914e1

Rrplace sscanf with kstrtouint, kstrtoul, and kstrtoull.
Replace print warning with DMWARN.

Signed-off-by: Matthew Sakai <msakai@redhat.com>


          dm lossy: replace BUG_ON with a normal error check

2b19567

Signed-off-by: Matthew Sakai <msakai@redhat.com>


          dm lossy: use meaningful names for magic constants

a8faf59

Also remove unused values.

Signed-off-by: Matthew Sakai <msakai@redhat.com>


          dm lossy: replace dory names with lossy

53cea42

Also replace variable names derived from the dory name, and
functions using the 'common' prefix.

Signed-off-by: Matthew Sakai <msakai@redhat.com>


          dm lossy: fix license and module description

Signed-off-by: Matthew Sakai <msakai@redhat.com>


          dm lossy: use bio_list_merge_init where appropriate

f7f51ae

Signed-off-by: Matthew Sakai <msakai@redhat.com>


          dm lossy: remove extraneous braces and checks

e54fe86

Remove unnecessary braces for single-line branch statements
and explicit checks against NULL or 0. Also remove many
unnecessary parentheses.

Signed-off-by: Matthew Sakai <msakai@redhat.com>


          dm lossy: move mask and modulus to table line

5c00ce1

Move the mask and modulus values to the table line, since
they usually never change while using the device. Validate
the values given for these parameters to ensure they are
meaningful.

Also remove the messages for setting and displaying the mask
and modulus, because they are no longer necessary.

Signed-off-by: Matthew Sakai <msakai@redhat.com>


          dm lossy: remove return_eio setting

eff1053

Remove the message for setting the returnEIO value explicitly.
In its place, add a presuspend function to stop returning bio
errors when suspending, so that stacked devices can clean up
easily. This makes dm-lossy behave like our perl tests without
any explicit management of the error return value.

Signed-off-by: Matthew Sakai <msakai@redhat.com>


          dm lossy: move variable declarations to block start

cd68a05

Signed-off-by: Matthew Sakai <msakai@redhat.com>


          dm lossy: use simple unsigned types where possible

952708c

Do not update types defined by external interfaces, though.

Signed-off-by: Matthew Sakai <msakai@redhat.com>


          dm lossy: improve function and variable names

0deda01

Rename functions to better describe what they do, and rename
variables to be more descriptive and less awkward.

Signed-off-by: Matthew Sakai <msakai@redhat.com>


          dm lossy: convert remaining camelcase names

fb89d06

Signed-off-by: Matthew Sakai <msakai@redhat.com>


          dm lossy: Fix comment formatting

411a028

Pare down unnecessary comments and convert remaining ones to
correct kernel comment format. Also convert function doc
comments to proper kerneldoc format.

Signed-off-by: Matthew Sakai <msakai@redhat.com>


          dm lossy: update and simplify device description

5720ea5

Signed-off-by: Matthew Sakai <msakai@redhat.com>


          dm lossy: simplify cache states

a2fb81f

Combine the COPYING and WRITING cache states, because the
device does not care why a block is busy, only that it is.

Also trim down the description of the remaining cache states.

Signed-off-by: Matthew Sakai <msakai@redhat.com>


          dm lossy: reduce maximum cache size

af44b6a

Most use cases won't need very many cache blocks, and
keeping the cache size small reduces the potential
for memory allocation or maangement problems.

Signed-off-by: Matthew Sakai <msakai@redhat.com>


          dm lossy: simplify offset calculation in process_cached_bio

b060667

Signed-off-by: Matthew Sakai <msakai@redhat.com>


          dm lossy: handle bio_add_page failures properly

6ef0bf3

The bio_add_page call should never fail, but we still
need to check the return value. If it fails anyway,
fail the bio and don't send it onwards. This is the
same thing we would do if the device was stopped, so
combine those path together.

This removes a warning, but end_flush_cache_block()
will log an error so it will still get reported.

Signed-off-by: Matthew Sakai <msakai@redhat.com>


          dm lossy: Use bioset for allocating cache bios

540cc7b

This simplifies bio initialization and cleanup, and
using pooled bios may also be more memory-efficient.

Signed-off-by: Matthew Sakai <msakai@redhat.com>


          dm lossy: create common device cleanup function

39e40eb

Signed-off-by: Matthew Sakai <msakai@redhat.com>


          dm lossy: expand process_bio_locked condition handling

48a531b

Each if-else branch returns from the function, so get
rid of the 'else' statements to improve legibility.

Signed-off-by: Matthew Sakai <msakai@redhat.com>


          dm lossy: fix indentation to match kernel style

afd4278

Also fix other writespace style mismatches, and
a few line length issues.

Signed-off-by: Matthew Sakai <msakai@redhat.com>

Member Author

lorelei-sakai commented Apr 10, 2026

dm-log-writes https://elixir.bootlin.com/linux/v6.19.11/source/drivers/md/dm-log-writes.c already has deterministic write replay ability, and is widely used in fstests for the same sort of consistency checking (but more thorough, and deterministic) that dory is used for.

Thanks for bringing that to my attention. When preparing for this work, I looked at some dm targets for similar functionality, but I didn't take a look at this one specifically. However it looks to me like they do very different (though complementary) things. dm-log-writes appears to record and replay a specific series of disk events, meaning that you can analyze and reproduce any issue that you have seen. On the other hand, lossy (dory) uses adversarial flushing to try to provoke possible but problematic disk states which may be difficult to produce with normal disk operation (and which can then be analyzed with dm-log-writes).

The device I thought was closest is dm-lossy is dm-flakey, which also tries to simulate bad storage behavior. dm-lossy's main advantage there is in being able to control the timing of the bad disk behavior, instead of relying on a timer.

Thoughts?

Contributor

sweettea commented Apr 10, 2026

Dory's function is that it drops recent writes that haven't been flushed, right? It keeps a cache of unflushed writes, and drops them when it is signaled. (I think you took out the torn write support, since it's equivalent to having 512-byte blocks?). As I recall, the key feature is that it drops some writes shortly after the last flush, which is more interesting than just curtailing sometime after the last flush.

fstests has some tests which use dm-log-writes to exhaustively test the consistency of the (filesystem) at every flush/fua/etc. For instance, generic/482. I believe this came out of the CrashMonkey paper. This seems like using dory, but being able to check every flush, every interesting point, generated by a particular workload's write pattern, instead of just one.

I think the replay-log tool for dm-log-writes could be extended pretty easily to get an equivalent effect -- write some things after the last flush, but not all of them / not just the first ones -- strictly with userspace changes.

So... I think dory is great, it's very easy to use and fast, we've gotten a lot of good out of it, but I do think dm-log-writes can do very close to what dory is used for already :(

bjohnsto requested changes

View reviewed changes

src/c++/devices/lossy/dm-lossy-target.c

+              		return -ENOMEM;
+              	}
+              	if (cache_block_count > 0) {

Member

bjohnsto Apr 10, 2026

If cache_block_count is 0 then we don't allocate anything for it here and the pointer is what was set up on the stack? What would happen then in the vfree call later on?

Member Author

lorelei-sakai Apr 10, 2026

vfree can take a NULL pointer (like most memory freeing functions) so the free call ends up being a no-op.

src/c++/devices/lossy/dm-lossy-target.c

+              	unsigned int sz = 0;
+              	switch (type) {
+              	case STATUSTYPE_INFO:

Member

bjohnsto Apr 10, 2026

I mentioned this to you in person but i will write it here. We should consider which of the dmsetup message read items might belong here instead.

Member Author

lorelei-sakai Apr 11, 2026

I'll think about what this should look like next week.

src/c++/devices/lossy/dm-lossy-target.c

+              		result = DM_MAPIO_SUBMITTED;
+              	} else {
+              		sector_t block_number;
+              		u16 slotNumber;

Member

bjohnsto Apr 10, 2026

camelcase

Member Author

lorelei-sakai Apr 10, 2026

Will fix.

src/c++/devices/lossy/dm-lossy-target.c

+              			DMEMIT("%u %s %u %llu\n", i, state,
+              			       waiter_count, (unsigned long long)sector);
+              		}
+              	}

Member

bjohnsto Apr 10, 2026

we probably need a default here for messages it doesnt understand.

Member Author

lorelei-sakai Apr 10, 2026

Good catch, I'll add one.

src/c++/devices/lossy/dm-lossy-target.c

+              	 * Some I/O is currently using the cache block contents. This can
+              	 * be an incoming write storing data in this cache block, or else
+              	 * the block contents are being written to the underlying storage.
+              	 * Incoming I/O must wait for the current operation to completee.

Member

bjohnsto Apr 10, 2026

typo. completee

Member Author

lorelei-sakai Apr 10, 2026

Will fix.

Member Author

lorelei-sakai commented Apr 11, 2026 •

edited

Loading

I feel like we're talking past each other here. dm-lossy does a fundamentally different thing from dm-log-writes. They cannot simply be substituted for one another. Saying dm-lossy is not useful for the particular testing scenario you're describing is very different from saying it's not useful for anything at all. It seems to me that the requirements for file system testing and the requirements for block driver testing are not the same.

Dory's function is that it drops recent writes that haven't been flushed, right? It keeps a cache of unflushed writes, and drops them when it is signaled. (I think you took out the torn write support, since it's equivalent to having 512-byte blocks?). As I recall, the key feature is that it drops some writes shortly after the last flush, which is more interesting than just curtailing sometime after the last flush.

dm-lossy specifically caches and drops the oldest unflushed writes. This is most interesting when some newer unflushed writes are persisted while the oldest ones are dropped. (By the way, you can read the device code. I would much prefer to discuss what this actual driver does, rather than a hazy recollection of what it used to do. I was surprised myself to discover that the behavior of this device did not align with my recollections of what I thought it did.)

(I think you took out the torn write support, since it's equivalent to having 512-byte blocks?)

Torn write support is not inherent in dm-lossy, it just a thing that can be produced by dropping some single sector writes. But I did not remove any of it, and in fact I had to fix it because it relied on caching code that had stopped working.

fstests has some tests which use dm-log-writes to exhaustively test the consistency of the (filesystem) at every flush/fua/etc. For instance, generic/482. I believe this came out of the CrashMonkey paper. This seems like using dory, but being able to check every flush, every interesting point, generated by a particular workload's write pattern, instead of just one.

The CrashMonkey paper specifically says they don't consider don't consider failures that happen between persistence points (i.e., they don't cover every interesting point). This may be a valid assumption for file system tests, but that does not mean it is valid for driver testing. I believe the interesting cases for drivers are precisely the ones where some unflushed writes are persisted and others are not.

I think the replay-log tool for dm-log-writes could be extended pretty easily to get an equivalent effect -- write some things after the last flush, but not all of them / not just the first ones -- strictly with userspace changes.

In particular, I think dm-lossy's value is in helping to determine which persistence patterns can cause problems. The paper specifically notes that exhaustively testing every possible write pattern is expensive to the point of being intractable. replay-log may be able to deterministically produce any desired write pattern, but it doesn't seem to do anything to help determine which write patterns would be worth reproducing.

Contributor

sweettea commented Apr 12, 2026

I feel like we're talking past each other here. dm-lossy does a fundamentally different thing from dm-log-writes. They cannot simply be substituted for one another. Saying dm-lossy is not useful for the particular testing scenario you're describing is very different from saying it's not useful for anything at all.

I have, indeed, not successfully communicated my point: the state generated by dm-lossy could be generated from the data saved by dm-log-writes, and thus the same testing could be reimplemented atop dm-log-writes with some userspace changes. When there is such similar functionality, and we could get away with userspace-only changes extending the existing kernel tool, we shouldn't add something new to the kernel and fragment the testing ecosystem further. As such, I do not believe dm-lossy should go in the upstream kernel.

I would be fine to delay this conversation until you send to the mailing list, if you'd prefer -- I thought it'd be lighter-weight to have this conversation on github ahead of time.

(By the way, you can read the device code. I would much prefer to discuss what this actual driver does, rather than a hazy recollection of what it used to do. I was surprised myself to discover that the behavior of this device did not align with my recollections of what I thought it did.)

Apologies for the weak wording; I did read the code before writing.

It seems to me that the requirements for file system testing and the requirements for block driver testing are not the same.

Just because something is done in filesystem testing land doesn't mean it's the right solution for block device testing, or vice versa, agreed. But we should cross-pollinate good ideas in both directions, and avoid redundancy.

Torn write support is not inherent in dm-lossy, it just a thing that can be produced by dropping some single sector writes. But I did not remove any of it, and in fact I had to fix it because it relied on caching code that had stopped working.

Torn write support is equivalent to 512 byte block support plus a patterned choice of what sectors to drop, instead of always picking the first ones after the flush, isn't it?

(I think supporting arbitrary block sizes is a useful property, particularly 512 byte...)

The CrashMonkey paper specifically says they don't consider don't consider failures that happen between persistence points (i.e., they don't cover every interesting point). This may be a valid assumption for file system tests, but that does not mean it is valid for driver testing. I believe the interesting cases for drivers are precisely the ones where some unflushed writes are persisted and others are not.

I believe it's an equivalently valid assumption for both (that is to say: the most interesting points are the ones at flushes, but it's possible for there to be interesting points in between). I think it is useful to test points other than flush points, and vdo certainly pioneers this, which is great.

In particular, I think dm-lossy's value is in helping to determine which persistence patterns can cause problems. The paper specifically notes that exhaustively testing every possible write pattern is expensive to the point of being intractable. replay-log may be able to deterministically produce any desired write pattern, but it doesn't seem to do anything to help determine which write patterns would be worth reproducing.

I think dm-lossy represents particular (two) choices of persistence pattern: drop by pattern (torn writes), and drop first N blocks touched after flush. The current uses of dm-log-writes plus replay tool represents a different choice of persistence pattern: drop last N writes.

dm-lossy requires picking an N before testing, dm-log-writes allows picking every N or only a specific N. Both are persistence patterns picked by the test author.

But we can extend the replay tool to these additional persistence patterns, without making any kernel change. And if we can already achieve the same result with existing kernel code, adding a second way to test, with less flexibility and more straightforward use, is not a good choice: it locks the kernel into a new interface presented to userspace forever, even if it's just intended for testing, and therefore additional code forever.

(I envision new args --drop-since-flush= and --prev-flush to replay-log. Then replay-log --end-mark=stoppoint --prev-flush --drop-since-flush=ffffffff would replay up to the last flush before stoppoint, then drop 32 writes, then continue replaying. --prev-flush requires a scan backwards to find the last flush before the end-mark, but the existing log_seek_next_entry infrastructure already supports this pattern; seek_to_mark does exactly the same thing. Filtering by ordinal position is one additional condition in log_replay_next_entry alongside the existing log_should_skip sector-range filter, which already demonstrates this pattern. Thus, the dm-lossy behavior purely with userspace changes)

If you do send this upstream, please be sure to cc me (sweettea-kernel@dorminy.me).

sweettea reviewed

View reviewed changes

src/c++/devices/lossy/dm-lossy-target.c

+              #include "dm-lossy-target.h"
+              #endif /* VDO_UPSTREAM */
+              #define MAX_CACHE_BLOCKS 236

Contributor

sweettea Apr 12, 2026

Why is this 236? This deserves at least an explanatory comment, although being more UINT_MAX shaped would be even better.

Member Author

lorelei-sakai Apr 15, 2026

The number is sort of arbitrary; the limit used to be UINT16_MAX - 19, and now it's UINT8_MAX - 19. But I did not figure out why the 19 was there. At any rate, the maximum number of cache blocks should be fairly low. There isn't much value in having a large number of cache blocks, though, since that will probably end up more or less identical to simply dropping all writes after a flush. And because we have to allocate backing storage for all cache blocks, a larger number of blocks increases the memory cost of this target, also. (In our actual testing, the number of cache blocks is never higher than 11.)

I can add a comment to that effect, though.

sweettea reviewed

View reviewed changes

src/c++/devices/lossy/dm-lossy-target.c

+              	atomic64_t submitted_returns;
+              	atomic64_t submitted_bios;
+              	atomic64_t success_bios;
+              	atomic64_t error_bios;

Contributor

sweettea Apr 12, 2026

So many statistics. Total_reads and total_writes are redundant with queue stats, aren't they? Since it's a test device, can we just use a BPF program in the test setup to count each of these instead of putting it in the kernel tree?

Member Author

lorelei-sakai Apr 14, 2026

I wasn't thinking about the content of the logging very hard, and I agree it should really get trimmed down to not be redundant with other, more generic reporting mechanisms. I'll take a pass at that if this ends up moving forward.

Member Author

lorelei-sakai commented Apr 14, 2026

Man I really wish github had a way to properly thread this conversation.

Anyway, thank you for sticking with it. Your extended comments have, I think, finally allowed me to understand your original point. I believe that you are saying that whatever interesting pattern of writes dory might produce, we should be able to produce and test the same patterns with dm-log-writes (and probably at lower cost, as well). And this point sounds very plausible to me.

The main motivation for upstreaming dm-lossy is to enable certain types of dm-vdo testing. I think he next step for me, then, is to see if we can, in fact, create the testing we want through by using dm-log-writes (or other existing dm targets) instead. If that works out, I agree there's no real purpose to submitting dm-lossy.

(I do mildly wish that we had had this discussion a month ago, when I was trying to figure out if this project was useful. But c'est la vie.)

The CrashMonkey paper specifically says they don't consider don't consider failures that happen between persistence points (i.e., they don't cover every interesting point). This may be a valid assumption for file system tests, but that does not mean it is valid for driver testing. I believe the interesting cases for drivers are precisely the ones where some unflushed writes are persisted and others are not.

I believe it's an equivalently valid assumption for both (that is to say: the most interesting points are the ones at flushes, but it's possible for there to be interesting points in between). I think it is useful to test points other than flush points, and vdo certainly pioneers this, which is great.

On a strictly pedantic note, I'm not at all certain that what's interesting to a filesystem is the same as what's interesting to a block driver. But it also doesn't matter. We can each test whatever scenarios we find interesting.

(I envision new args --drop-since-flush= and --prev-flush to replay-log. Then replay-log --end-mark=stoppoint --prev-flush --drop-since-flush=ffffffff would replay up to the last flush before stoppoint, then drop 32 writes, then continue replaying. --prev-flush requires a scan backwards to find the last flush before the end-mark, but the existing log_seek_next_entry infrastructure already supports this pattern; seek_to_mark does exactly the same thing. Filtering by ordinal position is one additional condition in log_replay_next_entry alongside the existing log_should_skip sector-range filter, which already demonstrates this pattern. Thus, the dm-lossy behavior purely with userspace changes)

For our torn write testing, we'd also want to test the address of the writes to drop, to make sure we're dropping particular sectors. And we might also want to filter for address range to hit particular kinds of metadata. But that's just a couple more filters so it still seems tractable.

If you do send this upstream, please be sure to cc me (sweettea-kernel@dorminy.me).

Will do.

lorelei-sakai added the on hold label

Member Author

lorelei-sakai commented Apr 15, 2026

Putting this on hold while investigating alternate testing strategy.

bmarzins commented Apr 15, 2026

@lorelei-sakai I have a question about what kinds of tests you want dm-lossy to do. It seems to me that the biggest difference between dm-lossy and dm-log-writes, is that dm-log-writes will only allow you to simulate what happens when you are starting up after a crash. You run your test to generate the IO pattern, replay a portion of it, possibly skipping some writes along the way, and then start up vdo on the device, just like you were booting up after a crash. If that's all that you want to use dm-lossy for, then I think @sweettea is correct. You can generate your IO pattern with dm-log-writes, and then use replay-log (possibly with some modifications for individual dropped writes) to get to whatever crash state you want.

dm-log-writes can't show you what happens to a running system, when writes fail or don't get written durably to storage. Like you said @lorelei-sakai, that is something that dm-flakey can do, but not with any precision. If you're trying to test what will happen in a specific situation, dm-flakey does not make that very easy or reproducible. So, I can see dm-lossy having value a more tunable dm-flakey, if you are looking to test the effect of errors on a running system. You could also look into modifying dm-flakey with a message() function, and a more complex flakey_map() function for deciding when you should flake on a bio. But dm-flakey doesn't have capability to cache a dropped write for a while, to more accurately mimic a device with a writeback cache that is going to fail when it's time to flush the block to durable storage. I'm not sure how important that is to test specific scenarios.

Member Author

lorelei-sakai commented Apr 15, 2026

@bmarzins The existing tests are all crash recovery tests, so I'm going to focus on those scenarios first.

But I'm intrigued by the possibility of tests of a running system. It might be worth spelling out what that sort of testing would look like, to figure out how valuable it is, and whether those cases could be served by extending dm-flakey in some way, or whether they might justify a new target.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

on hold private build done private testing done public build done public testing done