Add zstd support by sweettea · Pull Request #241 · dm-vdo/vdo-devel

sweettea · 2025-02-03T18:07:11Z

This adds zstd support to VDO.

Zstd support has been requested on the mailing list. It's a feature which comparable storage technologies have, and which is generally viewed as the best available compression technology.

This adds support and some tests for it.

sweettea · 2025-02-04T03:17:07Z

@bjohnsto hopefully this one I did right?

lorelei-sakai · 2025-02-04T03:30:31Z

I denied testing on this one partly to see if it would work, but mostly because this is a big PR and I haven't had time to give it a good read yet. Not that I don't trust you but... Let's stick to using the small PR for test the CI, okay? Thanks.

lorelei-sakai · 2025-02-04T03:53:28Z

I will say, I like what I've read so far. I think this feature could use some real design discussion so we all agree on how it should work. If you have any high-level design thoughts to share, that would be great. (The sort of thing that might go in a cover letter, for example.)

For some initial highlights:

Do you envision being able to change the compression algorithm any time we reload a new table, or just set it at format time? (I think setting it on table reload should work fine, but I want to clarify what you intended.)
1a) I don't really like the idea of reading the device_config every time we go to compress a block. I don't think we can change this more often than once per suspend/resume cycle. So couldn't you read the device config once (say at resume time) and store the algorithm/level values locally in the packer? That's probably more efficient.
Setting the compression level is a cool addition, but lz4 also supports different levels and I think it would make sense to have a "compression-level" parameter in general, independent of what algorithm is used. (In our internal discussions, we've imagined "add new compression algorithm support" and "add support for configuring the compression level" as separate stories, although they are obviously closely related.)
Putting the compression algorithm in the block header is nice, it means we can have a mix of old and new compressed blocks, but it does mean having all algorithms available in case we need to decompress data written with a different algorithm than we're currently using. That's probably not an issue? It also means we have to flush the packer so we don't mix fragments in a single block, but I think suspending will already do that. How closely did you look at this case?
The zstd-shims.h thing is ugly, I wonder if there's a better way around that?
Since this is a change to the outward facing interface (i.e. the table line), it should be documented.
5a) Also the dm device version should be updated to reflect the changed behavior. (I didn't see it, but I might have overlooked it?) I really like the idea of replace compression=(on|off) with an algorithm name but it's a change.
5b) If either the compression flag or the compression level is a required parameter, that means making a new version for the table line, too.
There's work to do to integrate this with lvm tools, but we can deal with that after the work is done, I suppose that doesn't really affect anything here as long as you get the versions correct so lvm can know which version it's dealing with.

If there's some time that you could jump on a call, it might be worth trying to hash some of this out in real time. Though I realize that might not be practical.

lorelei-sakai · 2025-02-04T03:55:52Z

Oh, would you mind moving this one back to draft? It'll let you post new versions without triggering the CI each time.

sweettea · 2025-02-04T05:39:19Z

I denied testing on this one partly to see if it would work, but mostly because this is a big PR and I haven't had time to give it a good read yet. Not that I don't trust you but... Let's stick to using the small PR for test the CI, okay? Thanks.

100%. I repushed this one partly because I coudln't figure out the right order of things on the other one, partly wanting to see what the Ready for Review button would do.

back to draft

Done. Didn't know one could do that, great button, love it a lot.

thanks for the various comments! here are some quick replies:

Design.. like in a cover letter

In hindsight you make a great point there, I was too excited to have gotten something working during the weekend to actually write a proper cover letter. Sorry about that, will accompany the next version with one.

Do you envision being able to change the compression algorithm any time we reload a new table, or just set it at format time? (I think setting it on table reload should work fine, but I want to clarify what you intended.)

On reload. There's a bit of weirdness insofar as you can make a table with compression off, then turn it on by message, or vice versa, but that seems like a bit of weirdness that could be moved away from longterm in favor of table reload.
Also, format time option gives up a lot of flexibility of device evolution during use, in favor of epsilon less memory usage.

1a) I don't really like the idea of reading the device_config every time we go to compress a block. I don't think we can change this more often than once per suspend/resume cycle. So couldn't you read the device config once (say at resume time) and store the algorithm/level values locally in the packer? That's probably more efficient.

Could do. I like having a single store of truth, the device config; the packer is not on the latency-sensitive pre-acknowledgement path; and there are no multithreaded implications since it can only change from dmsetup message (which sends a notification to the packer, thus doing sufficient memory barriers) or by dmsetup resume (which also does sufficient memory barriers). I view it as the same in principle as the vdo->compressing bool, which we have no difficulty accessing from the packing paths.

Setting the compression level is a cool addition, but lz4 also supports different levels and I think it would make sense to have a "compression-level" parameter in general, independent of what algorithm is used. (In our internal discussions, we've imagined "add new compression algorithm support" and "add support for configuring the compression level" as separate stories, although they are obviously closely related.)

It's easy enough to pull out the level specification parsing into its own function for easier reuse in the next version. Zstd levels are fairly standard to be able to select and I haven't ever wanted to do that for lz4,, so I didn't think about it. It seems, with the factoring out of parsing levels, to be pretty easy to add it, sort of the last three or four changes from here plus one to actually parse the level in analogy to zstd. However, I do think it's definitely orthogonal to adding zstd, while having levels in zstd is fairly core to zstd in my mind, so it feels like it deserves to be a different patchset beyond making it easy to add.

Putting the compression algorithm in the block header is nice, it means we can have a mix of old and new compressed blocks, but it does mean having all algorithms available in case we need to decompress data written with a different algorithm than we're currently using. That's probably not an issue?

We can share the memory blob between both algorithms, so it's not much additional memory to support having both and adds flexibility to folks using it in the future. I thought about just unilaterally upgrading everyone to zstd, but we still have to support LZ4 for old vdos, and if we still have to be able to read lz4 it's not much more work to allow writing new stuff with lz4 too.

It also means we have to flush the packer so we don't mix fragments in a single block, but I think suspending will already do that. How closely did you look at this case?
Yes, we flush the packer in between every potential change, so vios will only ever compress with one algo and we won't mix algos. Definitely a thing that deserves a test.

The zstd-shims.h thing is ugly, I wonder if there's a better way around that?

I was hung up on it for a couple hours. I tried using #include_next but couldn't get that to work. Basically we'd need #include <linux/zstd.h> to grab fake/linux/zstd.h, but then the #include <zstd.h> to not grab itself. I'll try it again.

Since this is a change to the outward facing interface (i.e. the table line), it should be documented.

Ah, yes. Where does the documentation lives in this tree?

5a) Also the dm device version should be updated to reflect the changed behavior. (I didn't see it, but I might have overlooked it?)

Hmmm, I don't know the dm device version rules? Where do I update that?

I really like the idea of replace compression=(on|off) with an algorithm name but it's a change.

This changeset leaves on/off as an option, but also allows lz4, zstd, zstd:[level] as options?

5b) If either the compression flag or the compression level is a required parameter, that means making a new version for the table line, too.

Optional :D

There's work to do to integrate this with lvm tools, but we can deal with that after the work is done, I suppose that doesn't really affect anything here as long as you get the versions correct so lvm can know which version it's dealing with.

Ah, magnificent.

sweettea · 2025-02-04T06:12:16Z

One specific question on the form of changes: do you want capitals in subject lines at all, or all lower case? Do you want things adding fakes to be before the change using those fakes, or after?

sweettea · 2025-02-04T06:36:26Z

latest push changes: edit zstd-shims.h to carry over the license and copyright notice from kernel zstd.h; also looked up what my copyright on personal contributions to open source projects is supposed to look like at present and added that.

bjohnsto · 2025-02-04T20:36:01Z

5a) Also the dm device version should be updated to reflect the changed behavior. (I didn't see it, but I might have overlooked it?)

Hmmm, I don't know the dm device version rules? Where do I update that?

Since compression is an optional parameter there is no need to change the version value in the actual table line. However, we want to make sure when there are changes that affect clients (LVM, Stratis, etc) that they have some way of knowing when new features are valid. We do this basically with two steps. 1. Update the struct target_type vdo_target_bio .version field in dm-vdo-target.c .... we usually update the second value. So 9.1 to 9.2 and so on. 2. Notify the clients of the change so they can implement the change to their side.

Either style would likely work but I personally would prefer to have two parameters. 1. leave the compression on/off alone and the parsing for it. It will then just work for backwards compatibility and there would be no need to test to make sure the parsing didn't get busted somehow. 2. have a compressionType optional parameter that would have type and level? Then the parsing additions can be self contained and the change to the .version would relate to adding that field to the table line. Just my feelings on the matter though.

lorelei-sakai · 2025-02-05T01:42:37Z

For commit titles, dm prefers all lower case. And in general acronyms like vdo and vio get downcased where they occur, too. The title should also include a subsystem where it's applicable: "dm vdo packer: pack all the things"
For commit ordering, whatever makes sense to you. I tend to put internal and test commits after the kernel-visible commits, but we don't require that the tree build after every commit so it doesn't matter too much. I would try to keep them near what they support, though. The kstrtoint commit, for example, seems to appear several commits before it's actually used.

lorelei-sakai

First off, sorry if this comes out awkward. There a couple of large design issues, and I want to discuss each one in a separate conversation so that the discussions don't become muddled. I can't see how to do that aside from making separate review comments in the code, but these issues are only loosely tied to the lines I've attached them to.

Also, in general these commits are very small. While I appreciate keeping them simple, in some cases you seem to add variables or structures in one commit and make use of them in another, which can make it hard to evaluate the individual commits.

lorelei-sakai · 2025-02-08T02:22:57Z

+		if (strcmp("off", value) == 0) {
+			config->compression = VDO_NO_COMPRESSION;
+		} else if (strcmp("on", value) == 0 || strcmp("lz4", value) == 0) {
+			config->compression = VDO_LZ4;


What should the table line parameters look like? In our internal discussion, there was some support for having three separate parameters:
--compression (bool): Is compression enabled at all? (This is the one we already have, with no changes.)
--compressionAlgorithm (lz4|zstd|etc): What algorithm is use for compression?
--compressionOptions (e.g. "level=3"): Additional options such as the compression level to use.
I think we wnat to keep the existing option unchanged, to prevent confusion about the meaning of --comression=on in the future. The other information could be combined though, so instead of the last two parameters we could have a --compressionType="zstd:3" to specify algorithm and options in one string.

Whatever options we pick here, the new options will require updating the target_type version number as Bruce mentioned.

See below for discussion of following next-gen filesystem interfaces.

It seems redundant to specify both 'compression on' and also 'compressionType zstd:4'. It seems ambiguous to specify 'compression on' without specifying a type. I would rather have ambiguity when specifying the old way than redundancy in specifying the current best practice way, as a general rule.

Can we issue a deprecation warning in the system log a la filesystems for compression [on/off], treat it as compressionType lz4 for now, and remove it in ten kernel versions or something lengthy like that?

Something we forgot the first time around is that vdo also has a dmsetup message that can enable or disable compression without suspending the device. This is somewhat contrary to the general dm model (which wants all changes like this to be managed by loading a new table), so we might look at deprecating that. However, we do have to continue to support it for the moment. The message probably can't change the compression algorithm because it can't allocate new memory, so we wouldn't be able to create support structures for the new algorithm.

We can't remove the existing table line option, either. Between that and the message, we agreed the simplest thing to do is leave the existing compression on/off options, and add a new option for specifying the compression algorithm, plus any options.

We didn't discuss the exact format. How about '--compressionType ":" ' where is a string specifying the algorithm, and is a string specifying any parameters the algorithm wants? For now the compression level can be the only option. By letting that string be context-dependent, we don't have to over-engineer it now.

lorelei-sakai · 2025-02-08T02:40:11Z

+		if (strcmp("off", value) == 0) {
+			config->compression = VDO_NO_COMPRESSION;
+		} else if (strcmp("on", value) == 0 || strcmp("lz4", value) == 0) {
+			config->compression = VDO_LZ4;


Should the compression type be a runtime option, changeable whenever we reload a table, or should it be a format time option? Both options will require updating the super block, and this choice will affect what the table line parameters look like. if you look at PR 216, you'll see we're soon going to add formatting options on the table line, but it will be important to make sure format-time parameters are in separate options than the regular reload options.

Making the algorithm a format-time option would mean writing that choice down in the super block, so it's a format change. It would also mean that we don't have to change the compressed_block format, since there can only be one format in a given volume. This would also simplify the number of scenarios we would need to test. And how realistic is it that users would want to switch algorithms dynamically? his case does involve updating vdoformat to support the new parameter.

Making the choice a reload option requires updating the compressed_block format (as you've done) but would not require a super block format change. (The super block version still needs to change to indicate the change in on-disk format, though.) It may require more testing to make sure that VDO uses the correct decompression for different blocks of compressed data.

To me it feels like the format-time option is simpler with very little loss of functionality. What do you think?

The question, I think, boils down to two higher level questions: should VDO follow the UI patterns forged by next-generation filesystems? and, should users expect to need to reformat to use new features?

Based on the lessons the filesystem community has learned over the years, I think it makes sense to 'act like filesystems' when there's a clear norm for a feature, and think in-place enablement of new features avoids penalizing early adopters who are the most valuable users. If someone's already tried a feature on other filesystems, it's negatively surprising for it to work differently with vdo and makes them less likely to stick with vdo. Similarly, if a user has already adopted VDO, it might be very difficult for them to migrate their data off for a reformat -- they may not have a sufficiently large or fast disk. It's a negative interaction to try to enable a feature and discover your volume was formatted too far in the past. Thus, in-place enablement of new features is important to avoid negative surprises to early adopters.

In specifics, I do acknowledge format-time is simpler: the testing burden, the magnitude of on disk changes, the magnitude of code changes all weigh on the side of format time option. However, I think some existing users may want to try out zstd, and think next-generation filesystems allow runtime changes.

As I understand it, zstd provides better compression, at the cost of slower compression and much slower decompression. So, for the write-mostly usecase, zstd could make sense, and it would be unsurprising to want to switch to zstd on an existing vdo. We could add a re-compressor, but that's a lot of work. So that leaves having blocks in both compression formats in some form, or disallowing in-place enablement.

As for comparable technologies: In filesystem-land, every next-gen filesystem (btrfs, bcachefs, zfs, f2fs) supporting compression allows specifying compression and compression algorithm on a per-file basis. Moreover, as I understand it, all but bcachefs allow changing the default compression going forward by remount. While VDO is not a filesystem, it is more like a advanced filesystem than anything else in DM land, and a vdo resume is like a mount. So it seems like following the UI pattern of next-gen filesystems would also encourage allowing changing compression technology per table line.

First, I think it's important to be clear that the filesystems interface (with a basic unit of a file) is different from the block device interface (with a basic unit of a data block). Users interact with them differently, they have different information about the data they're handling, etc. I think analogizing VDO to a filesystem is a good way of misunderstanding its audience and its limits. Even the deduplication and compression features of VDO
are rather different than the similarly-named features in newer filesystems.

Having said that, I think we agreed that allowing users to change the algorithm when reloading a table makes sense. Requiring reformatting is too restrictive.

lorelei-sakai · 2025-02-08T03:11:29Z

 				      __func__);
 			return VDO_INVALID_FRAGMENT;
 		}
+


The super block will need to gets its version increased at minimum. We don't want to allow an older version of VDO to try to open a volume that might have compressed blocks that it can't read. (The current version is 67, you'd bump to 68.)

There's an upgrade question, too. If the new driver loads a VDO with super block version 67, we should probably write it out again as 67 and only upgrade them if they've indicated they want to use zstd. This will help prevent hassles for users who don't care about zstd. (This is less of an issue if the algorithm is a format-time choice.)

And as long as we're changing the super block, is there any other information we'd want to add for this feature, or for a foreseeable future feature? As an example, we could record an enumeration of the algorithms we support, or that this volume has ever used. We probably don't have enough insight into future usecases to do something like this, but I wanted to see if you have any ideas.

I think we decided that we need to define superblock version 68. This version should add an extra 4-byte field that we can use as feature flags. (We'll only need to use two for now, though, one for lz4 compression and one for zstd.) As mentioned above, we can continue to write the old version (67) until a volume enables zstd to try to prevent downgrade issues where possible. Once there is a possibility that the volume has zstd-compresed blocks written, we have to ensure that the volume can continue to read those blocks after future reloads.

Confirmed that we do write the super block every time we resume the device (whether we saved he vdo state or not), so that should work fine.

Ah, I haven't gotten this part yet. Tomorrow.

I was planning on doing the super block part, but I certainly have no objection if you end up doing it instead.

lorelei-sakai · 2025-02-08T03:23:03Z

+		}
+
+		return VDO_SUCCESS;
+


Finally, the question of how to get this upstream. For a small change like your other PR, I don't mind pulling it into our tree and then pushing it upstream with the rest of our commits. But for a real feature like this, it should probably go to the dm-devel list for more public comment. I'm happy to work to refine this PR here, and then you can send it to the list once we think it's reasonably complete. We could also take it into our tree and then send it to the list for comment, but it would be better if you (as the author) could reply to any feedback directly on list. (You'd send just the kernel version of course, which should just be a matter of dropping the commits that change vdo internal files.)

Also, I do find it very weird to add zstd with level configuration without also adding support for configuring lz4 levels. Since that isn't related to your series, I'll see if we can put together a separate series for that. I'm mentioning that mostly because the table line questions would also affect that series, so just be aware we may duplicate some of that stuff in another PR. I don't currently have a good guess for whether that would come before or after your series yet.

Re upstream question: Excellent. I recognize that I'm forging relatively unknown territory here about how vdo adds features in the kernel world. I certainly like developing here to use the extensive test suite but I do think it's valuable to send it to dm-devel for comments too.

I don't know how to turn commits here into kernel tree commits; is there an existing best practice? I'm guessing it'd be git format-patch $patch | <rewrite vdo-devel paths into kernel paths> | git am in some form?

Level config: sounds good, pending outcomes of the ui question at the start. I think it should be easy enough to cherry-pick my relevant commits from this branch into the new level config branch, thereby preserving attribution?

The tools for doing this aren't as friendly as they could be, but you can use src/packaging/kpatch/commit_to_overlay.sh. This script can take a series of patches from vdo-devel and a kernel repo and translate those patches into a branch on the kernel repo. (It will actually copy the entire vdo tree for each patch, so it's easy to pick up unrelated cruft depending on the state of the kernel repo you're using. I usually create a dummy patch out of the base of my series first to capture all the tree differences, and then afterwards rebase to remove that patch from my branch. We are planning to improve this process, it's just been a low priority.)

If getting that script working proves overly challenging, src/perl/bin/transformUpstreamPatch.pl will take a formatted patch and rewrite it so all the file paths are updated to the proper kernel locations. Your suggestion will work fairly well if you use this script as the step, but you'll still have to deal with trimming out the internal code that doesn't go the kernel proper. Using "git am --rej" will prevent git from just refusing to apply changes when the internal stuff creates a conflict, and then you can resolve the conflicts manually.

My thoughts on this now are, we'll get this cleaned up and into our (vdo-devel) repo. Then, I'll collect this and my compressionType series and any other superblock bits we need to add, and send this to the mailing list as a single series. It will need a cover letter, which I can write, but if you've got thoughts about what it should mention , please send them to me or put them in the PR description. The resulting series might be a bit long, but I think it'll hang together better than trying to justify the stages piecemeal, and it means we won't have to deal with versions where some but not all of the bits are in.

sweettea · 2025-02-13T19:00:17Z

Just wanted to ping this -- it looks like I could fit in a synchronous meeting tomorrow anytime from about 1pm Eastern onward, if ya'll have any free time.

lorelei-sakai · 2025-03-21T00:23:21Z

Now that I've committed PR #269, it would be good to get this PR cleaned up and aligned with that so we can get it committed also. I'm working on the super block changes, but I think they're pretty small, and it might make sense for them to get committed after this series anyway, since the whole motivation for adding the flags is to support another compression algorithm.

lorelei-sakai

Okay, I've added some more granular comments to give you some ideas to work with in terms of cleaning this up. I didn't mention anything about aligning it with the table line interface introduced by PR #269 since that seems like it should be straightforward, but let me know if you have any questions or difficulties with it.

lorelei-sakai · 2025-03-22T00:30:50Z

 	return VDO_SUCCESS;
 }

+/*


So these first 3-4 patches are moving the compression and decompression routines from data-vio into the packer. But my question is, why? What does this code motion improve? It's not like the compression works happens in the packer or even on the packer thread. (It's farmed out to a CPU thread.)

I think it makes sense for all the compression/decompression stuff to live in one place.
Right now, compression contexts are allocated in vdo.c and compression/decompression is called in data-vio.c, but it's simple, straightforward calls to LZ4 constants -- essentially the complication is offloaded to lz4. However, in the zstd universe, the size of compression context is a more complicated calculation, and comp/decomp is conditional on algorithm.

With the added conditionals, I think it becomes clear that we should have all the compression calculations together so we can verify in one place that they're all parallel. And packer is the most compression-ish place to do that.

Okay, the argument seesm reasonable to me. I wonder if we should rename the file from packer.[ch] to something like compress.[ch], though? (Similar to dedupe.[ch]). Probabyl not worth worrying about at this time.
If we did something like that, it should be in its own commit because in the kernel we'd need to add some files to the commit (Makefiles, mostly) which aren't reflected in this repo.

lorelei-sakai · 2025-03-22T00:33:25Z

+struct compressed_block {
+	union {
+		struct compressed_block_1_0 v1;
+	};


I Would definitely prefer not to have a union here, and I don't think it helps anything in this case. There's no reason we can't use a single in-memory representation for both formats. (We might have to call out an in-memory structure that's separate from the packed, on-storage representation, I haven't thought it through enough to figure that out yet.)

We benefit from some sort of struct compressed_block in data-vio.h, as a member of compression_state; and then we can pass it to compress_buffer and uncompress_to_buffer.

I can grant it not being a union. Though, my understanding of dm norms is that the structure on-disk needs to be defined as a struct in a header file, so I don't think it makes much difference whether it's a union or a 4k array of char?

Okay, the problem here is that the existing code uses the packed on-disk buffer as its working in-memory object also. You're necessarily changing that here (struct compressed_block is the in-memory one), so that structrue could just have the fragment offset array, data pointer, and type. We'd have to do a bit more marshalling/unmarshalling, but not much, and it simplifies the code that needs to refer to e.g. he data pointer without caring what the version is.

Part of me is thinking that the minor version field is four bytes and we ocould abuse it slight and say version 1.0 is lz4, version 1.1 is zstd. That's probably a bad idea (but it would mean the on-disk header wouldn't change size...).

lorelei-sakai · 2025-03-22T00:38:59Z

+
+	/* Compression type in use for this block's fragments */
+	u8 type;
+


If you put the type after the array of fragment offsets, it means these structures are identical except for whether you read that final byte. That would simplify a lot, I think. I does mean we'd have problems make the number of fragments larger, but that militation is set by the number of block mapping states, so if we wanted to make that change it'll still require a fair bit of restructuring. (Realistically, I don't think we'll increase the number of fragments, your data has to be fairly particular to even get to 14.)

Also as a minor thing, I don't like that the array on u16s ends up misaligned. That just feels messy.

Makes sense. Thanks.

lorelei-sakai · 2025-03-22T00:43:54Z

@@ -0,0 +1,4 @@
+#include <zstd.h>


Given how much eventually ends up in this file, do we really need to imprt he userspace zstd.h? Why not just copy over the whole kernel implementation into a fake header? How much code is that, really? (I haven't looked, I genuinely don't know.)

If there is a lot code inclusion you save his way, it also seems like there should be some way of specifying where to look for the userspace header so this file wouldn't end up trying to include itself, but I haven't researched that, so I don't know.

$ wc include/linux/zstd.h 614 3045 23006 include/linux/zstd.h

I don't think it qualifies as really small enough. :(

600 lines isn't that bad. It looks like once you're done, the shims file is a bit over a hundred lines anyway.

zstd probably doesn't have this problem, but I'm also influenced by the discovery that out lz4 userspace implementation is entirely different from the one in the kernel, so what we test in userspace doesn't actually reflect exactly what the kernel does. So I'd be just as happy if our fake files reflected the actual kernel code.
It's possible you coud trim out some bits that aren't needed by our tests, too. (Though even if not, 600 lines is not that bad. (By comparison, the userspace fake of dm-bufio is about 233 lines, which is probably the largest one we currently have. So, could go either way, I guess.)

That's just the header that's 600 lines. The implementation is 22k more.

lorelei-sakai · 2025-03-22T00:45:38Z

+			     zstd_dstream_workspace_bound(VDO_BLOCK_SIZE));
+	for (int level = zstd_min_clevel(); level <= zstd_max_clevel(); level++) {
+		zstd_parameters params = zstd_get_params(level, VDO_BLOCK_SIZE);
+		context_size = max_t(size_t, context_size,


These sizes don't change from run to run, do they, they're fixed by the algorithm I hope? We should be able to figure them out at compile time or something and not do this calculation here. (Ideally, zstd would even have constants defined that we could include, but I presume because you wrote this that it doesn't.)

Wouldn't that be a nice universe? :/

It's not even as good as "we can check context size of min and max clevel and take the max of those" -- it's non-linear and there are no constants defined (because it depends on the max block size). This sort of loop is what btrfs does at https://elixir.bootlin.com/linux/v6.13.7/source/fs/btrfs/zstd.c#L160 .

You know, it occurs to me that for any given instance of vdo, we know what kind of compression it is using (and what level) so we can allocate contexts that are exactly the right size without looping. This would be a bit more complicated, because whenever we suspend and resume, we'd have to reallocate these contexts in case the configuration changed, but that still seems like way less work than trying to define a generic maximum.

I really like that idea, but if that seems too complicated, we can also do this calculation offline, add a constant describing the maximum context size (with an appropriately detailed comment), and then add a unit test to test all the possibilities regularly so we know if/when the computed maximum becomes invalid.

By the way, the work to reallocate the contextx would be similar to the work we already do for replacing the underlying block device: compare the old device config to the new one, and only do the work if something has changed. So most of the time we wouldn't even need to reallocate anything.

lorelei-sakai · 2025-03-28T23:45:32Z

Gentle ping. I would try to get this moving sooner rather than later. Also upon investigation, the lvm adjustments are more complicated than I'd expected, so I'd like to try to get all the compressionType work (yours and mine) done as a unit so we only need to do the lvm work once.

sweettea · 2025-03-28T23:49:10Z

Sorry sorry! I meant to reply last weekend that I didn't think I could get to it before this weekend. Which looks vaguely accurate, just finished a masssssssive overdue thing at work so my weekends are free again.

Move LZ4 compression and decompression functionality from data-vio.c to packer.c, centralizing compression functionality ahead of making it more complex with zstd. This refactoring: - Moves uncompress_data_vio() to packer.c as vdo_uncompress_to_buffer() - Creates new vdo_compress_buffer() function in packer.c - Makes vdo_get_compressed_block_fragment() static as all users are now in packer.c.

Refactor the compressed block structure to support multiple format versions. This change: - Creates new compressed_block_header_1_0 and compressed_block_1_0 structures - Renames VDO_COMPRESSED_BLOCK_DATA_SIZE to VDO_COMPRESSED_BLOCK_DATA_SIZE_1_0 Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>

- Update parse_compression() to handle both LZ4 and ZSTD algorithms - Set compression_type in device_config for both algorithms - Output compression type as appropriate. - Upgrade to version 68. Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>

Add necessary fake file to reflect the use of kernel zstd_{min,max}_clevel() functions. Include search rules mean that we can't actually have a fake/linux/zstd.h which includes system <zstd.h>. Hence, users are condemned to have a conditional __KERNEL__ include, and the fake/linux/ file is named zstd-shims.h, to avoid the name collision. (Kernel header only works with the kernel c files, as they have slightly different names than the userspace functions.) Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>

Add -lzstd to the relevant linker flags. Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>

Refactor compression context management by relocating LZ4 workspace allocation from vdo.c to packer.c where the rest of compression functionality now lives. This change: - Adds context_count and compression_context fields to packer structure - Moves allocation/deallocation code from vdo to packer module - Updates thread initialization to reference packer's contexts - Removes compression_context field from vdo structure This centralizes all compression-related functionality within the packer module and simplifies the vdo initialization path. Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>

Restructure compression context sizing and initialization to support both LZ4 and zstd: - Calculate maximum context size needed for zstd - Initialize the context buffer as needed before each use.

Actually enable using zstd implementation by adding compression and decompression calls as appropriate. This completes the zstd feature. Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>

Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>

Add a couple of zstd variants of compression tests to nightly. Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>

With the addition of zstd support, document this in the appropriate place. Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>

sweettea · 2025-04-04T00:14:53Z

Okay, I think this is in reviewable state.

One open question, and we might have discussed it: when do we upgrade from volume version 67 to 68? Do I need to implement 'keep using v1 until they opt into zstd'?

A mechanical question: I've tried to make fairly thorough descriptions for kernel patches; should I do that for the rest too?

Thanks!

lorelei-sakai · 2025-04-04T00:29:59Z

I will try to get to this review tomorrow.

One open question, and we might have discussed it: when do we upgrade from volume version 67 to 68? Do I need to implement 'keep using v1 until they opt into zstd'?

The superblock changes I will do; I will ad them as a patch at the end of the series, or maybe in the middle if I decide that makes more sense. I'd hoped to have a draft out so you could see what I intend but I've been distracted this week.

For the compressed block format, we should continue to use v1 if the customer has not used zstd, for backwards compatibility.

A mechanical question: I've tried to make fairly thorough descriptions for kernel patches; should I do that for the rest too?

Kernel patch description are the most improtant. For userspace and internal items, it's enough to be reasonably descriptive. I'll note any specifics as I review.

github-actions · 2025-04-04T01:02:02Z

Unit Test Results

    3 files   87 suites 35m 22s ⏱️
267 tests 266 ✔️ 0 💤 1 ❌
268 runs 267 ✔️ 0 💤 1 ❌

For more details on these failures, see this check.

Results for commit 51299c1.

lorelei-sakai

Generally seems like an improvement. I've noted a few places where I think different architectural choices have to be made, or where it looks like you rebased without noticing how the compressionType series change some of his code.

Ine big conclusion is that super block work probably needs to happen before this series (so you can use the feature flags it will add). I'm planning on geting that that and some other fixes next week. So there are parts you can't accurately update until those things are resolved.

lorelei-sakai · 2025-04-04T22:17:56Z

 	return VDO_SUCCESS;
 }

+/*


Okay, the argument seesm reasonable to me. I wonder if we should rename the file from packer.[ch] to something like compress.[ch], though? (Similar to dedupe.[ch]). Probabyl not worth worrying about at this time.
If we did something like that, it should be in its own commit because in the kernel we'd need to add some files to the commit (Makefiles, mostly) which aren't reflected in this repo.

lorelei-sakai · 2025-04-04T22:28:55Z

-				 data_vio->compression.block->data, VDO_BLOCK_SIZE,
-				 VDO_MAX_COMPRESSED_FRAGMENT_SIZE,
-				 vdo->device_config->compression_level,
-				 (char *) vdo_get_work_queue_private_data());


Okay, so wait, the old code passes the LZ4 compression_level into the compress function, and it looks like you just dropped that. That's no good.

Looking ahead, it seems like you eventually add a third parameter to vdo_compress_buffer, so just do that up front. Also, given the level of the interface, it seems like passing the compression level directly would make more sense that passing an entire vdo pointer.

lorelei-sakai · 2025-04-04T22:31:53Z

 /**
 * vdo_get_compressed_block_fragment() - Get a reference to a compressed fragment from a compressed
- *                                       block.
+ *                               block.


The reformatting seems unnecessary (and wrong), so I presume it was an accident. Similary down on line 44-45 (though in that case we could just truncate the comment after VDO_INVALID_FRAGMENT as the remainder is sort of redundant).

lorelei-sakai · 2025-04-04T22:32:56Z

-				      struct compressed_block *block,
-				      u16 *fragment_offset, u16 *fragment_size)
+STATIC int
+vdo_get_compressed_block_fragment(enum block_mapping_state mapping_state,


I suppose it's not a big deal, but I don't think we need to wrap this line; looks like it would end up bring about 84 characters in that case?

lorelei-sakai · 2025-04-04T22:35:47Z

+ */
+
+int vdo_uncompress_to_buffer(enum block_mapping_state mapping_state,
+			     struct compressed_block *block, char *buffer)


This signture feels weird to me. Maybe make struct compressed_block the first argument, since this is a compression function? Part of the issues is that it doesn't feel parallel to vdo_compress_buffer(), below.

lorelei-sakai · 2025-04-05T00:12:30Z

+		}
+
+		return VDO_SUCCESS;
+


My thoughts on this now are, we'll get this cleaned up and into our (vdo-devel) repo. Then, I'll collect this and my compressionType series and any other superblock bits we need to add, and send this to the mailing list as a single series. It will need a cover letter, which I can write, but if you've got thoughts about what it should mention , please send them to me or put them in the PR description. The resulting series might be a bit long, but I think it'll hang together better than trying to justify the stages piecemeal, and it means we won't have to deal with versions where some but not all of the bits are in.

lorelei-sakai · 2025-04-05T00:16:27Z

    .physicalThreadCount = 1,
    .hashZoneThreadCount = 1,
-    .compression         = VDO_LZ4,
+    .compression         = VDO_ZSTD,


I don't think we should give up the lz4 compression tests, rather we should add new ones for zstd. Hopefully we can paramterize it to just chnage the compression method and not actually have to duplicate the meat of the tests.

lorelei-sakai · 2025-04-05T00:20:34Z

     @{$suiteNames{tornWrites}},
+     "ZstdCompress01",
+     "ZstdGenData02",
+     "ZstdRebuild04",


This is a good idea. I'd call out a compression suite, though, and run all of them with zstd and lz4. Maybe not the long ones (Perf Fio), but any default suite tests that use compression would be good.

lorelei-sakai · 2025-04-05T00:22:19Z


+                For zstd, the option is the compression level. See zstd
+                documentation for more detail on the compression level.
+


Don't forget to update the doc on the status line, too. Though as I noted elsewhere, the status line is currently wrong so the doc is, too. So that might have to wait.

lorelei-sakai · 2025-04-05T00:25:02Z

  vdo->states.vdo.config              = *config;
  vdo->states.vdo.nonce               = nonce;
-  vdo->states.volume_version          = VDO_VOLUME_VERSION_67_0;
+  vdo->states.volume_version          = VDO_VOLUME_VERSION_68_0;


Oh, one more thing. VDOVersion_t1 failed, because it compares the on-disk format with a pickled version to make sure nothing has changed. Updating the formatted version number means you have to update that pickle too. But this should be dealt with in the commit that actually does the super block changes, wherever that lands.

Yeah, I figured leaving VDOVersion_t1 broken was the best way to make sure I didn't forget the remaining work to sync it with the upcoming superblock work. Thanks!

sweettea · 2025-04-05T01:36:10Z

Wow, I never thought I'd say this, but I got throughly lost on the github review conversation formatting and I miss email-based reviewing. Apologies if there are disconnected comments in the wrong place.

sweettea · 2025-04-11T19:38:46Z

A while back we discussed whether or not zstd on 4k blocks even made sense: https://news.ycombinator.com/item?id=43655010 is a comment of a company doing zstd on 4-8k blocks of memory, so they clearly find it useful to do zstd in such small chunks, so I was wrong.

lorelei-sakai · 2025-06-14T00:12:06Z

It's been a few weeks, have you had a chance to get back to this? Do you have any estimate of when you might be able to devote some time to this? Thanks.

sweettea · 2025-06-18T19:56:56Z

Sorry about the delay, moving and work have utterly sapped my will to do anything really. I'm aiming at EOM, apologies.

lorelei-sakai · 2025-06-18T20:10:02Z

End of month is no problem, just trying to update expectations. I hope you have a chance to get some rest this weekend.

lorelei-sakai · 2025-07-25T15:35:02Z

It's been a while since this has been updated, so I'm putting this PR on hold. If and when you get a chance to work on this again, let us know, and feel free to remove the "on hold" label as well.

github-actions bot requested a review from bjohnsto February 3, 2025 18:07

github-actions bot assigned sweettea Feb 3, 2025

sweettea force-pushed the features/zstd branch 2 times, most recently from ca7a259 to 62873c0 Compare February 4, 2025 03:14

sweettea closed this Feb 4, 2025

sweettea reopened this Feb 4, 2025

sweettea marked this pull request as ready for review February 4, 2025 03:21

vdo-devel added the verification needed label Feb 4, 2025

sweettea marked this pull request as draft February 4, 2025 05:04

sweettea force-pushed the features/zstd branch from 62873c0 to f6271c7 Compare February 4, 2025 06:34

github-actions bot removed the verification needed label Feb 4, 2025

lorelei-sakai requested changes Feb 8, 2025

View reviewed changes

lorelei-sakai requested changes Mar 22, 2025

View reviewed changes

sweettea force-pushed the features/zstd branch 2 times, most recently from e953fd7 to b0c0c67 Compare April 3, 2025 07:46

sweettea added 2 commits April 3, 2025 04:04

sweettea added 11 commits April 3, 2025 20:12

dm vdo test: Update Makefiles to include zstd.

ffade9b

Add -lzstd to the relevant linker flags. Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>

dm vdo: set up zstd compression contexts

5cc263d

Restructure compression context sizing and initialization to support both LZ4 and zstd: - Calculate maximum context size needed for zstd - Initialize the context buffer as needed before each use.

dm vdo fake: add zstd context size shims

70d8aa9

dm vdo: actually use zstd

6ed3c02

Actually enable using zstd implementation by adding compression and decompression calls as appropriate. This completes the zstd feature. Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>

dm vdo fake: Add fakes for zstd compress/decompress

9579cf8

Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>

dm vdo test: switch some unit tests over to using zstd

ebed313

Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>

dm vdo perl: add some zstd tests

4594023

Add a couple of zstd variants of compression tests to nightly. Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>

doc: dm vdo: update compression type with zstd

51299c1

With the addition of zstd support, document this in the appropriate place. Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>

sweettea force-pushed the features/zstd branch from 482f7c5 to 51299c1 Compare April 4, 2025 00:12

sweettea marked this pull request as ready for review April 4, 2025 00:14

vdo-devel added the verification needed label Apr 4, 2025

vdo-devel removed the verification needed label Apr 4, 2025

github-actions bot added public testing running public testing failed and removed public testing running labels Apr 4, 2025

lorelei-sakai requested changes Apr 5, 2025

View reviewed changes

lorelei-sakai mentioned this pull request Jul 25, 2025

deprecate compression message #261

Closed

lorelei-sakai added the on hold PR is on hold for now label Jul 25, 2025


		/* Compression type in use for this block's fragments */
		u8 type;


		For zstd, the option is the compression level. See zstd
		documentation for more detail on the compression level.

Conversation

sweettea commented Feb 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sweettea commented Feb 4, 2025

Uh oh!

lorelei-sakai commented Feb 4, 2025

Uh oh!

lorelei-sakai commented Feb 4, 2025

Uh oh!

lorelei-sakai commented Feb 4, 2025

Uh oh!

sweettea commented Feb 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sweettea commented Feb 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sweettea commented Feb 4, 2025

Uh oh!

bjohnsto commented Feb 4, 2025

Uh oh!

lorelei-sakai commented Feb 5, 2025

Uh oh!

lorelei-sakai left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sweettea commented Feb 13, 2025

Uh oh!

lorelei-sakai commented Mar 21, 2025

Uh oh!

lorelei-sakai left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sweettea commented Feb 3, 2025 •

edited

Loading

sweettea commented Feb 4, 2025 •

edited

Loading

sweettea commented Feb 4, 2025 •

edited

Loading