Skip to content

Add zstd support#241

Open
sweettea wants to merge 18 commits intodm-vdo:mainfrom
sweettea:features/zstd
Open

Add zstd support#241
sweettea wants to merge 18 commits intodm-vdo:mainfrom
sweettea:features/zstd

Conversation

@sweettea
Copy link
Copy Markdown
Contributor

@sweettea sweettea commented Feb 3, 2025

This adds zstd support to VDO.

Zstd support has been requested on the mailing list. It's a feature which comparable storage technologies have, and which is generally viewed as the best available compression technology.

This adds support and some tests for it.

@github-actions github-actions bot requested a review from bjohnsto February 3, 2025 18:07
@sweettea sweettea force-pushed the features/zstd branch 2 times, most recently from ca7a259 to 62873c0 Compare February 4, 2025 03:14
@sweettea sweettea closed this Feb 4, 2025
@sweettea sweettea reopened this Feb 4, 2025
@sweettea
Copy link
Copy Markdown
Contributor Author

sweettea commented Feb 4, 2025

@bjohnsto hopefully this one I did right?

@sweettea sweettea marked this pull request as ready for review February 4, 2025 03:21
@lorelei-sakai
Copy link
Copy Markdown
Member

I denied testing on this one partly to see if it would work, but mostly because this is a big PR and I haven't had time to give it a good read yet. Not that I don't trust you but... Let's stick to using the small PR for test the CI, okay? Thanks.

@lorelei-sakai
Copy link
Copy Markdown
Member

I will say, I like what I've read so far. I think this feature could use some real design discussion so we all agree on how it should work. If you have any high-level design thoughts to share, that would be great. (The sort of thing that might go in a cover letter, for example.)

For some initial highlights:

  1. Do you envision being able to change the compression algorithm any time we reload a new table, or just set it at format time? (I think setting it on table reload should work fine, but I want to clarify what you intended.)
    1a) I don't really like the idea of reading the device_config every time we go to compress a block. I don't think we can change this more often than once per suspend/resume cycle. So couldn't you read the device config once (say at resume time) and store the algorithm/level values locally in the packer? That's probably more efficient.
  2. Setting the compression level is a cool addition, but lz4 also supports different levels and I think it would make sense to have a "compression-level" parameter in general, independent of what algorithm is used. (In our internal discussions, we've imagined "add new compression algorithm support" and "add support for configuring the compression level" as separate stories, although they are obviously closely related.)
  3. Putting the compression algorithm in the block header is nice, it means we can have a mix of old and new compressed blocks, but it does mean having all algorithms available in case we need to decompress data written with a different algorithm than we're currently using. That's probably not an issue? It also means we have to flush the packer so we don't mix fragments in a single block, but I think suspending will already do that. How closely did you look at this case?
  4. The zstd-shims.h thing is ugly, I wonder if there's a better way around that?
  5. Since this is a change to the outward facing interface (i.e. the table line), it should be documented.
    5a) Also the dm device version should be updated to reflect the changed behavior. (I didn't see it, but I might have overlooked it?) I really like the idea of replace compression=(on|off) with an algorithm name but it's a change.
    5b) If either the compression flag or the compression level is a required parameter, that means making a new version for the table line, too.
  6. There's work to do to integrate this with lvm tools, but we can deal with that after the work is done, I suppose that doesn't really affect anything here as long as you get the versions correct so lvm can know which version it's dealing with.

If there's some time that you could jump on a call, it might be worth trying to hash some of this out in real time. Though I realize that might not be practical.

@lorelei-sakai
Copy link
Copy Markdown
Member

Oh, would you mind moving this one back to draft? It'll let you post new versions without triggering the CI each time.

@sweettea sweettea marked this pull request as draft February 4, 2025 05:04
@sweettea
Copy link
Copy Markdown
Contributor Author

sweettea commented Feb 4, 2025

I denied testing on this one partly to see if it would work, but mostly because this is a big PR and I haven't had time to give it a good read yet. Not that I don't trust you but... Let's stick to using the small PR for test the CI, okay? Thanks.

100%. I repushed this one partly because I coudln't figure out the right order of things on the other one, partly wanting to see what the Ready for Review button would do.

back to draft

Done. Didn't know one could do that, great button, love it a lot.

thanks for the various comments! here are some quick replies:

Design.. like in a cover letter

In hindsight you make a great point there, I was too excited to have gotten something working during the weekend to actually write a proper cover letter. Sorry about that, will accompany the next version with one.

Do you envision being able to change the compression algorithm any time we reload a new table, or just set it at format time? (I think setting it on table reload should work fine, but I want to clarify what you intended.)

On reload. There's a bit of weirdness insofar as you can make a table with compression off, then turn it on by message, or vice versa, but that seems like a bit of weirdness that could be moved away from longterm in favor of table reload.
Also, format time option gives up a lot of flexibility of device evolution during use, in favor of epsilon less memory usage.

1a) I don't really like the idea of reading the device_config every time we go to compress a block. I don't think we can change this more often than once per suspend/resume cycle. So couldn't you read the device config once (say at resume time) and store the algorithm/level values locally in the packer? That's probably more efficient.

Could do. I like having a single store of truth, the device config; the packer is not on the latency-sensitive pre-acknowledgement path; and there are no multithreaded implications since it can only change from dmsetup message (which sends a notification to the packer, thus doing sufficient memory barriers) or by dmsetup resume (which also does sufficient memory barriers). I view it as the same in principle as the vdo->compressing bool, which we have no difficulty accessing from the packing paths.

Setting the compression level is a cool addition, but lz4 also supports different levels and I think it would make sense to have a "compression-level" parameter in general, independent of what algorithm is used. (In our internal discussions, we've imagined "add new compression algorithm support" and "add support for configuring the compression level" as separate stories, although they are obviously closely related.)

It's easy enough to pull out the level specification parsing into its own function for easier reuse in the next version. Zstd levels are fairly standard to be able to select and I haven't ever wanted to do that for lz4,, so I didn't think about it. It seems, with the factoring out of parsing levels, to be pretty easy to add it, sort of the last three or four changes from here plus one to actually parse the level in analogy to zstd. However, I do think it's definitely orthogonal to adding zstd, while having levels in zstd is fairly core to zstd in my mind, so it feels like it deserves to be a different patchset beyond making it easy to add.

Putting the compression algorithm in the block header is nice, it means we can have a mix of old and new compressed blocks, but it does mean having all algorithms available in case we need to decompress data written with a different algorithm than we're currently using. That's probably not an issue?

We can share the memory blob between both algorithms, so it's not much additional memory to support having both and adds flexibility to folks using it in the future. I thought about just unilaterally upgrading everyone to zstd, but we still have to support LZ4 for old vdos, and if we still have to be able to read lz4 it's not much more work to allow writing new stuff with lz4 too.

It also means we have to flush the packer so we don't mix fragments in a single block, but I think suspending will already do that. How closely did you look at this case?
Yes, we flush the packer in between every potential change, so vios will only ever compress with one algo and we won't mix algos. Definitely a thing that deserves a test.

The zstd-shims.h thing is ugly, I wonder if there's a better way around that?

I was hung up on it for a couple hours. I tried using #include_next but couldn't get that to work. Basically we'd need #include <linux/zstd.h> to grab fake/linux/zstd.h, but then the #include <zstd.h> to not grab itself. I'll try it again.

Since this is a change to the outward facing interface (i.e. the table line), it should be documented.

Ah, yes. Where does the documentation lives in this tree?

5a) Also the dm device version should be updated to reflect the changed behavior. (I didn't see it, but I might have overlooked it?)

Hmmm, I don't know the dm device version rules? Where do I update that?

I really like the idea of replace compression=(on|off) with an algorithm name but it's a change.

This changeset leaves on/off as an option, but also allows lz4, zstd, zstd:[level] as options?

5b) If either the compression flag or the compression level is a required parameter, that means making a new version for the table line, too.

Optional :D

There's work to do to integrate this with lvm tools, but we can deal with that after the work is done, I suppose that doesn't really affect anything here as long as you get the versions correct so lvm can know which version it's dealing with.

Ah, magnificent.

@sweettea
Copy link
Copy Markdown
Contributor Author

sweettea commented Feb 4, 2025

One specific question on the form of changes: do you want capitals in subject lines at all, or all lower case? Do you want things adding fakes to be before the change using those fakes, or after?

@sweettea
Copy link
Copy Markdown
Contributor Author

sweettea commented Feb 4, 2025

latest push changes: edit zstd-shims.h to carry over the license and copyright notice from kernel zstd.h; also looked up what my copyright on personal contributions to open source projects is supposed to look like at present and added that.

@bjohnsto
Copy link
Copy Markdown
Member

bjohnsto commented Feb 4, 2025

5a) Also the dm device version should be updated to reflect the changed behavior. (I didn't see it, but I might have overlooked it?)

Hmmm, I don't know the dm device version rules? Where do I update that?

Since compression is an optional parameter there is no need to change the version value in the actual table line. However, we want to make sure when there are changes that affect clients (LVM, Stratis, etc) that they have some way of knowing when new features are valid. We do this basically with two steps. 1. Update the struct target_type vdo_target_bio .version field in dm-vdo-target.c .... we usually update the second value. So 9.1 to 9.2 and so on. 2. Notify the clients of the change so they can implement the change to their side.

Either style would likely work but I personally would prefer to have two parameters. 1. leave the compression on/off alone and the parsing for it. It will then just work for backwards compatibility and there would be no need to test to make sure the parsing didn't get busted somehow. 2. have a compressionType optional parameter that would have type and level? Then the parsing additions can be self contained and the change to the .version would relate to adding that field to the table line. Just my feelings on the matter though.

@lorelei-sakai
Copy link
Copy Markdown
Member

For commit titles, dm prefers all lower case. And in general acronyms like vdo and vio get downcased where they occur, too. The title should also include a subsystem where it's applicable: "dm vdo packer: pack all the things"
For commit ordering, whatever makes sense to you. I tend to put internal and test commits after the kernel-visible commits, but we don't require that the tree build after every commit so it doesn't matter too much. I would try to keep them near what they support, though. The kstrtoint commit, for example, seems to appear several commits before it's actually used.

Copy link
Copy Markdown
Member

@lorelei-sakai lorelei-sakai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First off, sorry if this comes out awkward. There a couple of large design issues, and I want to discuss each one in a separate conversation so that the discussions don't become muddled. I can't see how to do that aside from making separate review comments in the code, but these issues are only loosely tied to the lines I've attached them to.

Also, in general these commits are very small. While I appreciate keeping them simple, in some cases you seem to add variables or structures in one commit and make use of them in another, which can make it hard to evaluate the individual commits.

Comment thread src/c++/vdo/base/dm-vdo-target.c Outdated
if (strcmp("off", value) == 0) {
config->compression = VDO_NO_COMPRESSION;
} else if (strcmp("on", value) == 0 || strcmp("lz4", value) == 0) {
config->compression = VDO_LZ4;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What should the table line parameters look like? In our internal discussion, there was some support for having three separate parameters:
--compression (bool): Is compression enabled at all? (This is the one we already have, with no changes.)
--compressionAlgorithm (lz4|zstd|etc): What algorithm is use for compression?
--compressionOptions (e.g. "level=3"): Additional options such as the compression level to use.
I think we wnat to keep the existing option unchanged, to prevent confusion about the meaning of --comression=on in the future. The other information could be combined though, so instead of the last two parameters we could have a --compressionType="zstd:3" to specify algorithm and options in one string.

Whatever options we pick here, the new options will require updating the target_type version number as Bruce mentioned.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See below for discussion of following next-gen filesystem interfaces.

It seems redundant to specify both 'compression on' and also 'compressionType zstd:4'. It seems ambiguous to specify 'compression on' without specifying a type. I would rather have ambiguity when specifying the old way than redundancy in specifying the current best practice way, as a general rule.

Can we issue a deprecation warning in the system log a la filesystems for compression [on/off], treat it as compressionType lz4 for now, and remove it in ten kernel versions or something lengthy like that?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something we forgot the first time around is that vdo also has a dmsetup message that can enable or disable compression without suspending the device. This is somewhat contrary to the general dm model (which wants all changes like this to be managed by loading a new table), so we might look at deprecating that. However, we do have to continue to support it for the moment. The message probably can't change the compression algorithm because it can't allocate new memory, so we wouldn't be able to create support structures for the new algorithm.

We can't remove the existing table line option, either. Between that and the message, we agreed the simplest thing to do is leave the existing compression on/off options, and add a new option for specifying the compression algorithm, plus any options.

We didn't discuss the exact format. How about '--compressionType ":" ' where is a string specifying the algorithm, and is a string specifying any parameters the algorithm wants? For now the compression level can be the only option. By letting that string be context-dependent, we don't have to over-engineer it now.

Comment thread src/c++/vdo/base/dm-vdo-target.c Outdated
if (strcmp("off", value) == 0) {
config->compression = VDO_NO_COMPRESSION;
} else if (strcmp("on", value) == 0 || strcmp("lz4", value) == 0) {
config->compression = VDO_LZ4;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the compression type be a runtime option, changeable whenever we reload a table, or should it be a format time option? Both options will require updating the super block, and this choice will affect what the table line parameters look like. if you look at PR 216, you'll see we're soon going to add formatting options on the table line, but it will be important to make sure format-time parameters are in separate options than the regular reload options.

Making the algorithm a format-time option would mean writing that choice down in the super block, so it's a format change. It would also mean that we don't have to change the compressed_block format, since there can only be one format in a given volume. This would also simplify the number of scenarios we would need to test. And how realistic is it that users would want to switch algorithms dynamically? his case does involve updating vdoformat to support the new parameter.

Making the choice a reload option requires updating the compressed_block format (as you've done) but would not require a super block format change. (The super block version still needs to change to indicate the change in on-disk format, though.) It may require more testing to make sure that VDO uses the correct decompression for different blocks of compressed data.

To me it feels like the format-time option is simpler with very little loss of functionality. What do you think?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The question, I think, boils down to two higher level questions: should VDO follow the UI patterns forged by next-generation filesystems? and, should users expect to need to reformat to use new features?

Based on the lessons the filesystem community has learned over the years, I think it makes sense to 'act like filesystems' when there's a clear norm for a feature, and think in-place enablement of new features avoids penalizing early adopters who are the most valuable users. If someone's already tried a feature on other filesystems, it's negatively surprising for it to work differently with vdo and makes them less likely to stick with vdo. Similarly, if a user has already adopted VDO, it might be very difficult for them to migrate their data off for a reformat -- they may not have a sufficiently large or fast disk. It's a negative interaction to try to enable a feature and discover your volume was formatted too far in the past. Thus, in-place enablement of new features is important to avoid negative surprises to early adopters.

In specifics, I do acknowledge format-time is simpler: the testing burden, the magnitude of on disk changes, the magnitude of code changes all weigh on the side of format time option. However, I think some existing users may want to try out zstd, and think next-generation filesystems allow runtime changes.

As I understand it, zstd provides better compression, at the cost of slower compression and much slower decompression. So, for the write-mostly usecase, zstd could make sense, and it would be unsurprising to want to switch to zstd on an existing vdo. We could add a re-compressor, but that's a lot of work. So that leaves having blocks in both compression formats in some form, or disallowing in-place enablement.

As for comparable technologies: In filesystem-land, every next-gen filesystem (btrfs, bcachefs, zfs, f2fs) supporting compression allows specifying compression and compression algorithm on a per-file basis. Moreover, as I understand it, all but bcachefs allow changing the default compression going forward by remount. While VDO is not a filesystem, it is more like a advanced filesystem than anything else in DM land, and a vdo resume is like a mount. So it seems like following the UI pattern of next-gen filesystems would also encourage allowing changing compression technology per table line.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First, I think it's important to be clear that the filesystems interface (with a basic unit of a file) is different from the block device interface (with a basic unit of a data block). Users interact with them differently, they have different information about the data they're handling, etc. I think analogizing VDO to a filesystem is a good way of misunderstanding its audience and its limits. Even the deduplication and compression features of VDO
are rather different than the similarly-named features in newer filesystems.

Having said that, I think we agreed that allowing users to change the algorithm when reloading a table makes sense. Requiring reformatting is too restrictive.

Comment thread src/c++/vdo/base/packer.c
Comment thread src/c++/vdo/base/packer.c
Comment thread src/c++/vdo/base/packer.c
__func__);
return VDO_INVALID_FRAGMENT;
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The super block will need to gets its version increased at minimum. We don't want to allow an older version of VDO to try to open a volume that might have compressed blocks that it can't read. (The current version is 67, you'd bump to 68.)

There's an upgrade question, too. If the new driver loads a VDO with super block version 67, we should probably write it out again as 67 and only upgrade them if they've indicated they want to use zstd. This will help prevent hassles for users who don't care about zstd. (This is less of an issue if the algorithm is a format-time choice.)

And as long as we're changing the super block, is there any other information we'd want to add for this feature, or for a foreseeable future feature? As an example, we could record an enumeration of the algorithms we support, or that this volume has ever used. We probably don't have enough insight into future usecases to do something like this, but I wanted to see if you have any ideas.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we decided that we need to define superblock version 68. This version should add an extra 4-byte field that we can use as feature flags. (We'll only need to use two for now, though, one for lz4 compression and one for zstd.) As mentioned above, we can continue to write the old version (67) until a volume enables zstd to try to prevent downgrade issues where possible. Once there is a possibility that the volume has zstd-compresed blocks written, we have to ensure that the volume can continue to read those blocks after future reloads.

Confirmed that we do write the super block every time we resume the device (whether we saved he vdo state or not), so that should work fine.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I haven't gotten this part yet. Tomorrow.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was planning on doing the super block part, but I certainly have no objection if you end up doing it instead.

Comment thread src/c++/vdo/base/packer.c
}

return VDO_SUCCESS;

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finally, the question of how to get this upstream. For a small change like your other PR, I don't mind pulling it into our tree and then pushing it upstream with the rest of our commits. But for a real feature like this, it should probably go to the dm-devel list for more public comment. I'm happy to work to refine this PR here, and then you can send it to the list once we think it's reasonably complete. We could also take it into our tree and then send it to the list for comment, but it would be better if you (as the author) could reply to any feedback directly on list. (You'd send just the kernel version of course, which should just be a matter of dropping the commits that change vdo internal files.)

Also, I do find it very weird to add zstd with level configuration without also adding support for configuring lz4 levels. Since that isn't related to your series, I'll see if we can put together a separate series for that. I'm mentioning that mostly because the table line questions would also affect that series, so just be aware we may duplicate some of that stuff in another PR. I don't currently have a good guess for whether that would come before or after your series yet.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re upstream question: Excellent. I recognize that I'm forging relatively unknown territory here about how vdo adds features in the kernel world. I certainly like developing here to use the extensive test suite but I do think it's valuable to send it to dm-devel for comments too.

I don't know how to turn commits here into kernel tree commits; is there an existing best practice? I'm guessing it'd be git format-patch $patch | <rewrite vdo-devel paths into kernel paths> | git am in some form?

Level config: sounds good, pending outcomes of the ui question at the start. I think it should be easy enough to cherry-pick my relevant commits from this branch into the new level config branch, thereby preserving attribution?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tools for doing this aren't as friendly as they could be, but you can use src/packaging/kpatch/commit_to_overlay.sh. This script can take a series of patches from vdo-devel and a kernel repo and translate those patches into a branch on the kernel repo. (It will actually copy the entire vdo tree for each patch, so it's easy to pick up unrelated cruft depending on the state of the kernel repo you're using. I usually create a dummy patch out of the base of my series first to capture all the tree differences, and then afterwards rebase to remove that patch from my branch. We are planning to improve this process, it's just been a low priority.)

If getting that script working proves overly challenging, src/perl/bin/transformUpstreamPatch.pl will take a formatted patch and rewrite it so all the file paths are updated to the proper kernel locations. Your suggestion will work fairly well if you use this script as the step, but you'll still have to deal with trimming out the internal code that doesn't go the kernel proper. Using "git am --rej" will prevent git from just refusing to apply changes when the internal stuff creates a conflict, and then you can resolve the conflicts manually.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thoughts on this now are, we'll get this cleaned up and into our (vdo-devel) repo. Then, I'll collect this and my compressionType series and any other superblock bits we need to add, and send this to the mailing list as a single series. It will need a cover letter, which I can write, but if you've got thoughts about what it should mention , please send them to me or put them in the PR description. The resulting series might be a bit long, but I think it'll hang together better than trying to justify the stages piecemeal, and it means we won't have to deal with versions where some but not all of the bits are in.

@sweettea
Copy link
Copy Markdown
Contributor Author

Just wanted to ping this -- it looks like I could fit in a synchronous meeting tomorrow anytime from about 1pm Eastern onward, if ya'll have any free time.

@lorelei-sakai
Copy link
Copy Markdown
Member

Now that I've committed PR #269, it would be good to get this PR cleaned up and aligned with that so we can get it committed also. I'm working on the super block changes, but I think they're pretty small, and it might make sense for them to get committed after this series anyway, since the whole motivation for adding the flags is to support another compression algorithm.

Copy link
Copy Markdown
Member

@lorelei-sakai lorelei-sakai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I've added some more granular comments to give you some ideas to work with in terms of cleaning this up. I didn't mention anything about aligning it with the table line interface introduced by PR #269 since that seems like it should be straightforward, but let me know if you have any questions or difficulties with it.

Comment thread src/c++/vdo/base/packer.c
return VDO_SUCCESS;
}

/*
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So these first 3-4 patches are moving the compression and decompression routines from data-vio into the packer. But my question is, why? What does this code motion improve? It's not like the compression works happens in the packer or even on the packer thread. (It's farmed out to a CPU thread.)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes sense for all the compression/decompression stuff to live in one place.
Right now, compression contexts are allocated in vdo.c and compression/decompression is called in data-vio.c, but it's simple, straightforward calls to LZ4 constants -- essentially the complication is offloaded to lz4. However, in the zstd universe, the size of compression context is a more complicated calculation, and comp/decomp is conditional on algorithm.

With the added conditionals, I think it becomes clear that we should have all the compression calculations together so we can verify in one place that they're all parallel. And packer is the most compression-ish place to do that.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, the argument seesm reasonable to me. I wonder if we should rename the file from packer.[ch] to something like compress.[ch], though? (Similar to dedupe.[ch]). Probabyl not worth worrying about at this time.
If we did something like that, it should be in its own commit because in the kernel we'd need to add some files to the commit (Makefiles, mostly) which aren't reflected in this repo.

Comment thread src/c++/vdo/base/packer.h
struct compressed_block {
union {
struct compressed_block_1_0 v1;
};
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I Would definitely prefer not to have a union here, and I don't think it helps anything in this case. There's no reason we can't use a single in-memory representation for both formats. (We might have to call out an in-memory structure that's separate from the packed, on-storage representation, I haven't thought it through enough to figure that out yet.)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We benefit from some sort of struct compressed_block in data-vio.h, as a member of compression_state; and then we can pass it to compress_buffer and uncompress_to_buffer.

I can grant it not being a union. Though, my understanding of dm norms is that the structure on-disk needs to be defined as a struct in a header file, so I don't think it makes much difference whether it's a union or a 4k array of char?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, the problem here is that the existing code uses the packed on-disk buffer as its working in-memory object also. You're necessarily changing that here (struct compressed_block is the in-memory one), so that structrue could just have the fragment offset array, data pointer, and type. We'd have to do a bit more marshalling/unmarshalling, but not much, and it simplifies the code that needs to refer to e.g. he data pointer without caring what the version is.

Part of me is thinking that the minor version field is four bytes and we ocould abuse it slight and say version 1.0 is lz4, version 1.1 is zstd. That's probably a bad idea (but it would mean the on-disk header wouldn't change size...).

Comment thread src/c++/vdo/base/packer.h Outdated

/* Compression type in use for this block's fragments */
u8 type;

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you put the type after the array of fragment offsets, it means these structures are identical except for whether you read that final byte. That would simplify a lot, I think. I does mean we'd have problems make the number of fragments larger, but that militation is set by the number of block mapping states, so if we wanted to make that change it'll still require a fair bit of restructuring. (Realistically, I don't think we'll increase the number of fragments, your data has to be fairly particular to even get to 14.)

Also as a minor thing, I don't like that the array on u16s ends up misaligned. That just feels messy.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Thanks.

Comment thread src/c++/vdo/base/packer.h Outdated
Comment thread src/c++/vdo/base/vdo.c Outdated
@@ -0,0 +1,4 @@
#include <zstd.h>
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given how much eventually ends up in this file, do we really need to imprt he userspace zstd.h? Why not just copy over the whole kernel implementation into a fake header? How much code is that, really? (I haven't looked, I genuinely don't know.)

If there is a lot code inclusion you save his way, it also seems like there should be some way of specifying where to look for the userspace header so this file wouldn't end up trying to include itself, but I haven't researched that, so I don't know.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

$ wc include/linux/zstd.h
  614  3045 23006 include/linux/zstd.h
  

I don't think it qualifies as really small enough. :(

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

600 lines isn't that bad. It looks like once you're done, the shims file is a bit over a hundred lines anyway.

zstd probably doesn't have this problem, but I'm also influenced by the discovery that out lz4 userspace implementation is entirely different from the one in the kernel, so what we test in userspace doesn't actually reflect exactly what the kernel does. So I'd be just as happy if our fake files reflected the actual kernel code.
It's possible you coud trim out some bits that aren't needed by our tests, too. (Though even if not, 600 lines is not that bad. (By comparison, the userspace fake of dm-bufio is about 233 lines, which is probably the largest one we currently have. So, could go either way, I guess.)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's just the header that's 600 lines. The implementation is 22k more.

Comment thread src/c++/vdo/base/packer.c
zstd_dstream_workspace_bound(VDO_BLOCK_SIZE));
for (int level = zstd_min_clevel(); level <= zstd_max_clevel(); level++) {
zstd_parameters params = zstd_get_params(level, VDO_BLOCK_SIZE);
context_size = max_t(size_t, context_size,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These sizes don't change from run to run, do they, they're fixed by the algorithm I hope? We should be able to figure them out at compile time or something and not do this calculation here. (Ideally, zstd would even have constants defined that we could include, but I presume because you wrote this that it doesn't.)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't that be a nice universe? :/

It's not even as good as "we can check context size of min and max clevel and take the max of those" -- it's non-linear and there are no constants defined (because it depends on the max block size). This sort of loop is what btrfs does at https://elixir.bootlin.com/linux/v6.13.7/source/fs/btrfs/zstd.c#L160 .

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You know, it occurs to me that for any given instance of vdo, we know what kind of compression it is using (and what level) so we can allocate contexts that are exactly the right size without looping. This would be a bit more complicated, because whenever we suspend and resume, we'd have to reallocate these contexts in case the configuration changed, but that still seems like way less work than trying to define a generic maximum.

I really like that idea, but if that seems too complicated, we can also do this calculation offline, add a constant describing the maximum context size (with an appropriately detailed comment), and then add a unit test to test all the possibilities regularly so we know if/when the computed maximum becomes invalid.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way, the work to reallocate the contextx would be similar to the work we already do for replacing the underlying block device: compare the old device config to the new one, and only do the work if something has changed. So most of the time we wouldn't even need to reallocate anything.

@lorelei-sakai
Copy link
Copy Markdown
Member

Gentle ping. I would try to get this moving sooner rather than later. Also upon investigation, the lvm adjustments are more complicated than I'd expected, so I'd like to try to get all the compressionType work (yours and mine) done as a unit so we only need to do the lvm work once.

@sweettea
Copy link
Copy Markdown
Contributor Author

Sorry sorry! I meant to reply last weekend that I didn't think I could get to it before this weekend. Which looks vaguely accurate, just finished a masssssssive overdue thing at work so my weekends are free again.

@sweettea sweettea force-pushed the features/zstd branch 2 times, most recently from e953fd7 to b0c0c67 Compare April 3, 2025 07:46
sweettea added 2 commits April 3, 2025 04:04
Move LZ4 compression and decompression functionality from data-vio.c to packer.c,
centralizing compression functionality ahead of making it more complex
with zstd. This refactoring:

- Moves uncompress_data_vio() to packer.c as vdo_uncompress_to_buffer()
- Creates new vdo_compress_buffer() function in packer.c
- Makes vdo_get_compressed_block_fragment() static as all users are now
  in packer.c.
Refactor the compressed block structure to support multiple format versions.
This change:

- Creates new compressed_block_header_1_0 and compressed_block_1_0 structures
- Renames VDO_COMPRESSED_BLOCK_DATA_SIZE to VDO_COMPRESSED_BLOCK_DATA_SIZE_1_0

Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
sweettea added 11 commits April 3, 2025 20:12
- Update parse_compression() to handle both LZ4 and ZSTD algorithms
- Set compression_type in device_config for both algorithms
- Output compression type as appropriate.
- Upgrade to version 68.

Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Add necessary fake file to reflect the use of kernel
zstd_{min,max}_clevel() functions.

Include search rules mean that we can't actually have a
fake/linux/zstd.h which includes system <zstd.h>. Hence, users are
condemned to have a conditional __KERNEL__ include, and the fake/linux/
file is named zstd-shims.h, to avoid the name collision.

(Kernel header only works with the kernel c files, as they have slightly
different names than the userspace functions.)

Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Add -lzstd to the relevant linker flags.

Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Refactor compression context management by relocating LZ4 workspace allocation
from vdo.c to packer.c where the rest of compression functionality now
lives. This change:

- Adds context_count and compression_context fields to packer structure
- Moves allocation/deallocation code from vdo to packer module
- Updates thread initialization to reference packer's contexts
- Removes compression_context field from vdo structure

This centralizes all compression-related functionality within the packer module
and simplifies the vdo initialization path.

Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Restructure compression context sizing and initialization to support
both LZ4 and zstd:

- Calculate maximum context size needed for zstd
- Initialize the context buffer as needed before each use.
Actually enable using zstd implementation by adding compression and
decompression calls as appropriate.

This completes the zstd feature.

Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Add a couple of zstd variants of compression tests to nightly.

Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
With the addition of zstd support, document this in the appropriate
place.

Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
@sweettea
Copy link
Copy Markdown
Contributor Author

sweettea commented Apr 4, 2025

Okay, I think this is in reviewable state.

One open question, and we might have discussed it: when do we upgrade from volume version 67 to 68? Do I need to implement 'keep using v1 until they opt into zstd'?

A mechanical question: I've tried to make fairly thorough descriptions for kernel patches; should I do that for the rest too?

Thanks!

@sweettea sweettea marked this pull request as ready for review April 4, 2025 00:14
@lorelei-sakai
Copy link
Copy Markdown
Member

I will try to get to this review tomorrow.

One open question, and we might have discussed it: when do we upgrade from volume version 67 to 68? Do I need to implement 'keep using v1 until they opt into zstd'?

The superblock changes I will do; I will ad them as a patch at the end of the series, or maybe in the middle if I decide that makes more sense. I'd hoped to have a draft out so you could see what I intend but I've been distracted this week.

For the compressed block format, we should continue to use v1 if the customer has not used zstd, for backwards compatibility.

A mechanical question: I've tried to make fairly thorough descriptions for kernel patches; should I do that for the rest too?

Kernel patch description are the most improtant. For userspace and internal items, it's enough to be reasonably descriptive. I'll note any specifics as I review.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 4, 2025

Unit Test Results

    3 files    87 suites   35m 22s ⏱️
267 tests 266 ✔️ 0 💤 1
268 runs  267 ✔️ 0 💤 1

For more details on these failures, see this check.

Results for commit 51299c1.

Copy link
Copy Markdown
Member

@lorelei-sakai lorelei-sakai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally seems like an improvement. I've noted a few places where I think different architectural choices have to be made, or where it looks like you rebased without noticing how the compressionType series change some of his code.

Ine big conclusion is that super block work probably needs to happen before this series (so you can use the feature flags it will add). I'm planning on geting that that and some other fixes next week. So there are parts you can't accurately update until those things are resolved.

Comment thread src/c++/vdo/base/packer.c
return VDO_SUCCESS;
}

/*
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, the argument seesm reasonable to me. I wonder if we should rename the file from packer.[ch] to something like compress.[ch], though? (Similar to dedupe.[ch]). Probabyl not worth worrying about at this time.
If we did something like that, it should be in its own commit because in the kernel we'd need to add some files to the commit (Makefiles, mostly) which aren't reflected in this repo.

data_vio->compression.block->data, VDO_BLOCK_SIZE,
VDO_MAX_COMPRESSED_FRAGMENT_SIZE,
vdo->device_config->compression_level,
(char *) vdo_get_work_queue_private_data());
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, so wait, the old code passes the LZ4 compression_level into the compress function, and it looks like you just dropped that. That's no good.

Looking ahead, it seems like you eventually add a third parameter to vdo_compress_buffer, so just do that up front. Also, given the level of the interface, it seems like passing the compression level directly would make more sense that passing an entire vdo pointer.

Comment thread src/c++/vdo/base/packer.c
/**
* vdo_get_compressed_block_fragment() - Get a reference to a compressed fragment from a compressed
* block.
* block.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reformatting seems unnecessary (and wrong), so I presume it was an accident. Similary down on line 44-45 (though in that case we could just truncate the comment after VDO_INVALID_FRAGMENT as the remainder is sort of redundant).

Comment thread src/c++/vdo/base/packer.c
struct compressed_block *block,
u16 *fragment_offset, u16 *fragment_size)
STATIC int
vdo_get_compressed_block_fragment(enum block_mapping_state mapping_state,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose it's not a big deal, but I don't think we need to wrap this line; looks like it would end up bring about 84 characters in that case?

Comment thread src/c++/vdo/base/packer.c
*/

int vdo_uncompress_to_buffer(enum block_mapping_state mapping_state,
struct compressed_block *block, char *buffer)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This signture feels weird to me. Maybe make struct compressed_block the first argument, since this is a compression function? Part of the issues is that it doesn't feel parallel to vdo_compress_buffer(), below.

Comment thread src/c++/vdo/base/packer.c
}

return VDO_SUCCESS;

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thoughts on this now are, we'll get this cleaned up and into our (vdo-devel) repo. Then, I'll collect this and my compressionType series and any other superblock bits we need to add, and send this to the mailing list as a single series. It will need a cover letter, which I can write, but if you've got thoughts about what it should mention , please send them to me or put them in the PR description. The resulting series might be a bit long, but I think it'll hang together better than trying to justify the stages piecemeal, and it means we won't have to deal with versions where some but not all of the bits are in.

.physicalThreadCount = 1,
.hashZoneThreadCount = 1,
.compression = VDO_LZ4,
.compression = VDO_ZSTD,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should give up the lz4 compression tests, rather we should add new ones for zstd. Hopefully we can paramterize it to just chnage the compression method and not actually have to duplicate the meat of the tests.

@{$suiteNames{tornWrites}},
"ZstdCompress01",
"ZstdGenData02",
"ZstdRebuild04",
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good idea. I'd call out a compression suite, though, and run all of them with zstd and lz4. Maybe not the long ones (Perf Fio), but any default suite tests that use compression would be good.

Comment thread doc/vdo.rst

For zstd, the option is the compression level. See zstd
documentation for more detail on the compression level.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't forget to update the doc on the status line, too. Though as I noted elsewhere, the status line is currently wrong so the doc is, too. So that might have to wait.

vdo->states.vdo.config = *config;
vdo->states.vdo.nonce = nonce;
vdo->states.volume_version = VDO_VOLUME_VERSION_67_0;
vdo->states.volume_version = VDO_VOLUME_VERSION_68_0;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, one more thing. VDOVersion_t1 failed, because it compares the on-disk format with a pickled version to make sure nothing has changed. Updating the formatted version number means you have to update that pickle too. But this should be dealt with in the commit that actually does the super block changes, wherever that lands.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I figured leaving VDOVersion_t1 broken was the best way to make sure I didn't forget the remaining work to sync it with the upcoming superblock work. Thanks!

@sweettea
Copy link
Copy Markdown
Contributor Author

sweettea commented Apr 5, 2025

Wow, I never thought I'd say this, but I got throughly lost on the github review conversation formatting and I miss email-based reviewing. Apologies if there are disconnected comments in the wrong place.

@sweettea
Copy link
Copy Markdown
Contributor Author

A while back we discussed whether or not zstd on 4k blocks even made sense: https://news.ycombinator.com/item?id=43655010 is a comment of a company doing zstd on 4-8k blocks of memory, so they clearly find it useful to do zstd in such small chunks, so I was wrong.

@lorelei-sakai
Copy link
Copy Markdown
Member

It's been a few weeks, have you had a chance to get back to this? Do you have any estimate of when you might be able to devote some time to this? Thanks.

@sweettea
Copy link
Copy Markdown
Contributor Author

sweettea commented Jun 18, 2025 via email

@lorelei-sakai
Copy link
Copy Markdown
Member

End of month is no problem, just trying to update expectations. I hope you have a chance to get some rest this weekend.

@lorelei-sakai lorelei-sakai added the on hold PR is on hold for now label Jul 25, 2025
@lorelei-sakai
Copy link
Copy Markdown
Member

It's been a while since this has been updated, so I'm putting this PR on hold. If and when you get a chance to work on this again, let us know, and feel free to remove the "on hold" label as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants