-
Notifications
You must be signed in to change notification settings - Fork 44
Dictionary Creation #91
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Consider it a draft for the time being |
|
That's very cool! Thanks for the update on your progress :) |
|
Update: the core algorithm is complete, now all that's left is cleanup, and possibly performance/efficiency improvements in a different PR. Bench marking on the |
|
@KillingSpark this is ready for review :) |
|
Nice! I'll do a complete review in a bit, a few small things I noticed while skimming the change set:
It'll probably be a few days until I find the time to do a full review, but I'm exited to have this functionality merged soon :) |
|
Hi, finally found the time to review. This looks great! One thing needs fixing before I can merge. The change in lib.rs where you removed the And one more nitpick, which I wouldn't bother you with if it was just that: In Cargo.toml there's still a typo changing, Otherwise great work, I'll gladly merge it with the issue above resolved :) |
- Fix typo in cargo.toml - set VERBOSE to false and add a test to verify it's false - remove commented out bench code from zstd_dict.rs
|
The removal of Other feedback has been addressed |
|
Thanks! |
|
It seems a little deceptive to claim "compression gains within a percentage of the original implementation" and then clarify that refers to raw content dictionaries and not the kind of huffman-coded/FSE dictionaries produced by How are you even getting zstd-rs/ruzstd/src/dictionary/mod.rs Lines 6 to 8 in 20a5fb9
But the document from the
This seems to be very very specific to the zstd implementation, and not remotely about data frequency. Dictionaries based on the data itself would seem to refer to the FSE/huffman entropy tables produced by
I can't find any indication that this library has addressed that very explicit warning about safety and untrusted input. The decoding errors are of course completely undocumented: https://docs.rs/ruzstd/latest/ruzstd/decoding/errors/index.html. Your methodology for the
As mentioned, the
We'll come back to this claim.
I can't find this commit in your repo or in the upstream
This is the sum of all the individual json files after tar extraction, which is a very different thing from compression alone. When
The compression settings weren't provided. Using
"of no dict size"? Were you compressing the result of the first compression? This is RIDICULOUSLY misleading. This of course makes the numbers smaller. When you give % of original size (the % that the zstd cli provides) and then switch to % of a smaller number, it looks like you're trying to deceive people. This particular trick is quite literally an example of academic dishonesty that would at the very least get a paper retracted after publication.
I still have no clue how you generated the "reference dictionary", given that to my knowledge zstd has no API to generate a raw content dictionary. Is there a benchmarking tool in their repo I'm missing? Now here's my benchmark: # cosmicexplorer@terrestrial-gamma-ray-flash: ~/tools/zstd 14:50:56
; zstd --train --filelist gh-json.txt -o gh-json-text-noargs.dict
# cosmicexplorer@terrestrial-gamma-ray-flash: ~/tools/zstd 14:51:04
; find github -name '*.json' -type f > gh-json.txt
# cosmicexplorer@terrestrial-gamma-ray-flash: ~/tools/zstd 14:51:17
; zstd --train --filelist gh-json.txt -o gh-json-text-noargs.dict
Trying 5 different sets of parameters
k=1024
d=8
f=20
steps=4
split=75
accel=1
Save dictionary of size 112640 into file gh-json-text-noargs.dict
# cosmicexplorer@terrestrial-gamma-ray-flash: ~/tools/zstd 14:53:24
; rm -rf gh-trained && mkdir -v gh-trained && zstd --filelist gh-json.txt -D gh-json-text-noargs.dict --output-dir-mirror gh-trained --verbose --threads=0
mkdir: created directory 'gh-trained'
*** Zstandard CLI (64-bit) v1.5.7, by Yann Collet ***
Note: 8 physical core(s) detected
github/user.xZXXt8FeFI.json : 9.55% ( 869 B => 83 B, gh-trained/github/user.xZXXt8FeFI.json.zst)
# ... [output omitted] ...
github/user.SHuIWPzlXT.json : 9.90% ( 869 B => 86 B, gh-trained/github/user.SHuIWPzlXT.json.zst)
9114 files compressed : 10.17% ( 7.14 MiB => 743 KiB) Using only the defaults from the CLI, that's 10.17%. Again, let's review the claim on the front page of your repo:
Where is 0.2% from? Did you just subtract 16.28% - 16.16%? I won't accuse you of intentional academic dishonesty for this, but that is not how percentages work.
You know for a damn fact that 99% of users have no clue what this means, since you don't clarify that Did you find a resource that describes the raw content format as lempel-ziv encoding? Because the zstd dictionary format document very clearly states that raw content has no format. Is there no existing implementation of lempel-ziv window encoding in rust? Because it really seems like you just put in that citation to intentionally deceive your users about having implemented the actually useful form of dictionary encoding, the one that I'd love to see a comparison to randomly generated bytes. Here's an example of that in the zip file crate with all the tricks I've developed: https://github.com/zip-rs/zip2/blob/c864a14035439154187aa6a6c894c7031e9087aa/benches/merge_archive.rs#L20 fn generate_random_archive(
num_entries: usize,
entry_size: usize,
options: SimpleFileOptions,
) -> ZipResult<(usize, ZipArchive<Cursor<Vec<u8>>>)> {
let buf = Cursor::new(Vec::new());
let mut zip = ZipWriter::new(buf);
let mut bytes = vec![0u8; entry_size];
for i in 0..num_entries {
let name = format!("random{i}.dat");
zip.start_file(name, options)?;
getrandom::fill(&mut bytes).unwrap();
zip.write_all(&bytes)?;
}
let buf = zip.finish()?.into_inner();
let len = buf.len();
Ok((len, ZipArchive::new(Cursor::new(buf))?))
}We also compare against different types of inputs, like Shakespeare, or video files. Now let's see your repo's benchmark: https://github.com/KillingSpark/zstd-rs/blob/master/ruzstd/benches/decode_all.rs It looks like you choose exactly one input file to compare against. You also include the file as a slice and use the frame encoder, which is going to provide a better number than the streaming encoder one would normally expect to use from a Your "round trip" for FSE is also extremely suspicious (
if data.len() < 2 {
return;
}
if data.iter().all(|x| *x == data[0]) {
return;
}
if data.len() < 64 {
return;
}
You need to have a distinct phase of input parsing so that you don't need these kinds of one-off checks cluttering up your encoding and decoding logic. That's how vulnerabilities happen. Here's an example of using #[cfg(any(test, feature = "proptest"))]
impl Arbitrary for ValidPath<ffi::OsString> {
type Parameters = (Option<usize>, Option<usize>);
type Strategy = proptest::strategy::BoxedStrategy<Self>;
#[cfg(unix)]
fn arbitrary_with(args: (Option<usize>, Option<usize>)) -> Self::Strategy {
use std::os::unix::ffi::OsStringExt;
let (min_len, max_len) = args;
proptest::string::bytes_regex(&format!(
r"(?s-u:[^\x00]{{{},{}}})",
min_len.unwrap_or(0),
max_len.unwrap_or(500),
))
.unwrap()
.prop_map(|s| unsafe { Self::new_unchecked(ffi::OsString::from_vec(s)) })
.boxed()
}
}Also, it completely defeats the purpose of fuzzing if you use fixed parameters like the number 22 on this line. Line 96 in 20a5fb9
Finally I'm just flummoxed about this part:
What's the point of using rust at all if you're not going to provide a safe interface? Why on earth isn't the while loop approach in the library itself? Why isn't it the default approach for the streaming decoder? I saw this crate included in the rust compiler's transitive dependencies after pulling from I would love to help describe or explain further how to produce trustworthy benchmark and fuzzing processes for rust projects. I would be delighted to pair with you if you are genuinely interested in achieving scientific reproducibility guarantees for your benchmarks. But the language around dictionary construction is impossible for me to interpret as anything except malicious deception. If that is not your intent, please please let me know and I would love to work with you. But otherwise I will probably just make my own zstd implementation in rust, because it is legitimately frightening to me that the standard library in every one of my rust programs relies upon a library with essentially zero fuzzing coverage that actively attempts to deceive readers about its degree of support for dictionary generation. Please help me understand why you provide numbers from a completely unreproducible benchmark on the main docs page, and why you change from measuring % of original to % of the no dict output, when the |
|
While I appreciate your passion, the tone with which you approached your feedback is not appreciated, and I will attempt to offer you more courtesy in my response. No malice was intended with this PR, and I would gain nothing from it anyway. It was purely a passion project. I'm not a maintainer of this project, so any success it brought the project would likely not go to me. Raw Content vs State DictionariesWhen starting this project I needed to define ways to quantify the progress I made. I came to the conclusion that the main benchmark of success for this project would be how effective the inclusion of a dictionary created by was at improving compression ratio. Other possible benchmarks that I did not include were the speed of dictionary generation, effectiveness of the dictionary per unit size, and how the dictionary impacted compression/decompression size. As this was a passion project done in my limited time as an unpaid undergraduate research project, I could not focus on everything. I decided that it didn't matter to the end user whether raw content dictionaries or state based dictionaries were used, it's an internal implementation detail. Any distinction between the two in benchmarking was something that could be added down the line. Later in your comment, you state this:
The Github Users Datasethttps://github.com/facebook/zstd?tab=readme-ov-file#the-case-for-small-data-compression I did not link the github users dataset by mistake, I don't love that the original zstd project leaves it buried under the attachments of a minor release. That is something that should be corrected. Lempel ZivThe paper referenced is the same paper utilized by Yann Collet in the reference implementation. The paper's title is slightly misleading, it describes a generalized dictionary creation algorithm not tied to any particular format, the relation to lempel ziv was purely in circumstantial application. Untrusted inputYou are correct, but words are cheap, submit PRs. Should you like to fund someone to contribute fix these issues, that is also an acceptable solution. Research fabrication.You state you were unable to find that commit, it may have been lost in the squashing. It can be found here 09e52d0, and is the commit directly above this comment: #91 (comment) Document SummationSmall documents benefit the most from the utilization of a dictionary. I documented my benchmarking process in a lackluster manner because I did not see the need to do it better. The code for benchmarking is included in the commit you missed, albeit sloppy and poorly documented. To be concise it is included below: struct BenchmarkResults {
uncompressed_size: usize,
nodict_size: usize,
reference_size: usize,
our_size: usize,
}
fn bench<P: AsRef<Path>>(input_path: P) -> BenchmarkResults {
// At what compression level the dicts are built with
let compression_level = 22;
// 1. Collect a list of a path to every file in the directory into `file_paths`
println!("[bench]: collecting list of input files");
let mut file_paths: Vec<PathBuf> = Vec::new();
let dir: fs::ReadDir = fs::read_dir(&input_path).expect("read input path");
fn recurse_read(dir: fs::ReadDir, file_paths: &mut Vec<PathBuf>) -> Result<(), io::Error> {
for entry in dir {
let entry = entry?;
if entry.file_type()?.is_dir() {
recurse_read(fs::read_dir(&entry.path())?, file_paths)?;
} else {
file_paths.push(entry.path());
}
}
Ok(())
}
recurse_read(dir, &mut file_paths).expect("recursing over input dir");
// 2. Create two dictionaries, one with our strategy, and one with theirs
println!("[bench]: creating reference dict");
let reference_dict =
zstd::dict::from_files(file_paths.iter(), 112640).expect("create reference dict");
let mut our_dict = Vec::with_capacity(112640);
println!("[bench]: creating our dict");
create_dict_from_dir(input_path, &mut our_dict, 112640).expect("create our dict");
// Open each file and compress it
let mut uncompressed_size: usize = 0;
let mut nodict_size: usize = 0;
let mut reference_size: usize = 0;
let mut our_size: usize = 0;
let mut reference_output: Vec<u8> = Vec::with_capacity(128_000);
let mut reference_encoder =
zstd::Encoder::with_dictionary(&mut reference_output, compression_level, &reference_dict)
.unwrap();
reference_encoder.multithread(8).unwrap();
let mut our_output: Vec<u8> = Vec::with_capacity(128_000);
let mut our_encoder =
zstd::Encoder::with_dictionary(&mut our_output, compression_level, &our_dict).unwrap();
our_encoder.multithread(8).unwrap();
for (idx, path) in file_paths.iter().enumerate() {
println!("[bench]: compressing file {}/{}", idx + 1, file_paths.len());
let mut handle = File::open(path).unwrap();
let mut data = Vec::new();
handle.read_to_end(&mut data);
uncompressed_size += data.len();
// Compress with no dict
let nodict_output = zstd::encode_all(data.as_slice(), compression_level).unwrap();
nodict_size += nodict_output.len();
// Compress with the reference dict
reference_encoder.write_all(data.as_slice());
reference_encoder
.do_finish()
.expect("reference encoder finishes");
reference_size += reference_output.len();
reference_output.clear();
// Compress with our dict
our_encoder.write_all(data.as_slice());
our_encoder.finish().expect("our encoder finishes");
our_size += our_output.len();
our_output.clear();
}
BenchmarkResults {
uncompressed_size,
nodict_size,
reference_size,
our_size,
}
}misc
That percentage is the difference in compression ratios achieved by zstd-rs's implementation vs facebook's. Feel free to contribute a more detailed explaination. Just because you're allowed to put anything in a raw content dictionary does not mean that it'll improve compression ratios meaningfully. The paper describes ways to find the most valuable content to include in a content dictionary.
That information can be removed, it was included with the intent to invite contributors to implement it, but that's a poor place to put it.
Feel free to add one. You talk about the benchmarking methodologies and mention benchmarking against different media types. I did not write this code so cannot comment, but my understanding is that the maintainer tested locally, but did not develop continual benchmarking infrastructure. This is fine because it's a hobby project, done in their free time, and they can do whatever they please. Again, feel free to contribute said infrastructure. The rest of the comment goes outside of the scope of this PR and so I will not be addressing it here. I will if it's relevant to my work and is submitted in the correct location, I will happily provide context. To reiterate, the tone of your comment is not productive, and it comes across as berating and demanding. I apologize if I came across as terse, to open your GitHub notifications to find a harshly worded 2000 word essay critiquing work done for fun was not pleasant and negatively impacted my outlook. Best, |
|
I forgot to include, but if you open issues and tag me in them I'll try my best to fix them, this is just too much to filter through in one discussion :) |
|
Hey just wanted to let you know I am aware of this, I'll adress some of the concerns tomorrow. I am currently visiting family but I want to at least follow up on some of the concerns. I'll likely also put some of this in separate issues, since there are many different parts to the comment that concern parts of the library that do not have anything to do with this PR. I must agree with the PR author, the tone at the end of this comment got... out of order. I would be lying if I said I never lost myself in a rant, but I will also ask to keep this more civil in the future. |
|
Right, couldn't wait to answer this late at night after reading this fully... All in all I must say, the language you are using @cosmicexplorer is very unfriendly, condescending in some parts, accusatory in others. And I'm not going to engage with you any further if this doesn't change. @zleyyij answered a lot of stuff already I'll try to not repeat on points they already made. Thanks @zleyyij for the well written response, not everyone would have reacted in such a calm manner. I certainly didn't achieve that as well as you did but I hope it's still acceptable. I'll answer to the parts I deemed important, quoting the parts. If you feel that you have more questions, please open separate issues for separate questions, and reconsider your choice on how you want to treat me or the other contributors to this crate.
The benchmark you linked against is used to compare different versions of ruzstd against each other. Would a more diverse benchmarking harness be better? Almost certainly. You're free to write one. This was sufficient to guide me through some optimizations, which was the usecase for writing this benchmark.
The streaming decoder (there is no streaming encoder...) uses the frame decoder internally. It's literally just a convenience wrapper around the framedecoder. This fuzzes both levels. If you do not trust this, feel free to use your own resources and time to fuzz this crate further. I'm happy to take contributions. I'm also happy to look into any fuzzing results you might find.
FSE encoding as implemented in zstd and ruzstd doesn't make sense for slices that only include one distinct symbol. This has to do with how the tables are defined that encode the state transitions. Just remove the check, and step through the code. You'll see why. This isn't dangerous btw. If the using code tries to do that it will panic. It would be stupid, because zstd supports RLE encoding on almost every level of the format so any sensible encoder will not call this method with a slice that contains only one distinct symbol.
You'll notice that this kind of guard is just in place to stop the fuzzing from producing noisy results for known "bad" inputs. This has nothing to do with the encoding/decoding logic. The encoder of course does these checks before calling the fse encoding. Because it could just do the RLE encoding instead, and, as mentioned above does not work. This has nothing to do with an "input parsing phase". This is just literally avoiding inputs that just don't make sense and should, and also will, panic. This is, if anything, a lack of documentation of some internal functions and their preconditions. Edit
Do you even know what that number 22 does in this line? Seems like you don't. The necessary amount of bits allowed depends on the amount of different symbols in the data. Chosing this to be optimal for the random input would actually hit less interesting codepaths as keeping this constant and having it sometimes be optimal and sometimes suboptimal. Chosing it big enough to suit every input avoids false positive fuzzing results.
Do you know that safe means in rust? I can literally write
It literally is. The options are: Decode everything fully in memory or loop and decode chunks at a time. The framedecoder gives you more control over the specifics, the streaming decoder helps you do it more conveniently. And the most convenient method, that you should only use for known-good inputs, is to just do it all at once in memory.
It's used for reading zstd compressed metadata for backtraces yeah. If this is any consolation, they use an API that ensures that only a very specific amount of memory will be used. Also: neither I nor anyone working on this crate pushed for this. The std devs stepped up, contributed the necessary magic to the cargo.toml and used it as a dependency.
Go ahead, it's a free world. I would be happy to have another person working on this crate though. It is however wrong to claim this: "with essentially zero fuzzing coverage". There is fuzzing in place for the decoding, it runs pretty much all parts of the decoding API. |
At the time of opening it's not quite in a state where it's ready to be merged, but I want you to be able to track progress and provide feedback.
My roadmap looks like this:
Once that's done, then I could begin improving dictionary creation in a separate PR.