Skip to content

Conversation

@zleyyij
Copy link
Contributor

@zleyyij zleyyij commented Jul 21, 2025

At the time of opening it's not quite in a state where it's ready to be merged, but I want you to be able to track progress and provide feedback.

My roadmap looks like this:

  • Design and implement an interface for churning through a lot of files without loading them all into memory at the same time
  • Split text into epochs
  • Score each segment in an epoch using a reservoir sampler to estimate frequency size within the epoch
  • determine the best order to concatenate k-mers to form the dictionary (likely simple)
  • Implement zstd format wrappers around the dictionary
  • Create user-facing methods for creating a new dictionary
  • Create rough benchmarks to ensure that created dictionaries are effective and creation is usably fast for enwik9
  • Add dict creation to the CLI

Once that's done, then I could begin improving dictionary creation in a separate PR.

@zleyyij
Copy link
Contributor Author

zleyyij commented Jul 21, 2025

Consider it a draft for the time being

@KillingSpark
Copy link
Owner

That's very cool! Thanks for the update on your progress :)

@zleyyij
Copy link
Contributor Author

zleyyij commented Aug 11, 2025

Update: the core algorithm is complete, now all that's left is cleanup, and possibly performance/efficiency improvements in a different PR.

Bench marking on the github-users sample set, we achieve compression gains within a percentage of the original implementation.

uncompressed: 100.00% (7484607 bytes)
no dict: 34.99% of original size (2618872 bytes)
reference dict: 16.16% of no dict size (2195672 bytes smaller)
our dict: 16.28% of no dict size (2192400 bytes smaller)

@zleyyij
Copy link
Contributor Author

zleyyij commented Aug 19, 2025

@KillingSpark this is ready for review :)

@KillingSpark
Copy link
Owner

Nice! I'll do a complete review in a bit, a few small things I noticed while skimming the change set:

  • Typo introduced in Cargo.toml the -> theea
  • lib.rs VERBOSE should not be set to true
  • The big chunk of commented out code in zstd_dict.rs should either be used or deleted

It'll probably be a few days until I find the time to do a full review, but I'm exited to have this functionality merged soon :)

@KillingSpark
Copy link
Owner

Hi, finally found the time to review. This looks great! One thing needs fixing before I can merge. The change in lib.rs where you removed the mod tests; and included your own small tests module. I like the test for verbose_disabled though, you could copy that in the tests/mod.rs. But removing mod tests; results in the tests not running.

And one more nitpick, which I wouldn't bother you with if it was just that: In Cargo.toml there's still a typo changing, the -> thee.

Otherwise great work, I'll gladly merge it with the issue above resolved :)

- Fix typo in cargo.toml
- set VERBOSE to false and add a test to verify it's false
- remove commented out bench code from zstd_dict.rs
@zleyyij
Copy link
Contributor Author

zleyyij commented Sep 20, 2025

The removal of mod tests must have happened automatically from my IDE, that's MB.

Other feedback has been addressed

@KillingSpark
Copy link
Owner

Thanks!

@KillingSpark KillingSpark merged commit 2cebcfc into KillingSpark:master Sep 24, 2025
2 checks passed
@cosmicexplorer
Copy link

It seems a little deceptive to claim "compression gains within a percentage of the original implementation" and then clarify that refers to raw content dictionaries and not the kind of huffman-coded/FSE dictionaries produced by zstd --train. The github-users dataset isn't mentioned anywhere else in the codebase, and the upstream zstd project obscures the link, so that benchmark is already just completely nonreproducible.

How are you even getting zstd to generate raw content dictionaries? I couldn't find anywhere in the public API to do that when I tried to understand the raw content format recently. You describe a paper for lempel-ziv dictionary construction here:

//! Implemented following the paper "Effective construction of
//! Relative Lempel-Ziv Dictionaries", by Kewen Liao, Matthias Petri,
//! Alistair Moffat, and Anthony Wirth

But the document from the zstd repo doesn't mention lempel or ziv at all? https://github.com/facebook/zstd/blob/dev/doc/zstd_compression_format.md#dictionary-format

Content: The rest of the dictionary is its content. The content act as a "past" in front of data to compress or decompress, so it can be referenced in sequence commands. As long as the amount of data decoded from this frame is less than or equal to Window_Size, sequence commands may specify offsets longer than the total length of decoded output so far to reference back to the dictionary, even parts of the dictionary with offsets larger than Window_Size. After the total output has surpassed Window_Size however, this is no longer allowed and the dictionary is no longer accessible.

This seems to be very very specific to the zstd implementation, and not remotely about data frequency. Dictionaries based on the data itself would seem to refer to the FSE/huffman entropy tables produced by zstd --train, which you do not attempt to do. The zstd library makes it very clear that you can provide literally anything in a raw content dictionary. They also have this warning:

If a dictionary is provided by an external source, it should be loaded with great care, its content considered untrusted.

I can't find any indication that this library has addressed that very explicit warning about safety and untrusted input. The decoding errors are of course completely undocumented: https://docs.rs/ruzstd/latest/ruzstd/decoding/errors/index.html.

Your methodology for the github-users benchmark is a textbook example of fabricated research. Let's go line by line:

On the github-users sample set,

As mentioned, the github-users sample set is not described, nor linked to, nor available by any means. It is actively hidden by the facebook zstd repo for some reason. Despite this, I was able to obtain it.

our implementation benchmarks within 0.2% of the official implementation

We'll come back to this claim.

(as of commit 09e52d07340acdb2e13817b066e8be6e424f7258):

I can't find this commit in your repo or in the upstream zstd repo. git fetch ... 09e52d07340acdb2e13817b066e8be6e424f7258 fails. Again we find this number is unreproducible.

uncompressed: 100.00% (7484607 bytes)

This is the sum of all the individual json files after tar extraction, which is a very different thing from compression alone. When github_users.tar.zst is provided, zstd is decompressing the tar archive as a stream, not as 9114 distinct json documents. So this is measuring a very specific use case and not simply decompressing a tar file.

no dict: 34.99% of original size (2618872 bytes)

The compression settings weren't provided. Using zstd -19 I get closer to 32%.

reference dict: 16.16% of no dict size (2195672 bytes smaller)

"of no dict size"? Were you compressing the result of the first compression? This is RIDICULOUSLY misleading. This of course makes the numbers smaller. When you give % of original size (the % that the zstd cli provides) and then switch to % of a smaller number, it looks like you're trying to deceive people. This particular trick is quite literally an example of academic dishonesty that would at the very least get a paper retracted after publication.

our dict: 16.28% of no dict size (2192400 bytes smaller)

I still have no clue how you generated the "reference dictionary", given that to my knowledge zstd has no API to generate a raw content dictionary. Is there a benchmarking tool in their repo I'm missing?

Now here's my benchmark:

# cosmicexplorer@terrestrial-gamma-ray-flash: ~/tools/zstd 14:50:56
; zstd --train --filelist gh-json.txt -o gh-json-text-noargs.dict 
# cosmicexplorer@terrestrial-gamma-ray-flash: ~/tools/zstd 14:51:04
; find github -name '*.json' -type f > gh-json.txt
# cosmicexplorer@terrestrial-gamma-ray-flash: ~/tools/zstd 14:51:17 
; zstd --train --filelist gh-json.txt -o gh-json-text-noargs.dict 
Trying 5 different sets of parameters                                          
k=1024                                                                         
d=8
f=20
steps=4
split=75
accel=1
Save dictionary of size 112640 into file gh-json-text-noargs.dict 
# cosmicexplorer@terrestrial-gamma-ray-flash: ~/tools/zstd 14:53:24 
; rm -rf gh-trained && mkdir -v gh-trained && zstd --filelist gh-json.txt -D gh-json-text-noargs.dict --output-dir-mirror gh-trained --verbose --threads=0 
mkdir: created directory 'gh-trained'
*** Zstandard CLI (64-bit) v1.5.7, by Yann Collet ***
Note: 8 physical core(s) detected 
github/user.xZXXt8FeFI.json :  9.55%   (   869 B =>     83 B, gh-trained/github/user.xZXXt8FeFI.json.zst) 
# ... [output omitted] ...
github/user.SHuIWPzlXT.json :  9.90%   (   869 B =>     86 B, gh-trained/github/user.SHuIWPzlXT.json.zst) 
9114 files compressed : 10.17% (  7.14 MiB =>    743 KiB) 

Using only the defaults from the CLI, that's 10.17%. Again, let's review the claim on the front page of your repo:

On the github-users sample set, our implementation benchmarks within 0.2% of the official implementation (as of commit 09e52d0):

Where is 0.2% from? Did you just subtract 16.28% - 16.16%? I won't accuse you of intentional academic dishonesty for this, but that is not how percentages work.

The dictionary generator only provides support for creating “raw content” dictionaries. Tagged dictionaries are currently unsupported.

You know for a damn fact that 99% of users have no clue what this means, since you don't clarify that zstd --train produces "tagged dictionaries" and "raw content" is merely a curiosity and the format is not actually specified.

Did you find a resource that describes the raw content format as lempel-ziv encoding? Because the zstd dictionary format document very clearly states that raw content has no format. Is there no existing implementation of lempel-ziv window encoding in rust? Because it really seems like you just put in that citation to intentionally deceive your users about having implemented the actually useful form of dictionary encoding, the one that zstd --train does, while knowing that literally any content will suffice when used in a raw content dictionary.

I'd love to see a comparison to randomly generated bytes. Here's an example of that in the zip file crate with all the tricks I've developed: https://github.com/zip-rs/zip2/blob/c864a14035439154187aa6a6c894c7031e9087aa/benches/merge_archive.rs#L20

fn generate_random_archive(
    num_entries: usize,
    entry_size: usize,
    options: SimpleFileOptions,
) -> ZipResult<(usize, ZipArchive<Cursor<Vec<u8>>>)> {
    let buf = Cursor::new(Vec::new());
    let mut zip = ZipWriter::new(buf);

    let mut bytes = vec![0u8; entry_size];
    for i in 0..num_entries {
        let name = format!("random{i}.dat");
        zip.start_file(name, options)?;
        getrandom::fill(&mut bytes).unwrap();
        zip.write_all(&bytes)?;
    }

    let buf = zip.finish()?.into_inner();
    let len = buf.len();

    Ok((len, ZipArchive::new(Cursor::new(buf))?))
}

We also compare against different types of inputs, like Shakespeare, or video files. Now let's see your repo's benchmark: https://github.com/KillingSpark/zstd-rs/blob/master/ruzstd/benches/decode_all.rs

It looks like you choose exactly one input file to compare against. You also include the file as a slice and use the frame encoder, which is going to provide a better number than the streaming encoder one would normally expect to use from a .zst file. But curiously, your fuzzing only tests the streaming decoder (https://github.com/KillingSpark/zstd-rs/blob/master/ruzstd/fuzz/fuzz_targets/decode.rs). Why is that?

Your "round trip" for FSE is also extremely suspicious (

round_trip(data);
). Generally, a "round trip" test works because it very clearly generates the encoded data, then reproduces the decoded data, then compares the two. This makes it very easy to audit. But your round trip method "only for testing" has nothing of the sort:

    if data.len() < 2 {
        return;
    }
    if data.iter().all(|x| *x == data[0]) {
        return;
    }
    if data.len() < 64 {
        return;
    }

if data.iter().all(|x| *x == data[0]) is a ridiculous thing to be checking without any sort of comment, especially within the specific method used for fuzz testing. Why are you discarding these inputs? Does your input parsing code not provide a specific error case for invalid inputs? Does your input parsing code not handle these?

You need to have a distinct phase of input parsing so that you don't need these kinds of one-off checks cluttering up your encoding and decoding logic. That's how vulnerabilities happen. Here's an example of using proptest, one of my favorite crates: https://codeberg.org/cosmicexplorer/deep-link/src/commit/1ea3eba5d599d8c48ea56816b7103b12ee49d505/d-major/pathing/src/splitting/valid_path.rs#L305-L322

#[cfg(any(test, feature = "proptest"))]
impl Arbitrary for ValidPath<ffi::OsString> {
  type Parameters = (Option<usize>, Option<usize>);
  type Strategy = proptest::strategy::BoxedStrategy<Self>;
  #[cfg(unix)]
  fn arbitrary_with(args: (Option<usize>, Option<usize>)) -> Self::Strategy {
    use std::os::unix::ffi::OsStringExt;
    let (min_len, max_len) = args;
    proptest::string::bytes_regex(&format!(
      r"(?s-u:[^\x00]{{{},{}}})",
      min_len.unwrap_or(0),
      max_len.unwrap_or(500),
    ))
    .unwrap()
    .prop_map(|s| unsafe { Self::new_unchecked(ffi::OsString::from_vec(s)) })
    .boxed()
  }
}

Also, it completely defeats the purpose of fuzzing if you use fixed parameters like the number 22 on this line.

fse_encoder::build_table_from_data(data.iter().copied(), 22, false),
Please take a look at my fuzzing PR for the zip crate for how to do this right: zip-rs/zip2#437

Finally I'm just flummoxed about this part:

This might be a problem if you are accepting user provided data. Frames can be REALLY big when decoded. If this is the case you should either check how big the frame actually is or use the memory efficient approach described below.
Use the StreamingDecoder and use a while loop to fill your buffer (see src/bin/zstd_stream.rs for an example).

What's the point of using rust at all if you're not going to provide a safe interface? Why on earth isn't the while loop approach in the library itself? Why isn't it the default approach for the streaming decoder?

I saw this crate included in the rust compiler's transitive dependencies after pulling from main today and I find almost all of this documentation to be incredibly deceptive, to a degree I cannot believe is purely accidental. I cannot BELIEVE the error codes aren't even documented. I just checked and it's apparently embedded in the standard library??????

I would love to help describe or explain further how to produce trustworthy benchmark and fuzzing processes for rust projects. I would be delighted to pair with you if you are genuinely interested in achieving scientific reproducibility guarantees for your benchmarks.

But the language around dictionary construction is impossible for me to interpret as anything except malicious deception. If that is not your intent, please please let me know and I would love to work with you.

But otherwise I will probably just make my own zstd implementation in rust, because it is legitimately frightening to me that the standard library in every one of my rust programs relies upon a library with essentially zero fuzzing coverage that actively attempts to deceive readers about its degree of support for dictionary generation.

Please help me understand why you provide numbers from a completely unreproducible benchmark on the main docs page, and why you change from measuring % of original to % of the no dict output, when the zstd cli prints the % of original in its output, and you would have had to intentionally use a separate calculation method to even calculate the % of no dict output. That is one of the most wildly deceptive paragraphs I have read in a long time.

@zleyyij
Copy link
Contributor Author

zleyyij commented Dec 20, 2025

While I appreciate your passion, the tone with which you approached your feedback is not appreciated, and I will attempt to offer you more courtesy in my response. No malice was intended with this PR, and I would gain nothing from it anyway. It was purely a passion project. I'm not a maintainer of this project, so any success it brought the project would likely not go to me.

Raw Content vs State Dictionaries

When starting this project I needed to define ways to quantify the progress I made. I came to the conclusion that the main benchmark of success for this project would be how effective the inclusion of a dictionary created by was at improving compression ratio. Other possible benchmarks that I did not include were the speed of dictionary generation, effectiveness of the dictionary per unit size, and how the dictionary impacted compression/decompression size. As this was a passion project done in my limited time as an unpaid undergraduate research project, I could not focus on everything.

I decided that it didn't matter to the end user whether raw content dictionaries or state based dictionaries were used, it's an internal implementation detail. Any distinction between the two in benchmarking was something that could be added down the line.

Later in your comment, you state this:

Dictionaries based on the data itself would seem to refer to the FSE/huffman entropy tables produced by ..., which you do not attempt to do.
I'm afraid I don't understand this, you'll need to clarify. The way I'm interpreting your statement, "data" is ambiguous and so it's impossible to provide a concise answer.

The Github Users Dataset

https://github.com/facebook/zstd?tab=readme-ov-file#the-case-for-small-data-compression
https://github.com/facebook/zstd/releases/tag/v1.1.3

I did not link the github users dataset by mistake, I don't love that the original zstd project leaves it buried under the attachments of a minor release. That is something that should be corrected.

Lempel Ziv

The paper referenced is the same paper utilized by Yann Collet in the reference implementation.

https://github.com/facebook/zstd/blob/ae9f20ca2716f2605822ca375995b7d876389b64/lib/dictBuilder/cover.c#L11-L19

The paper's title is slightly misleading, it describes a generalized dictionary creation algorithm not tied to any particular format, the relation to lempel ziv was purely in circumstantial application.

Untrusted input

You are correct, but words are cheap, submit PRs. Should you like to fund someone to contribute fix these issues, that is also an acceptable solution.

Research fabrication.

You state you were unable to find that commit, it may have been lost in the squashing.

It can be found here 09e52d0, and is the commit directly above this comment: #91 (comment)

Document Summation

Small documents benefit the most from the utilization of a dictionary. I documented my benchmarking process in a lackluster manner because I did not see the need to do it better.

The code for benchmarking is included in the commit you missed, albeit sloppy and poorly documented. To be concise it is included below:

struct BenchmarkResults {
    uncompressed_size: usize,
    nodict_size: usize,
    reference_size: usize,
    our_size: usize,
}

fn bench<P: AsRef<Path>>(input_path: P) -> BenchmarkResults {
    // At what compression level the dicts are built with
    let compression_level = 22;

    // 1. Collect a list of a path to every file in the directory into `file_paths`
    println!("[bench]: collecting list of input files");
    let mut file_paths: Vec<PathBuf> = Vec::new();
    let dir: fs::ReadDir = fs::read_dir(&input_path).expect("read input path");
    fn recurse_read(dir: fs::ReadDir, file_paths: &mut Vec<PathBuf>) -> Result<(), io::Error> {
        for entry in dir {
            let entry = entry?;
            if entry.file_type()?.is_dir() {
                recurse_read(fs::read_dir(&entry.path())?, file_paths)?;
            } else {
                file_paths.push(entry.path());
            }
        }
        Ok(())
    }
    recurse_read(dir, &mut file_paths).expect("recursing over input dir");

    // 2. Create two dictionaries, one with our strategy, and one with theirs
    println!("[bench]: creating reference dict");
    let reference_dict =
        zstd::dict::from_files(file_paths.iter(), 112640).expect("create reference dict");
    let mut our_dict = Vec::with_capacity(112640);
    println!("[bench]: creating our dict");
    create_dict_from_dir(input_path, &mut our_dict, 112640).expect("create our dict");
    // Open each file and compress it
    let mut uncompressed_size: usize = 0;
    let mut nodict_size: usize = 0;
    let mut reference_size: usize = 0;
    let mut our_size: usize = 0;

    let mut reference_output: Vec<u8> = Vec::with_capacity(128_000);
    let mut reference_encoder =
        zstd::Encoder::with_dictionary(&mut reference_output, compression_level, &reference_dict)
            .unwrap();
    reference_encoder.multithread(8).unwrap();
    let mut our_output: Vec<u8> = Vec::with_capacity(128_000);
    let mut our_encoder =
        zstd::Encoder::with_dictionary(&mut our_output, compression_level, &our_dict).unwrap();
    our_encoder.multithread(8).unwrap();
    for (idx, path) in file_paths.iter().enumerate() {
        println!("[bench]: compressing file {}/{}", idx + 1, file_paths.len());
        let mut handle = File::open(path).unwrap();
        let mut data = Vec::new();
        handle.read_to_end(&mut data);
        uncompressed_size += data.len();
        // Compress with no dict
        let nodict_output = zstd::encode_all(data.as_slice(), compression_level).unwrap();
        nodict_size += nodict_output.len();
        // Compress with the reference dict
        reference_encoder.write_all(data.as_slice());
        reference_encoder
            .do_finish()
            .expect("reference encoder finishes");
        reference_size += reference_output.len();
        reference_output.clear();
        // Compress with our dict
        our_encoder.write_all(data.as_slice());
        our_encoder.finish().expect("our encoder finishes");
        our_size += our_output.len();
        our_output.clear();
    }

    BenchmarkResults {
        uncompressed_size,
        nodict_size,
        reference_size,
        our_size,
    }
}

misc

Did you find a resource that describes the raw content format as lempel-ziv encoding? Because the zstd dictionary format document very clearly states that raw content has no format. Is there no existing implementation of lempel-ziv window encoding in rust? Because it really seems like you just put in that citation to intentionally deceive your users about having implemented the actually useful form of dictionary encoding, the one that zstd --train does, while knowing that literally any content will suffice when used in a raw content dictionary.

Where is 0.2% from? Did you just subtract 16.28% - 16.16%? I won't accuse you of intentional academic dishonesty for this, but that is not how percentages work.

That percentage is the difference in compression ratios achieved by zstd-rs's implementation vs facebook's. Feel free to contribute a more detailed explaination.

Just because you're allowed to put anything in a raw content dictionary does not mean that it'll improve compression ratios meaningfully. The paper describes ways to find the most valuable content to include in a content dictionary.

You know for a damn fact that 99% of users have no clue what this means, since you don't clarify that zstd --train produces "tagged dictionaries" and "raw content" is merely a curiosity and the format is not actually specified.

That information can be removed, it was included with the intent to invite contributors to implement it, but that's a poor place to put it.

I'd love to see a comparison to randomly generated bytes. Here's an example of that in the zip file crate with all the tricks I've developed: https://github.com/zip-rs/zip2/blob/c864a14035439154187aa6a6c894c7031e9087aa/benches/merge_archive.rs#L20

Feel free to add one.

You talk about the benchmarking methodologies and mention benchmarking against different media types. I did not write this code so cannot comment, but my understanding is that the maintainer tested locally, but did not develop continual benchmarking infrastructure. This is fine because it's a hobby project, done in their free time, and they can do whatever they please. Again, feel free to contribute said infrastructure.

The rest of the comment goes outside of the scope of this PR and so I will not be addressing it here. I will if it's relevant to my work and is submitted in the correct location, I will happily provide context.

To reiterate, the tone of your comment is not productive, and it comes across as berating and demanding. I apologize if I came across as terse, to open your GitHub notifications to find a harshly worded 2000 word essay critiquing work done for fun was not pleasant and negatively impacted my outlook.

Best,
James

@zleyyij
Copy link
Contributor Author

zleyyij commented Dec 20, 2025

I forgot to include, but if you open issues and tag me in them I'll try my best to fix them, this is just too much to filter through in one discussion :)

@KillingSpark
Copy link
Owner

Hey just wanted to let you know I am aware of this, I'll adress some of the concerns tomorrow. I am currently visiting family but I want to at least follow up on some of the concerns.

I'll likely also put some of this in separate issues, since there are many different parts to the comment that concern parts of the library that do not have anything to do with this PR.

I must agree with the PR author, the tone at the end of this comment got... out of order. I would be lying if I said I never lost myself in a rant, but I will also ask to keep this more civil in the future.

@KillingSpark
Copy link
Owner

KillingSpark commented Dec 22, 2025

Right, couldn't wait to answer this late at night after reading this fully...

All in all I must say, the language you are using @cosmicexplorer is very unfriendly, condescending in some parts, accusatory in others. And I'm not going to engage with you any further if this doesn't change.

@zleyyij answered a lot of stuff already I'll try to not repeat on points they already made. Thanks @zleyyij for the well written response, not everyone would have reacted in such a calm manner. I certainly didn't achieve that as well as you did but I hope it's still acceptable.


I'll answer to the parts I deemed important, quoting the parts. If you feel that you have more questions, please open separate issues for separate questions, and reconsider your choice on how you want to treat me or the other contributors to this crate.

We also compare against different types of inputs, like Shakespeare, or video files. Now let's see your repo's benchmark: https://github.com/KillingSpark/zstd-rs/blob/master/ruzstd/benches/decode_all.rs

It looks like you choose exactly one input file to compare against.

The benchmark you linked against is used to compare different versions of ruzstd against each other. Would a more diverse benchmarking harness be better? Almost certainly. You're free to write one. This was sufficient to guide me through some optimizations, which was the usecase for writing this benchmark.

You also include the file as a slice and use the frame encoder, which is going to provide a better number than the streaming encoder one would normally expect to use from a .zst file. But curiously, your fuzzing only tests the streaming decoder (https://github.com/KillingSpark/zstd-rs/blob/master/ruzstd/fuzz/fuzz_targets/decode.rs). Why is that?

The streaming decoder (there is no streaming encoder...) uses the frame decoder internally. It's literally just a convenience wrapper around the framedecoder. This fuzzes both levels. If you do not trust this, feel free to use your own resources and time to fuzz this crate further. I'm happy to take contributions. I'm also happy to look into any fuzzing results you might find.

if data.iter().all(|x| *x == data[0]) is a ridiculous thing to be checking without any sort of comment, especially within the specific method used for fuzz testing. Why are you discarding these inputs? Does your input parsing code not provide a specific error case for invalid inputs? Does your input parsing code not handle these?

FSE encoding as implemented in zstd and ruzstd doesn't make sense for slices that only include one distinct symbol. This has to do with how the tables are defined that encode the state transitions. Just remove the check, and step through the code. You'll see why. This isn't dangerous btw. If the using code tries to do that it will panic. It would be stupid, because zstd supports RLE encoding on almost every level of the format so any sensible encoder will not call this method with a slice that contains only one distinct symbol.

You need to have a distinct phase of input parsing so that you don't need these kinds of one-off checks cluttering up your encoding and decoding logic. That's how vulnerabilities happen.

You'll notice that this kind of guard is just in place to stop the fuzzing from producing noisy results for known "bad" inputs. This has nothing to do with the encoding/decoding logic. The encoder of course does these checks before calling the fse encoding. Because it could just do the RLE encoding instead, and, as mentioned above does not work.

This has nothing to do with an "input parsing phase". This is just literally avoiding inputs that just don't make sense and should, and also will, panic. This is, if anything, a lack of documentation of some internal functions and their preconditions.

Edit
I also don't understand how you think this had anything to do with parsing the inputs. The compressor literally takes any slice of symbols, except slices that only contain one distinct symbol, and compresses them. The roundtrip is compress -> decompress not the other way around. Is this just an artifact of this becoming a rant rather than a constructive piece of work?
/Edit

Also, it completely defeats the purpose of fuzzing if you use fixed parameters like the number 22 on this line.

Do you even know what that number 22 does in this line? Seems like you don't. The necessary amount of bits allowed depends on the amount of different symbols in the data. Chosing this to be optimal for the random input would actually hit less interesting codepaths as keeping this constant and having it sometimes be optimal and sometimes suboptimal. Chosing it big enough to suit every input avoids false positive fuzzing results.

Finally I'm just flummoxed about this part:

This might be a problem if you are accepting user provided data. Frames can be REALLY big when decoded. If this is the case you should either check how big the frame actually is or use the memory efficient approach described below. Use the StreamingDecoder and use a while loop to fill your buffer (see src/bin/zstd_stream.rs for an example).

What's the point of using rust at all if you're not going to provide a safe interface?

Do you know that safe means in rust? I can literally write Vec::with_capacity(usize::MAX) and watch my process die. That's the same risk you run by trying to decode untrusted inputs completely in memory. Which I and noone else can protect you from. All I can do is warn you about it, which I did.

Why on earth isn't the while loop approach in the library itself? Why isn't it the default approach for the streaming decoder?

It literally is. The options are: Decode everything fully in memory or loop and decode chunks at a time. The framedecoder gives you more control over the specifics, the streaming decoder helps you do it more conveniently. And the most convenient method, that you should only use for known-good inputs, is to just do it all at once in memory.

I saw this crate included in the rust compiler's transitive dependencies after pulling from main today and I find almost all of this documentation to be incredibly deceptive, to a degree I cannot believe is purely accidental. I cannot BELIEVE the error codes aren't even documented. I just checked and it's apparently embedded in the standard library??????

It's used for reading zstd compressed metadata for backtraces yeah. If this is any consolation, they use an API that ensures that only a very specific amount of memory will be used. Also: neither I nor anyone working on this crate pushed for this. The std devs stepped up, contributed the necessary magic to the cargo.toml and used it as a dependency.

But otherwise I will probably just make my own zstd implementation in rust, because it is legitimately frightening to me that the standard library in every one of my rust programs relies upon a library with essentially zero fuzzing coverage that actively attempts to deceive readers about its degree of support for dictionary generation.

Go ahead, it's a free world. I would be happy to have another person working on this crate though. It is however wrong to claim this: "with essentially zero fuzzing coverage". There is fuzzing in place for the decoding, it runs pretty much all parts of the decoding API.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants