Not using `memchr` when parsing? #398

Yomguithereal · 2025-07-03T13:20:10Z

Yomguithereal
Jul 3, 2025

Hello @BurntSushi, out of curiosity I was wondering why your memchr routine isn't used when parsing CSV data. I seem to understand that the DFA reads the input byte by byte. I see that the writer however is using memchr to quickly assess whether the field to write has to be quoted. Is this because the DFA naturally vectorize well or perform a kind of optimization is identical? Is this because the trade-off is not good?

Answered by BurntSushi

Jul 3, 2025

IIRC, I tried using memchr because of a similar inclination when I wrote the parser. I can't remember the results, but IIRC, it improved some cases (CSV data with a lot of large fields) but regressed other cases (CSV data with short fields). I perceive the latter as more common. That is, while some CSV data might have a column or two with longer fields, I think most fields are very short (e.g., numbers). In this case, the overhead of starting and stopping memchr on each field ends up being slower.

That's generally consistent with my understanding of memchr. That is, haystack.iter().position(|&b| b == needle) is likely to be a hair faster one very tiny haystacks because there is less start…

View full answer

BurntSushi · 2025-07-03T13:27:53Z

BurntSushi
Jul 3, 2025
Maintainer

IIRC, I tried using memchr because of a similar inclination when I wrote the parser. I can't remember the results, but IIRC, it improved some cases (CSV data with a lot of large fields) but regressed other cases (CSV data with short fields). I perceive the latter as more common. That is, while some CSV data might have a column or two with longer fields, I think most fields are very short (e.g., numbers). In this case, the overhead of starting and stopping memchr on each field ends up being slower.

That's generally consistent with my understanding of memchr. That is, haystack.iter().position(|&b| b == needle) is likely to be a hair faster one very tiny haystacks because there is less startup cost. It's likely to even be inlined. Where as a call to memchr is unlikely to be inlined and has some initialization cost while it sets up the vectors and goes through the preamble and what not.

There are perhaps more speculative optimizations. Like, you could maybe run memchr2 on an entire record for \n and ". If none of those appear, you could potentially do a faster "split line into fields" instead of running the DFA. But it's not obviously a win to me.

This is a good example of how it can be difficult to optimize a general purpose library. If you know what your data looks like (i.e., you know your CSV data has tons of large fields with little escaping), then you can write something simpler than this library that parses it more quickly than this library.

4 replies

Yomguithereal Jul 4, 2025
Author

Thank you for the insightful answer @BurntSushi :)

Yomguithereal Jul 11, 2025
Author

@BurntSushi leveraging the insight you gave me and some discussions online (notably this one where you answered: https://www.reddit.com/r/rust/comments/8ur32t/is_there_a_zerocopy_csv_parser_for_generic_csv/), I was able to develop something you hinted at, i.e. a fast CSV record splitter (not parser, because it does not unquote fields, nor even find field delimitation, it just splits a stream into valid CSV records, not mere lines). This way I was able to use memchr to gain performance, and was able to experiment with mmaps also. The code can be found here if this is of interest to someone: https://github.com/medialab/xan/blob/zero-copy-experiments/src/splitter.rs

It remains fast even with files where there is no quoted data because in this case the splitter is just trying to move fast to a line return and skim over fields altogether. And usually CSV data has at least one column containing enough characters for memchr to remain ideal in this purpose.

But why is this even useful to have such a splitter? By benchmarking xsv & xan I basically understood that the CPU work is often trivialized by 1. reading the disk and 2. actually parsing the CSV which is, to my surprise, not that fast (don't get me wrong, your Rust parser still is the fastest thing around). Then I attempted to use ripgrep and was flabbergasted by the sheer speed of it. Let's say I want to find all lines matching the substring test on a 11Go CSV file containing raw article texts (quoted data of course):

rg test articles.csv takes ~1.5 seconds
xan search test articles.csv takes ~12-15 seconds, most of which is spent parsing CSV

Now using the new splitter, counting the number of actual CSV record takes ~1 second. And performing a regex match on each splitted record brings us into ripgrep order of magnitude i.e. ~2 seconds.

The intention is therefore to improve performance of xan count wildly, for once, and to add something like a xan grep or xan sift command that can be used to pre-filter a file very efficiently so you can then pipe into a command actually needing to parse CSV data. You might get false positives because you need to match text that can have quotes doubled and commas etc. but people are usually interested in patterns that do not contain those artifacts of the format. And, what's even more important, you will not have false negatives.

This splitter is not really zero-copy because the buffering is still moving bits around (like your ripgrep linebuffer rolls, or because you need to write a record's byte to an output buffer as-is). But you can still get zero-copy using mmaps. I did not find mmaps to critically improve performance there, though, you seem to gain at most 10% more.

Sorry for the lengthy comment, I just wanted to report this in hope it can be of use to anyone in the future, and have your thought on it if you find the subject interesting.

BurntSushi Jul 11, 2025
Maintainer

Thanks for sharing! And yeah, I never did things that way because it's not correct. You end up with unquoted data, as you say, and I didn't feel like that was a good user experience. But it is for sure sometimes okay. It depends on the use case.

But yeah, ripgrep works on a different scale here because in most cases you stay in a highly optimized vector routine.

You may be interesting in this: https://github.com/geofflangdale/simdcsv

Yomguithereal Oct 6, 2025
Author

I am close to reaching a usable state for a simd-csv crate. It is not documented yet, but it shall be soon. It has multiple specialized CSV readers (an unescaping/copying one, a zero-copy, a record spitter and a total reader for mmaps etc.) It is more low-level than the csv crate and cannot support as much cases. Else it is reasonably performant. What I mean by reasonably is that it can outperform the csv crate by being 1.2x to ~8x faster when parsing streams of bytes from a disk. The performance boost is very dependent on the nature of the data of course, the larger your cells, the greater the boost, even if I took great care of finding fast paths, batching copies of unquoted cells in one swoop, etc. For some data loads, the boost is a bit underwhelming for code leveraging SIMD instructions, but I figure this is to be expected since a streaming CSV parser has some hard obstacles to overcome regarding branch misses and CPU cache invalidation. It is becoming my opinion that people touting AVX-512 CSV parsing are either selling snake-oil numbers or have very different requirements than my own.

This said, this implementation of SIMD CSV parsing does not rely on the carry-less product multiplication tricks used by simdjson, only the memchr routines, and their amortized version keeping the move mask around (as discussed here). I suspect the simdjson tricks are overkill for CSV workloads, but I could also be wrong.

One thing I noticed also is that using 256bits SIMD instructions such as AVX2 is almost always counter-productive to the performance.

In any case, I would love to hear what you think about the code of our parser over here if you have time and you find it interesting.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Not using `memchr` when parsing? #398

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Not using memchr when parsing? #398

Uh oh!

Yomguithereal Jul 3, 2025

Replies: 1 comment · 4 replies

Uh oh!

BurntSushi Jul 3, 2025 Maintainer

Uh oh!

Yomguithereal Jul 4, 2025 Author

Uh oh!

Yomguithereal Jul 11, 2025 Author

Uh oh!

BurntSushi Jul 11, 2025 Maintainer

Uh oh!

Yomguithereal Oct 6, 2025 Author

Not using `memchr` when parsing? #398

Yomguithereal
Jul 3, 2025

Replies: 1 comment 4 replies

BurntSushi
Jul 3, 2025
Maintainer

Yomguithereal Jul 4, 2025
Author

Yomguithereal Jul 11, 2025
Author

BurntSushi Jul 11, 2025
Maintainer

Yomguithereal Oct 6, 2025
Author