- 
                Notifications
    
You must be signed in to change notification settings  - Fork 266
 
Use memchr to search for characters to escape #664
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| 
           I tried this a while back with   | 
    
| 
           Yeah, I actually just posted a PR to jetscii, and found that even for cases that needed multiple memchr calls, memchr seemed to be faster, and based on the fact that the benchmarks were xml-related, came here to try. 😄  | 
    
| 
           Yes,  I've seen some interesting ideas about using lookup tables and/or avx256 to get really good performance but haven't really investigated further.  | 
    
I believe this is quite a bit faster, because rust only has to verify that each string slicing operation starts/ends at character boundaries, none of the inner bytes need to be checked for UTF8-ness since they're already from a `&str` Also, when initially creating the escaped string, preallocate a little extra room, since we know the string will grow.
56841c9    to
    dcb2104      
    Compare
  
    | 
           I was also playing with a bitset lookup table in master...Dr-Emann:quick-xml:bitset_escape  | 
    
          Codecov Report
 ❗ Your organization needs to install the Codecov GitHub app to enable full functionality. @@            Coverage Diff             @@
##           master     #664      +/-   ##
==========================================
+ Coverage   64.63%   64.68%   +0.05%     
==========================================
  Files          36       37       +1     
  Lines       17289    17618     +329     
==========================================
+ Hits        11175    11397     +222     
- Misses       6114     6221     +107     
 Flags with carried forward coverage won't be shown. Click here to find out more. 
  | 
    
          
 How did that go? Do you have results for that?  | 
    
| 
           It looks like using a bitmap is better than master, but this PR is better than both nearly across the board, at least on my M1 mac. Note, the results for this PR are a little better than what is in the original PR description, thanks to dcb2104. master vs bitmap on M1 macbitmap vs this PR, M1 macCombined results | 
    
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some benchmarks shows regression about 2% in speed on two of my machines (Ubuntu 22.04 and Windows 7), mostly *_no_copy benchmarks, but other shows improvements about 2-7%. In summary, there are more improvements than regressions, so I think we can merge this.
| pub(crate) struct MergeIter<It1, It2> | ||
| where | ||
| It1: Iterator, | ||
| It2: Iterator<Item = It1::Item>, | ||
| { | ||
| it1: Peekable<It1>, | ||
| it2: Peekable<It2>, | ||
| } | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be great if you add several tests to check that iterator work as expected
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some unit tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! @dralley , feel free to merge when you ready.
e6889ff    to
    c6ec23a      
    Compare
  
    | 
           My results are quite a bit more mixed, on an R5 3600 CPU. I'm actually seeing a few more whole-document regressions than improvements (but mostly it's just a wash). Generally speaking I'm seeing more of a penalty than you are on short strings and less of a benefit on long strings. I'm going to keep looking at this tomorrow, do a bit more testing, and paste my results.  | 
    
| 
           Note, things might get even muddier: There are likely improvements to   | 
    
| 
           Great to hear! Thanks for investigating this I'll try out my PR with your jetscii changes tomorrow and see if it changes things  | 
    
| 
           R5 3600 i7-8665U This is a bit of a complex decision. I don't think jetscii does anything on ARM, so the mergeiter approach is probably still a winner there. And in many cases it's the winner against jetscii on x86, but it might be completely document dependent. Short strings with no escaping is a very very common case in a lot of documents and maybe enough so that jetscii would come out ahead. mergeiter actually regresses from baseline there on x86, possibly by enough to end up worse overall? We should probably test this somehow, but the macrobenchmarks that currently exist, aren't doing any "escaping" at all, just unescaping. I assume that is why those benchmarks show practically zero difference. We should probably create a benchmark that parses those documents into an event stream (outside the benchmark) and then writes them back to a buffer, in such a way that the escaping / construction cost is captured.  | 
    
| 
           Some ideas: 
 I think that's probably all I've got. Daniel Lemire's blog might be worth checking out.  | 
    
Per suggestion from @BurntSushi [here](tafia/quick-xml#664 (comment)) On my M1, tt appears to be slower but competitive with memchr up to memchr3, then start being the from 5-16
| 
           The Teddy results are pretty promising looking in the jetscii benchmarks for me (faster than merging memchrs), although it does seem a fair bit slower for small haystacks, so we might need to do one algorithm for a small haystacks (or CPUs teddy doesn't support), and use teddy for longer ones  | 
    
Per suggestion from @BurntSushi [here](tafia/quick-xml#664 (comment)) On my M1, tt appears to be slower but competitive with memchr up to memchr3, then start being the from 5-16
Use memchr iterators to search for characters to escape.
In some (most) cases, we need to combine multiple memchr searches (since memchr allows searching for up to 3 chars at the same time), so introduce a
MergeItertype, which takes two iterators, and combines them in order.This appears to be better for performance in almost all cases,
except it's very slightly slower for escaping small strings with escapes presentthanks to dcb2104, it's actually faster even in that case. However, it's MUCH faster when there are no characters to escape, or when escaping a long string.Only results shown with >1% change, reported by
critcmpResults for M1 Pro Macbook
Results for x64 i5-6600K Windows