splitting Vec<u8> of non unicode characters #928

bobi6666 · 2022-11-17T12:14:02Z

bobi6666
Nov 17, 2022

hello, I have a u8 vector, there can be non-unicode characters in that vector and I needed to divide it wherever there is a \0, but with the fact that if there is a \n or \r\n, it will not stop but will continue. Can you please show me example?

Answered by BurntSushi

Nov 17, 2022

I think you probably want to use bytes::Regex for this.

As for a specific example, I don't really understand what you're saying. Could you please provide some input and the desired output?

View full answer

BurntSushi · 2022-11-17T12:32:17Z

BurntSushi
Nov 17, 2022
Maintainer

I think you probably want to use bytes::Regex for this.

As for a specific example, I don't really understand what you're saying. Could you please provide some input and the desired output?

0 replies

bobi6666 · 2022-11-17T12:51:37Z

bobi6666
Nov 17, 2022
Author

i would just want example on how to build Regex::new from regex::bytes crate to cover my question 2022-11-17 13:32 GMT+01:00, Andrew Gallant ***@***.***>:

…

I think you probably want to use [`bytes::Regex`](https://docs.rs/regex/latest/regex/bytes/struct.Regex.html) for this. As for a specific example, I don't really understand what you're saying. Could you please provide some input and the desired output? -- Reply to this email directly or view it on GitHub: #928 (comment) You are receiving this because you authored the thread. Message ID: ***@***.***>

0 replies

BurntSushi · 2022-11-17T13:18:47Z

BurntSushi
Nov 17, 2022
Maintainer

It's right there in the link I gave you. :-) For example: https://docs.rs/regex/latest/regex/bytes/struct.Regex.html#method.find

0 replies

bobi6666 · 2022-11-17T13:33:27Z

bobi6666
Nov 17, 2022
Author

yesterday i wanted to build with unicode disabled and when i did let re = regex::bytes::Regex::new(r"0x0").unwrap(); it didn't work 2022-11-17 14:18 GMT+01:00, Andrew Gallant ***@***.***>:

…

It's right there in the link I gave you. :-) For example: https://docs.rs/regex/latest/regex/bytes/struct.Regex.html#method.find -- Reply to this email directly or view it on GitHub: #928 (comment) You are receiving this because you authored the thread. Message ID: ***@***.***>

2 replies

BurntSushi Nov 17, 2022
Maintainer

If you can't answer my original question

Could you please provide some input and the desired output?

then I can't help you. I won't be responding further if your next comment doesn't answer that question. I apologize if I come across as rude, but this has been a common pattern with you in past discussions. They just go back-and-forth endlessly without my understanding what you actually want to do.

All I'm asking for is a simple example that shows the input and what you want your output to be. If that isn't something you know how to provide, then I'm sorry, but I just don't have the time to give the help required here.

BurntSushi Nov 17, 2022
Maintainer

This might help: https://jvns.ca/blog/good-questions/

(I linked to ESR's version of the same thing previously, but I forgot just how patronizing it was.)

bobi6666 · 2022-11-17T13:57:11Z

bobi6666
Nov 17, 2022
Author

https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=65826762b40d386ae495a1781305930f 2022-11-17 14:45 GMT+01:00, Andrew Gallant ***@***.***>:

…

This might help: https://jvns.ca/blog/good-questions/ (I linked to [ESR's version of the same thing](http://www.catb.org/%7Eesr/faqs/smart-questions.html) previously, but I forgot just how patronizing it was.) -- Reply to this email directly or view it on GitHub: #928 (reply in thread) You are receiving this because you authored the thread. Message ID: ***@***.***>

2 replies

BurntSushi Nov 17, 2022
Maintainer

Thanks, but you still didn't answer my question! You didn't tell me what output you wanted. And the program you've given me has zero output.

Anyway, I took a guess: https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=d315997c5951b2aca3fd1e974fccbc2a

BurntSushi Nov 17, 2022
Maintainer

Notice that you don't even need a bytes::Regex because a NUL byte is valid UTF-8.

bobi6666 · 2022-11-17T14:14:12Z

bobi6666
Nov 17, 2022
Author

output will be later Vec<&[u8]> but about that i am fine and if your regex idea will not stop after non utf16 characters then you did what i wanted 2022-11-17 15:10 GMT+01:00, Andrew Gallant ***@***.***>:

…

Notice that you don't even need a `bytes::Regex` because a NUL byte is valid UTF-8. -- Reply to this email directly or view it on GitHub: #928 (reply in thread) You are receiving this because you authored the thread. Message ID: ***@***.***>

0 replies

bobi6666 · 2022-11-17T14:28:31Z

bobi6666
Nov 17, 2022
Author

I'll try to translate it so you can better understand what I need: in the example I sent you, there may be bytes that are not utf8 but may be utf16. I need to split them, but in such a way that they can be collected at the end into Vec<&[u8]> or Vec<Vec<u8>>, but I believe that if they are split so their appearance will be preserved as they are and I will not be blamed if they do not carry utf 8 2022-11-17 15:14 GMT+01:00, Peter Kubek ***@***.***>:

…

output will be later Vec<&[u8]> but about that i am fine and if your regex idea will not stop after non utf16 characters then you did what i wanted 2022-11-17 15:10 GMT+01:00, Andrew Gallant ***@***.***>: > Notice that you don't even need a `bytes::Regex` because a NUL byte is > valid > UTF-8. > > -- > Reply to this email directly or view it on GitHub: > #928 (reply in thread) > You are receiving this because you authored the thread. > > Message ID: > ***@***.***>

1 reply

BurntSushi Nov 17, 2022
Maintainer

A &str can be converted to &[u8] via str::as_bytes.

bobi6666 · 2022-11-17T14:47:40Z

bobi6666
Nov 17, 2022
Author

do you think this split will be faster than normal split from std if my file that i work with will be big? 2022-11-17 15:39 GMT+01:00, Andrew Gallant ***@***.***>:

…

A `&str` can be converted to `&[u8]` via [`str::as_bytes`](https://doc.rust-lang.org/std/primitive.str.html#method.as_bytes). -- Reply to this email directly or view it on GitHub: #928 (reply in thread) You are receiving this because you authored the thread. Message ID: ***@***.***>

1 reply

BurntSushi Nov 17, 2022
Maintainer

Maybe. Benchmark it.

bobi6666 · 2022-11-24T13:06:35Z

bobi6666
Nov 24, 2022
Author

hello, I want to ask you if aho_corasick supports processing non utf8 bytes if i am using replace_all_with method? 2022-11-17 16:08 GMT+01:00, Andrew Gallant ***@***.***>:

…

Maybe. Benchmark it. -- Reply to this email directly or view it on GitHub: #928 (reply in thread) You are receiving this because you authored the thread. Message ID: ***@***.***>

1 reply

BurntSushi Nov 24, 2022
Maintainer

Yes... Of course...

splitting Vec<u8> of non unicode characters #928

Uh oh!

bobi6666 Nov 17, 2022

Replies: 9 comments · 7 replies

Uh oh!

BurntSushi Nov 17, 2022 Maintainer

Uh oh!

bobi6666 Nov 17, 2022 Author

Uh oh!

BurntSushi Nov 17, 2022 Maintainer

Uh oh!

bobi6666 Nov 17, 2022 Author

Uh oh!

BurntSushi Nov 17, 2022 Maintainer

Uh oh!

BurntSushi Nov 17, 2022 Maintainer

Uh oh!

bobi6666 Nov 17, 2022 Author

Uh oh!

BurntSushi Nov 17, 2022 Maintainer

Uh oh!

BurntSushi Nov 17, 2022 Maintainer

Uh oh!

bobi6666 Nov 17, 2022 Author

Uh oh!

bobi6666 Nov 17, 2022 Author

Uh oh!

BurntSushi Nov 17, 2022 Maintainer

Uh oh!

bobi6666 Nov 17, 2022 Author

Uh oh!

BurntSushi Nov 17, 2022 Maintainer

Uh oh!

bobi6666 Nov 24, 2022 Author

Uh oh!

BurntSushi Nov 24, 2022 Maintainer

bobi6666
Nov 17, 2022

Replies: 9 comments 7 replies

BurntSushi
Nov 17, 2022
Maintainer

bobi6666
Nov 17, 2022
Author

BurntSushi
Nov 17, 2022
Maintainer

bobi6666
Nov 17, 2022
Author

BurntSushi Nov 17, 2022
Maintainer

BurntSushi Nov 17, 2022
Maintainer

bobi6666
Nov 17, 2022
Author

BurntSushi Nov 17, 2022
Maintainer

BurntSushi Nov 17, 2022
Maintainer

bobi6666
Nov 17, 2022
Author

bobi6666
Nov 17, 2022
Author

BurntSushi Nov 17, 2022
Maintainer

bobi6666
Nov 17, 2022
Author

BurntSushi Nov 17, 2022
Maintainer

bobi6666
Nov 24, 2022
Author

BurntSushi Nov 24, 2022
Maintainer