-
Notifications
You must be signed in to change notification settings - Fork 24
Description
Proposal
Problem statement
There's an inconsistency in the standard library's methods for converting a byte slice into a string slice:
| Type | Unchecked Conversion | Checked Conversion |
|---|---|---|
&str |
str::from_utf8_unchecked |
str::from_utf8 |
&mut str |
str::from_utf8_unchecked_mut |
str::from_utf8_mut |
Box<str> |
str::from_boxed_utf8_unchecked |
none |
While workarounds are possible in both safe and unsafe code, the inconsistency can be a source of confusion, and the workarounds are clunkier than they need to be.
The pull request that added str::from_boxed_utf8_unchecked originally was going to add str::from_boxed_utf8 as well, but did not, as it would drop the allocation. With that concern in mind, instead of returning Utf8Error, str::from_boxed_utf8 could return a FromUtf8Error, as that preserves the allocation.
Motivating examples or use cases
I recently was trying to implement a function to load a struct containing a Box<str> from a file, and was able to read it as a Box<[u8]>. In order to convert the byte slice into a string slice without rolling my own UTF-8 validation, I was trying to find a checked equivalent to str::from_boxed_utf8_unchecked, and was surprised that I didn't find one.
My workaround was a bit clunky:
let label_text = if str::from_utf8(&raw_label_text).is_ok() {
// SAFETY: Already validated
unsafe { std::str::from_boxed_utf8_unchecked(raw_label_text) }
} else {
return Err(Error::NonUtf8Label(raw_label_text));
};Solution sketch
Add fn from_boxed_utf8() -> Result<Box<str>, {ERROR}> to alloc::str, where {ERROR} is either core::str::Utf8Error, or alloc::string::FromUtf8Error. My test implementation used the former type, but that was before I saw the concern about the dropped allocation.
If core::str::validations::run_utf8_validation were re-exported by core::str as part of str_internals, it could be a two-line function returning a Utf8Error, and only slightly more complex if returning a FromUtf8Error.
The Utf8Error version:
pub fn from_boxed_utf8(v: Box<[u8]>) -> Result<Box<str>, Utf8Error> {
core::str::run_utf8_validation(&v)?;
// SAFETY: validation succeeded.
Ok(unsafe { from_boxed_utf8_unchecked(v) })
}A FromUtf8Error version that avoids discarding the allocation could look like this, with the current FromUtf8Error definition:
pub fn from_boxed_utf8(v: Box<[u8]>) -> Result<Box<str>, FromUtf8Error> {
if let Err(error) = core::str::run_utf8_validation(&v) {
return Err(FromUtf8Error { bytes: Vec::from(v), error });
}
// SAFETY: validation succeeded.
Ok(unsafe { from_boxed_utf8_unchecked(v) })
}Given that #151822 is working on making FromUtf8Error support other bytes types (see quote in Alternatives), it might make sense to wait for that work to complete, in order to avoid the need to convert the Box<[u8]> into a Vec<u8> for the error if this approach is taken.
Alternatives
The first alternative I can think of is simply leaving things as they are. There are at least 2 relatively simple ways to convert a Box<[u8]> into a Box<str> with UTF-8 validation, without API changes, though one requires unsafe:
// returns `FromUtf8Error` for invalid bytes
String::from_utf8(Vec::from(boxed)).map(String::into_boxed_str)// returns `Utf8Error` for invalid bytes
str::from_utf8(&boxed)?;
Ok(unsafe { str::from_boxed_utf8_unchecked(boxed) })The second alternative was mentioned by @programmerjake on Zulip:
actually, #151822 is busy making FromUtf8Error generic over the bytes type, we could easily add:
Box: TryFrom<Box[u8], Error = FromUtf8Error<Box<[u8]>>>
Edit: To be clear, I think that the inconsistency is worth addressing for the sake of a more consistent API. Implementing TryFrom would certainly make for a more ergonomic safe conversion, and the current approaches I mentioned make the conversion possible already, but it's still an unexpected (minor) source of potential confusion.
Links and related work
GitHub issue comment explaining why str::from_boxed_utf8 wasn't added originally.
What happens now?
This issue contains an API change proposal (or ACP) and is part of the libs-api team feature lifecycle. Once this issue is filed, the libs-api team will review open proposals as capability becomes available. Current response times do not have a clear estimate, but may be up to several months.
Possible responses
The libs team may respond in various different ways. First, the team will consider the problem (this doesn't require any concrete solution or alternatives to have been proposed):
- We think this problem seems worth solving, and the standard library might be the right place to solve it.
- We think that this probably doesn't belong in the standard library.
Second, if there's a concrete solution:
- We think this specific solution looks roughly right, approved, you or someone else should implement this. (Further review will still happen on the subsequent implementation PR.)
- We're not sure this is the right solution, and the alternatives or other materials don't give us enough information to be sure about that. Here are some questions we have that aren't answered, or rough ideas about alternatives we'd want to see discussed.