Skip to content

ACP: add str::from_boxed_utf8 for API consistency #737

@eliminmax

Description

@eliminmax

Proposal

Problem statement

There's an inconsistency in the standard library's methods for converting a byte slice into a string slice:

Type Unchecked Conversion Checked Conversion
&str str::from_utf8_unchecked str::from_utf8
&mut str str::from_utf8_unchecked_mut str::from_utf8_mut
Box<str> str::from_boxed_utf8_unchecked none

While workarounds are possible in both safe and unsafe code, the inconsistency can be a source of confusion, and the workarounds are clunkier than they need to be.

The pull request that added str::from_boxed_utf8_unchecked originally was going to add str::from_boxed_utf8 as well, but did not, as it would drop the allocation. With that concern in mind, instead of returning Utf8Error, str::from_boxed_utf8 could return a FromUtf8Error, as that preserves the allocation.

Motivating examples or use cases

I recently was trying to implement a function to load a struct containing a Box<str> from a file, and was able to read it as a Box<[u8]>. In order to convert the byte slice into a string slice without rolling my own UTF-8 validation, I was trying to find a checked equivalent to str::from_boxed_utf8_unchecked, and was surprised that I didn't find one.

My workaround was a bit clunky:

let label_text = if str::from_utf8(&raw_label_text).is_ok() {
    // SAFETY: Already validated
    unsafe { std::str::from_boxed_utf8_unchecked(raw_label_text) }
} else {
    return Err(Error::NonUtf8Label(raw_label_text));
};

Solution sketch

Add fn from_boxed_utf8() -> Result<Box<str>, {ERROR}> to alloc::str, where {ERROR} is either core::str::Utf8Error, or alloc::string::FromUtf8Error. My test implementation used the former type, but that was before I saw the concern about the dropped allocation.

If core::str::validations::run_utf8_validation were re-exported by core::str as part of str_internals, it could be a two-line function returning a Utf8Error, and only slightly more complex if returning a FromUtf8Error.

The Utf8Error version:

pub fn from_boxed_utf8(v: Box<[u8]>) -> Result<Box<str>, Utf8Error> {
    core::str::run_utf8_validation(&v)?;
    // SAFETY: validation succeeded.
    Ok(unsafe { from_boxed_utf8_unchecked(v) })
}

A FromUtf8Error version that avoids discarding the allocation could look like this, with the current FromUtf8Error definition:

pub fn from_boxed_utf8(v: Box<[u8]>) -> Result<Box<str>, FromUtf8Error> {
    if let Err(error) = core::str::run_utf8_validation(&v) {
        return Err(FromUtf8Error { bytes: Vec::from(v), error });
    }
    // SAFETY: validation succeeded.
    Ok(unsafe { from_boxed_utf8_unchecked(v) })
}

Given that #151822 is working on making FromUtf8Error support other bytes types (see quote in Alternatives), it might make sense to wait for that work to complete, in order to avoid the need to convert the Box<[u8]> into a Vec<u8> for the error if this approach is taken.

Alternatives

The first alternative I can think of is simply leaving things as they are. There are at least 2 relatively simple ways to convert a Box<[u8]> into a Box<str> with UTF-8 validation, without API changes, though one requires unsafe:

// returns `FromUtf8Error` for invalid bytes
String::from_utf8(Vec::from(boxed)).map(String::into_boxed_str)
// returns `Utf8Error` for invalid bytes
str::from_utf8(&boxed)?;
Ok(unsafe { str::from_boxed_utf8_unchecked(boxed) })

The second alternative was mentioned by @programmerjake on Zulip:

actually, #151822 is busy making FromUtf8Error generic over the bytes type, we could easily add:
Box: TryFrom<Box[u8], Error = FromUtf8Error<Box<[u8]>>>

Edit: To be clear, I think that the inconsistency is worth addressing for the sake of a more consistent API. Implementing TryFrom would certainly make for a more ergonomic safe conversion, and the current approaches I mentioned make the conversion possible already, but it's still an unexpected (minor) source of potential confusion.

Links and related work

GitHub issue comment explaining why str::from_boxed_utf8 wasn't added originally.

What happens now?

This issue contains an API change proposal (or ACP) and is part of the libs-api team feature lifecycle. Once this issue is filed, the libs-api team will review open proposals as capability becomes available. Current response times do not have a clear estimate, but may be up to several months.

Possible responses

The libs team may respond in various different ways. First, the team will consider the problem (this doesn't require any concrete solution or alternatives to have been proposed):

  • We think this problem seems worth solving, and the standard library might be the right place to solve it.
  • We think that this probably doesn't belong in the standard library.

Second, if there's a concrete solution:

  • We think this specific solution looks roughly right, approved, you or someone else should implement this. (Further review will still happen on the subsequent implementation PR.)
  • We're not sure this is the right solution, and the alternatives or other materials don't give us enough information to be sure about that. Here are some questions we have that aren't answered, or rough ideas about alternatives we'd want to see discussed.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions