ACP: Expose more Unicode casing data in libcore

# Proposal

## Problem statement

While Rust has pretty robust Unicode support, it only offers casing data in a limited capacity, via methods like `char::is_lowercase` and `str::to_lowercase`. While the standard library itself already contains data on the additional Unicode properties `Cased` and `Case_Ignorable`, this data is not exposed publicly, and code cannot reuse them to implement their own versions of methods like `to_lowercase` on their own custom string types.

Additionally, lowercase and uppercase alone are not enough to do proper case-insensitive matching: this requires case folding, which is entirely absent from the standard library. The compiler (mostly via its `clap` dependency) even brings in the external `unicase` crate to solve this problem.

## Motivating examples or use cases

As mentioned, the standard library *already* includes the `Cased` and `Case_Ignorable` property data in its own code, but does not expose this publicly. There would not be a substantial maintenance burden to expose `char::is_cased` and `char::is_case_ignorable` methods in libcore, since it's just a matter of offering a public API surface.

While case folding data isn't directly included in the standard library, it is no different from the lowercase and uppercase mapping tables and could easily be generated in the code as well and offered in a very similar API fashion.

While this code isn't strictly required in the standard library and the ecosystem has done mostly fine with crates like `unicase`, the primary benefit of including this data in the standard library is to expose data that is mostly already used by the compiler and to offer a solution to people who are averse to the idea of adding new dependencies.

## Solution sketch

I'm going to separate this into a basic core of methods that I think should be added for this proposal, and a set of "stretch goal" methods which would be nice complements to these, but not strictly required.

The base methods:

```rust
impl char {
    // currently exported as unstable `core::unicode::Cased`
    // corresponds to unicode `Cased` property
    // is not equivalent to `is_lowercase() || is_uppercase()`:
    //   it also includes title-case ligature characters like Lj
    const fn is_cased(self) -> bool;

    // currently exported as unstable `core::unicode::Case_Ignorable`
    // corresponds to unicode `Case_Ignorable` property
    // indicates characters which are completely ignored when case mapping;
    //   is mostly used for implementing casing algorithms
    const fn is_case_ignorable(self) -> bool;

    // not included currently
    // represents full case-folding as defined by `CaseFolding.txt`
    // should use same code as `ToLowercase` and `ToUppercase`
    // note that Turkic mappings are excluded;
    //   they're excluded by default, and the mapping is only two characters, so
    //   anyone can trivially special-case those ones
    fn to_folded_case(self) -> ToFoldedCase;
}

impl str {
    // not included currently
    // equivalent to `chars().flat_map(char::to_folded_case).collect()`
    // analogue to `to_lowercase` and `to_uppercase`
    fn to_folded_case(&self) -> String;

    // not included currently
    // equivalent to `chars().flat_map(char::to_folded_case)`
    // analogue to ACP-accepted `lowercase_chars` and `uppercase_chars`
    fn folded_chars(&self) -> FoldedChars;
}

impl String {
    // not included currently
    // equivalent to `*self = self.to_folded_case()`
    // analogue to ACP-accepted `make_lowercase` and `make_uppercase`
    fn fold_case(&mut self);
}
```

It additionally would be nice to make the case-folding methods usable in const code, as a stretch goal:

```rust
impl char {
    // these methods are now made const;
    // they only weren't because they weren't useful as const before
    const fn to_lowercase(self) -> ToLowercase;
    const fn to_uppercase(self) -> ToUppercase;
    const fn to_folded_case(self) -> ToFoldedCase;
}

impl To{Lowercase,Uppercase,FoldedCase} {
    // effectively same as `fmt::Write` impl, analogue to `char::encode_utf8`
    // allows usage in const code before const traits,
    //   and can be used for implementing own case methods
    const fn encode_utf8(&self, buffer: &mut [u8; 12]) -> &mut str;

    // analogue to `char::len_utf8`
    const fn len_utf8(&self) -> usize;

    // analogue to `char::encode_utf16`
    const fn encode_utf16(&self, buffer: &mut [u16; 6]) -> &mut [u16];

    // analogue to `char::len_utf16`
    const fn len_utf16(&self) -> usize;
}
```

Perhaps it would be also useful to have title-case data as well, since the number of title-cased characters is small. However, this is less useful because many people will not want explicit title-case (for example, "This Title Is Title Case" should probably be "This Title is Title Case") and because title-case is much more language-dependent.

Note that this uses Unicode's choice of "Titlecase" as one word instead of two separate words.

```rust
impl char {
    // note: `is_cased()` is now explicitly `is_lowercase() || is_uppercase() || is_titlecase()`
    // titlecase follows the unicode property `Titlecase_Letter`
    const fn is_titlecase(self) -> bool;

    // equivalent to `to_uppercase` for most characters,
    //   but different specifically for ligature characters
    // marking as const here in case we include eariler proposal
    const fn to_titlecase(self) -> ToTitlecase;
}

impl str {
    // would implement title-case algorithm, *and* include the final sigma rules
    fn to_titlecase(&self) -> String;
    fn titlecase_chars(&self) -> TitlecaseChars;
}

impl String {
    fn make_titlecase(&mut self);
}
```

It might be nice to have an `eq_ignore_case` method for strings that uses case folding:

```rust
impl str {
    // uses case folding
    fn eq_ignore_case(&self, rhs: &str) -> bool;
}
```

Using case folding, these methods are omitted but might be useful to include. I put them at the end since generally, people will prefer to perform case-folding in advance rather than doing them on-demand every time, and we may want to encourage that specifically.

```rust
// note that char::eq_ignore_case is absent,
//   since case conversions can expand to multiple characters

impl str {
     fn cmp_ignore_ascii_case(&self, rhs: &str) -> Ordering;
     fn cmp_ignore_case(&self, rhs: &str) -> Ordering;
}
```

I'm also adding this one just because I wanted it myself. I will be surprised if it's actually accepted, but it technically is included in `unicase`:

```rust
impl str {
    fn hash_ignore_ascii_case<H: Hasher>(&self, state: &mut H);
    fn hash_ignore_case<H: Hasher>(&self, state: &mut H);
}
```



## Alternatives



Right now, the alternatives already exist as crates on crates.io. The primary benefit of adding this to the standard library is that a lot of the work is already done for upper/lowercase mappings, and some of this data is already included but not exposed publicly. But also, adding case mapping to the standard library will make people aware of its existence, rather than simply converting everything to lowercase or uppercase to compare, which is technically incorrect. (the simplest example is that `lower(ß) != lower(SS)`, but `fold(ß) = fold(SS)`)

## Links and related work



I'm mostly creating this issue because it hasn't really been discussed since before 1.0.

If desired, I can dredge up some of the suggestions I found, but there aren't many, and I'm not sure they're relevant.

## What happens now?

This issue contains an API change proposal (or ACP) and is part of the libs-api team [feature lifecycle]. Once this issue is filed, the libs-api team will review open proposals as capability becomes available. Current response times do not have a clear estimate, but may be up to several months.

[feature lifecycle]: https://std-dev-guide.rust-lang.org/development/feature-lifecycle.html

## Possible responses

The libs team may respond in various different ways. First, the team will consider the *problem* (this doesn't require any concrete solution or alternatives to have been proposed):

- We think this problem seems worth solving, and the standard library might be the right place to solve it.
- We think that this probably doesn't belong in the standard library.

Second, if there's a concrete solution:

- We think this specific solution looks roughly right, approved, you or someone else should implement this. (Further review will still happen on the subsequent implementation PR.)
- We're not sure this is the right solution, and the alternatives or other materials don't give us enough information to be sure about that. Here are some questions we have that aren't answered, or rough ideas about alternatives we'd want to see discussed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ACP: Expose more Unicode casing data in libcore #530

Proposal

Problem statement

Motivating examples or use cases

Solution sketch

Alternatives

Links and related work

What happens now?

Possible responses

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

ACP: Expose more Unicode casing data in libcore #530

Description

Proposal

Problem statement

Motivating examples or use cases

Solution sketch

Alternatives

Links and related work

What happens now?

Possible responses

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions