-
Notifications
You must be signed in to change notification settings - Fork 24
Description
Proposal
Problem statement
While Rust has pretty robust Unicode support, it only offers casing data in a limited capacity, via methods like char::is_lowercase and str::to_lowercase. While the standard library itself already contains data on the additional Unicode properties Cased and Case_Ignorable, this data is not exposed publicly, and code cannot reuse them to implement their own versions of methods like to_lowercase on their own custom string types.
Additionally, lowercase and uppercase alone are not enough to do proper case-insensitive matching: this requires case folding, which is entirely absent from the standard library. The compiler (mostly via its clap dependency) even brings in the external unicase crate to solve this problem.
Motivating examples or use cases
As mentioned, the standard library already includes the Cased and Case_Ignorable property data in its own code, but does not expose this publicly. There would not be a substantial maintenance burden to expose char::is_cased and char::is_case_ignorable methods in libcore, since it's just a matter of offering a public API surface.
While case folding data isn't directly included in the standard library, it is no different from the lowercase and uppercase mapping tables and could easily be generated in the code as well and offered in a very similar API fashion.
While this code isn't strictly required in the standard library and the ecosystem has done mostly fine with crates like unicase, the primary benefit of including this data in the standard library is to expose data that is mostly already used by the compiler and to offer a solution to people who are averse to the idea of adding new dependencies.
Solution sketch
I'm going to separate this into a basic core of methods that I think should be added for this proposal, and a set of "stretch goal" methods which would be nice complements to these, but not strictly required.
The base methods:
impl char {
// currently exported as unstable `core::unicode::Cased`
// corresponds to unicode `Cased` property
// is not equivalent to `is_lowercase() || is_uppercase()`:
// it also includes title-case ligature characters like Lj
const fn is_cased(self) -> bool;
// currently exported as unstable `core::unicode::Case_Ignorable`
// corresponds to unicode `Case_Ignorable` property
// indicates characters which are completely ignored when case mapping;
// is mostly used for implementing casing algorithms
const fn is_case_ignorable(self) -> bool;
// not included currently
// represents full case-folding as defined by `CaseFolding.txt`
// should use same code as `ToLowercase` and `ToUppercase`
// note that Turkic mappings are excluded;
// they're excluded by default, and the mapping is only two characters, so
// anyone can trivially special-case those ones
fn to_folded_case(self) -> ToFoldedCase;
}
impl str {
// not included currently
// equivalent to `chars().flat_map(char::to_folded_case).collect()`
// analogue to `to_lowercase` and `to_uppercase`
fn to_folded_case(&self) -> String;
// not included currently
// equivalent to `chars().flat_map(char::to_folded_case)`
// analogue to ACP-accepted `lowercase_chars` and `uppercase_chars`
fn folded_chars(&self) -> FoldedChars;
}
impl String {
// not included currently
// equivalent to `*self = self.to_folded_case()`
// analogue to ACP-accepted `make_lowercase` and `make_uppercase`
fn fold_case(&mut self);
}It additionally would be nice to make the case-folding methods usable in const code, as a stretch goal:
impl char {
// these methods are now made const;
// they only weren't because they weren't useful as const before
const fn to_lowercase(self) -> ToLowercase;
const fn to_uppercase(self) -> ToUppercase;
const fn to_folded_case(self) -> ToFoldedCase;
}
impl To{Lowercase,Uppercase,FoldedCase} {
// effectively same as `fmt::Write` impl, analogue to `char::encode_utf8`
// allows usage in const code before const traits,
// and can be used for implementing own case methods
const fn encode_utf8(&self, buffer: &mut [u8; 12]) -> &mut str;
// analogue to `char::len_utf8`
const fn len_utf8(&self) -> usize;
// analogue to `char::encode_utf16`
const fn encode_utf16(&self, buffer: &mut [u16; 6]) -> &mut [u16];
// analogue to `char::len_utf16`
const fn len_utf16(&self) -> usize;
}Perhaps it would be also useful to have title-case data as well, since the number of title-cased characters is small. However, this is less useful because many people will not want explicit title-case (for example, "This Title Is Title Case" should probably be "This Title is Title Case") and because title-case is much more language-dependent.
Note that this uses Unicode's choice of "Titlecase" as one word instead of two separate words.
impl char {
// note: `is_cased()` is now explicitly `is_lowercase() || is_uppercase() || is_titlecase()`
// titlecase follows the unicode property `Titlecase_Letter`
const fn is_titlecase(self) -> bool;
// equivalent to `to_uppercase` for most characters,
// but different specifically for ligature characters
// marking as const here in case we include eariler proposal
const fn to_titlecase(self) -> ToTitlecase;
}
impl str {
// would implement title-case algorithm, *and* include the final sigma rules
fn to_titlecase(&self) -> String;
fn titlecase_chars(&self) -> TitlecaseChars;
}
impl String {
fn make_titlecase(&mut self);
}It might be nice to have an eq_ignore_case method for strings that uses case folding:
impl str {
// uses case folding
fn eq_ignore_case(&self, rhs: &str) -> bool;
}Using case folding, these methods are omitted but might be useful to include. I put them at the end since generally, people will prefer to perform case-folding in advance rather than doing them on-demand every time, and we may want to encourage that specifically.
// note that char::eq_ignore_case is absent,
// since case conversions can expand to multiple characters
impl str {
fn cmp_ignore_ascii_case(&self, rhs: &str) -> Ordering;
fn cmp_ignore_case(&self, rhs: &str) -> Ordering;
}I'm also adding this one just because I wanted it myself. I will be surprised if it's actually accepted, but it technically is included in unicase:
impl str {
fn hash_ignore_ascii_case<H: Hasher>(&self, state: &mut H);
fn hash_ignore_case<H: Hasher>(&self, state: &mut H);
}Alternatives
Right now, the alternatives already exist as crates on crates.io. The primary benefit of adding this to the standard library is that a lot of the work is already done for upper/lowercase mappings, and some of this data is already included but not exposed publicly. But also, adding case mapping to the standard library will make people aware of its existence, rather than simply converting everything to lowercase or uppercase to compare, which is technically incorrect. (the simplest example is that lower(ß) != lower(SS), but fold(ß) = fold(SS))
Links and related work
I'm mostly creating this issue because it hasn't really been discussed since before 1.0.
If desired, I can dredge up some of the suggestions I found, but there aren't many, and I'm not sure they're relevant.
What happens now?
This issue contains an API change proposal (or ACP) and is part of the libs-api team feature lifecycle. Once this issue is filed, the libs-api team will review open proposals as capability becomes available. Current response times do not have a clear estimate, but may be up to several months.
Possible responses
The libs team may respond in various different ways. First, the team will consider the problem (this doesn't require any concrete solution or alternatives to have been proposed):
- We think this problem seems worth solving, and the standard library might be the right place to solve it.
- We think that this probably doesn't belong in the standard library.
Second, if there's a concrete solution:
- We think this specific solution looks roughly right, approved, you or someone else should implement this. (Further review will still happen on the subsequent implementation PR.)
- We're not sure this is the right solution, and the alternatives or other materials don't give us enough information to be sure about that. Here are some questions we have that aren't answered, or rough ideas about alternatives we'd want to see discussed.