Skip to content
Draft
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
217 changes: 199 additions & 18 deletions library/std/src/path.rs
Original file line number Diff line number Diff line change
Expand Up @@ -3308,9 +3308,146 @@ impl Path {
fs::canonicalize(self)
}

/// Normalize a path, including `..` without traversing the filesystem.
/// Does the path represent a child of the current location? Path components are evaluated
/// naively with no filesystem traversal, so a "bounded" path may still be able to escape the
/// working directory when applied to the filesystem if symlinks are present.
///
/// Returns an error if normalization would leave leading `..` components.
/// # Examples
///
/// ```
/// #![feature(normalize_lexically)]
/// use std::path::Path;
///
/// assert!(Path::new("abc").is_lexically_bounded());
/// assert!(Path::new("abc/../def").is_lexically_bounded());
///
/// assert!(!Path::new("").is_lexically_bounded());
/// assert!(!Path::new(".").is_lexically_bounded());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one feels weird to me; should this not be bounded? I would expect anything below the current directory to not be bounded, but at or above to be bounded.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was a last-minute change on my part; I previously had it implemented as you describe. But I thought the most likely use case for this would be recreating a directory structure from an untrusted source (eg. extracting at archive), where you wouldn't want the source data to be able to address the pwd directly. I'm not sure if there's a compelling use case for pwd-or-lower.

Copy link
Copy Markdown
Contributor

@clarfonthey clarfonthey Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fair; perhaps is_strictly_lexically_bounded and is_lexically_bounded could make that distinction? They could use the same code internally.

In the extracting files case, extracting multiple files would require a strict bound, but extracting a single file could allow for a loose bound. At least, when talking about the directory the files are in.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But even when extracting a single file you wouldn't want to allow a "file" called ".".

The more elegant solution might be an enum along the lines of Inner/Pwd/Outer/Absolute. Which could ultimately stand in for is_absolute() as well. But I'm not sure we gain much from the extra complexity.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading your first paragraph before you added the second, I actually wasn't sure if is_strictly_lexically_bounded was meant to be the case with or without .. Which implies some naming confusion if nothing else.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, sorry, was typing up on the bus so I was trying to brief and then correcting myself as I realised how not-detailed it was.

Your suggestion of having an enum actually does sound like a good idea, since it feels similar to how classify works on floats: you have one method to get every possible class, and other methods to explicitly check which class it is. is_absolute is a special case because it can short-circuit whereas the others have to read the whole path.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also to clarify on the strictness: I was thinking it in terms of the way "strict" is used in mathematics: a subset of a set could be the set itself, but a strict subset explicitly forbids this.

So, in this sense, it would be whether the path is a "strictly" bounded within the current directory or not, lexically. So, a strict bound excludes the bound, whereas a loose bound allows the bound itself.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

impl Path {
    pub fn lexical_bound(&self) -> LexicalBound;
}

pub enum LexicalBound {
    Absolute,
    Outer,
    Pwd,
    Inner,
}

impl LexicalBound {
    pub fn is_absolute(&self) -> bool {
        matches!(self, Self::Absolute)
    }

    pub fn is_relative(&self) -> bool {
        !self.is_absolute()
    }

    pub fn is_pwd_or_inner(&self) -> bool {
        matches!(self, Self::Pwd | Self::Inner)
    }

    pub fn is_inner(&self) -> bool {
        matches!(self, Self::Inner)
    }
}

I'm mildly in favour of introducing an enum at this point and then revisiting the choice during the stabilization discussion; if the only real-world uses involve checking is_inner(), we should drop the enum and just return a boolean to that effect.

/// assert!(!Path::new("..").is_lexically_bounded());
/// assert!(!Path::new("abc/../../def").is_lexically_bounded());
/// assert!(!Path::new("abc/..").is_lexically_bounded());
/// assert!(!Path::new("/abc").is_lexically_bounded());
/// ```
#[unstable(feature = "normalize_lexically", issue = "134694")]
pub fn is_lexically_bounded(&self) -> bool {
use Component::*;

self.components()
.try_fold(0usize, |depth, component| match component {
Prefix(_) | RootDir => None,
CurDir => Some(depth),
Normal(_) => Some(depth + 1),
ParentDir => depth.checked_sub(1),
})
.is_some_and(|i| i > 0)
}

/// Is the path normalized, ie. expressed in simplest possible terms? A normalized path:
///
/// * Starts with either
/// * a prefix followed by root (`C:\`); or
/// * a prefix (`C:`); or
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realise this is a pre-existing issue but I think a "normalized" path shouldn't have some of the weirder Windows prefixes because they're harder to handle. For example, C: is technically relative but not relative to the current directory. This can break people's assumptions of what is_relative means.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My interpretation of "normalized" is just "is expressed in simplest possible terms". There is no path, however strange, that can't have a boolean response to that question. Although this is a good point for normalize_lexically, which I think in its present implementation will normalize C:..\ to C:\.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think C:foo is expressed In it's simplest term. It's roughly equivalent to doing %=C:%\foo in a command prompt. Simplifying it would require expanding it to get the real path.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, but that's not applicable to this use case since it's for working with paths that do not (yet) exist on the local filesystem. We do already have path::absolute() and fs::canonicalize() to address your definition of "simplifying", but those are both dependent on the runtime environment in a way that may not be possible or desirable (if the output will ultimately be consumed by another machine/process).

Ultimately, I'm not especially concerned with the specifics, but standing by the "every input has a normalized output" principle keeps this infallible, while adding obscure edge cases will require us to go back to returning a Result. Worse, weird Windows prefixes aren't parsed at all on *nix platforms, where you'll have to handle a Result that can't ever occur.

/// * root (`/`); or
/// * `.`; or
/// * zero or more `..`
/// * Continues with zero or more normal segments (`abc`)
/// * Contains only the primary platform separator, if applicable (eg. only `\\` rather than
/// `/` on Windows)
/// * Contains no repeated separators unless as part of the prefix
/// * Does not end with a separator unless as part of the prefix
///
/// # Examples
///
/// TODO: examples for non-*nix platforms
///
/// ```
/// #![feature(normalize_lexically)]
/// use std::path::Path;
///
/// assert!(Path::new("").is_normalized());
/// assert!(Path::new(".").is_normalized());
/// assert!(Path::new("abc").is_normalized());
/// assert!(Path::new("./abc").is_normalized());
/// assert!(Path::new("../../abc").is_normalized());
/// assert!(Path::new("/").is_normalized());
///
/// assert!(!Path::new("abc/../def").is_normalized());
/// assert!(!Path::new("//abc").is_normalized());
/// assert!(!Path::new(".//abc").is_normalized());
/// assert!(!Path::new("//").is_normalized());
/// assert!(!Path::new("/../abc").is_normalized());
/// assert!(!Path::new("abc/./def").is_normalized());
/// assert!(!Path::new("abc/").is_normalized());
/// assert!(!Path::new("abc/.").is_normalized());
/// assert!(!Path::new("./").is_normalized());
/// assert!(!Path::new("/.").is_normalized());
/// ```
#[unstable(feature = "normalize_lexically", issue = "134694")]
pub fn is_normalized(&self) -> bool {
use Component::*;

// TODO: This can be compiled out on platforms that only recognize one separator and can be
// optimized on platforms with two.
// See: https://github.com/rust-lang/libs-team/issues/744
if self.as_u8_slice().iter().any(|&b| is_sep_byte(b) && b != MAIN_SEPARATOR as u8) {
return false;
}

let mut components = self.components();
let Some(first) = components.next() else {
return true;
};

if !match first {
Prefix(_) => {
components.skip_while(|c| matches!(c, RootDir)).all(|c| matches!(c, Normal(_)))
}
RootDir | CurDir | Normal(_) => components.all(|c| matches!(c, Normal(_))),
ParentDir => {
components.skip_while(|c| matches!(c, ParentDir)).all(|c| matches!(c, Normal(_)))
}
} {
return false;
}

// TODO: Checking for the component iterator silently dropping repeated separators or
// current directory components can be done inline with the previous pass and should maybe
// be done without hooking into the iterator internals.
components = self.components(); // restart the iterator
let mut prev = None;
while components.front < State::Body {
prev = components.next().or(prev);
}

let mut is_consecutive_empty = false;
while !components.path.is_empty() {
// This is how the iterator internally communicates skipping a component
let (len, component) = components.parse_next_component();
if component.is_some() {
is_consecutive_empty = false;
prev = component;
} else {
if prev != Some(CurDir) || is_consecutive_empty {
return false;
}
is_consecutive_empty = true;
}
components.path = &components.path[len..];
}

if let Some(prev) = prev
&& !self.as_u8_slice().ends_with(prev.as_os_str().as_encoded_bytes())
{
return false;
}

true
}

/// Normalize a path, including `..` without traversing the filesystem. Any remaining `..`
/// components that can't be normalized are collected at the beginning of the path. Returns
/// [`Cow::Borrowed`] if the path is already normalized, otherwise [`Cow::Owned`] containing
/// the normalized form in a [`PathBuf`].
///
/// <div class="warning">
///
Expand All @@ -3320,18 +3457,48 @@ impl Path {
///
/// </div>
///
/// [`path::absolute`](absolute) is an alternative that preserves `..`.
/// Or [`Path::canonicalize`] can be used to resolve any `..` by querying the filesystem.
/// # Alternatives
///
/// * [`path::absolute`](absolute) is an alternative that preserves `..`.
/// * [`canonicalize`] can be used to resolve any `..` by querying the filesystem.
/// * [`is_normalized`] checks if the `Path` is already in normalized form.
/// * [`is_lexically_bounded`] checks if the `Path` escapes the current directory using `..` or
/// absolute paths, but does not query the filesystem and so cannot account for symbolic
/// links.
///
/// # Examples
///
/// ```
/// #![feature(normalize_lexically)]
/// use std::path::Path;
///
/// assert_eq!("def", Path::new("abc/../def").normalize_lexically().as_ref());
/// assert_eq!("../def", Path::new("abc/../../def").normalize_lexically().as_ref());
/// assert_eq!(".", Path::new("abc/..").normalize_lexically().as_ref());
/// assert_eq!("./abc/def", Path::new(".//abc/./def/").normalize_lexically().as_ref());
/// assert_eq!("", Path::new("").normalize_lexically().as_ref());
/// assert_eq!("/", Path::new("/").normalize_lexically().as_ref());
/// assert_eq!("/", Path::new("/..").normalize_lexically().as_ref());
/// ```
#[unstable(feature = "normalize_lexically", issue = "134694")]
pub fn normalize_lexically(&self) -> Result<PathBuf, NormalizeError> {
pub fn normalize_lexically(&self) -> Cow<'_, Path> {
if self.is_normalized() {
Cow::Borrowed(self)
} else {
Cow::Owned(self.normalize_lexically_internal())
}
}

/// Owned logic for [`normalize_lexically`] is split into its own method to facilitate testing
/// of [`is_normalized`].
fn normalize_lexically_internal(&self) -> PathBuf {
let mut lexical = PathBuf::new();
let mut iter = self.components().peekable();

// Find the root, if any, and add it to the lexical path.
// Here we treat the Windows path "C:\" as a single "root" even though
// `components` splits it into two: (Prefix, RootDir).
let root = match iter.peek() {
Some(Component::ParentDir) => return Err(NormalizeError),
Some(p @ Component::RootDir) | Some(p @ Component::CurDir) => {
lexical.push(p);
iter.next();
Expand All @@ -3346,27 +3513,41 @@ impl Path {
}
lexical.as_os_str().len()
}
None => return Ok(PathBuf::new()),
Some(Component::Normal(_)) => 0,
Some(Component::ParentDir) | Some(Component::Normal(_)) => 0,

// This will not happen in production code because an empty path satisfies
// `is_normalized`.
None => return lexical,
};

let mut depth = 0usize;
for component in iter {
match component {
Component::RootDir => unreachable!(),
Component::Prefix(_) => return Err(NormalizeError),
Component::CurDir => continue,
Component::ParentDir => {
// It's an error if ParentDir causes us to go above the "root".
if lexical.as_os_str().len() == root {
return Err(NormalizeError);
} else {
c @ Component::ParentDir => {
if depth > 0 {
lexical.pop();
depth -= 1;
} else if root == 0 {
lexical.push(c);
}
}
Component::Normal(path) => lexical.push(path),
Component::Normal(path) => {
lexical.push(path);
depth += 1;
}
Component::CurDir | Component::RootDir | Component::Prefix(_) => unreachable!(
"these components should have been consumed while handling the prefix"
),
}
}
Ok(lexical)

// An empty input satisfies `is_normalized`, so the only way to arrive here empty-handed is
// when given an input that resolves to the current directory, eg. "abc/..".
if lexical.as_os_str().is_empty() {
lexical.push(Component::CurDir);
}

lexical
}

/// Reads a symbolic link, returning the file that the link points to.
Expand Down
Loading