[CDRIVER-5983] Refactor String Handling Around URI Parsing #2047

vector-of-bool · 2025-06-26T22:08:08Z

Background

This PR is a subset of the changes that were original slated for CDRIVER-5983. The URI parameter refactor turned out to be significantly more impractical than initially though, so the changes related to parameter handling have been deferred until the relevant spec documents can consolidate on a cohesive behavior.

Regardless, some of the changes from that work can still be extracted and is useful in general throughout the codebase, specifically related to string handling and parsing.

This PR introduces minor behavioral changes surrounding some edge cases related to parsing.

While initially the commit history was going to be clean for the overall changes, the partial rollback around parameter handling means that commits themselves are not entirely useful in isolation.

Change Summary

String Views

This PR introduces the mlib/str.h header. This header has been re-introduced several times in libmongocrypt and amongoc, and this is the latest incarnation, which I believe is the most solid as it can build upon existing work regarding safe arithmetic and integer utilities.

mstr_view is a "string view" type. It consists of a const char* data pointer and a size_t len, wheren len refers to the length of the pointed-to string view.
Importantly, the data in mstr_view NOT necessarily null terminated! This means that mstr_view requires new string handling routines that respect the length.
Because we can pass slices of strings, this greatly reduces the need to strndup every time we want to take a string slice to pass to another function, further reducing the occurrences of memory leaking, goto fail, and buffer overruns. Code using mtsr_view is almost always far simpler than trying to juggle C strings.
There are several string algorithms: mstr_find, mstr_find_first_of, mlib_substr, mstr_contains, which inspect and manipulate string views.
mstr_cmp is a string comparison function like mlib_cmp, which supports natural syntax like mstr_cmp(a, !=, b)
mstr_split_at(S, Pos, [Drop,] *Pfx, *Sfx) is the most useful algorithm. It takes an index and optional skip parameter, and produces two new string views from an input.
mstr_split_around(S, Infix, *Pfx, *Sfx) -> bool builds upon mstr_split_at, and splits the string S around the first occurrence of Infix, returning true if it found Infix. This one API produces the greatest complexity reduction of all. For parsing a CSV string:
```
mstr_view remain = csv;
while (remain.len) {
  mstr_view entry;
  mstr_split_around(remain, mlib_cstring(","), &entry, &remain);
  // Handle `entry`
}
```
mstr_cstring creates an mstr_view from a C string. This is unfortunately necessary for calling a lot of APIs, but may be unnecessary in the future (see the note on _Generic in mlib/str.h)
All algorithms that accept a string position index also accept a negative value to index from the end of the string. They also use upsize_integer in order to inhibit any integer conversion warnings. No need to worry about passing the right type to these functions, it will just Do the Right Thing without any implicit conversions.
To printf-format a sized string, one must use the %.*s printf-specifier. This expect an int to specify the string length, followed by the string pointer. Because this is common, the macro MSTR_FMT(X) will expand to the two arguments required for the %.*s specifier.

Case Sensitivity

mlib_latin_tolower: Transforms a Basic Latin codepoint to a lowercase letter if it is an uppercase letter. Unlike standard tolower, this function is total in 32-bit integers, is very clear about what "lowercasing" it does, is not sensitive to the locale, and has no undefined behavior.
mlib_latin_charcasecmp: Compare two codepoints for equivalence with case-insensitivity in Basic Latin. Uses mlib_latin_tolower.
mstr_latin_casecmp: Like mstr_cmp, but case-insensitive using mlib_latin_charcasecmp.

`bson_error_t`

`bson_error_clear`

This function takes a bson_error_t* and resets it to a non-error value if the pointer is not null.

`bson_error_reset`

The new function-like macro bson_error_reset is defined. It is equivalent to bson_error_clear and takes an l-value to a pointer to bson_error_t. If the pointer is null, it is made to point to an anonymous bson_error_t in the local scope, allowing the function to rely on the present of an error object, even if the caller didn't pass one:

bool foo(bson_error_t* err) {
  bson_error_reset(err);  // Clears any "err"
  // `err` is now guaranteed to be non-null
}

This feature relies on C99 compound literals, so this function-like macro cannot be used in C++ 😢.

`bson/mongoc_set_error`

The set_error functions that do a string-format into the error message now support passing the error message of the object as an input parameter to the format string itself:

bson_set_error(&error, Cat, Code, "Got an error while creating a widget: %s", error.message);

Previously, this would garble the content of error.message. Now, the code uses a temporary buffer to format the message, then copies it over the error.message.

Integer Parsing

Because we are now using sized strings and not C strings, we need new integer parsing, because strtoll and friends expect a null terminator. Additionally, we can do better than strtoll:

mlib_i64_parse(Str, Base, *Out) -> int parses a 64-bit signed integer in Str. It differs from strtoll in a few important ways:
- It does not skip leading whitespace. That's an error.
- It doesn't immediately stop on a non-digit. That's an error.
- Doesn't silently reject empty strings. That's an error.
- It doesn't require a null terminator, of course.
mlib_i32_parse(Str, Base, *Out) -> int int32_t version of mlib_i64_parse
mlib_nat64_parse(Str, Base, *Out) -> int like mlib_i64_parse, but does not allow sign or base prefixes. Base must be non-zero. The input must be a string of only digits. This is the core of integer parsing.

Previousy, our code using strtoll was silently relying on whitespace trimming and had clunky or incomplete detecting whether strtoll stopped early. These parsing APIs are much more strict.

The implementation may be less efficient than the hyper-optimized strtoll, but it's much more easy to use correctly. Optimizations can come later.

Error Messages

Several APIs have been modified to return better error messages, including APIs that can fail with true/false. For example, %-decoding now explains why %-string is invalid, and that will be included when a URI is rejected.

Other Changes

The following other minor changes were made:

Passing an empty string to mongoc_uri_set_compressors was previously an error. The new behavior is to clear the compressors on the URI (as if passing a null pointer).
The failure strings in test assertions macros were modified so that VSCode generates proper clickable file links in terminal output.
Invalid UTF-8 following %-decoding generated a descriptive log warning in addition to a vague error message. This is changed to just generate a descriptive error without logging.

Not Included

Owning Sized-String `mstr`

There was going to be a type mstr that owned a sized mutable null-terminated string, with its own set of useful "Does the Right Thing" algorithms (append, splice, copy, erase, replace, push, etc.). I have the code for this, but it can wait for a future PR.

UTF-8 Validation

It may be useful to force mstr_view to only contain UTF-8 validation (i.e. you cannot create an mstr_view without checking that you have valid UTF-8), but that's a larger change for the potential future.

Logging/Errors/Warning Mess

The behavior around logging, warning, and erroring in URI handling is a mess. There are cases where we test that an API logs, and sometimes we expect it to be silent and put the error on an error object instead, and sometimes we expect both!

This inconsistency has been mostly retained. In the future, any API that accepts a bson_error_t* probably shouldn't log anything.

This commit adds `mlib_str_view` and utlities for manipulating sized string views.

- This change adds new functions to *unset* certain attributes on `mongoc_write_concern_t` objects, where previously it was only possible to set the value, but there was no way to clear the value. - This also changes the definition of `is_default()` to consider the values of the relevant attributes directly, rather than having an `is_default` field that gets set upon any modification. This allows a write concern to re-enter the default state after it has been modified. - The `wtimeout` field of the struct was renamed to `_wtimeout` to emphasize that it is private and that no one should modify it directly. There were test cases to check what happens if `wtimeout` is directly assigned an invalid (negative) value, but this is not actually possible to happen from the external API, because the `wtimeout` setter already guards against this case. The test cases for this situation have been removed. The rename of `_wtimeout` should discourange any code from attempting to modify this in a way that could bypass the validation of the setter.

This greatly reduces the number of allocated strings to manage, and simplifies parsing operations. - Split the URI into all components before trying to apply any of those to the URI object. - Update several internal APIs to pass sized string views, reducing the requirement to pass null-terminated strings, and reduing the number of redundant `strlen` calls. - `mstr_split_at/around` greatly simplifies parsing of delimited strings. - Put case normalization at a lower level to reduce need to case-fold strings which necessitates a strdup. Instead, use case-insensitive compares in more locations. - Behavior change: Setting compressors to empty string `""` now clears compressors rather than it being an error.

- %-decoding now indicates the position and kind of error to the caller. - %-decoding doesn't use sscanf - %-decoding allocates the full string up-front - Error messages related to %-decoding now explain the problem. - Use new sized integer parsing functions.

Allow passing `error.message` as an input to `_mongoc_set_error` by storing the temporary format output in a temporary buffer, then copying that over the `error.message`

This changes host parsing to use sized strings, and adds more specific error messages in case of parse failures.

This reverts commit ad481cc.

kevinAlbs

LGTM with comments addressed. I really like these changes. I think mstr_view significantly simplifies parsing.

Aside: I ran a local fuzz test with sanitizers on mongoc_uri_new, but did not discover anything.

Owning Sized-String mstr

If / when that is done, it may be worth combining with the existing mcommon_string_t.

kevinAlbs · 2025-07-08T13:43:24Z

src/common/src/mlib/intencode.h

+ * - A value of `0` indicates that the parse was successful.
+ * - A value of `EINVAL` indicates that the input string is not a valid
+ *   representation of an integer.
+ * - A value of `ERANGE` indicates thath the input string is a valid integer,


Suggested change

* - A value of `ERANGE` indicates thath the input string is a valid integer,

* - A value of `ERANGE` indicates that the input string is a valid integer,

kevinAlbs · 2025-07-08T13:51:48Z

src/common/src/mlib/intencode.h

+      in = mstr_substr (in, 1);
+      sign = -1;
+   }
+


Check length again after possibly removing sign:

if (in.len == 0) { // String with only sign is invalid return EINVAL; }

Avoids a possible buffer-overflow for a non-NULL terminated "+".

Tweaked with a check around the logic around base parsing. Added test cases for this that will be triggered during ASan builds.

kevinAlbs · 2025-07-08T15:51:04Z

src/common/src/mlib/str.h

+/**
+ * @brief Create an `mstr_view` that views the given array of `char`
+ *
+ * @param data Pointer to the beginning of the


Suggested change

* @param data Pointer to the beginning of the

* @param data Pointer to the beginning of the array

kevinAlbs · 2025-07-08T17:21:46Z

src/common/src/mlib/str.h

+{
+   if (clamp_to_length) {
+      if (pos.is_signed) {
+         if (pos.i.s > 0 && (size_t) pos.i.s > s.len) {


Suggest using mlib_cmp to handle the possibility that pos.i.s (intmax_t) be greater than SIZE_MAX. Return early to avoid the (unsafe?) cast of (intmax_t) s.len:

if (clamp_to_length) { if (pos.is_signed) { if (pos.i.s > 0 && mlib_cmp (pos.i.s, >, s.len)) { return s.len; } } else { if (pos.i.u > s.len) { return s.len; } } }

Heavily simplified this section

kevinAlbs · 2025-07-08T18:02:47Z

src/common/src/mlib/str.h

+ * with C string APIs could truncate unexpectedly.
+ */
+typedef struct mstr_view {
+   // Pointer to the first character in the string


Suggest clarifying if NULL is OK for data. E.g. is it OK to create an empty string like mstr_view_data (NULL, 0);?

Added clarification about null string views as an explicitly supported use case

kevinAlbs · 2025-07-09T19:19:16Z