-
Notifications
You must be signed in to change notification settings - Fork 455
[CDRIVER-5983] Refactor String Handling Around URI Parsing #2047
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CDRIVER-5983] Refactor String Handling Around URI Parsing #2047
Conversation
This commit adds `mlib_str_view` and utlities for manipulating sized string views.
- This change adds new functions to *unset* certain attributes on `mongoc_write_concern_t` objects, where previously it was only possible to set the value, but there was no way to clear the value. - This also changes the definition of `is_default()` to consider the values of the relevant attributes directly, rather than having an `is_default` field that gets set upon any modification. This allows a write concern to re-enter the default state after it has been modified. - The `wtimeout` field of the struct was renamed to `_wtimeout` to emphasize that it is private and that no one should modify it directly. There were test cases to check what happens if `wtimeout` is directly assigned an invalid (negative) value, but this is not actually possible to happen from the external API, because the `wtimeout` setter already guards against this case. The test cases for this situation have been removed. The rename of `_wtimeout` should discourange any code from attempting to modify this in a way that could bypass the validation of the setter.
This greatly reduces the number of allocated strings to manage, and simplifies parsing operations. - Split the URI into all components before trying to apply any of those to the URI object. - Update several internal APIs to pass sized string views, reducing the requirement to pass null-terminated strings, and reduing the number of redundant `strlen` calls. - `mstr_split_at/around` greatly simplifies parsing of delimited strings. - Put case normalization at a lower level to reduce need to case-fold strings which necessitates a strdup. Instead, use case-insensitive compares in more locations. - Behavior change: Setting compressors to empty string `""` now clears compressors rather than it being an error.
- %-decoding now indicates the position and kind of error to the caller. - %-decoding doesn't use sscanf - %-decoding allocates the full string up-front - Error messages related to %-decoding now explain the problem. - Use new sized integer parsing functions.
Allow passing `error.message` as an input to `_mongoc_set_error` by storing the temporary format output in a temporary buffer, then copying that over the `error.message`
This changes host parsing to use sized strings, and adds more specific error messages in case of parse failures.
This reverts commit ad481cc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with comments addressed. I really like these changes. I think mstr_view
significantly simplifies parsing.
Aside: I ran a local fuzz test with sanitizers on mongoc_uri_new
, but did not discover anything.
Owning Sized-String
mstr
If / when that is done, it may be worth combining with the existing mcommon_string_t
.
src/common/src/mlib/intencode.h
Outdated
* - A value of `0` indicates that the parse was successful. | ||
* - A value of `EINVAL` indicates that the input string is not a valid | ||
* representation of an integer. | ||
* - A value of `ERANGE` indicates thath the input string is a valid integer, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* - A value of `ERANGE` indicates thath the input string is a valid integer, | |
* - A value of `ERANGE` indicates that the input string is a valid integer, |
in = mstr_substr (in, 1); | ||
sign = -1; | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check length again after possibly removing sign:
if (in.len == 0) {
// String with only sign is invalid
return EINVAL;
}
Avoids a possible buffer-overflow for a non-NULL terminated "+".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tweaked with a check around the logic around base parsing. Added test cases for this that will be triggered during ASan builds.
src/common/src/mlib/str.h
Outdated
/** | ||
* @brief Create an `mstr_view` that views the given array of `char` | ||
* | ||
* @param data Pointer to the beginning of the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* @param data Pointer to the beginning of the | |
* @param data Pointer to the beginning of the array |
src/common/src/mlib/str.h
Outdated
{ | ||
if (clamp_to_length) { | ||
if (pos.is_signed) { | ||
if (pos.i.s > 0 && (size_t) pos.i.s > s.len) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest using mlib_cmp
to handle the possibility that pos.i.s
(intmax_t) be greater than SIZE_MAX
. Return early to avoid the (unsafe?) cast of (intmax_t) s.len
:
if (clamp_to_length) {
if (pos.is_signed) {
if (pos.i.s > 0 && mlib_cmp (pos.i.s, >, s.len)) {
return s.len;
}
} else {
if (pos.i.u > s.len) {
return s.len;
}
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Heavily simplified this section
src/common/src/mlib/str.h
Outdated
* with C string APIs could truncate unexpectedly. | ||
*/ | ||
typedef struct mstr_view { | ||
// Pointer to the first character in the string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest clarifying if NULL
is OK for data
. E.g. is it OK to create an empty string like mstr_view_data (NULL, 0);
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added clarification about null string views as an explicitly supported use case
return false; | ||
} | ||
// Check if the database name contains and invalid characters after the %-decode | ||
if (mstr_contains_any_of (mstr_cstring (uri->database), mstr_cstring ("/\\. \n\r\f\t\"$"))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (mstr_contains_any_of (mstr_cstring (uri->database), mstr_cstring ("/\\. \n\r\f\t\"$"))) { | |
if (mstr_contains_any_of (mstr_cstring (uri->database), mstr_cstring ("/\\. \"$"))) { |
Suggest reverting disallowed characters. I appear to be able to create a database name with "\n\r\f\t":
from pymongo import MongoClient
client = MongoClient()
for letter in "/\\. \n\r\f\t\"$":
dbname = f"foo{letter}bar"
try:
client[dbname].command({"ping": 1})
print ("{} ... ok".format(repr(dbname)))
except Exception:
print ("{} ... failed".format(repr(dbname)))
* | ||
*-------------------------------------------------------------------------- | ||
* @param doc The document to be updtaed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* @param doc The document to be updtaed | |
* @param doc The document to be updated |
continue; | ||
} | ||
/** | ||
* @brief Update a BSON document with a UTF-8 value, replacing it if it alread |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* @brief Update a BSON document with a UTF-8 value, replacing it if it alread | |
* @brief Update a BSON document with a UTF-8 value, replacing it if it already |
bson_free (userinfo); | ||
size_t userinfo_end_pos = mstr_find (remain, AT); | ||
if (userinfo_end_pos == SIZE_MAX) { | ||
// There is no userinfo, so we don't need to do anything speical |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// There is no userinfo, so we don't need to do anything speical | |
// There is no userinfo, so we don't need to do anything special |
// * remain = "query#fragment" (Each following component is optional, but this is the proper order) | ||
const size_t hash_pos = mstr_find_first_of (remain, mstr_cstring ("#")); | ||
mstr_split_at (remain, hash_pos, &parts->query, &remain); | ||
// * remain = "#fragment" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest removing special handling for #fragment
. The connection string spec does not appear to mention a fragment. And this may trigger a backwards breaking behavior change. Example:
// Previously OK. Now a parse error:
mongoc_uri_t *uri = mongoc_uri_new ("mongodb://host/?appName=bar#baz");
const char *got = mongoc_uri_get_option_as_utf8 (uri, "appName", "");
ASSERT_CMPSTR (got, "bar#baz");
mongoc_uri_destroy (uri);
Background
Refer: CDRIVER-5983
This PR is a subset of the changes that were original slated for CDRIVER-5983. The URI parameter refactor turned out to be significantly more impractical than initially though, so the changes related to parameter handling have been deferred until the relevant spec documents can consolidate on a cohesive behavior.
Regardless, some of the changes from that work can still be extracted and is useful in general throughout the codebase, specifically related to string handling and parsing.
This PR introduces minor behavioral changes surrounding some edge cases related to parsing.
While initially the commit history was going to be clean for the overall changes, the partial rollback around parameter handling means that commits themselves are not entirely useful in isolation.
Change Summary
String Views
This PR introduces the
mlib/str.h
header. This header has been re-introduced several times in libmongocrypt and amongoc, and this is the latest incarnation, which I believe is the most solid as it can build upon existing work regarding safe arithmetic and integer utilities.mstr_view
is a "string view" type. It consists of aconst char* data
pointer and asize_t len
, wherenlen
refers to the length of the pointed-to string view.Importantly, the data in
mstr_view
NOT necessarily null terminated! This means thatmstr_view
requires new string handling routines that respect the length.Because we can pass slices of strings, this greatly reduces the need to
strndup
every time we want to take a string slice to pass to another function, further reducing the occurrences of memory leaking,goto fail
, and buffer overruns. Code usingmtsr_view
is almost always far simpler than trying to juggle C strings.There are several string algorithms:
mstr_find
,mstr_find_first_of
,mlib_substr
,mstr_contains
, which inspect and manipulate string views.mstr_cmp
is a string comparison function likemlib_cmp
, which supports natural syntax likemstr_cmp(a, !=, b)
mstr_split_at(S, Pos, [Drop,] *Pfx, *Sfx)
is the most useful algorithm. It takes an index and optional skip parameter, and produces two new string views from an input.mstr_split_around(S, Infix, *Pfx, *Sfx) -> bool
builds uponmstr_split_at
, and splits the stringS
around the first occurrence ofInfix
, returningtrue
if it foundInfix
. This one API produces the greatest complexity reduction of all. For parsing a CSV string:mstr_cstring
creates anmstr_view
from a C string. This is unfortunately necessary for calling a lot of APIs, but may be unnecessary in the future (see the note on_Generic
inmlib/str.h
)All algorithms that accept a string position index also accept a negative value to index from the end of the string. They also use
upsize_integer
in order to inhibit any integer conversion warnings. No need to worry about passing the right type to these functions, it will just Do the Right Thing without any implicit conversions.To printf-format a sized string, one must use the
%.*s
printf-specifier. This expect anint
to specify the string length, followed by the string pointer. Because this is common, the macroMSTR_FMT(X)
will expand to the two arguments required for the%.*s
specifier.Case Sensitivity
mlib_latin_tolower
: Transforms a Basic Latin codepoint to a lowercase letter if it is an uppercase letter. Unlike standardtolower
, this function is total in 32-bit integers, is very clear about what "lowercasing" it does, is not sensitive to the locale, and has no undefined behavior.mlib_latin_charcasecmp
: Compare two codepoints for equivalence with case-insensitivity in Basic Latin. Usesmlib_latin_tolower
.mstr_latin_casecmp
: Likemstr_cmp
, but case-insensitive usingmlib_latin_charcasecmp
.bson_error_t
bson_error_clear
This function takes a
bson_error_t*
and resets it to a non-error value if the pointer is not null.bson_error_reset
The new function-like macro
bson_error_reset
is defined. It is equivalent tobson_error_clear
and takes an l-value to a pointer tobson_error_t
. If the pointer is null, it is made to point to an anonymousbson_error_t
in the local scope, allowing the function to rely on the present of an error object, even if the caller didn't pass one:This feature relies on C99 compound literals, so this function-like macro cannot be used in C++ 😢.
bson/mongoc_set_error
The
set_error
functions that do a string-format into the error message now support passing the error message of the object as an input parameter to the format string itself:Previously, this would garble the content of
error.message
. Now, the code uses a temporary buffer to format the message, then copies it over theerror.message
.Integer Parsing
Because we are now using sized strings and not C strings, we need new integer parsing, because
strtoll
and friends expect a null terminator. Additionally, we can do better thanstrtoll
:mlib_i64_parse(Str, Base, *Out) -> int
parses a 64-bit signed integer inStr
. It differs fromstrtoll
in a few important ways:mlib_i32_parse(Str, Base, *Out) -> int
int32_t
version ofmlib_i64_parse
mlib_nat64_parse(Str, Base, *Out) -> int
likemlib_i64_parse
, but does not allow sign or base prefixes.Base
must be non-zero. The input must be a string of only digits. This is the core of integer parsing.Previousy, our code using
strtoll
was silently relying on whitespace trimming and had clunky or incomplete detecting whetherstrtoll
stopped early. These parsing APIs are much more strict.The implementation may be less efficient than the hyper-optimized
strtoll
, but it's much more easy to use correctly. Optimizations can come later.Error Messages
Several APIs have been modified to return better error messages, including APIs that can fail with
true/false
. For example, %-decoding now explains why %-string is invalid, and that will be included when a URI is rejected.Other Changes
The following other minor changes were made:
mongoc_uri_set_compressors
was previously an error. The new behavior is to clear the compressors on the URI (as if passing a null pointer).Not Included
Owning Sized-String
mstr
There was going to be a type
mstr
that owned a sized mutable null-terminated string, with its own set of useful "Does the Right Thing" algorithms (append, splice, copy, erase, replace, push, etc.). I have the code for this, but it can wait for a future PR.UTF-8 Validation
It may be useful to force
mstr_view
to only contain UTF-8 validation (i.e. you cannot create anmstr_view
without checking that you have valid UTF-8), but that's a larger change for the potential future.Logging/Errors/Warning Mess
The behavior around logging, warning, and erroring in URI handling is a mess. There are cases where we test that an API logs, and sometimes we expect it to be silent and put the error on an error object instead, and sometimes we expect both!
This inconsistency has been mostly retained. In the future, any API that accepts a
bson_error_t*
probably shouldn't log anything.