Skip to content

Conversation

@subrata-ms
Copy link
Contributor

@subrata-ms subrata-ms commented Dec 8, 2025

Work Item / Issue Reference

AB#40879

GitHub Issue: #<ISSUE_NUMBER>


Summary

This pull request refactors and optimizes the string conversion utilities in unix_utils.cpp for converting between SQLWCHAR arrays and std::wstring on macOS/Linux. The new implementation eliminates intermediate buffers and reliance on codecvt, resulting in more efficient and robust conversions, especially for Unicode characters outside the Basic Multilingual Plane (BMP).

String conversion optimizations:

  • Replaced the previous SQLWCHARToWString implementation with a direct UTF-16 to UTF-32 conversion, handling surrogate pairs explicitly and removing the use of std::wstring_convert and intermediate buffers.
  • Improved the WStringToSQLWCHAR function to convert std::wstring (UTF-32) to UTF-16, encoding surrogate pairs manually and streamlining the conversion logic for better performance and branch prediction.

Robustness and correctness improvements:

  • Added explicit handling for invalid surrogate pairs and code points, ensuring that malformed input does not cause conversion failures or exceptions.
  • Ensured that both conversion functions always append a null terminator to the output, maintaining compatibility with ODBC expectations.

Code simplification:

  • Removed exception handling and fallback code paths by providing a single, reliable conversion strategy for both directions. (F2cac280L17

@github-actions
Copy link

github-actions bot commented Dec 8, 2025

📊 Code Coverage Report

🔥 Diff Coverage

74%


🎯 Overall Coverage

75%


📈 Total Lines Covered: 5245 out of 6993
📁 Project: mssql-python


Diff Coverage

Diff: main...HEAD, staged and unstaged changes

  • mssql_python/pybind/ddbc_bindings.h (74.2%): Missing lines 476-478,486-487,493-495,504-505,512-514,524-525,528-529

Summary

  • Total: 66 lines
  • Missing: 17 lines
  • Coverage: 74%

mssql_python/pybind/ddbc_bindings.h

  472         // 2-byte sequence: 110xxxxx 10xxxxxx
  473         if ((byte & 0xE0) == 0xC0 && i + 1 < len) {
  474             // Validate continuation byte has correct bit pattern (10xxxxxx)
  475             if ((data[i + 1] & 0xC0) != 0x80) {
! 476                 ++i;
! 477                 return 0xFFFD;  // Invalid continuation byte
! 478             }
  479             uint32_t cp = ((static_cast<uint32_t>(byte & 0x1F) << 6) | (data[i + 1] & 0x3F));
  480             // Reject overlong encodings (must be >= 0x80)
  481             if (cp >= 0x80) {
  482                 i += 2;

  482                 i += 2;
  483                 return static_cast<wchar_t>(cp);
  484             }
  485             // Overlong encoding - invalid
! 486             ++i;
! 487             return 0xFFFD;
  488         }
  489         // 3-byte sequence: 1110xxxx 10xxxxxx 10xxxxxx
  490         if ((byte & 0xF0) == 0xE0 && i + 2 < len) {
  491             // Validate continuation bytes have correct bit pattern (10xxxxxx)

  489         // 3-byte sequence: 1110xxxx 10xxxxxx 10xxxxxx
  490         if ((byte & 0xF0) == 0xE0 && i + 2 < len) {
  491             // Validate continuation bytes have correct bit pattern (10xxxxxx)
  492             if ((data[i + 1] & 0xC0) != 0x80 || (data[i + 2] & 0xC0) != 0x80) {
! 493                 ++i;
! 494                 return 0xFFFD;  // Invalid continuation bytes
! 495             }
  496             uint32_t cp = ((static_cast<uint32_t>(byte & 0x0F) << 12) |
  497                            ((data[i + 1] & 0x3F) << 6) | (data[i + 2] & 0x3F));
  498             // Reject overlong encodings (must be >= 0x800) and surrogates (0xD800-0xDFFF)
  499             if (cp >= 0x800 && (cp < 0xD800 || cp > 0xDFFF)) {

  500                 i += 3;
  501                 return static_cast<wchar_t>(cp);
  502             }
  503             // Overlong encoding or surrogate - invalid
! 504             ++i;
! 505             return 0xFFFD;
  506         }
  507         // 4-byte sequence: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
  508         if ((byte & 0xF8) == 0xF0 && i + 3 < len) {
  509             // Validate continuation bytes have correct bit pattern (10xxxxxx)

  508         if ((byte & 0xF8) == 0xF0 && i + 3 < len) {
  509             // Validate continuation bytes have correct bit pattern (10xxxxxx)
  510             if ((data[i + 1] & 0xC0) != 0x80 || (data[i + 2] & 0xC0) != 0x80 ||
  511                 (data[i + 3] & 0xC0) != 0x80) {
! 512                 ++i;
! 513                 return 0xFFFD;  // Invalid continuation bytes
! 514             }
  515             uint32_t cp =
  516                 ((static_cast<uint32_t>(byte & 0x07) << 18) | ((data[i + 1] & 0x3F) << 12) |
  517                  ((data[i + 2] & 0x3F) << 6) | (data[i + 3] & 0x3F));
  518             // Reject overlong encodings (must be >= 0x10000) and values above max Unicode

  520                 i += 4;
  521                 return static_cast<wchar_t>(cp);
  522             }
  523             // Overlong encoding or out of range - invalid
! 524             ++i;
! 525             return 0xFFFD;
  526         }
  527         // Invalid sequence - skip byte
! 528         ++i;
! 529         return 0xFFFD;  // Unicode replacement character
  530     };
  531 
  532     std::wstring result;
  533     result.reserve(str.size());  // Reserve assuming mostly ASCII


📋 Files Needing Attention

📉 Files with overall lowest coverage (click to expand)
mssql_python.pybind.logger_bridge.hpp: 58.8%
mssql_python.pybind.logger_bridge.cpp: 59.2%
mssql_python.pybind.ddbc_bindings.cpp: 66.2%
mssql_python.row.py: 66.2%
mssql_python.helpers.py: 67.5%
mssql_python.pybind.connection.connection.cpp: 73.6%
mssql_python.ddbc_bindings.py: 79.6%
mssql_python.connection.py: 83.7%
mssql_python.cursor.py: 84.2%
mssql_python.logging.py: 85.3%

🔗 Quick Links

⚙️ Build Summary 📋 Coverage Details

View Azure DevOps Build

Browse Full Coverage Report

@subrata-ms subrata-ms changed the title unix utility function fixes FIX: Fix for depricated lib function wstring_convert Dec 8, 2025
@github-actions github-actions bot added the pr-size: medium Moderate update size label Dec 8, 2025
@subrata-ms subrata-ms marked this pull request as ready for review December 8, 2025 07:25
Copilot AI review requested due to automatic review settings December 8, 2025 07:25
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request removes the deprecated std::wstring_convert and std::codecvt utilities and replaces them with manual UTF-8/UTF-16/UTF-32 conversion implementations for Unix platforms (macOS/Linux). The changes aim to modernize the codebase and improve performance through direct encoding/decoding without intermediate buffers.

Key changes:

  • Refactored SQLWCHARToWString and WStringToSQLWCHAR in unix_utils.cpp to manually handle UTF-16 ↔ UTF-32 conversions with explicit surrogate pair encoding/decoding
  • Implemented manual UTF-8 to UTF-32 decoder in Utf8ToWString function in ddbc_bindings.h

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 8 comments.

File Description
mssql_python/pybind/unix_utils.cpp Replaces codecvt-based conversions with manual UTF-16 ↔ UTF-32 conversion logic including surrogate pair handling
mssql_python/pybind/ddbc_bindings.h Adds manual UTF-8 decoder with multi-byte sequence handling to replace deprecated wstring_convert

Critical Issues Identified:

  • Security: UTF-8 decoder lacks validation for overlong encodings and malformed continuation bytes, which could lead to security vulnerabilities
  • Correctness: Invalid surrogate code points (0xD800-0xDFFF) are not properly validated in WStringToSQLWCHAR, allowing invalid Unicode to be generated
  • Robustness: Flawed logic in invalid sequence detection at line 516 of ddbc_bindings.h may cause incorrect behavior

The implementation introduces several critical bugs that deviate from the existing, more robust implementation already present in ddbc_bindings.h (lines 79-167). The existing implementation properly validates Unicode scalars and replaces invalid sequences with the Unicode replacement character (0xFFFD), while the new code in unix_utils.cpp silently passes through or skips invalid values inconsistently.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@github-actions github-actions bot added pr-size: large Substantial code update and removed pr-size: medium Moderate update size labels Dec 8, 2025
sumitmsft
sumitmsft previously approved these changes Dec 8, 2025
Copy link
Contributor

@gargsaumya gargsaumya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once the Alpine PR is merged, please update the x86_64 Alpine pipeline in this PR to use alpine:latest. This PR will resolve the compiler warning issue and allow x86 to move back to the latest image.

The ARM64 pipeline can be updated to latest once the ODBC fix is in (not part of this PR).

sumitmsft
sumitmsft previously approved these changes Dec 9, 2025
@subrata-ms subrata-ms changed the title FIX: Fix for depricated lib function wstring_convert FIX: Fix for deprecated lib function wstring_convert Dec 9, 2025
Copy link
Collaborator

@bewithgaurav bewithgaurav left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

requesting to put tests for uncovered logic

gargsaumya
gargsaumya previously approved these changes Dec 9, 2025
@subrata-ms subrata-ms force-pushed the subrata-ms/DepricatedFixLinux branch from 419b024 to ac56363 Compare December 9, 2025 16:18
sumitmsft
sumitmsft previously approved these changes Dec 10, 2025
gargsaumya
gargsaumya previously approved these changes Dec 10, 2025
@subrata-ms subrata-ms dismissed stale reviews from gargsaumya and sumitmsft via ab15ef9 December 10, 2025 09:54
@subrata-ms subrata-ms dismissed bewithgaurav’s stale review December 10, 2025 11:47

@gaurav, i have addressed the review comments related with Diff code coverage. Added required test and it is now 74%. As you are not available for quick review now, hence removing you from the mandatory reviewer list in-order to merge the PR.

@subrata-ms subrata-ms merged commit f119d05 into main Dec 10, 2025
27 checks passed
@subrata-ms subrata-ms deleted the subrata-ms/DepricatedFixLinux branch December 10, 2025 15:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-size: large Substantial code update

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants