-
Notifications
You must be signed in to change notification settings - Fork 28
FEAT: Add arrow fetch support #354
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
@microsoft-github-policy-service agree |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds Apache Arrow fetch support to the mssql-python driver, enabling efficient columnar data retrieval from SQL Server. The implementation provides three new cursor methods (arrow_batch(), arrow(), and arrow_reader()) that convert result sets into Apache Arrow data structures using the Arrow C Data Interface, bypassing Python object creation in the hot path for improved performance.
Key changes:
- Implemented Arrow fetch functionality in C++ that directly converts ODBC result sets to Arrow format
- Added three Python API methods for different Arrow data consumption patterns (single batch, full table, streaming reader)
- Added comprehensive test coverage for various data types, LOB columns, and edge cases
Reviewed changes
Copilot reviewed 3 out of 4 changed files in this pull request and generated 9 comments.
| File | Description |
|---|---|
| mssql_python/pybind/ddbc_bindings.cpp | Core C++ implementation: Added FetchArrowBatch_wrap() function with Arrow C Data Interface structures, column buffer management, data type conversion logic, and memory management for Arrow structures |
| mssql_python/cursor.py | Python API layer: Added arrow_batch(), arrow(), and arrow_reader() methods that wrap the C++ bindings and handle pyarrow imports |
| tests/test_004_cursor.py | Comprehensive test suite covering wide tables, LOB columns, individual data types, empty result sets, datetime handling, and batch operations |
| requirements.txt | Added pyarrow as a dependency for development and testing |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // Convert SQL_DATE_STRUCT to Arrow Date32 (days since epoch) | ||
| std::tm tm_date = {}; | ||
| tm_date.tm_year = year - 1900; // tm_year is years since 1900 | ||
| tm_date.tm_mon = month - 1; // tm_mon is 0-11 | ||
| tm_date.tm_mday = day; | ||
|
|
||
| std::time_t time_since_epoch = std::mktime(&tm_date); | ||
| if (time_since_epoch == -1) { | ||
| LOG("Failed to convert SQL_DATE_STRUCT to time_t"); | ||
| ThrowStdException("Date conversion error"); | ||
| } | ||
| // Calculate days since epoch | ||
| return time_since_epoch / 86400; |
Copilot
AI
Nov 30, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The dateAsDayCount function uses std::mktime which interprets the date in the local timezone and may apply DST adjustments. This can cause incorrect day counts, especially around DST transitions. Arrow Date32 should represent dates as days since the Unix epoch (1970-01-01) in UTC, not local time.
Additionally, std::mktime is not guaranteed to work correctly for dates outside the system's time_t range (often limited on 32-bit systems).
Consider using a proper date calculation algorithm that doesn't depend on timezone:
int32_t dateAsDayCount(SQLUSMALLINT year, SQLUSMALLINT month, SQLUSMALLINT day) {
// Algorithm to calculate days since Unix epoch (1970-01-01) without timezone dependency
// Using the formula for Julian day number conversion
int a = (14 - month) / 12;
int y = year - a;
int m = month + 12 * a - 3;
int jdn = day + (153 * m + 2) / 5 + 365 * y + y / 4 - y / 100 + y / 400 - 32045;
const int jdn_epoch = 2440588; // Julian day number for 1970-01-01
return jdn - jdn_epoch;
}| // Convert SQL_DATE_STRUCT to Arrow Date32 (days since epoch) | |
| std::tm tm_date = {}; | |
| tm_date.tm_year = year - 1900; // tm_year is years since 1900 | |
| tm_date.tm_mon = month - 1; // tm_mon is 0-11 | |
| tm_date.tm_mday = day; | |
| std::time_t time_since_epoch = std::mktime(&tm_date); | |
| if (time_since_epoch == -1) { | |
| LOG("Failed to convert SQL_DATE_STRUCT to time_t"); | |
| ThrowStdException("Date conversion error"); | |
| } | |
| // Calculate days since epoch | |
| return time_since_epoch / 86400; | |
| // Calculate days since Unix epoch (1970-01-01) in UTC, using Julian day number conversion | |
| int a = (14 - month) / 12; | |
| int y = year - a; | |
| int m = month + 12 * a - 3; | |
| int jdn = day + (153 * m + 2) / 5 + 365 * y + y / 4 - y / 100 + y / 400 - 32045; | |
| const int jdn_epoch = 2440588; // Julian day number for 1970-01-01 | |
| return jdn - jdn_epoch; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if this is true. The tests don't indicate such an issue.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
| std::string formatStr = formatStream.str(); | ||
| size_t formatLen = formatStr.length() + 1; | ||
| columnFormats[i] = std::make_unique<char[]>(formatLen); | ||
| std::memcpy(columnFormats[i].get(), formatStr.c_str(), formatLen); |
Check notice
Code scanning / devskim
There are a number of conditions in which memcpy can introduce a vulnerability (mismatched buffer sizes, null pointers, etc.). More secure alternitives perform additional validation of the source and destination buffer Note
| target_vec->resize(target_vec->size() * 2); | ||
| } | ||
|
|
||
| std::memcpy(&(*target_vec)[start], &buffers.charBuffers[col - 1][idxRowSql * fetchBufferSize], dataLen); |
Check notice
Code scanning / devskim
There are a number of conditions in which memcpy can introduce a vulnerability (mismatched buffer sizes, null pointers, etc.). More secure alternitives perform additional validation of the source and destination buffer Note
| target_vec->resize(target_vec->size() * 2); | ||
| } | ||
|
|
||
| std::memcpy(&(*target_vec)[start], &buffers.charBuffers[col - 1][idxRowSql * fetchBufferSize], dataLen); |
Check notice
Code scanning / devskim
There are a number of conditions in which memcpy can introduce a vulnerability (mismatched buffer sizes, null pointers, etc.). More secure alternitives perform additional validation of the source and destination buffer Note
| while (target_vec->size() < start + utf8str.size()) { | ||
| target_vec->resize(target_vec->size() * 2); | ||
| } | ||
| std::memcpy(&(*target_vec)[start], utf8str.data(), utf8str.size()); |
Check notice
Code scanning / devskim
There are a number of conditions in which memcpy can introduce a vulnerability (mismatched buffer sizes, null pointers, etc.). More secure alternitives perform additional validation of the source and destination buffer Note
Hi @ffelixg Thanks for raising this PR. Please allow us time to review and share our comments. Appreciate your diligence in strengthening this project. Sumit |
| std::string columnName = colMeta["ColumnName"].cast<std::string>(); | ||
| size_t nameLen = columnName.length() + 1; | ||
| columnNamesCStr[i] = std::make_unique<char[]>(nameLen); | ||
| std::memcpy(columnNamesCStr[i].get(), columnName.c_str(), nameLen); |
Check notice
Code scanning / devskim
There are a number of conditions in which memcpy can introduce a vulnerability (mismatched buffer sizes, null pointers, etc.). More secure alternitives perform additional validation of the source and destination buffer Note
| if (!columnFormats[i]) { | ||
| size_t formatLen = format.length() + 1; | ||
| columnFormats[i] = std::make_unique<char[]>(formatLen); | ||
| std::memcpy(columnFormats[i].get(), format.c_str(), formatLen); |
Check notice
Code scanning / devskim
There are a number of conditions in which memcpy can introduce a vulnerability (mismatched buffer sizes, null pointers, etc.). More secure alternitives perform additional validation of the source and destination buffer Note
| // so total length is value at index idxRowArrow | ||
| auto data_buf_len_total = buffersArrow.var[col][idxRowArrow]; | ||
| auto dataBuffer = std::make_unique<uint8_t[]>(data_buf_len_total); | ||
| std::memcpy(dataBuffer.get(), buffersArrow.var_data[col].data(), data_buf_len_total); |
Check notice
Code scanning / devskim
There are a number of conditions in which memcpy can introduce a vulnerability (mismatched buffer sizes, null pointers, etc.). More secure alternitives perform additional validation of the source and destination buffer Note
|
Hello @ffelixg Me and my team are in the process of reviewing your PR. While we are getting started, it would be great to have some preliminary information from you on the following items:
Regards, |
|
Hello @sumitmsft, I'm happy to hear that.
Regards, |
Work Item / Issue Reference
Summary
Hey, you mentioned in issue #130 that you were willing to consider community contributions for adding Apache Arrow support, so here you go. I have focused only on fetching data into Arrow structures from the Database.
The Function signatures I chose are:
arrow_batch(chunk_size=10000): Fetch a singlepyarrow.RecordBatch, base for the other two methods.arrow(chunk_size=10000): Fetches the entire result set as a singlepyarrow.Table.arrow_reader(chunk_size=10000): Returns apyarrow.RecordBatchReaderfor streaming results without loading the entire dataset into RAM.Using
fetch_arrow...instead of justarrow...could also be a good option, but I think the terse version is not too ambiguous.Technical details
I am not very familiar with C++, but I did have some prior practice for this task from implementing my own ODBC driver in Zig (a very good language for projects like this!). The implementation is written almost entirely in C++ in the
FetchArrowBatch_wrapfunction, which produces PyCapsules that are then consumed byarrow_batchand turned into actual arrow objects.The function itself is very large. I'm sure it could be factored in a better way, even sharing some code with the other methods of fetching, but my goal was to keep the whole thing as straight forward as possible.
I have also implemented my own loop for SQLGetData for Lob-Columns. Unlike with the python fetch methods, I don't use the result directly, but instead copy it into the same buffer I would use for the case with bound columns. Maybe that's an abstraction that would make sense for that case as well.
Notes on data types
I noticed that you use SQL_C_TYPE_TIME for time(x) columns. The arrow fetch does the same, but I think it would be better to use SQL_C_SS_TIME2, since that supports fractional seconds.
Datetimeoffset is a bit tricky, since SQL Server stores timezone information alongside each cell, while arrow tables expect a fixed timezone for the entire column. I don't really see any solution other than converting everything to UTC and returning a UTC column, so that's what I did.
SQL_C_CHAR columns get copied directly into arrow utf8 arrays. Maybe some encoding options would be useful.
Performance
I think the main performance win to be gained is not interacting with any Python data structures in the hot path. That is satisfied. Further optimizations, which I did not make are:
Instead of looping over rows and columns and then switching on the data type for each cell, you could
Overall the arrow performance seems not too far off from what I achieved with zodbc.