FEAT: Add arrow fetch support #354

ffelixg · 2025-11-30T21:00:35Z

Work Item / Issue Reference

GitHub Issue: #130

Summary

Hey, you mentioned in issue #130 that you were willing to consider community contributions for adding Apache Arrow support, so here you go. I have focused only on fetching data into Arrow structures from the Database.

The Function signatures I chose are:

arrow_batch(chunk_size=10000): Fetch a single pyarrow.RecordBatch, base for the other two methods.
arrow(chunk_size=10000): Fetches the entire result set as a single pyarrow.Table.
arrow_reader(chunk_size=10000): Returns a pyarrow.RecordBatchReader for streaming results without loading the entire dataset into RAM.

Using fetch_arrow... instead of just arrow... could also be a good option, but I think the terse version is not too ambiguous.

Technical details

I am not very familiar with C++, but I did have some prior practice for this task from implementing my own ODBC driver in Zig (a very good language for projects like this!). The implementation is written almost entirely in C++ in the FetchArrowBatch_wrap function, which produces PyCapsules that are then consumed by arrow_batch and turned into actual arrow objects.

The function itself is very large. I'm sure it could be factored in a better way, even sharing some code with the other methods of fetching, but my goal was to keep the whole thing as straight forward as possible.

I have also implemented my own loop for SQLGetData for Lob-Columns. Unlike with the python fetch methods, I don't use the result directly, but instead copy it into the same buffer I would use for the case with bound columns. Maybe that's an abstraction that would make sense for that case as well.

Notes on data types

I noticed that you use SQL_C_TYPE_TIME for time(x) columns. The arrow fetch does the same, but I think it would be better to use SQL_C_SS_TIME2, since that supports fractional seconds.

Datetimeoffset is a bit tricky, since SQL Server stores timezone information alongside each cell, while arrow tables expect a fixed timezone for the entire column. I don't really see any solution other than converting everything to UTC and returning a UTC column, so that's what I did.

SQL_C_CHAR columns get copied directly into arrow utf8 arrays. Maybe some encoding options would be useful.

Performance

I think the main performance win to be gained is not interacting with any Python data structures in the hot path. That is satisfied. Further optimizations, which I did not make are:

Releasing the GIL for the entire fetch loop
Sharing the bound fetch buffer across repeated fetch calls
Improve the hot loop switching

Instead of looping over rows and columns and then switching on the data type for each cell, you could

Put the row loop inside each switch case (fastest I think, but would bloat the code a lot more)
Use function pointers like you recently did for python fetching (has overhead because of the indirect function call I think, also code is more scattered)
Replace both loops and the switch with computed gotos. That's what I opted for in my ODBC driver (the Zig equivalent is a labeled switch) and I am quite happy with how it came out. Performance seems very good and it allows you to abstract the fetching process on a row by row basis. I don't know how well that would translate to C++.

Overall the arrow performance seems not too far off from what I achieved with zodbc.

ffelixg · 2025-11-30T21:03:39Z

@microsoft-github-policy-service agree

Copilot

Pull request overview

This PR adds Apache Arrow fetch support to the mssql-python driver, enabling efficient columnar data retrieval from SQL Server. The implementation provides three new cursor methods (arrow_batch(), arrow(), and arrow_reader()) that convert result sets into Apache Arrow data structures using the Arrow C Data Interface, bypassing Python object creation in the hot path for improved performance.

Key changes:

Implemented Arrow fetch functionality in C++ that directly converts ODBC result sets to Arrow format
Added three Python API methods for different Arrow data consumption patterns (single batch, full table, streaming reader)
Added comprehensive test coverage for various data types, LOB columns, and edge cases

Reviewed changes

Copilot reviewed 3 out of 4 changed files in this pull request and generated 9 comments.

File	Description
mssql_python/pybind/ddbc_bindings.cpp	Core C++ implementation: Added `FetchArrowBatch_wrap()` function with Arrow C Data Interface structures, column buffer management, data type conversion logic, and memory management for Arrow structures
mssql_python/cursor.py	Python API layer: Added `arrow_batch()`, `arrow()`, and `arrow_reader()` methods that wrap the C++ bindings and handle pyarrow imports
tests/test_004_cursor.py	Comprehensive test suite covering wide tables, LOB columns, individual data types, empty result sets, datetime handling, and batch operations
requirements.txt	Added pyarrow as a dependency for development and testing

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

mssql_python/pybind/ddbc_bindings.cpp

Copilot · 2025-11-30T21:06:04Z

mssql_python/pybind/ddbc_bindings.cpp

+    // Convert SQL_DATE_STRUCT to Arrow Date32 (days since epoch)
+    std::tm tm_date = {};
+    tm_date.tm_year = year - 1900; // tm_year is years since 1900
+    tm_date.tm_mon = month - 1;    // tm_mon is 0-11
+    tm_date.tm_mday = day;
+
+    std::time_t time_since_epoch = std::mktime(&tm_date);
+    if (time_since_epoch == -1) {
+        LOG("Failed to convert SQL_DATE_STRUCT to time_t");
+        ThrowStdException("Date conversion error");
+    }
+    // Calculate days since epoch
+    return time_since_epoch / 86400;


The dateAsDayCount function uses std::mktime which interprets the date in the local timezone and may apply DST adjustments. This can cause incorrect day counts, especially around DST transitions. Arrow Date32 should represent dates as days since the Unix epoch (1970-01-01) in UTC, not local time.

Additionally, std::mktime is not guaranteed to work correctly for dates outside the system's time_t range (often limited on 32-bit systems).

Consider using a proper date calculation algorithm that doesn't depend on timezone:

int32_t dateAsDayCount(SQLUSMALLINT year, SQLUSMALLINT month, SQLUSMALLINT day) { // Algorithm to calculate days since Unix epoch (1970-01-01) without timezone dependency // Using the formula for Julian day number conversion int a = (14 - month) / 12; int y = year - a; int m = month + 12 * a - 3; int jdn = day + (153 * m + 2) / 5 + 365 * y + y / 4 - y / 100 + y / 400 - 32045; const int jdn_epoch = 2440588; // Julian day number for 1970-01-01 return jdn - jdn_epoch; }

Suggested change

// Convert SQL_DATE_STRUCT to Arrow Date32 (days since epoch)

std::tm tm_date = {};

tm_date.tm_year = year - 1900; // tm_year is years since 1900

tm_date.tm_mon = month - 1; // tm_mon is 0-11

tm_date.tm_mday = day;

std::time_t time_since_epoch = std::mktime(&tm_date);

if (time_since_epoch == -1) {

LOG("Failed to convert SQL_DATE_STRUCT to time_t");

ThrowStdException("Date conversion error");

}

// Calculate days since epoch

return time_since_epoch / 86400;

// Calculate days since Unix epoch (1970-01-01) in UTC, using Julian day number conversion

int a = (14 - month) / 12;

int y = year - a;

int m = month + 12 * a - 3;

int jdn = day + (153 * m + 2) / 5 + 365 * y + y / 4 - y / 100 + y / 400 - 32045;

const int jdn_epoch = 2440588; // Julian day number for 1970-01-01

return jdn - jdn_epoch;

I'm not sure if this is true. The tests don't indicate such an issue.

mssql_python/pybind/ddbc_bindings.cpp

mssql_python/cursor.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

mssql_python/pybind/ddbc_bindings.cpp

+                std::string formatStr = formatStream.str();
+                size_t formatLen = formatStr.length() + 1;
+                columnFormats[i] = std::make_unique<char[]>(formatLen);
+                std::memcpy(columnFormats[i].get(), formatStr.c_str(), formatLen);


mssql_python/pybind/ddbc_bindings.cpp

+                            target_vec->resize(target_vec->size() * 2);
+                        }
+
+                        std::memcpy(&(*target_vec)[start], &buffers.charBuffers[col - 1][idxRowSql * fetchBufferSize], dataLen);


mssql_python/pybind/ddbc_bindings.cpp

+                            target_vec->resize(target_vec->size() * 2);
+                        }
+
+                        std::memcpy(&(*target_vec)[start], &buffers.charBuffers[col - 1][idxRowSql * fetchBufferSize], dataLen);


mssql_python/pybind/ddbc_bindings.cpp

+                        while (target_vec->size() < start + utf8str.size()) {
+                            target_vec->resize(target_vec->size() * 2);
+                        }
+                        std::memcpy(&(*target_vec)[start], utf8str.data(), utf8str.size());


mssql_python/pybind/ddbc_bindings.cpp

sumitmsft · 2025-12-01T11:14:32Z

Work Item / Issue Reference

GitHub Issue: #130

Summary

Hey, you mentioned in issue #130 that you were willing to consider community contributions for adding Apache Arrow support, so here you go. I have focused only on fetching data into Arrow structures from the Database.

The Function signatures I chose are:

arrow_batch(chunk_size=10000): Fetch a single pyarrow.RecordBatch, base for the other two methods.

arrow(chunk_size=10000): Fetches the entire result set as a single pyarrow.Table.

arrow_reader(chunk_size=10000): Returns a pyarrow.RecordBatchReader for streaming results without loading the entire dataset into RAM.

Using fetch_arrow... instead of just arrow... could also be a good option, but I think the terse version is not too ambiguous.

Technical details

I am not very familiar with C++, but I did have some prior practice for this task from implementing my own ODBC driver in Zig (a very good language for projects like this!). The implementation is written almost entirely in C++ in the FetchArrowBatch_wrap function, which produces PyCapsules that are then consumed by arrow_batch and turned into actual arrow objects.

The function itself is very large. I'm sure it could be factored in a better way, even sharing some code with the other methods of fetching, but my goal was to keep the whole thing as straight forward as possible.

I have also implemented my own loop for SQLGetData for Lob-Columns. Unlike with the python fetch methods, I don't use the result directly, but instead copy it into the same buffer I would use for the case with bound columns. Maybe that's an abstraction that would make sense for that case as well.

Notes on data types

I noticed that you use SQL_C_TYPE_TIME for time(x) columns. The arrow fetch does the same, but I think it would be better to use SQL_C_SS_TIME2, since that supports fractional seconds.

Datetimeoffset is a bit tricky, since SQL Server stores timezone information alongside each cell, while arrow tables expect a fixed timezone for the entire column. I don't really see any solution other than converting everything to UTC and returning a UTC column, so that's what I did.

SQL_C_CHAR columns get copied directly into arrow utf8 arrays. Maybe some encoding options would be useful.

Performance

I think the main performance win to be gained is not interacting with any Python data structures in the hot path. That is satisfied. Further optimizations, which I did not make are:

Releasing the GIL for the entire fetch loop

Sharing the bound fetch buffer across repeated fetch calls

Improve the hot loop switching

Instead of looping over rows and columns and then switching on the data type for each cell, you could

Put the row loop inside each switch case (fastest I think, but would bloat the code a lot more)

Use function pointers like you recently did for python fetching (has overhead because of the indirect function call I think, also code is more scattered)

Replace both loops and the switch with computed gotos. That's what I opted for in my ODBC driver (the Zig equivalent is a labeled switch) and I am quite happy with how it came out. Performance seems very good and it allows you to abstract the fetching process on a row by row basis. I don't know how well that would translate to C++.

Overall the arrow performance seems not too far off from what I achieved with zodbc.

Hi @ffelixg

Thanks for raising this PR. Please allow us time to review and share our comments.

Appreciate your diligence in strengthening this project.

Sumit

…by private_data

mssql_python/pybind/ddbc_bindings.cpp

+        std::string columnName = colMeta["ColumnName"].cast<std::string>();
+        size_t nameLen = columnName.length() + 1;
+        columnNamesCStr[i] = std::make_unique<char[]>(nameLen);
+        std::memcpy(columnNamesCStr[i].get(), columnName.c_str(), nameLen);


mssql_python/pybind/ddbc_bindings.cpp

+        if (!columnFormats[i]) {
+            size_t formatLen = format.length() + 1;
+            columnFormats[i] = std::make_unique<char[]>(formatLen);
+            std::memcpy(columnFormats[i].get(), format.c_str(), formatLen);


mssql_python/pybind/ddbc_bindings.cpp

+                // so total length is value at index idxRowArrow
+                auto data_buf_len_total = buffersArrow.var[col][idxRowArrow];
+                auto dataBuffer = std::make_unique<uint8_t[]>(data_buf_len_total);
+                std::memcpy(dataBuffer.get(), buffersArrow.var_data[col].data(), data_buf_len_total);


sumitmsft · 2025-12-04T11:32:31Z

Hello @ffelixg

Me and my team are in the process of reviewing your PR. While we are getting started, it would be great to have some preliminary information from you on the following items:

Have you created any design document for this feature (high\low level)? Could you please attach it here or share it with us at the below mentioned email id?
What is your motivation to bring the support for Arrow in mssql-python? Could you help us understand the use case(s) you're trying to address?
Is there a way to connect with you over Microsoft Teams call, so that we can closely work on this feature together? You can reach out to us at mssql-python@microsoft.com with your contact details and consent to connect with you.

Regards,
Sumit

ffelixg · 2025-12-04T20:31:24Z

Hello @sumitmsft,

I'm happy to hear that.

I don't have any design document beyond what I wrote in the PR description. Are there any areas in particular you would like me to provide more information on?
I assume the motivation is mostly in line with what most arrow users like about arrow. Mainly I believe that arrow is the correct format for anything that is using batches of data and has C-Extensions for both producer and consumer. For example arrow gives you great interop with things like duckdb, polars, pandas on the analytics/ML side. Also I want python to be the obvious one stop shop for ETL workloads and for that, plain python types don't work well both for performance and reliability. You still have plenty of situations though where you want to fetch one result set with python types and the next with arrow types, so it has to be in the same driver as well.
Yes, for sure. I have sent you an Email.

Regards,
Felix

…ring ownership to arrow

Add arrow fetch support

257f7c7

Copilot AI review requested due to automatic review settings November 30, 2025 21:00

Copilot started reviewing on behalf of ffelixg November 30, 2025 21:01 View session

Copilot finished reviewing on behalf of ffelixg November 30, 2025 21:04

Copilot AI reviewed Nov 30, 2025

View reviewed changes

ffelixg and others added 6 commits November 30, 2025 22:32

Copilot suggestion: Fix typo

4328452

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Copilot suggestion: Fix missing buffer resize

1634199

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Copilot suggestion: Initialize bool value buffer

eb08a93

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Add test for long data

c61c6bb

Copilot suggestion: Uppercase uuids

790d94d

Copilot suggestion: use new for batch schema format/name

322c9a7

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

github-advanced-security bot found potential problems Dec 1, 2025

View reviewed changes

sumitmsft requested review from bewithgaurav, gargsaumya and sumitmsft December 1, 2025 11:12

sumitmsft self-assigned this Dec 1, 2025

sumitmsft added the enhancement New feature or request label Dec 1, 2025

sumitmsft added inADO under development community PR or Issue raised by community members labels Dec 1, 2025

Replace free calls in release callbacks with unique pointers tracked …

e274d1a

…by private_data

github-advanced-security bot found potential problems Dec 2, 2025

View reviewed changes

ffelixg added 3 commits December 6, 2025 20:51

Eliminate potential memory leaks on allocation failures when transfer…

e1d08c5

…ring ownership to arrow

Check returncode for SQLGetData

f79f029

Fix null count array attribute

5834c73

FEAT: Add arrow fetch support #354

Are you sure you want to change the base?

FEAT: Add arrow fetch support #354

Uh oh!

Conversation

ffelixg commented Nov 30, 2025

Work Item / Issue Reference

Summary

Technical details

Notes on data types

Performance

Uh oh!

ffelixg commented Nov 30, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Copilot AI Nov 30, 2025

Choose a reason for hiding this comment

Uh oh!

ffelixg Nov 30, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Check notice

Uh oh!

Uh oh!

Check notice

Check notice

Check notice

Uh oh!

sumitmsft commented Dec 1, 2025

Work Item / Issue Reference

Summary

Technical details

Notes on data types

Performance

Uh oh!

Check notice

Check notice

Check notice

sumitmsft commented Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ffelixg commented Dec 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sumitmsft commented Dec 4, 2025 •

edited

Loading