Skip to content

Conversation

@Chenglong-MS
Copy link
Collaborator

This pull request introduces a new endpoint for refreshing derived data tables, improves data loader support and documentation, and makes UX and backend improvements for data refresh and streaming features. The most significant changes are the addition and relocation of the derived data refresh endpoint, expanded data loader integrations, and UI/UX enhancements for table metadata and messaging.

Backend API and Data Refresh Improvements:

  • Added a new /api/agent/refresh-derived-data endpoint in agent_routes.py to re-run Python transformation code for derived tables, replacing the previous implementation in tables_routes.py. This endpoint validates input, executes transformation code in a sandbox, and returns refreshed data or error messages. [1] [2]
  • Updated the frontend to use the new endpoint (REFRESH_DERIVED_DATA) for derived table refreshes and removed the old endpoint from the session-required list. [1] [2]

Data Loader Integrations and Documentation:

  • Added AthenaDataLoader support and included it in the DATA_LOADERS registry and __all__ export, enabling AWS Athena integration. [1] [2]
  • Expanded the data loader documentation to list new integrations: Athena, BigQuery, PostgreSQL, MSSQL, S3, Azure Blob, and MongoDB.

Frontend Data Refresh and Messaging Enhancements:

  • Removed throttling logic for refresh-related messages in useDataRefresh and useDerivedTableRefresh, ensuring all refresh status messages are dispatched immediately for better user feedback. [1] [2] [3] [4] [5] [6]
  • Improved the display of table metadata in the UI by showing the data loader type and source table name for imported tables.

Streaming and Demo Data Improvements:

  • Increased the ISS position history buffer from 500 to 10,000 entries and extended the allowed time and row limits for ISS data demo endpoints, enhancing demo streaming capabilities. [1] [2]

Comment on lines +781 to +784
return jsonify({
"status": "error",
"message": result.get('content', 'Unknown error during transformation')
}), 400

Check warning

Code scanning / CodeQL

Information exposure through an exception Medium

Stack trace information
flows to this location and may be exposed to an external user.

Copilot Autofix

AI about 14 hours ago

To fix this, we should stop returning the detailed exception-derived message from the sandbox directly to the client, and instead (a) log the detailed error on the server, and (b) return a generic, non-sensitive message in the HTTP response. This applies at two levels:

  1. In py_sandbox.run_in_main_process, instead of building a verbose error_message that is later surfaced to the client, we should build both:
    • a detailed log message (kept server-side), and
    • a safer, generic or lightly sanitized error string intended to be propagated outward.
  2. In agent_routes.refresh_derived_data, we should ensure that the "message" field does not echo the detailed sandbox error verbatim, but only a generic message (optionally including a short code like the exception type, which is typically safe).

Because CodeQL’s taint tracking starts at err in run_in_main_process and flows through error_message to result['error_message'] and then all the way to refresh_derived_data, the most robust fix is to break that propagation path. Concretely:

  • Modify run_in_main_process:
    • Capture the exception details, including traceback.format_exc(), in a local variable.
    • Return a response object where:
      • error_message is a generic, non-detailed string (for example, "Execution failed due to an error in the transformation code." or at most "Execution failed with ValueError").
      • The detailed traceback is not included in the returned structure (or, if necessary for internal use, is in a separate key clearly not used for user-facing messages; however, given only the shown code, we’ll avoid returning it altogether).
  • Optionally, if py_sandbox.py is not using logging yet and we are allowed to introduce it, we could add basic logging there, but since you did not show the logger setup in that file and the alert is about exposure to the user, not missing logging, the minimal compliant fix is to keep detailed info local and not return it.
  • Modify refresh_derived_data in agent_routes.py:
    • When result['status'] != 'ok', stop returning result.get('content', ...) directly.
    • Instead, send a generic error message like "Error executing transformation code" (and optionally a hint: "Check your code and try again."), independent of result['content'].

This preserves the existing function behavior in terms of control flow (success vs. error) and structure of returned JSON (status, rows, message keys), but prevents internal exception details from being exposed.

Suggested changeset 2
py-src/data_formulator/agent_routes.py

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/py-src/data_formulator/agent_routes.py b/py-src/data_formulator/agent_routes.py
--- a/py-src/data_formulator/agent_routes.py
+++ b/py-src/data_formulator/agent_routes.py
@@ -778,9 +778,10 @@
                 "message": "Successfully refreshed derived data"
             })
         else:
+            # Do not expose detailed sandbox error information to the client.
             return jsonify({
                 "status": "error",
-                "message": result.get('content', 'Unknown error during transformation')
+                "message": "Error executing transformation code. Please check your code and try again."
             }), 400
             
     except Exception as e:
EOF
@@ -778,9 +778,10 @@
"message": "Successfully refreshed derived data"
})
else:
# Do not expose detailed sandbox error information to the client.
return jsonify({
"status": "error",
"message": result.get('content', 'Unknown error during transformation')
"message": "Error executing transformation code. Please check your code and try again."
}), 400

except Exception as e:
py-src/data_formulator/py_sandbox.py
Outside changed files

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/py-src/data_formulator/py_sandbox.py b/py-src/data_formulator/py_sandbox.py
--- a/py-src/data_formulator/py_sandbox.py
+++ b/py-src/data_formulator/py_sandbox.py
@@ -106,8 +106,11 @@
     try:
         exec(code, restricted_globals)
     except Exception as err:
-        error_message = f"Error: {type(err).__name__} - {str(err)}"
-        return {'status': 'error', 'error_message': error_message}
+        # Build a generic, non-sensitive error message for callers.
+        generic_message = f"Execution failed due to an error in the transformation code ({type(err).__name__})."
+        # Note: full traceback and error details are intentionally not returned to callers
+        # to avoid leaking internal information. They should be logged by the caller if needed.
+        return {'status': 'error', 'error_message': generic_message}
 
     return {'status': 'ok', 'allowed_objects': {key: restricted_globals[key] for key in allowed_objects}}
 
EOF
@@ -106,8 +106,11 @@
try:
exec(code, restricted_globals)
except Exception as err:
error_message = f"Error: {type(err).__name__} - {str(err)}"
return {'status': 'error', 'error_message': error_message}
# Build a generic, non-sensitive error message for callers.
generic_message = f"Execution failed due to an error in the transformation code ({type(err).__name__})."
# Note: full traceback and error details are intentionally not returned to callers
# to avoid leaking internal information. They should be logged by the caller if needed.
return {'status': 'error', 'error_message': generic_message}

return {'status': 'ok', 'allowed_objects': {key: restricted_globals[key] for key in allowed_objects}}

Copilot is powered by AI and may make mistakes. Always verify output.
Comment on lines +789 to +792
return jsonify({
"status": "error",
"message": str(e)
}), 400

Check warning

Code scanning / CodeQL

Information exposure through an exception Medium

Stack trace information
flows to this location and may be exposed to an external user.

Copilot Autofix

AI about 14 hours ago

In general, to fix information exposure through exceptions, log detailed error information (including stack traces) only on the server side and return a generic, non-sensitive message to the client. Avoid echoing str(e) or any stack trace data back in HTTP responses.

For this specific endpoint (refresh_derived_data in py-src/data_formulator/agent_routes.py, lines 786–792), we should keep the logging behavior but replace the client-facing "message": str(e) with a generic message such as "An internal error occurred while refreshing derived data." or similar. This preserves existing functionality (the client still gets an "error" status and a message field) while removing the potential exposure of internal exception details. No new imports or helper methods are required; we only adjust the JSON payload in the except block. The rest of the function remains unchanged.

Concretely:

  • Keep:
    • logger.error(f"Error refreshing derived data: {str(e)}")
    • logger.error(traceback.format_exc())
  • Change:
    • In the jsonify call at line 789–792, replace str(e) with a fixed generic message string.
Suggested changeset 1
py-src/data_formulator/agent_routes.py

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/py-src/data_formulator/agent_routes.py b/py-src/data_formulator/agent_routes.py
--- a/py-src/data_formulator/agent_routes.py
+++ b/py-src/data_formulator/agent_routes.py
@@ -788,5 +788,5 @@
         logger.error(traceback.format_exc())
         return jsonify({
             "status": "error",
-            "message": str(e)
+            "message": "An internal error occurred while refreshing derived data."
         }), 400
EOF
@@ -788,5 +788,5 @@
logger.error(traceback.format_exc())
return jsonify({
"status": "error",
"message": str(e)
"message": "An internal error occurred while refreshing derived data."
}), 400
Copilot is powered by AI and may make mistakes. Always verify output.
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR relocates and updates the derived-table refresh API, expands supported external data loaders (notably AWS Athena), and improves the refresh/streaming UX in both backend and frontend.

Changes:

  • Moved derived table refresh to /api/agent/refresh-derived-data and updated frontend routing accordingly.
  • Added AthenaDataLoader and registered it in the data loader registry + docs.
  • Adjusted refresh messaging behavior and enhanced UI table metadata display; expanded demo streaming limits/buffers.

Reviewed changes

Copilot reviewed 10 out of 11 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
src/views/MessageSnackbar.tsx Removes unused state/vars related to message UI behavior.
src/views/DBTableManager.tsx Displays imported-table provenance (data loader type + source table).
src/app/utils.tsx Updates derived-refresh URL and removes old tables endpoint from session-required list.
src/app/useDataRefresh.tsx Removes throttling of refresh messages; dispatches refresh status messages immediately.
py-src/data_formulator/tables_routes.py Removes the old /api/tables/refresh-derived-data endpoint implementation.
py-src/data_formulator/agent_routes.py Adds the new /api/agent/refresh-derived-data endpoint implementation.
py-src/data_formulator/demo_stream_routes.py Increases ISS history buffer and loosens ISS query limits for demo streaming.
py-src/data_formulator/data_loader/athena_data_loader.py Adds new AWS Athena integration for listing/ingesting/query sampling into DuckDB.
py-src/data_formulator/data_loader/init.py Registers Athena loader in DATA_LOADERS and exports it.
py-src/data_formulator/data_loader/README.md Documents additional loader integrations including Athena.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

# Thread-safe storage for ISS position history
_iss_track_lock = threading.Lock()
_iss_track_history: deque = deque(maxlen=500) # Keep last 500 positions (~40 min at 5s intervals)
_iss_track_history: deque = deque(maxlen=10000) # Keep last 10000 positions (~20000 min at 5s intervals)
Copy link

Copilot AI Jan 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The inline comment about how much history is retained is incorrect: 10,000 points at 5s intervals is ~833 minutes (~13.9 hours), not “~20000 min”. Please fix the comment to avoid misleading future maintenance.

Suggested change
_iss_track_history: deque = deque(maxlen=10000) # Keep last 10000 positions (~20000 min at 5s intervals)
_iss_track_history: deque = deque(maxlen=10000) # Keep last 10000 positions (~833 min / ~14 hours at 5s intervals)

Copilot uses AI. Check for mistakes.
Comment on lines +146 to +147
minutes = min(1440, max(1, int(request.args.get('minutes', 1440))))
limit = min(10000, max(1000, int(request.args.get('limit', 10000))))
Copy link

Copilot AI Jan 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

int(request.args.get(...)) will raise ValueError for non-numeric query params, causing a 500. Since this endpoint is user-facing, please parse minutes/limit defensively (e.g., try/except with a 400 or default) and also update the docstring above, which still documents the old defaults/max values.

Copilot uses AI. Check for mistakes.
Comment on lines +731 to +732

data = request.get_json()
Copy link

Copilot AI Jan 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

data = request.get_json() can be None (or non-dict) if the client sends invalid JSON, and then data.get(...) will raise. Please validate request.is_json (and that data is a dict) and return a clear 400 like other agent endpoints (e.g., “Invalid request format”).

Suggested change
data = request.get_json()
if not request.is_json:
return jsonify({
"status": "error",
"message": "Invalid request format"
}), 400
data = request.get_json()
if not isinstance(data, dict):
return jsonify({
"status": "error",
"message": "Invalid request format"
}), 400

Copilot uses AI. Check for mistakes.
Comment on lines +791 to +792
"message": str(e)
}), 400
Copy link

Copilot AI Jan 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The exception handler returns message: str(e) with HTTP 400. This can leak internal details (tracebacks/paths/sandbox errors) to the client and also misclassifies unexpected server failures as a client error. Please sanitize the error message (and consider a generic message for unexpected exceptions) and return an appropriate status code (typically 500).

Suggested change
"message": str(e)
}), 400
"message": "An unexpected error occurred while refreshing derived data."
}), 500

Copilot uses AI. Check for mistakes.
Comment on lines +297 to +302
# Set AWS credentials for DuckDB
self.duck_db_conn.execute(f"SET s3_region='{self.region_name}'")
self.duck_db_conn.execute(f"SET s3_access_key_id='{self.aws_access_key_id}'")
self.duck_db_conn.execute(f"SET s3_secret_access_key='{self.aws_secret_access_key}'")
if self.aws_session_token:
self.duck_db_conn.execute(f"SET s3_session_token='{self.aws_session_token}'")
Copy link

Copilot AI Jan 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These DuckDB SET ...='...' statements interpolate credential values directly into SQL. If any value contains a single quote, this will break the statement and can create SQL-injection style issues. Please escape/quote these values safely before embedding (e.g., reuse _escape_sql_string(...)), or use a safer DuckDB API for setting secrets if available.

Suggested change
# Set AWS credentials for DuckDB
self.duck_db_conn.execute(f"SET s3_region='{self.region_name}'")
self.duck_db_conn.execute(f"SET s3_access_key_id='{self.aws_access_key_id}'")
self.duck_db_conn.execute(f"SET s3_secret_access_key='{self.aws_secret_access_key}'")
if self.aws_session_token:
self.duck_db_conn.execute(f"SET s3_session_token='{self.aws_session_token}'")
# Set AWS credentials for DuckDB using parameterized queries to avoid SQL injection and quoting issues
self.duck_db_conn.execute("SET s3_region=?", [self.region_name])
self.duck_db_conn.execute("SET s3_access_key_id=?", [self.aws_access_key_id])
self.duck_db_conn.execute("SET s3_secret_access_key=?", [self.aws_secret_access_key])
if self.aws_session_token:
self.duck_db_conn.execute("SET s3_session_token=?", [self.aws_session_token])

Copilot uses AI. Check for mistakes.
@@ -34,18 +34,15 @@ export interface Message {
}

export function MessageSnackbar() {
Copy link

Copilot AI Jan 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

useTheme is still imported but the theme variable was removed, so the import is now unused. Please remove the unused useTheme import (and any related dead code) to keep the module clean.

Copilot uses AI. Check for mistakes.
Comment on lines +915 to +917
<Typography component="span" sx={{fontSize: 12, color: 'text.secondary'}}>
{currentTable.source_metadata && `imported from ${currentTable.source_metadata.data_loader_type}.${currentTable.source_metadata.source_table_name}`}
</Typography>
Copy link

Copilot AI Jan 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

source_table_name is optional in the source_metadata type, but this string interpolation assumes it’s always present. If it’s missing, the UI will render imported from <type>.undefined. Consider guarding on both fields (or using fallbacks / optional chaining) and formatting without a trailing dot when source_table_name is absent.

Copilot uses AI. Check for mistakes.
Comment on lines +221 to 225
dispatch(dfActions.addMessages({
timestamp: Date.now(),
component: 'Data Refresh',
type: 'info',
value: `Table "${table.displayId || table.id}" data refreshed (${result.newRows.length} rows)`
Copy link

Copilot AI Jan 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With throttling removed, auto-refresh can append messages on every refresh tick and dfSlice.addMessages currently grows state.messages unbounded. This can lead to memory growth and slower renders (e.g., MessageSnackbar groups messages by iterating the full array). Consider adding a max length cap in the reducer (drop oldest), or coalescing per-table refresh messages instead of always appending.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants