-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Dev #230
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Dev #230
Conversation
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Feat: Added AWS Athena data loader
| return jsonify({ | ||
| "status": "error", | ||
| "message": result.get('content', 'Unknown error during transformation') | ||
| }), 400 |
Check warning
Code scanning / CodeQL
Information exposure through an exception Medium
Stack trace information
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI about 14 hours ago
To fix this, we should stop returning the detailed exception-derived message from the sandbox directly to the client, and instead (a) log the detailed error on the server, and (b) return a generic, non-sensitive message in the HTTP response. This applies at two levels:
- In
py_sandbox.run_in_main_process, instead of building a verboseerror_messagethat is later surfaced to the client, we should build both:- a detailed log message (kept server-side), and
- a safer, generic or lightly sanitized error string intended to be propagated outward.
- In
agent_routes.refresh_derived_data, we should ensure that the"message"field does not echo the detailed sandbox error verbatim, but only a generic message (optionally including a short code like the exception type, which is typically safe).
Because CodeQL’s taint tracking starts at err in run_in_main_process and flows through error_message to result['error_message'] and then all the way to refresh_derived_data, the most robust fix is to break that propagation path. Concretely:
- Modify
run_in_main_process:- Capture the exception details, including
traceback.format_exc(), in a local variable. - Return a response object where:
error_messageis a generic, non-detailed string (for example,"Execution failed due to an error in the transformation code."or at most"Execution failed with ValueError").- The detailed traceback is not included in the returned structure (or, if necessary for internal use, is in a separate key clearly not used for user-facing messages; however, given only the shown code, we’ll avoid returning it altogether).
- Capture the exception details, including
- Optionally, if
py_sandbox.pyis not using logging yet and we are allowed to introduce it, we could add basic logging there, but since you did not show the logger setup in that file and the alert is about exposure to the user, not missing logging, the minimal compliant fix is to keep detailed info local and not return it. - Modify
refresh_derived_datainagent_routes.py:- When
result['status'] != 'ok', stop returningresult.get('content', ...)directly. - Instead, send a generic error message like
"Error executing transformation code"(and optionally a hint:"Check your code and try again."), independent ofresult['content'].
- When
This preserves the existing function behavior in terms of control flow (success vs. error) and structure of returned JSON (status, rows, message keys), but prevents internal exception details from being exposed.
-
Copy modified line R781 -
Copy modified line R784
| @@ -778,9 +778,10 @@ | ||
| "message": "Successfully refreshed derived data" | ||
| }) | ||
| else: | ||
| # Do not expose detailed sandbox error information to the client. | ||
| return jsonify({ | ||
| "status": "error", | ||
| "message": result.get('content', 'Unknown error during transformation') | ||
| "message": "Error executing transformation code. Please check your code and try again." | ||
| }), 400 | ||
|
|
||
| except Exception as e: |
-
Copy modified lines R109-R113
| @@ -106,8 +106,11 @@ | ||
| try: | ||
| exec(code, restricted_globals) | ||
| except Exception as err: | ||
| error_message = f"Error: {type(err).__name__} - {str(err)}" | ||
| return {'status': 'error', 'error_message': error_message} | ||
| # Build a generic, non-sensitive error message for callers. | ||
| generic_message = f"Execution failed due to an error in the transformation code ({type(err).__name__})." | ||
| # Note: full traceback and error details are intentionally not returned to callers | ||
| # to avoid leaking internal information. They should be logged by the caller if needed. | ||
| return {'status': 'error', 'error_message': generic_message} | ||
|
|
||
| return {'status': 'ok', 'allowed_objects': {key: restricted_globals[key] for key in allowed_objects}} | ||
|
|
| return jsonify({ | ||
| "status": "error", | ||
| "message": str(e) | ||
| }), 400 |
Check warning
Code scanning / CodeQL
Information exposure through an exception Medium
Stack trace information
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI about 14 hours ago
In general, to fix information exposure through exceptions, log detailed error information (including stack traces) only on the server side and return a generic, non-sensitive message to the client. Avoid echoing str(e) or any stack trace data back in HTTP responses.
For this specific endpoint (refresh_derived_data in py-src/data_formulator/agent_routes.py, lines 786–792), we should keep the logging behavior but replace the client-facing "message": str(e) with a generic message such as "An internal error occurred while refreshing derived data." or similar. This preserves existing functionality (the client still gets an "error" status and a message field) while removing the potential exposure of internal exception details. No new imports or helper methods are required; we only adjust the JSON payload in the except block. The rest of the function remains unchanged.
Concretely:
- Keep:
logger.error(f"Error refreshing derived data: {str(e)}")logger.error(traceback.format_exc())
- Change:
- In the
jsonifycall at line 789–792, replacestr(e)with a fixed generic message string.
- In the
-
Copy modified line R791
| @@ -788,5 +788,5 @@ | ||
| logger.error(traceback.format_exc()) | ||
| return jsonify({ | ||
| "status": "error", | ||
| "message": str(e) | ||
| "message": "An internal error occurred while refreshing derived data." | ||
| }), 400 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR relocates and updates the derived-table refresh API, expands supported external data loaders (notably AWS Athena), and improves the refresh/streaming UX in both backend and frontend.
Changes:
- Moved derived table refresh to
/api/agent/refresh-derived-dataand updated frontend routing accordingly. - Added
AthenaDataLoaderand registered it in the data loader registry + docs. - Adjusted refresh messaging behavior and enhanced UI table metadata display; expanded demo streaming limits/buffers.
Reviewed changes
Copilot reviewed 10 out of 11 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| src/views/MessageSnackbar.tsx | Removes unused state/vars related to message UI behavior. |
| src/views/DBTableManager.tsx | Displays imported-table provenance (data loader type + source table). |
| src/app/utils.tsx | Updates derived-refresh URL and removes old tables endpoint from session-required list. |
| src/app/useDataRefresh.tsx | Removes throttling of refresh messages; dispatches refresh status messages immediately. |
| py-src/data_formulator/tables_routes.py | Removes the old /api/tables/refresh-derived-data endpoint implementation. |
| py-src/data_formulator/agent_routes.py | Adds the new /api/agent/refresh-derived-data endpoint implementation. |
| py-src/data_formulator/demo_stream_routes.py | Increases ISS history buffer and loosens ISS query limits for demo streaming. |
| py-src/data_formulator/data_loader/athena_data_loader.py | Adds new AWS Athena integration for listing/ingesting/query sampling into DuckDB. |
| py-src/data_formulator/data_loader/init.py | Registers Athena loader in DATA_LOADERS and exports it. |
| py-src/data_formulator/data_loader/README.md | Documents additional loader integrations including Athena. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # Thread-safe storage for ISS position history | ||
| _iss_track_lock = threading.Lock() | ||
| _iss_track_history: deque = deque(maxlen=500) # Keep last 500 positions (~40 min at 5s intervals) | ||
| _iss_track_history: deque = deque(maxlen=10000) # Keep last 10000 positions (~20000 min at 5s intervals) |
Copilot
AI
Jan 30, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The inline comment about how much history is retained is incorrect: 10,000 points at 5s intervals is ~833 minutes (~13.9 hours), not “~20000 min”. Please fix the comment to avoid misleading future maintenance.
| _iss_track_history: deque = deque(maxlen=10000) # Keep last 10000 positions (~20000 min at 5s intervals) | |
| _iss_track_history: deque = deque(maxlen=10000) # Keep last 10000 positions (~833 min / ~14 hours at 5s intervals) |
| minutes = min(1440, max(1, int(request.args.get('minutes', 1440)))) | ||
| limit = min(10000, max(1000, int(request.args.get('limit', 10000)))) |
Copilot
AI
Jan 30, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
int(request.args.get(...)) will raise ValueError for non-numeric query params, causing a 500. Since this endpoint is user-facing, please parse minutes/limit defensively (e.g., try/except with a 400 or default) and also update the docstring above, which still documents the old defaults/max values.
|
|
||
| data = request.get_json() |
Copilot
AI
Jan 30, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
data = request.get_json() can be None (or non-dict) if the client sends invalid JSON, and then data.get(...) will raise. Please validate request.is_json (and that data is a dict) and return a clear 400 like other agent endpoints (e.g., “Invalid request format”).
| data = request.get_json() | |
| if not request.is_json: | |
| return jsonify({ | |
| "status": "error", | |
| "message": "Invalid request format" | |
| }), 400 | |
| data = request.get_json() | |
| if not isinstance(data, dict): | |
| return jsonify({ | |
| "status": "error", | |
| "message": "Invalid request format" | |
| }), 400 |
| "message": str(e) | ||
| }), 400 |
Copilot
AI
Jan 30, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The exception handler returns message: str(e) with HTTP 400. This can leak internal details (tracebacks/paths/sandbox errors) to the client and also misclassifies unexpected server failures as a client error. Please sanitize the error message (and consider a generic message for unexpected exceptions) and return an appropriate status code (typically 500).
| "message": str(e) | |
| }), 400 | |
| "message": "An unexpected error occurred while refreshing derived data." | |
| }), 500 |
| # Set AWS credentials for DuckDB | ||
| self.duck_db_conn.execute(f"SET s3_region='{self.region_name}'") | ||
| self.duck_db_conn.execute(f"SET s3_access_key_id='{self.aws_access_key_id}'") | ||
| self.duck_db_conn.execute(f"SET s3_secret_access_key='{self.aws_secret_access_key}'") | ||
| if self.aws_session_token: | ||
| self.duck_db_conn.execute(f"SET s3_session_token='{self.aws_session_token}'") |
Copilot
AI
Jan 30, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These DuckDB SET ...='...' statements interpolate credential values directly into SQL. If any value contains a single quote, this will break the statement and can create SQL-injection style issues. Please escape/quote these values safely before embedding (e.g., reuse _escape_sql_string(...)), or use a safer DuckDB API for setting secrets if available.
| # Set AWS credentials for DuckDB | |
| self.duck_db_conn.execute(f"SET s3_region='{self.region_name}'") | |
| self.duck_db_conn.execute(f"SET s3_access_key_id='{self.aws_access_key_id}'") | |
| self.duck_db_conn.execute(f"SET s3_secret_access_key='{self.aws_secret_access_key}'") | |
| if self.aws_session_token: | |
| self.duck_db_conn.execute(f"SET s3_session_token='{self.aws_session_token}'") | |
| # Set AWS credentials for DuckDB using parameterized queries to avoid SQL injection and quoting issues | |
| self.duck_db_conn.execute("SET s3_region=?", [self.region_name]) | |
| self.duck_db_conn.execute("SET s3_access_key_id=?", [self.aws_access_key_id]) | |
| self.duck_db_conn.execute("SET s3_secret_access_key=?", [self.aws_secret_access_key]) | |
| if self.aws_session_token: | |
| self.duck_db_conn.execute("SET s3_session_token=?", [self.aws_session_token]) |
| @@ -34,18 +34,15 @@ export interface Message { | |||
| } | |||
|
|
|||
| export function MessageSnackbar() { | |||
Copilot
AI
Jan 30, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
useTheme is still imported but the theme variable was removed, so the import is now unused. Please remove the unused useTheme import (and any related dead code) to keep the module clean.
| <Typography component="span" sx={{fontSize: 12, color: 'text.secondary'}}> | ||
| {currentTable.source_metadata && `imported from ${currentTable.source_metadata.data_loader_type}.${currentTable.source_metadata.source_table_name}`} | ||
| </Typography> |
Copilot
AI
Jan 30, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
source_table_name is optional in the source_metadata type, but this string interpolation assumes it’s always present. If it’s missing, the UI will render imported from <type>.undefined. Consider guarding on both fields (or using fallbacks / optional chaining) and formatting without a trailing dot when source_table_name is absent.
| dispatch(dfActions.addMessages({ | ||
| timestamp: Date.now(), | ||
| component: 'Data Refresh', | ||
| type: 'info', | ||
| value: `Table "${table.displayId || table.id}" data refreshed (${result.newRows.length} rows)` |
Copilot
AI
Jan 30, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With throttling removed, auto-refresh can append messages on every refresh tick and dfSlice.addMessages currently grows state.messages unbounded. This can lead to memory growth and slower renders (e.g., MessageSnackbar groups messages by iterating the full array). Consider adding a max length cap in the reducer (drop oldest), or coalescing per-table refresh messages instead of always appending.
This pull request introduces a new endpoint for refreshing derived data tables, improves data loader support and documentation, and makes UX and backend improvements for data refresh and streaming features. The most significant changes are the addition and relocation of the derived data refresh endpoint, expanded data loader integrations, and UI/UX enhancements for table metadata and messaging.
Backend API and Data Refresh Improvements:
/api/agent/refresh-derived-dataendpoint inagent_routes.pyto re-run Python transformation code for derived tables, replacing the previous implementation intables_routes.py. This endpoint validates input, executes transformation code in a sandbox, and returns refreshed data or error messages. [1] [2]REFRESH_DERIVED_DATA) for derived table refreshes and removed the old endpoint from the session-required list. [1] [2]Data Loader Integrations and Documentation:
AthenaDataLoadersupport and included it in theDATA_LOADERSregistry and__all__export, enabling AWS Athena integration. [1] [2]Frontend Data Refresh and Messaging Enhancements:
useDataRefreshanduseDerivedTableRefresh, ensuring all refresh status messages are dispatched immediately for better user feedback. [1] [2] [3] [4] [5] [6]Streaming and Demo Data Improvements: