Skip to content
Merged

Dev #230

Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion public/df_stock_prices_live.json

Large diffs are not rendered by default.

82 changes: 82 additions & 0 deletions py-src/data_formulator/agent_routes.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@

import json
import html
import pandas as pd

from data_formulator.agents.agent_concept_derive import ConceptDeriveAgent
from data_formulator.agents.agent_py_concept_derive import PyConceptDeriveAgent
Expand Down Expand Up @@ -708,3 +709,84 @@
headers={ 'Access-Control-Allow-Origin': '*', }
)
return response


@agent_bp.route('/refresh-derived-data', methods=['POST'])
def refresh_derived_data():
"""
Re-run Python transformation code with new input data to refresh a derived table.

This endpoint takes:
- input_tables: list of {name: string, rows: list} objects representing the parent tables
- code: the Python transformation code to execute

Returns:
- status: 'ok' or 'error'
- rows: the resulting rows if successful
- message: error message if failed
"""
try:
from data_formulator.py_sandbox import run_transform_in_sandbox2020
from flask import current_app

data = request.get_json()
Comment on lines +731 to +732
Copy link

Copilot AI Jan 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

data = request.get_json() can be None (or non-dict) if the client sends invalid JSON, and then data.get(...) will raise. Please validate request.is_json (and that data is a dict) and return a clear 400 like other agent endpoints (e.g., “Invalid request format”).

Suggested change
data = request.get_json()
if not request.is_json:
return jsonify({
"status": "error",
"message": "Invalid request format"
}), 400
data = request.get_json()
if not isinstance(data, dict):
return jsonify({
"status": "error",
"message": "Invalid request format"
}), 400

Copilot uses AI. Check for mistakes.
input_tables = data.get('input_tables', [])
code = data.get('code', '')

if not input_tables:
return jsonify({
"status": "error",
"message": "No input tables provided"
}), 400

if not code:
return jsonify({
"status": "error",
"message": "No transformation code provided"
}), 400

# Convert input tables to pandas DataFrames
df_list = []
for table in input_tables:
table_name = table.get('name', '')
table_rows = table.get('rows', [])

if not table_rows:
return jsonify({
"status": "error",
"message": f"Table '{table_name}' has no rows"
}), 400

df = pd.DataFrame.from_records(table_rows)
df_list.append(df)

# Get exec_python_in_subprocess setting from app config
exec_python_in_subprocess = current_app.config.get('CLI_ARGS', {}).get('exec_python_in_subprocess', False)

# Run the transformation code
result = run_transform_in_sandbox2020(code, df_list, exec_python_in_subprocess)

if result['status'] == 'ok':
result_df = result['content']

# Convert result DataFrame to list of records
rows = json.loads(result_df.to_json(orient='records', date_format='iso'))

return jsonify({
"status": "ok",
"rows": rows,
"message": "Successfully refreshed derived data"
})
else:
return jsonify({
"status": "error",
"message": result.get('content', 'Unknown error during transformation')
}), 400
Comment on lines +781 to +784

Check warning

Code scanning / CodeQL

Information exposure through an exception Medium

Stack trace information
flows to this location and may be exposed to an external user.

Copilot Autofix

AI 2 days ago

To fix this, we should stop returning the detailed exception-derived message from the sandbox directly to the client, and instead (a) log the detailed error on the server, and (b) return a generic, non-sensitive message in the HTTP response. This applies at two levels:

  1. In py_sandbox.run_in_main_process, instead of building a verbose error_message that is later surfaced to the client, we should build both:
    • a detailed log message (kept server-side), and
    • a safer, generic or lightly sanitized error string intended to be propagated outward.
  2. In agent_routes.refresh_derived_data, we should ensure that the "message" field does not echo the detailed sandbox error verbatim, but only a generic message (optionally including a short code like the exception type, which is typically safe).

Because CodeQL’s taint tracking starts at err in run_in_main_process and flows through error_message to result['error_message'] and then all the way to refresh_derived_data, the most robust fix is to break that propagation path. Concretely:

  • Modify run_in_main_process:
    • Capture the exception details, including traceback.format_exc(), in a local variable.
    • Return a response object where:
      • error_message is a generic, non-detailed string (for example, "Execution failed due to an error in the transformation code." or at most "Execution failed with ValueError").
      • The detailed traceback is not included in the returned structure (or, if necessary for internal use, is in a separate key clearly not used for user-facing messages; however, given only the shown code, we’ll avoid returning it altogether).
  • Optionally, if py_sandbox.py is not using logging yet and we are allowed to introduce it, we could add basic logging there, but since you did not show the logger setup in that file and the alert is about exposure to the user, not missing logging, the minimal compliant fix is to keep detailed info local and not return it.
  • Modify refresh_derived_data in agent_routes.py:
    • When result['status'] != 'ok', stop returning result.get('content', ...) directly.
    • Instead, send a generic error message like "Error executing transformation code" (and optionally a hint: "Check your code and try again."), independent of result['content'].

This preserves the existing function behavior in terms of control flow (success vs. error) and structure of returned JSON (status, rows, message keys), but prevents internal exception details from being exposed.

Suggested changeset 2
py-src/data_formulator/agent_routes.py

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/py-src/data_formulator/agent_routes.py b/py-src/data_formulator/agent_routes.py
--- a/py-src/data_formulator/agent_routes.py
+++ b/py-src/data_formulator/agent_routes.py
@@ -778,9 +778,10 @@
                 "message": "Successfully refreshed derived data"
             })
         else:
+            # Do not expose detailed sandbox error information to the client.
             return jsonify({
                 "status": "error",
-                "message": result.get('content', 'Unknown error during transformation')
+                "message": "Error executing transformation code. Please check your code and try again."
             }), 400
             
     except Exception as e:
EOF
@@ -778,9 +778,10 @@
"message": "Successfully refreshed derived data"
})
else:
# Do not expose detailed sandbox error information to the client.
return jsonify({
"status": "error",
"message": result.get('content', 'Unknown error during transformation')
"message": "Error executing transformation code. Please check your code and try again."
}), 400

except Exception as e:
py-src/data_formulator/py_sandbox.py
Outside changed files

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/py-src/data_formulator/py_sandbox.py b/py-src/data_formulator/py_sandbox.py
--- a/py-src/data_formulator/py_sandbox.py
+++ b/py-src/data_formulator/py_sandbox.py
@@ -106,8 +106,11 @@
     try:
         exec(code, restricted_globals)
     except Exception as err:
-        error_message = f"Error: {type(err).__name__} - {str(err)}"
-        return {'status': 'error', 'error_message': error_message}
+        # Build a generic, non-sensitive error message for callers.
+        generic_message = f"Execution failed due to an error in the transformation code ({type(err).__name__})."
+        # Note: full traceback and error details are intentionally not returned to callers
+        # to avoid leaking internal information. They should be logged by the caller if needed.
+        return {'status': 'error', 'error_message': generic_message}
 
     return {'status': 'ok', 'allowed_objects': {key: restricted_globals[key] for key in allowed_objects}}
 
EOF
@@ -106,8 +106,11 @@
try:
exec(code, restricted_globals)
except Exception as err:
error_message = f"Error: {type(err).__name__} - {str(err)}"
return {'status': 'error', 'error_message': error_message}
# Build a generic, non-sensitive error message for callers.
generic_message = f"Execution failed due to an error in the transformation code ({type(err).__name__})."
# Note: full traceback and error details are intentionally not returned to callers
# to avoid leaking internal information. They should be logged by the caller if needed.
return {'status': 'error', 'error_message': generic_message}

return {'status': 'ok', 'allowed_objects': {key: restricted_globals[key] for key in allowed_objects}}

Copilot is powered by AI and may make mistakes. Always verify output.

except Exception as e:
logger.error(f"Error refreshing derived data: {str(e)}")
logger.error(traceback.format_exc())
return jsonify({
"status": "error",
"message": str(e)
}), 400
Comment on lines +789 to +792

Check warning

Code scanning / CodeQL

Information exposure through an exception Medium

Stack trace information
flows to this location and may be exposed to an external user.

Copilot Autofix

AI 2 days ago

In general, to fix information exposure through exceptions, log detailed error information (including stack traces) only on the server side and return a generic, non-sensitive message to the client. Avoid echoing str(e) or any stack trace data back in HTTP responses.

For this specific endpoint (refresh_derived_data in py-src/data_formulator/agent_routes.py, lines 786–792), we should keep the logging behavior but replace the client-facing "message": str(e) with a generic message such as "An internal error occurred while refreshing derived data." or similar. This preserves existing functionality (the client still gets an "error" status and a message field) while removing the potential exposure of internal exception details. No new imports or helper methods are required; we only adjust the JSON payload in the except block. The rest of the function remains unchanged.

Concretely:

  • Keep:
    • logger.error(f"Error refreshing derived data: {str(e)}")
    • logger.error(traceback.format_exc())
  • Change:
    • In the jsonify call at line 789–792, replace str(e) with a fixed generic message string.
Suggested changeset 1
py-src/data_formulator/agent_routes.py

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/py-src/data_formulator/agent_routes.py b/py-src/data_formulator/agent_routes.py
--- a/py-src/data_formulator/agent_routes.py
+++ b/py-src/data_formulator/agent_routes.py
@@ -788,5 +788,5 @@
         logger.error(traceback.format_exc())
         return jsonify({
             "status": "error",
-            "message": str(e)
+            "message": "An internal error occurred while refreshing derived data."
         }), 400
EOF
@@ -788,5 +788,5 @@
logger.error(traceback.format_exc())
return jsonify({
"status": "error",
"message": str(e)
"message": "An internal error occurred while refreshing derived data."
}), 400
Copilot is powered by AI and may make mistakes. Always verify output.
Comment on lines +791 to +792
Copy link

Copilot AI Jan 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The exception handler returns message: str(e) with HTTP 400. This can leak internal details (tracebacks/paths/sandbox errors) to the client and also misclassifies unexpected server failures as a client error. Please sanitize the error message (and consider a generic message for unexpected exceptions) and return an appropriate status code (typically 500).

Suggested change
"message": str(e)
}), 400
"message": "An unexpected error occurred while refreshing derived data."
}), 500

Copilot uses AI. Check for mistakes.
7 changes: 7 additions & 0 deletions py-src/data_formulator/data_loader/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,8 +22,15 @@ The UI automatically provide the query completion option to help user generate q

### Example Implementations

- `AthenaDataLoader`: AWS Athena integration (SQL queries on S3 data lakes)
- `BigQueryDataLoader`: Google BigQuery integration
- `KustoDataLoader`: Azure Data Explorer (Kusto) integration
- `MySQLDataLoader`: MySQL database integration
- `PostgreSQLDataLoader`: PostgreSQL database integration
- `MSSQLDataLoader`: Microsoft SQL Server integration
- `S3DataLoader`: Amazon S3 file integration (CSV, Parquet, JSON)
- `AzureBlobDataLoader`: Azure Blob Storage integration
- `MongoDBDataLoader`: MongoDB integration

### Testing

Expand Down
6 changes: 4 additions & 2 deletions py-src/data_formulator/data_loader/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@
from data_formulator.data_loader.postgresql_data_loader import PostgreSQLDataLoader
from data_formulator.data_loader.mongodb_data_loader import MongoDBDataLoader
from data_formulator.data_loader.bigquery_data_loader import BigQueryDataLoader
from data_formulator.data_loader.athena_data_loader import AthenaDataLoader

DATA_LOADERS = {
"mysql": MySQLDataLoader,
Expand All @@ -16,7 +17,8 @@
"azure_blob": AzureBlobDataLoader,
"postgresql": PostgreSQLDataLoader,
"mongodb": MongoDBDataLoader,
"bigquery": BigQueryDataLoader
"bigquery": BigQueryDataLoader,
"athena": AthenaDataLoader
}

__all__ = ["ExternalDataLoader", "MySQLDataLoader", "MSSQLDataLoader", "KustoDataLoader", "S3DataLoader", "AzureBlobDataLoader","PostgreSQLDataLoader", "MongoDBDataLoader", "BigQueryDataLoader", "DATA_LOADERS"]
__all__ = ["ExternalDataLoader", "MySQLDataLoader", "MSSQLDataLoader", "KustoDataLoader", "S3DataLoader", "AzureBlobDataLoader", "PostgreSQLDataLoader", "MongoDBDataLoader", "BigQueryDataLoader", "AthenaDataLoader", "DATA_LOADERS"]
Loading
Loading