-
Notifications
You must be signed in to change notification settings - Fork 0
Use standard logging module for sending the user useful messages. #76
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Adding some extra tests for untested functions. Adding a function to get namespace info.
| # includes stack trace | ||
| logger.exception("Error creating namespace") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you use logger.exception within the context of an exception, it automatically includes the stack trace. Very generous!
| except Exception as e: | ||
| print(f"Error creating namespace: {e}") | ||
| raise e | ||
| delta_namespace, db_location = generate_namespace_location(namespace, tenant_name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed this variable to delta_namespace so that it is clearer that it's the governance service / minio-approved namespace, not the raw string entered by the user.
| if not namespace: | ||
| namespace = DEFAULT_NAMESPACE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
another copilot suggestion from the last PR - have namespace default to None and then set the value of namespace within the function if it is falsy (None or "").
| tables = [row["tableName"] for row in tables_df.collect()] | ||
| return tables |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need to set intermediate variable and return it -- just return the function output directly
|
|
||
|
|
||
| def get_table_info(spark: SparkSession, table_name: str, namespace: str = DEFAULT_NAMESPACE) -> dict: | ||
| def get_namespace_info(spark: SparkSession, namespace: str | None = None) -> dict: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New function -- just a wrapper around DESCRIBE NAMESPACE EXTENDED <namespace_name>. Extremely similar to the existing get_table_info function.
| desc_df = spark.sql(f"DESCRIBE EXTENDED {db_table}").collect() | ||
| # N.b. if the table contains columns with the same names as table metadata fields ("Name", "Type", "Location", "Provider", etc.) | ||
| # they will be overwritten. | ||
| info = {row["col_name"]: row["data_type"] for row in desc_df if row["col_name"] and row["data_type"]} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use a dictionary comprehension
|
Will users even be able to see the logs or is this for admins only? Are logs are emitted to the notebook ? I was thinking we need to save logs to /tmp/logs or something in order for the user to be able to see them. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR modernizes user messaging by migrating from print statements to the standard Python logging module, improving code maintainability and allowing users to configure logging levels. It adds comprehensive test coverage for logging behavior and introduces a new get_namespace_info function for retrieving namespace metadata.
- Replaced all print statements with structured logging calls (logger.info, logger.warning, logger.exception)
- Added comprehensive tests using pytest's caplog fixture to verify logging behavior
- Introduced
get_namespace_infofunction to query namespace metadata
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| notebook_utils/berdl_notebook_utils/spark/database.py | Adds logger setup, converts print statements to logging calls, renames variables for clarity (namespace → delta_namespace), adds get_namespace_info function, and updates exception handling to use logger.exception |
| notebook_utils/tests/spark/test_database.py | Migrates from capfd to caplog for testing, adds make_mock_spark_sql_error test helper, adds comprehensive logging tests for all modified functions, and adds tests for get_table_info and get_namespace_info functions |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
|
The config settings used by Ahmed's data lakehouse ingest module seem pretty reasonable: https://github.com/kbase/data-lakehouse-ingest/blob/develop/src/data_lakehouse_ingest/logger.py#L41 It would be good if this module provided a basic logger config (level, where to output to, log format) that could be imported and reused by other packages that will be run on the spark notebook. A more uniform approach to logging across the different BERDL products means easier troubleshooting, a standardised user experience, and no need for every package to reinvent the logging wheel. |
|
|
You can configure handlers that will distribute log messages wherever you want. For example, it may be useful to save all messages of level ERROR and FATAL to a file, and all messages from levels INFO and above (WARNING, ERROR, etc.) to stdout / stderr. For these purposes, most messages are for the benefit of the user, so a config that outputs them to stdout/stderr would be fine. Some errors may be useful to aggregate across sessions and users as they may indicate bugs, connectivity problems, etc., etc. I don't think there has been much discussion of centralised, aggregated logging yet but I agree that we should come up with a shared format instead of each component having its own syntax. |
Tianhao-Gu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code change looks fine (besides some minor comments) - But without logger configuration, users will see no output (defeating the purpose of the PR).
It seems like it also breaks tests.
| DEFAULT_NAMESPACE = "default" | ||
|
|
||
|
|
||
| logger = logging.getLogger(__name__) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The module creates a logger but doesn't configure any handlers - is that intentional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to discuss a shared logging config (in terms of directing certain types of output to different places) that can be used by this package, Ahmed's dlh_ingest, and the cdm_data_loader_utils package.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For something like logging - is it worth building a shared module at this stage? Especially since we don’t yet have any infrastructure for log monitoring.
| except Exception as e: | ||
| print(f"Error creating namespace: {e}") | ||
| raise e | ||
| delta_namespace, db_location = generate_namespace_location(namespace, tenant_name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PR renames namespace to delta_namespace only in create_namespace_if_not_exists() but not in other functions - is that intentional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. It's to differentiate between functions where namespace in == namespace out (e.g. checking if a namespace exists) and functions where the namespace in may be mutated to something different by the governance service.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| Example: | ||
| >>> info = get_table_info(spark, "user_data", "alice_experiments") | ||
| >>> print(f"Table location: {info.get('location', 'N/A')}") | ||
| >>> print(f"Table location: {info.get('Location', 'N/A')}") |
Copilot
AI
Jan 5, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docstring for the namespace parameter does not document the default behavior. The parameter defaults to None and falls back to DEFAULT_NAMESPACE ("default") when None is provided. This should be documented in the Args section, similar to how table_exists documents it as "The namespace of the table. Default is 'default'" (line 139).
| Args: | ||
| spark: The Spark session | ||
| namespace: The namespace to retrieve information about | ||
| Returns: | ||
| Dictionary containing namespace information | ||
| Example: | ||
| >>> info = get_namespace_info(spark, "alice_experiments") | ||
| >>> print(f"Namespace location: {info.get('Location', 'N/A')}") |
Copilot
AI
Jan 5, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docstring for the namespace parameter does not document the default behavior. The parameter defaults to None and falls back to DEFAULT_NAMESPACE ("default") when None is provided. This should be documented in the Args section, similar to how table_exists documents it as "The namespace of the table. Default is 'default'" (line 139).
| def get_namespace_info(spark: SparkSession, namespace: str | None = None) -> dict: | ||
| """ | ||
| Get detailed information about a namespace. | ||
| Args: | ||
| spark: The Spark session | ||
| namespace: The namespace to retrieve information about | ||
| Returns: | ||
| Dictionary containing namespace information | ||
| Example: | ||
| >>> info = get_namespace_info(spark, "alice_experiments") | ||
| >>> print(f"Namespace location: {info.get('Location', 'N/A')}") | ||
| """ | ||
| if not namespace: | ||
| namespace = DEFAULT_NAMESPACE | ||
|
|
||
| info = {} | ||
| try: | ||
| # Get namespace description | ||
| desc_df = spark.sql(f"DESCRIBE NAMESPACE EXTENDED {namespace}").collect() | ||
| # Convert to dictionary | ||
| info = {row["info_name"]: row["info_value"] for row in desc_df} | ||
| except Exception: | ||
| logger.exception("Error getting namespace info for %s", namespace) | ||
|
|
||
| return info |
Copilot
AI
Jan 5, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new function get_namespace_info is not exported in berdl_notebook_utils/spark/__init__.py. This means users cannot access it via the package-level import (e.g., from berdl_notebook_utils.spark import get_namespace_info). Add it to the imports and __all__ list in the __init__.py file to make it accessible to users.
| spark: SparkSession, | ||
| table_name: str, | ||
| namespace: str = DEFAULT_NAMESPACE, | ||
| namespace: str | None = None, |
Copilot
AI
Jan 5, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changing the default parameter of table_exists from namespace: str = DEFAULT_NAMESPACE to namespace: str | None = None is a breaking API change. While the behavior remains the same (defaults to "default" namespace), code that relied on keyword argument matching or type checking will break. The change is inconsistent with remove_table which still uses namespace: str = DEFAULT_NAMESPACE (line 160). For consistency and to avoid breaking changes, consider keeping the same signature pattern across all functions.
| namespace: str | None = None, | |
| namespace: str = DEFAULT_NAMESPACE, |
Co-authored-by: Tianhao Gu <Tianhao-Gu@users.noreply.github.com>
I have not looked at the tests you added over the holiday, I just merged the branch in. Code changes or tests that involve logging or stdout/stderr are probably to blame. I will wait until after discussing broader logging strategies in case this turns out to have been a waste of time. |
Bonus content!