Skip to content

Conversation

@ialarmedalien
Copy link
Collaborator

  • Use logging to emit messages (on the sage advice of copilot).
  • Adding a bunch of tests to ensure that the aforementioned logging is occurring.
  • Adding some extra tests for untested functions.
  • Minor code simplification.

Bonus content!

  • Adding a function to get namespace info.

Adding some extra tests for untested functions.
Adding a function to get namespace info.
@ialarmedalien ialarmedalien self-assigned this Dec 23, 2025
Comment on lines +108 to +109
# includes stack trace
logger.exception("Error creating namespace")
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you use logger.exception within the context of an exception, it automatically includes the stack trace. Very generous!

except Exception as e:
print(f"Error creating namespace: {e}")
raise e
delta_namespace, db_location = generate_namespace_location(namespace, tenant_name)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed this variable to delta_namespace so that it is clearer that it's the governance service / minio-approved namespace, not the raw string entered by the user.

Comment on lines +149 to +150
if not namespace:
namespace = DEFAULT_NAMESPACE
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another copilot suggestion from the last PR - have namespace default to None and then set the value of namespace within the function if it is falsy (None or "").

Comment on lines -192 to -193
tables = [row["tableName"] for row in tables_df.collect()]
return tables
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to set intermediate variable and return it -- just return the function output directly



def get_table_info(spark: SparkSession, table_name: str, namespace: str = DEFAULT_NAMESPACE) -> dict:
def get_namespace_info(spark: SparkSession, namespace: str | None = None) -> dict:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New function -- just a wrapper around DESCRIBE NAMESPACE EXTENDED <namespace_name>. Extremely similar to the existing get_table_info function.

desc_df = spark.sql(f"DESCRIBE EXTENDED {db_table}").collect()
# N.b. if the table contains columns with the same names as table metadata fields ("Name", "Type", "Location", "Provider", etc.)
# they will be overwritten.
info = {row["col_name"]: row["data_type"] for row in desc_df if row["col_name"] and row["data_type"]}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use a dictionary comprehension

@bio-boris
Copy link
Contributor

Will users even be able to see the logs or is this for admins only? Are logs are emitted to the notebook ? I was thinking we need to save logs to /tmp/logs or something in order for the user to be able to see them.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR modernizes user messaging by migrating from print statements to the standard Python logging module, improving code maintainability and allowing users to configure logging levels. It adds comprehensive test coverage for logging behavior and introduces a new get_namespace_info function for retrieving namespace metadata.

  • Replaced all print statements with structured logging calls (logger.info, logger.warning, logger.exception)
  • Added comprehensive tests using pytest's caplog fixture to verify logging behavior
  • Introduced get_namespace_info function to query namespace metadata

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
notebook_utils/berdl_notebook_utils/spark/database.py Adds logger setup, converts print statements to logging calls, renames variables for clarity (namespace → delta_namespace), adds get_namespace_info function, and updates exception handling to use logger.exception
notebook_utils/tests/spark/test_database.py Migrates from capfd to caplog for testing, adds make_mock_spark_sql_error test helper, adds comprehensive logging tests for all modified functions, and adds tests for get_table_info and get_namespace_info functions

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@ialarmedalien
Copy link
Collaborator Author

The config settings used by Ahmed's data lakehouse ingest module seem pretty reasonable:

https://github.com/kbase/data-lakehouse-ingest/blob/develop/src/data_lakehouse_ingest/logger.py#L41

It would be good if this module provided a basic logger config (level, where to output to, log format) that could be imported and reused by other packages that will be run on the spark notebook. A more uniform approach to logging across the different BERDL products means easier troubleshooting, a standardised user experience, and no need for every package to reinvent the logging wheel.

@bio-boris
Copy link
Contributor

  • The title of this PR seems to imply that the intent is that the users of the notebook will see these messages, but these messages will only be viewable by rancher2 users/admins.
  • I agree standardized logging is good practice. However, all these logs go to stdout and get thrown away on container restart. The consistency would mainly pay off if we aggregate them somewhere (VictoriaLogs, Loki, etc.). Are we planning to set that up? If so, then yes, let's align on format — and we should probably decide on the format based on what the aggregator prefers (logfmt is nice for Loki and human-readable; JSON works too). If we're not aggregating, the standardization would only affect rancher2 users tailing logs

@ialarmedalien
Copy link
Collaborator Author

You can configure handlers that will distribute log messages wherever you want. For example, it may be useful to save all messages of level ERROR and FATAL to a file, and all messages from levels INFO and above (WARNING, ERROR, etc.) to stdout / stderr.

For these purposes, most messages are for the benefit of the user, so a config that outputs them to stdout/stderr would be fine. Some errors may be useful to aggregate across sessions and users as they may indicate bugs, connectivity problems, etc., etc.

I don't think there has been much discussion of centralised, aggregated logging yet but I agree that we should come up with a shared format instead of each component having its own syntax.

Copy link
Contributor

@Tianhao-Gu Tianhao-Gu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code change looks fine (besides some minor comments) - But without logger configuration, users will see no output (defeating the purpose of the PR).

It seems like it also breaks tests.

DEFAULT_NAMESPACE = "default"


logger = logging.getLogger(__name__)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The module creates a logger but doesn't configure any handlers - is that intentional?

Copy link
Collaborator Author

@ialarmedalien ialarmedalien Jan 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to discuss a shared logging config (in terms of directing certain types of output to different places) that can be used by this package, Ahmed's dlh_ingest, and the cdm_data_loader_utils package.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For something like logging - is it worth building a shared module at this stage? Especially since we don’t yet have any infrastructure for log monitoring.

except Exception as e:
print(f"Error creating namespace: {e}")
raise e
delta_namespace, db_location = generate_namespace_location(namespace, tenant_name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR renames namespace to delta_namespace only in create_namespace_if_not_exists() but not in other functions - is that intentional?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. It's to differentiate between functions where namespace in == namespace out (e.g. checking if a namespace exists) and functions where the namespace in may be mutated to something different by the governance service.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 233 to +272
Example:
>>> info = get_table_info(spark, "user_data", "alice_experiments")
>>> print(f"Table location: {info.get('location', 'N/A')}")
>>> print(f"Table location: {info.get('Location', 'N/A')}")
Copy link

Copilot AI Jan 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring for the namespace parameter does not document the default behavior. The parameter defaults to None and falls back to DEFAULT_NAMESPACE ("default") when None is provided. This should be documented in the Args section, similar to how table_exists documents it as "The namespace of the table. Default is 'default'" (line 139).

Copilot uses AI. Check for mistakes.
Comment on lines +232 to +241
Args:
spark: The Spark session
namespace: The namespace to retrieve information about
Returns:
Dictionary containing namespace information
Example:
>>> info = get_namespace_info(spark, "alice_experiments")
>>> print(f"Namespace location: {info.get('Location', 'N/A')}")
Copy link

Copilot AI Jan 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring for the namespace parameter does not document the default behavior. The parameter defaults to None and falls back to DEFAULT_NAMESPACE ("default") when None is provided. This should be documented in the Args section, similar to how table_exists documents it as "The namespace of the table. Default is 'default'" (line 139).

Copilot uses AI. Check for mistakes.
Comment on lines +228 to +255
def get_namespace_info(spark: SparkSession, namespace: str | None = None) -> dict:
"""
Get detailed information about a namespace.
Args:
spark: The Spark session
namespace: The namespace to retrieve information about
Returns:
Dictionary containing namespace information
Example:
>>> info = get_namespace_info(spark, "alice_experiments")
>>> print(f"Namespace location: {info.get('Location', 'N/A')}")
"""
if not namespace:
namespace = DEFAULT_NAMESPACE

info = {}
try:
# Get namespace description
desc_df = spark.sql(f"DESCRIBE NAMESPACE EXTENDED {namespace}").collect()
# Convert to dictionary
info = {row["info_name"]: row["info_value"] for row in desc_df}
except Exception:
logger.exception("Error getting namespace info for %s", namespace)

return info
Copy link

Copilot AI Jan 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new function get_namespace_info is not exported in berdl_notebook_utils/spark/__init__.py. This means users cannot access it via the package-level import (e.g., from berdl_notebook_utils.spark import get_namespace_info). Add it to the imports and __all__ list in the __init__.py file to make it accessible to users.

Copilot uses AI. Check for mistakes.
spark: SparkSession,
table_name: str,
namespace: str = DEFAULT_NAMESPACE,
namespace: str | None = None,
Copy link

Copilot AI Jan 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing the default parameter of table_exists from namespace: str = DEFAULT_NAMESPACE to namespace: str | None = None is a breaking API change. While the behavior remains the same (defaults to "default" namespace), code that relied on keyword argument matching or type checking will break. The change is inconsistent with remove_table which still uses namespace: str = DEFAULT_NAMESPACE (line 160). For consistency and to avoid breaking changes, consider keeping the same signature pattern across all functions.

Suggested change
namespace: str | None = None,
namespace: str = DEFAULT_NAMESPACE,

Copilot uses AI. Check for mistakes.
Co-authored-by: Tianhao Gu <Tianhao-Gu@users.noreply.github.com>
@ialarmedalien
Copy link
Collaborator Author

It seems like it also breaks tests.

I have not looked at the tests you added over the holiday, I just merged the branch in. Code changes or tests that involve logging or stdout/stderr are probably to blame. I will wait until after discussing broader logging strategies in case this turns out to have been a waste of time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants