Use standard logging module for sending the user useful messages. #76

ialarmedalien · 2025-12-23T17:56:05Z

Use logging to emit messages (on the sage advice of copilot).
Adding a bunch of tests to ensure that the aforementioned logging is occurring.
Adding some extra tests for untested functions.
Minor code simplification.

Bonus content!

Adding a function to get namespace info.

Adding some extra tests for untested functions. Adding a function to get namespace info.

ialarmedalien · 2025-12-23T17:57:12Z

notebook_utils/berdl_notebook_utils/spark/database.py

+            # includes stack trace
+            logger.exception("Error creating namespace")


If you use logger.exception within the context of an exception, it automatically includes the stack trace. Very generous!

ialarmedalien · 2025-12-23T17:58:24Z

notebook_utils/berdl_notebook_utils/spark/database.py

-        except Exception as e:
-            print(f"Error creating namespace: {e}")
-            raise e
+            delta_namespace, db_location = generate_namespace_location(namespace, tenant_name)


changed this variable to delta_namespace so that it is clearer that it's the governance service / minio-approved namespace, not the raw string entered by the user.

ialarmedalien · 2025-12-23T17:59:53Z

notebook_utils/berdl_notebook_utils/spark/database.py

+    if not namespace:
+        namespace = DEFAULT_NAMESPACE


another copilot suggestion from the last PR - have namespace default to None and then set the value of namespace within the function if it is falsy (None or "").

ialarmedalien · 2025-12-23T18:01:22Z

notebook_utils/berdl_notebook_utils/spark/database.py

-        tables = [row["tableName"] for row in tables_df.collect()]
-        return tables


no need to set intermediate variable and return it -- just return the function output directly

ialarmedalien · 2025-12-23T18:02:14Z

notebook_utils/berdl_notebook_utils/spark/database.py



-def get_table_info(spark: SparkSession, table_name: str, namespace: str = DEFAULT_NAMESPACE) -> dict:
+def get_namespace_info(spark: SparkSession, namespace: str | None = None) -> dict:


New function -- just a wrapper around DESCRIBE NAMESPACE EXTENDED <namespace_name>. Extremely similar to the existing get_table_info function.

ialarmedalien · 2025-12-23T18:02:38Z

notebook_utils/berdl_notebook_utils/spark/database.py

+        desc_df = spark.sql(f"DESCRIBE EXTENDED {db_table}").collect()
+        # N.b. if the table contains columns with the same names as table metadata fields ("Name", "Type", "Location", "Provider", etc.)
+        # they will be overwritten.
+        info = {row["col_name"]: row["data_type"] for row in desc_df if row["col_name"] and row["data_type"]}


use a dictionary comprehension

bio-boris · 2025-12-23T20:13:27Z

Will users even be able to see the logs or is this for admins only? Are logs are emitted to the notebook ? I was thinking we need to save logs to /tmp/logs or something in order for the user to be able to see them.

Copilot

Pull request overview

This PR modernizes user messaging by migrating from print statements to the standard Python logging module, improving code maintainability and allowing users to configure logging levels. It adds comprehensive test coverage for logging behavior and introduces a new get_namespace_info function for retrieving namespace metadata.

Replaced all print statements with structured logging calls (logger.info, logger.warning, logger.exception)
Added comprehensive tests using pytest's caplog fixture to verify logging behavior
Introduced get_namespace_info function to query namespace metadata

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File	Description
notebook_utils/berdl_notebook_utils/spark/database.py	Adds logger setup, converts print statements to logging calls, renames variables for clarity (namespace → delta_namespace), adds get_namespace_info function, and updates exception handling to use logger.exception
notebook_utils/tests/spark/test_database.py	Migrates from capfd to caplog for testing, adds make_mock_spark_sql_error test helper, adds comprehensive logging tests for all modified functions, and adds tests for get_table_info and get_namespace_info functions

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

notebook_utils/berdl_notebook_utils/spark/database.py

notebook_utils/tests/spark/test_database.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

ialarmedalien · 2026-01-02T15:22:12Z

The config settings used by Ahmed's data lakehouse ingest module seem pretty reasonable:

https://github.com/kbase/data-lakehouse-ingest/blob/develop/src/data_lakehouse_ingest/logger.py#L41

It would be good if this module provided a basic logger config (level, where to output to, log format) that could be imported and reused by other packages that will be run on the spark notebook. A more uniform approach to logging across the different BERDL products means easier troubleshooting, a standardised user experience, and no need for every package to reinvent the logging wheel.

bio-boris · 2026-01-02T16:02:40Z

The title of this PR seems to imply that the intent is that the users of the notebook will see these messages, but these messages will only be viewable by rancher2 users/admins.
I agree standardized logging is good practice. However, all these logs go to stdout and get thrown away on container restart. The consistency would mainly pay off if we aggregate them somewhere (VictoriaLogs, Loki, etc.). Are we planning to set that up? If so, then yes, let's align on format — and we should probably decide on the format based on what the aggregator prefers (logfmt is nice for Loki and human-readable; JSON works too). If we're not aggregating, the standardization would only affect rancher2 users tailing logs

ialarmedalien · 2026-01-05T14:20:42Z

You can configure handlers that will distribute log messages wherever you want. For example, it may be useful to save all messages of level ERROR and FATAL to a file, and all messages from levels INFO and above (WARNING, ERROR, etc.) to stdout / stderr.

For these purposes, most messages are for the benefit of the user, so a config that outputs them to stdout/stderr would be fine. Some errors may be useful to aggregate across sessions and users as they may indicate bugs, connectivity problems, etc., etc.

I don't think there has been much discussion of centralised, aggregated logging yet but I agree that we should come up with a shared format instead of each component having its own syntax.

Tianhao-Gu

The code change looks fine (besides some minor comments) - But without logger configuration, users will see no output (defeating the purpose of the PR).

It seems like it also breaks tests.

Tianhao-Gu · 2026-01-05T14:58:19Z

notebook_utils/berdl_notebook_utils/spark/database.py

 DEFAULT_NAMESPACE = "default"


+logger = logging.getLogger(__name__)


The module creates a logger but doesn't configure any handlers - is that intentional?

I would like to discuss a shared logging config (in terms of directing certain types of output to different places) that can be used by this package, Ahmed's dlh_ingest, and the cdm_data_loader_utils package.

For something like logging - is it worth building a shared module at this stage? Especially since we don’t yet have any infrastructure for log monitoring.

notebook_utils/berdl_notebook_utils/spark/database.py

Tianhao-Gu · 2026-01-05T15:02:43Z

notebook_utils/berdl_notebook_utils/spark/database.py

-        except Exception as e:
-            print(f"Error creating namespace: {e}")
-            raise e
+            delta_namespace, db_location = generate_namespace_location(namespace, tenant_name)


The PR renames namespace to delta_namespace only in create_namespace_if_not_exists() but not in other functions - is that intentional?

Yes. It's to differentiate between functions where namespace in == namespace out (e.g. checking if a namespace exists) and functions where the namespace in may be mutated to something different by the governance service.

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-05T15:32:06Z

notebook_utils/berdl_notebook_utils/spark/database.py


    Example:
        >>> info = get_table_info(spark, "user_data", "alice_experiments")
-        >>> print(f"Table location: {info.get('location', 'N/A')}")
+        >>> print(f"Table location: {info.get('Location', 'N/A')}")


The docstring for the namespace parameter does not document the default behavior. The parameter defaults to None and falls back to DEFAULT_NAMESPACE ("default") when None is provided. This should be documented in the Args section, similar to how table_exists documents it as "The namespace of the table. Default is 'default'" (line 139).

Copilot · 2026-01-05T15:32:06Z

notebook_utils/berdl_notebook_utils/spark/database.py

+    Args:
+        spark: The Spark session
+        namespace: The namespace to retrieve information about
+
+    Returns:
+        Dictionary containing namespace information
+
+    Example:
+        >>> info = get_namespace_info(spark, "alice_experiments")
+        >>> print(f"Namespace location: {info.get('Location', 'N/A')}")


The docstring for the namespace parameter does not document the default behavior. The parameter defaults to None and falls back to DEFAULT_NAMESPACE ("default") when None is provided. This should be documented in the Args section, similar to how table_exists documents it as "The namespace of the table. Default is 'default'" (line 139).

Copilot · 2026-01-05T15:32:06Z

notebook_utils/berdl_notebook_utils/spark/database.py

+def get_namespace_info(spark: SparkSession, namespace: str | None = None) -> dict:
+    """
+    Get detailed information about a namespace.
+
+    Args:
+        spark: The Spark session
+        namespace: The namespace to retrieve information about
+
+    Returns:
+        Dictionary containing namespace information
+
+    Example:
+        >>> info = get_namespace_info(spark, "alice_experiments")
+        >>> print(f"Namespace location: {info.get('Location', 'N/A')}")
+    """
+    if not namespace:
+        namespace = DEFAULT_NAMESPACE
+
+    info = {}
+    try:
+        # Get namespace description
+        desc_df = spark.sql(f"DESCRIBE NAMESPACE EXTENDED {namespace}").collect()
+        # Convert to dictionary
+        info = {row["info_name"]: row["info_value"] for row in desc_df}
+    except Exception:
+        logger.exception("Error getting namespace info for %s", namespace)
+
+    return info


The new function get_namespace_info is not exported in berdl_notebook_utils/spark/__init__.py. This means users cannot access it via the package-level import (e.g., from berdl_notebook_utils.spark import get_namespace_info). Add it to the imports and __all__ list in the __init__.py file to make it accessible to users.

Copilot · 2026-01-05T15:32:07Z

notebook_utils/berdl_notebook_utils/spark/database.py

    spark: SparkSession,
    table_name: str,
-    namespace: str = DEFAULT_NAMESPACE,
+    namespace: str | None = None,


Changing the default parameter of table_exists from namespace: str = DEFAULT_NAMESPACE to namespace: str | None = None is a breaking API change. While the behavior remains the same (defaults to "default" namespace), code that relied on keyword argument matching or type checking will break. The change is inconsistent with remove_table which still uses namespace: str = DEFAULT_NAMESPACE (line 160). For consistency and to avoid breaking changes, consider keeping the same signature pattern across all functions.

Suggested change

namespace: str | None = None,

namespace: str = DEFAULT_NAMESPACE,

Co-authored-by: Tianhao Gu <Tianhao-Gu@users.noreply.github.com>

ialarmedalien · 2026-01-05T16:58:22Z

It seems like it also breaks tests.

I have not looked at the tests you added over the holiday, I just merged the branch in. Code changes or tests that involve logging or stdout/stderr are probably to blame. I will wait until after discussing broader logging strategies in case this turns out to have been a waste of time.

Switching to using logging for the emission of messages to users.

23733f2

Adding some extra tests for untested functions. Adding a function to get namespace info.

ialarmedalien requested a review from Tianhao-Gu December 23, 2025 17:56

ialarmedalien self-assigned this Dec 23, 2025

ialarmedalien commented Dec 23, 2025

View reviewed changes

Lint fix

fd673b0

ialarmedalien requested a review from bio-boris December 23, 2025 20:05

bio-boris requested a review from Copilot December 23, 2025 20:10

Copilot started reviewing on behalf of bio-boris December 23, 2025 20:11 View session

Copilot AI reviewed Dec 23, 2025

View reviewed changes

ialarmedalien and others added 3 commits December 23, 2025 12:25

Update notebook_utils/berdl_notebook_utils/spark/database.py

d46d71a

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Copilot and other fixes

ae7a872

Merge branch 'main' into using_logging

6960b52

Tianhao-Gu reviewed Jan 5, 2026

View reviewed changes

Tianhao-Gu requested a review from Copilot January 5, 2026 15:27

Copilot started reviewing on behalf of Tianhao-Gu January 5, 2026 15:27 View session

Copilot AI reviewed Jan 5, 2026

View reviewed changes

Update notebook_utils/berdl_notebook_utils/spark/database.py

9c05d39

Co-authored-by: Tianhao Gu <Tianhao-Gu@users.noreply.github.com>

		# includes stack trace
		logger.exception("Error creating namespace")

		tables = [row["tableName"] for row in tables_df.collect()]
		return tables



		def get_table_info(spark: SparkSession, table_name: str, namespace: str = DEFAULT_NAMESPACE) -> dict:
		def get_namespace_info(spark: SparkSession, namespace: str \| None = None) -> dict:

		DEFAULT_NAMESPACE = "default"


		logger = logging.getLogger(__name__)

	namespace: str \| None = None,
	namespace: str = DEFAULT_NAMESPACE,

Use standard logging module for sending the user useful messages. #76

Are you sure you want to change the base?

Use standard logging module for sending the user useful messages. #76

Uh oh!

Conversation

ialarmedalien commented Dec 23, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bio-boris commented Dec 23, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ialarmedalien commented Jan 2, 2026

Uh oh!

bio-boris commented Jan 2, 2026

Uh oh!

ialarmedalien commented Jan 5, 2026

Uh oh!

Tianhao-Gu left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ialarmedalien Jan 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

ialarmedalien commented Jan 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ialarmedalien Jan 5, 2026 •

edited

Loading