feat: add data model for client side metrics #1187

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

daniel-sanche wants to merge 107 commits into main from csm_1_data_model

Contributor

daniel-sanche commented Aug 11, 2025 •

edited

Loading

Blocked on #1206

This PR revives #923, which was de-priotirized to work on the sync client. This PR brings it back, working with both async and sync. It also adds a grpc interceptor, as an improved way to capture metadata across both clients

Design

The main architecture looks like this:

300137129-bebbb05a-20f0-45c2-9d38-e95a314edf64 drawio (1)

Most of the work is done by the ActiveOperationMetric class, which is instantiated with each rpc call, and updated through the lifecycle of the call. When the rpc is complete, it will call on_operation_complete and on_attempt_complete on the MetricsHandler, which can then log the completed data into OpenTelemetry (or theoretically, other locations if needed)

Note that there are separate classes for active vs completed metrics (ActiveOperationMetric, ActiveAttemptMetric, CompletedOperationMetric, CompletedAttemptMetric). This is so that we can keep fields mutable and optional while the request is ongoing, but pass down static immutable copies once the attempt is completed and no new data is coming

daniel-sanche and others added 30 commits

July 25, 2025 16:12


          use replaceable channel wrapper

d2175f1


          got unit tests working

5e107fc


          put back in cache invalidation

c4a97e1


          added wrapped multicallables to avoid cache invalidation

e71b1d5


          added crosssync, moved close logic back to client

b81a9be


          generated sync code

a1dffb5


          got tests running

e3ec02b


          fixed tests

4e13783


          remove extra wrapper; added invalidate_stubs helper

7d90a04


          fixed lint

26cd601


          fixed lint

375332f


          renamed replaceablechannel to swappablechannel

428d75a


          added tests

4b39bc5


          added docstrings

3f090c2


          Merge branch 'main' into refactor_refresh

883ceab


          initial commit

04c762a


          added back interceptor

29dff4d


          added metrics to client

e4f8238


          fixed lint

fcb062e


          Merge branch 'refactor_refresh' into csm_1_data_model

ac8dbe4


          set up channel interceptions

d155f8a


          added TrackedBackoffGenerator

9fece96


          fixed lint

aec2577


          fixed import

ec4e847


          added operation.cancel

8f99e4e


          added operation cancelled to interceptor

f8e6603


          gave each operation a uuid

f5e057e


          return attempt metric on new attempt

8c397bb


          use standard context manager

2c34198


          use default backoff generator

9bd1e07

yoshi-kokoro removed the kokoro:force-run label

daniel-sanche added the kokoro:force-run label

yoshi-kokoro removed the kokoro:force-run label

daniel-sanche added 15 commits

November 26, 2025 14:15


          loosened test tolerances

6a4d742


          removed metrics superclass from interceptor

a474560


          fixed lint


          improved comments

3e0d134


          moved interceptor into _metrics

6b48242


          pulled out tracking into new file

dcf3d0a


          simplified wrapper method

c94e4ff


          Revert "moved interceptor into _metrics"

3a87a35

This reverts commit 6b48242.


          moved tracked retry out of autogen folder

b5c361b


          fixed typing

1b0b857


          added tests

ac315d0


          removed unneeded imports

87e78d1


          ran blacken

54b7208


          Moved retry trackers into own file

819e1ae


          added docstring

22eb2e1

daniel-sanche marked this pull request as ready for review

November 27, 2025 00:57

daniel-sanche requested review from a team as code owners

November 27, 2025 00:57

blunderbuss-gcf bot assigned vermas2012

daniel-sanche added 2 commits

November 26, 2025 16:58


          fixed type

0ec8d14


          import annotations

fa25c2b

Contributor Author

daniel-sanche commented Dec 3, 2025

Before merging, we should re-run the benchmarking code to make sure we are satisfied with the performance

vermas2012 assigned mutianf and unassigned vermas2012

mutianf reviewed

View reviewed changes

google/cloud/bigtable/data/_metrics/data_model.py Outdated Show resolved Hide resolved

google/cloud/bigtable/data/_metrics/data_model.py

    
              DEFAULT_CLUSTER_ID = "unspecified"

              # keys for parsing metadata blobs

              BIGTABLE_METADATA_KEY = "x-goog-ext-425905942-bin"

Contributor

mutianf Dec 4, 2025

nit: maybe a more descriptive name like BIGTABLE_LOCATION_METADATA_KEY?

Contributor Author

daniel-sanche Dec 19, 2025

sounds good, I'll change this

google/cloud/bigtable/data/_metrics/data_model.py

    
              class OperationType(Enum):

                  """Enum for the type of operation being performed."""

                  READ_ROWS = "ReadRows"

Contributor

mutianf Dec 4, 2025

there should also be a READ_ROW so we know if it's a point read or a scan

Contributor Author

daniel-sanche Dec 19, 2025

We had this discussion when I did the first draft of this, and we landed on just keeping READ_ROWS. But I'm totally fine either way, if you feel differently now

google/cloud/bigtable/data/_metrics/data_model.py

    
                  MUTATE_ROW = "MutateRow"

                  CHECK_AND_MUTATE = "CheckAndMutateRow"

                  READ_MODIFY_WRITE = "ReadModifyWriteRow"

Contributor

mutianf Dec 4, 2025

how about BulkMutateRows for write batcher?

Contributor Author

daniel-sanche Dec 19, 2025

There is one called BULK_MUTATE_ROWS (although it maps to MutateRows). Do we need another one?

google/cloud/bigtable/data/_metrics/data_model.py

    
                  backoff_before_attempt_ns: int = 0

                  # time waiting on grpc channel, in nanoseconds

                  # TODO: capture grpc_throttling_time

                  grpc_throttling_time_ns: int = 0

Contributor

mutianf Dec 4, 2025

fyi: we realized that in java this metric also doesn't capture the time a request queued on the channel. So if it's hard to get it in python we can skip it.

Contributor Author

daniel-sanche Dec 19, 2025

sounds good, maybe I'll remove this then

google/cloud/bigtable/data/_metrics/data_model.py

    
                          op_type=self.op_type,

                          uuid=self.uuid,

                          completed_attempts=self.completed_attempts,

                          duration_ns=time.monotonic_ns() - self.start_time_ns,

Contributor

mutianf Dec 6, 2025

same here, can we add a sanity check to make sure it's >= 0 ( or if its negative use 0 ) so that in case there's a bug in the code csm won't break the client.

google/cloud/bigtable/data/_async/metrics_interceptor.py

    
                  @CrossSync.convert

                  async def intercept_unary_unary(self, continuation, client_call_details, request):

                  @_with_active_operation

Contributor

mutianf Dec 8, 2025

where is this called? Can you point me to the code location? It feels a bit weird that starting an attempt is called from an interceptor

Contributor Author

daniel-sanche Dec 19, 2025

The implementation of this is here. The wrapper is executed before each intercept_unary_unary call.

The main purpose of this is to add the operation argument to this method, since that's not part of the regular intercept_unary_unary signature.

But yeah, there is a line to start an attempt if the associated rpc wasn't started previously. I can't remember if that was actually needed, or if I added that as a safeguard. I'll look into it

google/cloud/bigtable/data/_metrics/handlers/_base.py

    
                  def __init__(self, **kwargs):

                      pass

                  def on_operation_complete(self, op: CompletedOperationMetric) -> None:

Contributor

mutianf Dec 8, 2025

how is on_operation_complete called vs end_with_status in ActiveOperationMetric?

Contributor Author

daniel-sanche Dec 19, 2025

This represents the split between recording metric data, and then exporting it somewhere else

ActiveOperationMetric is the class that tracks an ongoing operation. It's passed around through an rpc's lifecycle, and is finalized by calling operation.end_with_status()
- this will be called throughout the library codebase
MetricHandler represents a destination for the metrics, like OTel. The implementation of this is mostly in a separate PR. The ActiveOperationMetric will call metric_handler.on_operation_complete() when it's time to export a new metric
- this is really internal to the _metrics implementation, and won't be used in other places

google/cloud/bigtable/data/_metrics/tracked_retry.py

    
                          # record metadata from failed rpc

                          if isinstance(exc, GoogleAPICallError) and exc.errors:

                              rpc_error = exc.errors[-1]

                              metadata = list(rpc_error.trailing_metadata()) + list(

Contributor

mutianf Dec 8, 2025

should this call metrics_interceptor._get_metadata()?

Contributor Author

daniel-sanche Dec 19, 2025

I don't think so

The metrics_interceptor and tracked_retry are two separate methods for capturing metadata in different contexts, but the source of truth is always operation.add_response_metadata(), which they both write to

google/cloud/bigtable/data/_metrics/tracked_retry.py

    
                          # record ending attempt for timeout failures

                          attempt_exc = exc_list[-1]

                          _track_retryable_error(operation)(attempt_exc)

                      operation.end_with_status(source_exc)

Contributor

mutianf Dec 8, 2025

where is end_with_success called?

Contributor Author

daniel-sanche Dec 19, 2025 •

edited

Loading

The tracked_retry object won't be notified when the stream ends successfully, so I think we'll have to instrument those manually

Stepping back a bit, I found rpcs behave pretty differently depending on if its unary vs stream, sync vs async, or retryable vs single attempt. I found I had to provide a couple different instrumentation methods to capture everything, and use different strategies in the instrumentation code instead of capturing it all the same way. I documented some of this in go/bigtable-csm-python

If you're finding it too confusing, let me know. Maybe I can improve the documentation around this, or find ways to streamline it a bit more

Contributor

mutianf Dec 19, 2025

will take a look at the doc!


          Update google/cloud/bigtable/data/_metrics/data_model.py

f9ac548

Co-authored-by: Mattie Fu <mattiefu@google.com>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api: bigtable size: xl