Skip to content

Feature Request: Add Go binding for dcgmGetFieldSummary API #110

@vimalk78

Description

@vimalk78

The DCGM C API provides dcgmGetFieldSummary which computes summary statistics (min, max, avg, sum, count, integral, diff) over watched field samples. The C headers for this API are already included in go-dcgm (dcgm_agent.h, dcgm_structs.h), but there is no Go wrapper exposing this functionality.

This prevents Go consumers (such as https://github.com/NVIDIA/dcgm-exporter) from accessing pre-computed field summaries that DCGM maintains internally.

Relevant C API

  // dcgm_agent.h:1232
  // Get a summary of the values for a field id over a period of time.
  dcgmReturn_t dcgmGetFieldSummary(dcgmHandle_t pDcgmHandle, dcgmFieldSummaryRequest_t *request);

  // dcgm_structs.h:3831
  #define DCGM_SUMMARY_MIN      0x00000001
  #define DCGM_SUMMARY_MAX      0x00000002
  #define DCGM_SUMMARY_AVG      0x00000004
  #define DCGM_SUMMARY_SUM      0x00000008
  #define DCGM_SUMMARY_COUNT    0x00000010
  #define DCGM_SUMMARY_INTEGRAL 0x00000020
  #define DCGM_SUMMARY_DIFF     0x00000040

  // dcgm_structs.h:3857
  typedef struct {
      unsigned int version;
      unsigned short fieldId;
      dcgm_field_entity_group_t entityGroupId;
      dcgm_field_eid_t entityId;
      uint32_t summaryTypeMask;    // bitmask of DCGM_SUMMARY_*
      uint64_t startTime;          // 0 = from beginning
      uint64_t endTime;            // 0 = to now
      dcgmSummaryResponse_t response;
  } dcgmFieldSummaryRequest_v1;

Proposed Go API

  type FieldSummary struct {
      Min   float64
      Max   float64
      Avg   float64
      // additional fields as needed
  }

  // GetFieldSummary returns summary statistics for a watched field over a time window.
  func GetFieldSummary(gpuId uint, fieldId Short, summaryMask uint32, startTime, endTime uint64) (FieldSummary, error)

Use Case

https://github.com/sustainable-computing-io/kepler and other power monitoring tools need the minimum observed power of a GPU to determine idle/baseline power for workload attribution. DCGM's sample buffer may have historical data from before the consumer started, making DCGM_SUMMARY_MIN on DCGM_FI_DEV_POWER_USAGE more
accurate than self-tracking.

Downstream, https://github.com/NVIDIA/dcgm-exporter could use this binding to expose summary metrics as Prometheus gauges.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions