-
Notifications
You must be signed in to change notification settings - Fork 43
Description
The DCGM C API provides dcgmGetFieldSummary which computes summary statistics (min, max, avg, sum, count, integral, diff) over watched field samples. The C headers for this API are already included in go-dcgm (dcgm_agent.h, dcgm_structs.h), but there is no Go wrapper exposing this functionality.
This prevents Go consumers (such as https://github.com/NVIDIA/dcgm-exporter) from accessing pre-computed field summaries that DCGM maintains internally.
Relevant C API
// dcgm_agent.h:1232
// Get a summary of the values for a field id over a period of time.
dcgmReturn_t dcgmGetFieldSummary(dcgmHandle_t pDcgmHandle, dcgmFieldSummaryRequest_t *request);
// dcgm_structs.h:3831
#define DCGM_SUMMARY_MIN 0x00000001
#define DCGM_SUMMARY_MAX 0x00000002
#define DCGM_SUMMARY_AVG 0x00000004
#define DCGM_SUMMARY_SUM 0x00000008
#define DCGM_SUMMARY_COUNT 0x00000010
#define DCGM_SUMMARY_INTEGRAL 0x00000020
#define DCGM_SUMMARY_DIFF 0x00000040
// dcgm_structs.h:3857
typedef struct {
unsigned int version;
unsigned short fieldId;
dcgm_field_entity_group_t entityGroupId;
dcgm_field_eid_t entityId;
uint32_t summaryTypeMask; // bitmask of DCGM_SUMMARY_*
uint64_t startTime; // 0 = from beginning
uint64_t endTime; // 0 = to now
dcgmSummaryResponse_t response;
} dcgmFieldSummaryRequest_v1;
Proposed Go API
type FieldSummary struct {
Min float64
Max float64
Avg float64
// additional fields as needed
}
// GetFieldSummary returns summary statistics for a watched field over a time window.
func GetFieldSummary(gpuId uint, fieldId Short, summaryMask uint32, startTime, endTime uint64) (FieldSummary, error)
Use Case
https://github.com/sustainable-computing-io/kepler and other power monitoring tools need the minimum observed power of a GPU to determine idle/baseline power for workload attribution. DCGM's sample buffer may have historical data from before the consumer started, making DCGM_SUMMARY_MIN on DCGM_FI_DEV_POWER_USAGE more
accurate than self-tracking.
Downstream, https://github.com/NVIDIA/dcgm-exporter could use this binding to expose summary metrics as Prometheus gauges.