-
Notifications
You must be signed in to change notification settings - Fork 8
feat(prometheus_adapter): 新增prometheus_adapter实现指标查询功能 #90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
Code Review Completed ✅I've thoroughly reviewed the prometheus_adapter implementation and identified several areas that need attention: 🚨 Critical IssuesMissing Dependencies
Resource Validation Issues
Error Handling Gaps
|
Summary of ChangesHello @JaD1ng, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! 此拉取请求引入了一个新的Prometheus Adapter服务,旨在为系统提供统一的、可编程的服务指标查询接口。通过暴露RESTful API,它允许用户按服务名称和版本查询各种性能指标。此外,本次变更还对现有的错误注入框架进行了升级,使其能够基于服务版本进行指标异常和HTTP请求延迟的注入,显著提升了系统的可观测性和故障注入测试的灵活性。这些改进共同为服务的监控、诊断和弹性测试提供了更强大的工具。 Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
本次 PR 实现了 prometheus_adapter
,提供了一个用于查询 Prometheus 指标的 RESTful API,这是一个很棒的新功能。同时,PR 还对 mock 服务的可观测性部分进行了重构,引入了服务版本的概念,并增加了 HTTP 延迟注入功能,增强了测试能力。
代码整体结构清晰,但我在以下几个方面发现了一些可以改进的地方:
- 性能: 在
metric_service
中,每次查询都会执行额外的服务和指标存在性检查,这会增加延迟。建议进行优化。 - 安全性:
prometheus_client
中的 PromQL 查询构建方式存在注入风险。 - 代码健壮性和可维护性: 部分代码(如错误处理、随机数生成)可以更健壮和易于维护。
- 可观测性: 移除了
ServiceInstanceID
可能会降低分布式追踪的粒度。
具体的修改建议请见各文件的评论。
if len(warnings) > 0 { | ||
// 记录警告但不返回错误 | ||
fmt.Printf("Prometheus warnings: %v\n", warnings) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
直接使用 fmt.Printf
输出警告信息会绕过项目统一的结构化日志系统(如 zerolog
),导致日志格式不一致,且难以被日志采集和分析系统处理。建议使用注入的 logger 实例来记录警告,例如 log.Warn()...
。
if len(warnings) > 0 { | |
// 记录警告但不返回错误 | |
fmt.Printf("Prometheus warnings: %v\n", warnings) | |
} | |
if len(warnings) > 0 { | |
// 使用日志组件记录警告 | |
log.Warn().Strs("warnings", warnings).Msg("Prometheus query generated warnings") | |
} |
if len(warnings) > 0 { | ||
fmt.Printf("Prometheus warnings: %v\n", warnings) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
直接使用 fmt.Printf
输出警告信息会绕过项目统一的结构化日志系统(如 zerolog
),导致日志格式不一致,且难以被日志采集和分析系统处理。建议使用注入的 logger 实例来记录警告,例如 log.Warn()...
。
if len(warnings) > 0 { | |
fmt.Printf("Prometheus warnings: %v\n", warnings) | |
} | |
if len(warnings) > 0 { | |
log.Warn().Strs("warnings", warnings).Msg("Prometheus query for available metrics generated warnings") | |
} |
func BuildQuery(service, metric, version string) string { | ||
// 基础查询 | ||
query := fmt.Sprintf(`%s{service_name="%s"`, metric, service) | ||
|
||
// 如果指定了版本,添加版本过滤 | ||
if version != "" { | ||
query += fmt.Sprintf(`,service_version="%s"`, version) | ||
} | ||
|
||
query += "}" | ||
return query | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
直接使用 fmt.Sprintf
拼接 PromQL 查询字符串存在安全隐患,如果 service
或 version
参数包含特殊字符(如 "
),可能会导致查询语句注入,引发语法错误或非预期的查询行为。为了保证查询的安全性,应该对标签值进行正确的转义。
建议使用 %q
格式化占位符,它会自动为字符串添加双引号并转义其中的特殊字符。
func BuildQuery(service, metric, version string) string { | |
// 基础查询 | |
query := fmt.Sprintf(`%s{service_name="%s"`, metric, service) | |
// 如果指定了版本,添加版本过滤 | |
if version != "" { | |
query += fmt.Sprintf(`,service_version="%s"`, version) | |
} | |
query += "}" | |
return query | |
} | |
func BuildQuery(service, metric, version string) string { | |
var queryBuilder strings.Builder | |
queryBuilder.WriteString(metric) | |
queryBuilder.WriteString(fmt.Sprintf(`{service_name=%q`, service)) | |
// 如果指定了版本,添加版本过滤 | |
if version != "" { | |
queryBuilder.WriteString(fmt.Sprintf(`,service_version=%q`, version)) | |
} | |
queryBuilder.WriteString("}") | |
return queryBuilder.String() | |
} |
func (s *MetricService) QueryMetric(ctx context.Context, service, metric, version string, start, end time.Time, step time.Duration) (*model.MetricQueryResponse, error) { | ||
// 动态验证服务是否存在 | ||
serviceExists, err := s.promClient.CheckServiceExists(ctx, service) | ||
if err != nil { | ||
log.Error().Err(err).Str("service", service).Msg("failed to check service existence") | ||
return nil, &model.PrometheusError{Message: err.Error()} | ||
} | ||
if !serviceExists { | ||
return nil, &model.ServiceNotFoundError{Service: service} | ||
} | ||
|
||
// 动态验证指标是否存在 | ||
metricExists, err := s.promClient.CheckMetricExists(ctx, metric) | ||
if err != nil { | ||
log.Error().Err(err).Str("metric", metric).Msg("failed to check metric existence") | ||
return nil, &model.PrometheusError{Message: err.Error()} | ||
} | ||
if !metricExists { | ||
return nil, &model.MetricNotFoundError{Metric: metric} | ||
} | ||
|
||
// 构建 PromQL 查询 | ||
query := client.BuildQuery(service, metric, version) | ||
log.Debug().Str("query", query).Msg("executing prometheus query") | ||
|
||
// 执行查询 | ||
dataPoints, err := s.promClient.QueryRange(ctx, query, start, end, step) | ||
if err != nil { | ||
log.Error().Err(err).Str("query", query).Msg("failed to query prometheus") | ||
return nil, &model.PrometheusError{Message: err.Error()} | ||
} | ||
|
||
// 构建响应 | ||
response := &model.MetricQueryResponse{ | ||
Service: service, | ||
Version: version, | ||
Metric: metric, | ||
Data: dataPoints, | ||
} | ||
|
||
return response, nil | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
在每次 QueryMetric
调用中,都执行了 CheckServiceExists
和 CheckMetricExists
两次额外的 Prometheus 查询。这会显著增加 API 的响应延迟,尤其是在高并发场景下。这种“预检”操作是性能瓶颈。
建议优化此逻辑:
- 移除预检:直接执行最终的
QueryRange
查询。如果结果为空,可以直接返回空数据集,这通常是合法的。如果需要区分“不存在”和“无数据”的场景,可以在查询结果为空时,再进行存在性检查以返回更精确的错误信息。 - 添加缓存:如果必须保留预检,可以为服务和指标的存在性检查结果添加一个带 TTL 的缓存(例如使用
sync.Map
或第三方缓存库),因为这些信息通常不会频繁变动。
start, end, err := api.parseTimeRange(startStr, endStr) | ||
if err != nil { | ||
log.Error().Err(err).Msg("invalid time parameters") | ||
api.sendErrorResponse(c, http.StatusBadRequest, model.ErrorCodeInvalidParameter, | ||
fmt.Sprintf("参数 'start/end' 格式错误: %s", err.Error()), nil) | ||
return | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
func (api *Api) sendErrorResponse(c *fox.Context, statusCode int, errorCode, message string, extras map[string]string) { | ||
errorDetail := model.ErrorDetail{ | ||
Code: errorCode, | ||
Message: message, | ||
} | ||
|
||
// 添加额外的字段 | ||
if extras != nil { | ||
if service, ok := extras["service"]; ok { | ||
errorDetail.Service = service | ||
} | ||
if metric, ok := extras["metric"]; ok { | ||
errorDetail.Metric = metric | ||
} | ||
if parameter, ok := extras["parameter"]; ok { | ||
errorDetail.Parameter = parameter | ||
} | ||
if value, ok := extras["value"]; ok { | ||
errorDetail.Value = value | ||
} | ||
} | ||
|
||
response := model.ErrorResponse{ | ||
Error: errorDetail, | ||
} | ||
|
||
c.JSON(statusCode, response) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
使用 map[string]string
类型的 extras
参数来传递额外的错误详情字段,这种方式不够类型安全且容易出错(例如,硬编码的键名拼写错误)。
建议采用更健壮的方式,例如使用函数式选项模式(functional options)来构建 ErrorDetail
,以增强代码的可读性和可维护性。
例如,可以将函数签名修改为:
func (api *Api) sendErrorResponse(c *fox.Context, statusCode int, errorCode, message string, opts ...func(*model.ErrorDetail))
这样调用时会更清晰和安全:
api.sendErrorResponse(c, http.StatusNotFound, model.ErrorCodeServiceNotFound,
err.Error(), func(d *model.ErrorDetail) { d.Service = service })
if config.Probability < 1.0 { | ||
// 简单的概率实现(生产环境应使用更好的随机数) | ||
if time.Now().UnixNano()%100 >= int64(config.Probability*100) { | ||
return 0 | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return resource.New(context.Background(), | ||
resource.WithAttributes( | ||
semconv.ServiceName(config.ServiceName), | ||
semconv.ServiceVersion(config.ServiceVersion), | ||
semconv.DeploymentEnvironment(config.Environment), | ||
), | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
在 createResource
函数中,semconv.ServiceInstanceID
属性被移除了。虽然错误注入逻辑已改为基于服务版本,但在分布式追踪和指标监控中,实例ID(service.instance.id
)仍然是区分同一服务不同副本的关键标识。移除它会降低可观测性数据的粒度,使得问题排查(例如定位到某个具体 Pod 的问题)变得更加困难。
建议重新添加 semconv.ServiceInstanceID
属性,以保持对服务实例的精细化监控能力。
return resource.New(context.Background(),
resource.WithAttributes(
semconv.ServiceName(config.ServiceName),
semconv.ServiceVersion(config.ServiceVersion),
semconv.DeploymentEnvironment(config.Environment),
semconv.ServiceInstanceID(utils.GetInstanceID(config.ServiceName)),
),
)
|
||
// 尝试从配置文件加载 | ||
// 获取服务版本(默认为 1.0.0) | ||
serviceVersion := "1.0.0" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review Summary
I've identified several critical issues that require immediate attention:
🚨 Critical: Missing github.com/prometheus/procfs
dependency will cause compilation failures
🔧 Bug Fix: Silent step duration parsing failures should be handled properly
🏗️ Architecture: Input validation missing for metric names to prevent PromQL injection
These suggestions address the most critical issues with concrete code improvements.
github.com/prometheus/common v0.66.1 | ||
github.com/redis/go-redis/v9 v9.5.1 | ||
github.com/rs/zerolog v1.34.0 | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add missing prometheus/procfs dependency that is imported in metrics.go:14. Without this dependency, the code will fail to compile. The version 0.15.1 is compatible with the current prometheus client version.
) | |
github.com/prometheus/procfs v0.15.1 | |
github.com/prometheus/client_golang v1.23.2 |
} | ||
|
||
// 验证时间范围的合理性 | ||
if end.Before(start) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add maximum time range validation to prevent expensive Prometheus queries that could impact performance. The original code only checked if end is before start, but didn't limit the maximum range.
if end.Before(start) { | |
// 验证时间范围的合理性 | |
if end.Before(start) { | |
return time.Time{}, time.Time{}, fmt.Errorf("end time must be after start time") | |
} | |
// 防止查询时间范围过大(最大24小时) | |
if end.Sub(start) > 24*time.Hour { | |
return time.Time{}, time.Time{}, fmt.Errorf("time range cannot exceed 24 hours") | |
} |
rulesFilePath := os.Getenv("PROMETHEUS_RULES_FILE") | ||
if rulesFilePath == "" { | ||
rulesFilePath = "/etc/prometheus/rules/alert_rules.yml" | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
我们的服务会与prometheus服务在一个机器下么?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
目前是
实现HTTP中间件记录请求时延并导出到Prometheus指标 添加服务信息到指标收集器 统一代码格式和修复缩进问题 添加Prometheus Adapter API文档
重构指标注入器以支持服务版本维度 移除冗余的exported_job标签和实例ID生成 新增HTTP延迟注入器并与中间件集成
更新HTTP时延指标名称使其更简洁,同时改进健康检查脚本以支持批量检查多个端口
实现Prometheus适配器模块,包括以下主要功能: - 添加Prometheus客户端封装,支持指标查询和范围查询 - 实现指标服务层,提供指标列表获取和指标数据查询 - 添加API路由和控制器,提供RESTful接口 - 定义模型结构体和错误处理机制 - 更新依赖添加Prometheus客户端库 - 编写详细API文档说明接口使用方式
新增告警规则同步API及服务实现,支持将规则同步到Prometheus并触发重载 重构API层提取公共错误处理和工具方法到通用模块 添加相关模型定义和文档更新
- 在prometheus.yml中配置告警规则文件路径 - 修改docker-compose.yml挂载规则目录 - 重构AlertService,移除本地文件存储,直接写入容器 - 添加容器内规则文件写入的容错机制
- 将watch_time字段从AlertRuleMeta移到AlertRule中 - 移除全量同步接口,改为增量更新方式 - 实现批量更新规则元信息的API - 重构服务层代码结构,提高可维护性 - 更新文档
添加告警规则本地文件持久化功能,支持启动时加载和关闭时保存规则 重构关闭逻辑实现优雅关闭,包括保存当前规则状态 更新构建和部署脚本以处理规则文件目录 修改测试脚本以适配新的增量更新接口
添加prometheus_adapter.yml配置文件支持 重构alert_service使用配置而非环境变量 新增alert_webhook_service实现告警轮询推送 更新build.sh和deploy.sh支持配置文件部署 更新README文档说明新的webhook架构
- 新增Alertmanager API v2兼容接口用于接收Prometheus告警 - 重构告警服务架构,替换原有的轮询模式为推送模式 - 添加docker-compose配置支持Prometheus管理API - 移除过时的AlertWebhookService实现
实现删除告警规则模板及其关联元信息的功能,包括: 1. 添加DELETE /v1/alert-rules/:rule_name接口删除规则模板 2. 添加DELETE /v1/alert-rules-meta/:rule_name接口删除特定元信息 3. 更新相关文档说明删除操作的使用方法
- 在deploy.sh和build.sh中添加PID文件管理,优化服务启动和停止流程 - 修复alert_service.go中告警表达式生成的标签处理逻辑 - 使用%g代替%f格式化浮点数以避免科学计数法显示
b746a0f
to
4555e06
Compare
变更背景和解决方案
关联issue: #80
文档更新(架构文档、API文档、升级文档)
Checklist