Skip to content

Conversation

JaD1ng
Copy link
Collaborator

@JaD1ng JaD1ng commented Sep 23, 2025

变更背景和解决方案

关联issue: #80

文档更新(架构文档、API文档、升级文档)

Checklist

  • 确认是否有文档更新
  • 确认是否自测
  • 确认是否考虑兼容升级,不兼容升级需要有对应的解决方案
  • 确认是否考虑添加监控指标
  • API 设计是否符合 API 设计规范, 如果没有 API 变更,请忽略
  • 我已经添加了相关人员到 Reviewers 列表中
  • 我已经设置了 Development 关联的 Issue

Copy link

niupilot bot commented Sep 23, 2025

Code Review Completed ✅

I've thoroughly reviewed the prometheus_adapter implementation and identified several areas that need attention:

🚨 Critical Issues

Missing Dependencies

  • go.mod includes github.com/prometheus/procfs but this dependency is missing from the current module requirements
  • The import in metrics.go:14 will cause compilation failures

Resource Validation Issues

  • Time range validation allows start time == end time (line 96 in metric_api.go), which could cause issues
  • No maximum time range limits could allow expensive queries

Error Handling Gaps

  • Client creation errors in server.go:32 don't provide actionable recovery information
  • Context cancellation in metrics collection isn't properly handled

⚠️ Security & Performance Concerns

Input Validation

  • No validation for malicious PromQL injection in metric names
  • Missing rate limiting for API endpoints
  • No timeout enforcement for expensive Prometheus queries

Resource Management

  • Unbounded time range queries could impact Prometheus performance
  • Background metrics collection goroutine may leak if context isn't properly cancelled

📋 Architecture & Design Issues

Code Organization

  • Service existence validation logic is duplicated between client and service layers
  • Error response construction is repeated across handlers
  • Missing interface abstractions make testing difficult

Configuration Management

  • Hardcoded Prometheus address fallback should be configurable
  • Magic numbers throughout metrics collection (30s intervals, cache TTL)

🔧 Implementation Bugs

Type Safety Issues

  • Step duration parsing fails silently and falls back to default (line 110 in metric_api.go)
  • Metrics collection continues even when procfs operations fail

Concurrency Issues

  • Metrics cache cleanup goroutine in HTTP latency injector may cause resource leaks

💡 Positive Aspects

  • Well-structured error types with proper Go error handling patterns
  • Good separation of concerns between API, service, and client layers
  • Comprehensive documentation with clear API examples
  • Proper use of OpenTelemetry metrics standards

📝 Recommendations

  1. Add missing dependencies and ensure all imports resolve correctly
  2. Implement proper input validation for all user inputs including time ranges and metric names
  3. Add interface abstractions for better testability and modularity
  4. Implement proper context handling throughout the metrics collection pipeline
  5. Add comprehensive unit tests for all error scenarios and edge cases

This is a solid foundation for a prometheus adapter, but requires attention to the critical issues before production deployment.


👉 Want me to re-run this review? Just type /review in the comments! For more usage examples, visit CodeAgent GitHub Repository.

Copy link
Contributor

Summary of Changes

Hello @JaD1ng, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

此拉取请求引入了一个新的Prometheus Adapter服务,旨在为系统提供统一的、可编程的服务指标查询接口。通过暴露RESTful API,它允许用户按服务名称和版本查询各种性能指标。此外,本次变更还对现有的错误注入框架进行了升级,使其能够基于服务版本进行指标异常和HTTP请求延迟的注入,显著提升了系统的可观测性和故障注入测试的灵活性。这些改进共同为服务的监控、诊断和弹性测试提供了更强大的工具。

Highlights

  • 新增Prometheus Adapter服务: 引入了一个全新的Prometheus Adapter服务,旨在通过RESTful API提供服务指标的查询功能。
  • 指标查询API: 新增了两个API端点:/v1/metrics用于获取所有可用指标列表,以及/v1/metrics/:service/:metric用于查询指定服务和指标的时间序列数据,支持按版本、时间范围和步长进行过滤。
  • Prometheus客户端集成: 集成了Prometheus客户端库,实现了与Prometheus的交互,包括范围查询、获取可用指标、检查指标和服务是否存在等核心功能。
  • 增强的错误注入机制: 改进了错误注入中间件,将指标异常注入的维度从实例级别调整为服务版本级别,并新增了HTTP请求延迟注入器,以支持更精细的故障注入测试。
  • HTTP延迟指标收集: 在可观测性模块中新增了HTTP请求延迟指标(http_latency)的收集,并将其与服务版本关联,以便更好地监控和分析服务性能。
  • API文档: 为新的Prometheus Adapter API提供了详细的Markdown文档,包括API概览、端点说明、请求示例、响应示例和错误响应格式。
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

本次 PR 实现了 prometheus_adapter,提供了一个用于查询 Prometheus 指标的 RESTful API,这是一个很棒的新功能。同时,PR 还对 mock 服务的可观测性部分进行了重构,引入了服务版本的概念,并增加了 HTTP 延迟注入功能,增强了测试能力。

代码整体结构清晰,但我在以下几个方面发现了一些可以改进的地方:

  1. 性能: 在 metric_service 中,每次查询都会执行额外的服务和指标存在性检查,这会增加延迟。建议进行优化。
  2. 安全性: prometheus_client 中的 PromQL 查询构建方式存在注入风险。
  3. 代码健壮性和可维护性: 部分代码(如错误处理、随机数生成)可以更健壮和易于维护。
  4. 可观测性: 移除了 ServiceInstanceID 可能会降低分布式追踪的粒度。

具体的修改建议请见各文件的评论。

Comment on lines +46 to +57
if len(warnings) > 0 {
// 记录警告但不返回错误
fmt.Printf("Prometheus warnings: %v\n", warnings)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

直接使用 fmt.Printf 输出警告信息会绕过项目统一的结构化日志系统(如 zerolog),导致日志格式不一致,且难以被日志采集和分析系统处理。建议使用注入的 logger 实例来记录警告,例如 log.Warn()...

Suggested change
if len(warnings) > 0 {
// 记录警告但不返回错误
fmt.Printf("Prometheus warnings: %v\n", warnings)
}
if len(warnings) > 0 {
// 使用日志组件记录警告
log.Warn().Strs("warnings", warnings).Msg("Prometheus query generated warnings")
}

Comment on lines +78 to +88
if len(warnings) > 0 {
fmt.Printf("Prometheus warnings: %v\n", warnings)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

直接使用 fmt.Printf 输出警告信息会绕过项目统一的结构化日志系统(如 zerolog),导致日志格式不一致,且难以被日志采集和分析系统处理。建议使用注入的 logger 实例来记录警告,例如 log.Warn()...

Suggested change
if len(warnings) > 0 {
fmt.Printf("Prometheus warnings: %v\n", warnings)
}
if len(warnings) > 0 {
log.Warn().Strs("warnings", warnings).Msg("Prometheus query for available metrics generated warnings")
}

Comment on lines +133 to +152
func BuildQuery(service, metric, version string) string {
// 基础查询
query := fmt.Sprintf(`%s{service_name="%s"`, metric, service)

// 如果指定了版本,添加版本过滤
if version != "" {
query += fmt.Sprintf(`,service_version="%s"`, version)
}

query += "}"
return query
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

直接使用 fmt.Sprintf 拼接 PromQL 查询字符串存在安全隐患,如果 serviceversion 参数包含特殊字符(如 "),可能会导致查询语句注入,引发语法错误或非预期的查询行为。为了保证查询的安全性,应该对标签值进行正确的转义。

建议使用 %q 格式化占位符,它会自动为字符串添加双引号并转义其中的特殊字符。

Suggested change
func BuildQuery(service, metric, version string) string {
// 基础查询
query := fmt.Sprintf(`%s{service_name="%s"`, metric, service)
// 如果指定了版本,添加版本过滤
if version != "" {
query += fmt.Sprintf(`,service_version="%s"`, version)
}
query += "}"
return query
}
func BuildQuery(service, metric, version string) string {
var queryBuilder strings.Builder
queryBuilder.WriteString(metric)
queryBuilder.WriteString(fmt.Sprintf(`{service_name=%q`, service))
// 如果指定了版本,添加版本过滤
if version != "" {
queryBuilder.WriteString(fmt.Sprintf(`,service_version=%q`, version))
}
queryBuilder.WriteString("}")
return queryBuilder.String()
}

Comment on lines +39 to +80
func (s *MetricService) QueryMetric(ctx context.Context, service, metric, version string, start, end time.Time, step time.Duration) (*model.MetricQueryResponse, error) {
// 动态验证服务是否存在
serviceExists, err := s.promClient.CheckServiceExists(ctx, service)
if err != nil {
log.Error().Err(err).Str("service", service).Msg("failed to check service existence")
return nil, &model.PrometheusError{Message: err.Error()}
}
if !serviceExists {
return nil, &model.ServiceNotFoundError{Service: service}
}

// 动态验证指标是否存在
metricExists, err := s.promClient.CheckMetricExists(ctx, metric)
if err != nil {
log.Error().Err(err).Str("metric", metric).Msg("failed to check metric existence")
return nil, &model.PrometheusError{Message: err.Error()}
}
if !metricExists {
return nil, &model.MetricNotFoundError{Metric: metric}
}

// 构建 PromQL 查询
query := client.BuildQuery(service, metric, version)
log.Debug().Str("query", query).Msg("executing prometheus query")

// 执行查询
dataPoints, err := s.promClient.QueryRange(ctx, query, start, end, step)
if err != nil {
log.Error().Err(err).Str("query", query).Msg("failed to query prometheus")
return nil, &model.PrometheusError{Message: err.Error()}
}

// 构建响应
response := &model.MetricQueryResponse{
Service: service,
Version: version,
Metric: metric,
Data: dataPoints,
}

return response, nil
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

在每次 QueryMetric 调用中,都执行了 CheckServiceExistsCheckMetricExists 两次额外的 Prometheus 查询。这会显著增加 API 的响应延迟,尤其是在高并发场景下。这种“预检”操作是性能瓶颈。

建议优化此逻辑:

  1. 移除预检:直接执行最终的 QueryRange 查询。如果结果为空,可以直接返回空数据集,这通常是合法的。如果需要区分“不存在”和“无数据”的场景,可以在查询结果为空时,再进行存在性检查以返回更精确的错误信息。
  2. 添加缓存:如果必须保留预检,可以为服务和指标的存在性检查结果添加一个带 TTL 的缓存(例如使用 sync.Map 或第三方缓存库),因为这些信息通常不会频繁变动。

Comment on lines 49 to 55
start, end, err := api.parseTimeRange(startStr, endStr)
if err != nil {
log.Error().Err(err).Msg("invalid time parameters")
api.sendErrorResponse(c, http.StatusBadRequest, model.ErrorCodeInvalidParameter,
fmt.Sprintf("参数 'start/end' 格式错误: %s", err.Error()), nil)
return
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

当前对 startend 参数的错误处理不够精确。当 parseTimeRange 返回错误时,错误消息 参数 'start/end' 格式错误 无法明确指出是哪个参数出了问题,这会给 API 使用者带来困惑。建议重构时间解析逻辑,为 startend 提供独立的错误校验和返回信息,从而提高错误提示的准确性。

Comment on lines 148 to 175
func (api *Api) sendErrorResponse(c *fox.Context, statusCode int, errorCode, message string, extras map[string]string) {
errorDetail := model.ErrorDetail{
Code: errorCode,
Message: message,
}

// 添加额外的字段
if extras != nil {
if service, ok := extras["service"]; ok {
errorDetail.Service = service
}
if metric, ok := extras["metric"]; ok {
errorDetail.Metric = metric
}
if parameter, ok := extras["parameter"]; ok {
errorDetail.Parameter = parameter
}
if value, ok := extras["value"]; ok {
errorDetail.Value = value
}
}

response := model.ErrorResponse{
Error: errorDetail,
}

c.JSON(statusCode, response)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

使用 map[string]string 类型的 extras 参数来传递额外的错误详情字段,这种方式不够类型安全且容易出错(例如,硬编码的键名拼写错误)。

建议采用更健壮的方式,例如使用函数式选项模式(functional options)来构建 ErrorDetail,以增强代码的可读性和可维护性。

例如,可以将函数签名修改为:

func (api *Api) sendErrorResponse(c *fox.Context, statusCode int, errorCode, message string, opts ...func(*model.ErrorDetail))

这样调用时会更清晰和安全:

api.sendErrorResponse(c, http.StatusNotFound, model.ErrorCodeServiceNotFound,
    err.Error(), func(d *model.ErrorDetail) { d.Service = service })

Comment on lines +126 to +130
if config.Probability < 1.0 {
// 简单的概率实现(生产环境应使用更好的随机数)
if time.Now().UnixNano()%100 >= int64(config.Probability*100) {
return 0
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

使用 time.Now().UnixNano()%100 来实现概率判断不是一种健壮的随机方法。它的结果依赖于请求到达的具体纳秒时间,分布可能不均匀,并且在测试中难以复现。建议使用 math/rand 包来生成一个浮点数进行比较,这样更符合通用的概率实现方式,也更可靠。

由于项目使用的 Go 版本是 1.24,math/rand 已经默认自动播种,无需手动处理。

if config.Probability < 1.0 && rand.Float64() > config.Probability {
			return 0
		}

Comment on lines +185 to +191
return resource.New(context.Background(),
resource.WithAttributes(
semconv.ServiceName(config.ServiceName),
semconv.ServiceVersion(config.ServiceVersion),
semconv.DeploymentEnvironment(config.Environment),
),
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

createResource 函数中,semconv.ServiceInstanceID 属性被移除了。虽然错误注入逻辑已改为基于服务版本,但在分布式追踪和指标监控中,实例ID(service.instance.id)仍然是区分同一服务不同副本的关键标识。移除它会降低可观测性数据的粒度,使得问题排查(例如定位到某个具体 Pod 的问题)变得更加困难。

建议重新添加 semconv.ServiceInstanceID 属性,以保持对服务实例的精细化监控能力。

	return resource.New(context.Background(),
		resource.WithAttributes(
			semconv.ServiceName(config.ServiceName),
			semconv.ServiceVersion(config.ServiceVersion),
			semconv.DeploymentEnvironment(config.Environment),
			semconv.ServiceInstanceID(utils.GetInstanceID(config.ServiceName)),
		),
	)


// 尝试从配置文件加载
// 获取服务版本(默认为 1.0.0)
serviceVersion := "1.0.0"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

服务版本 serviceVersion 被硬编码为 "1.0.0"。这使得为不同版本的服务配置错误注入变得不灵活。为了更好地支持多版本测试场景,建议将服务版本变为可配置项,例如通过环境变量或服务配置文件来指定。

serviceVersion := os.Getenv("SERVICE_VERSION")
	if serviceVersion == "" {
		serviceVersion = "1.0.0"
	}

Copy link

@niupilot niupilot bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review Summary

I've identified several critical issues that require immediate attention:

🚨 Critical: Missing github.com/prometheus/procfs dependency will cause compilation failures
⚠️ Security: Time range validation needs strengthening to prevent expensive queries
🔧 Bug Fix: Silent step duration parsing failures should be handled properly
🏗️ Architecture: Input validation missing for metric names to prevent PromQL injection

These suggestions address the most critical issues with concrete code improvements.

github.com/prometheus/common v0.66.1
github.com/redis/go-redis/v9 v9.5.1
github.com/rs/zerolog v1.34.0
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add missing prometheus/procfs dependency that is imported in metrics.go:14. Without this dependency, the code will fail to compile. The version 0.15.1 is compatible with the current prometheus client version.

Suggested change
)
github.com/prometheus/procfs v0.15.1
github.com/prometheus/client_golang v1.23.2

}

// 验证时间范围的合理性
if end.Before(start) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add maximum time range validation to prevent expensive Prometheus queries that could impact performance. The original code only checked if end is before start, but didn't limit the maximum range.

Suggested change
if end.Before(start) {
// 验证时间范围的合理性
if end.Before(start) {
return time.Time{}, time.Time{}, fmt.Errorf("end time must be after start time")
}
// 防止查询时间范围过大(最大24小时)
if end.Sub(start) > 24*time.Hour {
return time.Time{}, time.Time{}, fmt.Errorf("time range cannot exceed 24 hours")
}

Comment on lines 25 to 28
rulesFilePath := os.Getenv("PROMETHEUS_RULES_FILE")
if rulesFilePath == "" {
rulesFilePath = "/etc/prometheus/rules/alert_rules.yml"
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我们的服务会与prometheus服务在一个机器下么?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

目前是

@JaD1ng JaD1ng changed the title eat(prometheus_adapter): 新增prometheus_adapter实现指标查询功能 feat(prometheus_adapter): 新增prometheus_adapter实现指标查询功能 Sep 24, 2025
实现HTTP中间件记录请求时延并导出到Prometheus指标
添加服务信息到指标收集器
统一代码格式和修复缩进问题
添加Prometheus Adapter API文档
重构指标注入器以支持服务版本维度
移除冗余的exported_job标签和实例ID生成
新增HTTP延迟注入器并与中间件集成
更新HTTP时延指标名称使其更简洁,同时改进健康检查脚本以支持批量检查多个端口
实现Prometheus适配器模块,包括以下主要功能:
- 添加Prometheus客户端封装,支持指标查询和范围查询
- 实现指标服务层,提供指标列表获取和指标数据查询
- 添加API路由和控制器,提供RESTful接口
- 定义模型结构体和错误处理机制
- 更新依赖添加Prometheus客户端库
- 编写详细API文档说明接口使用方式
新增告警规则同步API及服务实现,支持将规则同步到Prometheus并触发重载
重构API层提取公共错误处理和工具方法到通用模块
添加相关模型定义和文档更新
- 在prometheus.yml中配置告警规则文件路径
- 修改docker-compose.yml挂载规则目录
- 重构AlertService,移除本地文件存储,直接写入容器
- 添加容器内规则文件写入的容错机制
- 将watch_time字段从AlertRuleMeta移到AlertRule中
- 移除全量同步接口,改为增量更新方式
- 实现批量更新规则元信息的API
- 重构服务层代码结构,提高可维护性
- 更新文档
添加告警规则本地文件持久化功能,支持启动时加载和关闭时保存规则
重构关闭逻辑实现优雅关闭,包括保存当前规则状态
更新构建和部署脚本以处理规则文件目录
修改测试脚本以适配新的增量更新接口
添加prometheus_adapter.yml配置文件支持
重构alert_service使用配置而非环境变量
新增alert_webhook_service实现告警轮询推送
更新build.sh和deploy.sh支持配置文件部署
更新README文档说明新的webhook架构
- 新增Alertmanager API v2兼容接口用于接收Prometheus告警
- 重构告警服务架构,替换原有的轮询模式为推送模式
- 添加docker-compose配置支持Prometheus管理API
- 移除过时的AlertWebhookService实现
实现删除告警规则模板及其关联元信息的功能,包括:
1. 添加DELETE /v1/alert-rules/:rule_name接口删除规则模板
2. 添加DELETE /v1/alert-rules-meta/:rule_name接口删除特定元信息
3. 更新相关文档说明删除操作的使用方法
- 在deploy.sh和build.sh中添加PID文件管理,优化服务启动和停止流程
- 修复alert_service.go中告警表达式生成的标签处理逻辑
- 使用%g代替%f格式化浮点数以避免科学计数法显示
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants