Skip to content

Commit aa589ce

Browse files
committed
Add "retention" feature allowing idle metrics to expire
Signed-off-by: Philipp Hossner <philipp.hossner@posteo.de>
1 parent ef53c83 commit aa589ce

File tree

7 files changed

+335
-26
lines changed

7 files changed

+335
-26
lines changed

README.md

Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -286,6 +286,8 @@ For details of each metric type, see [Prometheus documentation](http://prometheu
286286
- `type`: metric type (required)
287287
- `desc`: description of this metric (required)
288288
- `key`: key name of record for instrumentation (**optional**)
289+
- `retention`: time in seconds to remove a metric after not being updated (optional). See [Retention](#retention)
290+
- `retention_check_interval`: time in seconds to check for expired metrics (optional). Has no effect when `retention` not set. See [Retention](#retention)
289291
- `<labels>`: additional labels for this metric (optional). See [Labels](#labels)
290292

291293
If key is empty, the metric values is treated as 1, so the counter increments by 1 on each record regardless of contents of the record.
@@ -310,6 +312,8 @@ If key is empty, the metric values is treated as 1, so the counter increments by
310312
- `type`: metric type (required)
311313
- `desc`: description of metric (required)
312314
- `key`: key name of record for instrumentation (required)
315+
- `retention`: time in seconds to remove a metric after not being updated (optional). See [Retention](#retention)
316+
- `retention_check_interval`: time in seconds to check for expired metrics (optional). Has no effect when `retention` not set. See [Retention](#retention)
313317
- `<labels>`: additional labels for this metric (optional). See [Labels](#labels)
314318

315319
### summary type
@@ -332,6 +336,8 @@ If key is empty, the metric values is treated as 1, so the counter increments by
332336
- `type`: metric type (required)
333337
- `desc`: description of metric (required)
334338
- `key`: key name of record for instrumentation (required)
339+
- `retention`: time in seconds to remove a metric after not being updated (optional). See [Retention](#retention)
340+
- `retention_check_interval`: time in seconds to check for expired metrics (optional). Has no effect when `retention` not set. See [Retention](#retention)
335341
- `<labels>`: additional labels for this metric (optional). See [Labels](#labels)
336342

337343
### histogram type
@@ -356,6 +362,8 @@ If key is empty, the metric values is treated as 1, so the counter increments by
356362
- `desc`: description of metric (required)
357363
- `key`: key name of record for instrumentation (required)
358364
- `buckets`: buckets of record for instrumentation (optional)
365+
- `retention`: time in seconds to remove a metric after not being updated (optional). See [Retention](#retention)
366+
- `retention_check_interval`: time in seconds to check for expired metrics (optional). Has no effect when `retention` not set. See [Retention](#retention)
359367
- `<labels>`: additional labels for this metric (optional). See [Labels](#labels)
360368

361369
## Labels
@@ -430,6 +438,33 @@ Prometheus output/filter plugin can have multiple metric section. Top-level labe
430438

431439
In this case, `message_foo_counter` has `tag`, `hostname`, `key` and `data_type` labels.
432440

441+
## Retention
442+
443+
By default metrics with all encountered label combinations are preserved until the next restart of fluentd.
444+
Even if a label combination did not receive any update for a long time.
445+
That behavior is not always desirable e.g. when the contents of of fields change for good and the metric becomes idle.
446+
For these metrics you can set `retention` and `retention_check_interval` like this:
447+
448+
```
449+
<metric>
450+
name message_foo_counter
451+
type counter
452+
desc The total number of foo in message.
453+
key foo
454+
retention 3600 # 1h
455+
retention_check_interval 1800 # 30m
456+
<labels>
457+
bar ${bar}
458+
</labels>
459+
</metric>
460+
```
461+
462+
If `${bar}` was `baz` one time but after that no records with that value were processed, then after one hour the metric
463+
`foo{bar="baz"}` might be removed.
464+
When this actually happens depends on `retention_check_interval` (default 60).
465+
It causes a background thread to check every 30 minutes for expired metrics.
466+
So worst case the metrics are removed 30 minutes after expiration.
467+
You can set this value as low as `1`, but that may put more stress on your CPU.
433468

434469
## Try plugin with nginx
435470

lib/fluent/plugin/filter_prometheus.rb

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,8 @@ class PrometheusFilter < Fluent::Plugin::Filter
77
include Fluent::Plugin::PrometheusLabelParser
88
include Fluent::Plugin::Prometheus
99

10+
helpers :thread
11+
1012
def initialize
1113
super
1214
@registry = ::Prometheus::Client.registry
@@ -22,6 +24,17 @@ def configure(conf)
2224
@metrics = Fluent::Plugin::Prometheus.parse_metrics_elements(conf, @registry, labels)
2325
end
2426

27+
def start
28+
super
29+
Fluent::Plugin::Prometheus.start_retention_threads(
30+
@metrics,
31+
@registry,
32+
method(:thread_create),
33+
method(:thread_current_running?),
34+
@log
35+
)
36+
end
37+
2538
def filter(tag, time, record)
2639
instrument_single(tag, time, record, @metrics)
2740
record

lib/fluent/plugin/out_prometheus.rb

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,8 @@ class PrometheusOutput < Fluent::Plugin::Output
77
include Fluent::Plugin::PrometheusLabelParser
88
include Fluent::Plugin::Prometheus
99

10+
helpers :thread
11+
1012
def initialize
1113
super
1214
@registry = ::Prometheus::Client.registry
@@ -22,6 +24,17 @@ def configure(conf)
2224
@metrics = Fluent::Plugin::Prometheus.parse_metrics_elements(conf, @registry, labels)
2325
end
2426

27+
def start
28+
super
29+
Fluent::Plugin::Prometheus.start_retention_threads(
30+
@metrics,
31+
@registry,
32+
method(:thread_create),
33+
method(:thread_current_running?),
34+
@log
35+
)
36+
end
37+
2538
def process(tag, es)
2639
instrument(tag, es, @metrics)
2740
end

lib/fluent/plugin/prometheus.rb

Lines changed: 123 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
require 'prometheus/client'
22
require 'prometheus/client/formats/text'
33
require 'fluent/plugin/prometheus/placeholder_expander'
4+
require 'fluent/plugin/prometheus/data_store'
45

56
module Fluent
67
module Plugin
@@ -81,6 +82,17 @@ def self.parse_metrics_elements(conf, registry, labels = {})
8182
metrics
8283
end
8384

85+
def self.start_retention_threads(metrics, registry, thread_create, thread_running, log)
86+
metrics.select { |metric| metric.has_retention? }.each do |metric|
87+
thread_create.call("prometheus_retention_#{metric.name}".to_sym) do
88+
while thread_running.call()
89+
metric.remove_expired_metrics(registry, log)
90+
sleep(metric.retention_check_interval)
91+
end
92+
end
93+
end
94+
end
95+
8496
def self.placeholder_expander(log)
8597
Fluent::Plugin::Prometheus::ExpandBuilder.new(log: log)
8698
end
@@ -97,6 +109,11 @@ def stringify_keys(hash_to_stringify)
97109
end.to_h
98110
end
99111

112+
def initialize
113+
super
114+
::Prometheus::Client.config.data_store = Fluent::Plugin::Prometheus::DataStore.new
115+
end
116+
100117
def configure(conf)
101118
super
102119
@placeholder_values = {}
@@ -151,6 +168,8 @@ class Metric
151168
attr_reader :name
152169
attr_reader :key
153170
attr_reader :desc
171+
attr_reader :retention
172+
attr_reader :retention_check_interval
154173

155174
def initialize(element, registry, labels)
156175
['name', 'desc'].each do |key|
@@ -162,6 +181,11 @@ def initialize(element, registry, labels)
162181
@name = element['name']
163182
@key = element['key']
164183
@desc = element['desc']
184+
@retention = element['retention'].to_i
185+
@retention_check_interval = element.fetch('retention_check_interval', 60).to_i
186+
if has_retention?
187+
@last_modified_store = LastModifiedStore.new
188+
end
165189

166190
@base_labels = Fluent::Plugin::Prometheus.parse_labels_elements(element)
167191
@base_labels = labels.merge(@base_labels)
@@ -192,6 +216,74 @@ def self.get(registry, name, type, docstring)
192216

193217
metric
194218
end
219+
220+
def set_value?(value)
221+
if value
222+
return true
223+
end
224+
false
225+
end
226+
227+
def instrument(record, expander)
228+
value = self.value(record)
229+
if self.set_value?(value)
230+
labels = labels(record, expander)
231+
set_value(value, labels)
232+
if has_retention?
233+
@last_modified_store.set_last_updated(labels)
234+
end
235+
end
236+
end
237+
238+
def has_retention?
239+
@retention > 0
240+
end
241+
242+
def remove_expired_metrics(registry, log)
243+
if has_retention?
244+
metric = registry.get(@name)
245+
246+
expiration_time = Time.now - @retention
247+
expired_label_sets = @last_modified_store.get_labels_not_modified_since(expiration_time)
248+
249+
expired_label_sets.each { |expired_label_set|
250+
log.debug "Metric #{@name} with labels #{expired_label_set} expired. Removing..."
251+
metric.remove(expired_label_set) # this method is supplied by the require at the top of this method
252+
@last_modified_store.remove(expired_label_set)
253+
}
254+
else
255+
log.warn('remove_expired_metrics should not be called when retention is not set for this metric!')
256+
end
257+
end
258+
259+
class LastModifiedStore
260+
def initialize
261+
@internal_store = Hash.new
262+
@lock = Monitor.new
263+
end
264+
265+
def synchronize
266+
@lock.synchronize { yield }
267+
end
268+
269+
def set_last_updated(labels)
270+
synchronize do
271+
@internal_store[labels] = Time.now
272+
end
273+
end
274+
275+
def remove(labels)
276+
synchronize do
277+
@internal_store.delete(labels)
278+
end
279+
end
280+
281+
def get_labels_not_modified_since(time)
282+
synchronize do
283+
@internal_store.select { |k, v| v < time }.keys
284+
end
285+
end
286+
end
195287
end
196288

197289
class Gauge < Metric
@@ -208,16 +300,17 @@ def initialize(element, registry, labels)
208300
end
209301
end
210302

211-
def instrument(record, expander)
303+
def value(record)
212304
if @key.is_a?(String)
213-
value = record[@key]
305+
record[@key]
214306
else
215-
value = @key.call(record)
216-
end
217-
if value
218-
@gauge.set(value, labels: labels(record, expander))
307+
@key.call(record)
219308
end
220309
end
310+
311+
def set_value(value, labels)
312+
@gauge.set(value, labels: labels)
313+
end
221314
end
222315

223316
class Counter < Metric
@@ -230,20 +323,22 @@ def initialize(element, registry, labels)
230323
end
231324
end
232325

233-
def instrument(record, expander)
234-
# use record value of the key if key is specified, otherwise just increment
326+
def value(record)
235327
if @key.nil?
236-
value = 1
328+
1
237329
elsif @key.is_a?(String)
238-
value = record[@key]
330+
record[@key]
239331
else
240-
value = @key.call(record)
332+
@key.call(record)
241333
end
334+
end
242335

243-
# ignore if record value is nil
244-
return if value.nil?
336+
def set_value?(value)
337+
!value.nil?
338+
end
245339

246-
@counter.increment(by: value, labels: labels(record, expander))
340+
def set_value(value, labels)
341+
@counter.increment(by: value, labels: labels)
247342
end
248343
end
249344

@@ -261,16 +356,17 @@ def initialize(element, registry, labels)
261356
end
262357
end
263358

264-
def instrument(record, expander)
359+
def value(record)
265360
if @key.is_a?(String)
266-
value = record[@key]
361+
record[@key]
267362
else
268-
value = @key.call(record)
269-
end
270-
if value
271-
@summary.observe(value, labels: labels(record, expander))
363+
@key.call(record)
272364
end
273365
end
366+
367+
def set_value(value, labels)
368+
@summary.observe(value, labels: labels)
369+
end
274370
end
275371

276372
class Histogram < Metric
@@ -294,16 +390,17 @@ def initialize(element, registry, labels)
294390
end
295391
end
296392

297-
def instrument(record, expander)
393+
def value(record)
298394
if @key.is_a?(String)
299-
value = record[@key]
395+
record[@key]
300396
else
301-
value = @key.call(record)
302-
end
303-
if value
304-
@histogram.observe(value, labels: labels(record, expander))
397+
@key.call(record)
305398
end
306399
end
400+
401+
def set_value(value, labels)
402+
@histogram.observe(value, labels: labels)
403+
end
307404
end
308405
end
309406
end

0 commit comments

Comments
 (0)