Skip to content
57 changes: 56 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ git clone https://github.com/openstack-exporter/helm-charts.git

# Package the chart
cd helm-charts/charts/prometheus-openstack-exporter/
helm package .
helm package .

# Get chart version & install
version="$(awk '/^version:/{ print $NF }' Chart.yaml)"
Expand All @@ -26,3 +26,58 @@ helm install prometheus-openstack-exporter prometheus-openstack-exporter-${versi
## Contributing

Please fill pull requests or issues under Github.



## OpenStack volumes can be in the following status:
openstack_cinder_volume_status:

| Status | Value |
|---------------------|---------|
|"creating" | 0 |
|"available" | 1 |
|"reserved" | 2 |
|"attaching" | 3 |
|"detaching" | 4 |
|"in-use" | 5 |
|"maintenance" | 6 |
|"deleting" | 7 |
|"awaiting-transfer" | 8 |
|"error" | 9 |
|"error_deleting" | 10 |
|"backing-up" | 11 |
|"restoring-backup" | 12 |
|"error_backing-up" | 13 |
|"error_restoring" | 14 |
|"error_extending" | 15 |
|"downloading" | 16 |
|"uploading" | 17 |
|"retyping" | 18 |
|"extending" | 19 |

## OpenStack server can be in the following status:
openstack_nova_server_status:

| Status | Value | Description
|-------------------|-------|--------------------|
| ACTIVE | 0 |
| BUILD | 1 | The server has not finished the original build process.
| BUILD(spawning) | 2 | The server has not finished the original build process but networking works (HP Cloud specific)
| DELETED | 3 | The server is deleted.
| ERROR | 4 | The server is in error.
| HARD_REBOOT | 5 | The server is hard rebooting.
| PASSWORD | 6 | The password is being reset on the server.
| REBOOT | 7 | The server is in a soft reboot state.
| REBUILD | 8 | The server is currently being rebuilt from an image.
| RESCUE | 9 | The server is in rescue mode.
| RESIZE | 10 | Server is performing the differential copy of data that changed during its initial copy.
| SHUTOFF | 11 | The virtual machine (VM) was powered down by the user, but not through the OpenStack Compute API.
| SUSPENDED | 12 | The server is suspended, either by request or necessity.
| UNKNOWN | 13 | The state of the server is unknown. Contact your cloud provider.
| VERIFY_RESIZE | 14 | System is awaiting confirmation that the server is operational after a move or resize.
| MIGRATING | 15 | The server is migrating. This is caused by a live migration (moving a server that is active) action.
| PAUSED | 16 | The server is paused.
| REVERT_RESIZE | 17 | The resize or migration of a server failed for some reason. The destination server is being cleaned up and the original source server is restarting.
| SHELVED | 18 | The server is in shelved state. Depends on the shelve offload time, the server will be automatically shelved off loaded.
| SHELVED_OFFLOADED | 19 | The shelved server is offloaded (removed from the compute host) and it needs unshelved action to be used again.
| SOFT_DELETED | 20 | The server is marked as deleted but will remain in the cloud for some configurable amount of time.
4 changes: 2 additions & 2 deletions charts/prometheus-openstack-exporter/Chart.yaml
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
---
apiVersion: v1
name: prometheus-openstack-exporter
version: 0.4.3
appVersion: v1.6.0
version: 0.6.3
appVersion: v1.7.0
246 changes: 46 additions & 200 deletions charts/prometheus-openstack-exporter/templates/prometheusrule.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -5,206 +5,52 @@ metadata:
name: {{ include "openstack-exporter.fullname" . }}
namespace: {{ .Release.Namespace }}
labels:
{{ include "openstack-exporter.labels" . | indent 4 }}
{{- include "openstack-exporter.labels" . | indent 4 }}
spec:
groups:
- name: cinder
{{- range $groupName, $group := .Values.promethuesRules }}
{{- if (dig "enabled" true $group )}}
- name: {{ $groupName }}
rules:
- alert: CinderAgentDown
expr: |
openstack_cinder_agent_state != 1
labels:
severity: P4
annotations:
summary: "[`{{`{{$labels.hostname}}`}}`] `{{`{{$labels.exported_service}}`}}` down"
description: >
The service `{{`{{$labels.exported_service}}`}}` running on `{{`{{$labels.hostname}}`}}`
is being reported as down.

- alert: CinderAgentDown
for: 5m
expr: |
openstack_cinder_agent_state != 1
labels:
severity: P3
annotations:
summary: "[`{{`{{$labels.hostname}}`}}`] `{{`{{$labels.exported_service}}`}}` down"
description: >
The service `{{`{{$labels.exported_service}}`}}` running on `{{`{{$labels.hostname}}`}}`
is being reported as down for 5 minutes. This can affect volume operations so it must
be resolved as quickly as possible.

- alert: CinderAgentDisabled
for: 1h
expr: |
openstack_cinder_agent_state{adminState!="enabled"}
labels:
severity: P5
annotations:
summary: "[`{{`{{$labels.hostname}}`}}`] `{{`{{$labels.exported_service}}`}}` disabled"
description: >
The service `{{`{{$labels.exported_service}}`}}` running on `{{`{{$labels.hostname}}`}}`
has been disabled for 60 minutes. This can affect volume operations so it must be resolved
as quickly as possible.

- alert: CinderVolumeInError
for: 24h
expr: |
openstack_cinder_volume_status{status=~"error.*"}
labels:
severity: P4
annotations:
summary: "[`{{`{{$labels.id}}`}}`] Volume in ERROR state"
description: >
The volume `{{`{{$labels.id}}`}}` has been in ERROR state for over 24 hours. It must
be cleaned up or removed in order to provide a consistent customer experience.


- name: neutron
rules:
- alert: NeutronAgentDown
expr: |
openstack_neutron_agent_state != 1
labels:
severity: P4
annotations:
summary: "[`{{`{{$labels.hostname}}`}}`] `{{`{{$labels.exported_service}}`}}` down"
description: >
The service `{{`{{$labels.exported_service}}`}}` running on `{{`{{$labels.hostname}}`}}`
is being reported as down.

- alert: NeutronAgentDown
for: 5m
expr: |
openstack_neutron_agent_state != 1
labels:
severity: P3
annotations:
summary: "[`{{`{{$labels.hostname}}`}}`] `{{`{{$labels.exported_service}}`}}` down"
description: >
The service `{{`{{$labels.exported_service}}`}}` running on `{{`{{$labels.hostname}}`}}`
is being reported as down for 5 minutes. This can affect network operations so it must
be resolved as quickly as possible.

- alert: NeutronAgentDisabled
for: 1h
expr: |
openstack_neutron_agent_state{adminState!="up"}
labels:
severity: P5
annotations:
summary: "[`{{`{{$labels.hostname}}`}}`] `{{`{{$labels.exported_service}}`}}` disabled"
description: >
The service `{{`{{$labels.exported_service}}`}}` running on `{{`{{$labels.hostname}}`}}`
has been disabled for 60 minutes. This can affect network operations so it must be resolved
as quickly as possible.

- alert: NeutronBindingFailedPorts
expr: |
openstack_neutron_port{binding_vif_type="binding_failed"} != 0
labels:
severity: P3
annotations:
summary: "[`{{`{{$labels.device_owner}}`}}`] `{{`{{$labels.mac_address}}`}}` binding failed"
description: >
The NIC `{{`{{$labels.mac_address}}`}}` of `{{`{{$labels.device_owner}}`}}`
has binding failed port now.

- alert: NeutronNetworkOutOfIPs
expr: |
sum by (network_id) (openstack_neutron_network_ip_availabilities_used{project_id!=""}) / sum by (network_id) (openstack_neutron_network_ip_availabilities_total{project_id!=""}) * 100 > 80
labels:
severity: P4
annotations:
summary: "[`{{`{{$labels.network_name}}`}}`] `{{`{{$labels.subnet_name}}`}}` running out of IPs"
description: >
The subnet `{{`{{$labels.subnet_name}}`}}` within `{{`{{$labels.network_name}}`}}`
is currently at `{{`{{$value}}`}}`% utilization. If the IP addresses run out, it will
impact the provisioning of new ports.


- name: nova
rules:
- alert: NovaAgentDown
expr: |
openstack_nova_agent_state != 1
labels:
severity: P4
annotations:
summary: "[`{{`{{$labels.hostname}}`}}`] `{{`{{$labels.exported_service}}`}}` down"
description: >
The service `{{`{{$labels.exported_service}}`}}` running on `{{`{{$labels.hostname}}`}}`
is being reported as down.

- alert: NovaAgentDown
for: 5m
expr: |
openstack_nova_agent_state != 1
labels:
severity: P3
annotations:
summary: "[`{{`{{$labels.hostname}}`}}`] `{{`{{$labels.exported_service}}`}}` down"
description: >
The service `{{`{{$labels.exported_service}}`}}` running on `{{`{{$labels.hostname}}`}}`
is being reported as down. This can affect compute operations so it must be resolved
as quickly as possible.

- alert: NovaAgentDisabled
for: 1h
expr: |
openstack_nova_agent_state{adminState!="enabled"}
labels:
severity: P5
annotations:
summary: "[`{{`{{$labels.hostname}}`}}`] `{{`{{$labels.exported_service}}`}}` disabled"
description: >
The service `{{`{{$labels.exported_service}}`}}` running on `{{`{{$labels.hostname}}`}}`
has been disabled for 60 minutes. This can affect compute operations so it must be resolved
as quickly as possible.

- alert: NovaInstanceInError
for: 24h
expr: |
openstack_nova_server_status{status="ERROR"}
labels:
severity: P4
annotations:
summary: "[`{{`{{$labels.id}}`}}`] Instance in ERROR state"
description: >
The instance `{{`{{$labels.id}}`}}` has been in ERROR state for over 24 hours. It must
be cleaned up or removed in order to provide a consistent customer experience.

- alert: NovaFailureRisk
for: 6h
expr: |
(sum(openstack_nova_memory_available_bytes-openstack_nova_memory_used_bytes) - max(openstack_nova_memory_used_bytes)) / sum(openstack_nova_memory_available_bytes-openstack_nova_memory_used_bytes) * 100 < 0.25
labels:
severity: P4
annotations:
summary: "[nova] Failure risk"
description: >
The cloud capacity will be at `{{`{{$value}}`}}` in the event of the failure of a single
hypervisor which puts the cloud at risk of not being able to recover should any hypervisor
failures occur. Please ensure that adequate amount of infrastructure is assigned to this
deployment to prevent this.

- alert: NovaCapacity
for: 6h
expr: |
sum (
openstack_nova_memory_used_bytes
+ on(hostname) group_left(adminState)
(0 * openstack_nova_agent_state{exported_service="nova-compute",adminState="enabled"})
) / sum (
openstack_nova_memory_available_bytes
+ on(hostname) group_left(adminState)
(0 * openstack_nova_agent_state{exported_service="nova-compute",adminState="enabled"})
) * 100 > 75
labels:
severity: P4
annotations:
summary: "[nova] Capacity risk"
description: >
The cloud capacity is currently at `{{`{{$value}}`}}` which means there is a risk of running
out of capacity due to the timeline required to add new nodes. Please ensure that adequate
amount of infrastructure is assigned to this deployment to prevent this.
{{- range $ruleName, $rule := $group.rules }}
{{- if (dig "enabled" true $rule )}}
- # {{ $ruleName }}
{{- with $rule.alert }}
alert: {{ . }}
{{- end }}

{{- with $rule.expr }}
expr: {{ tpl . $ | quote }}
{{- end }}

{{- with $rule.record }}
record: {{ . }}
{{- end }}

{{- with $rule.for }}
for: {{ . }}
{{- end }}

{{- with $rule.keep_firing_for }}
keep_firing_for: {{ . }}
{{- end }}

{{- with $rule.labels }}
labels:
{{- range $k,$v := . }}
{{ $k }}: {{ tpl $v $ | quote }}
{{- end }}
{{- end }}

{{- with $rule.annotations }}
annotations:
{{- range $k, $v := . }}
{{ $k }}: |
{{- tpl $v $ | nindent 10 }}
{{- end }}
{{- end }}

{{- end }}
{{- end }}
{{- end }}
{{- end }}
Original file line number Diff line number Diff line change
Expand Up @@ -13,4 +13,4 @@ spec:
port: 9180
targetPort: metrics
selector:
{{- include "openstack-exporter.labels" . | indent 4 }}
{{- include "openstack-exporter.labels" . | indent 4 }}
Loading