You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
node-problem-detector aims to make various node problems visible to the upstream
@@ -12,6 +13,7 @@ Now it is running as a
12
13
enabled by default in the GCE cluster.
13
14
14
15
# Background
16
+
15
17
There are tons of node problems could possibly affect the pods running on the
16
18
node such as:
17
19
* Infrastructure daemon issues: ntp service down;
@@ -29,6 +31,7 @@ layers. Once upstream layers have the visibility to those problems, we can discu
29
31
[remedy system](#remedy-systems).
30
32
31
33
# Problem API
34
+
32
35
node-problem-detector uses `Event` and `NodeCondition` to report problems to
33
36
apiserver.
34
37
*`NodeCondition`: Permanent problem that makes the node unavailable for pods should
@@ -37,6 +40,7 @@ be reported as `NodeCondition`.
37
40
should be reported as `Event`.
38
41
39
42
# Problem Daemon
43
+
40
44
A problem daemon is a sub-daemon of node-problem-detector. It monitors a specific
41
45
kind of node problems and reports them to node-problem-detector.
42
46
@@ -57,7 +61,9 @@ List of supported problem daemons:
57
61
|[CustomPluginMonitor](https://github.com/kubernetes/node-problem-detector/blob/master/config/custom-plugin-monitor.json)| On-demand(According to users configuration) | A custom plugin monitor for node-problem-detector to invoke and check various node problems with user defined check scripts. See proposal [here](https://docs.google.com/document/d/1jK_5YloSYtboj-DtfjmYKxfNnUxCAvohLnsH5aGCAYQ/edit#). |
58
62
59
63
# Usage
64
+
60
65
## Flags
66
+
61
67
*`--version`: Print current version of node-problem-detector.
62
68
*`--address`: The address to bind the node problem detector server.
63
69
*`--port`: The port to bind the node problem detector server. Use 0 to disable.
@@ -81,6 +87,7 @@ For example, to run without auth, use the following config:
81
87
*`--hostname-override`: A customized node name used for node-problem-detector to update conditions and emit events. node-problem-detector gets node name first from `hostname-override`, then `NODE_NAME` environment variable and finally fall back to `os.Hostname`.
82
88
83
89
## Build Image
90
+
84
91
*`go get` or `git clone` node-problem-detector repo into `$GOPATH/src/k8s.io` or `$GOROOT/src/k8s.io`
@@ -97,17 +104,29 @@ You should download the systemd develop files first. For Ubuntu, `libsystemd-jou
97
104
be installed. For Debian, `libsystemd-dev` package should be installed.
98
105
99
106
## Push Image
107
+
100
108
`make push` uploads the docker image to registry. By default, the image will be uploaded to
101
109
`staging-k8s.gcr.io`. It's easy to modify the `Makefile` to push the image
102
110
to another registry.
103
111
104
-
## Start DaemonSet
105
-
* Edit [node-problem-detector.yaml](https://github.com/kubernetes/node-problem-detector/blob/master/deployment/node-problem-detector.yaml) to fit your environment: Set `log` volume to your system log directory. (Used by SystemLogMonitor). For **kubernetes <1.9** use [node-problem-detector-old.yaml](https://github.com/kubernetes/node-problem-detector/blob/master/deployment/node-problem-detector-old.yaml)
106
-
* If needed, you can use [ConfigMap](https://kubernetes.io/docs/tasks/configure-pod-container/configure-pod-configmap/)
107
-
to overwrite the `config/`, Edit [node-problem-detector-config.yaml](https://github.com/kubernetes/node-problem-detector/blob/master/deployment/node-problem-detector-config.yaml) to fit your environment. and create the ConfigMap with `kubectl create -f node-problem-detector-config.yaml`.
112
+
## Installation
113
+
114
+
The easiest way to install node-problem-detector into your cluster is to use the [Helm](https://helm.sh/)[chart](https://github.com/helm/charts/tree/master/stable/node-problem-detector):
115
+
116
+
```
117
+
helm install stable/node-problem-detector
118
+
```
119
+
120
+
Or alternatively, to install node-problem-detector manually:
121
+
122
+
* Edit [node-problem-detector.yaml](deployment/node-problem-detector.yaml) to fit your environment. Set `log` volume to your system log directory (used by SystemLogMonitor). For Kubernetes versions older than 1.9, use [node-problem-detector-old.yaml](deployment/node-problem-detector-old.yaml).
123
+
124
+
* If needed, you can use a [ConfigMap](https://kubernetes.io/docs/tasks/configure-pod-container/configure-pod-configmap/) to overwrite the `config` directory inside the pod. Edit [node-problem-detector-config.yaml](deployment/node-problem-detector-config.yaml) as required and create the `ConfigMap` with `kubectl create -f node-problem-detector-config.yaml`.
125
+
108
126
* Create the DaemonSet with `kubectl create -f node-problem-detector.yaml`.
109
127
110
128
## Start Standalone
129
+
111
130
To run node-problem-detector standalone, you should set `inClusterConfig` to `false` and
112
131
teach node-problem-detector how to access apiserver with `apiserver-override`.
For more scenarios, see [here](https://github.com/kubernetes/heapster/blob/master/docs/source-configuration.md#kubernetes)
120
139
121
140
## Try It Out
141
+
122
142
You can try node-problem-detector in a running cluster by injecting messages to the logs that node-problem-detector is watching. For example, Let's assume node-problem-detector is using [KernelMonitor](https://github.com/kubernetes/node-problem-detector/blob/master/config/kernel-monitor.json). On your workstation, run ```kubectl get events -w```. On the node, run ```sudo sh -c "echo 'kernel: BUG: unable to handle kernel NULL pointer dereference at TESTING' >> /dev/kmsg"```. Then you should see the ```KernelOops``` event.
123
143
124
144
When adding new rules or developing node-problem-detector, it is probably easier to test it on the local workstation in the standalone mode. For the API server, an easy way is to use ```kubectl proxy``` to make a running cluster's API server available locally. You will get some errors because your local workstation is not recognized by the API server. But you should still be able to test your new rules regardless.
@@ -139,6 +159,7 @@ For example, to test [KernelMonitor](https://github.com/kubernetes/node-problem-
139
159
- For [KernelMonitor](https://github.com/kubernetes/node-problem-detector/blob/master/config/kernel-monitor.json) message injection, all messages should have ```kernel: ``` prefix (also note there is a space after ```:```).
140
160
141
161
# Remedy Systems
162
+
142
163
A _remedy system_ is a process or processes designed to attempt to remedy problems
143
164
detected by the node-problem-detector. Remedy systems observe events and/or node
144
165
conditions emitted by the node-problem-detector and take action to return the
@@ -156,6 +177,7 @@ Kubernetes cluster to a healthy state. The following remedy systems exist:
0 commit comments