-
Notifications
You must be signed in to change notification settings - Fork 74
WINC-1635, WINC-1592: enable log rotation for kubelet and kubeproxy services #3766
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from all commits
c0d2e60
b07f055
ace0f38
19c40bf
8f2badb
54c4ce8
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,63 @@ | ||
| # Log rotation for managed Windows services | ||
|
|
||
| Log rotation for managed Windows services is available for WMCO 10.22+. This feature rotates log files based | ||
| on configurable size and age thresholds and is configured via environment variables in the operator. | ||
|
|
||
| ## Enabling log rotation for managed Windows services | ||
|
|
||
| To enable and customize the log rotation behavior, add the following environment variables to the subscription (OLMv0). | ||
| The operator will restart to load the newly added environment variables and apply log rotation to the | ||
| managed services. This will result in a reconfiguration of the existing Windows nodes, one at a time, until all | ||
| nodes have been handled, to minimize disruption. | ||
|
|
||
| ### Setting environment variables in the subscription: | ||
| ```yaml | ||
| kind: Subscription | ||
| spec: | ||
| config: | ||
| env: | ||
| - name: SERVICES_LOG_FILE_SIZE | ||
| value: "100M" # Rotate when log reaches this size (suggested: 100M) | ||
| - name: SERVICES_LOG_FILE_AGE | ||
| value: "168h" # Keep rotated logs for this duration (e.g: 168h/7 days) | ||
| - name: SERVICES_LOG_FLUSH_INTERVAL | ||
| value: "5s" # Flush logs to disk at this interval (suggested: 5s) | ||
| ``` | ||
|
|
||
| ### Patching the subscription using the CLI: | ||
| ```shell script | ||
| oc patch subscription <subscription_name> -n <namespace_name> \ | ||
| --type=merge \ | ||
| -p '{"spec":{"config":{"env":[{"name":"SERVICES_LOG_FILE_SIZE","value":"100M"},{"name":"SERVICES_LOG_FILE_AGE","value":"168h"},{"name":"SERVICES_LOG_FLUSH_INTERVAL","value":"5s"}]}}}' | ||
| ``` | ||
|
|
||
| ### Patching the operator deployment using the CLI (OLMv1 or manual installs): | ||
|
|
||
| ```shell script | ||
| oc set env deployment/windows-machine-config-operator -n <namespace_name> -c manager \ | ||
| SERVICES_LOG_FILE_SIZE="100M" \ | ||
| SERVICES_LOG_FILE_AGE="168h" \ | ||
| SERVICES_LOG_FLUSH_INTERVAL="5s" | ||
| ``` | ||
| where: | ||
| - `<namespace_name>`: The namespace where the operator is installed (e.g., `openshift-windows-machine-config-operator`) | ||
| - `<subscription_name>`: The name of the subscription used to install the operator (e.g., `windows-machine-config-operator-subscription`) | ||
|
|
||
| ## Disabling log rotation for managed Windows services | ||
|
|
||
| To disable log rotation, remove the `SERVICES_LOG_FILE_SIZE`, `SERVICES_LOG_FILE_AGE`, and `SERVICES_LOG_FLUSH_INTERVAL` | ||
| environment variables from the subscription or operator deployment. | ||
|
|
||
| ## Behavior when log rotation settings change | ||
|
|
||
| **Effect on existing log files:** When rotation settings are changed (enabled, disabled, or updated), any previously | ||
| rotated log files are retained according to the `SERVICES_LOG_FILE_AGE` value that was in effect when they were | ||
| created. Once that retention period expires, the files are cleaned up automatically. New log files and any future | ||
| rotated files will follow the updated rotation rules going forward. | ||
|
|
||
| **Operator and node behavior:** Any change to the `SERVICES_LOG_FILE_SIZE`, `SERVICES_LOG_FILE_AGE`, or | ||
| `SERVICES_LOG_FLUSH_INTERVAL` environment variables—whether in the subscription (OLMv0) or the operator deployment | ||
| (OLMv1 / manual installs)—will cause the operator to restart in order to load the updated configuration. After | ||
| restarting, the operator will reconfigure each Windows node one at a time to apply the new log rotation settings, | ||
| minimizing disruption. Note that service continuity during reconfiguration is not guaranteed; brief interruptions | ||
| to managed services (such as kubelet or kube-proxy) may occur on each node as it is reconfigured. |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -380,6 +380,10 @@ deleteParallelUpgradeCheckerResources() { | |
| } | ||
|
|
||
|
|
||
| # Enables debug logging and set smaller size for services log file in the operator pod to make it easier to | ||
| # troubleshoot issues in CI. | ||
| # The method for patching the deployment depends on the OLM version, which is detected by checking for the presence | ||
| # of a subscription (OLMv0) or clusterextension (OLMv1). | ||
| enable_debug_logging() { | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. kind of weird to comment on, but I remember enable_debug_logging causing problems for the submodule upgrade script. Might be good to be sure that these changes don't mess with it
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I dont anticipate any issues. @wgahnagl do you recommend creating a specific test to validate this? |
||
| if [[ $(oc get -n $WMCO_DEPLOY_NAMESPACE pod -l name=windows-machine-config-operator -ojson) == *"--debugLogging"* ]]; then | ||
| echo "Debug logging already enabled" | ||
|
|
@@ -390,13 +394,16 @@ enable_debug_logging() { | |
| WMCO_SUB=$(oc get sub -n "$WMCO_DEPLOY_NAMESPACE" --no-headers 2>/dev/null | awk '{print $1}') | ||
| if [[ -n "$WMCO_SUB" ]]; then | ||
| echo "Detected OLMv0, patching subscription $WMCO_SUB" | ||
| oc patch subscription $WMCO_SUB -n $WMCO_DEPLOY_NAMESPACE --type=merge -p '{"spec":{"config":{"env":[{"name":"ARGS","value":"--debugLogging"}]}}}' | ||
| oc patch subscription $WMCO_SUB -n $WMCO_DEPLOY_NAMESPACE --type=merge -p '{"spec":{"config":{"env":[{"name":"ARGS","value":"--debugLogging"},{"name":"SERVICES_LOG_FILE_SIZE","value":"1M"}]}}}' | ||
| # delete the deployment to ensure the changes are picked up in a timely matter | ||
| oc delete deployment -n $WMCO_DEPLOY_NAMESPACE windows-machine-config-operator | ||
| elif oc get clusterextension windows-machine-config-operator &>/dev/null; then | ||
| echo "Detected OLMv1, patching deployment directly..." | ||
| # Add debug env variable to the WMCO manager container | ||
| oc set env deployment/windows-machine-config-operator -n "$WMCO_DEPLOY_NAMESPACE" ARGS="--debugLogging" -c manager | ||
| # Add debug env variable and log file limit to the WMCO manager container | ||
| oc set env deployment/windows-machine-config-operator -n "$WMCO_DEPLOY_NAMESPACE" \ | ||
| ARGS="--debugLogging" \ | ||
| SERVICES_LOG_FILE_SIZE="1M" \ | ||
| -c manager | ||
| # force restart to pick up the env variable change | ||
| oc scale deployment/windows-machine-config-operator -n "$WMCO_DEPLOY_NAMESPACE" --replicas=0 | ||
| oc scale deployment/windows-machine-config-operator -n "$WMCO_DEPLOY_NAMESPACE" --replicas=1 | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -22,6 +22,7 @@ import ( | |
| "k8s.io/apimachinery/pkg/util/wait" | ||
| "k8s.io/client-go/kubernetes" | ||
| clientcmdv1 "k8s.io/client-go/tools/clientcmd/api/v1" | ||
| logsapi "k8s.io/component-base/logs/api/v1" | ||
| "k8s.io/kubectl/pkg/drain" | ||
| kubeletconfigv1 "k8s.io/kubelet/config/v1" | ||
| kubeletconfig "k8s.io/kubelet/config/v1beta1" | ||
|
|
@@ -743,14 +744,20 @@ func generateKubeletConfiguration(clusterDNS string) kubeletconfig.KubeletConfig | |
| Enabled: &falseBool, | ||
| }, | ||
| }, | ||
| ClusterDomain: "cluster.local", | ||
| ClusterDNS: []string{clusterDNS}, | ||
| CgroupsPerQOS: &falseBool, | ||
| RuntimeRequestTimeout: meta.Duration{Duration: 10 * time.Minute}, | ||
| MaxPods: 250, | ||
| KubeAPIQPS: &kubeAPIQPS, | ||
| KubeAPIBurst: 100, | ||
| SerializeImagePulls: &falseBool, | ||
| ClusterDomain: "cluster.local", | ||
| ClusterDNS: []string{clusterDNS}, | ||
| CgroupsPerQOS: &falseBool, | ||
| RuntimeRequestTimeout: meta.Duration{Duration: 10 * time.Minute}, | ||
| MaxPods: 250, | ||
| KubeAPIQPS: &kubeAPIQPS, | ||
| KubeAPIBurst: 100, | ||
| SerializeImagePulls: &falseBool, | ||
| Logging: logsapi.LoggingConfiguration{ | ||
| FlushFrequency: logsapi.TimeOrMetaDuration{ | ||
| Duration: meta.Duration{Duration: 5 * time.Second}, | ||
| SerializeAsString: true, | ||
| }, | ||
| }, | ||
|
Comment on lines
+755
to
+760
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is explicitly setting a default, which is fine, but the commit message is misleading as there is no functional change here.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. updated commit message |
||
| EnableSystemLogHandler: &trueBool, | ||
| EnableSystemLogQuery: &trueBool, | ||
| FeatureGates: map[string]bool{ | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,25 @@ | ||
| package services | ||
|
|
||
| import ctrl "sigs.k8s.io/controller-runtime" | ||
|
|
||
| var logFileSize, logFileAge, flushInterval string | ||
|
|
||
| func init() { | ||
| log := ctrl.Log.WithName("services").WithName("init") | ||
|
|
||
| var err error | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. it might feel a little better to save these in a config file somewhere instead of setting them with environment variables? And then address Sebastian's comment about the checking the variables being too complicated by verifying the config all in one go when it loads.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I have mixed feelings about introducing a new config file and all the complexity of watching it. |
||
| logFileSize, err = getEnvQuantity(logFileSizeEnvVar) | ||
| if err != nil { | ||
| log.Error(err, "cannot load environment variable", "name", logFileSizeEnvVar) | ||
| } | ||
|
|
||
| logFileAge, err = getEnvDuration(logFileAgeEnvVar) | ||
| if err != nil { | ||
| log.Error(err, "cannot load environment variable", "name", logFileAgeEnvVar) | ||
| } | ||
|
|
||
| flushInterval, err = getEnvDuration(logFlushIntervalEnvVar) | ||
| if err != nil { | ||
| log.Error(err, "cannot load environment variable", "name", logFlushIntervalEnvVar) | ||
| } | ||
coderabbitai[bot] marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| } | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -2,10 +2,13 @@ package services | |
|
|
||
| import ( | ||
| "fmt" | ||
| "os" | ||
| "path/filepath" | ||
| "strings" | ||
| "time" | ||
|
|
||
| config "github.com/openshift/api/config/v1" | ||
| "k8s.io/apimachinery/pkg/api/resource" | ||
|
|
||
| "github.com/openshift/windows-machine-config-operator/pkg/cluster" | ||
| "github.com/openshift/windows-machine-config-operator/pkg/ignition" | ||
|
|
@@ -22,6 +25,13 @@ const ( | |
| // hostnameOverrideVar is the variable that should be replaced with the value of the desired instance hostname | ||
| hostnameOverrideVar = "HOSTNAME_OVERRIDE" | ||
| NodeIPVar = "NODE_IP" | ||
|
|
||
| // logFileSizeEnvVar is the environment variable name for log file size limit | ||
| logFileSizeEnvVar = "SERVICES_LOG_FILE_SIZE" | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @jrvaldes is this a pattern on linux side to give options for these values to be configurable?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. on linux side; yes it can be configurable via journald settings, not env vars |
||
| // logFileAgeEnvVar is the environment variable name for log file age retention | ||
| logFileAgeEnvVar = "SERVICES_LOG_FILE_AGE" | ||
| // logFlushIntervalEnvVar is the environment variable name for log flush interval | ||
| logFlushIntervalEnvVar = "SERVICES_LOG_FLUSH_INTERVAL" | ||
| ) | ||
|
|
||
| // GenerateManifest returns the expected state of the Windows service configmap. If debug is true, debug logging | ||
|
|
@@ -143,9 +153,8 @@ func hybridOverlayConfiguration(apiServerEndpoint, vxlanPort string, debug bool) | |
|
|
||
| // kubeProxyConfiguration returns the Service definition for kube-proxy | ||
| func kubeProxyConfiguration(debug bool) servicescm.Service { | ||
| cmd := fmt.Sprintf("%s -log-file=%s %s --config %s --windows-service", windows.KubeLogRunnerPath, windows.KubeProxyLog, | ||
| windows.KubeProxyPath, windows.KubeProxyConfigPath) | ||
|
|
||
| cmd := getLogRunnerForCmd(windows.KubeProxyPath, windows.KubeProxyLog) | ||
| cmd = fmt.Sprintf("%s --config %s --windows-service", cmd, windows.KubeProxyConfigPath) | ||
| verbosity := "0" | ||
| if debug { | ||
| verbosity = "4" | ||
|
|
@@ -222,8 +231,7 @@ func getKubeletServiceConfiguration(argsFromIginition map[string]string, debug b | |
| preScripts = append(preScripts, hostnameOverridePowershellVar) | ||
| } | ||
|
|
||
| kubeletServiceCmd := fmt.Sprintf("%s -log-file=%s %s", | ||
| windows.KubeLogRunnerPath, windows.KubeletLog, windows.KubeletPath) | ||
| kubeletServiceCmd := getLogRunnerForCmd(windows.KubeletPath, windows.KubeletLog) | ||
|
|
||
| for _, arg := range kubeletArgs { | ||
| kubeletServiceCmd += fmt.Sprintf(" %s", arg) | ||
|
|
@@ -307,3 +315,78 @@ func getHostnameCmd(platformType config.PlatformType) string { | |
| return "" | ||
| } | ||
| } | ||
|
|
||
| // getLogRunnerForCmd returns the command string to run the given commandPath with kube-log-runner | ||
| // logging to the given logfilePath. Log rotation parameters can be configured via environment variables. | ||
| func getLogRunnerForCmd(commandPath, logfilePath string) string { | ||
| cmdBuilder := strings.Builder{} | ||
| // log runner path must be first | ||
| cmdBuilder.WriteString(windows.KubeLogRunnerPath) | ||
|
|
||
| // add log file option | ||
| cmdBuilder.WriteString(" -log-file=") | ||
| cmdBuilder.WriteString(logfilePath) | ||
|
|
||
| if logFileSize != "" { | ||
| // log file size limit before creating a backup | ||
| cmdBuilder.WriteString(" -log-file-size=") | ||
| cmdBuilder.WriteString(logFileSize) | ||
| } | ||
|
|
||
| if logFileAge != "" { | ||
| // log retention for backup files created after the size limit is reached | ||
| cmdBuilder.WriteString(" -log-file-age=") | ||
| cmdBuilder.WriteString(logFileAge) | ||
| } | ||
|
|
||
| if flushInterval != "" { | ||
| // flush to ensure recent log entries are written to disk in near real-time | ||
| cmdBuilder.WriteString(" -flush-interval=") | ||
| cmdBuilder.WriteString(flushInterval) | ||
| } | ||
|
|
||
| // last, add the target command to be run | ||
| cmdBuilder.WriteString(" " + commandPath) | ||
|
|
||
| return cmdBuilder.String() | ||
| } | ||
|
|
||
| // getEnvQuantity returns the value of the environment variable with the given key | ||
| // if it represents a valid and non-negative quantity, otherwise returns error | ||
| func getEnvQuantity(key string) (string, error) { | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. getEnvQuantity returns a string, not a quantity?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
correct, returns a string that represents a valid and non-negative quantity.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
in this case, no. Given that the function encapsulates the validation. I don't anticipate other usages ATM. the func can be adjusted in the future as needed.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @jrvaldes can we rename this to getEnvString or getEnvValue, I feel like that will be more inline with what you are returning.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. To add to @sebsoto's point here:
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Additionally, have you considered returning resource.ParseQuantity here?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
correct, and that is the point. I'm not proposing a generic getter; what's the point on having a func as dummy wrapper for
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
correct, unit tests are only covering the specific edge cases, empty, negative, etc. there is not intention to implement test coverge for the ParseQuantity or ParseDuration funcs.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If the intention is not to introduce a generic getter that can be reused in many ways. The naming should reflect what the function is doing to avoid confusion. |
||
| value := os.Getenv(key) | ||
| value = strings.TrimSpace(value) | ||
| if value == "" { | ||
| // not present | ||
| return "", nil | ||
| } | ||
| // validate value as quantity | ||
| q, err := resource.ParseQuantity(value) | ||
| if err != nil { | ||
| return "", fmt.Errorf("invalid quantity value for %s: %w", key, err) | ||
| } | ||
| if q.Sign() < 0 { | ||
| return "", fmt.Errorf("quantity cannot be negative for %s", key) | ||
| } | ||
| return value, nil | ||
| } | ||
|
|
||
| // getEnvDuration returns the value of the environment variable with the given key | ||
| // if it represents a valid and non-negative duration, otherwise returns error | ||
| func getEnvDuration(key string) (string, error) { | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Same as this 7722cda#r2850043734 But time.Duration rather than Quantity
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. see #3766 (comment)
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @jrvaldes there is a return type "time.Duration" and if this function is not returning that, the naming is confusing
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Would it be a better pattern to return time.Duration instead?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I see your point, will change it to getEnvDurationString
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
no, the logic need a string. the only reason I am proposing the validation (duration, quantity, non-negative, etc) is to prevent WMCO from starting with an invalid configuration for the log wrapper, and avoid the kube-log-runner to fail in the background |
||
| value := os.Getenv(key) | ||
| value = strings.TrimSpace(value) | ||
| if value == "" { | ||
| // not present | ||
| return "", nil | ||
| } | ||
| if strings.HasPrefix(value, "-") { | ||
| return "", fmt.Errorf("duration cannot be negative for %s", key) | ||
| } | ||
|
|
||
| // validate value as duration | ||
| if _, err := time.ParseDuration(value); err != nil { | ||
| return "", fmt.Errorf("invalid duration value for %s: %w", key, err) | ||
| } | ||
| return value, nil | ||
| } | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
whitespace nit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will address in separate PR