[telemetry] Rotate streaming telemetry secrets.#9600
[telemetry] Rotate streaming telemetry secrets.#9600yozhao101 wants to merge 21 commits intosonic-net:masterfrom
Conversation
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
…uration file. Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
|
This pull request introduces 2 alerts and fixes 3 when merging 90d9d03 into 34be53a - view on LGTM.com new alerts:
fixed alerts:
|
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
|
This pull request introduces 1 alert and fixes 3 when merging 0bfcec9 into 67e40b5 - view on LGTM.com new alerts:
fixed alerts:
|
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
| root_process_pid = os.getppid() | ||
| syslog.syslog(syslog.LOG_INFO, | ||
| "Restarting streaming telemetry service by terminating the process with pid: '{}'".format(root_process_pid)) | ||
| os.kill(root_process_pid, signal.SIGTERM) |
There was a problem hiding this comment.
The modern convention is to use sighup (ref: https://stackoverflow.com/a/28327659/2514803 ).
The benefit is not to explicitly terminate the other process and trigger critical process monitor alerts.
You may need to add the sighup handler in sonic-telemetry if not supported.
There was a problem hiding this comment.
I think we can probably use the supervisorctl restart telemetry command to only restart the streaming telemetry server process once the secrets were rotated. This can avoid triggerring the critical process alerts.
There was a problem hiding this comment.
Restarting is using SIGTERM and SIGKILL internally. One big concern is graceful shutdown. Considering the client will often fetch large amount of data, graceful shutdown will make client easier.
There was a problem hiding this comment.
I got the point. We can do this by either sending the signal SIGKILL or executing the command supervisorctl restart telemetry. However, our main focus is how we can do some cleanup before gracefully stopping the telemetry server process and disconnecting with gNMI client side.
There was a problem hiding this comment.
Please check my first comment in this thread. You did not get the point of using sighup.
| # Install Python module for inotify | ||
| sudo https_proxy=$https_proxy LANG=C chroot $FILESYSTEM_ROOT pip3 install inotify | ||
|
|
||
|
|
There was a problem hiding this comment.
Remove extra empty line. #Closed
|
|
||
|
|
||
| def main(): | ||
| certificate_rollover_check() |
There was a problem hiding this comment.
Could you try inotify (https://www.linuxjournal.com/content/linux-filesystem-events-inotify ) instead of reinvent the wheel? #Closed
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
…te and private key. Signed-off-by: Yong Zhao <yozhao@microsoft.com>
|
This pull request introduces 1 alert when merging 17c4d60 into 58c5bb6 - view on LGTM.com new alerts:
|
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
|
/azp run |
|
You have several pipelines (over 10) configured to build pull requests in this repository. Specify which pipelines you would like to run by using /azp run [pipelines] command. You can specify multiple pipelines using a comma separated list. |
|
/azp run Azure.sonic-buildimage |
|
Azure Pipelines successfully started running 1 pipeline(s). |
| "Certificate and private key were rotated and restarting telemetry server process ...") | ||
| restart_telemetry_server() | ||
|
|
||
| # Wait for specified seconds and then do the next round checking |
There was a problem hiding this comment.
Initially what I am thinking is since the directory /etc/sonic/credentials/ will store all the certificates and keys for different applications in the future and if one or several certificates under this directory are rotated too frequently (by accident?), this event loop will be executed continuously and then burn too much CPU cycles potentially. Another thinking the polling frequency of our internal tool to get the update from certificate storage is 8 ~ 24 hours. That's why I put a sleep function at here.
I will let inotify to monitor the certificate file directly instead of monitoring the /etc/sonic/credentials/ and then we can remove this sleep function in event loop.
There was a problem hiding this comment.
Got your motivation, and the considering may be helpful. However, event you sleep, the event will be queued and processed after each sleep, you don't save too much?
There was a problem hiding this comment.
Agree and this will not save too much!
| inotify_instance.add_watch(CREDENTIALS_DIR_PATH) | ||
| for event in inotify_instance.event_gen(yield_nones=False): | ||
| header, event_type, monitoring_path, file_name = event | ||
| if (file_name == certificate_file_name |
There was a problem hiding this comment.
The FSM logic is complex and may be messed up by some input sequence. Could you use one file as the main indicator, and always rotate if that file changed.
ref: https://github.com/Azure/sonic-restapi/blob/94805a39ac0712219f7dc08faa2cfdbf371dd177/go-server-server/main.go#L103 #Closed
There was a problem hiding this comment.
Updated and use the rotation of certificate file as the main indicator.
There was a problem hiding this comment.
Please make sure describe the main file in document? This is very critical design assumption and the cert rotator should treat it as a contract.
There was a problem hiding this comment.
Updated the design document.
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
|
|
||
| inotify_instance = inotify.adapters.Inotify() | ||
| inotify_instance.add_watch(CREDENTIALS_DIR_PATH) | ||
| for event in inotify_instance.event_gen(yield_nones=False): |
There was a problem hiding this comment.
The inotify is inode based and it will monitor the credentials directory /etc/sonic/credentials/ to see whether the telemetry certificate file was rotated or not. If certificate file was deleted by accidentally, the inotify_instance will not be impacted.
I updated the PR to log an error message if the certificate was deleted. What I am thinking is if the certificate was restored later, then it can be treated as a kind of rotation operation and the telemetry server will be restarted by this script.
There was a problem hiding this comment.
If certificate file was deleted by accidentally, what is the expected behavior?
I am considering in this case, we can kill telemetry daemon.
| "Failed to retrieve the certificate information from 'TELEMETRY' table!") | ||
| sys.exit(4) | ||
|
|
||
| while True: |
There was a problem hiding this comment.
Updated and checks the existence of both two files.
…sted on device and log an error message if certificate file was deleted accidentally. Signed-off-by: Yong Zhao <yozhao@microsoft.com>
…P` signal. Signed-off-by: Yong Zhao <yozhao@microsoft.com>
|
This pull request introduces 1 alert when merging c6e2fdd into 61e9a76 - view on LGTM.com new alerts:
|
…ing script. Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
|
This pull request introduces 1 alert when merging 41377a1 into 1ac140a - view on LGTM.com new alerts:
|
|
@yozhao101 could you please make sure your PR is rebased to latest and rerun checkers? |
|
@yozhao101 could you please rebase your work on latest and rerun checkers? |
|
@qiluo-msft Can you help me review this PR please? |
Thanks @liat-grozovik for your reviewing! I will rebase my PR. |
|
|
||
| while True: | ||
| if not os.path.exists(certificate_file_path) or not os.path.exists(private_key_file_path): | ||
| syslog.syslog(syslog.LOG_ERR, |
|
@yozhao101 any update on this PR? should it go to 202205? if so, can you add the proper label? |
|
@yozhao101 can you please help to response the comments and provide an update on this PR? Thanks. |
|
Hi @yozhao101 this looks like a very useful change. Would like to know what is the status of this PR and what's the plan to get it to master? Thanks. |
Signed-off-by: Yong Zhao yozhao@microsoft.com
Why I did it
Initially the secrets of Streaming Telemetry container were managed and deployed by several internal tools. Since the secrets of Streaming Telemetry were not always be able to be deployed successfully on SONiC production devices, we decide to leverage an exising feature in SONiC to manage and deploy the secrets.
How I did it
I mainly did the following two changes:
First I added a python script named
certificate_rotation_checkerin Streaming Telemtry container and this script will check periodically in background to see whether the certificate and private key were rotated and updated. The Streaming Telemetry server process will be restarted if and only if the certificate and key were actually rotated. We have another open PR to reload gNMI server configuration in telemetry repo: sonic-net/sonic-telemetry#92.Second, In the script
telemetry.sh, the telemetry server process will be started if and only if the certificate and private key files did exist on the device. Otherwise, it will check the existence of certificate and private key files on device periodically.How to verify it
I did this implementation and tested them on a virtual switch.
Which release branch to backport (provide reason below if selected)
Description for the changelog
A picture of a cute animal (not mandatory but encouraged)