From 2a97493a9cf4e8eb1a94ff9c3c8cdd3702a7edd3 Mon Sep 17 00:00:00 2001 From: Ben Clifford Date: Mon, 6 Sep 2021 11:05:59 +0000 Subject: [PATCH 01/42] Fix typo --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 29dbf01..02c2168 100644 --- a/README.md +++ b/README.md @@ -9,7 +9,7 @@ This application includes: * FuncX Web-Service * FuncX Websocket Service * FuncX Forwarder -* Kuberentes endpoint +* Kubernetes endpoint * Postgres database * Redis Shared Data Structure * RabbitMQ broker From fbeb8d5278f122d0a15359f12d747615b2d019bd Mon Sep 17 00:00:00 2001 From: Ben Clifford Date: Mon, 6 Sep 2021 13:39:16 +0000 Subject: [PATCH 02/42] assorted ramblings --- README.md | 97 ++++++++++++++++++++++++++++++++++++++++++++++++++++--- 1 file changed, 92 insertions(+), 5 deletions(-) diff --git a/README.md b/README.md index 02c2168..5e28ee5 100644 --- a/README.md +++ b/README.md @@ -14,8 +14,62 @@ This application includes: * Redis Shared Data Structure * RabbitMQ broker +## benc notes on what i added onto a hetzner machine installed with ubuntu 20.04.03 - as root (because this is a VM dedicated to this project, so I don't care about user permissions for kubernetes level stuff) + + +apt-get install docker.io +curl -LO https://storage.googleapis.com/minikube/releases/latest/minikube-linux-amd64 +install minikube-linux-amd64 /usr/local/bin/minikube + +apt-get install contrack # because: ❌ Exiting due to GUEST_MISSING_CONNTRACK: Sorry, Kubernetes 1.22.1 requires conntrack to be installed in root's path + +minikube --driver=none # because i am root in a VM. otherwise apparently driver=docker might be nice? I haven't tried + +Now can run to see running pods + +$ minikube kubectl -- get pods -A + +This gives me 7 runnings k8s pods. + + +now install helm, following debian/apt instructions here: +https://helm.sh/docs/intro/install/ + +curl https://baltocdn.com/helm/signing.asc | sudo apt-key add - +sudo apt-get install apt-transport-https --yes +echo "deb https://baltocdn.com/helm/stable/debian/ all main" | sudo tee /etc/apt/sources.list.d/helm-stable-debian.list +sudo apt-get update +sudo apt-get install helm + +now can run: + +# helm + +and see the default helm help text. + +now get the funcx helm repo: + +mkdir src +cd src +git clone git@github.com:funcx-faas/helm-chart +cd helm-chart + +and run the first helm command from below: + +helm dependency update funcx + +which will download some stuff. + +Note that it downloads a funcx_endpoint chart over http - that isn't something contained in this repo, even though it is a funcx related chart... + +see notes further down for continuation... + ## Preliminaries +how is this a preliminary rather than part of the main install? it even looks like a funcx-endpoint is set up as part of helm automatically ... so is this whole endpoint section irrelevant for an initial install? or at least, there should be better intro description at this point +that an endpoint will be deployed inside k8s? + + There are two modes in which funcx-endpoints could be deployed: 1. funcx-endpoint deployed outside k8s, connecting to hosted services in k8s @@ -89,6 +143,9 @@ kubectl create secret generic funcx-sdk-tokens \ ``` ## Installing FuncX + +[how does this section relate to the previous section?] + 1. Make a clone of this repository 2. Download subcharts: ```shell script @@ -110,14 +167,26 @@ kubectl create secret generic funcx-sdk-tokens \ ``` 6. You can access your web service through the ingress or via a port forward to the web service pod. Instructions are provided in the displayed notes. +[what ingress? there is no ingress resource created by helm? talking about the +'service' resource for these?] + +now i see a bunch of srevices including a funcx endpoint 7. You should be able to see the endpoint registering with the web service in their respective logs, along with the forwarder log. Check the endpoint's logs for its ID. +[clarify which logs / *where* those logs are? explicitly which (3?) logs to +look at... - who is registering with whom?] + + + ### Database Setup -Until we migrate the webservice to use an ORM, we need to set the database -schema up using a SQL script. This is accomplished by an init-container that +[Until we migrate the webservice to use an ORM, (remove roadmap from install instructions)] + +We need to set the database +schema up using a SQL script. [clarify what is happening here... where is the SQL script? how is it run? +the text makes it sound like init-container will run it? but I don't see any evidence of that] This is accomplished by an init-container that is run prior to starting up the web service container. This setup image checks to see if the tables are there. If not, it runs the setup script. @@ -151,6 +220,13 @@ kubectl create secret generic funcx-forwarder-secrets --from-file=.curve/server. > :warning: **USE THE FOLLOWING deployed_values/values.yaml** Omit the > funcx_endpoint section if using an externally deployed endpoint. +should I use the following values.yaml or the values.yaml I was told to make earlier? +dedupe - and if this section is the values.yaml i should be using, move it up to +where i am told to create the values.yaml + +eg. why do I need to be exposing postgres to the internet? + + ``` yaml webService: pullPolicy: Always @@ -165,7 +241,7 @@ websocketService: # Note that we install numpy into the worker so that we can run tests against the local # deployment # Note that the workerImage needs the same python version as is used in the funcx_endpoint -# image. This requirement will be relaxed +# image. This requirement will be relaxed [give an issue url for tracking this or remove the promise/dream. it is barely relevant to config docs] funcx_endpoint: enabled: true funcXServiceAddress: http://funcx-funcx-web-service:8000 @@ -201,6 +277,8 @@ rabbitmq: ``` ### Additional config +[are these values that are defaulted in funcx/values.yaml and can be overridden in +deployed_values/values.yaml?] There are a few values that can be set to adjust the deployed system configuration @@ -238,6 +316,7 @@ configuration ## Sealed Secrets +[why would i want to do this?] The chart can take advantage of Bitnami's sealed secrets controller to encrypt sensitive config data so it can safely be checked into the GitHub repo. @@ -256,7 +335,7 @@ cat local-dev-secrets.yaml | \ ## Subcharts This chart uses two subcharts to supply dependent services. You can update -settings for these by referenceing the subchart name and values from +settings for these by referencing the subchart name and values from their READMEs. For example @@ -278,7 +357,7 @@ In the scripts directory there is `psql-busybox.yaml`. Create the pod with $ kubectl create -f scripts/psql-busybox.yaml ``` -You can then create a shell with `kubectl exec -it psql bash` +You can then create a shell with `kubectl exec -it plsql bash` Inside that shell there is a fun pg sql client which can be invoked with the same Postgres URL found in the web app's config file (`/opt/funcx/app.conf`) @@ -286,3 +365,11 @@ same Postgres URL found in the web app's config file (`/opt/funcx/app.conf`) ```console pgcli postgresql://funcx:XXXXXXXXXXXX@funcx-production-db.XXXXXX.rds.amazonaws.com:5432/funcx ``` +[if this is intended to be used inside a dev cluster, is there a better way to name +the DB than this rd.amazonaws url? + +Where does the XXXX password come from in a dev cluster? +Here's a better command line for my minikube setup: + +root@plsql:/# pgcli postgresql://funcx:leftfoot1@funcx-postgresql:5432/public + From 00bdcf43587b6158fbc1487fb84a68a9b974b0d7 Mon Sep 17 00:00:00 2001 From: Ben Clifford Date: Tue, 14 Sep 2021 17:01:32 +0000 Subject: [PATCH 03/42] More notes --- README.md | 57 +++++++++++++++++++++++++++++++++++++++++++++++++++---- 1 file changed, 53 insertions(+), 4 deletions(-) diff --git a/README.md b/README.md index 5e28ee5..38075a8 100644 --- a/README.md +++ b/README.md @@ -69,12 +69,23 @@ see notes further down for continuation... how is this a preliminary rather than part of the main install? it even looks like a funcx-endpoint is set up as part of helm automatically ... so is this whole endpoint section irrelevant for an initial install? or at least, there should be better intro description at this point that an endpoint will be deployed inside k8s? +make a decision for the user - as they are new. they can try different ways later, which +can be documented elsewhere - for example, this is the helm repo so should be talking +about the helm deployment of the endpoint. Or pointing people at the endpoint repo. + +There are two modes in which funcx-endpoints could be deployed: -There are two modes in which funcx-endpoints could be deployed: 1. funcx-endpoint deployed outside k8s, connecting to hosted services in k8s 2. funcx-endpoint deployed inside k8s +Also be clear on the deployment modes: production-like (eg with many users, centrally, +with expection that images are from tags, endpoints deployed by other people) +and development-like - eg on my own private VM with my own hacked up source changes and +all sorts of mess, and everything including endpoints deployed by me. + +be clear throughout this document what refers to those two use cases. + ### Deploying funcx-endpoint outside of K8s --- @@ -129,6 +140,17 @@ be created by running on your local workstation and running funcx-endpoint start ``` +(or if I don't have this on my local workstation because I'm deploying purely +inside kubernetes...? can I run this as a command inside minikube - eg after I've +got this running? +yes, with lots of error messages to ignore, I did this: + +docker run --rm -ti funcx/kube-endpoint:main /home/funcx/boot.sh funcx + +) + + + You will be prompted to follow the authorization link and paste the resulting token into the console. Once you do that, funcx-endpoint will create a `~/.funcx` directory and provide you with a token file. @@ -144,7 +166,8 @@ kubectl create secret generic funcx-sdk-tokens \ ## Installing FuncX -[how does this section relate to the previous section?] +[how does this section relate to the previous section? I think the +] 1. Make a clone of this repository 2. Download subcharts: @@ -152,6 +175,9 @@ kubectl create secret generic funcx-sdk-tokens \ helm dependency update funcx ``` 3. Create your own `values.yaml` inside the Git ignored directory `deployed_values/` + [forward reference to the two different values sections later on in this + document: should I just have the three lines mentioned here? or should I + be copy-pasting a huge example?] 4. Obtain Globus Client ID and Secret. These secrets need to exist in the correct Globus Auth app. Ask for access to the credentials by contacting https://github.com/BenGalewsky or sending a message to the `dev` funcx Slack @@ -170,7 +196,23 @@ to the web service pod. Instructions are provided in the displayed notes. [what ingress? there is no ingress resource created by helm? talking about the 'service' resource for these?] -now i see a bunch of srevices including a funcx endpoint +now i see a bunch of services including a funcx endpoint + +looks like this: + +# minikube kubectl get pods +NAME READY STATUS RESTARTS AGE +funcx-endpoint-86756c48c8-flhqf 1/1 Running 0 95m +funcx-forwarder-db744678c-fqxhg 1/1 Running 0 95m +funcx-funcx-web-service-97585d958-6zrvh 1/1 Running 0 95m +funcx-funcx-websocket-service-bb766fbcd-rvbgj 1/1 Running 0 95m +funcx-postgresql-0 1/1 Running 0 95m +funcx-rabbitmq-0 1/1 Running 0 95m +funcx-redis-master-0 1/1 Running 0 95m +funcx-redis-slave-0 1/1 Running 0 95m +funcx-redis-slave-1 1/1 Running 0 95m + + 7. You should be able to see the endpoint registering with the web service in their respective logs, along with the forwarder log. Check the endpoint's @@ -226,7 +268,6 @@ where i am told to create the values.yaml eg. why do I need to be exposing postgres to the internet? - ``` yaml webService: pullPolicy: Always @@ -373,3 +414,11 @@ Here's a better command line for my minikube setup: root@plsql:/# pgcli postgresql://funcx:leftfoot1@funcx-postgresql:5432/public +# upgrades +How does I upgrade this? It was installed using the latest images at install time I guess? +I see a funcx-web-service tag from 15h ago after running helm upgrade funcx funcx... +(imgage ID d21432c1525a) - so is that what is running now? looks like it pulled a new image when I rebooted the server (!) + +That seems a bit chaotic. And how do I switch these to using my own source builds? + +how can i run a test job against this install? From c4bf3f3e900d8bf520df634015f99b0ab1a19448 Mon Sep 17 00:00:00 2001 From: Ben Clifford Date: Wed, 15 Sep 2021 15:40:21 +0000 Subject: [PATCH 04/42] Ongoing narrative of trying to get a single task running --- README.md | 106 ++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 106 insertions(+) diff --git a/README.md b/README.md index 38075a8..02774b3 100644 --- a/README.md +++ b/README.md @@ -221,6 +221,103 @@ logs for its ID. [clarify which logs / *where* those logs are? explicitly which (3?) logs to look at... - who is registering with whom?] +Endpoint log will look like: + +2021-09-14 16:36:02 endpoint.endpoint_manager:172 [INFO] Starting endpoint with uuid: cfd389f3-4eda-413b-af95-4d54a8e944dc + +forwarder will look like: +{"asctime": "2021-09-14 18:20:44,535", "name": "funcx_forwarder.forwarder", "levelname": "DEBUG", "message": "endpoint_status_message", "log_type": "endpoint_status_message", "endpoint_id": "cfd389f3-4eda-413b-af95-4d54a8e944dc", "endpoint_status_message": {"_payload": null, "_header": "b'\\xcf\\xd3\\x89\\xf3N\\xdaA;\\xaf\\x95MT\\xa8\\xe9D\\xdc'", "ep_status": {"task_id": -2, "info": {"total_cores": 0, "total_mem": 0, "new_core_hrs": 0, "total_core_hrs": 0, "managers": 0, "active_managers": 0, "total_workers": 0, "idle_workers": 0, "pending_tasks": 0, "outstanding_tasks": {}, "worker_mode": "no_container", "scheduler_mode": "hard", "scaling_enabled": true, "mem_per_worker": null, "cores_per_worker": 1.0, "prefetch_capacity": 10, "max_blocks": 100, "min_blocks": 1, "max_workers_per_node": 1, "nodes_per_block": 1}}, "task_statuses": {}}} + +web service will look like: +{"asctime": "2021-09-14 16:36:03,273", "name": "funcx_web_service", "levelname": "INFO", "message": "Successfully registered cfd389f3-4eda-413b-af95-4d54a8e944dc in database"} + +### connecting clients + +the startup message (from helm) has a couple of kubectl port-forward commands that might be a bit wrong - i ended up using these two: +# minikube kubectl -- port-forward --address 0.0.0.0 service/funcx-funcx-web-service 8000 +# minikube kubectl -- port-forward --address 0.0.0.0 service/funcx-funcx-websocket-service 6000 + +This will expose the services on port 8000 and port 6000 - because this is a service, the 2nd port number in the helm suggested text is ignored, I think - so that could be removed in a PR (as long as I check and justify that with documentation links) + +now from a working funcx install, create a funcx client pointed at the current +service, like this: + +fxc = FuncXClient(funcx_service_address="http://amber.cqx.ltd.uk:8000/v2") + +and then run quickstart guide style stuff - probably i don't need to paste it here, but +I could... + + +from funcx.sdk.client import FuncXClient + +fxc = FuncXClient(PARMS HERE) + +def hello_world(): + return "Hello World!" + +func_uuid = fxc.register_function(hello_world) + +tutorial_endpoint = 'LOCAL ENDPOINT HERE' +result = fxc.run(endpoint_id=tutorial_endpoint, function_id=func_uuid) + +print(fxc.get_result(result)) + +this gets as far as submitting for me, but attempts to get the result always give +funcx.utils.errors.TaskPending: Task is pending due to waiting-for-nodes + +There's a pod started up ok - called funcx-1631645705407 without any more interesting name. i guess thats a worker? after 106s all that is in the logs is a warning from pip running as root +- but i can see pip running. + +So i should put a note here about how long things might take here? +Note that it has changes from waiting-for-ep in the error to waiting-for-nodes +after several minutes, so there is more stuff going on, slowly... 5m later and its still churning. it's had one restart 63s ago... nothing clear about *why* it restarted though. +in kubectl describe pod XXXXX I can see the commandline - a pip install and then a funcx-manager. +It seems to end with: PROCESS_WORKER_POOL main event loop exiting normally + +So lets debug a bit more about where the task execution happens or not. + +Is this doing a pip install on every restart (?!) - that's a question to ask. +(maybe it's not actually installing new stuff though - which is why i'm not seeing any +packages being installed on subsequent runs) + +Eventually it went into "CrashLoopBackOff" at the kubernetes level, which maybe isn't the right behaviour for "exiting normal" at the PROCESS_WORKER_POOL level? Ask on chat about that. + +There's nothing in the endpoint logs about starting up that funcx process worker container, or about jobs happening - just every 600s a keepalive message + +Digging into the endpoint container environment, find ~/.funcx/funcx/EndpointInterchange.log +which is reporting a sequence of errors: + +2021-09-15 14:26:55.592 funcx_endpoint.executors.high_throughput.executor:540 [WARNING] [MTHR +EAD] Executor shutting down due to version mismatch in interchange +2021-09-15 14:26:55.610 funcx_endpoint.executors.high_throughput.executor:542 [ERROR] [MTHREA +D] Exception: Task failure due to loss of manager b'18e00d57935c' +NoneType: None +2021-09-15 14:26:55.610 funcx_endpoint.executors.high_throughput.executor:577 [INFO] [MTHREAD +] queue management worker finished + +then every 10ms this message *forever* 2021-09-15 14:26:55.613 funcx_endpoint:504 [ERROR] [MAIN] Something broke while forwarding re +sults from executor to forwarder queues +Traceback (most recent call last): + File "/usr/local/lib/python3.7/site-packages/funcx_endpoint/endpoint/interchange.py", line 4 +90, in _main_loop + results = self.results_passthrough.get(False, 0.01) + File "/usr/local/lib/python3.7/multiprocessing/queues.py", line 108, in get + res = self._recv_bytes() + File "/usr/local/lib/python3.7/multiprocessing/connection.py", line 216, in recv_bytes + buf = self._recv_bytes(maxlength) + File "/usr/local/lib/python3.7/multiprocessing/connection.py", line 407, in _recv_bytes + buf = self._recv(4) + File "/usr/local/lib/python3.7/multiprocessing/connection.py", line 383, in _recv + raise EOFError +EOFError + +- that kind of failure should be resulting in a kubernetes level restart (or some other exit/restart) not a hang loop like this? +- mismatch of what? between who? is it the process worker pool container vs the funcx container? Looking at interchange.py - this might not even be from a version mismatch: it can happen if reg_flag is false (due to a json deserialisation problem in registration message). Other than that, it can happen because the python versions from the manager vs the interchange. + +whle the task on the client side still reports: +funcx.utils.errors.TaskPending: Task is pending due to waiting-for-ep + + ### Database Setup @@ -422,3 +519,12 @@ I see a funcx-web-service tag from 15h ago after running helm upgrade funcx func That seems a bit chaotic. And how do I switch these to using my own source builds? how can i run a test job against this install? + + +## Upgrading and developing against non-main environments + +i've been trying this but it's not clear that it is pulling down the latest of +everything: (I think it does, but just the images are not changing often +upstream from me?) + helm upgrade -f deployed_values/values.yaml funcx funcx --recreate-pods + From f3a9b78d78d5b3566cbfcc753f0909262c8b15da Mon Sep 17 00:00:00 2001 From: Ben Clifford Date: Thu, 16 Sep 2021 09:58:44 +0000 Subject: [PATCH 05/42] more notes --- README.md | 85 +++++++++++++++++++++++++++++++++++++++++++++++++------ 1 file changed, 77 insertions(+), 8 deletions(-) diff --git a/README.md b/README.md index 02774b3..8a1658f 100644 --- a/README.md +++ b/README.md @@ -237,6 +237,8 @@ the startup message (from helm) has a couple of kubectl port-forward commands th # minikube kubectl -- port-forward --address 0.0.0.0 service/funcx-funcx-web-service 8000 # minikube kubectl -- port-forward --address 0.0.0.0 service/funcx-funcx-websocket-service 6000 +These port forwards are only temporary - they run as foreground processes and break as soon as the pods change (for example due to restarts). That seems a bit frustrating if they're meant to be pointing to services. Is there a more persistent kubernetes configuration that can be used to expose to the world? And for other people, to expose to whatever their security-scoped environment is? + This will expose the services on port 8000 and port 6000 - because this is a service, the 2nd port number in the helm suggested text is ignored, I think - so that could be removed in a PR (as long as I check and justify that with documentation links) now from a working funcx install, create a funcx client pointed at the current @@ -314,21 +316,85 @@ EOFError - that kind of failure should be resulting in a kubernetes level restart (or some other exit/restart) not a hang loop like this? - mismatch of what? between who? is it the process worker pool container vs the funcx container? Looking at interchange.py - this might not even be from a version mismatch: it can happen if reg_flag is false (due to a json deserialisation problem in registration message). Other than that, it can happen because the python versions from the manager vs the interchange. +I commented on these logs not being obvious, in slack, and ben g gave me: +> so for debugging, I added a value to the endpoint helm chart detachEndpoint -since the endpoint runs in a daemon, the output doesn’t show up in the pod’s logs.. Setting this to false means the endpoint runs in the main thread. Less reliable, but easy for debugging + +I haven't tried that yet. But if its good, then... if k8s endpoints are also expected for end users, maybe they should also get the same functionality? (eg why is this running as a daemon when its inside a pod anyway managing that?) + whle the task on the client side still reports: funcx.utils.errors.TaskPending: Task is pending due to waiting-for-ep +At the same time, the process worker service repeatedly exits and is restarted by kubernetes (with it eventually hitting CrashLoopBackOff to slow this down) - presumably that's somehow opposite half of this same error message, but it isn't clear. The entire log file is: + +root@amber:~# minikube kubectl logs -- funcx-1631715998878 +WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv +PROCESS_WORKER_POOL main event loop exiting normally +root@amber:~# + +I found a more interesting log file here: +/home/funcx/.funcx/funcx/HighThroughputExecutor/worker_logs/8e8f66c705d3/manager.log + +which eventully reports a critical error that the interchange heartbeat is missing - not at all like the kubectl log error of: ... exiting normally. + +It's a bit weird to be 2 minutes in before the manager even notices that the interchange isn't even alive. + +So... the version mismatch: +This is invoking python:3.6-buster - so let's track down where that was. + +Selecting the correct image (eg for AWS AMIs, not docker images) has been a massive +usability problem for testing parsl on image based systems... I'm not sure how much it +matters in end-use though, if you're assuming that users make app-specific images that +are tied to their own environment? I don't have experience there. I haven't spent any +time seriously trying to solve this for parsl, but eg ZZ did container stuff for parsl +so I'd be interested to here any of his relevant experiences. not really a problem +i am interested in solving. +so grep around in the source tree for python:3.6-buster +The funcx endpoint helm chart is coming from a URL on funcx.org, not the helm-charts repo, +under here: http://funcx.org/funcx-helm-charts -### Database Setup -[Until we migrate the webservice to use an ORM, (remove roadmap from install instructions)] +There's a 0.3 chart in the funcx-helm-charts repo by the looks of it - perhaps I can try that, by hacking at the server-side chart. What's the right way to be controlling this? -We need to set the database -schema up using a SQL script. [clarify what is happening here... where is the SQL script? how is it run? -the text makes it sound like init-container will run it? but I don't see any evidence of that] This is accomplished by an init-container that -is run prior to starting up the web service container. This setup image checks -to see if the tables are there. If not, it runs the setup script. +While checking if process claims to be alive, the endpoint output line: +"funcx-endpoint process is still alive. Next check in 600s." +should be given a timestamp - it doesn't seem to go through the logging mechanism so +is not getting a timestamp that way. So I have no idea when it last ran, or if its +an outdated message, etc. +This command to install the worker inside the worker container is installing funcx-endpoint and directing the output to a file called =0.2.0. That is probably not what is intended. Put the whole thing in ' marks perhaps. + + pip install funcx-endpoint>=0.2.0 + +This is in the funcx endpoint helm chart... along with the naughty command right next +to it: +workerImage and +workerInit +so I should be able to override those myself in my values.yaml? + +funcx_endpoint: + workerImage: python:3.8-buster + workerInit: 'pip install "funcx-endpoint>=0.2.0"' + +note that workerInit is embedded python string syntax, not a plain string, so +you can't use ' marks inside it because it is substituted in somewhere I think +and that causes a syntax error - eg try the above line with " and ' swapped +and see: +""" +File "/home/funcx/.funcx/funcx/config.py", line 24 + worker_init='pip install 'funcx-endpoint>=0.2.0'', + ^ +SyntaxError: invalid syntax +""" +because string substitution into source without proper escaping. + +This could be fixed - either by reading the string from a different place and +not doing python source substitution, or by performing escaping on the string. +This behaviour is likely to cause trouble to anyone doing non-trivial bash +in their worker_init. + +it's frustrating that the python version is not set to the version that +is actually used by the endpoint. ### Forwarder Debugging > :warning: *Only for debugging*: You can set the forwarder curve server key manually by creating @@ -511,6 +577,10 @@ Here's a better command line for my minikube setup: root@plsql:/# pgcli postgresql://funcx:leftfoot1@funcx-postgresql:5432/public + + +how can i run a test job against this install? + # upgrades How does I upgrade this? It was installed using the latest images at install time I guess? I see a funcx-web-service tag from 15h ago after running helm upgrade funcx funcx... @@ -518,7 +588,6 @@ I see a funcx-web-service tag from 15h ago after running helm upgrade funcx func That seems a bit chaotic. And how do I switch these to using my own source builds? -how can i run a test job against this install? ## Upgrading and developing against non-main environments From e05bc1ae501ed3715bd15ec2cb234074374e7bcd Mon Sep 17 00:00:00 2001 From: Ben Clifford Date: Mon, 20 Sep 2021 16:09:43 +0000 Subject: [PATCH 06/42] wip --- README.md | 19 +++++++++++++++++++ 1 file changed, 19 insertions(+) diff --git a/README.md b/README.md index 8a1658f..1c59419 100644 --- a/README.md +++ b/README.md @@ -589,6 +589,25 @@ I see a funcx-web-service tag from 15h ago after running helm upgrade funcx func That seems a bit chaotic. And how do I switch these to using my own source builds? +python versions: + The endpoint repo funcX/Dockerfile-endpoint defaults to building with python 3.8. + That doesn't align with the python that is being supplied by the helm images. + So rebuild my client, specifying 3.7. + +Perhaps funcx/Dockerfile-endpoint should force the user to choose, rather than building +a likely invalid one. + +The broad topic here is python versions are poorly co-ordinated as supplied - yes, they +have to align, but the defaults supplied should all align with each other at least. +(they need to align across all three of: the submitting client, the endpoint, the +endpoint workers, and all three are wrong by default) + + +Python mismatch between interchange and worker pod results in an eternal hang: interchange reports inside its endpoint logs that there's a version mismatch (but not what the version mismathc is). The worker just restarts every couple of minutes with missing heartbeat: no description of *why* the heartbeat is missing. The end user is never informed of anything more than "waiting for nodes". It's unclear to me if that should be a richer message or a richer hard error: could tell user that the worker version is wrong at least, becaues that is likely an error that won't fix itself (i.e. is not a transient error). Richer error here would help the submitting user understand that they need to contact the endpoint administrator for rectification and what to tell them, beyond "waiting for nodes". + + +funcx endpoint worker pods lose their logs at each restart - which is awkward to examine when the logs are in an every-two-minutes restart loop due to missing heartbeat. [should they even be autorestarting in that situation, rather than letting the endpoint handle restarting them if it still wants them?] + ## Upgrading and developing against non-main environments From 04ca10ba58b2fcd36afaa2ef1e659e508f0a28aa Mon Sep 17 00:00:00 2001 From: Ben Clifford Date: Mon, 20 Sep 2021 19:19:05 +0000 Subject: [PATCH 07/42] wip --- README.md | 34 ++++++++++++++++++++++++++++++++-- 1 file changed, 32 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index 1c59419..2bd8091 100644 --- a/README.md +++ b/README.md @@ -1,6 +1,30 @@ # helm-chart Helm Chart for Deploying funcX stack +# About this document + +Goal: Getting started guide for a new funcX developer +making their first install for themselves to hack on. +[so there should be a prefered main path with minimal +choices for customisation of the initial install - eg proper defaults for python +version] + +Non-goals: + +* Using the helm charts to deploy to the live/dev systems. +* Customising your helm deploy. + +# About benc's notes + +Notes are here for various purposes: (in no particular order) + +i) making this document better support the Document Goal +ii) making the default configuration of funcX in the repositories better support the +Document Goal +iii) making funcX better support users (people invoking functions, and people operating +their own endpoints). + + [![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0) [![NSF-2004894](https://img.shields.io/badge/NSF-2004894-blue.svg)](https://nsf.gov/awardsearch/showAward?AWD_ID=2004894) [![NSF-2004932](https://img.shields.io/badge/NSF-2004932-blue.svg)](https://nsf.gov/awardsearch/showAward?AWD_ID=2004932) @@ -14,6 +38,7 @@ This application includes: * Redis Shared Data Structure * RabbitMQ broker + ## benc notes on what i added onto a hetzner machine installed with ubuntu 20.04.03 - as root (because this is a VM dedicated to this project, so I don't care about user permissions for kubernetes level stuff) @@ -64,7 +89,7 @@ Note that it downloads a funcx_endpoint chart over http - that isn't something c see notes further down for continuation... -## Preliminaries +## Preliminaries [funcx endpoint] how is this a preliminary rather than part of the main install? it even looks like a funcx-endpoint is set up as part of helm automatically ... so is this whole endpoint section irrelevant for an initial install? or at least, there should be better intro description at this point that an endpoint will be deployed inside k8s? @@ -73,6 +98,10 @@ make a decision for the user - as they are new. they can try different ways late can be documented elsewhere - for example, this is the helm repo so should be talking about the helm deployment of the endpoint. Or pointing people at the endpoint repo. +I think for an initial install, the in-kubernetes default endpoint which configures +itself almost automatically should be chosen for getting started. With instructions +on attaching an external endpoint described *afterwards* at the end of this document. + There are two modes in which funcx-endpoints could be deployed: @@ -606,7 +635,8 @@ endpoint workers, and all three are wrong by default) Python mismatch between interchange and worker pod results in an eternal hang: interchange reports inside its endpoint logs that there's a version mismatch (but not what the version mismathc is). The worker just restarts every couple of minutes with missing heartbeat: no description of *why* the heartbeat is missing. The end user is never informed of anything more than "waiting for nodes". It's unclear to me if that should be a richer message or a richer hard error: could tell user that the worker version is wrong at least, becaues that is likely an error that won't fix itself (i.e. is not a transient error). Richer error here would help the submitting user understand that they need to contact the endpoint administrator for rectification and what to tell them, beyond "waiting for nodes". -funcx endpoint worker pods lose their logs at each restart - which is awkward to examine when the logs are in an every-two-minutes restart loop due to missing heartbeat. [should they even be autorestarting in that situation, rather than letting the endpoint handle restarting them if it still wants them?] +funcx endpoint worker pods lose their logs at each restart - which is awkward to examine when the logs are in an every-two-minutes restart loop due to missing heartbeat. debuggability might be enhanced by putting them in a pod-lifetime dir rather than a container-lifetime dir? [should they even be autorestarting in that situation, rather than letting the endpoint handle restarting them if it still wants them? - c.f. parsl discussion about how kubernetes pods are managed by the parsl kubernetes provider?] + ## Upgrading and developing against non-main environments From fce74763817bfe1b1c1f1cba864b8bfaddcc9853 Mon Sep 17 00:00:00 2001 From: Ben Clifford Date: Mon, 20 Sep 2021 19:27:51 +0000 Subject: [PATCH 08/42] wip --- README.md | 22 ++++++++++++++++------ 1 file changed, 16 insertions(+), 6 deletions(-) diff --git a/README.md b/README.md index 2bd8091..ac14103 100644 --- a/README.md +++ b/README.md @@ -38,9 +38,20 @@ This application includes: * Redis Shared Data Structure * RabbitMQ broker +## Kubernetes pre-reqs -## benc notes on what i added onto a hetzner machine installed with ubuntu 20.04.03 - as root (because this is a VM dedicated to this project, so I don't care about user permissions for kubernetes level stuff) +You will need a kubernetes installation. +Some ways in which you can get an installation: + +(benc: hetzner cloud + minikube) +(slack suggestion: docker desktop) + +## eg. minikube + hetzner cloud + +base os: ubuntu 20.04.03 + +running everything as roo apt-get install docker.io curl -LO https://storage.googleapis.com/minikube/releases/latest/minikube-linux-amd64 @@ -68,7 +79,9 @@ sudo apt-get install helm now can run: +``` # helm +``` and see the default helm help text. @@ -89,7 +102,7 @@ Note that it downloads a funcx_endpoint chart over http - that isn't something c see notes further down for continuation... -## Preliminaries [funcx endpoint] +## Preliminaries [for funcx endpoint] how is this a preliminary rather than part of the main install? it even looks like a funcx-endpoint is set up as part of helm automatically ... so is this whole endpoint section irrelevant for an initial install? or at least, there should be better intro description at this point that an endpoint will be deployed inside k8s? @@ -115,7 +128,7 @@ all sorts of mess, and everything including endpoints deployed by me. be clear throughout this document what refers to those two use cases. -### Deploying funcx-endpoint outside of K8s +### Deploying funcx-endpoint outside of K8s [this is "advanced" - move to end of doc, and crossref with other "install an endpoint" document] --- **NOTE** @@ -195,9 +208,6 @@ kubectl create secret generic funcx-sdk-tokens \ ## Installing FuncX -[how does this section relate to the previous section? I think the -] - 1. Make a clone of this repository 2. Download subcharts: ```shell script From 7fabd84346ef3451a25299adf9d700bd5fae1ff2 Mon Sep 17 00:00:00 2001 From: Ben Clifford Date: Mon, 20 Sep 2021 20:06:20 +0000 Subject: [PATCH 09/42] wip --- README.md | 68 ++++++++++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 67 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index ac14103..a0b7db9 100644 --- a/README.md +++ b/README.md @@ -412,7 +412,7 @@ workerInit so I should be able to override those myself in my values.yaml? funcx_endpoint: - workerImage: python:3.8-buster + workerImage: python:3.7-buster workerInit: 'pip install "funcx-endpoint>=0.2.0"' note that workerInit is embedded python string syntax, not a plain string, so @@ -656,3 +656,69 @@ everything: (I think it does, but just the images are not changing often upstream from me?) helm upgrade -f deployed_values/values.yaml funcx funcx --recreate-pods +## Python version notes + +this is important for not having errors, so probably should be near the top. + +there are three different places where the python interpreter must be the same +major version (eg all 3.7). the tooling as it is now does not make that the +case by default - [TODO: make it so] + +- the endpoint worker +- the endpoint +- the submitting user python env + +TODO: make this consistent for a first time install experience: either describe +how to configure it in all three places, or make the defaults/documented +command lines make that happen. + +The setup as I came to it is installing 3 different incompatible python version +without telling me not to. + +## Port notes + +nmap of amber.cqx.ltd.uk: + +``` +$ nmap amber.cqx.ltd.uk -p- -4 + +Starting Nmap 7.40 ( https://nmap.org ) at 2021-09-20 19:41 UTC +Nmap scan report for amber.cqx.ltd.uk (65.108.55.218) +Host is up (0.055s latency). +Other addresses for amber.cqx.ltd.uk (not scanned): 2a01:4f9:c010:e030::1 +Not shown: 65522 closed ports +PORT STATE SERVICE +22/tcp open ssh +2379/tcp open etcd-client +2380/tcp open etcd-server +6000/tcp open X11 +8000/tcp open http-alt +8080/tcp open http-proxy +8443/tcp open https-alt +10249/tcp open unknown +10250/tcp open unknown +10256/tcp open unknown +55001/tcp open unknown +55002/tcp open unknown +55003/tcp open unknown + +``` + +The above notes have two ports forwarded manually using kubectl port forwarding each time. + +funcx-forwarder is configured to expose a number of ports, 55002-55005. + +but in its environment, it declares these: which don't quite align. +it declares these: + TASKS_PORT: 55001 + RESULTS_PORT: 55002 + COMMANDS_PORT: 55003 +as well as 8080 + +As far as remote nmap is concerned, 55001-3 are exposed, not -4 and -5. + +So what's the configuration divergence here? (is there unnecessary configuration of those +ports, seeing as 55001 is finding its way in there anyway?) + +the funcx-forwarder service lists 55001, 55003, 55005 +which is a *third* combination of those different ports. (!) From 3444e6da77ff42c530a5a08cbc5701d0d687dde2 Mon Sep 17 00:00:00 2001 From: Ben Clifford Date: Mon, 20 Sep 2021 20:48:22 +0000 Subject: [PATCH 10/42] wip - some notes on understanding ports --- README.md | 28 ++++++++++++++++++++++++++++ 1 file changed, 28 insertions(+) diff --git a/README.md b/README.md index a0b7db9..f894761 100644 --- a/README.md +++ b/README.md @@ -722,3 +722,31 @@ ports, seeing as 55001 is finding its way in there anyway?) the funcx-forwarder service lists 55001, 55003, 55005 which is a *third* combination of those different ports. (!) +they're labelled in the service as zmq1, zmq2, zmq3 + +... is the forwarder running in some weird non-kubernetes network env? (eg the host native network env?) it appears to have an IP: field described in the pod: +IP: 65.108.55.218 +IPs: + IP: 65.108.55.218 + +and the chart has "hostNetwork: true" + - so all the port forwards that its doing in the 55000...whatever range are unneeded I guess and thats why it doesn't matter that they're messed up? + +[TODO: understand and rationalise that configuration. if this is always host network, +are any of those other port descriptions necessary at all? not in the minikube case +I think, but what about in real deployment] + +what do the ports: declarations do in funcx/templates/forwarder-deployment.yaml anyway? +it looks like they're explicitly adding on 1 to the actual values ?? + +These ports look like harmless fluff that is intellectually taxing/wasteful on someone +trying to understand... is that true? + +Who is communicating with the forwarder? + + +TODO: If these ports can be configured publicly, can the ports for the web service +and websockets service also be configured that way? What's the difference between +the web(sockets) ports and the forwarder ports? On the production system are they +made fully public in different ways? + From c65918d7f56a7f694420fa0774cae36b6dd09a9e Mon Sep 17 00:00:00 2001 From: Ben Clifford Date: Thu, 23 Sep 2021 16:14:43 +0000 Subject: [PATCH 11/42] wip --- README.md | 50 +++++++++++++++++++++++++++++++++++++++++++++++ funcx/values.yaml | 2 +- 2 files changed, 51 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index f894761..fca694f 100644 --- a/README.md +++ b/README.md @@ -750,3 +750,53 @@ and websockets service also be configured that way? What's the difference betwe the web(sockets) ports and the forwarder ports? On the production system are they made fully public in different ways? + + +### nice ways to get endpoint in k8s cluster ofr devs + +eg make it consistent every time over restarts rather than random each time? + +or output it somewhere that cna be read programmatically by clients? + + +### k8s endpoint worker names + +... are millisecond based, so I have seen two pods next to each otehr with the same name: + +# minikube kubectl get pods +NAME READY STATUS RESTARTS AGE +funcx-1632329996841 1/1 Running 0 4m3s +funcx-1632329998841 1/1 Running 0 4m2s +funcx-1632330007842 1/1 Running 0 3m53s +funcx-endpoint-86756c48c8-nc7r2 1/1 Running 2 (10m ago) 12m +funcx-forwarder-db744678c-6r9nq 1/1 Running 0 12m +funcx-funcx-web-service-6745bd4f5d-p59qc 1/1 Running 0 12m +funcx-funcx-websocket-service-bb766fbcd-x82sg 1/1 Running 0 12m +funcx-postgresql-0 1/1 Running 0 12m +funcx-rabbitmq-0 1/1 Running 0 12m +funcx-redis-master-0 1/1 Running 0 12m +funcx-redis-slave-0 1/1 Running 0 12m +funcx-redis-slave-1 1/1 Running 0 12m +plsql 1/1 Running 389 (22m ago) 16d + + +## websocket-service tag bug in helm chart + +uses 'latest' not 'main' which is weeks old at time of my writing +i am trying to switch to main + + +## web-service build (and check others?) has a requiremenets.txt which +installs from git funcx api main + +docker build doesn't invalidate the cache as the api main tag advances, +becaues the command is not changed, and so rebuilds are not built with the +latest main. + +this is terrible and obscure and i only noticed it randomly in passing. + +=== +> docker build --no-cache -t funcx-web-service is my default + +says yadu. +=== diff --git a/funcx/values.yaml b/funcx/values.yaml index dc4a1c7..75a39e5 100644 --- a/funcx/values.yaml +++ b/funcx/values.yaml @@ -80,5 +80,5 @@ rabbitmq: websocketService: image: funcx/funcx-websocket-service - tag: latest + tag: main pullPolicy: Always From c167bad129cba55e6910a3f39e2b7b2fee7c6038 Mon Sep 17 00:00:00 2001 From: Ben Clifford Date: Fri, 24 Sep 2021 13:23:08 +0000 Subject: [PATCH 12/42] wip --- README.md | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/README.md b/README.md index fca694f..1b262a1 100644 --- a/README.md +++ b/README.md @@ -799,4 +799,8 @@ this is terrible and obscure and i only noticed it randomly in passing. > docker build --no-cache -t funcx-web-service is my default  says yadu. + - this has gone into the helm-chart Makefile documentation already +=== + +=== === From aad0839976755ea91896ca2930d2a9695503545e Mon Sep 17 00:00:00 2001 From: Ben Clifford Date: Fri, 1 Oct 2021 12:17:28 +0000 Subject: [PATCH 13/42] WIP --- README.md | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff --git a/README.md b/README.md index 1b262a1..e662cbd 100644 --- a/README.md +++ b/README.md @@ -85,6 +85,10 @@ now can run: and see the default helm help text. +TODO: is there a "hello world" style helm+kubernetes validation that could be run so +that we can say "you need helm+kubernetes at least good enough to do this:" + + now get the funcx helm repo: mkdir src @@ -189,10 +193,9 @@ yes, with lots of error messages to ignore, I did this: docker run --rm -ti funcx/kube-endpoint:main /home/funcx/boot.sh funcx +but how then did I get the token out? ) - - You will be prompted to follow the authorization link and paste the resulting token into the console. Once you do that, funcx-endpoint will create a `~/.funcx` directory and provide you with a token file. @@ -206,7 +209,7 @@ kubectl create secret generic funcx-sdk-tokens \ --from-file ~/.funcx/credentials/funcx_sdk_tokens.json ``` -## Installing FuncX +## Installing FuncX ["central services"? what's the right title vs the endpoint and client?] 1. Make a clone of this repository 2. Download subcharts: From ed7fccc773b0301e7f363532c42cfa9f7caf22b6 Mon Sep 17 00:00:00 2001 From: Ben Clifford Date: Fri, 1 Oct 2021 12:19:10 +0000 Subject: [PATCH 14/42] Replace notes for a worker name bug with opened issue --- README.md | 28 +++++++--------------------- 1 file changed, 7 insertions(+), 21 deletions(-) diff --git a/README.md b/README.md index e662cbd..ebc3527 100644 --- a/README.md +++ b/README.md @@ -762,27 +762,6 @@ eg make it consistent every time over restarts rather than random each time? or output it somewhere that cna be read programmatically by clients? -### k8s endpoint worker names - -... are millisecond based, so I have seen two pods next to each otehr with the same name: - -# minikube kubectl get pods -NAME READY STATUS RESTARTS AGE -funcx-1632329996841 1/1 Running 0 4m3s -funcx-1632329998841 1/1 Running 0 4m2s -funcx-1632330007842 1/1 Running 0 3m53s -funcx-endpoint-86756c48c8-nc7r2 1/1 Running 2 (10m ago) 12m -funcx-forwarder-db744678c-6r9nq 1/1 Running 0 12m -funcx-funcx-web-service-6745bd4f5d-p59qc 1/1 Running 0 12m -funcx-funcx-websocket-service-bb766fbcd-x82sg 1/1 Running 0 12m -funcx-postgresql-0 1/1 Running 0 12m -funcx-rabbitmq-0 1/1 Running 0 12m -funcx-redis-master-0 1/1 Running 0 12m -funcx-redis-slave-0 1/1 Running 0 12m -funcx-redis-slave-1 1/1 Running 0 12m -plsql 1/1 Running 389 (22m ago) 16d - - ## websocket-service tag bug in helm chart uses 'latest' not 'main' which is weeks old at time of my writing @@ -807,3 +786,10 @@ says yadu. === === + + +# issues opened + +https://github.com/funcx-faas/funcX/issues/600 (relative-dupe of existing parsl issue) + + From 9d5eeb43b21a28596a7fec7ddccdcaa2901dad0a Mon Sep 17 00:00:00 2001 From: Ben Clifford Date: Fri, 1 Oct 2021 12:21:49 +0000 Subject: [PATCH 15/42] replace web socket tag note with merged PR url --- README.md | 10 +++------- 1 file changed, 3 insertions(+), 7 deletions(-) diff --git a/README.md b/README.md index ebc3527..ba5e36a 100644 --- a/README.md +++ b/README.md @@ -762,12 +762,6 @@ eg make it consistent every time over restarts rather than random each time? or output it somewhere that cna be read programmatically by clients? -## websocket-service tag bug in helm chart - -uses 'latest' not 'main' which is weeks old at time of my writing -i am trying to switch to main - - ## web-service build (and check others?) has a requiremenets.txt which installs from git funcx api main @@ -788,7 +782,9 @@ says yadu. === -# issues opened +# PRs and issues opened + +https://github.com/funcx-faas/helm-chart/pull/36 (websockets installed from wrong tag) https://github.com/funcx-faas/funcX/issues/600 (relative-dupe of existing parsl issue) From dc6653ae397381334f2fa28a53b705e61f3ff6de Mon Sep 17 00:00:00 2001 From: Ben Clifford Date: Fri, 1 Oct 2021 12:33:51 +0000 Subject: [PATCH 16/42] Add url for broken k8s worker pod accumulation --- README.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/README.md b/README.md index ba5e36a..1a80d2a 100644 --- a/README.md +++ b/README.md @@ -788,4 +788,6 @@ https://github.com/funcx-faas/helm-chart/pull/36 (websockets installed from wron https://github.com/funcx-faas/funcX/issues/600 (relative-dupe of existing parsl issue) +https://github.com/funcx-faas/funcX/issues/601 (broken k8s worker pods accumulate forever) + From b3aff97a329af9c0844f4c215bf1601b834e4e6c Mon Sep 17 00:00:00 2001 From: Ben Clifford Date: Fri, 1 Oct 2021 14:10:46 +0000 Subject: [PATCH 17/42] update URL of readme-related issues --- README.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index 1a80d2a..e2a187a 100644 --- a/README.md +++ b/README.md @@ -786,7 +786,9 @@ says yadu. https://github.com/funcx-faas/helm-chart/pull/36 (websockets installed from wrong tag) -https://github.com/funcx-faas/funcX/issues/600 (relative-dupe of existing parsl issue) +https://github.com/funcx-faas/helm-chart/pull/39 (tidy up suggested values documentation) + +https://github.com/funcx-faas/funcX/issues/600 (duplicate pod names - dupe of existing parsl issue) https://github.com/funcx-faas/funcX/issues/601 (broken k8s worker pods accumulate forever) From 127b2b6531f355273b9aaf8946960d6083764b7c Mon Sep 17 00:00:00 2001 From: Ben Clifford Date: Mon, 18 Oct 2021 10:23:27 +0000 Subject: [PATCH 18/42] ingress notes for microk8s --- README.md | 25 +++++++++++++++++++++++++ 1 file changed, 25 insertions(+) diff --git a/README.md b/README.md index e2a187a..fc655c0 100644 --- a/README.md +++ b/README.md @@ -779,6 +779,31 @@ says yadu. === === +microk8s ingres (with ingress PR applied...) +install microk8s ingress. this puts in a DaemonSet that launches the ingress controller. +this needs editing: + +root@pearl:~# microk8s kubectl get daemonset -n ingress +NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE +nginx-ingress-microk8s-controller 1 1 1 1 1 12d + +~/kubectl edit daemonset nginx-ingress-microk8s-controller -n ingress + +edit eg the container port (for @sirosen's use case, but not otherwise): + ports: + - containerPort: 80 + hostPort: 81 + name: http + protocol: TCP + - containerPort: 443 + hostPort: 443 + name: https + protocol: TCP + +and somewhere (perhaps the container args in that daemonset) remove the restriction on running + + + === From 583a7d9d01e7cfc434d7cf1c22d721b171073bce Mon Sep 17 00:00:00 2001 From: Ben Clifford Date: Wed, 20 Oct 2021 10:40:09 +0000 Subject: [PATCH 19/42] More notes --- README.md | 20 +++++++++++++++----- 1 file changed, 15 insertions(+), 5 deletions(-) diff --git a/README.md b/README.md index fc655c0..603225a 100644 --- a/README.md +++ b/README.md @@ -11,8 +11,8 @@ version] Non-goals: -* Using the helm charts to deploy to the live/dev systems. -* Customising your helm deploy. +* Using the helm charts to deploy to the live/dev systems. [TODO: crossref to the actual notes for that] +* Customising your helm deploy. [well, actually maybe that is a goal - but there are some notes in another file for that so they should be crosslinked] # About benc's notes @@ -235,8 +235,8 @@ kubectl create secret generic funcx-sdk-tokens \ ``` 6. You can access your web service through the ingress or via a port forward to the web service pod. Instructions are provided in the displayed notes. -[what ingress? there is no ingress resource created by helm? talking about the -'service' resource for these?] +[ingress should become the official way, and be better documented - it's what I've +been working on] now i see a bunch of services including a funcx endpoint @@ -806,6 +806,17 @@ and somewhere (perhaps the container args in that daemonset) remove the restrict === +# dockerfile endpoint in funcx vs python version +should force choice of python rather than defaulting to some rando version that +doesn't make sense: +--- a/Dockerfile-endpoint ++++ b/Dockerfile-endpoint +@@ -1,4 +1,5 @@ +-ARG PYTHON_VERSION="3.8" ++ARG PYTHON_VERSION ++# eg PYTHON_VERSION="3.8" + + # PRs and issues opened @@ -817,4 +828,3 @@ https://github.com/funcx-faas/funcX/issues/600 (duplicate pod names - dupe of https://github.com/funcx-faas/funcX/issues/601 (broken k8s worker pods accumulate forever) - From 767d6a5c7303dfff6fe3c6539858c0f087c60a77 Mon Sep 17 00:00:00 2001 From: Ben Clifford Date: Wed, 27 Oct 2021 11:05:05 +0000 Subject: [PATCH 20/42] Rework post-install NOTES generation to nudge towards ingress usage --- funcx/templates/NOTES.txt | 12 ++++++++---- 1 file changed, 8 insertions(+), 4 deletions(-) diff --git a/funcx/templates/NOTES.txt b/funcx/templates/NOTES.txt index 9f89276..f425e9f 100644 --- a/funcx/templates/NOTES.txt +++ b/funcx/templates/NOTES.txt @@ -4,14 +4,18 @@ progress via the command kubectl get pods --namespace {{ .Release.Namespace }} -To access the REST server you will need to run a Kubernetes Port-Forward: - {{- if .Values.ingress.enabled -}} Your service will be waiting for you at http://{{ .Release.Name }}-funcx.{{ .Values.ingress.host }} {{- else }} +To access the REST server, either enable ingress or run Kubernetes port-forward commands: + +For the web service: + kubectl port-forward service/{{ .Release.Name }}-funcx-web-service 5000:8000 + +For the websocket service: + + kubectl port-forward service/{{ .Release.Name }}-funcx-websocket-service 6000:6000 {{- end }} -To access the websocket service you will need to run a Kubernetes Port-Forward: -kubectl port-forward service/{{ .Release.Name }}-funcx-websocket-service 6000:6000 From e105805d19c4ac4db48569166e229f4e44996252 Mon Sep 17 00:00:00 2001 From: Ben Clifford Date: Wed, 27 Oct 2021 11:18:28 +0000 Subject: [PATCH 21/42] Add first pass of install notes for developer-local ingress --- README.md | 56 ++++++++++++++++++++++++++++++++++++++++++++++++++++--- 1 file changed, 53 insertions(+), 3 deletions(-) diff --git a/README.md b/README.md index 71bc1c8..d9b9d8b 100644 --- a/README.md +++ b/README.md @@ -108,13 +108,63 @@ kubectl create secret generic funcx-sdk-tokens \ ```shell script helm install -f deployed_values/values.yaml funcx funcx ``` -6. You can access your web service through the ingress or via a port forward -to the web service pod. Instructions are provided in the displayed notes. -7. You should be able to see the endpoint registering with the web service +6. You should be able to see the endpoint registering with the web service in their respective logs, along with the forwarder log. Check the endpoint's logs for its ID. +## Exposing FuncX to external clients and endpoints + +You need to expose two ports from your cluster to clients. There are two ways: +ingress and port-forward. These instructions talk about ingress, which is more +complicated to set up but easier to maintain and closer to the production +configuration. If you do not configure ingress, then the post-install notes +output by `helm install` will tell you which port-forward commands to run. + +1. Install an ingress controller. Minikube and microk8s do this differently. + +1.a Minikube: + +Run this: + +``` +minikube addons enable ingress +``` + +1.b microk8s + +Run this: + +``` +microk8s enable ingress +``` + +Then configure microk8s to serve all namespaces. (by default, it only +serves the `public` ingress class). [TODO: i need to write the exact commands for this] + +2. Get a hostname that your kubernetes install is accessible under. + +Depending on your development environment, this might be the public hostname of your +kubernetes server, or it might be an entry in `/etc/hosts` pointing to 127.0.0.1. +Maybe even `localhost` works in that case. + +3. Enable ingress in the funcx install + +Edit `deployed_values/values.yaml` to enable funcx ingress and to tell funcx the +host name from step 2. + +``` +ingress: + enabled: true + host: amber.cqx.ltd.uk +``` + +4. Redeploy funcx + +``` +helm upgrade --atomic -f deployed_values/values.yaml funcx funcx +``` + ### Forwarder Debugging > :warning: *Only for debugging*: You can set the forwarder curve server key manually by creating From c868fd54af493b95d97a81109c0ab104f5fc98a3 Mon Sep 17 00:00:00 2001 From: Ben Clifford Date: Mon, 14 Feb 2022 11:12:01 +0000 Subject: [PATCH 22/42] Refresh intro and add see-also --- README.md | 25 ++++++++++++++----------- 1 file changed, 14 insertions(+), 11 deletions(-) diff --git a/README.md b/README.md index 53f1f54..a59088b 100644 --- a/README.md +++ b/README.md @@ -1,18 +1,15 @@ -# helm-chart -Helm Chart for Deploying funcX stack +# A Chart for Deploying the funcX stack -# About this document +# Who is this README for? -Goal: Getting started guide for a new funcX developer -making their first install for themselves to hack on. -[so there should be a prefered main path with minimal -choices for customisation of the initial install - eg proper defaults for python -version] +This README is aimed at people who want to deploy funcX services into +kubernetes. -Non-goals: +The main part of the text is aimed at a new funcX developer making their +first install for themselves to hack on. -* Using the helm charts to deploy to the live/dev systems. [TODO: crossref to the actual notes for that] -* Customising your helm deploy. [well, actually maybe that is a goal - but there are some notes in another file for that so they should be crosslinked] +Other notes on the way talk about how the install can be made in the +production system at AWS, and in development environments hosted at AWS. # About benc's notes @@ -1026,3 +1023,9 @@ postgres, and rabbitmq) running at a specified host under `*.api.dev.funcx.org`. * Create a new route53 record for the given host (josh-test.dev.funcx.org). We won't have to do this after [external dns](https://kubernetes-sigs.github.io/aws-load-balancer-controller/v2.2/guide/integrations/external_dns/) has been enabled. + + +## See also + +More notes in the local_dev/ subdirectory that should be merged into this file + From 8d5abfeef24ed8819f8ab0b04a11a5f4200b6894 Mon Sep 17 00:00:00 2001 From: Ben Clifford Date: Mon, 14 Feb 2022 11:26:36 +0000 Subject: [PATCH 23/42] Work through tidying - most importantly, move non-k8s endpoint install option to the end as an advanced option --- README.md | 206 ++++++++++++++++++++++++++---------------------------- 1 file changed, 99 insertions(+), 107 deletions(-) diff --git a/README.md b/README.md index a59088b..c27fb10 100644 --- a/README.md +++ b/README.md @@ -3,7 +3,7 @@ # Who is this README for? This README is aimed at people who want to deploy funcX services into -kubernetes. +kubernetes, using the helm chart contained in this repository. The main part of the text is aimed at a new funcX developer making their first install for themselves to hack on. @@ -11,22 +11,15 @@ first install for themselves to hack on. Other notes on the way talk about how the install can be made in the production system at AWS, and in development environments hosted at AWS. -# About benc's notes - -Notes are here for various purposes: (in no particular order) - -i) making this document better support the Document Goal -ii) making the default configuration of funcX in the repositories better support the -Document Goal -iii) making funcX better support users (people invoking functions, and people operating -their own endpoints). - - [![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0) [![NSF-2004894](https://img.shields.io/badge/NSF-2004894-blue.svg)](https://nsf.gov/awardsearch/showAward?AWD_ID=2004894) [![NSF-2004932](https://img.shields.io/badge/NSF-2004932-blue.svg)](https://nsf.gov/awardsearch/showAward?AWD_ID=2004932) -This application includes: + +## Components + +A funcx install consists of a number of components, which will each be installed by this chart. + * FuncX Web-Service * FuncX Websocket Service * FuncX Forwarder @@ -35,143 +28,90 @@ This application includes: * Redis Shared Data Structure * RabbitMQ broker -## Kubernetes pre-reqs +## Kubernetes pre-requisites -You will need a kubernetes installation. +You will need a Kubernetes (k8s) installation. There is no particularly +favoured form of setup. Some ways used by funcX developers include: -Some ways in which you can get an installation: +* minikube on a hetzner cloud node +* docker desktop +* microk8s -(benc: hetzner cloud + minikube) -(slack suggestion: docker desktop) +You will also need `helm`. -## eg. minikube + hetzner cloud +You will maybe need an ingress controller - traditionally people haven't been +using one but we are pushing on that a bit. More later. -base os: ubuntu 20.04.03 +### Example install of hetzner cloud + minikube -running everything as roo +This shows how @benclifford installed minikube on a hetzner cloud node as +root: -apt-get install docker.io -curl -LO https://storage.googleapis.com/minikube/releases/latest/minikube-linux-amd64 -install minikube-linux-amd64 /usr/local/bin/minikube -apt-get install contrack # because: ❌ Exiting due to GUEST_MISSING_CONNTRACK: Sorry, Kubernetes 1.22.1 requires conntrack to be installed in root's path +Base os: Ubuntu 20.04.03 -minikube --driver=none # because i am root in a VM. otherwise apparently driver=docker might be nice? I haven't tried +``` +# apt-get install docker.io +# curl -LO https://storage.googleapis.com/minikube/releases/latest/minikube-linux-amd64 +# install minikube-linux-amd64 /usr/local/bin/minikube + +# apt-get install contrack # because: ❌ Exiting due to GUEST_MISSING_CONNTRACK: Sorry, Kubernetes 1.22.1 requires conntrack to be installed in root's path + +# minikube --driver=none # because i am root in a VM. otherwise apparently driver=docker might be nice? I haven't tried +``` Now can run to see running pods -$ minikube kubectl -- get pods -A +``` +# minikube kubectl -- get pods -A +``` This gives me 7 runnings k8s pods. +Now install helm, following debian/apt instructions here: https://helm.sh/docs/intro/install/ -now install helm, following debian/apt instructions here: -https://helm.sh/docs/intro/install/ - +``` curl https://baltocdn.com/helm/signing.asc | sudo apt-key add - sudo apt-get install apt-transport-https --yes echo "deb https://baltocdn.com/helm/stable/debian/ all main" | sudo tee /etc/apt/sources.list.d/helm-stable-debian.list sudo apt-get update sudo apt-get install helm +``` + +How you can run this to see the default helm help text to check it was installed: -now can run: ``` # helm ``` -and see the default helm help text. TODO: is there a "hello world" style helm+kubernetes validation that could be run so that we can say "you need helm+kubernetes at least good enough to do this:" -now get the funcx helm repo: +Now cloen the funcx helm repo (which you are reading this document from, perhaps): +``` mkdir src cd src git clone git@github.com:funcx-faas/helm-chart cd helm-chart - -and run the first helm command from below: - -helm dependency update funcx - -which will download some stuff. - -Note that it downloads a funcx_endpoint chart over http - that isn't something contained in this repo, even though it is a funcx related chart... - -see notes further down for continuation... - -## Preliminaries [for funcx endpoint] - -how is this a preliminary rather than part of the main install? it even looks like a funcx-endpoint is set up as part of helm automatically ... so is this whole endpoint section irrelevant for an initial install? or at least, there should be better intro description at this point -that an endpoint will be deployed inside k8s? - -make a decision for the user - as they are new. they can try different ways later, which -can be documented elsewhere - for example, this is the helm repo so should be talking -about the helm deployment of the endpoint. Or pointing people at the endpoint repo. - -I think for an initial install, the in-kubernetes default endpoint which configures -itself almost automatically should be chosen for getting started. With instructions -on attaching an external endpoint described *afterwards* at the end of this document. - -There are two modes in which funcx-endpoints could be deployed: - - -1. funcx-endpoint deployed outside k8s, connecting to hosted services in k8s -2. funcx-endpoint deployed inside k8s - -Also be clear on the deployment modes: production-like (eg with many users, centrally, -with expection that images are from tags, endpoints deployed by other people) -and development-like - eg on my own private VM with my own hacked up source changes and -all sorts of mess, and everything including endpoints deployed by me. - -be clear throughout this document what refers to those two use cases. - -### Deploying funcx-endpoint outside of K8s [this is "advanced" - move to end of doc, and crossref with other "install an endpoint" document] - ---- -**NOTE** - -This only works on Linux systems. - ---- - -Here are the steps to install, preferably into your active conda environment: - -```shell script -git clone https://github.com/funcx-faas/funcX.git -cd funcX -git checkout main -pip install funcx_sdk -pip install funcx_endpoint ``` -Next create an endpoint configuration: +and run a helm command to update funcx dependencies: -```shell script -funcx-endpoint +``` +helm dependency update funcx ``` -Update the endpoint's configuration file to point the endpoint to locally -deployed services, which we will setup in the next sections. If using default -values, the funcx_service_address should be set to http://localhost:5000/v2. - -`~/.funcx/default/config.py` +which will download some stuff. (TODO: what?) -```python - config = Config( - executors=[HighThroughputExecutor( - provider=LocalProvider( - init_blocks=1, - min_blocks=0, - max_blocks=1, - ), - )], - funcx_service_address="http://127.0.0.1:5000/api/v1", # <--- UPDATE THIS LINE -) -``` +One note for later is this step downloads a funcx_endpoint chart: although +this is funcX related, it is a separately versioned component because end users +are expected to deploy the endpoint - unlike the funcX services that this +current chart describes. (TODO: notes later for overriding this for +development) ### Deploying funcx-endpoint into the K8s deployment @@ -1024,6 +964,58 @@ postgres, and rabbitmq) running at a specified host under `*.api.dev.funcx.org`. We won't have to do this after [external dns](https://kubernetes-sigs.github.io/aws-load-balancer-controller/v2.2/guide/integrations/external_dns/) has been enabled. +## Advanced option: Deploying funcx-endpoint outside of K8s [this is "advanced" - move to end of doc, and crossref with other "install an endpoint" document] + +The above noteis installed a funcx endpoint inside kubernetes, alongside the funcx services. +In real life, end users would install funcx endpoints elsewhere (on their compute +resources) and attach them to the officially funcx services. + +It is also possible to install an endpoint elsewhere and attach it to services +deployed by this chart for dev purposes. + +--- +**NOTE** + +This only works on Linux systems. + +--- + +Here are the steps to install, preferably into your active conda environment: + +```shell script +git clone https://github.com/funcx-faas/funcX.git +cd funcX +git checkout main +pip install funcx_sdk +pip install funcx_endpoint +``` + +Next create an endpoint configuration: + +```shell script +funcx-endpoint +``` + +Update the endpoint's configuration file to point the endpoint to locally +deployed services, which we will setup in the next sections. If using default +values, the funcx_service_address should be set to http://localhost:5000/v2. + +`~/.funcx/default/config.py` + +```python + config = Config( + executors=[HighThroughputExecutor( + provider=LocalProvider( + init_blocks=1, + min_blocks=0, + max_blocks=1, + ), + )], + funcx_service_address="http://127.0.0.1:5000/api/v1", # <--- UPDATE THIS LINE +) +``` + + ## See also From e1900aee1fd42fcec89d7e6b21bd131d771e1577 Mon Sep 17 00:00:00 2001 From: Ben Clifford Date: Mon, 14 Feb 2022 11:26:58 +0000 Subject: [PATCH 24/42] fix markdown title level --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index c27fb10..ca59030 100644 --- a/README.md +++ b/README.md @@ -1,6 +1,6 @@ # A Chart for Deploying the funcX stack -# Who is this README for? +## Who is this README for? This README is aimed at people who want to deploy funcX services into kubernetes, using the helm chart contained in this repository. From bdb6057869bfc3d06ca749e7097f281064c71a99 Mon Sep 17 00:00:00 2001 From: Ben Clifford Date: Mon, 14 Feb 2022 11:44:56 +0000 Subject: [PATCH 25/42] Make another pass --- README.md | 203 +++++++++++++++++++++++++++++------------------------- 1 file changed, 111 insertions(+), 92 deletions(-) diff --git a/README.md b/README.md index ca59030..577bed3 100644 --- a/README.md +++ b/README.md @@ -116,6 +116,9 @@ development) ### Deploying funcx-endpoint into the K8s deployment +TODO: this is messy and not part of the service install so i'm not sure +if it should happen here or as part of the configuration section? + We can deploy the kubernetes endpoint as a pod as part of the chart. It needs to have a valid copy of the funcx's `funcx_sdk_tokens.json` which can be created by running on your local workstation and running @@ -148,7 +151,7 @@ kubectl create secret generic funcx-sdk-tokens \ ## Installing FuncX ["central services"? what's the right title vs the endpoint and client?] -0. Update cloudformation stack if necessary +0. Update cloudformation stack if necessary [TODO: I think this is only for production deployment? ask Josh. In which case, ignore for personal dev cluster] 1. Make a clone of this repository 2. Download subcharts: @@ -159,25 +162,31 @@ kubectl create secret generic funcx-sdk-tokens \ [forward reference to the two different values sections later on in this document: should I just have the three lines mentioned here? or should I be copy-pasting a huge example?] -4. Obtain Globus Client ID and Secret. These secrets need to exist in the - correct Globus Auth app. Ask for access to the credentials by contacting - https://github.com/BenGalewsky or sending a message to the `dev` funcx Slack - channel. Once you have your credentials, paste them into your `values.yaml`: + [TODO: paragraph desribing what values.yaml will do] +4. Obtain Globus Client ID and Secret. Get the credentials by asking on the + `dev` funcx Slack channel. + + Once you have your credentials, paste them into your `values.yaml`: ```yaml webService: globusClient: <> globusKey: <> ``` + + [TODO: there are plans afoot to make this different for developers. When + that is settled, can reuse deleted text, and elaborate: + These secrets need to exist in the correct Globus Auth app. + ] + 5. Install the helm chart: ```shell script helm install -f deployed_values/values.yaml funcx funcx ``` 5b. -now i see a bunch of services including a funcx endpoint - -looks like this: +now i see a bunch of services including a funcx endpoint, like this: +``` # minikube kubectl get pods NAME READY STATUS RESTARTS AGE funcx-endpoint-86756c48c8-flhqf 1/1 Running 0 95m @@ -189,50 +198,99 @@ funcx-rabbitmq-0 1/1 Running 0 funcx-redis-master-0 1/1 Running 0 95m funcx-redis-slave-0 1/1 Running 0 95m funcx-redis-slave-1 1/1 Running 0 95m - +``` 6. You should be able to see the endpoint registering with the web service in their respective logs, along with the forwarder log. Check the endpoint's logs for its ID. -7. You can access your web service through the ingress or via a port forward -to the web service pod. Instructions are provided in the displayed notes. -[ingress should become the official way, and be better documented - it's what I've -been working on] +7. You can access your funcX services through the ingress or via a port forward +to the web service pod. For port forwarding, instructions are provided in the displayed notes. +[ingress should become the official way, and be better documented, but there is less +experience with it and it needs an ingress controller...] [clarify which logs / *where* those logs are? explicitly which (3?) logs to look at... - who is registering with whom?] Endpoint log will look like: - +``` 2021-09-14 16:36:02 endpoint.endpoint_manager:172 [INFO] Starting endpoint with uuid: cfd389f3-4eda-413b-af95-4d54a8e944dc +``` forwarder will look like: +``` {"asctime": "2021-09-14 18:20:44,535", "name": "funcx_forwarder.forwarder", "levelname": "DEBUG", "message": "endpoint_status_message", "log_type": "endpoint_status_message", "endpoint_id": "cfd389f3-4eda-413b-af95-4d54a8e944dc", "endpoint_status_message": {"_payload": null, "_header": "b'\\xcf\\xd3\\x89\\xf3N\\xdaA;\\xaf\\x95MT\\xa8\\xe9D\\xdc'", "ep_status": {"task_id": -2, "info": {"total_cores": 0, "total_mem": 0, "new_core_hrs": 0, "total_core_hrs": 0, "managers": 0, "active_managers": 0, "total_workers": 0, "idle_workers": 0, "pending_tasks": 0, "outstanding_tasks": {}, "worker_mode": "no_container", "scheduler_mode": "hard", "scaling_enabled": true, "mem_per_worker": null, "cores_per_worker": 1.0, "prefetch_capacity": 10, "max_blocks": 100, "min_blocks": 1, "max_workers_per_node": 1, "nodes_per_block": 1}}, "task_statuses": {}}} +``` web service will look like: +``` {"asctime": "2021-09-14 16:36:03,273", "name": "funcx_web_service", "levelname": "INFO", "message": "Successfully registered cfd389f3-4eda-413b-af95-4d54a8e944dc in database"} +``` + +## Exposing FuncX to external clients and endpoints + +You need to expose two ports from your cluster to clients. There are two ways: +ingress and port-forward. These instructions talk about ingress, which is more +complicated to set up but easier to maintain and closer to the production +configuration. If you do not configure ingress, then the post-install notes +output by `helm install` will tell you which port-forward commands to run. -### connecting clients +1. Install an ingress controller. Minikube and microk8s do this differently. -the startup message (from helm) has a couple of kubectl port-forward commands that might be a bit wrong - i ended up using these two: -# minikube kubectl -- port-forward --address 0.0.0.0 service/funcx-funcx-web-service 8000 -# minikube kubectl -- port-forward --address 0.0.0.0 service/funcx-funcx-websocket-service 6000 +1.a Minikube: -These port forwards are only temporary - they run as foreground processes and break as soon as the pods change (for example due to restarts). That seems a bit frustrating if they're meant to be pointing to services. Is there a more persistent kubernetes configuration that can be used to expose to the world? And for other people, to expose to whatever their security-scoped environment is? +Run this: -This will expose the services on port 8000 and port 6000 - because this is a service, the 2nd port number in the helm suggested text is ignored, I think - so that could be removed in a PR (as long as I check and justify that with documentation links) +``` +minikube addons enable ingress +``` -now from a working funcx install, create a funcx client pointed at the current -service, like this: +1.b microk8s -fxc = FuncXClient(funcx_service_address="http://amber.cqx.ltd.uk:8000/v2") +Run this: + +``` +microk8s enable ingress +``` + +Then configure microk8s to serve all namespaces. (by default, it only +serves the `public` ingress class). [TODO: i need to write the exact commands for this] + +2. Get a hostname that your kubernetes install is accessible under. + +Depending on your development environment, this might be the public hostname of your +kubernetes server, or it might be an entry in `/etc/hosts` pointing to 127.0.0.1. +Maybe even `localhost` works in that case. + +3. Enable ingress in the funcx install + +Edit `deployed_values/values.yaml` to enable funcx ingress and to tell funcx the +host name from step 2. + +``` +ingress: + enabled: true + host: amber.cqx.ltd.uk +``` + +4. Redeploy funcx + +``` +helm upgrade --atomic -f deployed_values/values.yaml funcx funcx +``` + +### Connecting clients + +Create a `FuncXClient` instance pointing at your install, by specifying the funcx_service_address: -and then run quickstart guide style stuff - probably i don't need to paste it here, but -I could... +``` +fxc = FuncXClient(funcx_service_address="http://amber.cqx.ltd.uk:8000/v2") +``` +and then run the same sort of tests as can happen against the tutorial endpoint. For example: +``` from funcx.sdk.client import FuncXClient fxc = FuncXClient(PARMS HERE) @@ -246,6 +304,15 @@ tutorial_endpoint = 'LOCAL ENDPOINT HERE' result = fxc.run(endpoint_id=tutorial_endpoint, function_id=func_uuid) print(fxc.get_result(result)) +``` + +If you have got this far, then you have successfully installed the current +version of funcx, and can begin to hack. + + +[TODO: at this point I got into a lot of tangle with default Python versions +not matching up. We've subsequently talked and fiddled with this - so check +what will happen. In the meantime skip these notes: this gets as far as submitting for me, but attempts to get the result always give funcx.utils.errors.TaskPending: Task is pending due to waiting-for-nodes @@ -378,58 +445,7 @@ in their worker_init. it's frustrating that the python version is not set to the version that is actually used by the endpoint. - -## Exposing FuncX to external clients and endpoints - -You need to expose two ports from your cluster to clients. There are two ways: -ingress and port-forward. These instructions talk about ingress, which is more -complicated to set up but easier to maintain and closer to the production -configuration. If you do not configure ingress, then the post-install notes -output by `helm install` will tell you which port-forward commands to run. - -1. Install an ingress controller. Minikube and microk8s do this differently. - -1.a Minikube: - -Run this: - -``` -minikube addons enable ingress -``` - -1.b microk8s - -Run this: - -``` -microk8s enable ingress -``` - -Then configure microk8s to serve all namespaces. (by default, it only -serves the `public` ingress class). [TODO: i need to write the exact commands for this] - -2. Get a hostname that your kubernetes install is accessible under. - -Depending on your development environment, this might be the public hostname of your -kubernetes server, or it might be an entry in `/etc/hosts` pointing to 127.0.0.1. -Maybe even `localhost` works in that case. - -3. Enable ingress in the funcx install - -Edit `deployed_values/values.yaml` to enable funcx ingress and to tell funcx the -host name from step 2. - -``` -ingress: - enabled: true - host: amber.cqx.ltd.uk -``` - -4. Redeploy funcx - -``` -helm upgrade --atomic -f deployed_values/values.yaml funcx funcx -``` +] ### Forwarder Debugging @@ -463,7 +479,7 @@ overridden per-deployment by placing replacements in the non-version-controlled `deployed_values/values.yaml` - for example, the globusClient/globusKey values earlier in the install instructions. -This is a recommended initial set of values to override: +This is a recommended [TODO: by whom?] initial set of values to override: should I use the following values.yaml or the values.yaml I was told to make earlier? dedupe - and if this section is the values.yaml i should be using, move it up to @@ -496,14 +512,9 @@ rabbitmq: pullPolicy: Always ``` -### Additional config -[are these values that are defaulted in funcx/values.yaml and can be overridden in -deployed_values/values.yaml?] +Here are some more values that can be set to adjust the deployed system +configuration: -There are a few values that can be set to adjust the deployed system -configuration - -Here are some values that can be overriden: | Value | Desciption | Default | | ------------------------------ | ------------------------------------------------------------------- | ----------------- | @@ -542,7 +553,7 @@ Here are some values that can be overriden: ## Sealed Secrets -[why would i want to do this?] +[TODO: why would i want to do this?] The chart can take advantage of Bitnami's sealed secrets controller to encrypt sensitive config data so it can safely be checked into the GitHub repo. @@ -625,7 +636,7 @@ EOF ## Subcharts This chart uses two subcharts to supply dependent services. You can update settings for these by referencing the subchart name and values from -their READMEs. +their READMEs. [TODO: also the funcx endpoint subchart?] For example ``` yaml @@ -664,9 +675,13 @@ root@plsql:/# pgcli postgresql://funcx:leftfoot1@funcx-postgresql:5432/public -how can i run a test job against this install? +## Upgrades +[TODO: crossref/ incorporate text from the helm upgrade that happens in the +ingress section above +] + +# General benc grumbles to address -# upgrades How does I upgrade this? It was installed using the latest images at install time I guess? I see a funcx-web-service tag from 15h ago after running helm upgrade funcx funcx... (imgage ID d21432c1525a) - so is that what is running now? looks like it pulled a new image when I rebooted the server (!) @@ -804,6 +819,11 @@ eg make it consistent every time over restarts rather than random each time? or output it somewhere that cna be read programmatically by clients? +TODO: best common practice is probably to generate an endpoint ID and hard +configure it right from the start, as suggested by ryan. That eliminates the +need for fiddling in the logs to discover the random endpoint ID each time. +Be very clear that this needs to be unique. + ## web-service build (and check others?) has a requiremenets.txt which installs from git funcx api main @@ -873,10 +893,9 @@ https://github.com/funcx-faas/funcX/issues/601 (broken k8s worker pods accumulat -## Deployment/Release Guide - +## Making a release and deploying to the AWS clusters -The following is an incomplete guide to deploying a new release onto our development or production clusters. +The following is an incomplete guide to making and deploying a new release onto our development or production clusters. Here are the components that need updating as part of a release, in the order they should be updated due to dependencies. Note that only components that have changes for release need to updated and the From 870818185c6f6eec4fb12b50aaebbade6635e738 Mon Sep 17 00:00:00 2001 From: Ben Clifford Date: Mon, 14 Feb 2022 13:37:04 +0000 Subject: [PATCH 26/42] Re-title endpoint configuration --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 577bed3..95154e0 100644 --- a/README.md +++ b/README.md @@ -114,7 +114,7 @@ current chart describes. (TODO: notes later for overriding this for development) -### Deploying funcx-endpoint into the K8s deployment +### Configuring a funcx-endpoint in the K8s deployment TODO: this is messy and not part of the service install so i'm not sure if it should happen here or as part of the configuration section? From ceafbe046e5cba64182ed090304d43dd6b0bd8e4 Mon Sep 17 00:00:00 2001 From: Ben Clifford Date: Mon, 14 Feb 2022 14:00:11 +0000 Subject: [PATCH 27/42] Switch to specifying fixed endpoint UUID --- README.md | 26 ++++++++++++++++++++++---- 1 file changed, 22 insertions(+), 4 deletions(-) diff --git a/README.md b/README.md index 95154e0..8cc7097 100644 --- a/README.md +++ b/README.md @@ -163,7 +163,9 @@ kubectl create secret generic funcx-sdk-tokens \ document: should I just have the three lines mentioned here? or should I be copy-pasting a huge example?] [TODO: paragraph desribing what values.yaml will do] -4. Obtain Globus Client ID and Secret. Get the credentials by asking on the + + +3a. Obtain Globus Client ID and Secret. Get the credentials by asking on the `dev` funcx Slack channel. Once you have your credentials, paste them into your `values.yaml`: @@ -178,6 +180,20 @@ kubectl create secret generic funcx-sdk-tokens \ These secrets need to exist in the correct Globus Auth app. ] +3b. Configure endpoint UUID: + + First generate a UUID, for example, by running `uuidgen` or `cat /proc/sys/kernel/random/uuid`. + + Do not copy someone elses UUID from their example configuration. All kinds of subtle identity + problems will happen if you do. + + Paste the UUID into your values.yaml in an endpoint section: + + ``` + funcx_endpoint: + endpointUUID: XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX + ``` + 5. Install the helm chart: ```shell script helm install -f deployed_values/values.yaml funcx funcx @@ -282,13 +298,15 @@ helm upgrade --atomic -f deployed_values/values.yaml funcx funcx ### Connecting clients -Create a `FuncXClient` instance pointing at your install, by specifying the funcx_service_address: +Create a `FuncXClient` instance pointing at your install, by specifying the funcx_service_address, ``` fxc = FuncXClient(funcx_service_address="http://amber.cqx.ltd.uk:8000/v2") ``` -and then run the same sort of tests as can happen against the tutorial endpoint. For example: +and by specifying your endpoint UUID (generated earlier) when invoking a function. + +Run the same sort of tests as can happen against the tutorial endpoint. For example: ``` from funcx.sdk.client import FuncXClient @@ -300,7 +318,7 @@ def hello_world(): func_uuid = fxc.register_function(hello_world) -tutorial_endpoint = 'LOCAL ENDPOINT HERE' +tutorial_endpoint = 'YOUR-ENDPOINT-UUID-HERE' result = fxc.run(endpoint_id=tutorial_endpoint, function_id=func_uuid) print(fxc.get_result(result)) From 77dea4a86ce1967cea4545ed2fc088c8afb4a163 Mon Sep 17 00:00:00 2001 From: Ben Clifford Date: Mon, 14 Feb 2022 14:00:58 +0000 Subject: [PATCH 28/42] Exhortions on unique uuids --- README.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index 8cc7097..d8a9e58 100644 --- a/README.md +++ b/README.md @@ -185,7 +185,8 @@ kubectl create secret generic funcx-sdk-tokens \ First generate a UUID, for example, by running `uuidgen` or `cat /proc/sys/kernel/random/uuid`. Do not copy someone elses UUID from their example configuration. All kinds of subtle identity - problems will happen if you do. + problems will happen if you do. Similarly, if you make another endpoint install, do not re-use + the same UUID. UUIDs are cheap. If in doubt, generate a new one. Paste the UUID into your values.yaml in an endpoint section: From 59ff38f488020a4d1b20ebd22c6a9ce6f15fb0e2 Mon Sep 17 00:00:00 2001 From: Ben Clifford Date: Mon, 14 Feb 2022 14:18:02 +0000 Subject: [PATCH 29/42] Fix markdown which makes following paragraph into a code block --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index d8a9e58..ac77cc8 100644 --- a/README.md +++ b/README.md @@ -180,7 +180,7 @@ kubectl create secret generic funcx-sdk-tokens \ These secrets need to exist in the correct Globus Auth app. ] -3b. Configure endpoint UUID: +3b. Configure endpoint UUID. First generate a UUID, for example, by running `uuidgen` or `cat /proc/sys/kernel/random/uuid`. From 018b772cc82aa7c060ad81fc515a6efbb60ead1e Mon Sep 17 00:00:00 2001 From: Ben Clifford Date: Mon, 14 Feb 2022 14:29:49 +0000 Subject: [PATCH 30/42] Retitle endpoint config section that is really about making endpoint k8s secrets --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index ac77cc8..fe9639d 100644 --- a/README.md +++ b/README.md @@ -114,7 +114,7 @@ current chart describes. (TODO: notes later for overriding this for development) -### Configuring a funcx-endpoint in the K8s deployment +### Creating endpoint secrets for a funcx endpoint in the K8s deployment TODO: this is messy and not part of the service install so i'm not sure if it should happen here or as part of the configuration section? From 544812c8b212a825556a2715553841d7da16f5be Mon Sep 17 00:00:00 2001 From: Ben Clifford Date: Mon, 14 Feb 2022 15:13:50 +0000 Subject: [PATCH 31/42] Add a TODO label --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index fe9639d..49f30d7 100644 --- a/README.md +++ b/README.md @@ -149,7 +149,7 @@ kubectl create secret generic funcx-sdk-tokens \ --from-file ~/.funcx/credentials/funcx_sdk_tokens.json ``` -## Installing FuncX ["central services"? what's the right title vs the endpoint and client?] +## Installing FuncX [TODO: "central services"? what's the right title vs the endpoint and client?] 0. Update cloudformation stack if necessary [TODO: I think this is only for production deployment? ask Josh. In which case, ignore for personal dev cluster] From dbdea4879e2980995864ca9731f24cee5cc760db Mon Sep 17 00:00:00 2001 From: Ben Clifford Date: Mon, 14 Feb 2022 15:57:39 +0000 Subject: [PATCH 32/42] Minor fixes based on trying this out on a new machine --- README.md | 17 ++++++++++++----- 1 file changed, 12 insertions(+), 5 deletions(-) diff --git a/README.md b/README.md index 49f30d7..9dbe641 100644 --- a/README.md +++ b/README.md @@ -51,23 +51,30 @@ root: Base os: Ubuntu 20.04.03 ``` +# apt-get update # apt-get install docker.io # curl -LO https://storage.googleapis.com/minikube/releases/latest/minikube-linux-amd64 # install minikube-linux-amd64 /usr/local/bin/minikube -# apt-get install contrack # because: ❌ Exiting due to GUEST_MISSING_CONNTRACK: Sorry, Kubernetes 1.22.1 requires conntrack to be installed in root's path +# apt-get install conntrack # because: ❌ Exiting due to GUEST_MISSING_CONNTRACK: Sorry, Kubernetes 1.22.1 requires conntrack to be installed in root's path -# minikube --driver=none # because i am root in a VM. otherwise apparently driver=docker might be nice? I haven't tried +# minikube start --driver=none # because i am root in a VM. otherwise apparently driver=docker might be nice? I haven't tried ``` -Now can run to see running pods +Now can run this to see some running pods. The `kube-system` namespace is for kubernetes system related pods. Later on, when funcx is installed, this command should also show funcx-related pods in the `default` namespace. ``` # minikube kubectl -- get pods -A +NAMESPACE NAME READY STATUS RESTARTS AGE +kube-system coredns-64897985d-9qhvm 1/1 Running 0 5s +kube-system etcd-ubuntu-4gb-hel1-1 1/1 Running 0 16s +kube-system kube-apiserver-ubuntu-4gb-hel1-1 1/1 Running 0 16s +kube-system kube-controller-manager-ubuntu-4gb-hel1-1 1/1 Running 0 18s +kube-system kube-proxy-cklqx 1/1 Running 0 5s +kube-system kube-scheduler-ubuntu-4gb-hel1-1 1/1 Running 0 18s +kube-system storage-provisioner 1/1 Running 0 15s ``` -This gives me 7 runnings k8s pods. - Now install helm, following debian/apt instructions here: https://helm.sh/docs/intro/install/ ``` From 4effdfd99c19cea857768eb3a564986ca3c515ab Mon Sep 17 00:00:00 2001 From: Ben Clifford Date: Mon, 14 Feb 2022 16:25:09 +0000 Subject: [PATCH 33/42] Add another shorter way to get funcx sdk tokens --- README.md | 43 ++++++++++++++++++++++++++++++++++++++++--- 1 file changed, 40 insertions(+), 3 deletions(-) diff --git a/README.md b/README.md index 9dbe641..57fced0 100644 --- a/README.md +++ b/README.md @@ -97,12 +97,13 @@ TODO: is there a "hello world" style helm+kubernetes validation that could be ru that we can say "you need helm+kubernetes at least good enough to do this:" -Now cloen the funcx helm repo (which you are reading this document from, perhaps): +Now clone the funcx helm repo (which you are reading this document from, perhaps). +You can also use the equivalent ssh-based URL. ``` mkdir src -cd src -git clone git@github.com:funcx-faas/helm-chart +cd src +git clone https://github.com/funcx-faas/helm-chart cd helm-chart ``` @@ -123,6 +124,9 @@ development) ### Creating endpoint secrets for a funcx endpoint in the K8s deployment +There are various awful ways to do this. Some of them are here. + + TODO: this is messy and not part of the service install so i'm not sure if it should happen here or as part of the configuration section? @@ -146,6 +150,39 @@ but how then did I get the token out? You will be prompted to follow the authorization link and paste the resulting token into the console. Once you do that, funcx-endpoint will create a `~/.funcx` directory and provide you with a token file. +=== + +here's another way + +``` +# docker run --rm -ti funcx/kube-endpoint:main bash -l +$ python3 -c "import funcx ; funcx.FuncXClient()" +Please paste the following URL in a browser: +https://auth.globus.org/v2/oauth2/authorize?client_id=..... +``` + +visit url + +paste code + +press enter + +``` +$ cat .funcx/credentials/funcx_sdk_tokens.json +{ + "auth.globus.org": { + "scope": "openid", + "access_token": +``` + +Copy that file (eg via clipboard) somewhere safe and make it available +in your minikube shell. + +=== + +after getting the funcx_sdk_tokens.json file by hook or by crook, do this: + + The Kubernetes endpoint expects this file to be available as a Kubernetes secret named `funcx-sdk-tokens`. From 8e7afee620fbdc34fea53a4ea9bb1949a9f7865d Mon Sep 17 00:00:00 2001 From: Ben Clifford Date: Mon, 14 Feb 2022 17:52:02 +0000 Subject: [PATCH 34/42] More on endpoint images and ingress --- README.md | 46 +++++++++++++++++++++++++++++++++++++++------- 1 file changed, 39 insertions(+), 7 deletions(-) diff --git a/README.md b/README.md index 57fced0..f7160c9 100644 --- a/README.md +++ b/README.md @@ -208,9 +208,11 @@ kubectl create secret generic funcx-sdk-tokens \ be copy-pasting a huge example?] [TODO: paragraph desribing what values.yaml will do] +3a. mkdir deployed_values/ -3a. Obtain Globus Client ID and Secret. Get the credentials by asking on the - `dev` funcx Slack channel. +3b. Obtain Globus Client ID and Secret for funcX. Get the credentials by asking on the + `dev` funcx Slack channel. This is distinct from the funcx credentials needed for the + endpoint, acquired in the previous section. Once you have your credentials, paste them into your `values.yaml`: ```yaml @@ -239,6 +241,25 @@ kubectl create secret generic funcx-sdk-tokens \ endpointUUID: XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX ``` +3c Fix the funcx_endpoint image. By default this will install an image tag `main` which is +hopelessly out of date and will result in obscure errors. [TODO: fix the endpoint default +in endpoint chart, and remove the `main` tag image] + +Instead, override the image tag with an explicit python version. [TODO: it seems quite confused +about what versions of python in each stage will work with each other. For now, try using a +tag: main-3.9, like this: + +``` +funcx_endpoint: + endpointUUID: XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX + image: + pullPolicy: Always + tag: main-3.9 + +``` + +] + 5. Install the helm chart: ```shell script helm install -f deployed_values/values.yaml funcx funcx @@ -320,9 +341,11 @@ serves the `public` ingress class). [TODO: i need to write the exact commands fo 2. Get a hostname that your kubernetes install is accessible under. -Depending on your development environment, this might be the public hostname of your -kubernetes server, or it might be an entry in `/etc/hosts` pointing to 127.0.0.1. -Maybe even `localhost` works in that case. +You can use `localhost` if you are running your client code on your local machine +too. + +Otherwise, figure out (using IP networking skills not described in this document) +how you will address and name your kubernetes host. 3. Enable ingress in the funcx install @@ -332,7 +355,7 @@ host name from step 2. ``` ingress: enabled: true - host: amber.cqx.ltd.uk + host: amber.cqx.ltd.uk # <- this is the hostname you chose in step 2 ``` 4. Redeploy funcx @@ -341,12 +364,21 @@ ingress: helm upgrade --atomic -f deployed_values/values.yaml funcx funcx ``` +5. You should now see the ingress definition in kubernetes: + +``` +# kubectl get ingress +NAME CLASS HOSTS ADDRESS PORTS AGE +funcx-funcx-ingress amber.cqx.ltd.uk 80 11d + +``` + ### Connecting clients Create a `FuncXClient` instance pointing at your install, by specifying the funcx_service_address, ``` -fxc = FuncXClient(funcx_service_address="http://amber.cqx.ltd.uk:8000/v2") +fxc = FuncXClient(funcx_service_address="http://localhost/v2") # <- this is also the hostname you chose in step 2 ``` and by specifying your endpoint UUID (generated earlier) when invoking a function. From 40dd287d84ab731fab642fd65a3a55b03b11b781 Mon Sep 17 00:00:00 2001 From: Ben Clifford Date: Mon, 14 Feb 2022 19:10:59 +0000 Subject: [PATCH 35/42] Tidy endpoint secrets section --- README.md | 45 ++++++++++++--------------------------------- 1 file changed, 12 insertions(+), 33 deletions(-) diff --git a/README.md b/README.md index f7160c9..4d72d1f 100644 --- a/README.md +++ b/README.md @@ -126,36 +126,10 @@ development) There are various awful ways to do this. Some of them are here. +* Ugly method 1 -TODO: this is messy and not part of the service install so i'm not sure -if it should happen here or as part of the configuration section? - -We can deploy the kubernetes endpoint as a pod as part of the chart. It -needs to have a valid copy of the funcx's `funcx_sdk_tokens.json` which can -be created by running on your local workstation and running -```shell script - funcx-endpoint start ``` - -(or if I don't have this on my local workstation because I'm deploying purely -inside kubernetes...? can I run this as a command inside minikube - eg after I've -got this running? -yes, with lots of error messages to ignore, I did this: - -docker run --rm -ti funcx/kube-endpoint:main /home/funcx/boot.sh funcx - -but how then did I get the token out? -) - -You will be prompted to follow the authorization link and paste the resulting -token into the console. Once you do that, funcx-endpoint will create a -`~/.funcx` directory and provide you with a token file. -=== - -here's another way - -``` -# docker run --rm -ti funcx/kube-endpoint:main bash -l +# docker run --rm -ti funcx/kube-endpoint:main3.9 bash -l $ python3 -c "import funcx ; funcx.FuncXClient()" Please paste the following URL in a browser: https://auth.globus.org/v2/oauth2/authorize?client_id=..... @@ -172,20 +146,25 @@ $ cat .funcx/credentials/funcx_sdk_tokens.json { "auth.globus.org": { "scope": "openid", - "access_token": + "access_token": ... ``` Copy that file (eg via clipboard) somewhere safe and make it available in your minikube shell. -=== +* Ugly method 2 -after getting the funcx_sdk_tokens.json file by hook or by crook, do this: +If you have used funcx (endpoint or submit side) elsewhere, you can probably +find suitable tokens in ~/.funcx/credentials/funcx_sdk_tokens.json in that +environment. +After getting the funcx_sdk_tokens.json file by hook or by crook, copy +to a file called `funcx_sdk_tokens.json` - the name of this file is +important as it will be used inside the created secret. Don't be +tempted to call it `tmp.json` for example. -The Kubernetes endpoint expects this file to be available as a Kubernetes -secret named `funcx-sdk-tokens`. +Now create the kubernetes secret like this: You can install this secret with: ```shell script From aa87c2185d384f52bfc881f387c120248b3d6c53 Mon Sep 17 00:00:00 2001 From: Ben Clifford Date: Mon, 14 Feb 2022 20:32:46 +0000 Subject: [PATCH 36/42] Mostly move my random notes to the end of the doc, out of the way of the normal flow --- README.md | 695 ++++++++++++++++++++++++++---------------------------- 1 file changed, 331 insertions(+), 364 deletions(-) diff --git a/README.md b/README.md index 4d72d1f..27de456 100644 --- a/README.md +++ b/README.md @@ -267,27 +267,7 @@ logs for its ID. 7. You can access your funcX services through the ingress or via a port forward to the web service pod. For port forwarding, instructions are provided in the displayed notes. -[ingress should become the official way, and be better documented, but there is less -experience with it and it needs an ingress controller...] - - -[clarify which logs / *where* those logs are? explicitly which (3?) logs to -look at... - who is registering with whom?] - -Endpoint log will look like: -``` -2021-09-14 16:36:02 endpoint.endpoint_manager:172 [INFO] Starting endpoint with uuid: cfd389f3-4eda-413b-af95-4d54a8e944dc -``` - -forwarder will look like: -``` -{"asctime": "2021-09-14 18:20:44,535", "name": "funcx_forwarder.forwarder", "levelname": "DEBUG", "message": "endpoint_status_message", "log_type": "endpoint_status_message", "endpoint_id": "cfd389f3-4eda-413b-af95-4d54a8e944dc", "endpoint_status_message": {"_payload": null, "_header": "b'\\xcf\\xd3\\x89\\xf3N\\xdaA;\\xaf\\x95MT\\xa8\\xe9D\\xdc'", "ep_status": {"task_id": -2, "info": {"total_cores": 0, "total_mem": 0, "new_core_hrs": 0, "total_core_hrs": 0, "managers": 0, "active_managers": 0, "total_workers": 0, "idle_workers": 0, "pending_tasks": 0, "outstanding_tasks": {}, "worker_mode": "no_container", "scheduler_mode": "hard", "scaling_enabled": true, "mem_per_worker": null, "cores_per_worker": 1.0, "prefetch_capacity": 10, "max_blocks": 100, "min_blocks": 1, "max_workers_per_node": 1, "nodes_per_block": 1}}, "task_statuses": {}}} -``` - -web service will look like: -``` -{"asctime": "2021-09-14 16:36:03,273", "name": "funcx_web_service", "levelname": "INFO", "message": "Successfully registered cfd389f3-4eda-413b-af95-4d54a8e944dc in database"} -``` +Ingress configuration is described in the following section. ## Exposing FuncX to external clients and endpoints @@ -383,144 +363,6 @@ print(fxc.get_result(result)) If you have got this far, then you have successfully installed the current version of funcx, and can begin to hack. - -[TODO: at this point I got into a lot of tangle with default Python versions -not matching up. We've subsequently talked and fiddled with this - so check -what will happen. In the meantime skip these notes: - -this gets as far as submitting for me, but attempts to get the result always give -funcx.utils.errors.TaskPending: Task is pending due to waiting-for-nodes - -There's a pod started up ok - called funcx-1631645705407 without any more interesting name. i guess thats a worker? after 106s all that is in the logs is a warning from pip running as root -- but i can see pip running. - -So i should put a note here about how long things might take here? -Note that it has changes from waiting-for-ep in the error to waiting-for-nodes -after several minutes, so there is more stuff going on, slowly... 5m later and its still churning. it's had one restart 63s ago... nothing clear about *why* it restarted though. -in kubectl describe pod XXXXX I can see the commandline - a pip install and then a funcx-manager. -It seems to end with: PROCESS_WORKER_POOL main event loop exiting normally - -So lets debug a bit more about where the task execution happens or not. - -Is this doing a pip install on every restart (?!) - that's a question to ask. -(maybe it's not actually installing new stuff though - which is why i'm not seeing any -packages being installed on subsequent runs) - -Eventually it went into "CrashLoopBackOff" at the kubernetes level, which maybe isn't the right behaviour for "exiting normal" at the PROCESS_WORKER_POOL level? Ask on chat about that. - -There's nothing in the endpoint logs about starting up that funcx process worker container, or about jobs happening - just every 600s a keepalive message - -Digging into the endpoint container environment, find ~/.funcx/funcx/EndpointInterchange.log -which is reporting a sequence of errors: - -2021-09-15 14:26:55.592 funcx_endpoint.executors.high_throughput.executor:540 [WARNING] [MTHR -EAD] Executor shutting down due to version mismatch in interchange -2021-09-15 14:26:55.610 funcx_endpoint.executors.high_throughput.executor:542 [ERROR] [MTHREA -D] Exception: Task failure due to loss of manager b'18e00d57935c' -NoneType: None -2021-09-15 14:26:55.610 funcx_endpoint.executors.high_throughput.executor:577 [INFO] [MTHREAD -] queue management worker finished - -then every 10ms this message *forever* 2021-09-15 14:26:55.613 funcx_endpoint:504 [ERROR] [MAIN] Something broke while forwarding re -sults from executor to forwarder queues -Traceback (most recent call last): - File "/usr/local/lib/python3.7/site-packages/funcx_endpoint/endpoint/interchange.py", line 4 -90, in _main_loop - results = self.results_passthrough.get(False, 0.01) - File "/usr/local/lib/python3.7/multiprocessing/queues.py", line 108, in get - res = self._recv_bytes() - File "/usr/local/lib/python3.7/multiprocessing/connection.py", line 216, in recv_bytes - buf = self._recv_bytes(maxlength) - File "/usr/local/lib/python3.7/multiprocessing/connection.py", line 407, in _recv_bytes - buf = self._recv(4) - File "/usr/local/lib/python3.7/multiprocessing/connection.py", line 383, in _recv - raise EOFError -EOFError - -- that kind of failure should be resulting in a kubernetes level restart (or some other exit/restart) not a hang loop like this? -- mismatch of what? between who? is it the process worker pool container vs the funcx container? Looking at interchange.py - this might not even be from a version mismatch: it can happen if reg_flag is false (due to a json deserialisation problem in registration message). Other than that, it can happen because the python versions from the manager vs the interchange. - -I commented on these logs not being obvious, in slack, and ben g gave me: -> so for debugging, I added a value to the endpoint helm chart detachEndpoint -since the endpoint runs in a daemon, the output doesn’t show up in the pod’s logs.. Setting this to false means the endpoint runs in the main thread. Less reliable, but easy for debugging - -I haven't tried that yet. But if its good, then... if k8s endpoints are also expected for end users, maybe they should also get the same functionality? (eg why is this running as a daemon when its inside a pod anyway managing that?) - -whle the task on the client side still reports: -funcx.utils.errors.TaskPending: Task is pending due to waiting-for-ep - -At the same time, the process worker service repeatedly exits and is restarted by kubernetes (with it eventually hitting CrashLoopBackOff to slow this down) - presumably that's somehow opposite half of this same error message, but it isn't clear. The entire log file is: - -root@amber:~# minikube kubectl logs -- funcx-1631715998878 -WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv -PROCESS_WORKER_POOL main event loop exiting normally -root@amber:~# - -I found a more interesting log file here: -/home/funcx/.funcx/funcx/HighThroughputExecutor/worker_logs/8e8f66c705d3/manager.log - -which eventully reports a critical error that the interchange heartbeat is missing - not at all like the kubectl log error of: ... exiting normally. - -It's a bit weird to be 2 minutes in before the manager even notices that the interchange isn't even alive. - -So... the version mismatch: -This is invoking python:3.6-buster - so let's track down where that was. - -Selecting the correct image (eg for AWS AMIs, not docker images) has been a massive -usability problem for testing parsl on image based systems... I'm not sure how much it -matters in end-use though, if you're assuming that users make app-specific images that -are tied to their own environment? I don't have experience there. I haven't spent any -time seriously trying to solve this for parsl, but eg ZZ did container stuff for parsl -so I'd be interested to here any of his relevant experiences. not really a problem -i am interested in solving. - -so grep around in the source tree for python:3.6-buster - -The funcx endpoint helm chart is coming from a URL on funcx.org, not the helm-charts repo, -under here: http://funcx.org/funcx-helm-charts - -There's a 0.3 chart in the funcx-helm-charts repo by the looks of it - perhaps I can try that, by hacking at the server-side chart. What's the right way to be controlling this? - -While checking if process claims to be alive, the endpoint output line: -"funcx-endpoint process is still alive. Next check in 600s." -should be given a timestamp - it doesn't seem to go through the logging mechanism so -is not getting a timestamp that way. So I have no idea when it last ran, or if its -an outdated message, etc. - -This command to install the worker inside the worker container is installing funcx-endpoint and directing the output to a file called =0.2.0. That is probably not what is intended. Put the whole thing in ' marks perhaps. - - pip install funcx-endpoint>=0.2.0 - -This is in the funcx endpoint helm chart... along with the naughty command right next -to it: -workerImage and -workerInit -so I should be able to override those myself in my values.yaml? - -funcx_endpoint: - workerImage: python:3.7-buster - workerInit: 'pip install "funcx-endpoint>=0.2.0"' - -note that workerInit is embedded python string syntax, not a plain string, so -you can't use ' marks inside it because it is substituted in somewhere I think -and that causes a syntax error - eg try the above line with " and ' swapped -and see: -""" -File "/home/funcx/.funcx/funcx/config.py", line 24 - worker_init='pip install 'funcx-endpoint>=0.2.0'', - ^ -SyntaxError: invalid syntax -""" -because string substitution into source without proper escaping. - -This could be fixed - either by reading the string from a different place and -not doing python source substitution, or by performing escaping on the string. -This behaviour is likely to cause trouble to anyone doing non-trivial bash -in their worker_init. - -it's frustrating that the python version is not set to the version that -is actually used by the endpoint. -] - ### Forwarder Debugging > :warning: *Only for debugging*: You can set the forwarder curve server key manually by creating @@ -754,75 +596,360 @@ root@plsql:/# pgcli postgresql://funcx:leftfoot1@funcx-postgresql:5432/public ingress section above ] -# General benc grumbles to address -How does I upgrade this? It was installed using the latest images at install time I guess? -I see a funcx-web-service tag from 15h ago after running helm upgrade funcx funcx... -(imgage ID d21432c1525a) - so is that what is running now? looks like it pulled a new image when I rebooted the server (!) +## Making a release and deploying to the AWS clusters -That seems a bit chaotic. And how do I switch these to using my own source builds? +The following is an incomplete guide to making and deploying a new release onto our development or production clusters. +Here are the components that need updating as part of a release, in the order they should be updated +due to dependencies. Note that only components that have changes for release need to updated and the +rest can safely be skipped: -python versions: - The endpoint repo funcX/Dockerfile-endpoint defaults to building with python 3.8. - That doesn't align with the python that is being supplied by the helm images. - So rebuild my client, specifying 3.7. +* funcx-forwarder + * Update version number + * merge above changes to main in a PR + * Create a branch off of main with the version number, for, eg: 'v0.3.3'. + For dev releases, do alpha releases `v0.3.3a0` + * Ensure that the branch has the CI tests passing and the publish step working -Perhaps funcx/Dockerfile-endpoint should force the user to choose, rather than building -a likely invalid one. +* funcx-web-service + * Same steps as funcx-forwarder -The broad topic here is python versions are poorly co-ordinated as supplied - yes, they -have to align, but the defaults supplied should all align with each other at least. -(they need to align across all three of: the submitting client, the endpoint, the -endpoint workers, and all three are wrong by default) +* funcx-websocket-service + * Same steps as funcx-websocket-service +* Update helm-charts + * Update the smoke-tests in the helm-charts to use the new version numbers in `conftest.py` -Python mismatch between interchange and worker pod results in an eternal hang: interchange reports inside its endpoint logs that there's a version mismatch (but not what the version mismathc is). The worker just restarts every couple of minutes with missing heartbeat: no description of *why* the heartbeat is missing. The end user is never informed of anything more than "waiting for nodes". It's unclear to me if that should be a richer message or a richer hard error: could tell user that the worker version is wrong at least, becaues that is likely an error that won't fix itself (i.e. is not a transient error). Richer error here would help the submitting user understand that they need to contact the endpoint administrator for rectification and what to tell them, beyond "waiting for nodes". +* Prepare to deploy to cluster. + * Confirm that all the bits to be deployed should be available on dockerhub. + * Run `kubectl config current-context` which should return something like: + >> arn:aws:eks:us-east-1:512084481048:cluster/funcx-dev -funcx endpoint worker pods lose their logs at each restart - which is awkward to examine when the logs are in an every-two-minutes restart loop due to missing heartbeat. debuggability might be enhanced by putting them in a pod-lifetime dir rather than a container-lifetime dir? [should they even be autorestarting in that situation, rather than letting the endpoint handle restarting them if it still wants them? - c.f. parsl discussion about how kubernetes pods are managed by the parsl kubernetes provider?] + * Make sure the right cluster is pointed to by kubectl, and use this terminal for all following steps. +* Download the current values deployed to the target cluster as a backup. Note: you can use this as a base values.yaml. + >> helm get values funcx > enviornment.yaml +* Update the values to use the release branchnames as the new tags -## Upgrading and developing against non-main environments +* Deploy with: + >> helm upgrade -f prod-values.yaml funcx funcx -i've been trying this but it's not clear that it is pulling down the latest of -everything: (I think it does, but just the images are not changing often -upstream from me?) - helm upgrade -f deployed_values/values.yaml funcx funcx --recreate-pods +> :warning: It is preferable to upgrade rather than blow away the current deployment and redeploy + because, wiping the current deployment loses state that ties the Route53 entries to point at + the ALB, and any configuration on the ALB itself could be lost. -## Python version notes +> :warning: If the deployment was wiped here are the steps: + * Go to Route53 on AWS Console and select the hosted zone: `dev.funcx.org`. Select the + appropriate A record for the deployment you are updating and edit the record to update the + value to something like `dualstack.k8s-default-funcxfun-dd14845f35-608065658.us-east-1.elb.amazonaws.com.` + * Add the ALB to the existing WAF Rules here: `https://console.aws.amazon.com/wafv2/homev2/web-acl/funcx-prod-web-acl/d82023f9-2cd8-4aed-b8e3-460dd399f4b0/overview?region=us-east-1#` -this is important for not having errors, so probably should be near the top. -there are three different places where the python interpreter must be the same -major version (eg all 3.7). the tooling as it is now does not make that the -case by default - [TODO: make it so] +* While a new forwarder will be launched on upgrade, the new one will not go online + since it requires the ports that are in use by the older one. So you must manually + delete the older funcx-forwarder pod. -- the endpoint worker -- the endpoint -- the submitting user python env + >> kubectl get pods + \# Find the older funcx-forwarder pod -TODO: make this consistent for a first time install experience: either describe -how to configure it in all three places, or make the defaults/documented -command lines make that happen. + >> kubctl delete pods \ -The setup as I came to it is installing 3 different incompatible python version -without telling me not to. -## Port notes +## Deploy a temporary k8s deployment in the dev cluster -nmap of amber.cqx.ltd.uk: +It is occasionally useful to deploy a full FuncX stack in the dev cluster under +a different namespace. This is useful when two developers are both working on +or debugging a feature as well as to verify a feature works as expected before +potentially deploying to the main dev environment deployment. These +instructions will get a second FuncX deployment (with k8s based redis, +postgres, and rabbitmq) running at a specified host under `*.api.dev.funcx.org`. -``` -$ nmap amber.cqx.ltd.uk -p- -4 +* To avoid forwarder port conflicts, ensure at least as many nodes are running + in EKS as there will be forwarder deployments since forwarders rely on host + ports to be addressable. To scale the node group you can use `eksctl scale + nodegroup --cluster=funcx-dev --name=funcx-dev-node-group --nodes=2 + --nodes-max=2` where `nodes-max` and `nodes` are set to as many as are needed. +* Create a new namespace for your deployment: e.g. `kubectl create namespace josh-funcx` +* Create a `values.yaml` that includes information about the host name to use + in the ingress definition. E.g.: + ingress: + enabled: true + host: josh-test.dev.funcx.org + name: dev-lb + subnets: subnet-0c0d6b32bb57c39b2, subnet-0906da1c44cbe3b8d + use_alb: true +* Install the helm chart as described above, but specifying the new `values.yaml` file + and the namespace. E.g.: `helm install -f deployed_values/values.yaml josh-funcx funcx --namespace` +* Create a new route53 record for the given host (josh-test.dev.funcx.org). + We won't have to do this after [external dns](https://kubernetes-sigs.github.io/aws-load-balancer-controller/v2.2/guide/integrations/external_dns/) has been enabled. -Starting Nmap 7.40 ( https://nmap.org ) at 2021-09-20 19:41 UTC -Nmap scan report for amber.cqx.ltd.uk (65.108.55.218) -Host is up (0.055s latency). -Other addresses for amber.cqx.ltd.uk (not scanned): 2a01:4f9:c010:e030::1 -Not shown: 65522 closed ports -PORT STATE SERVICE + +## Advanced option: Deploying a funcx-endpoint outside of K8s + +The above notes installed a funcx endpoint inside kubernetes, alongside the funcx services. +In real life, end users would install funcx endpoints elsewhere (on their compute +resources) and attach them to the officially funcx services. + +It is also possible to install an endpoint elsewhere and attach it to services +deployed by this chart for dev purposes. + +--- +**NOTE** + +This only works on Linux systems. + +--- + +Here are the steps to install, preferably into your active conda environment: + +```shell script +git clone https://github.com/funcx-faas/funcX.git +cd funcX +git checkout main +pip install funcx_sdk +pip install funcx_endpoint +``` + +Next create an endpoint configuration: + +```shell script +funcx-endpoint +``` + +Update the endpoint's configuration file to point the endpoint to locally +deployed services, which we will setup in the next sections. If using default +values, the funcx_service_address should be set to http://localhost:5000/v2. + +`~/.funcx/default/config.py` + +```python + config = Config( + executors=[HighThroughputExecutor( + provider=LocalProvider( + init_blocks=1, + min_blocks=0, + max_blocks=1, + ), + )], + funcx_service_address="http://127.0.0.1:5000/api/v1", # <--- UPDATE THIS LINE +) +``` + + + +## See also + +More notes in the local_dev/ subdirectory that should be merged into this file + + + +# Assorted benc notes + +this gets as far as submitting for me, but attempts to get the result always give +funcx.utils.errors.TaskPending: Task is pending due to waiting-for-nodes + +There's a pod started up ok - called funcx-1631645705407 without any more interesting name. i guess thats a worker? after 106s all that is in the logs is a warning from pip running as root +- but i can see pip running. + +So i should put a note here about how long things might take here? +Note that it has changes from waiting-for-ep in the error to waiting-for-nodes +after several minutes, so there is more stuff going on, slowly... 5m later and its still churning. it's had one restart 63s ago... nothing clear about *why* it restarted though. +in kubectl describe pod XXXXX I can see the commandline - a pip install and then a funcx-manager. +It seems to end with: PROCESS_WORKER_POOL main event loop exiting normally + +So lets debug a bit more about where the task execution happens or not. + +Is this doing a pip install on every restart (?!) - that's a question to ask. +(maybe it's not actually installing new stuff though - which is why i'm not seeing any +packages being installed on subsequent runs) + +Eventually it went into "CrashLoopBackOff" at the kubernetes level, which maybe isn't the right behaviour for "exiting normal" at the PROCESS_WORKER_POOL level? Ask on chat about that. + +There's nothing in the endpoint logs about starting up that funcx process worker container, or about jobs happening - just every 600s a keepalive message + +Digging into the endpoint container environment, find ~/.funcx/funcx/EndpointInterchange.log +which is reporting a sequence of errors: + +2021-09-15 14:26:55.592 funcx_endpoint.executors.high_throughput.executor:540 [WARNING] [MTHR +EAD] Executor shutting down due to version mismatch in interchange +2021-09-15 14:26:55.610 funcx_endpoint.executors.high_throughput.executor:542 [ERROR] [MTHREA +D] Exception: Task failure due to loss of manager b'18e00d57935c' +NoneType: None +2021-09-15 14:26:55.610 funcx_endpoint.executors.high_throughput.executor:577 [INFO] [MTHREAD +] queue management worker finished + +then every 10ms this message *forever* 2021-09-15 14:26:55.613 funcx_endpoint:504 [ERROR] [MAIN] Something broke while forwarding re +sults from executor to forwarder queues +Traceback (most recent call last): + File "/usr/local/lib/python3.7/site-packages/funcx_endpoint/endpoint/interchange.py", line 4 +90, in _main_loop + results = self.results_passthrough.get(False, 0.01) + File "/usr/local/lib/python3.7/multiprocessing/queues.py", line 108, in get + res = self._recv_bytes() + File "/usr/local/lib/python3.7/multiprocessing/connection.py", line 216, in recv_bytes + buf = self._recv_bytes(maxlength) + File "/usr/local/lib/python3.7/multiprocessing/connection.py", line 407, in _recv_bytes + buf = self._recv(4) + File "/usr/local/lib/python3.7/multiprocessing/connection.py", line 383, in _recv + raise EOFError +EOFError + +- that kind of failure should be resulting in a kubernetes level restart (or some other exit/restart) not a hang loop like this? +- mismatch of what? between who? is it the process worker pool container vs the funcx container? Looking at interchange.py - this might not even be from a version mismatch: it can happen if reg_flag is false (due to a json deserialisation problem in registration message). Other than that, it can happen because the python versions from the manager vs the interchange. + +I commented on these logs not being obvious, in slack, and ben g gave me: +> so for debugging, I added a value to the endpoint helm chart detachEndpoint -since the endpoint runs in a daemon, the output doesn’t show up in the pod’s logs.. Setting this to false means the endpoint runs in the main thread. Less reliable, but easy for debugging + +I haven't tried that yet. But if its good, then... if k8s endpoints are also expected for end users, maybe they should also get the same functionality? (eg why is this running as a daemon when its inside a pod anyway managing that?) + +whle the task on the client side still reports: +funcx.utils.errors.TaskPending: Task is pending due to waiting-for-ep + +At the same time, the process worker service repeatedly exits and is restarted by kubernetes (with it eventually hitting CrashLoopBackOff to slow this down) - presumably that's somehow opposite half of this same error message, but it isn't clear. The entire log file is: + +root@amber:~# minikube kubectl logs -- funcx-1631715998878 +WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv +PROCESS_WORKER_POOL main event loop exiting normally +root@amber:~# + +I found a more interesting log file here: +/home/funcx/.funcx/funcx/HighThroughputExecutor/worker_logs/8e8f66c705d3/manager.log + +which eventully reports a critical error that the interchange heartbeat is missing - not at all like the kubectl log error of: ... exiting normally. + +It's a bit weird to be 2 minutes in before the manager even notices that the interchange isn't even alive. + +So... the version mismatch: +This is invoking python:3.6-buster - so let's track down where that was. + +Selecting the correct image (eg for AWS AMIs, not docker images) has been a massive +usability problem for testing parsl on image based systems... I'm not sure how much it +matters in end-use though, if you're assuming that users make app-specific images that +are tied to their own environment? I don't have experience there. I haven't spent any +time seriously trying to solve this for parsl, but eg ZZ did container stuff for parsl +so I'd be interested to here any of his relevant experiences. not really a problem +i am interested in solving. + +so grep around in the source tree for python:3.6-buster + +The funcx endpoint helm chart is coming from a URL on funcx.org, not the helm-charts repo, +under here: http://funcx.org/funcx-helm-charts + +There's a 0.3 chart in the funcx-helm-charts repo by the looks of it - perhaps I can try that, by hacking at the server-side chart. What's the right way to be controlling this? + +While checking if process claims to be alive, the endpoint output line: +"funcx-endpoint process is still alive. Next check in 600s." +should be given a timestamp - it doesn't seem to go through the logging mechanism so +is not getting a timestamp that way. So I have no idea when it last ran, or if its +an outdated message, etc. + +This command to install the worker inside the worker container is installing funcx-endpoint and directing the output to a file called =0.2.0. That is probably not what is intended. Put the whole thing in ' marks perhaps. + + pip install funcx-endpoint>=0.2.0 + +This is in the funcx endpoint helm chart... along with the naughty command right next +to it: +workerImage and +workerInit +so I should be able to override those myself in my values.yaml? + +funcx_endpoint: + workerImage: python:3.7-buster + workerInit: 'pip install "funcx-endpoint>=0.2.0"' + +note that workerInit is embedded python string syntax, not a plain string, so +you can't use ' marks inside it because it is substituted in somewhere I think +and that causes a syntax error - eg try the above line with " and ' swapped +and see: +""" +File "/home/funcx/.funcx/funcx/config.py", line 24 + worker_init='pip install 'funcx-endpoint>=0.2.0'', + ^ +SyntaxError: invalid syntax +""" +because string substitution into source without proper escaping. + +This could be fixed - either by reading the string from a different place and +not doing python source substitution, or by performing escaping on the string. +This behaviour is likely to cause trouble to anyone doing non-trivial bash +in their worker_init. + +it's frustrating that the python version is not set to the version that +is actually used by the endpoint. +] + +=== + +How does I upgrade this? It was installed using the latest images at install time I guess? +I see a funcx-web-service tag from 15h ago after running helm upgrade funcx funcx... +(imgage ID d21432c1525a) - so is that what is running now? looks like it pulled a new image when I rebooted the server (!) + +That seems a bit chaotic. And how do I switch these to using my own source builds? + + +python versions: + The endpoint repo funcX/Dockerfile-endpoint defaults to building with python 3.8. + That doesn't align with the python that is being supplied by the helm images. + So rebuild my client, specifying 3.7. + +Perhaps funcx/Dockerfile-endpoint should force the user to choose, rather than building +a likely invalid one. + +The broad topic here is python versions are poorly co-ordinated as supplied - yes, they +have to align, but the defaults supplied should all align with each other at least. +(they need to align across all three of: the submitting client, the endpoint, the +endpoint workers, and all three are wrong by default) + + +Python mismatch between interchange and worker pod results in an eternal hang: interchange reports inside its endpoint logs that there's a version mismatch (but not what the version mismathc is). The worker just restarts every couple of minutes with missing heartbeat: no description of *why* the heartbeat is missing. The end user is never informed of anything more than "waiting for nodes". It's unclear to me if that should be a richer message or a richer hard error: could tell user that the worker version is wrong at least, becaues that is likely an error that won't fix itself (i.e. is not a transient error). Richer error here would help the submitting user understand that they need to contact the endpoint administrator for rectification and what to tell them, beyond "waiting for nodes". + + +funcx endpoint worker pods lose their logs at each restart - which is awkward to examine when the logs are in an every-two-minutes restart loop due to missing heartbeat. debuggability might be enhanced by putting them in a pod-lifetime dir rather than a container-lifetime dir? [should they even be autorestarting in that situation, rather than letting the endpoint handle restarting them if it still wants them? - c.f. parsl discussion about how kubernetes pods are managed by the parsl kubernetes provider?] + + + +## Upgrading and developing against non-main environments + +i've been trying this but it's not clear that it is pulling down the latest of +everything: (I think it does, but just the images are not changing often +upstream from me?) + helm upgrade -f deployed_values/values.yaml funcx funcx --recreate-pods + +## Python version notes + +this is important for not having errors, so probably should be near the top. + +there are three different places where the python interpreter must be the same +major version (eg all 3.7). the tooling as it is now does not make that the +case by default - [TODO: make it so] + +- the endpoint worker +- the endpoint +- the submitting user python env + +TODO: make this consistent for a first time install experience: either describe +how to configure it in all three places, or make the defaults/documented +command lines make that happen. + +The setup as I came to it is installing 3 different incompatible python version +without telling me not to. + +## Port notes + +nmap of amber.cqx.ltd.uk: + +``` +$ nmap amber.cqx.ltd.uk -p- -4 + +Starting Nmap 7.40 ( https://nmap.org ) at 2021-09-20 19:41 UTC +Nmap scan report for amber.cqx.ltd.uk (65.108.55.218) +Host is up (0.055s latency). +Other addresses for amber.cqx.ltd.uk (not scanned): 2a01:4f9:c010:e030::1 +Not shown: 65522 closed ports +PORT STATE SERVICE 22/tcp open ssh 2379/tcp open etcd-client 2380/tcp open etcd-server @@ -886,19 +1013,6 @@ the web(sockets) ports and the forwarder ports? On the production system are th made fully public in different ways? - -### nice ways to get endpoint in k8s cluster ofr devs - -eg make it consistent every time over restarts rather than random each time? - -or output it somewhere that cna be read programmatically by clients? - -TODO: best common practice is probably to generate an endpoint ID and hard -configure it right from the start, as suggested by ryan. That eliminates the -need for fiddling in the logs to discover the random endpoint ID each time. -Be very clear that this needs to be unique. - - ## web-service build (and check others?) has a requiremenets.txt which installs from git funcx api main @@ -967,150 +1081,3 @@ https://github.com/funcx-faas/funcX/issues/601 (broken k8s worker pods accumulat -## Making a release and deploying to the AWS clusters - -The following is an incomplete guide to making and deploying a new release onto our development or production clusters. - -Here are the components that need updating as part of a release, in the order they should be updated -due to dependencies. Note that only components that have changes for release need to updated and the -rest can safely be skipped: - -* funcx-forwarder - * Update version number - * merge above changes to main in a PR - * Create a branch off of main with the version number, for, eg: 'v0.3.3'. - For dev releases, do alpha releases `v0.3.3a0` - * Ensure that the branch has the CI tests passing and the publish step working - -* funcx-web-service - * Same steps as funcx-forwarder - -* funcx-websocket-service - * Same steps as funcx-websocket-service - -* Update helm-charts - * Update the smoke-tests in the helm-charts to use the new version numbers in `conftest.py` - -* Prepare to deploy to cluster. - * Confirm that all the bits to be deployed should be available on dockerhub. - * Run `kubectl config current-context` which should return something like: - - >> arn:aws:eks:us-east-1:512084481048:cluster/funcx-dev - - * Make sure the right cluster is pointed to by kubectl, and use this terminal for all following steps. - -* Download the current values deployed to the target cluster as a backup. Note: you can use this as a base values.yaml. - >> helm get values funcx > enviornment.yaml - -* Update the values to use the release branchnames as the new tags - -* Deploy with: - >> helm upgrade -f prod-values.yaml funcx funcx - -> :warning: It is preferable to upgrade rather than blow away the current deployment and redeploy - because, wiping the current deployment loses state that ties the Route53 entries to point at - the ALB, and any configuration on the ALB itself could be lost. - -> :warning: If the deployment was wiped here are the steps: - * Go to Route53 on AWS Console and select the hosted zone: `dev.funcx.org`. Select the - appropriate A record for the deployment you are updating and edit the record to update the - value to something like `dualstack.k8s-default-funcxfun-dd14845f35-608065658.us-east-1.elb.amazonaws.com.` - * Add the ALB to the existing WAF Rules here: `https://console.aws.amazon.com/wafv2/homev2/web-acl/funcx-prod-web-acl/d82023f9-2cd8-4aed-b8e3-460dd399f4b0/overview?region=us-east-1#` - - -* While a new forwarder will be launched on upgrade, the new one will not go online - since it requires the ports that are in use by the older one. So you must manually - delete the older funcx-forwarder pod. - - >> kubectl get pods - \# Find the older funcx-forwarder pod - - >> kubctl delete pods \ - - -## Deploy a temporary k8s deployment in the dev cluster - -It is occasionally useful to deploy a full FuncX stack in the dev cluster under -a different namespace. This is useful when two developers are both working on -or debugging a feature as well as to verify a feature works as expected before -potentially deploying to the main dev environment deployment. These -instructions will get a second FuncX deployment (with k8s based redis, -postgres, and rabbitmq) running at a specified host under `*.api.dev.funcx.org`. - -* To avoid forwarder port conflicts, ensure at least as many nodes are running - in EKS as there will be forwarder deployments since forwarders rely on host - ports to be addressable. To scale the node group you can use `eksctl scale - nodegroup --cluster=funcx-dev --name=funcx-dev-node-group --nodes=2 - --nodes-max=2` where `nodes-max` and `nodes` are set to as many as are needed. -* Create a new namespace for your deployment: e.g. `kubectl create namespace josh-funcx` -* Create a `values.yaml` that includes information about the host name to use - in the ingress definition. E.g.: - ingress: - enabled: true - host: josh-test.dev.funcx.org - name: dev-lb - subnets: subnet-0c0d6b32bb57c39b2, subnet-0906da1c44cbe3b8d - use_alb: true -* Install the helm chart as described above, but specifying the new `values.yaml` file - and the namespace. E.g.: `helm install -f deployed_values/values.yaml josh-funcx funcx --namespace` -* Create a new route53 record for the given host (josh-test.dev.funcx.org). - We won't have to do this after [external dns](https://kubernetes-sigs.github.io/aws-load-balancer-controller/v2.2/guide/integrations/external_dns/) has been enabled. - - -## Advanced option: Deploying funcx-endpoint outside of K8s [this is "advanced" - move to end of doc, and crossref with other "install an endpoint" document] - -The above noteis installed a funcx endpoint inside kubernetes, alongside the funcx services. -In real life, end users would install funcx endpoints elsewhere (on their compute -resources) and attach them to the officially funcx services. - -It is also possible to install an endpoint elsewhere and attach it to services -deployed by this chart for dev purposes. - ---- -**NOTE** - -This only works on Linux systems. - ---- - -Here are the steps to install, preferably into your active conda environment: - -```shell script -git clone https://github.com/funcx-faas/funcX.git -cd funcX -git checkout main -pip install funcx_sdk -pip install funcx_endpoint -``` - -Next create an endpoint configuration: - -```shell script -funcx-endpoint -``` - -Update the endpoint's configuration file to point the endpoint to locally -deployed services, which we will setup in the next sections. If using default -values, the funcx_service_address should be set to http://localhost:5000/v2. - -`~/.funcx/default/config.py` - -```python - config = Config( - executors=[HighThroughputExecutor( - provider=LocalProvider( - init_blocks=1, - min_blocks=0, - max_blocks=1, - ), - )], - funcx_service_address="http://127.0.0.1:5000/api/v1", # <--- UPDATE THIS LINE -) -``` - - - -## See also - -More notes in the local_dev/ subdirectory that should be merged into this file - From 4c30f45e4998a495ae19060594b58263d9c05a36 Mon Sep 17 00:00:00 2001 From: Ben Clifford Date: Tue, 15 Feb 2022 13:10:33 +0000 Subject: [PATCH 37/42] Move cloudformation update instruction around --- README.md | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index 27de456..ca52145 100644 --- a/README.md +++ b/README.md @@ -174,8 +174,6 @@ kubectl create secret generic funcx-sdk-tokens \ ## Installing FuncX [TODO: "central services"? what's the right title vs the endpoint and client?] -0. Update cloudformation stack if necessary [TODO: I think this is only for production deployment? ask Josh. In which case, ignore for personal dev cluster] - 1. Make a clone of this repository 2. Download subcharts: ```shell script @@ -601,6 +599,10 @@ ingress section above The following is an incomplete guide to making and deploying a new release onto our development or production clusters. +[TODO: this was moved from the basic funcX k8s install sequence, because i don't think it is part of that - only when installing using AWS which is a special case of production] + +0. Update cloudformation stack if necessary [TODO: I think this is only for production deployment? ask Josh. In which case, ignore for personal dev cluster] + Here are the components that need updating as part of a release, in the order they should be updated due to dependencies. Note that only components that have changes for release need to updated and the rest can safely be skipped: From e6d6e51aec0bac91c94f904772bf059e2a8bc0b0 Mon Sep 17 00:00:00 2001 From: Ben Clifford Date: Tue, 15 Feb 2022 14:01:42 +0000 Subject: [PATCH 38/42] update a note on cloudformation --- README.md | 9 ++++++++- 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index ca52145..882862c 100644 --- a/README.md +++ b/README.md @@ -601,7 +601,14 @@ The following is an incomplete guide to making and deploying a new release onto [TODO: this was moved from the basic funcX k8s install sequence, because i don't think it is part of that - only when installing using AWS which is a special case of production] -0. Update cloudformation stack if necessary [TODO: I think this is only for production deployment? ask Josh. In which case, ignore for personal dev cluster] +0. Update cloudformation stack if necessary + +[TODO: I asked josh: +Correct, if you are deploying locally there is nothing to do with cloudformation, and most of the time you deploy to either prod or dev you shouldn't need to mess with the CF stack unless you are changing configuration of the cluster itself or AWS managed services like rabbit or rds. +] + + + Here are the components that need updating as part of a release, in the order they should be updated due to dependencies. Note that only components that have changes for release need to updated and the From 744eabb39df6aa88c4adf89914ffc1eb212e4ea9 Mon Sep 17 00:00:00 2001 From: Ben Clifford Date: Tue, 15 Feb 2022 19:42:33 +0000 Subject: [PATCH 39/42] Add i'm a developer what's next! --- README.md | 9 +++++++++ 1 file changed, 9 insertions(+) diff --git a/README.md b/README.md index 882862c..f11a462 100644 --- a/README.md +++ b/README.md @@ -361,6 +361,15 @@ print(fxc.get_result(result)) If you have got this far, then you have successfully installed the current version of funcx, and can begin to hack. +# I'm a developer whats next?! + +Here are a couple of links you could look at: + +* https://github.com/funcx-faas/funcX/blob/main/CONTRIBUTING.md + +* local_dev/README.md + + ### Forwarder Debugging > :warning: *Only for debugging*: You can set the forwarder curve server key manually by creating From 222553e2a51f2647a4fc6a6ef34366c62064d6f4 Mon Sep 17 00:00:00 2001 From: Ben Clifford Date: Tue, 15 Feb 2022 19:45:08 +0000 Subject: [PATCH 40/42] Add cheatsheets reference --- README.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index f11a462..6f81e6d 100644 --- a/README.md +++ b/README.md @@ -367,8 +367,9 @@ Here are a couple of links you could look at: * https://github.com/funcx-faas/funcX/blob/main/CONTRIBUTING.md -* local_dev/README.md +* this repo local_dev/README.md +* this report cheatsheats/ ### Forwarder Debugging From a3893e6d756241488c39ff901318f8d0fd82028e Mon Sep 17 00:00:00 2001 From: Ben Clifford Date: Tue, 15 Feb 2022 20:58:32 +0000 Subject: [PATCH 41/42] fix typo --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 6f81e6d..07c3569 100644 --- a/README.md +++ b/README.md @@ -129,7 +129,7 @@ There are various awful ways to do this. Some of them are here. * Ugly method 1 ``` -# docker run --rm -ti funcx/kube-endpoint:main3.9 bash -l +# docker run --rm -ti funcx/kube-endpoint:main-3.9 bash -l $ python3 -c "import funcx ; funcx.FuncXClient()" Please paste the following URL in a browser: https://auth.globus.org/v2/oauth2/authorize?client_id=..... From a3fc399ff3310846c44ac2fe0869edcc1712427c Mon Sep 17 00:00:00 2001 From: Ben Clifford Date: Fri, 18 Feb 2022 18:01:48 +0000 Subject: [PATCH 42/42] Add more notes from hands on experience --- README.md | 16 ++++++++++------ 1 file changed, 10 insertions(+), 6 deletions(-) diff --git a/README.md b/README.md index 07c3569..617a502 100644 --- a/README.md +++ b/README.md @@ -44,6 +44,8 @@ using one but we are pushing on that a bit. More later. ### Example install of hetzner cloud + minikube +TODO: minikube always/sometimes forgets its install on restart. + This shows how @benclifford installed minikube on a hetzner cloud node as root: @@ -239,7 +241,7 @@ funcx_endpoint: 5. Install the helm chart: ```shell script - helm install -f deployed_values/values.yaml funcx funcx + helm install --values deployed_values/values.yaml funcx funcx ``` 5b. @@ -298,8 +300,9 @@ serves the `public` ingress class). [TODO: i need to write the exact commands fo 2. Get a hostname that your kubernetes install is accessible under. -You can use `localhost` if you are running your client code on your local machine -too. +You can sometimes use `localhost` if you are running your client code on your local machine +too. ** WARNING ** ingress-nginx in minikube doesn't always listen on the same address +as "localhost" is bound to. Otherwise, figure out (using IP networking skills not described in this document) how you will address and name your kubernetes host. @@ -318,7 +321,7 @@ ingress: 4. Redeploy funcx ``` -helm upgrade --atomic -f deployed_values/values.yaml funcx funcx +helm upgrade --atomic --values deployed_values/values.yaml funcx funcx ``` 5. You should now see the ingress definition in kubernetes: @@ -330,6 +333,7 @@ funcx-funcx-ingress amber.cqx.ltd.uk 80 11d ``` + ### Connecting clients Create a `FuncXClient` instance pointing at your install, by specifying the funcx_service_address, @@ -654,7 +658,7 @@ rest can safely be skipped: * Update the values to use the release branchnames as the new tags * Deploy with: - >> helm upgrade -f prod-values.yaml funcx funcx + >> helm upgrade --values prod-values.yaml funcx funcx > :warning: It is preferable to upgrade rather than blow away the current deployment and redeploy because, wiping the current deployment loses state that ties the Route53 entries to point at @@ -701,7 +705,7 @@ postgres, and rabbitmq) running at a specified host under `*.api.dev.funcx.org`. subnets: subnet-0c0d6b32bb57c39b2, subnet-0906da1c44cbe3b8d use_alb: true * Install the helm chart as described above, but specifying the new `values.yaml` file - and the namespace. E.g.: `helm install -f deployed_values/values.yaml josh-funcx funcx --namespace` + and the namespace. E.g.: `helm install --values deployed_values/values.yaml josh-funcx funcx --namespace` * Create a new route53 record for the given host (josh-test.dev.funcx.org). We won't have to do this after [external dns](https://kubernetes-sigs.github.io/aws-load-balancer-controller/v2.2/guide/integrations/external_dns/) has been enabled.