Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions docs/build-deploy/cluster-configuration.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# Cluster configuration

Cloud Harness deploys with helm and uses an ingress controller to serve application endpoints.

The ingress controller is not installed as a part of the application, and has to be installed separately. The installation process can be different depending on the Kubernetes provider.

See [here](../../infrastructure/cluster-configuration/README.md) for more information and resources for setting up your cluster.
36 changes: 34 additions & 2 deletions infrastructure/cluster-configuration/README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,36 @@
# Cluster configuration

1. Initialize kubectl credentials to work with your cluster (e.g minikube or google cloud)
1. Run `source cluster-init.sh`
## Simple setup on an existing cluster

### TLDR;
1. Create Kubernetes cluster (e.g minikube or google cloud)
1. Initialize kubectl credentials to work with your cluster
1. Run `source cluster-init.sh` (This script installs the ingress-nginx controller and cert-manager using Helm in the configured k8s cluster.)

### Cert-manager

Follow [this](https://cert-manager.io/docs/installation/kubernetes/) instructions to deploy cert-manager

### Ingress

Ingress controller is the entry point of all cloudharness applications.
Info on how to deploy nginx-ingress can be found [here](https://kubernetes.github.io/ingress-nginx/deploy/).

```
helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx
helm repo update
helm install ingress-nginx ingress-nginx/ingress-nginx
```

On localclusters and GCP/GKE, the nginx-ingress chart will deploy a Load Balancer with a given IP address, while in other environments, you may need to configure the Load Balancer manually. Use that address to create the CNames and A records for the website.



## GCP GKE cluster setup

GKE setup is pretty straighforward. Can create a cluster and a node pool from the google console and internet facing load balancers are directly created with the ingress controller.

For additional info see [here](gcp-setup.md).

## AWS EKS setup
AWS requires come additional steps to install the load balancer and the ingress, see [here](./aws-setup.md)
128 changes: 128 additions & 0 deletions infrastructure/cluster-configuration/aws-setup.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,128 @@



## Create the EKS cluster

There are many ways to create a cluster and which specific one to use depends on specifications that are outside of the generic scope.

This is a good starting point: https://docs.aws.amazon.com/eks/latest/userguide/getting-started.html.
In doubt, [Auto Mode Cluster](https://docs.aws.amazon.com/eks/latest/userguide/getting-started-automode.html) is a good place to start.

## Ingress setup

The following is inspired by https://aws.amazon.com/blogs/containers/exposing-kubernetes-applications-part-3-nginx-ingress-controller/, section "Exposing Ingress-Nginx Controller via a Load Balancer".
Be aware that the article is from 2022 and it doesn't work 100%.
Following the steps that worked for us on May 2025

### Setup the policy and service account

Note that have to pay attention to the version of the aws-load-balancer-controller to match with the policy. Wrong version will make things fail

```bash
curl -o iam-policy.json https://raw.githubusercontent.com/kubernetes-sigs/aws-load-balancer-controller/main/docs/install/iam_policy.json
Copy link
Contributor

@alxbrd alxbrd May 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The IAM policy should not use main by default, but rather should be the version of the controller installed, eg: https://raw.githubusercontent.com/kubernetes-sigs/aws-load-balancer-controller/**v2.13.2**/docs/install/iam_policy.json Similarly, a specific controller version, compatible with the policy should be installed.

aws iam create-policy --policy-name AWSLoadBalancerControllerIAMPolicy --policy-document file://iam-policy.json
AWS_ACCOUNT=123456789 # Your AWS account
eksctl create iamserviceaccount \
--cluster=metacell-dev \
--name=aws-load-balancer-controller \
--namespace=kube-system \
--attach-policy-arn=arn:aws:iam::${AWS_ACCOUNT}:policy/AWSLoadBalancerControllerIAMPolicy \
--approve
```

### Install the aws-load-balancer-controller

First, apply custom resource definition
```bash
wget https://raw.githubusercontent.com/aws/eks-charts/refs/heads/master/stable/aws-load-balancer-controller/crds/crds.yaml
kubectl apply -f crds.yaml
```

Then install the helm chart
From https://github.com/aws/eks-charts/tree/master/stable/aws-load-balancer-controller
```bash
helm repo add eks https://aws.github.io/eks-charts
# If using IAM Roles for service account install as follows - NOTE: you need to specify both of the chart values `serviceAccount.create=false` and `serviceAccount.name=aws-load-balancer-controller`
helm install aws-load-balancer-controller eks/aws-load-balancer-controller --set clusterName=metacell-dev -n kube-system --set serviceAccount.create=false --set serviceAccount.name=aws-load-balancer-controller
```


### Fix vpc

If encounter the following error related to vpc

> {"level":"info","ts":"2025-05-21T13:53:48Z","msg":"version","GitVersion":"v2.13.2","GitCommit":"4236bd7928711874ae4d8aff6b97870b5625140f","BuildDate":"2025-05-15T17:37:55+0000"}
> {"level":"error","ts":"2025-05-21T13:53:53Z","logger":"setup","msg":"unable to initialize AWS cloud","error":"failed to get VPC ID: failed to fetch VPC ID from instance metadata: error in fetching vpc id through ec2 metadata: get mac metadata: operation error ec2imds: GetMetadata, canceled, context deadline exceeded"}

First get the vpc id:

```bash
aws eks describe-cluster \
--name metacell-dev \
--region us-west-2 \
--query "cluster.resourcesVpcConfig.vpcId" \
--output text
```

Then fix the vpc id value
```bash
helm upgrade aws-load-balancer-controller eks/aws-load-balancer-controller -n kube-system --reuse-values --set vpcId=$VPC_ID
```

### Install ingress nginx

```bash
helm upgrade -i ingress-nginx ingress-nginx/ingress-nginx \
--namespace kube-system \
--values ingress/values-aws.yaml

kubectl -n kube-system rollout status deployment ingress-nginx-controller

kubectl get deployment -n kube-system ingress-nginx-controller
```

### Associate the DNS

The endpoint can be assigned with 2 CNAME entries.
For instance, if you run `harness-deployment ... -d myapp.mydomain.com`,
the following CNAME entries are needed
- myapp [LB_ADDRESS]
- *.myapp [LB_ADDRESS]


The easiest way to get the load balancer address is to do the deployment and
from the ingress with

```
kubectl get ingress
```

## Storage class

EKS does not provide a default storage class.
To create one, run

```bash
kubectl apply -f storageclass-default-aws.yaml
```

## Container registry

CloudHarness pushes images on a container registry, which has to be readable from EKS

Any public registry can be used seamlessly, while ECR is recommended to pull private images

1. Create a new ECR registry
2. Create all the repositories within the deployment (ECR does not create repositories automatically on push, unless this is implemented https://aws.amazon.com/blogs/containers/dynamically-create-repositories-upon-image-push-to-amazon-ecr/)
3. Give the permissions to the Node IAM role
https://docs.aws.amazon.com/AmazonECR/latest/userguide/ECR_on_EKS.html (the role should be AmazonSSMRoleForInstancesQuickSetup for Auto Mode Clusters)

To push images, have to authenticate to the registry.

To authenticate from the local console, the command looks like the following:

```bash
aws ecr get-login-password --region us-west-2 | docker login --username AWS --password-stdin 527966638683.dkr.ecr.us-west-2.amazonaws.com
```

The exact command can also be viewed by hitting "View push commands" from the web console.
45 changes: 45 additions & 0 deletions infrastructure/cluster-configuration/azure-setup.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
# Azure AKS setup

The main complication within AKS is given by the fact that the nginx ingress controller
is not easy to setup, hence the AGIC controller has to be preferred.


1. Create new public IP for the AGIC load balancer
2. Create the application gateway
3. Enable AGIC add-on in existing AKS cluster with Azure CLI
4. Peer the AKS and AG virtual networks together

See https://learn.microsoft.com/en-us/azure/application-gateway/tutorial-ingress-controller-add-on-existing.

## Adapt / override Ingress template
In addition to this, the [Ingress template](../../deployment-configuration/helm/templates/ingress.yaml) has to be adapted to use the agic controller.

An adaptation to the Ingress controller template that optionally supports AGIC is the following
```yaml
kind: Ingress
metadata:
name: {{ .Values.ingress.name | quote }}
annotations:
{{- if eq .Values.ingress.className "azure-application-gateway" }}
appgw.ingress.kubernetes.io/backend-path-prefix: /
{{- if $tls }}
appgw.ingress.kubernetes.io/appgw-ssl-certificate: "nwwcssl"
{{- end }}
appgw.ingress.kubernetes.io/ssl-redirect: {{ (and $tls .Values.ingress.ssl_redirect) | quote }}
{{- else }}
nginx.ingress.kubernetes.io/ssl-redirect: {{ (and $tls .Values.ingress.ssl_redirect) | quote }}
nginx.ingress.kubernetes.io/proxy-body-size: '{{ .Values.proxy.payload.max }}m'
nginx.ingress.kubernetes.io/proxy-buffer-size: '128k'
nginx.ingress.kubernetes.io/from-to-www-redirect: 'true'
nginx.ingress.kubernetes.io/rewrite-target: /$1
nginx.ingress.kubernetes.io/auth-keepalive-timeout: {{ .Values.proxy.timeout.keepalive | quote }}
nginx.ingress.kubernetes.io/proxy-read-timeout: {{ .Values.proxy.timeout.read | quote }}
nginx.ingress.kubernetes.io/proxy-send-timeout: {{ .Values.proxy.timeout.send | quote }}
nginx.ingress.kubernetes.io/use-forwarded-headers: {{ .Values.proxy.forwardedHeaders | quote }}

{{- end }}
{{- if and (and (not .Values.local) (not .Values.certs)) $tls }}
kubernetes.io/tls-acme: 'true'
cert-manager.io/issuer: {{ printf "%s-%s" "letsencrypt" .Values.namespace }}
{{- end }}
```
13 changes: 12 additions & 1 deletion infrastructure/cluster-configuration/cluster-init.sh
Original file line number Diff line number Diff line change
@@ -1,3 +1,14 @@
helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx
helm repo update
helm install ingress ingress-nginx/ingress-nginx -f ingress/values.yaml -n kube-system
helm install ingress ingress-nginx/ingress-nginx -f ingress/values.yaml -n kube-system

helm repo add jetstack https://charts.jetstack.io --force-update
helm install \
cert-manager jetstack/cert-manager \
--namespace cert-manager \
--create-namespace \
--version v1.17.2 \
--set crds.enabled=true


helm install --name cert-manager --namespace cert-manager --version v0.14.0 jetstack/cert-manager --set webhook.enabled=false
Comment on lines +11 to +14
Copy link

Copilot AI May 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two separate helm install commands for cert-manager with differing versions and syntax; this will cause conflicts. Consolidate into a single, consistent installation.

Suggested change
--set crds.enabled=true
helm install --name cert-manager --namespace cert-manager --version v0.14.0 jetstack/cert-manager --set webhook.enabled=false
--set crds.enabled=true \
--set webhook.enabled=false

Copilot uses AI. Check for mistakes.
11 changes: 11 additions & 0 deletions infrastructure/cluster-configuration/gcp-setup.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
## Google GKE setup

The easiest way to create the cluster is to use the GCP web console. Define a node pool that satisfies application requirements (no other particular requirements are generally necessary for the node pool/s).
The node pool can be always scaled to have a different node number, so the only important choice is the node type.
Also other node pools can be added at any point of time to the cluster to replace the original one.

After creating the cluster, use `gcloud` command line client to get the credentials in your machine:
- `gcloud init`
- `gcloud container clusters get-credentials --zone us-central1-a <CLUSTER_NAME>`


16 changes: 16 additions & 0 deletions infrastructure/cluster-configuration/ingress/values-aws.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
controller:
service:
annotations:
service.beta.kubernetes.io/aws-load-balancer-name: apps-ingress
service.beta.kubernetes.io/aws-load-balancer-type: external
service.beta.kubernetes.io/aws-load-balancer-scheme: internet-facing
service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: ip
service.beta.kubernetes.io/aws-load-balancer-healthcheck-protocol: http
service.beta.kubernetes.io/aws-load-balancer-healthcheck-path: /healthz
service.beta.kubernetes.io/aws-load-balancer-healthcheck-port: 10254
config:
http-snippet: |
proxy_cache_path /tmp/nginx-cache levels=1:2 keys_zone=static-cache:2m max_size=100m inactive=7d use_temp_path=off;
proxy_cache_key $scheme$proxy_host$request_uri;
proxy_cache_lock on;
proxy_cache_use_stale updating;
10 changes: 10 additions & 0 deletions infrastructure/cluster-configuration/storageclass-default-aws.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: default
annotations:
storageclass.kubernetes.io/is-default-class: "true"
reclaimPolicy: Delete
provisioner: ebs.csi.eks.amazonaws.com
volumeBindingMode: Immediate
---
6 changes: 3 additions & 3 deletions infrastructure/cluster-configuration/storageclass.yaml
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: persisted
reclaimPolicy: Retain
volumeProvisioner: kubernetes.io/gce-pd
name: standard
reclaimPolicy: Delete
provisioner: kubernetes.io/gce-pd
volumeBindingMode: Immediate
---
Loading