-
Notifications
You must be signed in to change notification settings - Fork 11
Enable the possibility of running ToughDay distributed on Kubernetes #10
Description
Context
Currently, ToughDay can only run on a single machine and the configuration given by the user is constrained by the resource availability of the hardware support(CPU, memory, IO). Horizontal scaling is not possible at this moment so testing an application with a heavy load would require to manually start an individual execution on multiple machines.
A solution for the problem described above would be the possibility to run the framework in a distributed mode, on a dedicated cluster.
Desired usage
By default, ToughDay will continue to run on the local machine with the configuration established by the user. If the user wants to trigger the execution in a dedicated cluster, the next steps should be followed:
- Two types of components must be deployed: Driver and Agents. The execution can start only after the pods are up and running.
- The Driver must expose a public IP address, accessible from outside the cluster.
Example of command for running ToughDay distributed on Kubernetes:
java -jar td.jar --host=localhost --distributedconfig driverip=10.0.0.1 hearbeatinterval=5s --add CreateUserTest
Each property related to the distributed run must be specified after the "--distributedconfig" argument and it must respect the following format: name=value. The available properties are:
- driverip: specifies the public IP address at which the Driver can be accessed from outside the cluster. This property is required when running TD distributed.
- agent: specifies whether ToughDay runs in Agent mode, waiting to receive tasks from the driver. Default value: false.
- driver: specifies whether ToughDay runs in Driver mode, coordinating the execution in the cluster. Default value: false.
- heartbeatinterval: time interval between two consecutive heartbeat messages sent from the driver to the agents.
- redistributionwaittime: time interval between two consecutive work redistributions. This process is triggered whenever the number of agents running in the cluster is changing.
Driver
The driver is coordinating the entire distributed execution and it is responsible for the following main tasks:
- periodically sending heartbeat messages to each active agent. These HTTP requests are used for detecting when the agents are no longer running properly. There are multiple causes that could lead to this situation: an exception was thrown while running ToughDay, the pod was evicted because of lack of resources in the cluster, the pod was manually deleted by the user etc.
- redistributing the work whenever the number of agents running in the cluster is changing. When the user is manually modifying this number (by setting a new value for the 'parallelism' field in the configuration file), Kubernetes is deploying/removing pods in batches of unknown sizes. In order to reduce the number of work redistributions, the rebalancing process is scheduled to begin after
redistributionwaittimeseconds after the first modification is detected. If another change occurs during this period, it will be treated when the scheduled process begins its execution. - splitting each phase of the initial ToughDay configuration into a number of phases equal to the number of agents running in the cluster. Currently, the division is made based on the following three properties: the number of executions/test, the number of threads to be used for running the tests and the load to be generated when running ToughDay with the constant load run mode.
Example of a partitioning process assuming 2 agents running in the cluster:
Configuration used to execute ToughDay
globals:
host: 10.244.1.19
phases:
- metrics:
- add: Passed
- add: Failed
name: phase1
publishers:
- add: CSVPublisher
runmode:
type: normal
concurrency: 300
tests:
- add: CreateUserTest
properties:
count: 200Configuration received by each of the two agents running in the cluster
globals:
host: 10.244.1.19
phases:
- metrics:
- add: Passed
- add: Failed
name: phase1
publishers:
- add: CSVPublisher
runmode:
type: normal
concurrency: 150
tests:
- add: CreateUserTest
properties:
count: 100Agent:
Component used for running TD tests. When the pod starts, the process is sending a request to the driver to be marked as an active agent running in the cluster. Afterwards, the agent waits to receive phases to execute from the driver.