Terminate experiments on server error

Sometimes, the flower server crashes, but the container running it (fl-server) is not killed, leaving the experiment running indefinitely. 
I'm not sure if flower is capturing the error and preventing the container from terminating. If that's the case, we might add a mechanism to terminate the execution if a round is taking too long - for example, stop the experiment if it has been more than 1h since the last round was completed. We can allow the researcher to configure this value.

Example crash:
```
[92mINFO [0m:      Timestamp: 1743180348.4660757
[92mINFO [0m:      Starting Flower server, config: num_rounds=10, no round_timeout
[92mINFO [0m:      Flower ECE: gRPC server running (10 rounds), SSL is disabled
[92mINFO [0m:      [INIT]
[92mINFO [0m:      Requesting initial parameters from one random client
[92mINFO [0m:      Received initial parameters from one random client
[92mINFO [0m:      Evaluating initial global parameters
[92mINFO [0m:      
[92mINFO [0m:      [ROUND 1]
[92mINFO [0m:      configure_fit: strategy sampled 10 clients (out of 10)
[92mINFO [0m:      aggregate_fit: received 10 results and 0 failures
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/fl_testbed/user_code/system/servers/serverGH.py", line 138, in <module>
    fl.server.start_server(
  File "/opt/conda/lib/python3.10/site-packages/flwr/server/app.py", line 171, in start_server
    hist = run_fl(
  File "/opt/conda/lib/python3.10/site-packages/flwr/server/server.py", line 483, in run_fl
    hist, elapsed_time = server.fit(
  File "/opt/conda/lib/python3.10/site-packages/flwr/server/server.py", line 113, in fit
    res_fit = self.fit_round(
  File "/opt/conda/lib/python3.10/site-packages/flwr/server/server.py", line 249, in fit_round
    ] = self.strategy.aggregate_fit(server_round, results, failures)
  File "/opt/conda/lib/python3.10/site-packages/colext/metric_collection/decorators/flwr_server_decorator.py", line 95, in aggregate_fit
    aggregate_fit_result = super().aggregate_fit(server_round, results, failures)
  File "/fl_testbed/user_code/system/servers/serverGH.py", line 79, in aggregate_fit
    self.update_Head(uploaded_protos)
  File "/fl_testbed/user_code/system/servers/serverGH.py", line 107, in update_Head
    for proto, y in proto_loader:
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 634, in __next__
    data = self._next_data()
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 678, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 54, in fetch
    return self.collate_fn(data)
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/_utils/collate.py", line 264, in default_collate
    return collate(batch, collate_fn_map=default_collate_fn_map)
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/_utils/collate.py", line 142, in collate
    return [collate(samples, collate_fn_map=collate_fn_map) for samples in transposed]  # Backwards compatibility.
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/_utils/collate.py", line 142, in <listcomp>
    return [collate(samples, collate_fn_map=collate_fn_map) for samples in transposed]  # Backwards compatibility.
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/_utils/collate.py", line 119, in collate
    return collate_fn_map[elem_type](batch, collate_fn_map=collate_fn_map)
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/_utils/collate.py", line 162, in collate_tensor_fn
    return torch.stack(batch, 0, out=out)
RuntimeError: stack expects each tensor to be equal size, but got [] at entry 0 and [512] at entry 29
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Terminate experiments on server error #4

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Terminate experiments on server error #4

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions