Skip to content

Terminate experiments on server error #4

@Nandinski

Description

@Nandinski

Sometimes, the flower server crashes, but the container running it (fl-server) is not killed, leaving the experiment running indefinitely.
I'm not sure if flower is capturing the error and preventing the container from terminating. If that's the case, we might add a mechanism to terminate the execution if a round is taking too long - for example, stop the experiment if it has been more than 1h since the last round was completed. We can allow the researcher to configure this value.

Example crash:

�[92mINFO �[0m:      Timestamp: 1743180348.4660757
�[92mINFO �[0m:      Starting Flower server, config: num_rounds=10, no round_timeout
�[92mINFO �[0m:      Flower ECE: gRPC server running (10 rounds), SSL is disabled
�[92mINFO �[0m:      [INIT]
�[92mINFO �[0m:      Requesting initial parameters from one random client
�[92mINFO �[0m:      Received initial parameters from one random client
�[92mINFO �[0m:      Evaluating initial global parameters
�[92mINFO �[0m:      
�[92mINFO �[0m:      [ROUND 1]
�[92mINFO �[0m:      configure_fit: strategy sampled 10 clients (out of 10)
�[92mINFO �[0m:      aggregate_fit: received 10 results and 0 failures
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/fl_testbed/user_code/system/servers/serverGH.py", line 138, in <module>
    fl.server.start_server(
  File "/opt/conda/lib/python3.10/site-packages/flwr/server/app.py", line 171, in start_server
    hist = run_fl(
  File "/opt/conda/lib/python3.10/site-packages/flwr/server/server.py", line 483, in run_fl
    hist, elapsed_time = server.fit(
  File "/opt/conda/lib/python3.10/site-packages/flwr/server/server.py", line 113, in fit
    res_fit = self.fit_round(
  File "/opt/conda/lib/python3.10/site-packages/flwr/server/server.py", line 249, in fit_round
    ] = self.strategy.aggregate_fit(server_round, results, failures)
  File "/opt/conda/lib/python3.10/site-packages/colext/metric_collection/decorators/flwr_server_decorator.py", line 95, in aggregate_fit
    aggregate_fit_result = super().aggregate_fit(server_round, results, failures)
  File "/fl_testbed/user_code/system/servers/serverGH.py", line 79, in aggregate_fit
    self.update_Head(uploaded_protos)
  File "/fl_testbed/user_code/system/servers/serverGH.py", line 107, in update_Head
    for proto, y in proto_loader:
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 634, in __next__
    data = self._next_data()
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 678, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 54, in fetch
    return self.collate_fn(data)
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/_utils/collate.py", line 264, in default_collate
    return collate(batch, collate_fn_map=default_collate_fn_map)
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/_utils/collate.py", line 142, in collate
    return [collate(samples, collate_fn_map=collate_fn_map) for samples in transposed]  # Backwards compatibility.
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/_utils/collate.py", line 142, in <listcomp>
    return [collate(samples, collate_fn_map=collate_fn_map) for samples in transposed]  # Backwards compatibility.
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/_utils/collate.py", line 119, in collate
    return collate_fn_map[elem_type](batch, collate_fn_map=collate_fn_map)
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/_utils/collate.py", line 162, in collate_tensor_fn
    return torch.stack(batch, 0, out=out)
RuntimeError: stack expects each tensor to be equal size, but got [] at entry 0 and [512] at entry 29

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions