Sometimes, the flower server crashes, but the container running it (fl-server) is not killed, leaving the experiment running indefinitely.
I'm not sure if flower is capturing the error and preventing the container from terminating. If that's the case, we might add a mechanism to terminate the execution if a round is taking too long - for example, stop the experiment if it has been more than 1h since the last round was completed. We can allow the researcher to configure this value.
�[92mINFO �[0m: Timestamp: 1743180348.4660757
�[92mINFO �[0m: Starting Flower server, config: num_rounds=10, no round_timeout
�[92mINFO �[0m: Flower ECE: gRPC server running (10 rounds), SSL is disabled
�[92mINFO �[0m: [INIT]
�[92mINFO �[0m: Requesting initial parameters from one random client
�[92mINFO �[0m: Received initial parameters from one random client
�[92mINFO �[0m: Evaluating initial global parameters
�[92mINFO �[0m:
�[92mINFO �[0m: [ROUND 1]
�[92mINFO �[0m: configure_fit: strategy sampled 10 clients (out of 10)
�[92mINFO �[0m: aggregate_fit: received 10 results and 0 failures
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/conda/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/fl_testbed/user_code/system/servers/serverGH.py", line 138, in <module>
fl.server.start_server(
File "/opt/conda/lib/python3.10/site-packages/flwr/server/app.py", line 171, in start_server
hist = run_fl(
File "/opt/conda/lib/python3.10/site-packages/flwr/server/server.py", line 483, in run_fl
hist, elapsed_time = server.fit(
File "/opt/conda/lib/python3.10/site-packages/flwr/server/server.py", line 113, in fit
res_fit = self.fit_round(
File "/opt/conda/lib/python3.10/site-packages/flwr/server/server.py", line 249, in fit_round
] = self.strategy.aggregate_fit(server_round, results, failures)
File "/opt/conda/lib/python3.10/site-packages/colext/metric_collection/decorators/flwr_server_decorator.py", line 95, in aggregate_fit
aggregate_fit_result = super().aggregate_fit(server_round, results, failures)
File "/fl_testbed/user_code/system/servers/serverGH.py", line 79, in aggregate_fit
self.update_Head(uploaded_protos)
File "/fl_testbed/user_code/system/servers/serverGH.py", line 107, in update_Head
for proto, y in proto_loader:
File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 634, in __next__
data = self._next_data()
File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 678, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 54, in fetch
return self.collate_fn(data)
File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/_utils/collate.py", line 264, in default_collate
return collate(batch, collate_fn_map=default_collate_fn_map)
File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/_utils/collate.py", line 142, in collate
return [collate(samples, collate_fn_map=collate_fn_map) for samples in transposed] # Backwards compatibility.
File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/_utils/collate.py", line 142, in <listcomp>
return [collate(samples, collate_fn_map=collate_fn_map) for samples in transposed] # Backwards compatibility.
File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/_utils/collate.py", line 119, in collate
return collate_fn_map[elem_type](batch, collate_fn_map=collate_fn_map)
File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/_utils/collate.py", line 162, in collate_tensor_fn
return torch.stack(batch, 0, out=out)
RuntimeError: stack expects each tensor to be equal size, but got [] at entry 0 and [512] at entry 29
Sometimes, the flower server crashes, but the container running it (fl-server) is not killed, leaving the experiment running indefinitely.
I'm not sure if flower is capturing the error and preventing the container from terminating. If that's the case, we might add a mechanism to terminate the execution if a round is taking too long - for example, stop the experiment if it has been more than 1h since the last round was completed. We can allow the researcher to configure this value.
Example crash: