Skip to content

SL nodes haven't been terminated before running the next local compare task #42

@Ultimate-Storm

Description

@Ultimate-Storm

Port 16000 for sl shows in use since previous sl node has not been stopped

2023-03-15 13:01:08,895 : swarm.swop : INFO : SWOPExecutor: ENROLL TO Task : swarm_task_local_compare_20230315_133712 , opId : 13485391426936112013 - Begins
2023-03-15 13:01:11,950 : swarm.swop : INFO : SWOPExecutor: ENROLL TO Task : swarm_task_local_compare_20230315_133712 , opId : 13485391426936112013 - Ends
2023-03-15 13:01:12,102 : swarm.swop : INFO : SWOPRunTask: Profile validated
2023-03-15 13:01:15,146 : swarm.swop : INFO : SWOPRunTask: APLS configured with non-default port : 5000
2023-03-15 13:01:15,147 : swarm.swop : INFO : SWOPRunTask: SL Image Name : hub.myenterpriselicense.hpe.com/hpe_eval/swarm-learning/sl:1.2.0
2023-03-15 13:01:15,242 : swarm.swop : INFO : SWOPRunTask: Arguments passed to User container idx : 0
2023-03-15 13:01:15,243 : swarm.swop : INFO : {'entrypoint': 'python3', 'detach': True, 'auto_remove': False, 'name': 'demo-swarm_task_local_compare_20230315_133712-u-0-cc4936c40dfcbd00', 'hostname': 'user-marugoto_mri-172.24.40.65', 'network': 'host-net', 'ports': {}, 'mounts': [{'Target': '/tmp/hpe-swarm', 'Source': 'swop-demo-swif-0', 'Type': 'volume', 'ReadOnly': False}, {'Target': '/tmp/test/model', 'Source': '/opt/hpe/swarm-learning-hpe/workspace/marugoto_mri/model', 'Type': 'bind', 'ReadOnly': False}, {'Target': '/tmp/test/data-and-scratch', 'Source': '/opt/hpe/swarm-learning-hpe/workspace/marugoto_mri/user/data-and-scratch', 'Type': 'bind', 'ReadOnly': False}], 'environment': {'DATA_DIR': 'data-and-scratch/data', 'SCRATCH_DIR': 'data-and-scratch/scratch', 'MODEL_DIR': 'model', 'MAX_EPOCHS': 100, 'MIN_PEERS': 5, 'LOCAL_COMPARE_FLAG': True, 'USE_ADAPTIVE_SYNC': False, 'SYNC_FREQUENCY': 32, 'MODEL_TYPE': 'transformer', 'SL_REQUEST_CHANNEL': '/tmp/hpe-swarm/demo.0.request.pipe', 'SL_RESPONSE_CHANNEL': '/tmp/hpe-swarm/demo.0.response.pipe'}, 'working_dir': '/tmp/test', 'user': '0:0', 'dns': [], 'labels': None, 'device_requests': [{'Driver': '', 'Count': 0, 'DeviceIDs': ['all'], 'Capabilities': [['gpu']], 'Options': {}}], 'shm_size': '16G'}
2023-03-15 13:01:15,243 : swarm.swop : INFO : SWOPRunTask: USER Image Name : user-env-marugoto-swop
2023-03-15 13:01:18,573 : swarm.swop : INFO : SWOPRunTask: failed to remove : POD : 0 , TYPE : SL , CID :6315ec5fe5a5cc1572cfa11539da519b0e2c876f9740873f4abadf638f87e481
2023-03-15 13:01:18,573 : swarm.swop : INFO : 500 Server Error for http+docker://localhost/v1.41/containers/6315ec5fe5a5cc1572cfa11539da519b0e2c876f9740873f4abadf638f87e481/start: Internal Server Error ("driver failed programming external connectivity on endpoint demo-swarm_task_local_compare_20230315_133712-s-0-cc4936c40dfcbd00 (b9ff2b414f8a2bd5080919ab0946d6004160d4573cbf39c8d22abd41c3933438): Bind for 0.0.0.0:16000 failed: port is already allocated")
2023-03-15 13:01:19,309 : swarm.swop : INFO : SWOPRunTask: Failed to start containers...Stopping Task Execution
2023-03-15 13:01:19,310 : swarm.swop : INFO : SWOPRunTask: Stopping Task```

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions