Hi there,
Thanks for sharing the framework! I try to experiment with the codebase. But not able to create suitable environment on AWS platform. I am using Deep Learning AMI (Ubuntu 18.04) Version 60.4 as the cloud instance system, the corresponding AMI is ami-0184e674549ab8432.
When I use command ./create-grace-env-tf1.15.sh under the root folder of the project, the installation of horovod raises the following error. But the installation process didn't stop. And I can see the horovod-0.21.0 is installed in conda environment. But the configuration is not correct.
make[1]: Leaving directory '/tmp/pip-install-hopus90d/horovod/build/temp.linux-x86_64-cpython-37'
Makefile:146: recipe for target 'all' failed
make: *** [all] Error 2
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/tmp/pip-install-hopus90d/horovod/setup.py", line 193, in <module>
'horovodrun = horovod.runner.launch:run_commandline'
File "/home/ubuntu/grace/env-tf1.15/lib/python3.7/site-packages/setuptools/__init__.py", line 87, in setup
return distutils.core.setup(**attrs)
File "/home/ubuntu/grace/env-tf1.15/lib/python3.7/site-packages/setuptools/_distutils/core.py", line 148, in setup
return run_commands(dist)
File "/home/ubuntu/grace/env-tf1.15/lib/python3.7/site-packages/setuptools/_distutils/core.py", line 163, in run_commands
dist.run_commands()
File "/home/ubuntu/grace/env-tf1.15/lib/python3.7/site-packages/setuptools/_distutils/dist.py", line 967, in run_commands
self.run_command(cmd)
File "/home/ubuntu/grace/env-tf1.15/lib/python3.7/site-packages/setuptools/dist.py", line 1229, in run_command
super().run_command(command)
File "/home/ubuntu/grace/env-tf1.15/lib/python3.7/site-packages/setuptools/_distutils/dist.py", line 986, in run_command
cmd_obj.run()
File "/home/ubuntu/grace/env-tf1.15/lib/python3.7/site-packages/wheel/bdist_wheel.py", line 299, in run
self.run_command('build')
File "/home/ubuntu/grace/env-tf1.15/lib/python3.7/site-packages/setuptools/_distutils/cmd.py", line 313, in run_command
self.distribution.run_command(command)
File "/home/ubuntu/grace/env-tf1.15/lib/python3.7/site-packages/setuptools/dist.py", line 1229, in run_command
super().run_command(command)
File "/home/ubuntu/grace/env-tf1.15/lib/python3.7/site-packages/setuptools/_distutils/dist.py", line 986, in run_command
cmd_obj.run()
File "/home/ubuntu/grace/env-tf1.15/lib/python3.7/site-packages/setuptools/_distutils/command/build.py", line 136, in run
self.run_command(cmd_name)
File "/home/ubuntu/grace/env-tf1.15/lib/python3.7/site-packages/setuptools/_distutils/cmd.py", line 313, in run_command
self.distribution.run_command(command)
File "/home/ubuntu/grace/env-tf1.15/lib/python3.7/site-packages/setuptools/dist.py", line 1229, in run_command
super().run_command(command)
File "/home/ubuntu/grace/env-tf1.15/lib/python3.7/site-packages/setuptools/_distutils/dist.py", line 986, in run_command
cmd_obj.run()
File "/home/ubuntu/grace/env-tf1.15/lib/python3.7/site-packages/setuptools/command/build_ext.py", line 79, in run
_build_ext.run(self)
File "/home/ubuntu/grace/env-tf1.15/lib/python3.7/site-packages/setuptools/_distutils/command/build_ext.py", line 339, in run
self.build_extensions()
File "/tmp/pip-install-hopus90d/horovod/setup.py", line 91, in build_extensions
cwd=self.build_temp)
File "/home/ubuntu/grace/env-tf1.15/lib/python3.7/subprocess.py", line 363, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['cmake', '--build', '.', '--config', 'RelWithDebInfo', '--', '-j8', 'VERBOSE=1']' returned non-zero
exit status 2.
----------------------------------------
ERROR: Failed building wheel for horovod
Running setup.py clean for horovod
Failed to build horovod
Running command horovodrun -cb gives following message, which indicate the PyTorch extension is not enabled.
(/home/ubuntu/grace/env-tf1.15) ubuntu@ip-172-31-82-84:~/grace$ horovodrun -cb
Horovod v0.21.0:
Available Frameworks:
[X] TensorFlow
[ ] PyTorch
[ ] MXNet
Available Controllers:
[X] MPI
[X] Gloo
Available Tensor Operations:
[X] NCCL
[ ] DDL
[ ] CCL
[X] MPI
[X] Gloo
Besides, even the TensorFlow extension for Horovod seems ready, the actual training indicate it isn't work properly. When running tensorflow_mnist.py with horovodrun, only the first GPU is doing the computation. This means the distributed training isn't working.
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03 Driver Version: 510.47.03 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla T4 On | 00000000:00:1B.0 Off | 0 |
| N/A 45C P0 27W / 70W | 390MiB / 15360MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Tesla T4 On | 00000000:00:1C.0 Off | 0 |
| N/A 37C P8 15W / 70W | 2MiB / 15360MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 Tesla T4 On | 00000000:00:1D.0 Off | 0 |
| N/A 37C P8 14W / 70W | 2MiB / 15360MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 Tesla T4 On | 00000000:00:1E.0 Off | 0 |
| N/A 35C P8 14W / 70W | 2MiB / 15360MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 10049 C python 97MiB |
| 0 N/A N/A 10050 C python 97MiB |
| 0 N/A N/A 10051 C python 97MiB |
| 0 N/A N/A 10052 C python 97MiB |
+-----------------------------------------------------------------------------+
Any suggestions? It would be great if you can share a docker environment, or more detailed system configurations.
Hi there,
Thanks for sharing the framework! I try to experiment with the codebase. But not able to create suitable environment on AWS platform. I am using
Deep Learning AMI (Ubuntu 18.04) Version 60.4as the cloud instance system, the corresponding AMI isami-0184e674549ab8432.When I use command
./create-grace-env-tf1.15.shunder the root folder of the project, the installation of horovod raises the following error. But the installation process didn't stop. And I can see the horovod-0.21.0 is installed in conda environment. But the configuration is not correct.Running command
horovodrun -cbgives following message, which indicate the PyTorch extension is not enabled.Besides, even the TensorFlow extension for Horovod seems ready, the actual training indicate it isn't work properly. When running tensorflow_mnist.py with
horovodrun, only the first GPU is doing the computation. This means the distributed training isn't working.Any suggestions? It would be great if you can share a docker environment, or more detailed system configurations.