Environment build failed on AWS EC2 g4dn instance with OS image (Deep Learning AMI, ami-0184e674549ab8432)

Hi there,

Thanks for sharing the framework! I try to experiment with the codebase. But not able to create suitable environment on AWS platform. I am using `Deep Learning AMI (Ubuntu 18.04) Version 60.4` as the cloud instance system, the corresponding AMI is `ami-0184e674549ab8432`. 

When I use command `./create-grace-env-tf1.15.sh` under the root folder of the project, the installation of horovod raises the following error. But the installation process didn't stop. And I can see the horovod-0.21.0 is installed in conda environment. But the configuration is not correct. 

```
 make[1]: Leaving directory '/tmp/pip-install-hopus90d/horovod/build/temp.linux-x86_64-cpython-37'
  Makefile:146: recipe for target 'all' failed
  make: *** [all] Error 2
  Traceback (most recent call last):
    File "<string>", line 1, in <module>
    File "/tmp/pip-install-hopus90d/horovod/setup.py", line 193, in <module>
      'horovodrun = horovod.runner.launch:run_commandline'
    File "/home/ubuntu/grace/env-tf1.15/lib/python3.7/site-packages/setuptools/__init__.py", line 87, in setup
      return distutils.core.setup(**attrs)
    File "/home/ubuntu/grace/env-tf1.15/lib/python3.7/site-packages/setuptools/_distutils/core.py", line 148, in setup
      return run_commands(dist)
    File "/home/ubuntu/grace/env-tf1.15/lib/python3.7/site-packages/setuptools/_distutils/core.py", line 163, in run_commands
      dist.run_commands()
    File "/home/ubuntu/grace/env-tf1.15/lib/python3.7/site-packages/setuptools/_distutils/dist.py", line 967, in run_commands
      self.run_command(cmd)
    File "/home/ubuntu/grace/env-tf1.15/lib/python3.7/site-packages/setuptools/dist.py", line 1229, in run_command
      super().run_command(command)
    File "/home/ubuntu/grace/env-tf1.15/lib/python3.7/site-packages/setuptools/_distutils/dist.py", line 986, in run_command
      cmd_obj.run()
    File "/home/ubuntu/grace/env-tf1.15/lib/python3.7/site-packages/wheel/bdist_wheel.py", line 299, in run
      self.run_command('build')
    File "/home/ubuntu/grace/env-tf1.15/lib/python3.7/site-packages/setuptools/_distutils/cmd.py", line 313, in run_command
      self.distribution.run_command(command)
    File "/home/ubuntu/grace/env-tf1.15/lib/python3.7/site-packages/setuptools/dist.py", line 1229, in run_command
      super().run_command(command)
    File "/home/ubuntu/grace/env-tf1.15/lib/python3.7/site-packages/setuptools/_distutils/dist.py", line 986, in run_command
      cmd_obj.run()
    File "/home/ubuntu/grace/env-tf1.15/lib/python3.7/site-packages/setuptools/_distutils/command/build.py", line 136, in run
      self.run_command(cmd_name)
    File "/home/ubuntu/grace/env-tf1.15/lib/python3.7/site-packages/setuptools/_distutils/cmd.py", line 313, in run_command
      self.distribution.run_command(command)
    File "/home/ubuntu/grace/env-tf1.15/lib/python3.7/site-packages/setuptools/dist.py", line 1229, in run_command
      super().run_command(command)
    File "/home/ubuntu/grace/env-tf1.15/lib/python3.7/site-packages/setuptools/_distutils/dist.py", line 986, in run_command
      cmd_obj.run()
    File "/home/ubuntu/grace/env-tf1.15/lib/python3.7/site-packages/setuptools/command/build_ext.py", line 79, in run
      _build_ext.run(self)
    File "/home/ubuntu/grace/env-tf1.15/lib/python3.7/site-packages/setuptools/_distutils/command/build_ext.py", line 339, in run
      self.build_extensions()
    File "/tmp/pip-install-hopus90d/horovod/setup.py", line 91, in build_extensions
      cwd=self.build_temp)
    File "/home/ubuntu/grace/env-tf1.15/lib/python3.7/subprocess.py", line 363, in check_call
      raise CalledProcessError(retcode, cmd)
  subprocess.CalledProcessError: Command '['cmake', '--build', '.', '--config', 'RelWithDebInfo', '--', '-j8', 'VERBOSE=1']' returned non-zero
 exit status 2.
  ----------------------------------------
  ERROR: Failed building wheel for horovod
  Running setup.py clean for horovod
Failed to build horovod
```

Running command `horovodrun -cb` gives following message, which indicate the PyTorch extension is not enabled. 
```
(/home/ubuntu/grace/env-tf1.15) ubuntu@ip-172-31-82-84:~/grace$ horovodrun -cb
Horovod v0.21.0:

Available Frameworks:
    [X] TensorFlow
    [ ] PyTorch
    [ ] MXNet

Available Controllers:
    [X] MPI
    [X] Gloo

Available Tensor Operations:
    [X] NCCL
    [ ] DDL
    [ ] CCL
    [X] MPI
    [X] Gloo
```

Besides, even the TensorFlow extension for Horovod seems ready, the actual training indicate it isn't work properly. When running [tensorflow_mnist.py](https://github.com/sands-lab/grace/blob/master/examples/tensorflow/tensorflow_mnist.py) with `horovodrun`, only the first GPU is doing the computation. This means the distributed training isn't working.
```
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            On   | 00000000:00:1B.0 Off |                    0 |
| N/A   45C    P0    27W /  70W |    390MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla T4            On   | 00000000:00:1C.0 Off |                    0 |
| N/A   37C    P8    15W /  70W |      2MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla T4            On   | 00000000:00:1D.0 Off |                    0 |
| N/A   37C    P8    14W /  70W |      2MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla T4            On   | 00000000:00:1E.0 Off |                    0 |
| N/A   35C    P8    14W /  70W |      2MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     10049      C   python                             97MiB |
|    0   N/A  N/A     10050      C   python                             97MiB |
|    0   N/A  N/A     10051      C   python                             97MiB |
|    0   N/A  N/A     10052      C   python                             97MiB |
+-----------------------------------------------------------------------------+
```

Any suggestions? It would be great if you can share a docker environment, or more detailed system configurations. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Environment build failed on AWS EC2 g4dn instance with OS image (Deep Learning AMI, ami-0184e674549ab8432) #24

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Environment build failed on AWS EC2 g4dn instance with OS image (Deep Learning AMI, ami-0184e674549ab8432) #24

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions