Skip to content

Environment build failed on AWS EC2 g4dn instance with OS image (Deep Learning AMI, ami-0184e674549ab8432) #24

@zarzen

Description

@zarzen

Hi there,

Thanks for sharing the framework! I try to experiment with the codebase. But not able to create suitable environment on AWS platform. I am using Deep Learning AMI (Ubuntu 18.04) Version 60.4 as the cloud instance system, the corresponding AMI is ami-0184e674549ab8432.

When I use command ./create-grace-env-tf1.15.sh under the root folder of the project, the installation of horovod raises the following error. But the installation process didn't stop. And I can see the horovod-0.21.0 is installed in conda environment. But the configuration is not correct.

 make[1]: Leaving directory '/tmp/pip-install-hopus90d/horovod/build/temp.linux-x86_64-cpython-37'
  Makefile:146: recipe for target 'all' failed
  make: *** [all] Error 2
  Traceback (most recent call last):
    File "<string>", line 1, in <module>
    File "/tmp/pip-install-hopus90d/horovod/setup.py", line 193, in <module>
      'horovodrun = horovod.runner.launch:run_commandline'
    File "/home/ubuntu/grace/env-tf1.15/lib/python3.7/site-packages/setuptools/__init__.py", line 87, in setup
      return distutils.core.setup(**attrs)
    File "/home/ubuntu/grace/env-tf1.15/lib/python3.7/site-packages/setuptools/_distutils/core.py", line 148, in setup
      return run_commands(dist)
    File "/home/ubuntu/grace/env-tf1.15/lib/python3.7/site-packages/setuptools/_distutils/core.py", line 163, in run_commands
      dist.run_commands()
    File "/home/ubuntu/grace/env-tf1.15/lib/python3.7/site-packages/setuptools/_distutils/dist.py", line 967, in run_commands
      self.run_command(cmd)
    File "/home/ubuntu/grace/env-tf1.15/lib/python3.7/site-packages/setuptools/dist.py", line 1229, in run_command
      super().run_command(command)
    File "/home/ubuntu/grace/env-tf1.15/lib/python3.7/site-packages/setuptools/_distutils/dist.py", line 986, in run_command
      cmd_obj.run()
    File "/home/ubuntu/grace/env-tf1.15/lib/python3.7/site-packages/wheel/bdist_wheel.py", line 299, in run
      self.run_command('build')
    File "/home/ubuntu/grace/env-tf1.15/lib/python3.7/site-packages/setuptools/_distutils/cmd.py", line 313, in run_command
      self.distribution.run_command(command)
    File "/home/ubuntu/grace/env-tf1.15/lib/python3.7/site-packages/setuptools/dist.py", line 1229, in run_command
      super().run_command(command)
    File "/home/ubuntu/grace/env-tf1.15/lib/python3.7/site-packages/setuptools/_distutils/dist.py", line 986, in run_command
      cmd_obj.run()
    File "/home/ubuntu/grace/env-tf1.15/lib/python3.7/site-packages/setuptools/_distutils/command/build.py", line 136, in run
      self.run_command(cmd_name)
    File "/home/ubuntu/grace/env-tf1.15/lib/python3.7/site-packages/setuptools/_distutils/cmd.py", line 313, in run_command
      self.distribution.run_command(command)
    File "/home/ubuntu/grace/env-tf1.15/lib/python3.7/site-packages/setuptools/dist.py", line 1229, in run_command
      super().run_command(command)
    File "/home/ubuntu/grace/env-tf1.15/lib/python3.7/site-packages/setuptools/_distutils/dist.py", line 986, in run_command
      cmd_obj.run()
    File "/home/ubuntu/grace/env-tf1.15/lib/python3.7/site-packages/setuptools/command/build_ext.py", line 79, in run
      _build_ext.run(self)
    File "/home/ubuntu/grace/env-tf1.15/lib/python3.7/site-packages/setuptools/_distutils/command/build_ext.py", line 339, in run
      self.build_extensions()
    File "/tmp/pip-install-hopus90d/horovod/setup.py", line 91, in build_extensions
      cwd=self.build_temp)
    File "/home/ubuntu/grace/env-tf1.15/lib/python3.7/subprocess.py", line 363, in check_call
      raise CalledProcessError(retcode, cmd)
  subprocess.CalledProcessError: Command '['cmake', '--build', '.', '--config', 'RelWithDebInfo', '--', '-j8', 'VERBOSE=1']' returned non-zero
 exit status 2.
  ----------------------------------------
  ERROR: Failed building wheel for horovod
  Running setup.py clean for horovod
Failed to build horovod

Running command horovodrun -cb gives following message, which indicate the PyTorch extension is not enabled.

(/home/ubuntu/grace/env-tf1.15) ubuntu@ip-172-31-82-84:~/grace$ horovodrun -cb
Horovod v0.21.0:

Available Frameworks:
    [X] TensorFlow
    [ ] PyTorch
    [ ] MXNet

Available Controllers:
    [X] MPI
    [X] Gloo

Available Tensor Operations:
    [X] NCCL
    [ ] DDL
    [ ] CCL
    [X] MPI
    [X] Gloo

Besides, even the TensorFlow extension for Horovod seems ready, the actual training indicate it isn't work properly. When running tensorflow_mnist.py with horovodrun, only the first GPU is doing the computation. This means the distributed training isn't working.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            On   | 00000000:00:1B.0 Off |                    0 |
| N/A   45C    P0    27W /  70W |    390MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla T4            On   | 00000000:00:1C.0 Off |                    0 |
| N/A   37C    P8    15W /  70W |      2MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla T4            On   | 00000000:00:1D.0 Off |                    0 |
| N/A   37C    P8    14W /  70W |      2MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla T4            On   | 00000000:00:1E.0 Off |                    0 |
| N/A   35C    P8    14W /  70W |      2MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     10049      C   python                             97MiB |
|    0   N/A  N/A     10050      C   python                             97MiB |
|    0   N/A  N/A     10051      C   python                             97MiB |
|    0   N/A  N/A     10052      C   python                             97MiB |
+-----------------------------------------------------------------------------+

Any suggestions? It would be great if you can share a docker environment, or more detailed system configurations.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions