Running Tensorflow on the GPUs - Not working on ARGO
Summary of issue
The version of cuda module loaded and the python and the tensorflow version need to match. Compatibility given in this chart: https://www.tensorflow.org/install/source#gpu_support_3
The installed version of CUDA (on the gpu) should be able to support the runtime version of cuda (the module being loaded) The version of CUDA on the GPU (10.2) needs to be ugraded to the latest version, at least 11.2 (with the accompanying cudnn version) to be able to run the latest versions of python/tensorflow.
From the testing, the code as it is cannot be run since it is written using tensorflow version 2.6. This needs cuda 11.2 and cudnn 8.2. The GPU nodes have cuda 10.2.
Installed CUDA Version
$ nvidia-smi
Sun Sep 5 22:23:59 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01 Driver Version: 440.33.01 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 On | 00000000:06:00.0 Off | 0 |
| N/A 20C P8 26W / 149W | 0MiB / 11441MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Checking that Tensorflow can detect gpu devices
Scripts are run in python virtual environment (tf-env) with the following packages installed:
tensorflow-gpu==2.2
glob2
imageio
matplotlib
tensorflow_probability
pathlib
pickle-mixin
tensorflow_addons
Loaded cuda module and version
$ module list
Currently Loaded Modulefiles:
1) cuda/10.1
Running python script that uses tf
$ python gpu_check.py
2021-09-05 22:24:20.712230: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2021-09-05 22:24:20.736678: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 0 with properties:
pciBusID: 0000:06:00.0 name: Tesla K80 computeCapability: 3.7
coreClock: 0.8235GHz coreCount: 13 deviceMemorySize: 11.17GiB deviceMemoryBandwidth: 223.96GiB/s
2021-09-05 22:24:20.738524: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2021-09-05 22:24:21.150860: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2021-09-05 22:24:21.375723: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2021-09-05 22:24:21.378054: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2021-09-05 22:24:21.690392: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2021-09-05 22:24:21.882272: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2021-09-05 22:24:22.509111: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2021-09-05 22:24:22.511227: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1703] Adding visible gpu devices: 0
Num GPUs Available: 1
Name: /physical_device:GPU:0 Type: GPU
The script successfully finds the gpu node when cuda/10.1 is in the environment.
Testing actual script
$ python mahesh_vae_spp.py
Traceback (most recent call last):
File "mahesh_vae_spp.py", line 7, in <module>
import tensorflow_probability as tfp
File "/scratch/falam5/Summer2021/spp_graph_64/data/argo/tf-env/lib/python3.8/site-packages/tensorflow_probability/__init__.py", line 20, in <module>
from tensorflow_probability import substrates
File "/scratch/falam5/Summer2021/spp_graph_64/data/argo/tf-env/lib/python3.8/site-packages/tensorflow_probability/substrates/__init__.py", line 21, in <module>
from tensorflow_probability.python.internal import all_util
File "/scratch/falam5/Summer2021/spp_graph_64/data/argo/tf-env/lib/python3.8/site-packages/tensorflow_probability/python/__init__.py", line 142, in <module>
dir(globals()[pkg_name]) # Forces loading the package from its lazy loader.
File "/scratch/falam5/Summer2021/spp_graph_64/data/argo/tf-env/lib/python3.8/site-packages/tensorflow_probability/python/internal/lazy_loader.py", line 61, in __dir__
module = self._load()
File "/scratch/falam5/Summer2021/spp_graph_64/data/argo/tf-env/lib/python3.8/site-packages/tensorflow_probability/python/internal/lazy_loader.py", line 41, in _load
self._on_first_access()
File "/scratch/falam5/Summer2021/spp_graph_64/data/argo/tf-env/lib/python3.8/site-packages/tensorflow_probability/python/__init__.py", line 63, in _validate_tf_environment
raise ImportError(
ImportError: This version of TensorFlow Probability requires TensorFlow version >= 2.5; Detected an installation of version 2.2.0. Please upgrade TensorFlow to proceed.
Gives version error for the Tensorflow being run
Upgrading the tensorflow version to 2.5 and cuda module to 11.0
module list
Currently Loaded Modulefiles:
1) cuda/11.0
The gpu devices are seen
$ python gpu_check.py
2021-09-05 22:59:54.327690: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
2021-09-05 22:59:57.009252: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcuda.so.1
2021-09-05 22:59:57.031748: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties:
pciBusID: 0000:06:00.0 name: Tesla K80 computeCapability: 3.7
coreClock: 0.8235GHz coreCount: 13 deviceMemorySize: 11.17GiB deviceMemoryBandwidth: 223.96GiB/s
2021-09-05 22:59:57.031789: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
2021-09-05 22:59:57.036160: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublas.so.11
2021-09-05 22:59:57.036204: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublasLt.so.11
2021-09-05 22:59:57.037973: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcufft.so.10
2021-09-05 22:59:57.038585: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcurand.so.10
2021-09-05 22:59:57.043049: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcusolver.so.11
2021-09-05 22:59:57.044312: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcusparse.so.11
2021-09-05 22:59:57.044822: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudnn.so.8
2021-09-05 22:59:57.046808: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0
Num GPUs Available: 1
Name: /physical_device:GPU:0 Type: GPU
$ python mahesh_vae_spp.py
2021-09-05 23:00:22.545385: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
2021-09-05 23:00:28.622259: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcuda.so.1
2021-09-05 23:00:28.644886: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties:
pciBusID: 0000:06:00.0 name: Tesla K80 computeCapability: 3.7
coreClock: 0.8235GHz coreCount: 13 deviceMemorySize: 11.17GiB deviceMemoryBandwidth: 223.96GiB/s
2021-09-05 23:00:28.644926: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
2021-09-05 23:00:28.648687: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublas.so.11
2021-09-05 23:00:28.648730: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublasLt.so.11
2021-09-05 23:00:28.650461: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcufft.so.10
2021-09-05 23:00:28.651094: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcurand.so.10
2021-09-05 23:00:28.655563: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcusolver.so.11
2021-09-05 23:00:28.656558: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcusparse.so.11
2021-09-05 23:00:28.657015: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudnn.so.8
2021-09-05 23:00:28.659027: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0
2021-09-05 23:00:28.659391: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-09-05 23:00:28.660468: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties:
pciBusID: 0000:06:00.0 name: Tesla K80 computeCapability: 3.7
coreClock: 0.8235GHz coreCount: 13 deviceMemorySize: 11.17GiB deviceMemoryBandwidth: 223.96GiB/s
2021-09-05 23:00:28.662401: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0
2021-09-05 23:00:28.662439: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
Traceback (most recent call last):
File "mahesh_vae_spp.py", line 130, in <module>
model = CVAE(latent_dim)
File "mahesh_vae_spp.py", line 23, in __init__
super(CVAE, self).__init__()
File "/scratch/falam5/Summer2021/spp_graph_64/data/argo/tf-env-2.6/lib/python3.8/site-packages/tensorflow/python/training/tracking/base.py", line 522, in _method_wrapper
result = method(self, *args, **kwargs)
File "/scratch/falam5/Summer2021/spp_graph_64/data/argo/tf-env-2.6/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py", line 318, in __init__
self._init_batch_counters()
File "/scratch/falam5/Summer2021/spp_graph_64/data/argo/tf-env-2.6/lib/python3.8/site-packages/tensorflow/python/training/tracking/base.py", line 522, in _method_wrapper
result = method(self, *args, **kwargs)
File "/scratch/falam5/Summer2021/spp_graph_64/data/argo/tf-env-2.6/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py", line 326, in _init_batch_counters
self._train_counter = variables.Variable(0, dtype='int64', aggregation=agg)
File "/scratch/falam5/Summer2021/spp_graph_64/data/argo/tf-env-2.6/lib/python3.8/site-packages/tensorflow/python/ops/variables.py", line 262, in __call__
return cls._variable_v2_call(*args, **kwargs)
File "/scratch/falam5/Summer2021/spp_graph_64/data/argo/tf-env-2.6/lib/python3.8/site-packages/tensorflow/python/ops/variables.py", line 244, in _variable_v2_call
return previous_getter(
File "/scratch/falam5/Summer2021/spp_graph_64/data/argo/tf-env-2.6/lib/python3.8/site-packages/tensorflow/python/ops/variables.py", line 237, in <lambda>
previous_getter = lambda **kws: default_variable_creator_v2(None, **kws)
File "/scratch/falam5/Summer2021/spp_graph_64/data/argo/tf-env-2.6/lib/python3.8/site-packages/tensorflow/python/ops/variable_scope.py", line 2662, in default_variable_creator_v2
return resource_variable_ops.ResourceVariable(
File "/scratch/falam5/Summer2021/spp_graph_64/data/argo/tf-env-2.6/lib/python3.8/site-packages/tensorflow/python/ops/variables.py", line 264, in __call__
return super(VariableMetaclass, cls).__call__(*args, **kwargs)
File "/scratch/falam5/Summer2021/spp_graph_64/data/argo/tf-env-2.6/lib/python3.8/site-packages/tensorflow/python/ops/resource_variable_ops.py", line 1584, in __init__
self._init_from_args(
File "/scratch/falam5/Summer2021/spp_graph_64/data/argo/tf-env-2.6/lib/python3.8/site-packages/tensorflow/python/ops/resource_variable_ops.py", line 1727, in _init_from_args
initial_value = ops.convert_to_tensor(initial_value,
File "/scratch/falam5/Summer2021/spp_graph_64/data/argo/tf-env-2.6/lib/python3.8/site-packages/tensorflow/python/profiler/trace.py", line 163, in wrapped
return func(*args, **kwargs)
File "/scratch/falam5/Summer2021/spp_graph_64/data/argo/tf-env-2.6/lib/python3.8/site-packages/tensorflow/python/framework/ops.py", line 1566, in convert_to_tensor
ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
File "/scratch/falam5/Summer2021/spp_graph_64/data/argo/tf-env-2.6/lib/python3.8/site-packages/tensorflow/python/framework/tensor_conversion_registry.py", line 52, in _default_conversion_function
return constant_op.constant(value, dtype, name=name)
File "/scratch/falam5/Summer2021/spp_graph_64/data/argo/tf-env-2.6/lib/python3.8/site-packages/tensorflow/python/framework/constant_op.py", line 264, in constant
return _constant_impl(value, dtype, shape, name, verify_shape=False,
File "/scratch/falam5/Summer2021/spp_graph_64/data/argo/tf-env-2.6/lib/python3.8/site-packages/tensorflow/python/framework/constant_op.py", line 276, in _constant_impl
return _constant_eager_impl(ctx, value, dtype, shape, verify_shape)
File "/scratch/falam5/Summer2021/spp_graph_64/data/argo/tf-env-2.6/lib/python3.8/site-packages/tensorflow/python/framework/constant_op.py", line 301, in _constant_eager_impl
t = convert_to_eager_tensor(value, ctx, dtype)
File "/scratch/falam5/Summer2021/spp_graph_64/data/argo/tf-env-2.6/lib/python3.8/site-packages/tensorflow/python/framework/constant_op.py", line 97, in convert_to_eager_tensor
ctx.ensure_initialized()
File "/scratch/falam5/Summer2021/spp_graph_64/data/argo/tf-env-2.6/lib/python3.8/site-packages/tensorflow/python/eager/context.py", line 525, in ensure_initialized
context_handle = pywrap_tfe.TFE_NewContext(opts)
tensorflow.python.framework.errors_impl.InternalError: cudaGetDevice() failed. Status: CUDA driver version is insufficient for CUDA runtime version
Main issue
tensorflow.python.framework.errors_impl.InternalError: cudaGetDevice() failed. Status: CUDA driver version is insufficient for CUDA runtime version