Running Tensorflow on the GPUs - Not working on ARGO

Summary of issue

The version of cuda module loaded and the python and the tensorflow version need to match. Compatibility given in this chart: https://www.tensorflow.org/install/source#gpu_support_3

The installed version of CUDA (on the gpu) should be able to support the runtime version of cuda (the module being loaded) The version of CUDA on the GPU (10.2) needs to be ugraded to the latest version, at least 11.2 (with the accompanying cudnn version) to be able to run the latest versions of python/tensorflow.

From the testing, the code as it is cannot be run since it is written using tensorflow version 2.6. This needs cuda 11.2 and cudnn 8.2. The GPU nodes have cuda 10.2.

Installed CUDA Version

$ nvidia-smi
Sun Sep  5 22:23:59 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01    Driver Version: 440.33.01    CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           On   | 00000000:06:00.0 Off |                    0 |
| N/A   20C    P8    26W / 149W |      0MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Checking that Tensorflow can detect gpu devices

Scripts are run in python virtual environment (tf-env) with the following packages installed:

tensorflow-gpu==2.2
glob2
imageio
matplotlib
tensorflow_probability
pathlib
pickle-mixin
tensorflow_addons

Loaded cuda module and version

$ module list
Currently Loaded Modulefiles:
  1) cuda/10.1

Running python script that uses tf

$ python gpu_check.py
2021-09-05 22:24:20.712230: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2021-09-05 22:24:20.736678: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 0 with properties:
pciBusID: 0000:06:00.0 name: Tesla K80 computeCapability: 3.7
coreClock: 0.8235GHz coreCount: 13 deviceMemorySize: 11.17GiB deviceMemoryBandwidth: 223.96GiB/s
2021-09-05 22:24:20.738524: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2021-09-05 22:24:21.150860: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2021-09-05 22:24:21.375723: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2021-09-05 22:24:21.378054: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2021-09-05 22:24:21.690392: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2021-09-05 22:24:21.882272: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2021-09-05 22:24:22.509111: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2021-09-05 22:24:22.511227: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1703] Adding visible gpu devices: 0
Num GPUs Available:  1
Name: /physical_device:GPU:0   Type: GPU

The script successfully finds the gpu node when cuda/10.1 is in the environment.

Testing actual script

$ python mahesh_vae_spp.py
Traceback (most recent call last):
  File "mahesh_vae_spp.py", line 7, in <module>
    import tensorflow_probability as tfp
  File "/scratch/falam5/Summer2021/spp_graph_64/data/argo/tf-env/lib/python3.8/site-packages/tensorflow_probability/__init__.py", line 20, in <module>
    from tensorflow_probability import substrates
  File "/scratch/falam5/Summer2021/spp_graph_64/data/argo/tf-env/lib/python3.8/site-packages/tensorflow_probability/substrates/__init__.py", line 21, in <module>
    from tensorflow_probability.python.internal import all_util
  File "/scratch/falam5/Summer2021/spp_graph_64/data/argo/tf-env/lib/python3.8/site-packages/tensorflow_probability/python/__init__.py", line 142, in <module>
    dir(globals()[pkg_name])  # Forces loading the package from its lazy loader.
  File "/scratch/falam5/Summer2021/spp_graph_64/data/argo/tf-env/lib/python3.8/site-packages/tensorflow_probability/python/internal/lazy_loader.py", line 61, in __dir__
    module = self._load()
  File "/scratch/falam5/Summer2021/spp_graph_64/data/argo/tf-env/lib/python3.8/site-packages/tensorflow_probability/python/internal/lazy_loader.py", line 41, in _load
    self._on_first_access()
  File "/scratch/falam5/Summer2021/spp_graph_64/data/argo/tf-env/lib/python3.8/site-packages/tensorflow_probability/python/__init__.py", line 63, in _validate_tf_environment
    raise ImportError(
ImportError: This version of TensorFlow Probability requires TensorFlow version >= 2.5; Detected an installation of version 2.2.0. Please upgrade TensorFlow to proceed.

Gives version error for the Tensorflow being run

Upgrading the tensorflow version to 2.5 and cuda module to 11.0

module list
Currently Loaded Modulefiles:
  1) cuda/11.0

The gpu devices are seen

$ python gpu_check.py
2021-09-05 22:59:54.327690: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
2021-09-05 22:59:57.009252: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcuda.so.1
2021-09-05 22:59:57.031748: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties:
pciBusID: 0000:06:00.0 name: Tesla K80 computeCapability: 3.7
coreClock: 0.8235GHz coreCount: 13 deviceMemorySize: 11.17GiB deviceMemoryBandwidth: 223.96GiB/s
2021-09-05 22:59:57.031789: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
2021-09-05 22:59:57.036160: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublas.so.11
2021-09-05 22:59:57.036204: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublasLt.so.11
2021-09-05 22:59:57.037973: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcufft.so.10
2021-09-05 22:59:57.038585: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcurand.so.10
2021-09-05 22:59:57.043049: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcusolver.so.11
2021-09-05 22:59:57.044312: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcusparse.so.11
2021-09-05 22:59:57.044822: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudnn.so.8
2021-09-05 22:59:57.046808: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0
Num GPUs Available:  1
Name: /physical_device:GPU:0   Type: GPU

But the actual code does not run because of the conflicting cuda module loaded and the node installed cuda.

$ python mahesh_vae_spp.py
2021-09-05 23:00:22.545385: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
2021-09-05 23:00:28.622259: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcuda.so.1
2021-09-05 23:00:28.644886: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties:
pciBusID: 0000:06:00.0 name: Tesla K80 computeCapability: 3.7
coreClock: 0.8235GHz coreCount: 13 deviceMemorySize: 11.17GiB deviceMemoryBandwidth: 223.96GiB/s
2021-09-05 23:00:28.644926: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
2021-09-05 23:00:28.648687: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublas.so.11
2021-09-05 23:00:28.648730: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublasLt.so.11
2021-09-05 23:00:28.650461: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcufft.so.10
2021-09-05 23:00:28.651094: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcurand.so.10
2021-09-05 23:00:28.655563: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcusolver.so.11
2021-09-05 23:00:28.656558: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcusparse.so.11
2021-09-05 23:00:28.657015: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudnn.so.8
2021-09-05 23:00:28.659027: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0
2021-09-05 23:00:28.659391: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-09-05 23:00:28.660468: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties:
pciBusID: 0000:06:00.0 name: Tesla K80 computeCapability: 3.7
coreClock: 0.8235GHz coreCount: 13 deviceMemorySize: 11.17GiB deviceMemoryBandwidth: 223.96GiB/s
2021-09-05 23:00:28.662401: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0
2021-09-05 23:00:28.662439: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
Traceback (most recent call last):
  File "mahesh_vae_spp.py", line 130, in <module>
    model = CVAE(latent_dim)
  File "mahesh_vae_spp.py", line 23, in __init__
    super(CVAE, self).__init__()
  File "/scratch/falam5/Summer2021/spp_graph_64/data/argo/tf-env-2.6/lib/python3.8/site-packages/tensorflow/python/training/tracking/base.py", line 522, in _method_wrapper
    result = method(self, *args, **kwargs)
  File "/scratch/falam5/Summer2021/spp_graph_64/data/argo/tf-env-2.6/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py", line 318, in __init__
    self._init_batch_counters()
  File "/scratch/falam5/Summer2021/spp_graph_64/data/argo/tf-env-2.6/lib/python3.8/site-packages/tensorflow/python/training/tracking/base.py", line 522, in _method_wrapper
    result = method(self, *args, **kwargs)
  File "/scratch/falam5/Summer2021/spp_graph_64/data/argo/tf-env-2.6/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py", line 326, in _init_batch_counters
    self._train_counter = variables.Variable(0, dtype='int64', aggregation=agg)
  File "/scratch/falam5/Summer2021/spp_graph_64/data/argo/tf-env-2.6/lib/python3.8/site-packages/tensorflow/python/ops/variables.py", line 262, in __call__
    return cls._variable_v2_call(*args, **kwargs)
  File "/scratch/falam5/Summer2021/spp_graph_64/data/argo/tf-env-2.6/lib/python3.8/site-packages/tensorflow/python/ops/variables.py", line 244, in _variable_v2_call
    return previous_getter(
  File "/scratch/falam5/Summer2021/spp_graph_64/data/argo/tf-env-2.6/lib/python3.8/site-packages/tensorflow/python/ops/variables.py", line 237, in <lambda>
    previous_getter = lambda **kws: default_variable_creator_v2(None, **kws)
  File "/scratch/falam5/Summer2021/spp_graph_64/data/argo/tf-env-2.6/lib/python3.8/site-packages/tensorflow/python/ops/variable_scope.py", line 2662, in default_variable_creator_v2
    return resource_variable_ops.ResourceVariable(
  File "/scratch/falam5/Summer2021/spp_graph_64/data/argo/tf-env-2.6/lib/python3.8/site-packages/tensorflow/python/ops/variables.py", line 264, in __call__
    return super(VariableMetaclass, cls).__call__(*args, **kwargs)
  File "/scratch/falam5/Summer2021/spp_graph_64/data/argo/tf-env-2.6/lib/python3.8/site-packages/tensorflow/python/ops/resource_variable_ops.py", line 1584, in __init__
    self._init_from_args(
  File "/scratch/falam5/Summer2021/spp_graph_64/data/argo/tf-env-2.6/lib/python3.8/site-packages/tensorflow/python/ops/resource_variable_ops.py", line 1727, in _init_from_args
    initial_value = ops.convert_to_tensor(initial_value,
  File "/scratch/falam5/Summer2021/spp_graph_64/data/argo/tf-env-2.6/lib/python3.8/site-packages/tensorflow/python/profiler/trace.py", line 163, in wrapped
    return func(*args, **kwargs)
  File "/scratch/falam5/Summer2021/spp_graph_64/data/argo/tf-env-2.6/lib/python3.8/site-packages/tensorflow/python/framework/ops.py", line 1566, in convert_to_tensor
    ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
  File "/scratch/falam5/Summer2021/spp_graph_64/data/argo/tf-env-2.6/lib/python3.8/site-packages/tensorflow/python/framework/tensor_conversion_registry.py", line 52, in _default_conversion_function
    return constant_op.constant(value, dtype, name=name)
  File "/scratch/falam5/Summer2021/spp_graph_64/data/argo/tf-env-2.6/lib/python3.8/site-packages/tensorflow/python/framework/constant_op.py", line 264, in constant
    return _constant_impl(value, dtype, shape, name, verify_shape=False,
  File "/scratch/falam5/Summer2021/spp_graph_64/data/argo/tf-env-2.6/lib/python3.8/site-packages/tensorflow/python/framework/constant_op.py", line 276, in _constant_impl
    return _constant_eager_impl(ctx, value, dtype, shape, verify_shape)
  File "/scratch/falam5/Summer2021/spp_graph_64/data/argo/tf-env-2.6/lib/python3.8/site-packages/tensorflow/python/framework/constant_op.py", line 301, in _constant_eager_impl
    t = convert_to_eager_tensor(value, ctx, dtype)
  File "/scratch/falam5/Summer2021/spp_graph_64/data/argo/tf-env-2.6/lib/python3.8/site-packages/tensorflow/python/framework/constant_op.py", line 97, in convert_to_eager_tensor
    ctx.ensure_initialized()
  File "/scratch/falam5/Summer2021/spp_graph_64/data/argo/tf-env-2.6/lib/python3.8/site-packages/tensorflow/python/eager/context.py", line 525, in ensure_initialized
    context_handle = pywrap_tfe.TFE_NewContext(opts)
tensorflow.python.framework.errors_impl.InternalError: cudaGetDevice() failed. Status: CUDA driver version is insufficient for CUDA runtime version

Main issue

 tensorflow.python.framework.errors_impl.InternalError: cudaGetDevice() failed. Status: CUDA driver version is insufficient for CUDA runtime version