• Register
0 votes
398 views

Problem :

I am running the AWS EC2 g2.2xlarge instance with my Ubuntu 14.04 LTS. I want to observe a GPU utilization while training the TensorFlow models. But I face the below error while trying to run the 'nvidia-smi'.

ubuntu@ip-10-0-1-213:/etc/alternatives$ cd /usr/lib/nvidia-375/bin
ubuntu@ip-10-0-1-213:/usr/lib/nvidia-375/bin$ ls
nvidia-bug-report.sh     nvidia-debugdump     nvidia-xconfig
nvidia-cuda-mps-control  nvidia-persistenced
nvidia-cuda-mps-server   nvidia-smi
ubuntu@ip-10-0-1-213:/usr/lib/nvidia-375/bin$ ./nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

 

7 5 2
3,870 points

Please log in or register to answer this question.

2 Answers

0 votes

Solution :

I was getting a same error on the Ubuntu 16.04 (Linux 4.14 kernel) in a Google Compute Engine with a K80 GPU. I upgraded a kernel to 4.15 from 4.14 and boom a problem was solved. Here is how I upgraded a Linux kernel from 4.14 to 4.15:

1: Please check a existing kernel of your Ubuntu Linux:

uname -a

2: Ubuntu maintains the website for all its versions of kernel that have been released. At a time of this writing, a latest stable release of Ubuntu kernel is 4.15. If you go to below link: http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.15/, you will see the several links for download.

3: Download the appropriate files based on your type of OS that you have. For 64 bit, I have downloaded the below deb files:

wget https://kernel.ubuntu.com/~kernel-ppa/mainline/v4.15/linux-headers-4.15.0-041500_4.15.0-041500.201802011154_all.deb
wget https://kernel.ubuntu.com/~kernel-ppa/mainline/v4.15/linux-headers-4.15.0-041500-generic_4.15.0-041500.201802011154_amd64.deb
wget https://kernel.ubuntu.com/~kernel-ppa/mainline/v4.15/linux-image-4.15.0-041500-generic_4.15.0-041500.201802011154_amd64.deb

 4: Install all your downloaded deb files:

sudo dpkg -i *.deb

 5: Reboot the machine and check if your kernel has been updated by:

uname -aenter code here
9 7 4
38,600 points
0 votes

Solution:

In case your nvidia-smi failed to communicate however you've installed the driver so many times, check prime-select.

  1. Run prime-select query to obtain all probable options. You must view at least nvidia | intel.

  2. Select prime-select nvidia.

  3. In case it tells nvidia is menawhile selected, select a diverse one, for example prime-select intel, then switch back to nvidia prime-select nvidia

  4. Reboot and inspect nvidia-smi.

You may want to install cuda toolkit. Employing the following command to install it.

sudo apt install nvidia-cuda-toolkit

At one time the installation is done, reboot the machine. nvidia-smi should perform.

This problem is simply solved by re-installing the drivers and rebooting:

sudo apt-get update
sudo apt-get install --no-install-recommends nvidia-384 libcuda1-384 nvidia-opencl-icd-384
sudo reboot

I resolved "NVIDIA-SMI has failed since it couldn't notify with the NVIDIA driver" on my ASUS laptop with GTX 950m and Ubuntu 18.04 by disabling Secure Boot Control from BIOS.

I was obtaining the similar error on my Ubuntu 16.04 (Linux 4.14 kernel) in Google Compute Engine with K80 GPU. I upgraded the kernel to 4.15 from 4.14 and boom the issue was solved. Here is the way how I upgraded my Linux kernel from 4.14 to 4.15:

Step 1:
Check the existing kernel of your Ubuntu Linux:

uname -a

Step 2:

Ubuntu maintains a website for all the versions of kernel that have 
been released. At the time of this writing, the latest stable release 
of Ubuntu kernel is 4.15. If you go to this 
link: http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.15/, you will 
see several links for download.

Step 3:

Download the appropriate files based on the type of OS you have. For 64 
bit, I would download the following deb files:

wget http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.15/linux-headers-
4.15.0-041500_4.15.0-041500.201802011154_all.deb
wget http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.15/linux-headers-
4.15.0-041500-generic_4.15.0-041500.201802011154_amd64.deb
wget http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.15/linux-image-
4.15.0-041500-generic_4.15.0-041500.201802011154_amd64.deb

Step 4:

Install all the downloaded deb files:

sudo dpkg -i *.deb

Step 5:
Reboot your machine and check if the kernel has been updated by:
uname -a

You must view that your kernel has been upgraded and hopefully nvidia-smi must perform.

I am performing with a AWS DeepAMI P2 example and abruptly I figure out that Nvidia-driver command doesn't working and GPU is not traced torch or tensorflow library. Then I have resolved the issue in the following way,

Run nvcc --version in case it doesn't perform

Then run the following

apt install nvidia-cuda-toolkit

I solved this by installing the latest compatible version of the nvidia drivers.
In my instance I installed version 410.104

$cat driver_install.sh
#!/bin/bash
set -x
version=$1
wget http://us.download.nvidia.com/tesla/${version}/NVIDIA-Linux-x86_64-${version}.run 
sudo sh ./NVIDIA-Linux-x86_64-${version}.run --no-drm --disable-nouveau --dkms --silent --install-libglvnd 

$sudo ./driver_install.sh 410.104
$sudo modprobe nvidia
$nvidia-smi

 

10 6 4
31,120 points

Related questions

0 votes
1 answer 87 views
87 views
Solution: failed to initialize nvml: driver/library version mismatch
asked Feb 18 charles mathews 5.5k points
0 votes
1 answer 3 views
3 views
Problem: Do you have any ideas about how I would be able to fix this problem? Thank you a lot!
asked 2 days ago zayed1 35.6k points
0 votes
1 answer 2 views
2 views
Problem: What caused the problem?
asked 2 days ago zayed1 35.6k points
0 votes
1 answer 1 view
1 view
Problem: Please help me … how can I solve it? will pci 3.0 work in 2.0 slot
asked Apr 4 Ifra 24.4k points
0 votes
1 answer 4 views
4 views
Problem: i don’t understand about this problem please help? pci e 3.0 card in 2.0 slot
asked Mar 27 Rohit kr 16.2k points
0 votes
1 answer 21 views
21 views
Problem : I am facing following error : Importerror: libcuda.so.1: cannot open shared object file: no such file or directory .
asked Mar 14 Wafa Abu Yousef 6.1k points
0 votes
1 answer 1 view
1 view
Problem: intel sdk for opencl cpu only runtime.
asked Mar 14 Ethan ross 2.7k points
0 votes
1 answer 92 views
92 views
Problem : I am facing below error : “Desktop window manager high gpu” When I play the games or when designing 3D (all games) my dwm.exe rises to 80% or more in my usange gpu column. And it makes the extreme fps drop.This is very strange usually the dwm.exe ... and my CPU, RAM and DISK are performing normal. Can anybody advise on this? My windows and the n-vidia drivers are up to date.
asked Jan 16, 2020 jwilliam 3.9k points
0 votes
1 answer 1.8K views
1.8K views
Problem : I am facing following error : ImportError: libcuda.so.1: cannot open shared object file: No such file or directory Failed to load the native TensorFlow runtime. Above error comes up when I tried to import the tensorflow. Is there anybody who has faced this error before and know the solution on it?
asked Dec 10, 2019 alecxe 7.5k points
0 votes
1 answer 1 view
1 view
Problem: How can svn co give a “directory already exists error”? How is that possible?? I'm checking out into an empty dir.
asked 6 days ago ummesalma 25.2k points