• Register
0 votes
223 views

Problem :

I am running the AWS EC2 g2.2xlarge instance with my Ubuntu 14.04 LTS. I want to observe a GPU utilization while training the TensorFlow models. But I face the below error while trying to run the 'nvidia-smi'.

ubuntu@ip-10-0-1-213:/etc/alternatives$ cd /usr/lib/nvidia-375/bin
ubuntu@ip-10-0-1-213:/usr/lib/nvidia-375/bin$ ls
nvidia-bug-report.sh     nvidia-debugdump     nvidia-xconfig
nvidia-cuda-mps-control  nvidia-persistenced
nvidia-cuda-mps-server   nvidia-smi
ubuntu@ip-10-0-1-213:/usr/lib/nvidia-375/bin$ ./nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

 

7 5 2
3,870 points

2 Answers

0 votes

Solution :

I was getting a same error on the Ubuntu 16.04 (Linux 4.14 kernel) in a Google Compute Engine with a K80 GPU. I upgraded a kernel to 4.15 from 4.14 and boom a problem was solved. Here is how I upgraded a Linux kernel from 4.14 to 4.15:

1: Please check a existing kernel of your Ubuntu Linux:

uname -a

2: Ubuntu maintains the website for all its versions of kernel that have been released. At a time of this writing, a latest stable release of Ubuntu kernel is 4.15. If you go to below link: http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.15/, you will see the several links for download.

3: Download the appropriate files based on your type of OS that you have. For 64 bit, I have downloaded the below deb files:

wget https://kernel.ubuntu.com/~kernel-ppa/mainline/v4.15/linux-headers-4.15.0-041500_4.15.0-041500.201802011154_all.deb
wget https://kernel.ubuntu.com/~kernel-ppa/mainline/v4.15/linux-headers-4.15.0-041500-generic_4.15.0-041500.201802011154_amd64.deb
wget https://kernel.ubuntu.com/~kernel-ppa/mainline/v4.15/linux-image-4.15.0-041500-generic_4.15.0-041500.201802011154_amd64.deb

 4: Install all your downloaded deb files:

sudo dpkg -i *.deb

 5: Reboot the machine and check if your kernel has been updated by:

uname -aenter code here
9 7 4
38,600 points
0 votes

Solution:

In case your nvidia-smi failed to communicate however you've installed the driver so many times, check prime-select.

  1. Run prime-select query to obtain all probable options. You must view at least nvidia | intel.

  2. Select prime-select nvidia.

  3. In case it tells nvidia is menawhile selected, select a diverse one, for example prime-select intel, then switch back to nvidia prime-select nvidia

  4. Reboot and inspect nvidia-smi.

You may want to install cuda toolkit. Employing the following command to install it.

sudo apt install nvidia-cuda-toolkit

At one time the installation is done, reboot the machine. nvidia-smi should perform.

This problem is simply solved by re-installing the drivers and rebooting:

sudo apt-get update
sudo apt-get install --no-install-recommends nvidia-384 libcuda1-384 nvidia-opencl-icd-384
sudo reboot

I resolved "NVIDIA-SMI has failed since it couldn't notify with the NVIDIA driver" on my ASUS laptop with GTX 950m and Ubuntu 18.04 by disabling Secure Boot Control from BIOS.

I was obtaining the similar error on my Ubuntu 16.04 (Linux 4.14 kernel) in Google Compute Engine with K80 GPU. I upgraded the kernel to 4.15 from 4.14 and boom the issue was solved. Here is the way how I upgraded my Linux kernel from 4.14 to 4.15:

Step 1:
Check the existing kernel of your Ubuntu Linux:

uname -a

Step 2:

Ubuntu maintains a website for all the versions of kernel that have 
been released. At the time of this writing, the latest stable release 
of Ubuntu kernel is 4.15. If you go to this 
link: http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.15/, you will 
see several links for download.

Step 3:

Download the appropriate files based on the type of OS you have. For 64 
bit, I would download the following deb files:

wget http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.15/linux-headers-
4.15.0-041500_4.15.0-041500.201802011154_all.deb
wget http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.15/linux-headers-
4.15.0-041500-generic_4.15.0-041500.201802011154_amd64.deb
wget http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.15/linux-image-
4.15.0-041500-generic_4.15.0-041500.201802011154_amd64.deb

Step 4:

Install all the downloaded deb files:

sudo dpkg -i *.deb

Step 5:
Reboot your machine and check if the kernel has been updated by:
uname -a

You must view that your kernel has been upgraded and hopefully nvidia-smi must perform.

I am performing with a AWS DeepAMI P2 example and abruptly I figure out that Nvidia-driver command doesn't working and GPU is not traced torch or tensorflow library. Then I have resolved the issue in the following way,

Run nvcc --version in case it doesn't perform

Then run the following

apt install nvidia-cuda-toolkit

I solved this by installing the latest compatible version of the nvidia drivers.
In my instance I installed version 410.104

$cat driver_install.sh
#!/bin/bash
set -x
version=$1
wget http://us.download.nvidia.com/tesla/${version}/NVIDIA-Linux-x86_64-${version}.run 
sudo sh ./NVIDIA-Linux-x86_64-${version}.run --no-drm --disable-nouveau --dkms --silent --install-libglvnd 

$sudo ./driver_install.sh 410.104
$sudo modprobe nvidia
$nvidia-smi

 

10 6 4
31,120 points

Related questions

1 vote
1 answer 64 views
64 views
Problem: I have very recently started with the iOS development and I am currently learning the use of NSData using the Obj-C. Currently I am trying to use the URLSession:dataTask:didReceiveData method to get the NSData by using the HTTP POST request. My server should ... (@"json error: %@", [error localizedDescription]); } } Please guide me if anyone knows the potential reason behind this problem?
asked Jun 1 Martin K 6.6k points
0 votes
1 answer 49 views
49 views
Problem : I am facing below error : “Desktop window manager high gpu” When I play the games or when designing 3D (all games) my dwm.exe rises to 80% or more in my usange gpu column. And it makes the extreme fps drop.This is very strange usually the dwm.exe ... and my CPU, RAM and DISK are performing normal. Can anybody advise on this? My windows and the n-vidia drivers are up to date.
asked Jan 16 jwilliam 3.9k points
0 votes
1 answer 1.2K views
1.2K views
Problem : I am facing following error : ImportError: libcuda.so.1: cannot open shared object file: No such file or directory Failed to load the native TensorFlow runtime. Above error comes up when I tried to import the tensorflow. Is there anybody who has faced this error before and know the solution on it?
asked Dec 10, 2019 alecxe 7.5k points
0 votes
1 answer 39 views
39 views
Problem : Facing following error this application failed to start because it could not find or load the qt platform plugin "xcb"
asked Nov 13, 2019 peterlaw 6.9k points
1 vote
1 answer 51 views
51 views
Problem : I am trying to using Qt 5.1.1 with Visual Studio 2012 Application is compiled in "Release"-mode and can be executed if directly started with Qt Creator. But when starting from the "release"-Folder, i get the following error this application failed to start because it could not find or load the qt platform plugin windows
asked Nov 11, 2019 peterlaw 6.9k points