• Register
0 votes
299 views

Problem :

I am running the AWS EC2 g2.2xlarge instance with my Ubuntu 14.04 LTS. I want to observe a GPU utilization while training the TensorFlow models. But I face the below error while trying to run the 'nvidia-smi'.

ubuntu@ip-10-0-1-213:/etc/alternatives$ cd /usr/lib/nvidia-375/bin
ubuntu@ip-10-0-1-213:/usr/lib/nvidia-375/bin$ ls
nvidia-bug-report.sh     nvidia-debugdump     nvidia-xconfig
nvidia-cuda-mps-control  nvidia-persistenced
nvidia-cuda-mps-server   nvidia-smi
ubuntu@ip-10-0-1-213:/usr/lib/nvidia-375/bin$ ./nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

 

7 5 2
3,870 points

Please log in or register to answer this question.

2 Answers

0 votes

Solution :

I was getting a same error on the Ubuntu 16.04 (Linux 4.14 kernel) in a Google Compute Engine with a K80 GPU. I upgraded a kernel to 4.15 from 4.14 and boom a problem was solved. Here is how I upgraded a Linux kernel from 4.14 to 4.15:

1: Please check a existing kernel of your Ubuntu Linux:

uname -a

2: Ubuntu maintains the website for all its versions of kernel that have been released. At a time of this writing, a latest stable release of Ubuntu kernel is 4.15. If you go to below link: http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.15/, you will see the several links for download.

3: Download the appropriate files based on your type of OS that you have. For 64 bit, I have downloaded the below deb files:

wget https://kernel.ubuntu.com/~kernel-ppa/mainline/v4.15/linux-headers-4.15.0-041500_4.15.0-041500.201802011154_all.deb
wget https://kernel.ubuntu.com/~kernel-ppa/mainline/v4.15/linux-headers-4.15.0-041500-generic_4.15.0-041500.201802011154_amd64.deb
wget https://kernel.ubuntu.com/~kernel-ppa/mainline/v4.15/linux-image-4.15.0-041500-generic_4.15.0-041500.201802011154_amd64.deb

 4: Install all your downloaded deb files:

sudo dpkg -i *.deb

 5: Reboot the machine and check if your kernel has been updated by:

uname -aenter code here
9 7 4
38,600 points
0 votes

Solution:

In case your nvidia-smi failed to communicate however you've installed the driver so many times, check prime-select.

  1. Run prime-select query to obtain all probable options. You must view at least nvidia | intel.

  2. Select prime-select nvidia.

  3. In case it tells nvidia is menawhile selected, select a diverse one, for example prime-select intel, then switch back to nvidia prime-select nvidia

  4. Reboot and inspect nvidia-smi.

You may want to install cuda toolkit. Employing the following command to install it.

sudo apt install nvidia-cuda-toolkit

At one time the installation is done, reboot the machine. nvidia-smi should perform.

This problem is simply solved by re-installing the drivers and rebooting:

sudo apt-get update
sudo apt-get install --no-install-recommends nvidia-384 libcuda1-384 nvidia-opencl-icd-384
sudo reboot

I resolved "NVIDIA-SMI has failed since it couldn't notify with the NVIDIA driver" on my ASUS laptop with GTX 950m and Ubuntu 18.04 by disabling Secure Boot Control from BIOS.

I was obtaining the similar error on my Ubuntu 16.04 (Linux 4.14 kernel) in Google Compute Engine with K80 GPU. I upgraded the kernel to 4.15 from 4.14 and boom the issue was solved. Here is the way how I upgraded my Linux kernel from 4.14 to 4.15:

Step 1:
Check the existing kernel of your Ubuntu Linux:

uname -a

Step 2:

Ubuntu maintains a website for all the versions of kernel that have 
been released. At the time of this writing, the latest stable release 
of Ubuntu kernel is 4.15. If you go to this 
link: http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.15/, you will 
see several links for download.

Step 3:

Download the appropriate files based on the type of OS you have. For 64 
bit, I would download the following deb files:

wget http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.15/linux-headers-
4.15.0-041500_4.15.0-041500.201802011154_all.deb
wget http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.15/linux-headers-
4.15.0-041500-generic_4.15.0-041500.201802011154_amd64.deb
wget http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.15/linux-image-
4.15.0-041500-generic_4.15.0-041500.201802011154_amd64.deb

Step 4:

Install all the downloaded deb files:

sudo dpkg -i *.deb

Step 5:
Reboot your machine and check if the kernel has been updated by:
uname -a

You must view that your kernel has been upgraded and hopefully nvidia-smi must perform.

I am performing with a AWS DeepAMI P2 example and abruptly I figure out that Nvidia-driver command doesn't working and GPU is not traced torch or tensorflow library. Then I have resolved the issue in the following way,

Run nvcc --version in case it doesn't perform

Then run the following

apt install nvidia-cuda-toolkit

I solved this by installing the latest compatible version of the nvidia drivers.
In my instance I installed version 410.104

$cat driver_install.sh
#!/bin/bash
set -x
version=$1
wget http://us.download.nvidia.com/tesla/${version}/NVIDIA-Linux-x86_64-${version}.run 
sudo sh ./NVIDIA-Linux-x86_64-${version}.run --no-drm --disable-nouveau --dkms --silent --install-libglvnd 

$sudo ./driver_install.sh 410.104
$sudo modprobe nvidia
$nvidia-smi

 

10 6 4
31,120 points

Related questions

1 vote
1 answer 91 views
91 views
Problem: I have very recently started with the iOS development and I am currently learning the use of NSData using the Obj-C. Currently I am trying to use the URLSession:dataTask:didReceiveData method to get the NSData by using the HTTP POST request. My server should ... (@"json error: %@", [error localizedDescription]); } } Please guide me if anyone knows the potential reason behind this problem?
asked Jun 1, 2020 Martin K 6.6k points
0 votes
1 answer 57 views
57 views
Problem : I am facing below error : “Desktop window manager high gpu” When I play the games or when designing 3D (all games) my dwm.exe rises to 80% or more in my usange gpu column. And it makes the extreme fps drop.This is very strange usually the dwm.exe ... and my CPU, RAM and DISK are performing normal. Can anybody advise on this? My windows and the n-vidia drivers are up to date.
asked Jan 16, 2020 jwilliam 3.9k points
0 votes
1 answer 1.5K views
1.5K views
Problem : I am facing following error : ImportError: libcuda.so.1: cannot open shared object file: No such file or directory Failed to load the native TensorFlow runtime. Above error comes up when I tried to import the tensorflow. Is there anybody who has faced this error before and know the solution on it?
asked Dec 10, 2019 alecxe 7.5k points
0 votes
1 answer 252 views
252 views
Problem : I have facing an issue with my C# PayTrace Gateway. My code was working fine until they turned off SSL3 due to the Poodle Exploit. When tried running my code I am getting the following message: The remote server has forcefully closed the connection. ... The client and server cannot communicate, because they do not possess a common algorithm. Any clue on error messages which I am facing?
asked Nov 29, 2019 alecxe 7.5k points
0 votes
1 answer 57 views
57 views
Problem : Facing following error this application failed to start because it could not find or load the qt platform plugin "xcb"
asked Nov 13, 2019 peterlaw 6.9k points
1 vote
1 answer 80 views
80 views
Problem : I am trying to using Qt 5.1.1 with Visual Studio 2012 Application is compiled in "Release"-mode and can be executed if directly started with Qt Creator. But when starting from the "release"-Folder, i get the following error this application failed to start because it could not find or load the qt platform plugin windows
asked Nov 11, 2019 peterlaw 6.9k points
0 votes
1 answer 213 views
213 views
Problem : I want to tell you that I have researched a lot for this issue but without a solution. I have created the loop that will go throw my listBox1 and that contains the links, each time creating the http GET request in order to access my full source ... it's the problem with my certificate as otherwise I would still get the error when I removed my loop. Any suggestions are highly appreciated.
asked Jan 10, 2020 alecxe 7.5k points
0 votes
1 answer 154 views
154 views
Problem : I have installed the fresh copy of the Centos 7. Then I just restarted Apache but my Apache failed to start. I am stuck with the bellow error from past 5 days. Even my support could not figure out the below error. sudo service httpd start Failed to ... could not bind to address 85.25.12.20:xx Jan 04 16:08:02 startdedicated.de httpd[5710]: no listening sockets available, shutting down
asked Jan 10, 2020 alecxe 7.5k points
0 votes
1 answer 45 views
45 views
Problem : I got below problem which I am trying to fix from the few days now and I am unable to know what should I do, I am looking for the answers but all of those which I found are of no use for me. I am very new here so I really hope that somebody can help me in resolving my error $ ... ------------ f...e) Jan 05 13:23:33 startdedicated.com nginx[8315]: nginx: [emerg] bind() to ----- f...e)
asked Jan 6, 2020 alecxe 7.5k points
1 vote
1 answer 70 views
70 views
Problem : I am facing the problem which I am trying to fix from the couple of days now and still I don't know what should I do, I am searching for answers but all of those I came across are of no use to me. I am very new here and I am really hopeful ... systemd/system/nginx.service; enabled; vendor preset: disabled) Active: failed (Result: exit-code) since Sun 2019-12-29 13:23:35 GMT; 2min 20s ago
asked Dec 30, 2019 alecxe 7.5k points